Page MenuHomeMiraheze

Add (rolling average) response time to grafana
Open, NormalPublic

Description

It would be good if we can see data from response times on grafana.

Both:
A) The amount of time icinga is seeing when querying health checks
B) averages for actual users

This would help us identify during icindents affecting loads times how much the impact actually is.

Event Timeline

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

Universal_Omega moved this task from Backlog to Long Term on the MediaWiki (SRE) board.

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

As mentioned in the task, /healthcheck is the biggest one because it has an effect on uptime if that gets too high.

I would recommend we do one that loads quite a few resources (eg. Images, javascript etc)

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

As mentioned in the task, /healthcheck is the biggest one because it has an effect on uptime if that gets too high.

/healthcheck = Meta's Main Page. We're already monitoring that.

I would recommend we do one that loads quite a few resources (eg. Images, javascript etc)

The blackbox exporter does not monitor subsequent requests, such as resources (images?) used on an article. We can monitor that though, but you'll need to provide specific URLs. :)

Further stats beyond the response timings other than so we can see if we're getting close to depool territory during incidents would be nice but we've found a way to see during an incident affecting load times how close to we are to false positive depools from health checks.