Page MenuHomeMiraheze

Formalise and agree on MediaWiki SLOs
Open, NormalPublic

Description

Grafana dashboard: https://grafana.miraheze.org/d/pfjAbhf7k/mediawiki-slos

  • JobQueue
  • Memcached
    • SLO: Availability of Memcached to be at least x%. SLI: Service uptime.
  • MediaWiki
    • SLO: Availability of MediaWiki to be at least x% SLI: Service uptime.
    • SLO: Errors must account for less than x% of requests SLI: Exceptions graph (total errors, alerts when too high via email) v total requests
    • SLO: Latency backend response times. (requires measurable goal) SLI: Blackbox

Event Timeline

John triaged this task as Normal priority.Feb 20 2022, 14:58
John created this task.

Uptime: I'd say 95% really at a minimum for all. That's still 1.5 days a month not working in total which is a lot.

Abandoned jobs: not much an idea, we should base on the Prometheus data.

Job Latency: We really should be handling all non expensive jobs within 1 hour of being pushed. I'm pretty sure we do this now.

Errors: Probably rate on grafana higher than 0.4 as we class that as an emergency and use the stalk if Prometheus worked.

Latency: hard to tell with no data but we should probably set it at something like 10% higher than the 30 day average. We could probably alert on this too.

Sorry about the delay here. While not my strong area I'd propose the following:

MediaWiki:
Service uptime: 97% (as RhinosF1 calculates above 95% means 1.5 days/mo not working which is too much to be reasonable IMO, so 97% is better IMO)
Errors: (Exception graph is broken so I'm unable to provide a suggestion here) - https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki?orgId=1&viewPanel=187
Latency: As RhinosF1 says, I find it difficult to make a suggestion here without seeing a graph first.

Jobqueue:
Availability: 97% would be reasonable? (though not with our current issues)
Errors: Without total jobs difficult to propose
Latency: same as above

Memcached:
Availability: 97% still sounds like a good number to me

Sorry about the delay here. While not my strong area I'd propose the following:

MediaWiki:
Service uptime: 97% (as RhinosF1 calculates above 95% means 1.5 days/mo not working which is too much to be reasonable IMO, so 97% is better IMO)
Errors: (Exception graph is broken so I'm unable to provide a suggestion here) - https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki?orgId=1&viewPanel=187
Latency: As RhinosF1 says, I find it difficult to make a suggestion here without seeing a graph first.

Jobqueue:
Availability: 97% would be reasonable? (though not with our current issues)
Errors: Without total jobs difficult to propose
Latency: same as above

Memcached:
Availability: 97% still sounds like a good number to me

Are cal of there based on current data from Grafana? They need to be graphed historically before agreeing on values, otherwise you could be agreeing to things you either can’t accurately measure or things you can’t achieve.

For memcached at least uptime/availability doesn't seem to be an issue.

For memcached at least uptime/availability doesn't seem to be an issue.

To formalise it, can you add it to the SLO dashboard listed above so it can be tracked and monitored in Grafana?