Page MenuHomeMiraheze

Formalise and agree on MediaWiki SLOs
Closed, ResolvedPublic

Description

Grafana dashboard: https://grafana.miraheze.org/d/pfjAbhf7k/mediawiki-slos

  • JobQueue
    • SLO: Availability to submit/run jobs is at least 99.5%. SLI: Service uptime.
    • SLO: Errors abandoned jobs are less than 1.5% of jobs over 1 day. SLI: Abandoned jobs
  • Memcached
    • SLO: Availability of Memcached to be at least 99.5%. SLI: Service uptime.
  • MediaWiki
    • SLO: Availability of MediaWiki to be at least 99% SLI: Nginx 50x responses / total requests
    • SLO: Errors must account for less than 3% of requests SLI: Errors v total hits
    • SLO: Latency backend response times to be below 3s. SLI: Nginx request time average

Event Timeline

John triaged this task as Normal priority.Feb 20 2022, 14:58
John created this task.

Uptime: I'd say 95% really at a minimum for all. That's still 1.5 days a month not working in total which is a lot.

Abandoned jobs: not much an idea, we should base on the Prometheus data.

Job Latency: We really should be handling all non expensive jobs within 1 hour of being pushed. I'm pretty sure we do this now.

Errors: Probably rate on grafana higher than 0.4 as we class that as an emergency and use the stalk if Prometheus worked.

Latency: hard to tell with no data but we should probably set it at something like 10% higher than the 30 day average. We could probably alert on this too.

Sorry about the delay here. While not my strong area I'd propose the following:

MediaWiki:
Service uptime: 97% (as RhinosF1 calculates above 95% means 1.5 days/mo not working which is too much to be reasonable IMO, so 97% is better IMO)
Errors: (Exception graph is broken so I'm unable to provide a suggestion here) - https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki?orgId=1&viewPanel=187
Latency: As RhinosF1 says, I find it difficult to make a suggestion here without seeing a graph first.

Jobqueue:
Availability: 97% would be reasonable? (though not with our current issues)
Errors: Without total jobs difficult to propose
Latency: same as above

Memcached:
Availability: 97% still sounds like a good number to me

Sorry about the delay here. While not my strong area I'd propose the following:

MediaWiki:
Service uptime: 97% (as RhinosF1 calculates above 95% means 1.5 days/mo not working which is too much to be reasonable IMO, so 97% is better IMO)
Errors: (Exception graph is broken so I'm unable to provide a suggestion here) - https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki?orgId=1&viewPanel=187
Latency: As RhinosF1 says, I find it difficult to make a suggestion here without seeing a graph first.

Jobqueue:
Availability: 97% would be reasonable? (though not with our current issues)
Errors: Without total jobs difficult to propose
Latency: same as above

Memcached:
Availability: 97% still sounds like a good number to me

Are cal of there based on current data from Grafana? They need to be graphed historically before agreeing on values, otherwise you could be agreeing to things you either can’t accurately measure or things you can’t achieve.

For memcached at least uptime/availability doesn't seem to be an issue.

For memcached at least uptime/availability doesn't seem to be an issue.

To formalise it, can you add it to the SLO dashboard listed above so it can be tracked and monitored in Grafana?

Unknown Object (User) moved this task from Backlog to Short Term on the MediaWiki (SRE) board.May 9 2022, 19:05
Unknown Object (User) moved this task from Short Term to Goals on the MediaWiki (SRE) board.May 9 2022, 19:19
Unknown Object (User) updated the task description. (Show Details)Sep 7 2022, 19:08
Unknown Object (User) added a comment.Sep 8 2022, 03:08

I got the prometheus-es-exporter working again today, for this. So we can add exceptions, etc... again, so I can do more work on this a bit later now.

Unknown Object (User) updated the task description. (Show Details)Sep 8 2022, 21:37
Unknown Object (User) updated the task description. (Show Details)Sep 9 2022, 05:39
Unknown Object (User) updated the task description. (Show Details)Sep 9 2022, 06:24
Unknown Object (User) updated the task description. (Show Details)Sep 9 2022, 06:38
Unknown Object (User) updated the task description. (Show Details)Sep 9 2022, 19:04
Unknown Object (User) updated the task description. (Show Details)Sep 10 2022, 00:47
Unknown Object (User) updated the task description. (Show Details)Sep 10 2022, 05:02
Unknown Object (User) updated the task description. (Show Details)Sep 11 2022, 02:13
Unknown Object (User) updated the task description. (Show Details)Sep 17 2022, 23:28
Unknown Object (User) added a comment.Sep 20 2022, 15:59

I recommend we remove latency from MediaWiki also. We don't really have a way to monitor it anymore, as I removed blackbox monitoring from MediaWiki as it seemed to make hundreds of thousands of semi-expensive, otherwise unnecessary daily requests, that seemed to actually provide ortherwise unnecessary frontend load and degradation. Backend monitoring was not worth frontend performance degradation in my opinion. I don't see any other way to really monitor latency in this case.

We could store and average the backend response times variable that we post in web requests - this would be a form of inexpensive latency monitoring

Unknown Object (User) updated the task description. (Show Details)Sep 23 2022, 00:38
Unknown Object (User) updated the task description. (Show Details)
Unknown Object (User) updated the task description. (Show Details)Sep 23 2022, 00:55
Unknown Object (User) added a subscriber: MacFan4000.Sep 23 2022, 01:05

@Reception123 @MacFan4000 All the monitoring for this task should be done now, but we need to make sure that the selected numbers here are agreed upon. Any other suggestions for it?

How is the abandoned jobs measurement to work as it’s not clear what percentage of jobs have failed?

The SLO for MediaWiki latency also looks extremely high based on the small data set available at 5s - is there a reason for this?

Unknown Object (User) added a comment.Sep 23 2022, 18:57
In T8802#197599, @John wrote:

How is the abandoned jobs measurement to work as it’s not clear what percentage of jobs have failed?

The SLO for MediaWiki latency also looks extremely high based on the small data set available at 5s - is there a reason for this?

For abandoned jobs, I actually just realised that right before I saw your comment. I will fix that a little later. As for latency, that is probably true, the SLO probably doesn't need to be that high. Thank you!

Unknown Object (User) updated the task description. (Show Details)Sep 23 2022, 18:57
Unknown Object (User) updated the task description. (Show Details)Sep 24 2022, 18:35
Unknown Object (User) updated the task description. (Show Details)Sep 24 2022, 18:42
Unknown Object (User) added a comment.Sep 24 2022, 18:48
In T8802#197599, @John wrote:

How is the abandoned jobs measurement to work as it’s not clear what percentage of jobs have failed?

The SLO for MediaWiki latency also looks extremely high based on the small data set available at 5s - is there a reason for this?

I have now adjusted the jobs SLO and graph for data we have and to show percentage of abandoned jobs.

Unknown Object (User) closed this task as Resolved.Sep 26 2022, 05:08
Unknown Object (User) claimed this task.
Unknown Object (User) moved this task from Unsorted to Goals on the Universal Omega board.Sep 26 2022, 05:08