This tag is used to monitor everything related to SLOs - whether this is new SLOs needing to be added for services being deployed - or collating tasks regarding SLO failures and review processes.
Details
Jun 16 2023
May 19 2023
Mar 21 2023
Based on the fact that the error rate in swift dropped off massively in late February, at about the time T10510 was closed, I'm assuming that task was both the cause and additionally the solution.
Feb 15 2023
@Paladox half way through Feb, we really need to look into this ASAP
Feb 11 2023
I've looked into this and the metric being used in Grafana was wildly wrong.
Feb 4 2023
For January 2023 SLO Reporting - JobQueue failed the SLO for Errors.
Feb 2 2023
For January 2023 SLO Reporting - JobQueue failed the SLO for Errors
The Errors SLO agreed was: 1.5%.
The Performance achieved was: 3.4%.
Jan 22 2023
The availability was likely caused by cloud14 being down at one point and therefore so was mwtask141 and mw141/mw142 jobrunners. The issue with availability does not seem present so far in the past 30 days now.
This was originally due to the cloud14 outage, as every wiki hitting the down wikis returned a 500 status code. This was outside of our control.
Jan 21 2023
Jan 1 2023
This has been fixed. This was generating around 1440 failures a day - in order to meet the error threshold with these numbers, we'd need to have sent 144000 emails a day, or 100 a minute. As we don't operate at these volumes, this was always going to be the case.
Availability - having reviewed this, I am certain that the failure here is attributed to two things - one beyond our control and one where we have an open task that is blocked on MediaWiki (SRE) for a resolution.