MediaWiki - JobQueue - SLO Error/Availability Failure
For December 2022 SLO Reporting - JobQueue failed the SLO for Errors and Availability.

The Errors SLO agreed was: 1.5%.
The Performance achieved was: 1.8%.

The Availability SLO agreed was: 99.5%.
The Performance achieved was: 95.3%.

Please investigate the reasons behind not meeting the SLO and provide a clear summary on this task identifying whether:

  • the failure was transient due to factors outside of the teams control, or
  • the failure was preventable and clear steps have been taken to investigate and implement controls to minimise the risk of failing in January 2023.

The availability was likely caused by cloud14 being down at one point and therefore so was mwtask141 and mw141/mw142 jobrunners. The issue with availability does not seem present so far in the past 30 days now.

As for errors, that is a wider issue, that seems to still be present (although a little better), that needs a larger investigation into why and what can be done about it.