Page MenuHomeMiraheze

MediaWiki - JobQueue - SLO Error/Availability Failure
Closed, ResolvedPublic

Description

For December 2022 SLO Reporting - JobQueue failed the SLO for Errors and Availability.

The Errors SLO agreed was: 1.5%.
The Performance achieved was: 1.8%.

The Availability SLO agreed was: 99.5%.
The Performance achieved was: 95.3%.

Please investigate the reasons behind not meeting the SLO and provide a clear summary on this task identifying whether:

  • the failure was transient due to factors outside of the teams control, or
  • the failure was preventable and clear steps have been taken to investigate and implement controls to minimise the risk of failing in January 2023.

Event Timeline

John triaged this task as Normal priority.Dec 30 2022, 22:37
John created this task.
Reception123 assigned this task to Unknown Object (User).Jan 21 2023, 08:54
Unknown Object (User) added a comment.Jan 22 2023, 22:57

The availability was likely caused by cloud14 being down at one point and therefore so was mwtask141 and mw141/mw142 jobrunners. The issue with availability does not seem present so far in the past 30 days now.

As for errors, that is a wider issue, that seems to still be present (although a little better), that needs a larger investigation into why and what can be done about it.

For January 2023 SLO Reporting - JobQueue failed the SLO for Errors
The Errors SLO agreed was: 1.5%.
The Performance achieved was: 3.4%.

John removed Unknown Object (User) as the assignee of this task.Feb 4 2023, 12:26
John moved this task from Failure Stage 1 to Failure Stage 2 on the SLO board.
John edited projects, added Infrastructure (SRE); removed MediaWiki (SRE).

For January 2023 SLO Reporting - JobQueue failed the SLO for Errors.

The SLO agreed was: 1.5%.
The Performance achieve was: 3.37%.

As this is the second failure, this has now been escalated to the Infrastructure for review.

John claimed this task.

I've looked into this and the metric being used in Grafana was wildly wrong.

This has now been fixed and the metric is passing for the last 30 days.