Page MenuHomeMiraheze

Evaluate what happened during the 1.36 update
Closed, ResolvedPublic

Description

We ran over and many things were handled chaotically.

We need to document these and report on them.

Related Objects

StatusAssignedTask
ResolvedReception123
ResolvedUnknown Object (User)
ResolvedReception123
ResolvedUnknown Object (User)
ResolvedUnknown Object (User)
ResolvedRhinosF1
ResolvedUnknown Object (User)
ResolvedUnknown Object (User)
DeclinedUnknown Object (User)
DeclinedUnknown Object (User)
ResolvedUnknown Object (User)
ResolvedUnknown Object (User)
DeclinedUnknown Object (User)

Event Timeline

A brief outside review suggests things weren’t planned appropriately, there was confusion about how to change git over, there was no practise runs on test3 in dealing with a code transition either.

There was confusion over fundemental aspects of how puppet works and how to deploy puppet changes to puppet3.

Distractions were rife - I counted at least 5 separate conversions that members involved in this deployment were engaged in - therefore not concentrating on the deployment.

Yet again timing was a guess, and not considered in detail, which Reception stated in -sre.

Unknown Object (User) added a comment.Jun 12 2021, 23:23

I do think an incident report may be warranted since it went over the approximation by almost 30 minutes. But yes this was just a guess and we mentioned approximately so I am not certain.

I do think an incident report may be warranted since it went over the approximation by almost 30 minutes. But yes this was just a guess and we mentioned approximately so I am not certain.

An incident report definitely because it exceeded the window of planned maintenance, therefore it became an outage not a maintenance.

Who had server access during what time periods?

Who was taking lead of the upgrade?

What were the backup plans in place?

Unknown Object (User) triaged this task as High priority.Jun 13 2021, 00:26

High for now feel free to lower as needed though

In T7450#148970, @John wrote:

Who had server access during what time periods?

I believe @Paladox and @Universal_Omega had access the entire time
@Reception123 had access for at least the first hour and 40 mins
@RhinosF1 may have had access at one point

Who was taking lead of the upgrade?

To my knowledge it was @Reception123

Unknown Object (User) added a comment.Jun 13 2021, 01:19
In T7450#148970, @John wrote:

Who had server access during what time periods?

I believe @Paladox and @Universal_Omega had access the entire time
@Reception123 had access for at least the first hour and 40 mins
@RhinosF1 may have had access at one point

Who was taking lead of the upgrade?

To my knowledge it was @Reception123

That is mostly correct, but I am not certain if @RhinosF1 had server access or was just around helping without access.

I was only around just in case. I wasn’t supposed to be involved but seems that changed…

Unknown Object (User) added a comment.Jun 13 2021, 02:48

I was only around just in case. I wasn’t supposed to be involved but seems that changed…

Yes, and thank you for all the help today. It was greatly appreciated.

Regarding the time estimation, it is true that we didn't really realise how to get to a more accurate estimation. However that being said I believe that if everything had gone fully to plan we would've probably managed to live up to the 2 hours that were planned.

If @Reception123 was in charge of the upgrade, why did they not have server access for the full duration? That is a critical failure if true.

That is mostly correct, but I am not certain if @RhinosF1 had server access or was just around helping without access.

I'm not certain if RhinosF1 had access either, but am also not sure why they didn't have access as I thought they weren't going to be around until after the upgrade, to handle post-upgrade issues.

Unknown Object (User) moved this task from Backlog to Short Term on the MediaWiki (SRE) board.Jun 15 2021, 17:04
Reception123 claimed this task.

This has been discussed with the team and the following main points have been retained for the next update

  • In order to be efficient and avoid any sort of confusion there should be a paste containing the exact commands that will be executed and by whom.
  • It must be clear who is taking the lead of the upgrade, who is assisting and who else is around (with or without access) during the upgrade. Preferable at least one member of the Infra team should be around with access but they should not be expected to actively participate in the upgrade unless it is necessary because of an unexpected issue.
  • A more clear plan B should exist in case something goes wrong (i.e. an extension doesn't work) and the team should be prepared to disable an extension immediately)
  • More thought needs to be given to the estimated time
  • There should be more concentration on the upgrade and the person(s) doing the upgrade should avoid discussing with users about unrelated topics at the same time.