Evaluate what happened during the 1.36 update
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	RhinosF1
	Jun 12 2021, 23:04

Description

We ran over and many things were handled chaotically.

We need to document these and report on them.

Related Objects
Search...

Status	Assigned	Task
Resolved	Reception123	T7450 Evaluate what happened during the 1.36 update
Resolved	Unknown Object (User)	T7117 Upgrade to MediaWiki 1.36.0
Resolved	Reception123	T7116 Consider installing SecureLinkFixer
Resolved	Unknown Object (User)	T7253 Disable Modern Vector in 1.36 by default, Slow roll eventually
Resolved	Unknown Object (User)	T7357 Apply test3 drop default schema changes
Resolved	RhinosF1	T7358 Enable wgCheckUserEnableSpecialInvestigate
Resolved	Unknown Object (User)	T7359 Test all extensions for 1.36
Resolved	Unknown Object (User)	T7383 Consider undeploying ModernSkylight skin
Declined	Unknown Object (User)	T7389 Switch PortableInfobox repository
Declined	Unknown Object (User)	T7360 Remove Variables
Resolved	Unknown Object (User)	T7387 Consider enabling logging of CU data for logins
Resolved	Unknown Object (User)	T7449 Add $wgLogos->icon to ManageWiki
Declined	Unknown Object (User)	T7429 Set wgMiserMode to true on all Miraheze wikis

Event Timeline

RhinosF1 created this task.Jun 12 2021, 23:04

Herald added a subscriber: Unknown Object (User). · View Herald TranscriptJun 12 2021, 23:04

RhinosF1 added a subtask: T7117: Upgrade to MediaWiki 1.36.0.Jun 12 2021, 23:04

A brief outside review suggests things weren’t planned appropriately, there was confusion about how to change git over, there was no practise runs on test3 in dealing with a code transition either.

There was confusion over fundemental aspects of how puppet works and how to deploy puppet changes to puppet3.

Distractions were rife - I counted at least 5 separate conversions that members involved in this deployment were engaged in - therefore not concentrating on the deployment.

Yet again timing was a guess, and not considered in detail, which Reception stated in -sre.

I do think an incident report may be warranted since it went over the approximation by almost 30 minutes. But yes this was just a guess and we mentioned approximately so I am not certain.

In T7450#148966, @Universal_Omega wrote:

I do think an incident report may be warranted since it went over the approximation by almost 30 minutes. But yes this was just a guess and we mentioned approximately so I am not certain.

An incident report definitely because it exceeded the window of planned maintenance, therefore it became an outage not a maintenance.

Who had server access during what time periods?

Who was taking lead of the upgrade?

What were the backup plans in place?

High for now feel free to lower as needed though

Herald added a subscriber: Reception123. · View Herald TranscriptJun 13 2021, 00:26

In T7450#148970, @John wrote:

Who had server access during what time periods?

I believe @Paladox and @Universal_Omega had access the entire time
@Reception123 had access for at least the first hour and 40 mins
@RhinosF1 may have had access at one point

Who was taking lead of the upgrade?

To my knowledge it was @Reception123

In T7450#149033, @MacFan4000 wrote:

In T7450#148970, @John wrote:

Who had server access during what time periods?

I believe @Paladox and @Universal_Omega had access the entire time
@Reception123 had access for at least the first hour and 40 mins
@RhinosF1 may have had access at one point

Who was taking lead of the upgrade?

To my knowledge it was @Reception123

That is mostly correct, but I am not certain if @RhinosF1 had server access or was just around helping without access.

I was only around just in case. I wasn’t supposed to be involved but seems that changed…

In T7450#149083, @Paladox wrote:

I was only around just in case. I wasn’t supposed to be involved but seems that changed…

Yes, and thank you for all the help today. It was greatly appreciated.

Redmin subscribed.Jun 13 2021, 03:39

Regarding the time estimation, it is true that we didn't really realise how to get to a more accurate estimation. However that being said I believe that if everything had gone fully to plan we would've probably managed to live up to the 2 hours that were planned.

If @Reception123 was in charge of the upgrade, why did they not have server access for the full duration? That is a critical failure if true.

Dmehus subscribed.Jun 14 2021, 01:01

In T7450#149035, @Universal_Omega wrote:

That is mostly correct, but I am not certain if @RhinosF1 had server access or was just around helping without access.

I'm not certain if RhinosF1 had access either, but am also not sure why they didn't have access as I thought they weren't going to be around until after the upgrade, to handle post-upgrade issues.

Unknown Object (User) moved this task from Backlog to Short Term on the MediaWiki (SRE) board.Jun 15 2021, 17:04

Paladox unsubscribed.Jun 15 2021, 19:05

Bukkit subscribed.Jun 19 2021, 20:26

This has been discussed with the team and the following main points have been retained for the next update

In order to be efficient and avoid any sort of confusion there should be a paste containing the exact commands that will be executed and by whom.
It must be clear who is taking the lead of the upgrade, who is assisting and who else is around (with or without access) during the upgrade. Preferable at least one member of the Infra team should be around with access but they should not be expected to actively participate in the upgrade unless it is necessary because of an unexpected issue.
A more clear plan B should exist in case something goes wrong (i.e. an extension doesn't work) and the team should be prepared to disable an extension immediately)
More thought needs to be given to the estimated time
There should be more concentration on the upgrade and the person(s) doing the upgrade should avoid discussing with users about unrelated topics at the same time.

Evaluate what happened during the 1.36 updateClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate what happened during the 1.36 update
Closed, ResolvedPublic
Actions

Related Objects
Search...