We ran over and many things were handled chaotically.
We need to document these and report on them.
We ran over and many things were handled chaotically.
We need to document these and report on them.
Status | Assigned | Task | ||
---|---|---|---|---|
Resolved | Reception123 | T7450 Evaluate what happened during the 1.36 update | ||
Resolved | Unknown Object (User) | T7117 Upgrade to MediaWiki 1.36.0 | ||
Resolved | Reception123 | T7116 Consider installing SecureLinkFixer | ||
Resolved | Unknown Object (User) | T7253 Disable Modern Vector in 1.36 by default, Slow roll eventually | ||
Resolved | Unknown Object (User) | T7357 Apply test3 drop default schema changes | ||
Resolved | RhinosF1 | T7358 Enable wgCheckUserEnableSpecialInvestigate | ||
Resolved | Unknown Object (User) | T7359 Test all extensions for 1.36 | ||
Resolved | Unknown Object (User) | T7383 Consider undeploying ModernSkylight skin | ||
Declined | Unknown Object (User) | T7389 Switch PortableInfobox repository | ||
Declined | Unknown Object (User) | T7360 Remove Variables | ||
Resolved | Unknown Object (User) | T7387 Consider enabling logging of CU data for logins | ||
Resolved | Unknown Object (User) | T7449 Add $wgLogos->icon to ManageWiki | ||
Declined | Unknown Object (User) | T7429 Set wgMiserMode to true on all Miraheze wikis |
A brief outside review suggests things weren’t planned appropriately, there was confusion about how to change git over, there was no practise runs on test3 in dealing with a code transition either.
There was confusion over fundemental aspects of how puppet works and how to deploy puppet changes to puppet3.
Distractions were rife - I counted at least 5 separate conversions that members involved in this deployment were engaged in - therefore not concentrating on the deployment.
Yet again timing was a guess, and not considered in detail, which Reception stated in -sre.
I do think an incident report may be warranted since it went over the approximation by almost 30 minutes. But yes this was just a guess and we mentioned approximately so I am not certain.
An incident report definitely because it exceeded the window of planned maintenance, therefore it became an outage not a maintenance.
Who had server access during what time periods?
Who was taking lead of the upgrade?
What were the backup plans in place?
I believe @Paladox and @Universal_Omega had access the entire time
@Reception123 had access for at least the first hour and 40 mins
@RhinosF1 may have had access at one point
Who was taking lead of the upgrade?
To my knowledge it was @Reception123
That is mostly correct, but I am not certain if @RhinosF1 had server access or was just around helping without access.
I was only around just in case. I wasn’t supposed to be involved but seems that changed…
Regarding the time estimation, it is true that we didn't really realise how to get to a more accurate estimation. However that being said I believe that if everything had gone fully to plan we would've probably managed to live up to the 2 hours that were planned.
If @Reception123 was in charge of the upgrade, why did they not have server access for the full duration? That is a critical failure if true.
I'm not certain if RhinosF1 had access either, but am also not sure why they didn't have access as I thought they weren't going to be around until after the upgrade, to handle post-upgrade issues.
This has been discussed with the team and the following main points have been retained for the next update