Hi I'm John. I'm the Co-Founder of Miraheze and the Engineering Manager for the Infrastructure team.
Mar 17 2023
Mar 3 2023
Feb 24 2023
https://github.com/miraheze/puppet/commit/bedbbf259236895187b13d9dde21e980787117bd temporary solution until we have more disk space to expand.
Change deployed locally
Will take a look over this later tonight
So the idea itself is invalid - backups don't remain locally stored, the server runs out of space, causing the backup to fail.
Feb 16 2023
Could this be why T10434 failed? If so, I think this task is high priority given the timeline left on Feb's SLO reporting
Feb 15 2023
@Paladox half way through Feb, we really need to look into this ASAP
Feb 12 2023
I've just ran that command on puppet141 25 times and the average was 0.007s (max of 0.017s) and 25 times on cloud14 which consistently gave 0.003s or 0.004s (max of 0.005s).
Grafana does not have data to suggest it is slower?
Feb 11 2023
I've looked into this and the metric being used in Grafana was wildly wrong.
Feb 4 2023
For January 2023 SLO Reporting - JobQueue failed the SLO for Errors.
Jan 28 2023
@Collei why did you mark this Dec 2022 task as a duplicate of a Dec 2021 task?
Jan 24 2023
The backups are automatically deleted - backup/dbs was full of backups from early 2022, before the backup system even existed.
Jan 23 2023
The blocker on this task is actually unresolved
I do not - I haven't touched the code in about 2 years and the fact this started to happen immediately after an upgrade suggests something changed and CreateWiki wasn't tested correctly.
Jan 22 2023
Boldly going to mark as declined due to concerns raised above.
Jan 4 2023
Jan 2 2023
A dialogue box if the group contains managewiki rights is a rather easy solution to implement as all the logic would just be checking the group rights that are already exposed for the rights selection pages.
Personally, I think we should just provide a warning rather than restrict the ability entirely - a user can correctly reconfigure the wiki ecosystem and this would then generate a new type of workload of requesting stewards correctly delete a bureaucrat or sysop group thereby changing the problem, not fixing it.
New run schedules:
What probably doesn't help is all db servers take backups at the same time. I suggest we stagger these a bit more and then watch the next impact.
This was actually the major reason we didn't have backups for so long - this performance problem.
Suggestions welcome but the way it is is currently the best option.
Jan 1 2023
This has been fixed. This was generating around 1440 failures a day - in order to meet the error threshold with these numbers, we'd need to have sent 144000 emails a day, or 100 a minute. As we don't operate at these volumes, this was always going to be the case.
Availability - having reviewed this, I am certain that the failure here is attributed to two things - one beyond our control and one where we have an open task that is blocked on MediaWiki (SRE) for a resolution.
Dec 31 2022
Dec 30 2022
Dec 29 2022
Re-assigning as the action above identified is one for the MediaWiki team and not Infrastructure
Dec 28 2022
Dec 27 2022
Backup schedules defined:
- Private - weekly
- SSL Keys - weekly
- SQL - fortnightly
- mediawiki-xml - MediaWiki (SRE) - can someone propose a time frame for XML dumps please? - 3 monthly?
- Phabricator Static - fortnightly
root@puppet141:~/private# /usr/local/bin/miraheze-backup backup private Starting backup of 'private' for date 2022-12-27... Completed! This took 8.501368522644043s root@puppet141:~/private# /usr/local/bin/miraheze-backup backup sslkeys Starting backup of 'sslkeys' for date 2022-12-27... Completed! This took 7.49277400970459s