Since we've got SQL backups as well, I'd say every 3 months for XML would be reasonable
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 30 2022
Dec 28 2022
Dec 27 2022
Backup schedules defined:
- Private - weekly
- SSL Keys - weekly
- SQL - fortnightly
- mediawiki-xml - MediaWiki (SRE) - can someone propose a time frame for XML dumps please? - 3 monthly?
- Phabricator Static - fortnightly
root@puppet141:~/private# /usr/local/bin/miraheze-backup backup private Starting backup of 'private' for date 2022-12-27... Completed! This took 8.501368522644043s root@puppet141:~/private# /usr/local/bin/miraheze-backup backup sslkeys Starting backup of 'sslkeys' for date 2022-12-27... Completed! This took 7.49277400970459s
Dec 26 2022
Availability for the swift proxy, ac and object servers should be 99.5% I think.
Dec 21 2022
Dec 11 2022
@Paladox can we draft some Swift SLOs please so that we can start to monitor them before the end of this year?
Nov 25 2022
Given recent events, a work around solution will be worked on and released this weekend hopefully.
Sep 26 2022
Sep 24 2022
In T8802#197599, @John wrote:How is the abandoned jobs measurement to work as it’s not clear what percentage of jobs have failed?
The SLO for MediaWiki latency also looks extremely high based on the small data set available at 5s - is there a reason for this?
Sep 23 2022
In T8802#197599, @John wrote:How is the abandoned jobs measurement to work as it’s not clear what percentage of jobs have failed?
The SLO for MediaWiki latency also looks extremely high based on the small data set available at 5s - is there a reason for this?
How is the abandoned jobs measurement to work as it’s not clear what percentage of jobs have failed?
@Reception123 @MacFan4000 All the monitoring for this task should be done now, but we need to make sure that the selected numbers here are agreed upon. Any other suggestions for it?
Sep 20 2022
We could store and average the backend response times variable that we post in web requests - this would be a form of inexpensive latency monitoring
I recommend we remove latency from MediaWiki also. We don't really have a way to monitor it anymore, as I removed blackbox monitoring from MediaWiki as it seemed to make hundreds of thousands of semi-expensive, otherwise unnecessary daily requests, that seemed to actually provide ortherwise unnecessary frontend load and degradation. Backend monitoring was not worth frontend performance degradation in my opinion. I don't see any other way to really monitor latency in this case.
Sep 17 2022
Sep 11 2022
Sep 10 2022
Sep 9 2022
Sep 8 2022
I got the prometheus-es-exporter working again today, for this. So we can add exceptions, etc... again, so I can do more work on this a bit later now.
Sep 7 2022
Sep 1 2022
Aug 21 2022
- DNS
- SLO: Latency for DNS lookups to be below 5ms at least 99.5%
Aug 2 2022
Jul 2 2022
- Mail
- SLO: Availability of Mail servers to be at least 99.5%.
- SLO: Errors for sending mail is below 1%.
- SLO: Latency of message delivery is below 30 seconds.
- MariaDB
- SLO: Availability of MariaDB is at least 99.5%.
- SLO: Error rates for access are below 5%.
- LDAP
- SLO: Availability of LDAP to be at least 99.5%.
Proposal for some below are:
Jun 25 2022
Resolved
Being able to run and support effective backups is looking like we need to reduce existing infrastructure strain. So this is unfortunately blocked on bigger projects in the next few months.
@Paladox less than a week until end of goal period - do we have an update on this?
May 30 2022
In T8801#188483, @Dmehus wrote:Has there been any more discussion among SRE team members with regard to agreeing on SLOs for the above?
Has there been any more discussion among SRE team members with regard to agreeing on SLOs for the above?
May 9 2022
Apr 16 2022
Apr 11 2022
In T8802#183519, @Reception123 wrote:For memcached at least uptime/availability doesn't seem to be an issue.
For memcached at least uptime/availability doesn't seem to be an issue.
Apr 8 2022
In T8802#183142, @Reception123 wrote:Sorry about the delay here. While not my strong area I'd propose the following:
MediaWiki:
Service uptime: 97% (as RhinosF1 calculates above 95% means 1.5 days/mo not working which is too much to be reasonable IMO, so 97% is better IMO)
Errors: (Exception graph is broken so I'm unable to provide a suggestion here) - https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki?orgId=1&viewPanel=187
Latency: As RhinosF1 says, I find it difficult to make a suggestion here without seeing a graph first.Jobqueue:
Availability: 97% would be reasonable? (though not with our current issues)
Errors: Without total jobs difficult to propose
Latency: same as aboveMemcached:
Availability: 97% still sounds like a good number to me
Sorry about the delay here. While not my strong area I'd propose the following:
Feb 27 2022
Feb 26 2022
Uptime: I'd say 95% really at a minimum for all. That's still 1.5 days a month not working in total which is a lot.
Feb 21 2022
Pretty sure this is offtopic
Feb 20 2022
Feb 19 2022
Feb 14 2022
Started work on this using a python handler for interacting with OVH's PCA via swift.
Feb 7 2022
Upping to normal as we are looking to decom bacula in the very near future.
Jan 25 2022
No action required from Infra. ES7 is deployed with no plan to downgrade.
ES 7 is not compatible with MediaWiki
https://github.com/miraheze/puppet/pull/2392 to add Composer for Elastica
Please use this form if you would like it on your wiki.
Jan 24 2022
Oh, you're right. It hasn't started yet.
In T7740#175455, @MikeV wrote:What is the URL for gratispaideiawiki? I'd like to see if you can include hyphens in the search term.
In T7740#175455, @MikeV wrote:What is the URL for gratispaideiawiki? I'd like to see if you can include hyphens in the search term.
What is the URL for gratispaideiawiki? I'd like to see if you can include hyphens in the search term.
Jan 23 2022
Jan 22 2022
gratispaideiawiki is open for testing :)
Will look at this tommorow
In T7740#175232, @RhinosF1 wrote:Is it worth trying what's it's like on a few wikis?
I'm happy to try and set some config up and test perf on experimental wikis.
You can use it for:
- search
- text
- file metadata - probably minor and easily switchable back if fails?
- ...?
Is it worth trying what's it's like on a few wikis?
Re-assigning team.