Resolved
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 25 2022
@Paladox less than a week until end of goal period - do we have an update on this?
May 9 2022
Apr 16 2022
Feb 21 2022
Jan 1 2022
Dec 5 2021
I am going to start progress on this task, firstly by cleaning up how we define all of this in puppet. I'll introduce simply logging stanzas that we can define over and over again for each log file, that handles all of the syslog-ng logic + logrotate configuration for the new system.
Nov 7 2021
Oct 20 2021
New server list for checking the above plan against:
Plan for resolving this task:
- All services will have their logs ingested into Graylog, this isn't negotiable.
- Where logs are ingested, we will maintain 24-48 hours of *local* logs on the server. This will be supported by log rotation.
Oct 15 2021
This is now resolved.
Oct 13 2021
db13:
- Time taken: 2 hours and 20 minutes
- Size: 33G
https://github.com/miraheze/puppet/compare/6d6dcbc15b0e...139bf730eb26 automates this for daily, so we should have a live accessible copy for a 24 hour RPO - and bacula will store backups for a longer period of time (TBD).
The backup ran for 14 hours before I killed it as it caused T8163.
Oct 12 2021
Currently doing the above command but over an NFS mount to dbbackup1 which is in the US. This will take significantly longer - that is the main thing I am interested in right now.
mydumper -G -E -R -m -v 3 -t 2 -c -x "^(?!([0-9a-z]+wiki.(objectcache|querycache|querycachetwo|recentchanges|searchindex)))" -L "/home/johnflewis/$(date +"%Y%m%d%H%M%S").log" --trx-consistency-only
On db12:
- Time taken: 103 minutes (1 hour and 43 minutes)
- Size: 30G
Oct 11 2021
Trying to optimise the dump by reducing amount of data carried over (because not everything in MediaWiki is irreplaceable!)
T7740 is likely to be influenced by work done on this task.
Oct 9 2021
Sep 28 2021
De-assigned per lack of progress.
Sep 21 2021
@Southparkfan Any updates on this task? If there isn't an update provided in a week, I'll reassign the task to ensure it gets completed.
Aug 10 2021
In T5044#156437, @John wrote:@Paladox has raised concerns with centralised-only logging. We should explore these concerns before pushing for things like nginx access logs as these are critical for debugging some traffic influx/DoS attacks.
I agree with that. At least for some logs it's definitely useful to have logs stored locally in case something goes wrong and the logs don't get transmitted to graylog.
@Paladox has raised concerns with centralised-only logging. We should explore these concerns before pushing for things like nginx access logs as these are critical for debugging some traffic influx/DoS attacks.
Updates since last one on June 1st?
Jul 31 2021
Jul 28 2021
In T5412#154985, @Universal_Omega wrote:This has now been deployed.
This has now been deployed.
Jul 26 2021
Currently blocked on community consensus.
I guess this wasn't moved over to the next goal period, so doing that.
I drafted a bit of JS for this, using the oojs dialogs. This should be fairly good to do, with a "review" button, next to the save button, so it does not annoy users if they don't want to review them.
Jul 3 2021
Moving over to new goal period. Feel free to remove if it isn't wanted to be moved over.
Moving over to new goal period. Feel free to remove if it isn't wanted to be moved over.
Jun 14 2021
I could look into taking this over from @Paladox. Is there anything not on this task that I should be aware of if I do?
Jun 13 2021
This went live after T7117: Upgrade to MediaWiki 1.36.0 was done.
Jun 1 2021
The latency between db and dbbackup causes the slowness in the dump process. Moving the dbbackup VM to NL should improve the performance, but NL is much closer to UK than the US is. A disaster impacting both UK and NL is not very likely, but still...
May 27 2021
https://github.com/miraheze/IncidentReporting/pull/22 should complete this.
May 9 2021
Going to decom dbbackup2 (we'll be using dbbackup1).
May 3 2021
Test backup: mydumper -G -E -R -v 3 -t 2 -c -L "/home/dbcopy/dbbackup1-mnt/$(date +"%Y%m%d%H%M%S").log" --trx-consistency-only
- db11
- Duration: 2095 minutes (34.9 hours)
- Size: 14 GB
- Tables: 204,174
- db12
- Duration: 1615 minutes (26.9 hours)
- Size: 26 GB
- Tables: 156,104
- db13
- Duration: 1359 minutes (22.7 hours)
- Size: 35 GB
- Tables: 125,530
May 2 2021
Running on db1{2,3,4} simultaneously:
mydumper -G -E -R -v 3 -t 2 -c -L "/home/dbcopy/dbbackup1-mnt/$(date +"%Y%m%d%H%M%S").log"
EDIT: trying again with --trx-consistency-only
Apr 26 2021
Other tests required:
- A test with the following settings: 1) -t 4 (true core count of each virtual machine) 2) --triggers --events --routines
- Another test, but with -t 2 (to lessen server load)
- What happens to performance if we backup three masters simultaneously? (reason: to maximise backup consistency)
In T5877#142646, @Southparkfan wrote:In T5877#142347, @John wrote:@Southparkfan updates on the above?
Sorry for the lack of response. Still working on this: 16:36:25 <+SPF|Cloud> !log https://phabricator.miraheze.org/T5877#140588: run test backup on db11 with six threads. I stopped the backup from T5877#141278 mid-way by accident.
Command: mydumper -t 6 -v 3 -c --trx-consistency-only
Start: 2021-04-24 14:36 UTC
End: 2021-04-26 04:39 UTC (38 hours)
Backup size: 14 GB
Apr 25 2021
Apr 24 2021
In T5877#142347, @John wrote:@Southparkfan updates on the above?
Apr 20 2021
@Southparkfan updates on the above?
Apr 19 2021
there's one other log I didn't think we need to send for proxmox (wasn't really any info we needed I think).
Added pve* logging via https://github.com/miraheze/puppet/pull/1713
I will try and finish this now (for cloud*)
Apr 9 2021
Running dump from db11 to dbbackup1:/srv/backups/db11. @Paladox and I are around to monitor.
Apr 4 2021
New performance test (using sshfs setup, 4 mydumper threads):
- Uncompressed: 290 seconds
- Compressed: 210 seconds
For reference: mydumper is superior to mysqldump due to its better performance (using multiple threads) and the flexibility (PCRE based table inclusion/exclusion) in conjunction with transaction consistency and (almost) no locking (no read-only time required during backups). However, mydumper does not support TLS in connections, so dumping must happen at the database master.
Apr 3 2021
Mar 31 2021
More testing is required to determine the final backup sizes.
In T5877#139273, @Southparkfan wrote:A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.
Mar 28 2021
Mar 27 2021
Mar 25 2021
A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.
Mar 21 2021
Mar 18 2021
Perhaps, it may be possible to directly dump from the masters, with very little interruption: https://stackoverflow.com/q/56715657.
In that case, we can use the RamNode VMs to store the logical dumps (mydumper to stdout | ssh - local file). The disadvantage is that we won't have a live replica at all times (if a master crashes for good, the data between <most recent backup> and <crash> will be lost), but it's much cheaper: I/O limit is not much of an issue and since data is not replicated, there is more space for storing logical dumps.
In T4420#138213, @John wrote:In T4420#138212, @Universal_Omega wrote:In T4420#138210, @John wrote:When I try this and select ‘show number of incidents’ and ‘show all services’, all the rows turn up empty work no numbers. This is the same for visible outage and total outage.
Oh, hmm. That didn't happen to me when I was testing this. I will attach screenshots of local test shortly
If this is deployed, a local test bears no value to the point here because it’s deployed in production now.
In T4420#138212, @Universal_Omega wrote:In T4420#138210, @John wrote:When I try this and select ‘show number of incidents’ and ‘show all services’, all the rows turn up empty work no numbers. This is the same for visible outage and total outage.
Oh, hmm. That didn't happen to me when I was testing this. I will attach screenshots of local test shortly
Mar 17 2021
In T4420#138210, @John wrote:When I try this and select ‘show number of incidents’ and ‘show all services’, all the rows turn up empty work no numbers. This is the same for visible outage and total outage.
When I try this and select ‘show number of incidents’ and ‘show all services’, all the rows turn up empty work no numbers. This is the same for visible outage and total outage.
Mar 15 2021
Done with:
Mar 14 2021
Mar 11 2021
Mar 10 2021
https://github.com/miraheze/CreateWiki/pull/200 makes this task resolved, only setting a configuration in LS is required now to enable this.
Mar 9 2021
In T5044#136907, @Paladox wrote:We switched off syslog-ng logging on the cloud servers. Not sure if we want to switch it back on @John @Southparkfan ?
Yes, let's see if we can receive proxmox logs without further tweaking.
So I've created and merge this pull https://github.com/miraheze/puppet/pull/1695. Essentially logs for puppetserver/puppetdb are now read and sent to graylog.
Mar 8 2021
We switched off syslog-ng logging on the cloud servers. Not sure if we want to switch it back on @John @Southparkfan ?