Page MenuHomeMiraheze

Revise MariaDB backup strategy
Closed, ResolvedPublic

Description

(todo: fill in with more information)

Our current backup strategy is directly copying /srv/mariadb to Bacula. However, that is guaranteed to provide corruption, since no table locks are acquired during the backup. On top of that, retrieving specific databases or tables is a pain.

See https://wikitech.wikimedia.org/wiki/MariaDB/Backups, and https://wikitech.wikimedia.org/wiki/MariaDB/Backups#Snapshoting_and_disaster_recovery in particular. While we can't compare our wiki sizes to Wikimedia's, their backups are generated using commonly used tools: mariabackup (which we're already using for creating replicas) and mydumper (alternative for mysqldump).

Event Timeline

Southparkfan triaged this task as Normal priority.Jul 6 2020, 17:30
Southparkfan created this task.

Contacted Owen for a data processing agreement for the free infra offers.

John subscribed.

Moving over to next goal period, @Southparkfan any updates/re-prioritisations?

Budget approved by Board. I have purchased a 4GB MKVM (dbbackup1.miraheze.org) @ RamNode.

mariabackup is running on db12 and db13 and is dumping all databases directly (over ssh) to dbbackup1.

Overview of plan:

  1. Set up replication for db12 (c3) and db13 (c4)
  2. Dump db11 (c2) to dbbackup1
  3. Set up replication for db11 (c2)
  4. Explore the possibility of using wmfbackups, otherwise write our own script that stops replication for a given cluster, uses mydumper to backup all databases to sql.gz files and starts replication

MariaDB instances became corrupted due to my own fault. Recloning.

dbbackup1 is replicating all clusters now.

Testing backups:

mysql -S /var/run/mysqld/mysqld.c4.sock -e 'stop slave'

cd /srv/backups

mydumper -x '^(?!(mysql\.|test\.))' -G -E -R -v 3 -t 1 -c -S /var/run/mysqld/mysqld.c4.sock -L "/srv/backups/logs/$(date +"%Y%m%d%H%M%S")-c4.log"

mysql -S /var/run/mysqld/mysqld.c4.sock -e 'start slave'

Perhaps, it may be possible to directly dump from the masters, with very little interruption: https://stackoverflow.com/q/56715657.
In that case, we can use the RamNode VMs to store the logical dumps (mydumper to stdout | ssh - local file). The disadvantage is that we won't have a live replica at all times (if a master crashes for good, the data between <most recent backup> and <crash> will be lost), but it's much cheaper: I/O limit is not much of an issue and since data is not replicated, there is more space for storing logical dumps.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Perhaps, it may be possible to directly dump from the masters, with very little interruption: https://stackoverflow.com/q/56715657.
In that case, we can use the RamNode VMs to store the logical dumps (mydumper to stdout | ssh - local file). The disadvantage is that we won't have a live replica at all times (if a master crashes for good, the data between <most recent backup> and <crash> will be lost), but it's much cheaper: I/O limit is not much of an issue and since data is not replicated, there is more space for storing logical dumps.

Decided to test this on db11 (c2 master) for metawiki, using this command: mydumper -B metawiki -G -E -R -v 3 -c --trx-consistency-only (dump metawiki compressed, high verbosity, use transactional consistency, include triggers/events/routines)

Results:

  • Uncompressed metawiki backup (without -c flag) is 2.9G, compressed is 1.3G
  • Dumping uncompressed takes: real 1m18.535s user 0m38.117s sys 0m4.680s, compressed takes: real 2m25.700s user 1m59.198s sys 0m6.715s
  • I had no trouble editing metawiki during the backup: https://meta.miraheze.org/w/index.php?title=User:Southparkfan/Test&action=history
  • The last entry in the text table (which took the longest to dump) corresponds to an abuse filter log entry, that abuse filter log entry is the last entry in the abuse_filter_log table. The second last entry in the text table is the last entry of the revision/slots/content tables, which is good as well.
  • /srv/mariadb/metawiki is ~5.2G (excl. full-text search cache), /srv/mariadb is ~201G (excl. binary logs + full-text search cache), metawiki is ~2.6% of db11

Conclusions:

  • Dumping from the master directly looks promising
  • Estimation: dumping db11 uncompressed will take about 50 minutes, compressed will take about 93 minutes
  • Estimation: dumping db11 uncompressed requires 111G disk space, compressed requires about 50G disk space
  • Since logical dumps are always full (incremental or differential is not possible), disk space usage grows linear

More testing is required to determine the final backup sizes.

Proposal 1:

  • RPO of 24 hours (RPO = maximum tolerable data loss) -> daily backup
  • Retention date of 21 days -> 21 backups stored
  • ~160 GB per compressed backup for all clusters

160G * 21 = 3360 GB, costs: 1x RamNode MKVM 12GB (4000 GB) = $72/mo, but plenty of space left for future growth

Proposal 2:

  • RPO of 72 hours (RPO = maximum tolerable data loss) -> backup each three days
  • Retention date of 21 days -> 7 backups stored
  • ~160 GB per compressed backup for all clusters

160G * 7 = 1120 GB, costs: 1x RamNode MKVM 4GB (1350 GB) = $24/mo

Proposal 3:

  • RPO of 48 hours (RPO = maximum tolerable data loss) -> backup each two days
  • Retention date of 22 days -> 11 backups stored
  • ~160 GB per compressed backup for all clusters

160G * 11 = 1760 GB, costs: 1x RamNode MKVM 8GB (2650 GB) = $48/mo, but plenty of space left for future growth

Proposal 4:

  • RPO of 168 hours (RPO = maximum tolerable data loss) -> backup each seven days
  • Retention date of 84 days -> 7 backups stored
  • ~160 GB per compressed backup for all clusters

160G * 7 = 1120 GB, costs: 1x RamNode MKVM 4GB (1350 GB) = $24/mo

Unknown Object (User) unsubscribed.Apr 3 2021, 19:57

For reference: mydumper is superior to mysqldump due to its better performance (using multiple threads) and the flexibility (PCRE based table inclusion/exclusion) in conjunction with transaction consistency and (almost) no locking (no read-only time required during backups). However, mydumper does not support TLS in connections, so dumping must happen at the database master.

mydumper writes to a local directory, not to stdout. Since we cannot afford hosting 50G backups on masters, then scp'ing the backups, backups should be piped (over SSH) to the backup server, but mydumper doesn't support that.

Fortunately, SSHFS saves us. We can mount /srv/backups (dbbackup) in /home/dbcopy on the master. The disk mount is completely transparent to mydumper: the tool dumps to a directory, but behind the scenes, SSHFS writes the content to the remote server (dbbackup), not to the local disk of the master.

Proof of concept:

master: apt-get install sshfs
master: sshfs -o IdentityFile=/home/dbcopy/.ssh/id_ed25519 dbcopy@dbbackup1.miraheze.org:/srv/backups /home/dbcopy/dbbackup1-mnt/
master (test backup): mydumper -B metawiki -x '^(?!(metawiki\.text))' -c -v 3 -t 4 --trx-consistency-only

There may be better tools performance wise, but I think SSHFS is the way to go:

  1. Uses battle-tested SSH.
  2. Setup: very easy to setup, only requires the sshfs package and a persistent mount in Puppet. We have the OpenSSH daemon on dbbackup servers, it works well.
  3. Security: we trust OpenSSH for all administrative tasks, the SFTP traffic runs over SSH. No additional firewall rules required, no unknown daemons required.

New performance test (using sshfs setup, 4 mydumper threads):

  • Uncompressed: 290 seconds
  • Compressed: 210 seconds

Running dump from db11 to dbbackup1:/srv/backups/db11. @Paladox and I are around to monitor.

In T5877#142347, @John wrote:

@Southparkfan updates on the above?

Sorry for the lack of response. Still working on this: 16:36:25 <+SPF|Cloud> !log https://phabricator.miraheze.org/T5877#140588: run test backup on db11 with six threads. I stopped the backup from T5877#141278 mid-way by accident.

In T5877#142347, @John wrote:

@Southparkfan updates on the above?

Sorry for the lack of response. Still working on this: 16:36:25 <+SPF|Cloud> !log https://phabricator.miraheze.org/T5877#140588: run test backup on db11 with six threads. I stopped the backup from T5877#141278 mid-way by accident.

Command: mydumper -t 6 -v 3 -c --trx-consistency-only
Start: 2021-04-24 14:36 UTC
End: 2021-04-26 04:39 UTC (38 hours)
Backup size: 14 GB

Other tests required:

  • A test with the following settings: 1) -t 4 (true core count of each virtual machine) 2) --triggers --events --routines
  • Another test, but with -t 2 (to lessen server load)
  • What happens to performance if we backup three masters simultaneously? (reason: to maximise backup consistency)

Implementation:

  1. Agree on RPO and backup retention period
  2. Remove MariaDB replication instances, decom dbbackup2.miraheze.org
  3. Create local user account for database backups on database server and dbbackup servers, distribute SSH keys to database servers (similar to dbcopy)
  4. Create MariaDB account for database backups (to-do: determine correct read-only permissions, see MariaDB documentation for FLUSH)
  5. Puppetise creation of /srv/backups/c{2,3,4} with correct permissions (only the account from '3.' should be able to read/write)
  6. Puppetise the SSHFS mount (see this)
  7. Create a /var/log/mydumper directory on database servers with correct permissions
  8. Create cronjob for backup management: 1 or 2
  9. Add monitoring checks: 1) alert if last backup is older than RPO + ~48 hours (or <actual dump generation time>) 2) alert if last backup is smaller than < 14 GB (?)
  10. Write a procedure for restoring backups, test the procedure by deliberately corrupting a test wiki. If successful, create the on-wiki documentation.

Running on db1{2,3,4} simultaneously:

mydumper -G -E -R -v 3 -t 2 -c -L "/home/dbcopy/dbbackup1-mnt/$(date +"%Y%m%d%H%M%S").log"

EDIT: trying again with --trx-consistency-only

Test backup: mydumper -G -E -R -v 3 -t 2 -c -L "/home/dbcopy/dbbackup1-mnt/$(date +"%Y%m%d%H%M%S").log" --trx-consistency-only

  • db11
    • Duration: 2095 minutes (34.9 hours)
    • Size: 14 GB
    • Tables: 204,174
  • db12
    • Duration: 1615 minutes (26.9 hours)
    • Size: 26 GB
    • Tables: 156,104
  • db13
    • Duration: 1359 minutes (22.7 hours)
    • Size: 35 GB
    • Tables: 125,530

Going to decom dbbackup2 (we'll be using dbbackup1).

The latency between db and dbbackup causes the slowness in the dump process. Moving the dbbackup VM to NL should improve the performance, but NL is much closer to UK than the US is. A disaster impacting both UK and NL is not very likely, but still...

Moving over to new goal period. Feel free to remove if it isn't wanted to be moved over.

Unknown Object (User) moved this task from Backlog to Infrastructure on the Goal-2021-Jul-Dec board.Jul 3 2021, 18:45

Updates since last one on June 1st?

@Southparkfan Any updates on this task? If there isn't an update provided in a week, I'll reassign the task to ensure it gets completed.

De-assigned per lack of progress.

Trying to optimise the dump by reducing amount of data carried over (because not everything in MediaWiki is irreplaceable!)

All data collected is not via web - so numbers above will be vastly different. I just don't fancy spending days or weeks fine tuning something I can do it a day by removing the non-changing network element.

Initial command shared by SPF:

mydumper -G -E -R -v 3 -t 2 -c -L "/home/johnflewis/$(date +"%Y%m%d%H%M%S").log" --trx-consistency-only

On db12:

  • Time taken: 151 minutes (2 hours, 31 minutes)
  • Size: 33GB
mydumper -G -E -R -v 3 -t 2 -c -x "^(?!([0-9a-z]+wiki.(recentchanges|searchindex)))" -L "/home/johnflewis/$(date +"%Y%m%d%H%M%S").log" --trx-consistency-only

On db12:

  • Time taken: 129 minutes (2 hours and 9 minutes)
  • Size: 32G
mydumper -G -E -R -m -v 3 -t 2 -c -x "^(?!([0-9a-z]+wiki.(objectcache|querycache|querycachetwo|recentchanges|searchindex)))" -L "/home/johnflewis/$(date +"%Y%m%d%H%M%S").log" --trx-consistency-only

On db12:

  • Time taken: 103 minutes (1 hour and 43 minutes)
  • Size: 30G

Currently doing the above command but over an NFS mount to dbbackup1 which is in the US. This will take significantly longer - that is the main thing I am interested in right now.

Depending how long it takes and our retention demands, I'll make a decision about whether dbbackup1 is kept or whether we seek to invest further in our bacula2 install.

The backup ran for 14 hours before I killed it as it caused T8163.

Based on this information, I think doing the backups over the network is not practical - therefore I will be decoming dbbackup1 later today.

https://github.com/miraheze/puppet/compare/6d6dcbc15b0e...139bf730eb26 automates this for daily, so we should have a live accessible copy for a 24 hour RPO - and bacula will store backups for a longer period of time (TBD).

I'll populate live backups at some point today to get an understanding of the full data requirements, we might need to expand disk space on db* to accommodate natural growth but could be avoided with T8157.

My expectation is to have this task resolved today.

db13:

  • Time taken: 2 hours and 20 minutes
  • Size: 33G

https://github.com/miraheze/puppet/commit/9feab439cdfee3ee6af9db651212d78163fc3943

This is now resolved.

I will write some brief docs on this tonight as well.