As discussed; candidate for Goal-2021-Jul-Dec.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
May 3 2021
Discussed; handing over some of the tasks to me (see subtasks), we won't delay this.
Puppet 6 is EOL in December 2022, no need to rush this. Scheduled for Q4 2021 / Q1-Q2 2022.
Until configuration has been synced (mostly) with Varnish'.
Until configuration has been synced (mostly) with Varnish'.
Discussed; paladox will contact Wikimedia DBAs.
+ Phabricator admin?
May 2 2021
@RhinosF1 The task is for the Infrastructure team now, but JohnLewis couldn't have known that in the first place.
Scheduler for sda and sdb: [mq-deadline] none
Facts:
- cloud4 has been experiencing high disk utilization as of ~01:00: https://grafana.miraheze.org/d/W9MIkA7iz/miraheze-cluster?viewPanel=287&orgId=1&left=%5B%22now-1h%22,%22now%22,%22Prometheus%22,%7B%7D%5D&var-job=node&var-node=cloud4.miraheze.org&var-port=9100&from=1619864292000&to=1619993832000
- cloud5 experienced high disk utilization between ~01:00 and ~17:45: https://grafana.miraheze.org/d/W9MIkA7iz/miraheze-cluster?viewPanel=287&orgId=1&left=%5B%22now-1h%22,%22now%22,%22Prometheus%22,%7B%7D%5D&var-job=node&var-node=cloud4.miraheze.org&var-port=9100&from=1619864292000&to=1619993832000
- checkarray is running on cloud4, but not on cloud5: $ cat /proc/mdstat check = 84.6% (...) finish=654.8min
@Paladox and I are investigating the possibility of the /etc/cron.d/mdadm check (on cloud nodes only) being the cause of high I/O.
Running on db1{2,3,4} simultaneously:
mydumper -G -E -R -v 3 -t 2 -c -L "/home/dbcopy/dbbackup1-mnt/$(date +"%Y%m%d%H%M%S").log"
EDIT: trying again with --trx-consistency-only
May 1 2021
Apr 30 2021
In T7117#143323, @RhinosF1 wrote:@Southparkfan: Can you advise what the best process for taking a backup of test3 DB prior to running the sql & maint scripts is?
You can see the proposed commands for extensions and core file list at P403
Assuming test3wiki can survive read-only mode / performance (database locking) issues for a few minutes, a mysqldump --single-transaction by @Reception123 (on <the database server hosting test3wiki>) is good enough.
Apr 29 2021
Apr 26 2021
Other tests required:
- A test with the following settings: 1) -t 4 (true core count of each virtual machine) 2) --triggers --events --routines
- Another test, but with -t 2 (to lessen server load)
- What happens to performance if we backup three masters simultaneously? (reason: to maximise backup consistency)
In T5877#142646, @Southparkfan wrote:In T5877#142347, @John wrote:@Southparkfan updates on the above?
Sorry for the lack of response. Still working on this: 16:36:25 <+SPF|Cloud> !log https://phabricator.miraheze.org/T5877#140588: run test backup on db11 with six threads. I stopped the backup from T5877#141278 mid-way by accident.
Command: mydumper -t 6 -v 3 -c --trx-consistency-only
Start: 2021-04-24 14:36 UTC
End: 2021-04-26 04:39 UTC (38 hours)
Backup size: 14 GB
Apr 25 2021
This won't be an issue anymore.
Apr 24 2021
In T5877#142347, @John wrote:@Southparkfan updates on the above?
In T4425#142254, @John wrote:@Southparkfan See the above please
Apr 19 2021
@GodlessRaven Unfortunately, our servers have trouble loading the images of country flags. Please reduce the number of images in Template:Geonav.
Apr 14 2021
Actually, I think I'm closer now. Top profile entries for the Germany article without parsercache hit:
100.00% 4987.024 1 - main() 99.22% 4947.946 1 - wfIndexMain 99.22% 4947.914 1 - MediaWiki::run 99.21% 4947.640 1 - MediaWiki::main 96.67% 4820.735 1 - MediaWiki::performRequest 96.58% 4816.498 1 - MediaWiki::performAction 96.57% 4816.175 1 - ViewAction::show 96.42% 4808.691 1 - Article::view 95.22% 4748.519 1 - PoolCounterWork::execute 95.22% 4748.501 1 - PoolWorkArticleView::doWork 94.86% 4730.743 364 - call_user_func 94.73% 4724.182 1 - MediaWiki\Revision\RenderedRevision::getRevisionParserOutput 94.73% 4724.173 1 - MediaWiki\Revision\RevisionRenderer::MediaWiki\Revision\{closure} 94.73% 4724.172 1 - MediaWiki\Revision\RevisionRenderer::combineSlotOutput 94.73% 4724.165 1 - MediaWiki\Revision\RenderedRevision::getSlotParserOutput 94.73% 4724.150 1 - MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached 94.73% 4724.147 1 - AbstractContent::getParserOutput 94.50% 4712.590 9 - Parser::parse 94.39% 4707.457 1 - WikitextContent::fillParserOutput 92.26% 4601.219 9 - Parser::internalParse 50.38% 2512.279 442 - Parser::replaceVariables 50.32% 2509.301 262 - PPFrame_Hash::expand 49.79% 2483.122 65 - Parser::braceSubstitution 48.95% 2440.924 241 - PPFrame_Hash::expand@1 48.88% 2437.498 66 - Parser::braceSubstitution@1 48.63% 2424.996 2 - PPTemplateFrame_Hash::cachedExpand 48.56% 2421.826 206 - PPFrame_Hash::expand@2 48.48% 2417.873 34 - Parser::braceSubstitution@2 48.46% 2416.850 219 - PPFrame_Hash::expand@3 48.43% 2415.128 235 - Parser::argSubstitution 48.38% 2412.822 234 - PPTemplateFrame_Hash::getArgument 48.30% 2408.795 243 - PPTemplateFrame_Hash::getNumberedArgument 47.81% 2384.088 156 - PPFrame_Hash::expand@4 47.80% 2383.565 95 - PPTemplateFrame_Hash::getNamedArgument 47.75% 2381.453 166 - PPFrame_Hash::expand@5 47.74% 2380.743 299 - Parser::braceSubstitution@3 47.73% 2380.329 55 - Parser::argSubstitution@1 47.72% 2379.760 55 - PPTemplateFrame_Hash::getArgument@1 46.37% 2312.559 506 - PPFrame_Hash::expand@6 44.65% 2226.903 250 - Parser::braceSubstitution@4 44.36% 2212.446 353 - Parser::callParserFunction 43.84% 2186.357 254 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::switch 41.80% 2084.648 32 - Parser::handleInternalLinks 41.80% 2084.477 32 - Parser::handleInternalLinks2 38.41% 1915.588 72583 - PPFrame_Hash::expand@7 29.46% 1469.174 257 - Parser::braceSubstitution@5 28.39% 1415.838 1003 - PPFrame_Hash::expand@8 28.10% 1401.562 247 - Parser::braceSubstitution@6 27.84% 1388.267 307 - Parser::callParserFunction@1 27.75% 1383.777 247 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::switch@1 27.60% 1376.460 4138 - PPFrame_Hash::expand@9 26.82% 1337.538 730 - Parser::braceSubstitution@7 25.18% 1255.556 177103 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::decodeTrimExpand 24.88% 1240.986 254 - MediaWiki\BadFileLookup::isBadFile 24.86% 1239.964 508 - RepoGroup::findFile 24.19% 1206.336 1954 - PPFrame_Hash::expand@10 21.63% 1078.468 730 - Parser::braceSubstitution@8 21.26% 1060.474 738 - FileRepo::findFile 20.30% 1012.231 1857 - WANObjectCache::getWithSetCallback 20.08% 1001.627 733 - Parser::callParserFunction@2 19.92% 993.603 1849 - WANObjectCache::fetchOrRegenerate 19.79% 987.131 730 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::switch@2 15.57% 776.300 254 - Parser::makeImage 14.30% 713.180 1356 - Wikimedia\Rdbms\DBConnRef::__call 14.08% 702.115 801 - Wikimedia\Rdbms\Database::select 13.42% 669.325 813 - Wikimedia\Rdbms\Database::query 13.31% 663.683 816 - Wikimedia\Rdbms\Database::executeQuery 13.29% 662.669 743 - Wikimedia\Rdbms\DBConnRef::selectRow 13.25% 660.912 744 - Wikimedia\Rdbms\Database::selectRow 13.17% 656.632 816 - Wikimedia\Rdbms\Database::executeQueryAttempt 12.99% 647.844 102832 - PPFrame_Hash::expand@11 12.52% 624.529 1852 - WANObjectCache::get 12.45% 621.126 1852 - WANObjectCache::getMulti 12.05% 600.810 1855 - MediumSpecificBagOStuff::getMulti 11.92% 594.457 816 - Wikimedia\Rdbms\DatabaseMysqli::doQuery 11.89% 592.824 1855 - MemcachedPhpBagOStuff::doGetMulti 11.81% 589.080 1855 - MemcachedClient::get_multi 11.77% 587.036 816 - mysqli::query 10.96% 546.385 3715 - MemcachedClient::_fgets 10.72% 534.668 1863 - MemcachedClient::_load_items 10.71% 534.110 3715 - fgets 9.54% 475.907 492 - LocalRepo::checkRedirect 9.05% 451.240 1008 - ForeignAPIRepo::fetchImageQuery 8.59% 428.321 1008 - ForeignAPIRepo::httpGetCached 8.37% 417.328 254 - Linker::makeImageLink 8.12% 404.924 984 - LocalFile::load 8.11% 404.227 492 - LocalFile::loadFromCache 7.65% 381.575 381284 - call_user_func@1 7.64% 381.088 762 - ForeignAPIFile::transform 7.46% 371.800 380846 - ParserOptions::getOption 6.70% 333.928 486 - spl_autoload_call 6.69% 333.546 762 - ForeignAPIRepo::getThumbUrlFromCache 6.66% 332.034 762 - ForeignAPIRepo::getThumbUrl 6.35% 316.879 478 - AutoLoader::autoload 6.17% 307.496 246 - LocalFile::loadFromDB 5.92% 295.367 177199 - PPNode_Hash_Tree::splitArg 5.31% 264.929 177199 - PPNode_Hash_Tree::splitRawArg 5.07% 252.726 185159 - ParserOptions::getMaxPPNodeCount 4.78% 238.492 246 - section.query-m: SELECT img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,comment_img_description.comment_text AS `img_description_text`,comment_img_description.comment_data AS `img_description 4.57% 227.997 247 - Parser::parseLinkParameter 4.51% 225.003 630 - LinkCache::addLinkObj 4.51% 224.969 460 - Title::getArticleID 4.50% 224.621 254 - Linker::processResponsiveImages 4.39% 218.950 243 - LocalRepo::{closure} 4.34% 216.456 623 - ParserOutput::addLink 3.86% 192.463 245 - LinkCache::fetchPageRow 3.61% 180.236 4960 - PPNode_Hash_Tree::splitRawTemplate 3.60% 179.284 185159 - ParserOptions::getMaxPPExpandDepth 3.56% 177.691 6729 - MediaWiki\HookContainer\HookContainer::run 3.51% 174.833 380846 - ParserOptions::optionUsed 3.48% 173.330 738 - FileRepo::newFile 3.26% 162.529 1 - RepoGroup::initialiseRepos 3.26% 162.513 3 - RepoGroup::newRepo 3.24% 161.691 243 - section.query-m: SELECT rd_namespace,rd_title FROM `page`,`redirect` WHERE page_namespace = N AND page_title = 'X' AND (rd_from = page_id) LIMIT N 3.18% 158.616 245 - section.query-m: SELECT page_id,page_len,page_is_redirect,page_latest,page_restrictions,page_content_model FROM `page` WHERE page_namespace = N AND page_title = 'X' LIMIT N 3.00% 149.794 246 - ForeignAPIRepo::newFile 2.94% 146.832 246 - ForeignAPIFile::newFromTitle 2.89% 144.363 3 - FileRepo::__construct 2.89% 144.294 2 - LocalRepo::__construct 2.62% 130.767 3 - FileBackendGroup::get
Submitting an edit to the Germany article is enough to reproduce: https://socdemwiki.miraheze.org/w/index.php?diff=8321&oldid=8307&rcid=8653
Apr 13 2021
In T7131#141558, @GodlessRaven wrote:Weird, because I got a "504 Gateway Time-out" just right now (trying to access https://socdemwiki.miraheze.org/wiki/Iceland).
The slowlog has been readded locally, hence lowering priority (debugging is possible now). Leaving the task open though:
- It would be useful to correlate NGINX upstream timeouts to slowlogs in Graylog -> let syslog-ng read the slowlog, send the data to Graylog
- There is no logrotate configuration for the slowlogs, logs may fill up the disk in the future (but this will take years)
@GodlessRaven's first request:
* << BeReq >> 97781317 - Begin bereq 97781316 pass - Timestamp Start: 1618257674.184309 0.000000 0.000000 - BereqMethod POST - BereqURL /w/index.php?title=Germany&action=submit - BereqProtocol HTTP/1.1 - BereqHeader Host: socdemwiki.miraheze.org <REDACTED> - BereqHeader sec-fetch-dest: document - BereqHeader referer: https://socdemwiki.miraheze.org/w/index.php?title=Germany&action=submit <REDACTED> - BereqHeader X-Device: desktop - BereqHeader X-Use-Mobile: 0 - BereqHeader X-Varnish: 97781317 - VCL_call BACKEND_FETCH - VCL_return fetch - BackendOpen 89 mw11 127.0.0.1 8088 127.0.0.1 60006 - BackendStart 127.0.0.1 8088 - Timestamp Bereq: 1618257674.184397 0.000087 0.000087 - FetchError HTC idle (3) - BackendClose 89 mw11 - Timestamp Beresp: 1618257794.150665 119.966356 119.966269 - Timestamp Error: 1618257794.150675 119.966366 0.000010 - BerespProtocol HTTP/1.1 - BerespStatus 503 - BerespReason Service Unavailable - BerespReason Backend fetch failed - BerespHeader Date: Mon, 12 Apr 2021 20:03:14 GMT - BerespHeader Server: Varnish - VCL_call BACKEND_ERROR - BerespHeader Content-Type: text/html; charset=utf-8 - VCL_return deliver - Storage malloc Transient - Length 2934 - BereqAcct 1981 7270 9251 0 0 0
Apr 9 2021
Running dump from db11 to dbbackup1:/srv/backups/db11. @Paladox and I are around to monitor.
Apr 6 2021
Apr 5 2021
Apr 4 2021
New performance test (using sshfs setup, 4 mydumper threads):
- Uncompressed: 290 seconds
- Compressed: 210 seconds
For reference: mydumper is superior to mysqldump due to its better performance (using multiple threads) and the flexibility (PCRE based table inclusion/exclusion) in conjunction with transaction consistency and (almost) no locking (no read-only time required during backups). However, mydumper does not support TLS in connections, so dumping must happen at the database master.
Apr 2 2021
In T7087#140328, @RhinosF1 wrote:In T7087#140325, @Southparkfan wrote:We have the blackbox exporter for this. Can we help you by monitoring specific URLs?
As mentioned in the task, /healthcheck is the biggest one because it has an effect on uptime if that gets too high.
/healthcheck = Meta's Main Page. We're already monitoring that.
I would recommend we do one that loads quite a few resources (eg. Images, javascript etc)
The blackbox exporter does not monitor subsequent requests, such as resources (images?) used on an article. We can monitor that though, but you'll need to provide specific URLs. :)
We have the blackbox exporter for this. Can we help you by monitoring specific URLs?
Apr 1 2021
In T7073#140060, @John wrote:Is there a use case for this that the ES data source wouldn’t fulfil? Is this the approach MediaWiki (SRE) wish to take? If so this would fall under the MW team to implement as part of their task as without a use case for Infra, what’s the point in implementing something unused?
There are more use cases than MediaWiki only. For example, I would like to monitor SSH authentication attempts and access logs of non-MediaWiki services, which is a task for us, not for the MediaWiki team. The proof of concept above was tailored for MediaWiki logs, because said logs have a higher priority.
Mar 31 2021
Proof of concept:
/etc/prometheus-es-exporter/mediawiki.cfg:
[query_log_mediawiki] QueryIntervalSecs = 900 QueryIndices = <graylog_deflector> QueryJson = { "size": 0, "track_total_hits": true, "query": { "bool": { "must": [ { "match": { "application_name": "mediawiki" } } ], "filter": [ { "range": { "timestamp": { "gte": "now-15m", "lte": "now" } } } ] } }, "aggs": { "mediawiki-channels": { "terms": { "field": "mediawiki_channel" } } } }
The rebuild should finish in 10-30 minutes. If the errors are gone after the rebuild, you can close this task. If not, assistance will be needed.
Already did that for 'en', but without luck. Started the rebuild in a screen now.
More testing is required to determine the final backup sizes.
In T5877#139273, @Southparkfan wrote:A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.
In T7067#139966, @Reception123 wrote:Just noting that it's been decided to discontinue SRE duty due to the new team system and it didn't seem to be functioning anymore. The dashboard and links we've compiled have still been kept though as they're useful.
Mar 30 2021
Mar 29 2021
In order to do proper backend verification in the certificate (CN), we have tested using ENFORCE. However, the Host header from the client (e.g. allthetropes.org) is used for the CN check at the backend. Therefore, the allthetropes.org certificate would still be mandatory at the backend, even though I prefer to remove all certificates (including our wildcard one) but a single domain (such as ats-internal.miraheze.wiki) from the MediaWiki servers.
Mar 28 2021
The future of these servers depends on the outcome of testing regarding T5877#139273.
In T7033#139663, @Paladox wrote:@Southparkfan should we make this task public viewable?
I recommend rebooting the dbbackup servers. They may or may not be affected by CVE-2021-3450, but as long as these servers are rebooted gracefully, we can survive without them for a few minutes.
Mar 26 2021
I cannot find the minion in /etc/salt/roster.
Servers that haven't been rebooted, except for db1[1-3] / cloud[3-5] / mon2 / ns[12]:
- dbbackup1
- dbbackup2
- mem1
- mem2
Fixed.
RamNode is short on capacity, so we can't resize bacula yet. I hope we can resize the server next week.
13:08:41 <+SPF|Cloud> first, the nginx config points to /etc/ssl/certs/miraheze.wiki.crt, but we have switched to /etc/ssl/localcerts 13:09:17 <+SPF|Cloud> second, the certificate is valid for 'miraheze.wiki', but not phab.miraheze.wiki
Mar 25 2021
+$5/mo is approved by me, only requires John's approval as the EM of Infrastructure.
A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.
Spoke with @Paladox regarding ATS. Installing and testing ATS on test3 is not ideal, since that server is used for MediaWiki tests. Installing a new server as a testing cache proxy, granted that this cache proxy may not receive the 'allow 80/443 tcp' rules yet due to security reasons (we have agreed on a security review beforehand), has my support.
19:58:18 <+SPF|Cloud> my advice: reboot all VMs with services that can be depooled and repooled easily, in order to preserve uptime, do it the normal way (adhere to the 5 minutes DNS TTL, depool from varnish, wait until requests have finished, etc) 20:00:49 <+SPF|Cloud> on the critical servers, db1[1-3], cloud[3-5], mon2 and ns[12], restarting syslog-ng / IRC bots is fine, anything else shouldn't be touched (yet)