Page MenuHomeMiraheze
Feed Advanced Search

May 3 2021

Southparkfan added a comment to T6759: Automate the adding of SSL private keys to puppet3.

As discussed; candidate for Goal-2021-Jul-Dec.

May 3 2021, 17:51 · SRE Automation, Goal-2021-Jul-Dec, Infrastructure (SRE), SSL
Southparkfan added a comment to T4302: Deploy Apache Traffic Server.

Discussed; handing over some of the tasks to me (see subtasks), we won't delay this.

May 3 2021, 17:47 · Goal-2021-Jul-Dec, Infrastructure (SRE)
Southparkfan lowered the priority of T6839: Upgrade puppet to puppet 7 from Normal to Low.

Puppet 6 is EOL in December 2022, no need to rush this. Scheduled for Q4 2021 / Q1-Q2 2022.

May 3 2021, 17:46 · Puppet, Infrastructure (SRE)
Southparkfan moved T7241: ATS: Deploy healthchecker that depools/repools from Incoming to Long Term on the Infrastructure (SRE) board.
May 3 2021, 17:41 · Infrastructure (SRE)
Southparkfan moved T7240: ATS: Review security from Incoming to Long Term on the Infrastructure (SRE) board.
May 3 2021, 17:41 · Infrastructure (SRE)
Southparkfan moved T7239: ATS: Review performance from Incoming to Long Term on the Infrastructure (SRE) board.
May 3 2021, 17:41 · Infrastructure (SRE)
Southparkfan lowered the priority of T7240: ATS: Review security from Normal to Low.

Until configuration has been synced (mostly) with Varnish'.

May 3 2021, 17:41 · Infrastructure (SRE)
Southparkfan lowered the priority of T7239: ATS: Review performance from Normal to Low.

Until configuration has been synced (mostly) with Varnish'.

May 3 2021, 17:41 · Infrastructure (SRE)
Southparkfan added a comment to T4425: Fix all mysql tables that are using latin rather than binary.

Discussed; paladox will contact Wikimedia DBAs.

May 3 2021, 17:34 · Database, Infrastructure (SRE)
Southparkfan moved T7230: High I/O on cloud nodes affecting GlusterFS from Incoming to Short Term on the Infrastructure (SRE) board.
May 3 2021, 17:28 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan assigned T7230: High I/O on cloud nodes affecting GlusterFS to Paladox.
May 3 2021, 17:28 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan added a member for Infrastructure (SRE): Southparkfan.
May 3 2021, 17:22
Southparkfan removed a member for Infrastructure (SRE): John.
May 3 2021, 17:22
Southparkfan added a comment to T7238: Removal of access for John.

+ Phabricator admin?

May 3 2021, 17:06 · Site Reliability Engineering

May 2 2021

Southparkfan added a comment to T7230: High I/O on cloud nodes affecting GlusterFS.

@RhinosF1 The task is for the Infrastructure team now, but JohnLewis couldn't have known that in the first place.

May 2 2021, 22:45 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan added a comment to T7230: High I/O on cloud nodes affecting GlusterFS.

Scheduler for sda and sdb: [mq-deadline] none

May 2 2021, 22:32 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan added a comment to T7230: High I/O on cloud nodes affecting GlusterFS.

Facts:

May 2 2021, 22:24 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan renamed T7230: High I/O on cloud nodes affecting GlusterFS from Load times high enough to cause depool to High I/O on cloud nodes affecting GlusterFS.
May 2 2021, 21:58 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan added projects to T7230: High I/O on cloud nodes affecting GlusterFS: Cloud Infrastructure, Infrastructure (SRE).
May 2 2021, 21:57 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan lowered the priority of T7230: High I/O on cloud nodes affecting GlusterFS from Unbreak Now! to High.

@Paladox and I are investigating the possibility of the /etc/cron.d/mdadm check (on cloud nodes only) being the cause of high I/O.

May 2 2021, 21:56 · Infrastructure (SRE), Cloud Infrastructure, MediaWiki (SRE), Performance
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

Running on db1{2,3,4} simultaneously:

mydumper -G -E -R -v 3 -t 2 -c -L "/home/dbcopy/dbbackup1-mnt/$(date +"%Y%m%d%H%M%S").log"

EDIT: trying again with --trx-consistency-only

May 2 2021, 18:39 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Dmehus awarded T7224: Uncompressed puppetserver json logs fill up disk a Like token.
May 2 2021, 14:24 · Puppet, Infrastructure (SRE)

May 1 2021

Southparkfan triaged T7224: Uncompressed puppetserver json logs fill up disk as High priority.
May 1 2021, 17:41 · Puppet, Infrastructure (SRE)

Apr 30 2021

Southparkfan added a comment to T7117: Upgrade to MediaWiki 1.36.0.

@Southparkfan: Can you advise what the best process for taking a backup of test3 DB prior to running the sql & maint scripts is?

You can see the proposed commands for extensions and core file list at P403

Assuming test3wiki can survive read-only mode / performance (database locking) issues for a few minutes, a mysqldump --single-transaction by @Reception123 (on <the database server hosting test3wiki>) is good enough.

Apr 30 2021, 11:06 · Universal Omega, MediaWiki (SRE), MediaWiki

Apr 29 2021

Southparkfan updated Southparkfan.
Apr 29 2021, 22:45
Southparkfan edited Description on Goal-2021-Jul-Dec.
Apr 29 2021, 18:25
Southparkfan added a hashtag to Goal-2021-Jul-Dec: #goal-2021-jul-dec.
Apr 29 2021, 18:24

Apr 26 2021

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

Other tests required:

  • A test with the following settings: 1) -t 4 (true core count of each virtual machine) 2) --triggers --events --routines
  • Another test, but with -t 2 (to lessen server load)
  • What happens to performance if we backup three masters simultaneously? (reason: to maximise backup consistency)
Apr 26 2021, 21:38 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.
In T5877#142347, @John wrote:

@Southparkfan updates on the above?

Sorry for the lack of response. Still working on this: 16:36:25 <+SPF|Cloud> !log https://phabricator.miraheze.org/T5877#140588: run test backup on db11 with six threads. I stopped the backup from T5877#141278 mid-way by accident.

Command: mydumper -t 6 -v 3 -c --trx-consistency-only
Start: 2021-04-24 14:36 UTC
End: 2021-04-26 04:39 UTC (38 hours)
Backup size: 14 GB

Apr 26 2021, 21:08 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec

Apr 25 2021

Southparkfan closed T6900: Create draft of Data Processing Inventory as Resolved.
Apr 25 2021, 12:08 · Site Reliability Engineering
Southparkfan closed T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, as Invalid.
Apr 25 2021, 12:08 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan closed T6984: High load on dbbackup servers as Invalid.

This won't be an issue anymore.

Apr 25 2021, 12:08 · Database, Monitoring, Infrastructure (SRE)

Apr 24 2021

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.
In T5877#142347, @John wrote:

@Southparkfan updates on the above?

Apr 24 2021, 14:36 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T4425: Fix all mysql tables that are using latin rather than binary.
In T4425#142254, @John wrote:

@Southparkfan See the above please

Apr 24 2021, 14:20 · Database, Infrastructure (SRE)

Apr 19 2021

Southparkfan closed T7131: Consistent 50x errors as Invalid.

@GodlessRaven Unfortunately, our servers have trouble loading the images of country flags. Please reduce the number of images in Template:Geonav.

Apr 19 2021, 11:16 · Production Error, MediaWiki (SRE)

Apr 14 2021

Southparkfan added a comment to T7131: Consistent 50x errors.

Actually, I think I'm closer now. Top profile entries for the Germany article without parsercache hit:

100.00% 4987.024      1 - main()
 99.22% 4947.946      1 - wfIndexMain
 99.22% 4947.914      1 - MediaWiki::run
 99.21% 4947.640      1 - MediaWiki::main
 96.67% 4820.735      1 - MediaWiki::performRequest
 96.58% 4816.498      1 - MediaWiki::performAction
 96.57% 4816.175      1 - ViewAction::show
 96.42% 4808.691      1 - Article::view
 95.22% 4748.519      1 - PoolCounterWork::execute
 95.22% 4748.501      1 - PoolWorkArticleView::doWork
 94.86% 4730.743    364 - call_user_func
 94.73% 4724.182      1 - MediaWiki\Revision\RenderedRevision::getRevisionParserOutput
 94.73% 4724.173      1 - MediaWiki\Revision\RevisionRenderer::MediaWiki\Revision\{closure}
 94.73% 4724.172      1 - MediaWiki\Revision\RevisionRenderer::combineSlotOutput
 94.73% 4724.165      1 - MediaWiki\Revision\RenderedRevision::getSlotParserOutput
 94.73% 4724.150      1 - MediaWiki\Revision\RenderedRevision::getSlotParserOutputUncached
 94.73% 4724.147      1 - AbstractContent::getParserOutput
 94.50% 4712.590      9 - Parser::parse
 94.39% 4707.457      1 - WikitextContent::fillParserOutput
 92.26% 4601.219      9 - Parser::internalParse
 50.38% 2512.279    442 - Parser::replaceVariables
 50.32% 2509.301    262 - PPFrame_Hash::expand
 49.79% 2483.122     65 - Parser::braceSubstitution
 48.95% 2440.924    241 - PPFrame_Hash::expand@1
 48.88% 2437.498     66 - Parser::braceSubstitution@1
 48.63% 2424.996      2 - PPTemplateFrame_Hash::cachedExpand
 48.56% 2421.826    206 - PPFrame_Hash::expand@2
 48.48% 2417.873     34 - Parser::braceSubstitution@2
 48.46% 2416.850    219 - PPFrame_Hash::expand@3
 48.43% 2415.128    235 - Parser::argSubstitution
 48.38% 2412.822    234 - PPTemplateFrame_Hash::getArgument
 48.30% 2408.795    243 - PPTemplateFrame_Hash::getNumberedArgument
 47.81% 2384.088    156 - PPFrame_Hash::expand@4
 47.80% 2383.565     95 - PPTemplateFrame_Hash::getNamedArgument
 47.75% 2381.453    166 - PPFrame_Hash::expand@5
 47.74% 2380.743    299 - Parser::braceSubstitution@3
 47.73% 2380.329     55 - Parser::argSubstitution@1
 47.72% 2379.760     55 - PPTemplateFrame_Hash::getArgument@1
 46.37% 2312.559    506 - PPFrame_Hash::expand@6
 44.65% 2226.903    250 - Parser::braceSubstitution@4
 44.36% 2212.446    353 - Parser::callParserFunction
 43.84% 2186.357    254 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::switch
 41.80% 2084.648     32 - Parser::handleInternalLinks
 41.80% 2084.477     32 - Parser::handleInternalLinks2
 38.41% 1915.588  72583 - PPFrame_Hash::expand@7
 29.46% 1469.174    257 - Parser::braceSubstitution@5
 28.39% 1415.838   1003 - PPFrame_Hash::expand@8
 28.10% 1401.562    247 - Parser::braceSubstitution@6
 27.84% 1388.267    307 - Parser::callParserFunction@1
 27.75% 1383.777    247 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::switch@1
 27.60% 1376.460   4138 - PPFrame_Hash::expand@9
 26.82% 1337.538    730 - Parser::braceSubstitution@7
 25.18% 1255.556 177103 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::decodeTrimExpand
 24.88% 1240.986    254 - MediaWiki\BadFileLookup::isBadFile
 24.86% 1239.964    508 - RepoGroup::findFile
 24.19% 1206.336   1954 - PPFrame_Hash::expand@10
 21.63% 1078.468    730 - Parser::braceSubstitution@8
 21.26% 1060.474    738 - FileRepo::findFile
 20.30% 1012.231   1857 - WANObjectCache::getWithSetCallback
 20.08% 1001.627    733 - Parser::callParserFunction@2
 19.92% 993.603   1849 - WANObjectCache::fetchOrRegenerate
 19.79% 987.131    730 - MediaWiki\Extensions\ParserFunctions\ParserFunctions::switch@2
 15.57% 776.300    254 - Parser::makeImage
 14.30% 713.180   1356 - Wikimedia\Rdbms\DBConnRef::__call
 14.08% 702.115    801 - Wikimedia\Rdbms\Database::select
 13.42% 669.325    813 - Wikimedia\Rdbms\Database::query
 13.31% 663.683    816 - Wikimedia\Rdbms\Database::executeQuery
 13.29% 662.669    743 - Wikimedia\Rdbms\DBConnRef::selectRow
 13.25% 660.912    744 - Wikimedia\Rdbms\Database::selectRow
 13.17% 656.632    816 - Wikimedia\Rdbms\Database::executeQueryAttempt
 12.99% 647.844 102832 - PPFrame_Hash::expand@11
 12.52% 624.529   1852 - WANObjectCache::get
 12.45% 621.126   1852 - WANObjectCache::getMulti
 12.05% 600.810   1855 - MediumSpecificBagOStuff::getMulti
 11.92% 594.457    816 - Wikimedia\Rdbms\DatabaseMysqli::doQuery
 11.89% 592.824   1855 - MemcachedPhpBagOStuff::doGetMulti
 11.81% 589.080   1855 - MemcachedClient::get_multi
 11.77% 587.036    816 - mysqli::query
 10.96% 546.385   3715 - MemcachedClient::_fgets
 10.72% 534.668   1863 - MemcachedClient::_load_items
 10.71% 534.110   3715 - fgets
  9.54% 475.907    492 - LocalRepo::checkRedirect
  9.05% 451.240   1008 - ForeignAPIRepo::fetchImageQuery
  8.59% 428.321   1008 - ForeignAPIRepo::httpGetCached
  8.37% 417.328    254 - Linker::makeImageLink
  8.12% 404.924    984 - LocalFile::load
  8.11% 404.227    492 - LocalFile::loadFromCache
  7.65% 381.575 381284 - call_user_func@1
  7.64% 381.088    762 - ForeignAPIFile::transform
  7.46% 371.800 380846 - ParserOptions::getOption
  6.70% 333.928    486 - spl_autoload_call
  6.69% 333.546    762 - ForeignAPIRepo::getThumbUrlFromCache
  6.66% 332.034    762 - ForeignAPIRepo::getThumbUrl
  6.35% 316.879    478 - AutoLoader::autoload
  6.17% 307.496    246 - LocalFile::loadFromDB
  5.92% 295.367 177199 - PPNode_Hash_Tree::splitArg
  5.31% 264.929 177199 - PPNode_Hash_Tree::splitRawArg
  5.07% 252.726 185159 - ParserOptions::getMaxPPNodeCount
  4.78% 238.492    246 - section.query-m: SELECT img_name,img_size,img_width,img_height,img_metadata,img_bits,img_media_type,img_major_mime,img_minor_mime,img_timestamp,img_sha1,comment_img_description.comment_text AS `img_description_text`,comment_img_description.comment_data AS `img_description
  4.57% 227.997    247 - Parser::parseLinkParameter
  4.51% 225.003    630 - LinkCache::addLinkObj
  4.51% 224.969    460 - Title::getArticleID
  4.50% 224.621    254 - Linker::processResponsiveImages
  4.39% 218.950    243 - LocalRepo::{closure}
  4.34% 216.456    623 - ParserOutput::addLink
  3.86% 192.463    245 - LinkCache::fetchPageRow
  3.61% 180.236   4960 - PPNode_Hash_Tree::splitRawTemplate
  3.60% 179.284 185159 - ParserOptions::getMaxPPExpandDepth
  3.56% 177.691   6729 - MediaWiki\HookContainer\HookContainer::run
  3.51% 174.833 380846 - ParserOptions::optionUsed
  3.48% 173.330    738 - FileRepo::newFile
  3.26% 162.529      1 - RepoGroup::initialiseRepos
  3.26% 162.513      3 - RepoGroup::newRepo
  3.24% 161.691    243 - section.query-m: SELECT rd_namespace,rd_title FROM `page`,`redirect` WHERE page_namespace = N AND page_title = 'X' AND (rd_from = page_id) LIMIT N 
  3.18% 158.616    245 - section.query-m: SELECT page_id,page_len,page_is_redirect,page_latest,page_restrictions,page_content_model FROM `page` WHERE page_namespace = N AND page_title = 'X' LIMIT N 
  3.00% 149.794    246 - ForeignAPIRepo::newFile
  2.94% 146.832    246 - ForeignAPIFile::newFromTitle
  2.89% 144.363      3 - FileRepo::__construct
  2.89% 144.294      2 - LocalRepo::__construct
  2.62% 130.767      3 - FileBackendGroup::get
Apr 14 2021, 00:22 · Production Error, MediaWiki (SRE)
Southparkfan added a comment to T7131: Consistent 50x errors.

Submitting an edit to the Germany article is enough to reproduce: https://socdemwiki.miraheze.org/w/index.php?diff=8321&oldid=8307&rcid=8653

Apr 14 2021, 00:03 · Production Error, MediaWiki (SRE)

Apr 13 2021

Southparkfan added a comment to T7131: Consistent 50x errors.

Weird, because I got a "504 Gateway Time-out" just right now (trying to access https://socdemwiki.miraheze.org/wiki/Iceland).

Apr 13 2021, 23:53 · Production Error, MediaWiki (SRE)
Southparkfan added a comment to T7135: Ingest PHP-FPM slowlogs into Graylog.

The slowlog has been readded locally, hence lowering priority (debugging is possible now). Leaving the task open though:

  1. It would be useful to correlate NGINX upstream timeouts to slowlogs in Graylog -> let syslog-ng read the slowlog, send the data to Graylog
  2. There is no logrotate configuration for the slowlogs, logs may fill up the disk in the future (but this will take years)
Apr 13 2021, 23:50 · Monitoring, MediaWiki (SRE)
Southparkfan lowered the priority of T7135: Ingest PHP-FPM slowlogs into Graylog from High to Normal.

https://github.com/miraheze/puppet/commit/8e399fcf25453535173737d7980ff122a05378ec

Apr 13 2021, 23:41 · Monitoring, MediaWiki (SRE)
Southparkfan updated the task description for T7135: Ingest PHP-FPM slowlogs into Graylog.
Apr 13 2021, 23:38 · Monitoring, MediaWiki (SRE)
Southparkfan triaged T7135: Ingest PHP-FPM slowlogs into Graylog as High priority.
Apr 13 2021, 23:38 · Monitoring, MediaWiki (SRE)
Southparkfan created T7134: Puppet cannot remount GlusterFS mount if directory exists.
Apr 13 2021, 23:30 · Puppet, Infrastructure (SRE)
Southparkfan added projects to T7131: Consistent 50x errors: MediaWiki (SRE), Production Error.

@GodlessRaven's first request:

*   << BeReq    >> 97781317
-   Begin          bereq 97781316 pass
-   Timestamp      Start: 1618257674.184309 0.000000 0.000000
-   BereqMethod    POST
-   BereqURL       /w/index.php?title=Germany&action=submit
-   BereqProtocol  HTTP/1.1
-   BereqHeader    Host: socdemwiki.miraheze.org
<REDACTED>
-   BereqHeader    sec-fetch-dest: document
-   BereqHeader    referer: https://socdemwiki.miraheze.org/w/index.php?title=Germany&action=submit
<REDACTED>
-   BereqHeader    X-Device: desktop
-   BereqHeader    X-Use-Mobile: 0
-   BereqHeader    X-Varnish: 97781317
-   VCL_call       BACKEND_FETCH
-   VCL_return     fetch
-   BackendOpen    89 mw11 127.0.0.1 8088 127.0.0.1 60006
-   BackendStart   127.0.0.1 8088
-   Timestamp      Bereq: 1618257674.184397 0.000087 0.000087
-   FetchError     HTC idle (3)
-   BackendClose   89 mw11
-   Timestamp      Beresp: 1618257794.150665 119.966356 119.966269
-   Timestamp      Error: 1618257794.150675 119.966366 0.000010
-   BerespProtocol HTTP/1.1
-   BerespStatus   503
-   BerespReason   Service Unavailable
-   BerespReason   Backend fetch failed
-   BerespHeader   Date: Mon, 12 Apr 2021 20:03:14 GMT
-   BerespHeader   Server: Varnish
-   VCL_call       BACKEND_ERROR
-   BerespHeader   Content-Type: text/html; charset=utf-8
-   VCL_return     deliver
-   Storage        malloc Transient
-   Length         2934
-   BereqAcct      1981 7270 9251 0 0 0
Apr 13 2021, 23:12 · Production Error, MediaWiki (SRE)

Apr 9 2021

Southparkfan updated subscribers of T5877: Revise MariaDB backup strategy.

Running dump from db11 to dbbackup1:/srv/backups/db11. @Paladox and I are around to monitor.

Apr 9 2021, 22:21 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec

Apr 6 2021

Reception123 awarded Blog Post: An interview with Co-Founder Ferran Tufan a Mountain of Wealth token.
Apr 6 2021, 11:24 · Site Reliability Engineering
Southparkfan published Blog Post: An interview with Co-Founder Ferran Tufan.
Apr 6 2021, 10:31 · Site Reliability Engineering

Apr 5 2021

Southparkfan added a reverting change for rPUPCbb1ec901f84f: Revert "Varnish: block Googlebot requests with specific parameter": rPUPC0b1a5da44601: Revert "Revert "Varnish: block Googlebot requests with specific parameter"".
Apr 5 2021, 15:36
Southparkfan committed rPUPC0b1a5da44601: Revert "Revert "Varnish: block Googlebot requests with specific parameter"".
Revert "Revert "Varnish: block Googlebot requests with specific parameter""
Apr 5 2021, 15:36
Southparkfan added a reverting change for rPUPCc32785b0fa6f: Varnish: block Googlebot requests with specific parameter: rPUPCbb1ec901f84f: Revert "Varnish: block Googlebot requests with specific parameter".
Apr 5 2021, 15:20
Southparkfan committed rPUPCbb1ec901f84f: Revert "Varnish: block Googlebot requests with specific parameter".
Revert "Varnish: block Googlebot requests with specific parameter"
Apr 5 2021, 15:20

Apr 4 2021

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

New performance test (using sshfs setup, 4 mydumper threads):

  • Uncompressed: 290 seconds
  • Compressed: 210 seconds
Apr 4 2021, 22:07 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

For reference: mydumper is superior to mysqldump due to its better performance (using multiple threads) and the flexibility (PCRE based table inclusion/exclusion) in conjunction with transaction consistency and (almost) no locking (no read-only time required during backups). However, mydumper does not support TLS in connections, so dumping must happen at the database master.

Apr 4 2021, 21:37 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan raised the priority of T6900: Create draft of Data Processing Inventory from Low to Normal.
Apr 4 2021, 20:35 · Site Reliability Engineering

Apr 2 2021

Southparkfan added a comment to T7087: Add (rolling average) response time to grafana.

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

As mentioned in the task, /healthcheck is the biggest one because it has an effect on uptime if that gets too high.

/healthcheck = Meta's Main Page. We're already monitoring that.

I would recommend we do one that loads quite a few resources (eg. Images, javascript etc)

The blackbox exporter does not monitor subsequent requests, such as resources (images?) used on an article. We can monitor that though, but you'll need to provide specific URLs. :)

Apr 2 2021, 21:50 · MediaWiki (SRE), MediaWiki, Monitoring
Southparkfan added a comment to T7087: Add (rolling average) response time to grafana.

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

Apr 2 2021, 21:45 · MediaWiki (SRE), MediaWiki, Monitoring

Apr 1 2021

Southparkfan added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.
In T7073#140060, @John wrote:

Is there a use case for this that the ES data source wouldn’t fulfil? Is this the approach MediaWiki (SRE) wish to take? If so this would fall under the MW team to implement as part of their task as without a use case for Infra, what’s the point in implementing something unused?

There are more use cases than MediaWiki only. For example, I would like to monitor SSH authentication attempts and access logs of non-MediaWiki services, which is a task for us, not for the MediaWiki team. The proof of concept above was tailored for MediaWiki logs, because said logs have a higher priority.

Apr 1 2021, 00:12 · Universal Omega, MediaWiki (SRE), Monitoring

Mar 31 2021

Southparkfan added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.

Proof of concept:
/etc/prometheus-es-exporter/mediawiki.cfg:

[query_log_mediawiki]
QueryIntervalSecs = 900
QueryIndices = <graylog_deflector>
QueryJson = {
    "size": 0,
    "track_total_hits": true,
        "query": {
                "bool": {
                        "must": [
                                {
                                        "match": {
                                                "application_name": "mediawiki"
                                        }
                                }
                        ],
                        "filter": [
                                {
                                        "range": {
                                                "timestamp": { "gte": "now-15m", "lte": "now" }
                                        }
                                }
                        ]
                }
        },
        "aggs": {
                "mediawiki-channels": {
                        "terms": {
                                "field": "mediawiki_channel"
                        }
                }
        }
    }
Mar 31 2021, 23:56 · Universal Omega, MediaWiki (SRE), Monitoring
Southparkfan triaged T7073: Install prometheus-es-exporter for prometheus <-> graylog integration as Normal priority.
Mar 31 2021, 23:01 · Universal Omega, MediaWiki (SRE), Monitoring
Dmehus awarded T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc' a Like token.
Mar 31 2021, 18:05 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan closed T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc' as Resolved.

afbeelding.png (290×1 px, 11 KB)

Mar 31 2021, 17:54 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan added a comment to T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc'.

The rebuild should finish in 10-30 minutes. If the errors are gone after the rebuild, you can close this task. If not, assistance will be needed.

Mar 31 2021, 17:44 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan added a comment to T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc'.

Already did that for 'en', but without luck. Started the rebuild in a screen now.

Mar 31 2021, 17:43 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan updated the task description for T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc'.
Mar 31 2021, 17:16 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan triaged T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc' as Unbreak Now! priority.
Mar 31 2021, 17:16 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

More testing is required to determine the final backup sizes.

Mar 31 2021, 15:10 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Mar 31 2021, 14:27 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan updated the task description for T7067: Subscribe SRE to OpenCVE for notifications.
Mar 31 2021, 13:16 · Security, Site Reliability Engineering
Southparkfan added a comment to T7067: Subscribe SRE to OpenCVE for notifications.

Just noting that it's been decided to discontinue SRE duty due to the new team system and it didn't seem to be functioning anymore. The dashboard and links we've compiled have still been kept though as they're useful.

Mar 31 2021, 13:16 · Security, Site Reliability Engineering

Mar 30 2021

Southparkfan updated the task description for T7067: Subscribe SRE to OpenCVE for notifications.
Mar 30 2021, 21:58 · Security, Site Reliability Engineering
Southparkfan updated subscribers of T7067: Subscribe SRE to OpenCVE for notifications.
Mar 30 2021, 21:53 · Security, Site Reliability Engineering
Southparkfan moved T7067: Subscribe SRE to OpenCVE for notifications from Radar to Discussion on the Site Reliability Engineering board.
Mar 30 2021, 21:52 · Security, Site Reliability Engineering
Southparkfan triaged T7067: Subscribe SRE to OpenCVE for notifications as Normal priority.
Mar 30 2021, 21:52 · Security, Site Reliability Engineering

Mar 29 2021

Southparkfan added a comment to T4302: Deploy Apache Traffic Server.

In order to do proper backend verification in the certificate (CN), we have tested using ENFORCE. However, the Host header from the client (e.g. allthetropes.org) is used for the CN check at the backend. Therefore, the allthetropes.org certificate would still be mandatory at the backend, even though I prefer to remove all certificates (including our wildcard one) but a single domain (such as ats-internal.miraheze.wiki) from the MediaWiki servers.

Mar 29 2021, 00:46 · Goal-2021-Jul-Dec, Infrastructure (SRE)

Mar 28 2021

Southparkfan changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Open to Stalled.
Mar 28 2021, 22:44 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan changed the status of T6984: High load on dbbackup servers from Open to Stalled.

The future of these servers depends on the outcome of testing regarding T5877#139273.

Mar 28 2021, 22:44 · Database, Monitoring, Infrastructure (SRE)
Southparkfan changed the edit policy for T7033: Restart services running on older openssl binaries.
Mar 28 2021, 22:40 · Infrastructure (SRE), Security
Southparkfan changed the visibility for T7033: Restart services running on older openssl binaries.
Mar 28 2021, 22:40 · Infrastructure (SRE), Security
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

@Southparkfan should we make this task public viewable?

Mar 28 2021, 22:40 · Infrastructure (SRE), Security
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

I recommend rebooting the dbbackup servers. They may or may not be affected by CVE-2021-3450, but as long as these servers are rebooted gracefully, we can survive without them for a few minutes.

Mar 28 2021, 21:32 · Infrastructure (SRE), Security

Mar 26 2021

Southparkfan added a comment to T7042: salt-ssh broken due to unknown minion.

I cannot find the minion in /etc/salt/roster.

Mar 26 2021, 12:37 · Infrastructure (SRE)
Southparkfan updated the task description for T7042: salt-ssh broken due to unknown minion.
Mar 26 2021, 12:33 · Infrastructure (SRE)
Southparkfan triaged T7042: salt-ssh broken due to unknown minion as High priority.
Mar 26 2021, 12:33 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

Servers that haven't been rebooted, except for db1[1-3] / cloud[3-5] / mon2 / ns[12]:

  • dbbackup1
  • dbbackup2
  • mem1
  • mem2
Mar 26 2021, 12:31 · Infrastructure (SRE), Security
Southparkfan closed T7041: phab.miraheze.wiki cert expired as Resolved.

Fixed.

Mar 26 2021, 12:24 · MediaWiki (SRE), SSL
Southparkfan reopened T7041: phab.miraheze.wiki cert expired as "Open".
Mar 26 2021, 12:19 · MediaWiki (SRE), SSL
Southparkfan added a comment to T7038: Existing Server Resource Request for bacula2.

RamNode is short on capacity, so we can't resize bacula yet. I hope we can resize the server next week.

Mar 26 2021, 12:14 · Infrastructure (SRE)
Southparkfan added a comment to T7041: phab.miraheze.wiki cert expired.
13:08:41 <+SPF|Cloud> first, the nginx config points to /etc/ssl/certs/miraheze.wiki.crt, but we have switched to /etc/ssl/localcerts 
13:09:17 <+SPF|Cloud> second, the certificate is valid for 'miraheze.wiki', but not phab.miraheze.wiki
Mar 26 2021, 12:09 · MediaWiki (SRE), SSL

Mar 25 2021

Southparkfan added a comment to T7038: Existing Server Resource Request for bacula2.

+$5/mo is approved by me, only requires John's approval as the EM of Infrastructure.

Mar 25 2021, 22:40 · Infrastructure (SRE)
Southparkfan updated the task description for T7038: Existing Server Resource Request for bacula2.
Mar 25 2021, 22:39 · Infrastructure (SRE)
Southparkfan created T7038: Existing Server Resource Request for bacula2.
Mar 25 2021, 22:39 · Infrastructure (SRE)
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Mar 25 2021, 22:08 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T7037: [New] Server Resource Request for ats.

Spoke with @Paladox regarding ATS. Installing and testing ATS on test3 is not ideal, since that server is used for MediaWiki tests. Installing a new server as a testing cache proxy, granted that this cache proxy may not receive the 'allow 80/443 tcp' rules yet due to security reasons (we have agreed on a security review beforehand), has my support.

Mar 25 2021, 21:57 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.
19:58:18 <+SPF|Cloud> my advice: reboot all VMs with services that can be depooled and repooled easily, in order to preserve uptime, do it the normal way (adhere to the 5 minutes DNS TTL, depool from varnish, wait until requests have finished, etc)
20:00:49 <+SPF|Cloud> on the critical servers, db1[1-3], cloud[3-5], mon2 and ns[12], restarting syslog-ng / IRC bots is fine, anything else shouldn't be touched (yet)
Mar 25 2021, 19:05 · Infrastructure (SRE), Security
Southparkfan updated the task description for T7033: Restart services running on older openssl binaries.
Mar 25 2021, 18:50 · Infrastructure (SRE), Security
Southparkfan created T7033: Restart services running on older openssl binaries.
Mar 25 2021, 18:41 · Infrastructure (SRE), Security
Redmin awarded T4005: Execute external commands on MediaWiki servers inside sandboxes a 100 token.
Mar 25 2021, 10:30 · Universal Omega, MediaWiki (SRE), Security, MediaWiki

Mar 24 2021

Southparkfan lowered the priority of T6984: High load on dbbackup servers from High to Normal.
Mar 24 2021, 12:20 · Database, Monitoring, Infrastructure (SRE)

Mar 22 2021

Dmehus awarded T5222: MediaWiki response time can fluctuate due to messages a Like token.
Mar 22 2021, 21:44 · MediaWiki (SRE), MediaWiki

Mar 20 2021

Dmehus awarded T6765: Cache frequently accessed files on MediaWiki servers a Like token.
Mar 20 2021, 16:51 · MediaWiki (SRE), Performance, MediaWiki