Page MenuHomeMiraheze

Significant drop in cache hit rate caused all wikis to go down about 2am UK time 24 May 2021
Closed, DeclinedPublic

Assigned To
Authored By
Dmehus
May 24 2021, 02:03
Referenced Files
F1454048: afbeelding.png
Jun 9 2021, 21:06
F1454046: afbeelding.png
Jun 9 2021, 21:06
F1454040: afbeelding.png
Jun 9 2021, 21:06
F1454044: afbeelding.png
Jun 9 2021, 21:06

Description

Problem: Per report on IRC around this timestamp. Users reporting 502 Bad Gateway and 500 Internal Server Errors on other wikis as well as metawiki where I first reported it. Cause should be investigated as it seems system wide.

Relevant logs from IRC:

<icinga-miraheze> IRC echo bot 
PROBLEM - lcn.zfc.id.lv - reverse DNS on sslhost is WARNING: rDNS WARNING - reverse DNS entry for lcn.zfc.id.lv could not be found
17:31 
RECOVERY - graylog2 Puppet on graylog2 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
17:35 
PROBLEM - mw9 Current Load on mw9 is WARNING: WARNING - load average: 7.60, 6.59, 6.07
17:37 
RECOVERY - mw9 Current Load on mw9 is OK: OK - load average: 5.53, 6.09, 5.94
17:38 
RECOVERY - lcn.zfc.id.lv - reverse DNS on sslhost is OK: rDNS OK - lcn.zfc.id.lv reverse DNS resolves to cp13.miraheze.org
17:46 
PROBLEM - lcn.zfc.id.lv - reverse DNS on sslhost is WARNING: rDNS WARNING - reverse DNS entry for lcn.zfc.id.lv could not be found
18:05 
PROBLEM - mw11 Current Load on mw11 is CRITICAL: CRITICAL - load average: 8.46, 7.03, 5.86
18:07 
RECOVERY - mw11 Current Load on mw11 is OK: OK - load average: 5.40, 6.55, 5.84
18:15 
↔ MirahezeBot nipped out  
18:41 <icinga-miraheze> IRC echo bot 
RECOVERY - lcn.zfc.id.lv - reverse DNS on sslhost is OK: rDNS OK - lcn.zfc.id.lv reverse DNS resolves to cp13.miraheze.org
18:55 
PROBLEM - mw11 Current Load on mw11 is WARNING: WARNING - load average: 6.91, 6.23, 5.66
18:57 
PROBLEM - mw11 Current Load on mw11 is CRITICAL: CRITICAL - load average: 9.18, 6.83, 5.92
18:59 
PROBLEM - mw9 Current Load on mw9 is CRITICAL: CRITICAL - load average: 18.76, 10.95, 7.25
19:00 
PROBLEM - mw10 Current Load on mw10 is CRITICAL: CRITICAL - load average: 14.01, 10.12, 7.37
19:00 
PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.38.69.175/cpweb, 2001:41d0:801:2000::58af/cpweb, 51.222.25.132/cpweb, 2607:5300:205:200::1c30/cpweb
19:00 
PROBLEM - cp12 Stunnel Http for mw8 on cp12 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
19:00 
PROBLEM - mw8 Current Load on mw8 is CRITICAL: CRITICAL - load average: 14.02, 9.53, 6.75
19:00 
PROBLEM - cp12 HTTP 4xx/5xx ERROR Rate on cp12 is CRITICAL: CRITICAL - NGINX Error Rate is 95%
19:00 
PROBLEM - cp13 HTTPS on cp13 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:00 
PROBLEM - cp12 HTTPS on cp12 is CRITICAL: HTTP CRITICAL - No data received from host
19:00 
PROBLEM - cp3 HTTPS on cp3 is CRITICAL: CRITICAL - Cannot make SSL connection.
19:00 
PROBLEM - cp12 Current Load on cp12 is WARNING: WARNING - load average: 1.76, 1.60, 1.22
19:01 
PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 6 datacenters are down: 128.199.139.216/cpweb, 2400:6180:0:d0::403:f001/cpweb, 51.38.69.175/cpweb, 2001:41d0:801:2000::58af/cpweb, 51.222.25.132/cpweb, 2607:5300:205:200::1c30/cpweb
19:01 
PROBLEM - wiki.scrungecraft.gg - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:01 
PROBLEM - planetnomads.speedbomb.de - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:01 
PROBLEM - icinga.bots.miraheze.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:01 
PROBLEM - rangpurpedia.xyz - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:01 
PROBLEM - data.nonbinary.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:01 
PROBLEM - wiki.mobilityengineer.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:01 
PROBLEM - wiki.globasa.net - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - mr.gyaanipedia.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - wiki.candela.digital - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 <dmehus> Doug 
!sre 502 Bad Gateway and 500 Internal Server Error on `metawiki`
19:02 <icinga-miraheze> IRC echo bot 
PROBLEM - wiki.rebirthofthenight.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - wiki.yumeka.icu - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - wiki.nowchess.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - egoish.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - cp13 HTTP 4xx/5xx ERROR Rate on cp13 is CRITICAL: CRITICAL - NGINX Error Rate is 96%
19:02 
PROBLEM - christipedia.nl - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - wiki.thedev.gq - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - cp3 HTTP 4xx/5xx ERROR Rate on cp3 is CRITICAL: CRITICAL - NGINX Error Rate is 84%
19:02 
PROBLEM - wiki.rpointer.ml - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - electowiki.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - wiki.insideearth.info - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - erikapedia.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - wiki.shaazzz.ir - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:02 
PROBLEM - cp11 Stunnel Http for mw8 on cp11 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
19:02 
PROBLEM - kkutu.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.jacksonheights.nyc - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - cp13 Stunnel Http for mw8 on cp13 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
19:03 
PROBLEM - encyclopediaofastrobiology.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - archive.a2b2.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.umbranet.mandelbrotgames.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.astralprojections.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - cp12 Current Load on cp12 is CRITICAL: CRITICAL - load average: 4.85, 2.59, 1.62
19:03 
PROBLEM - dc-multiverse.dcwikis.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - programming.red - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - www.zenbuddhism.info - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - www.publictestwiki.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - bebaskanpengetahuan.id - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - puardian.xyz - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - mw8 MediaWiki Rendering on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:03 
PROBLEM - mw8 MediaWiki Rendering on mw8 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:03 
PROBLEM - wiki.thefactoryhka.com.pa - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - www.permanentfuturelab.wiki - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.make717.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - es.nocyclo.tk - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.usagihime.ml - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - www.rangpurpedia.xyz - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - cp10 Stunnel Http for mw8 on cp10 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
19:03 
PROBLEM - animatedfeet.miraheze.org - Sectigo on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.meregos.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.mikrodev.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.jumpbound.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - wiki.helioss.co - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:03 
PROBLEM - tensegritywiki.com - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.
19:04 
PROBLEM - wiki.conworlds.org - LetsEncrypt on sslhost is CRITICAL: CRITICAL - Cannot make SSL connection.

Event Timeline

Dmehus triaged this task as Unbreak Now! priority.May 24 2021, 02:03
Dmehus created this task.

@Cocopuff2018 has also reported that all wikis are down now.

I can access metawiki intermittently now, but it's hit or miss, and reportedly other wikis are inaccessible for some users.

Dmehus renamed this task from 502 Bad Gateway and 500 Internal Server Error on metawiki to 502 Bad Gateway and 500 Internal Server Error on metawiki (and all wikis).May 24 2021, 02:22

We seem to be slowly coming back online, with users now able to access all or most wikis, but load times remain significantly elevated, though they do seem to be improving.

At minimum, we should:

  1. Confirm the problem no longer exists;
  2. Investigate the likely cause(s); and,
  3. Prepare an incident report
Unknown Object (User) lowered the priority of this task from Unbreak Now! to High.EditedMay 24 2021, 02:58

Since the error seems to have self resolved, lowering to high for investigation. I want to know what caused this though as even Matomo seemed to have gone for 4 minutes and the outage duration for mediawiki seemed to be 14 minutes. Issue seemed to be SSL or DNS related?

Unknown Object (User) renamed this task from 502 Bad Gateway and 500 Internal Server Error on metawiki (and all wikis) to 502 Bad Gateway or 500 Internal Server Error on all wikis.May 24 2021, 03:02

This happen only custom domains per relevant logs from IRC?

In T7352#146390, @Anton wrote:

This happen only custom domains per relevant logs from IRC?

Because I don't see this error on miraheze.org domain.

No custom domains just have more checks

RhinosF1 renamed this task from 502 Bad Gateway or 500 Internal Server Error on all wikis to Significant drop in cache hit rate caused all wikis to go down about 2am UK time 24 May 2021.May 24 2021, 06:19

For 10 seconds it was interrupted again. Error 502.

6:22 AM UTC 24 May 2021

Between the people who've helped, I've managed to work out that Varnish isn't in graylog and there's no easy way to find out past data.

Between the people who've helped, I've managed to work out that Varnish isn't in graylog and there's no easy way to find out past data.

T5044

That graph shows the drop in caching rate aligns almost perfectly with drop in requests and increase in cache expiration. It looks like you’ve diagnosed a symptom and not a cause, as a drop in cache hit rate would be expected if you drop all your cachable requests.

Thanks @John, we're back to square one then on debugging it.

In T7352#146390, @Anton wrote:

This happen only custom domains per relevant logs from IRC?

It did affect wikis on *.miraheze.org.

I need to debug this.

@Southparkfan Thank you, and thank you for claiming this. :)

There are multiple high priority tasks now, this will have to wait till more time is available.

Still blocked on internal issues. Backed up various logs from cp13 in my home dir.

According to https://wm-bot.wmcloud.org/browser/index.php?start=05%2F24%2F2021&end=05%2F24%2F2021&display=%23miraheze-sre, issues started around 02:00 UTC

Looks like varnish and nginx OOM'd:

May 24 01:58:17 cp13 kernel: [208254.943992] nginx invoked oom-killer: gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null), order=0, oom_score_adj=0
May 24 01:58:17 cp13 kernel: [208255.144671] Out of memory: Kill process 1306 (cache-main) score 412 or sacrifice child
May 24 02:05:21 cp13 kernel: [208678.856078] nginx invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
May 24 02:05:21 cp13 kernel: [208679.100701] oom_reaper: reaped process 8870 (cache-main), now anon-rss:0kB, file-rss:0kB, shmem-rss:82104kB
May 24 02:05:24 cp13 kernel: [208682.065205] TCP: out of memory -- consider tuning tcp_mem
May 24 02:06:32 cp13 kernel: [208749.665111] nrpe invoked oom-killer: gfp_mask=0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=(null), order=1, oom_score_adj=-500
May 24 02:06:32 cp13 kernel: [208749.861551] Killed process 5254 (nginx) total-vm:351804kB, anon-rss:289156kB, file-rss:348kB, shmem-rss:20484kB

The graphs suggest the request rate was increased dramatically just before processes were killed:

afbeelding.png (836×1 px, 103 KB)

Normal value 'cp13.miraheze.org:9113 active': 140 - 200
Buildup till first error messages:

  • 2021-05-24 01:56:45 165
  • 2021-05-24 01:57:00 933
  • 2021-05-24 01:57:30 3022
  • 2021-05-24 01:58:00 4316

Connection tracking:

afbeelding.png (375×905 px, 25 KB)

TCP stats:

afbeelding.png (379×911 px, 42 KB)

TCP errors:

afbeelding.png (533×793 px, 53 KB)

cp12 and cp13 were overwhelmed due to the increased connection & request rate. Not unexpected, given the low resources on these VMs, although we could do better. The access logs are gone, so we cannot determine the exact culprit of this load. (T5044)

Proposed actionables:

  • Perform load test on cache proxy -> which process or resource is the first bottleneck? can we revisit the resource limits on the cache proxy?
  • Move access logs to Graylog -> T5044
  • Depending on the outcome of the first actionable, tune host firewall limits and/or nginx limits. Limiting traffic at firewall level is more efficient, but NGINX can do more granular filtering (L7).

@Paladox do you agree?

Those actionables sound fine to me.

Out of interest, is there an incident report for this yet?

Southparkfan lowered the priority of this task from High to Normal.Jun 12 2021, 15:20

Those actionables sound fine to me.

Thanks.

In T7352#148824, @John wrote:

Out of interest, is there an incident report for this yet?

Not yet, but a report will be written by me (after E269).

3 weeks with loss of logs for a high priority issue? This will be an interesting incident report detailing the systematic failures in dealing with this I’d presume?

Oddly marking as declined, as investigation into this task seems insufficient so no culprit has been identified, and after being left open for 3 months when we regularly cycle all logs after a few weeks means no meaningful investigation can occur.

No incident report has also been completed for this incident, which is not acceptable for proving accountability and lesson learning.