Page MenuHomeMiraheze

Several MediaWiki backends are down
Closed, ResolvedPublic

Description

Persistent 503s and 502s on multiple wikis on Miraheze, for the past fifty (50) minutes or so. Universal Omega reports several MediaWiki backends are down.

From IRC:

<dmehus> Doug 
SRE; persistent 503s and 502s on multiple wikis
19:57 
⇐ darkmatterman450 quit (~darkmatte@user/darkmatterman450) Quit: Connection closed
20:01 <•CosmicAlpha> 
dmehus: I don't think anyone is around right now. The issue is known but I'm only available on mobile and not much I can do even if I wasn't.
20:02 <icinga-miraheze> IRC echo bot 
PROBLEM - cp30 Current Load on cp30 is WARNING: WARNING - load average: 0.61, 1.09, 1.79
20:04 <•CosmicAlpha> 
dmehus: oh, I just looked at Icinga, this isn't just a localised outage. It looks like 3 MediaWiki servers are down completely. Have been for 50 minutes. It was 4 servers a little bit ago.
20:05 <icinga-miraheze> IRC echo bot 
RECOVERY - cp30 Current Load on cp30 is OK: OK - load average: 0.82, 1.01, 1.63
20:06 <•CosmicAlpha> 
paladox: around?
20:08 
Different mw servers are down from different cache proxies it looks like...
20:10 
Reception123: ^
20:12 <icinga-miraheze> IRC echo bot 
PROBLEM - cp31 Current Load on cp31 is CRITICAL: CRITICAL - load average: 1.94, 2.11, 1.40
20:12 <•CosmicAlpha> 
Now ongoing for an hour
20:15 <icinga-miraheze> IRC echo bot 
RECOVERY - cp31 Current Load on cp31 is OK: OK - load average: 0.96, 1.57, 1.32
20:19 <dmehus> Doug 
CosmicAlpha, yeah
20:19 
worthy of an !sre ping?
20:20 <MacFan4000> 
@Void ^^
20:20 <•CosmicAlpha> 
It is. Yes.
20:21 <dmehus> Doug 
Created a task
20:21 
Void went offline, MacFan4000
20:21 
Likely will need paladox or Reception123 when he's up

Event Timeline

Dmehus triaged this task as Unbreak Now! priority.Jan 24 2022, 04:21
Dmehus created this task.

5 mw server backends are down at this time it would seem. For about an hour. It is different between different cp servers also it looks like and fluctuating between 2 and 5 servers down at a time on a given server.

5 mw server backends are down at this time it would seem. For about an hour. It is different between different cp servers also it looks like and fluctuating between 2 and 5 servers down at a time on a given server.

This will also need an IR then, too.

I've alerted @Owen in a Trust and Safety channel, as he can possibly alert @John via mobile.

Yes 100%

Looking at Grafana, the fact incident started several hours after deployment and the fact the service is fine right now, 100% unrelated

Universal_Omega lowered the priority of this task from Unbreak Now! to High.Jan 24 2022, 08:00

Lowering to high since it's OK now.

Some 502 and 503 on tuscriaturaswiki.

Some recent message is:
Error 503 Backend fetch failed, forwarded for 77.226.103.47, 127.0.0.1
(Varnish XID 374214046) via cp21 at Mon, 24 Jan 2022 12:29:12 GMT.

Several "503 Bad gateway" incidents last hour. After refresh (F5) however the requested wikipage is displayed.

All The Tropes has been returning 502s and 503s consistently for the last four hours or so, and possibly longer (since at least approx. 8:30 AM 24 January EST).

Universal_Omega raised the priority of this task from High to Unbreak Now!.Jan 24 2022, 20:32

This seems to still be ongoing.

This is fully recovered according to icinga.