Page MenuHomeMiraheze
Feed Advanced Search

Feb 28 2024

Collei updated the task description for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 28 2024, 07:05 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei updated the task description for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 28 2024, 07:04 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Feb 26 2024

Xena added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

A user on Discord has reported it happening again, it's possible the issue wasn't fully resolved.

Feb 26 2024, 17:00 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Sounds good

Feb 26 2024, 05:13 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Dicto added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Hmm, that's weird but now I don't get Error 500 neither by importing pages on gameshows nor by editing with code editor on chernowiki. Looks like the problem is actually resolved.

Feb 26 2024, 04:13 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Visual editor being broken is already tracked in T11903. As for the other issues, can you reproduce this on any wikis other than gameshowswiki?

Feb 26 2024, 01:15 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Feb 25 2024

Dicto added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Still got Error 500 when try to import pages on gameshows.miraheze.org. Small xml files are going well when large (like 750 kb) are failing.

Feb 25 2024, 23:52 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Agent_Isai closed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions as Resolved.

Once again purged 13-16G of Varnish logs.

Feb 25 2024, 13:54 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei renamed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from 500 Internal Server Error - uploading images and editing pages to 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 25 2024, 01:54 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei merged T11900: XML ImportDump feature gives a "500 Internal Server" error into T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 25 2024, 01:53 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

Several Discord users have reported this occurring recently and more frequently

Feb 25 2024, 01:51 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Feb 24 2024

RhinosF1 edited projects for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions, added: Infrastructure (SRE); removed MediaWiki (SRE).
Feb 24 2024, 22:47 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
RhinosF1 raised the priority of T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from High to Unbreak Now!.
Feb 24 2024, 22:46 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei updated the task description for T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 24 2024, 22:23 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei raised the priority of T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from Normal to High.

Reviewing Discord and Phabricator issues needing triage, this is probably a larger issue than I first assumed

Feb 24 2024, 21:26 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei renamed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions from 500 Internal Server Error - uploading images to 500 Internal Server Error - uploading images and editing pages.
Feb 24 2024, 21:25 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei merged T11892: Can not make changes to page into T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 24 2024, 21:25 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

To be clear, I lowered this to Normal because it only appears to be happening on some wikis and not all of their pages. Most functionality still works. Feel free to change it back if I'm wrong about this triage.

Feb 24 2024, 21:19 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei merged T11899: 500 Internal Server Error into T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.
Feb 24 2024, 21:18 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei added a comment to T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions.

This is occurring again, see T11899

Feb 24 2024, 21:17 · Infrastructure (SRE), Varnish, MediaWiki, Production Error
Collei reopened T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions as "Open".
Feb 24 2024, 21:17 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Feb 23 2024

Universal_Omega closed T11891: 500 Internal Server Error - uploading images, editing pages, and taking other actions as Resolved.
Feb 23 2024, 20:57 · Infrastructure (SRE), Varnish, MediaWiki, Production Error

Feb 22 2024

Universal_Omega added a comment to T11887: Vanish cannot be executed.

https://github.com/miraheze/MirahezeMagic/pull/469

Feb 22 2024, 04:35 · Configuration, MediaWiki (SRE)
1108-Kiju renamed T11887: Vanish cannot be executed from Vanish cannot be executed to varnish cannot be executed.
Feb 22 2024, 04:31 · Configuration, MediaWiki (SRE)
1108-Kiju triaged T11887: Vanish cannot be executed as Normal priority.
Feb 22 2024, 04:30 · Configuration, MediaWiki (SRE)

Jan 24 2024

Agent_Isai closed T11052: Do something about broken twiter feed as Invalid.

503s no longer display a Twitter feed. They instead link to a static help page on GitHub Pages which explains what may have happened and links to our social media and status page so technically this is invalid?

Jan 24 2024, 00:34 · Varnish, Technical-Debt, Design
labster added a comment to T11052: Do something about broken twiter feed.

T&S exists now, and @Agent_Isai is likely the best person to approve what comes next.

Jan 24 2024, 00:28 · Varnish, Technical-Debt, Design

Oct 25 2023

OrangeStar added a comment to T11052: Do something about broken twiter feed.

If the problem is with CSP reviews I'd argue emfed has a better shot than facebook

Oct 25 2023, 14:10 · Varnish, Technical-Debt, Design

Oct 24 2023

MacFan4000 added a comment to T11052: Do something about broken twiter feed.

Replacing it with Mastodon is the easiest route, since you already have that up and running. A quick search brings up https://sampsyo.github.io/emfed/. I could write a PR including emfed from the jsdelivr cdn if wanted.

Oct 24 2023, 22:35 · Varnish, Technical-Debt, Design
OrangeStar added a comment to T11052: Do something about broken twiter feed.

Replacing it with Mastodon is the easiest route, since you already have that up and running. A quick search brings up https://sampsyo.github.io/emfed/. I could write a PR including emfed from the jsdelivr cdn if wanted.

Oct 24 2023, 19:22 · Varnish, Technical-Debt, Design

Sep 11 2023

Paladox closed T11207: Consider adding 'browsing-topics=()' to permission-policy header as Resolved.
Sep 11 2023, 13:11 · Infrastructure (SRE), Varnish, revi
RhinosF1 added projects to T11207: Consider adding 'browsing-topics=()' to permission-policy header: Varnish, Infrastructure (SRE).

Makes sense to me

Sep 11 2023, 11:59 · Infrastructure (SRE), Varnish, revi

Aug 9 2023

PlanToSaveNoWork added a comment to T11133: [trash].

**This is the truth Miraheze is burning like a house and the suppoesed firefighters are sitting relaxing having a coffee instead of helping

Aug 9 2023, 17:52 · Trash
PlanToSaveNoWork triaged T11133: [trash] as Unbreak Now! priority.
Aug 9 2023, 17:51 · Trash

Jul 11 2023

MacFan4000 triaged T11052: Do something about broken twiter feed as Normal priority.
Jul 11 2023, 19:04 · Varnish, Technical-Debt, Design

May 19 2023

MacFan4000 removed a member for Varnish: Southparkfan.
May 19 2023, 20:08

Apr 16 2022

RhinosF1 renamed T8983: 23 Mar 2022 DoS from 23 Mar 2022 DDoS to 23 Mar 2022 DoS.
Apr 16 2022, 09:17 · MediaWiki, Infrastructure (SRE), Varnish, Security

Mar 26 2022

John changed the visibility for T8983: 23 Mar 2022 DoS.
Mar 26 2022, 19:05 · MediaWiki, Infrastructure (SRE), Varnish, Security
John closed T8983: 23 Mar 2022 DoS as Resolved.

Spoke with @Paladox and no further action is needed on this task.

Mar 26 2022, 19:05 · MediaWiki, Infrastructure (SRE), Varnish, Security
John moved T8983: 23 Mar 2022 DoS from Incoming to Short Term on the Infrastructure (SRE) board.
Mar 26 2022, 17:14 · MediaWiki, Infrastructure (SRE), Varnish, Security
Reception123 added a comment to T8983: 23 Mar 2022 DoS.

@RhinosF1 Do we still need this task open since the incident has passed?

Mar 26 2022, 08:13 · MediaWiki, Infrastructure (SRE), Varnish, Security

Mar 23 2022

RhinosF1 added a comment to T8983: 23 Mar 2022 DoS.

NCSC are aware

Mar 23 2022, 23:04 · MediaWiki, Infrastructure (SRE), Varnish, Security
RhinosF1 assigned T8983: 23 Mar 2022 DoS to Paladox.

blocked at firewall level globally, let's keep an eye.

Mar 23 2022, 22:48 · MediaWiki, Infrastructure (SRE), Varnish, Security
RhinosF1 raised the priority of T8983: 23 Mar 2022 DoS from High to Unbreak Now!.
Mar 23 2022, 21:36 · MediaWiki, Infrastructure (SRE), Varnish, Security

Mar 14 2022

RhinosF1 edited projects for T8930: Persistent 503s on multiple wikis, including `metawiki`, added: Database, Infrastructure (SRE); removed MediaWiki (SRE).

00:08:29 <JohnLewis> dmehus: yeah, IO on cloud11's SSDs is pretty high because of piwik db migration

Mar 14 2022, 07:26 · MediaWiki (SRE), MediaWiki
RhinosF1 added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.

php-fpm looks to be struggling to keep up again.

Mar 14 2022, 06:54 · MediaWiki (SRE), MediaWiki
RobLa added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.

When I tried to reach https://robla.miraheze.org about 20 minutes ago , I received the following error:

Mar 14 2022, 04:30 · MediaWiki (SRE), MediaWiki
Dmehus added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
20:19 
PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.09, 2.06, 1.84
20:20 
RECOVERY - matomo101 SSH on matomo101 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0)
20:20 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 3.172 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
20:20 
PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 8.22, 7.19, 6.98
20:21 
RECOVERY - cp30 Stunnel HTTP for mw101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.312 second response time
20:21 
RECOVERY - cp31 Varnish Backends on cp31 is OK: All 12 backends are healthy
20:21 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.07, 1.69, 1.73
20:22 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.24, 6.69, 6.81
20:23 
RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy
20:23 
RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.10, 1.50, 1.66
20:25 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
20:26 <dmehus> Doug 
!sre
20:26 <icinga-miraheze> IRC echo bot 
PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:26 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 5.97, 6.63, 6.77
20:27 
PROBLEM - cp30 Stunnel HTTP for mw101 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
20:27 
PROBLEM - cp31 Stunnel HTTP for phab121 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
20:27 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 1.177 second response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
20:27 
PROBLEM - cp30 Stunnel HTTP for mw111 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:27 
PROBLEM - cp21 Stunnel HTTP for mw122 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
20:27 
PROBLEM - cp21 Stunnel HTTP for mw111 on cp21 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.011 second response time
20:27 
PROBLEM - cp31 Stunnel HTTP for mw111 on cp31 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.317 second response time
20:27 
PROBLEM - cp20 Stunnel HTTP for mw111 on cp20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.012 second response time
20:27 
PROBLEM - mw111 MediaWiki Rendering on mw111 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 1595 bytes in 0.008 second response time
20:28 <dmehus> Doug 
Can reproduce the above persistently
20:28 <icinga-miraheze> IRC echo bot 
PROBLEM - cp31 Stunnel HTTP for mw122 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
20:28 
PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
20:28 
PROBLEM - cp30 Stunnel HTTP for mw122 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds

^ Additional icinga alerts

Mar 14 2022, 03:29 · MediaWiki (SRE), MediaWiki
Dmehus added a comment to T8930: Persistent 503s on multiple wikis, including `metawiki`.
<icinga-miraheze> IRC echo bot 
RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 6.68, 8.41, 8.47
18:17 
PROBLEM - db112 Current Load on db112 is WARNING: WARNING - load average: 5.19, 5.81, 5.32
18:17 
RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 6.64, 7.32, 8.32
18:19 
PROBLEM - db112 Current Load on db112 is CRITICAL: CRITICAL - load average: 6.57, 6.00, 5.44
18:19 
RECOVERY - gluster101 Current Load on gluster101 is OK: OK - load average: 3.19, 3.18, 3.16
18:20 
alerting : [FIRING:1] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki
18:20 
RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.34, 7.62, 8.49
18:21 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.84, 8.74, 8.71
18:23 
PROBLEM - db112 Current Load on db112 is WARNING: WARNING - load average: 2.62, 4.79, 5.11
18:24 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 8.70, 8.49, 8.68
18:25 
RECOVERY - db112 Current Load on db112 is OK: OK - load average: 2.99, 4.25, 4.87
18:27 
→ darkmatterman450 joined (~darkmatte@user/darkmatterman450)
18:27 <icinga-miraheze> IRC echo bot 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.21, 9.05, 8.85
18:28 
PROBLEM - mw111 Current Load on mw111 is CRITICAL: CRITICAL - load average: 10.69, 9.69, 9.15
18:29 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 9.93, 9.14, 8.90
18:30 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 7.77, 9.17, 9.04
18:30 
PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 8.33, 8.92, 8.52
18:31 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.86, 9.82, 9.19
18:34 
PROBLEM - mw111 Current Load on mw111 is CRITICAL: CRITICAL - load average: 11.45, 10.20, 9.45
18:34 
PROBLEM - mw121 Current Load on mw121 is CRITICAL: CRITICAL - load average: 10.20, 9.44, 8.79
18:36 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 9.32, 9.86, 9.42
18:36 
PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 7.81, 8.99, 8.72
18:41 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.68, 9.52, 9.53
18:42 
PROBLEM - mw102 Current Load on mw102 is WARNING: WARNING - load average: 8.73, 7.72, 6.97
18:43 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.80, 10.19, 9.78
18:44 
RECOVERY - mw102 Current Load on mw102 is OK: OK - load average: 7.10, 7.53, 7.00
18:44 
PROBLEM - mw122 Current Load on mw122 is CRITICAL: CRITICAL - load average: 10.86, 8.60, 8.05
18:45 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 9.56, 9.96, 9.76
18:47 
PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 11.72, 10.28, 9.87
18:48 
PROBLEM - mw122 Current Load on mw122 is WARNING: WARNING - load average: 9.66, 9.10, 8.36
18:50 
PROBLEM - cp31 Current Load on cp31 is CRITICAL: CRITICAL - load average: 2.54, 1.96, 1.29
18:51 
PROBLEM - cp30 Current Load on cp30 is WARNING: WARNING - load average: 1.75, 1.65, 1.28
18:52 
PROBLEM - cp30 Stunnel HTTP for matomo101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:52 
PROBLEM - cp21 Stunnel HTTP for matomo101 on cp21 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 358 bytes in 0.157 second response time
18:52 
PROBLEM - matomo101 Current Load on matomo101 is CRITICAL: CRITICAL - load average: 20.08, 9.20, 4.29
18:52 
PROBLEM - cp31 Stunnel HTTP for matomo101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:52 
RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.08, 7.80, 8.46
18:52 
RECOVERY - cp31 Current Load on cp31 is OK: OK - load average: 1.10, 1.63, 1.25
18:53 
PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 9.95, 7.83, 6.37
18:53 
PROBLEM - matomo101 HTTPS on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:53 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
18:53 
PROBLEM - cp20 Stunnel HTTP for matomo101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
18:53 
RECOVERY - cp30 Current Load on cp30 is OK: OK - load average: 1.21, 1.54, 1.29
18:53 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.99, 9.74, 9.82
18:54 
RECOVERY - mw122 Current Load on mw122 is OK: OK - load average: 5.58, 7.61, 8.01
18:55 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 2.725 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
18:57 
PROBLEM - cp31 Varnish Backends on cp31 is CRITICAL: 1 backends are down. mw111
18:58 
PROBLEM - matomo101 Redis Process on matomo101 is CRITICAL: PROCS CRITICAL: 0 processes with args 'redis-server'
18:58 
PROBLEM - mw102 Current Load on mw102 is WARNING: WARNING - load average: 8.76, 8.09, 7.53
18:58 
RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 6.47, 7.93, 8.49
18:59 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.51, 7.73, 6.89
18:59 
RECOVERY - cp31 Varnish Backends on cp31 is OK: All 12 backends are healthy
19:00 
RECOVERY - matomo101 Redis Process on matomo101 is OK: PROCS OK: 1 process with args 'redis-server'
19:00 
PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:00 
RECOVERY - mw102 Current Load on mw102 is OK: OK - load average: 8.47, 8.20, 7.64
19:01 
PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 9.59, 8.57, 7.31
19:01 
PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.07, 1.82, 1.51
19:02 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
19:03 
PROBLEM - matomo101 NTP time on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:03 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.51, 1.73, 1.52
19:03 
PROBLEM - matomo101 Puppet on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:05 
PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:05 
RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 5.63, 6.85, 8.34
19:05 
PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.93, 4.15, 3.36
19:05 
PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.16, 1.89, 1.60
19:05 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.66, 3.35, 2.84
19:06 
PROBLEM - cp30 Stunnel HTTP for test101 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:06 
PROBLEM - cp30 Stunnel HTTP for mw121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:07 
PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 3.16, 3.74, 3.31
19:07 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 3.91, 3.31, 2.87
19:08 
PROBLEM - matomo101 ferm_active on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:08 
RECOVERY - cp30 Stunnel HTTP for test101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14564 bytes in 0.338 second response time
19:08 
RECOVERY - cp30 Stunnel HTTP for mw121 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 0.852 second response time
19:08 
PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 2 datacenters are down: 149.56.140.43/cpweb, 2607:5300:201:3100::929a/cpweb
19:09 
PROBLEM - gluster121 Current Load on gluster121 is CRITICAL: CRITICAL - load average: 4.93, 4.00, 3.10
19:09 
PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 7 backends are down. mw101 mw102 mw111 mw112 mw121 mw122 mediawiki
19:09 
PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 2 datacenters are down: 149.56.140.43/cpweb, 2607:5300:201:3100::929a/cpweb
19:09 
PROBLEM - matomo101 Redis Process on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:09 
PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.99, 4.13, 3.49
19:09 
PROBLEM - matomo101 Disk Space on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:09 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.69, 1.87, 1.67
19:09 
PROBLEM - matomo101 php-fpm on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:09 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.20, 3.56, 3.01
19:10 
RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online
19:11 
RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy
19:11 
RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online
19:11 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 3.54, 3.78, 3.18
19:11 
PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 2.38, 3.59, 3.38
19:11 
RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.37, 1.70, 1.63
19:11 
RECOVERY - matomo101 Disk Space on matomo101 is OK: DISK OK - free space: / 1205 MB (12% inode=80%);
19:11 
RECOVERY - matomo101 Puppet on matomo101 is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures
19:11 
RECOVERY - matomo101 php-fpm on matomo101 is OK: PROCS OK: 5 processes with command name 'php-fpm7.4'
19:11 
RECOVERY - matomo101 Redis Process on matomo101 is OK: PROCS OK: 1 process with args 'redis-server'
19:11 
RECOVERY - matomo101 NTP time on matomo101 is OK: NTP OK: Offset -0.005568474531 secs
19:12 
RECOVERY - matomo101 SSH on matomo101 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0)
19:13 
RECOVERY - matomo101 ferm_active on matomo101 is OK: OK ferm input default policy is set
19:13 
RECOVERY - matomo101 conntrack_table_size on matomo101 is OK: OK: nf_conntrack is 0 % full
19:13 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.85, 7.63, 7.55
19:13 
RECOVERY - cp30 Stunnel HTTP for matomo101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 3.464 second response time
19:13 
RECOVERY - cp21 Stunnel HTTP for matomo101 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.889 second response time
19:13 
RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 0.811 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68
19:13 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 5.26, 4.24, 3.41
19:13 
RECOVERY - cp20 Stunnel HTTP for matomo101 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.705 second response time
19:13 
PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.51, 3.95, 3.54
19:13 
RECOVERY - cp31 Stunnel HTTP for matomo101 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.663 second response time
19:14 
RECOVERY - matomo101 HTTPS on matomo101 is OK: HTTP OK: HTTP/1.1 200 OK - 66479 bytes in 1.038 second response time
19:14 
PROBLEM - gluster121 Current Load on gluster121 is WARNING: WARNING - load average: 3.53, 3.88, 3.40
19:15 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 2.95, 3.69, 3.31
19:15 
ok : [RESOLVED] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki
19:16 
PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 9.25, 8.08, 7.21
19:16 
PROBLEM - gluster121 Current Load on gluster121 is CRITICAL: CRITICAL - load average: 8.47, 4.85, 3.77
19:17 
PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 3.26, 3.93, 3.66
19:18 
RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.62, 7.88, 7.25
19:19 
PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.06, 3.89, 3.45
19:19 
PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.77, 1.64, 1.59
19:20 
PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.11, 8.72, 8.18
19:20 
PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 9.69, 8.68, 7.66
19:21 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.29, 6.11, 6.73
19:21 
PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 2.87, 3.62, 3.41
19:21 
alerting : [FIRING:1] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki
19:21 
RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.21, 1.50, 1.55
19:22 
PROBLEM - cp31 Stunnel HTTP for matomo101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:22 
RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 6.77, 8.29, 8.10
19:22 
RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 5.47, 7.77, 7.48
19:23 
RECOVERY - gluster111 Current Load on gluster111 is OK: OK - load average: 2.23, 3.29, 3.32
19:23 
PROBLEM - cp30 Stunnel HTTP for matomo101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:23 
PROBLEM - cp20 Stunnel HTTP for matomo101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:23 
PROBLEM - matomo101 HTTPS on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:23 
PROBLEM - cp21 Stunnel HTTP for matomo101 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:24 
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call
19:24 
PROBLEM - gluster121 Current Load on gluster121 is WARNING: WARNING - load average: 1.87, 3.56, 3.75
19:24 
PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:24 
PROBLEM - cp30 Stunnel HTTP for mail121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:24 
PROBLEM - cp31 Stunnel HTTP for mon111 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:25 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 7.20, 6.62, 6.79
19:25 
PROBLEM - cp20 Stunnel HTTP for mw111 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
PROBLEM - cp21 Stunnel HTTP for mw121 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
PROBLEM - cp20 Stunnel HTTP for mw121 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
PROBLEM - cp30 Stunnel HTTP for mw111 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 
RECOVERY - gluster101 Current Load on gluster101 is OK: OK - load average: 1.47, 2.91, 3.37
19:25 
PROBLEM - cp30 Stunnel HTTP for phab121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:26 
PROBLEM - matomo101 NTP time on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:26 
PROBLEM - cp20 Stunnel HTTP for mw101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:26 
PROBLEM - matomo101 Puppet on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:26 
RECOVERY - cp30 Stunnel HTTP for mail121 on cp30 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 427 bytes in 0.241 second response time
19:27 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.06, 6.37, 6.67
19:27 
PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 3 backends are down. mw101 mw102 mw122
19:27 
RECOVERY - cp20 Stunnel HTTP for mw111 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 7.008 second response time
19:27 
RECOVERY - cp21 Stunnel HTTP for mw121 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 7.478 second response time
19:27 
RECOVERY - cp20 Stunnel HTTP for mw121 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 7.526 second response time
19:27 
RECOVERY - cp30 Stunnel HTTP for mw111 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 5.210 second response time
19:27 
PROBLEM - cp31 Varnish Backends on cp31 is CRITICAL: 5 backends are down. mw102 mw111 mw112 mw121 mw122
19:27 
RECOVERY - cp30 Stunnel HTTP for phab121 on cp30 is OK: HTTP OK: Status line output matched "500" - 2855 bytes in 0.353 second response time
19:28 
PROBLEM - cp31 Stunnel HTTP for mw101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
PROBLEM - cp21 Stunnel HTTP for mw101 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
RECOVERY - matomo101 NTP time on matomo101 is OK: NTP OK: Offset -0.006324976683 secs
19:28 
PROBLEM - mw101 MediaWiki Rendering on mw101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
PROBLEM - cp30 Stunnel HTTP for mw101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:28 
RECOVERY - cp31 Stunnel HTTP for mon111 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 33915 bytes in 1.185 second response time
19:29 
RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy
19:29 
PROBLEM - cp20 Varnish Backends on cp20 is CRITICAL: 1 backends are down. mw101
19:30 
RECOVERY - gluster121 Current Load on gluster121 is OK: OK - load average: 2.30, 2.80, 3.33
19:31 
PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.54, 6.84, 6.83
19:33 
RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.42, 6.67, 6.77
19:33 
RECOVERY - cp20 Varnish Backends on cp20 is OK: All 12 backends are healthy
19:34 
RECOVERY - cp21 Stunnel HTTP for mw101 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 3.751 second response time
19:34 
PROBLEM - cp31 Stunnel HTTP for mw122 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:34 
PROBLEM - cp31 Stunnel HTTP for mw112 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:34 
RECOVERY - mw101 MediaWiki Rendering on mw101 is OK: HTTP OK: HTTP/1.1 200 OK - 22336 bytes in 3.518 second response time
19:34 
PROBLEM - cp30 Stunnel HTTP for mw112 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host
19:34 
PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds.
19:34 
RECOVERY - cp30 Stunnel HTTP for mw101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.313 second response time
19:34 
RECOVERY - cp20 Stunnel HTTP for mw101 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.015 second response time
19:35 
PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 7 backends are down. mw101 mw102 mw111 mw112 mw121 mw122 mediawiki
19:35 
PROBLEM - cp31 Stunnel HTTP for phab121 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host
19:35 <dmehus> Doug 
SRE: persistent 503s on multiple wikis
19:35 <icinga-miraheze> IRC echo bot 
RECOVERY - cp31 Stunnel HTTP for mw101 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.312 second response time
19:36 
RECOVERY - cp31 Stunnel HTTP for mw112 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.325 second response time
19:36 
RECOVERY - cp31 Stunnel HTTP for mw122 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 3.995 second response time
19:36 
RECOVERY - cp30 Stunnel HTTP for mw112 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.358 second response time
Sunday, March 13th, 2022		about an hour ago
↓ 1 unread message (less than a minute)
dmehus 
New message input
Mar 14 2022, 02:37 · MediaWiki (SRE), MediaWiki
Dmehus triaged T8930: Persistent 503s on multiple wikis, including `metawiki` as Unbreak Now! priority.
Mar 14 2022, 02:36 · MediaWiki (SRE), MediaWiki

Mar 11 2022

John changed the visibility for T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.
Mar 11 2022, 17:43 · Infrastructure (SRE), Varnish, Security
John closed T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif as Resolved.

I am now no longer able to reproduce this.

Mar 11 2022, 17:40 · Infrastructure (SRE), Varnish, Security
John added a comment to T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.
Mar 11 2022, 09:43 · Infrastructure (SRE), Varnish, Security

Mar 9 2022

Lens0021 added a comment to T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.

And css|js|json clearly doesn't seem to be for images; The use cases should be allowed.

Mar 9 2022, 12:37 · Infrastructure (SRE), Varnish, Security
Lens0021 added a comment to T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.

I'm not complaining, but https://meta.miraheze.org/wiki/?action=raw&title=Miraheze&ARBITRARY=/w/img_auth.php/.gif still works.

Mar 9 2022, 10:47 · Infrastructure (SRE), Varnish, Security
John moved T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif from Incoming to Short Term on the Infrastructure (SRE) board.
Mar 9 2022, 09:28 · Infrastructure (SRE), Varnish, Security
John added a project to T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif: Infrastructure (SRE).
Mar 9 2022, 09:28 · Infrastructure (SRE), Varnish, Security
John claimed T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.

Thank you for identify this problem, I have pushed a resolution but not fully tested it yet. I will verify the resolution before making this public

Mar 9 2022, 09:27 · Infrastructure (SRE), Varnish, Security
Lens0021 updated the task description for T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.
Mar 9 2022, 03:51 · Infrastructure (SRE), Varnish, Security
Lens0021 updated the task description for T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.
Mar 9 2022, 03:45 · Infrastructure (SRE), Varnish, Security
Lens0021 updated the task description for T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.
Mar 9 2022, 03:45 · Infrastructure (SRE), Varnish, Security
Lens0021 created T8912: Possible to make the Access-Control-Allow-Origin header to be * by appending &ARBITRARY=.gif.
Mar 9 2022, 03:42 · Infrastructure (SRE), Varnish, Security

Feb 25 2022

RhinosF1 shifted T8832: Prevent MJ12Bot's access from the Restricted Space space to the S1 Public space.
Feb 25 2022, 13:36 · Trust & Safety, MediaWiki, Security, Varnish, Infrastructure (SRE), MediaWiki (SRE)
RhinosF1 closed T8832: Prevent MJ12Bot's access as Resolved.

I'm closing this task. I've tweaked logging so we can tell when LoginNotify is being triggered. We should follow up with some way to alert like we do for exceptions.

Feb 25 2022, 13:36 · Trust & Safety, MediaWiki, Security, Varnish, Infrastructure (SRE), MediaWiki (SRE)
RhinosF1 added a comment to T8832: Prevent MJ12Bot's access.

What do you mean by "security sensitive" pages?

Feb 25 2022, 13:34 · Trust & Safety, MediaWiki, Security, Varnish, Infrastructure (SRE), MediaWiki (SRE)
Reception123 added a comment to T8832: Prevent MJ12Bot's access.

What do you mean by "security sensitive" pages?

Feb 25 2022, 13:02 · Trust & Safety, MediaWiki, Security, Varnish, Infrastructure (SRE), MediaWiki (SRE)
RhinosF1 lowered the priority of T8832: Prevent MJ12Bot's access from Unbreak Now! to High.

Blocked since 12:26

Feb 25 2022, 12:56 · Trust & Safety, MediaWiki, Security, Varnish, Infrastructure (SRE), MediaWiki (SRE)
RhinosF1 triaged T8832: Prevent MJ12Bot's access as Unbreak Now! priority.
Feb 25 2022, 11:48 · Trust & Safety, MediaWiki, Security, Varnish, Infrastructure (SRE), MediaWiki (SRE)

Feb 7 2022

John added a comment to T8737: Log Varnish XID -> Request mapping.

Wont the backends nginx have access to the X-Varnish header, we can log it there and put it in graylog

Feb 7 2022, 13:30 · Monitoring, Varnish, Infrastructure (SRE)
RhinosF1 added a comment to T8737: Log Varnish XID -> Request mapping.

Wont the backends nginx have access to the X-Varnish header, we can log it there and put it in graylog

Feb 7 2022, 12:15 · Monitoring, Varnish, Infrastructure (SRE)

Feb 6 2022

RhinosF1 added a comment to T8737: Log Varnish XID -> Request mapping.

Which isn't helpful if users don't save that, which most aren't going too.

Feb 6 2022, 14:53 · Monitoring, Varnish, Infrastructure (SRE)
John added a comment to T8737: Log Varnish XID -> Request mapping.

There’s only Varnishlog, which is easiest to search using the XID.

Feb 6 2022, 14:50 · Monitoring, Varnish, Infrastructure (SRE)
RhinosF1 added a comment to T8737: Log Varnish XID -> Request mapping.

Is there an access log we can have that shows the XID, url & ip on the varnish end? It should be enough to match them up.

Feb 6 2022, 14:41 · Monitoring, Varnish, Infrastructure (SRE)
John closed T8737: Log Varnish XID -> Request mapping as Declined.

X-Varnish is set on response, not on request. This is because the header logs the response ID.

Feb 6 2022, 14:07 · Monitoring, Varnish, Infrastructure (SRE)

Feb 5 2022

RhinosF1 created T8737: Log Varnish XID -> Request mapping.
Feb 5 2022, 09:45 · Monitoring, Varnish, Infrastructure (SRE)

Jan 26 2022

John created P452 Top 10 HIT/MISS of Varnish (25/01/2022).
Jan 26 2022, 17:50 · Varnish

Dec 30 2021

RhinosF1 added a comment to T8547: Advise MW Team of available stunnel ports.

8094-8101 used

Dec 30 2021, 15:04 · Varnish, Infrastructure (SRE)
RhinosF1 added a comment to T8547: Advise MW Team of available stunnel ports.

Thank you!

Dec 30 2021, 15:02 · Varnish, Infrastructure (SRE)
John closed T8547: Advise MW Team of available stunnel ports as Resolved.

Ports are assigned historically in numerical order, last used for MediaWiki was 8093, so 8094+

Dec 30 2021, 14:58 · Varnish, Infrastructure (SRE)
RhinosF1 moved T8547: Advise MW Team of available stunnel ports from Incoming to Short Term on the Infrastructure (SRE) board.
Dec 30 2021, 14:00 · Varnish, Infrastructure (SRE)
RhinosF1 triaged T8547: Advise MW Team of available stunnel ports as Normal priority.
Dec 30 2021, 13:42 · Varnish, Infrastructure (SRE)

Nov 20 2021

Unknown Object (User) closed T7311: Continuous JavaScript errors with Echo and GlobalWatchlist: Access to XMLHttpRequest blocked by CORS policy as Resolved.

This is resolved with the commits.

Nov 20 2021, 19:10 · Varnish, Puppet, Universal Omega, MediaWiki (SRE)

Nov 17 2021

Unknown Object (User) claimed T7311: Continuous JavaScript errors with Echo and GlobalWatchlist: Access to XMLHttpRequest blocked by CORS policy.
Nov 17 2021, 20:27 · Varnish, Puppet, Universal Omega, MediaWiki (SRE)

Oct 11 2021

John closed T8024: Prevent large objects from being cached within varnish as Declined.
Oct 11 2021, 17:56 · Varnish, Infrastructure (SRE)

Sep 28 2021

John added a comment to T8024: Prevent large objects from being cached within varnish.

@Paladox see above

Sep 28 2021, 11:25 · Varnish, Infrastructure (SRE)

Sep 12 2021

John added a comment to T8024: Prevent large objects from being cached within varnish.

What are example of large objects? Are they infrequently requested? Frequently requested large objects would make more sense to cache than smaller infrequently requested objects

Sep 12 2021, 20:01 · Varnish, Infrastructure (SRE)
Paladox triaged T8024: Prevent large objects from being cached within varnish as Normal priority.
Sep 12 2021, 19:03 · Varnish, Infrastructure (SRE)

Sep 6 2021

Unknown Object (User) moved T7698: Wikimedia incident can bring us down from Unsorted to Short Term on the Universal Omega board.
Sep 6 2021, 03:07 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)
Unknown Object (User) closed T7698: Wikimedia incident can bring us down as Resolved.
Sep 6 2021, 03:07 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)

Sep 4 2021

Unknown Object (User) added a comment to T7698: Wikimedia incident can bring us down.

Yes it did happen.

Sep 4 2021, 07:58 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)
RhinosF1 added a comment to T7698: Wikimedia incident can bring us down.

Logs show indication last night it might have happened. I see depools around the wikimedia outage.

Sep 4 2021, 07:31 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)

Aug 12 2021

John closed T7319: Varnish should clean up ramdisk after failure, a subtask of T7318: Varnish cp12 OOM'd May 16 20:25 UK time, as Resolved.
Aug 12 2021, 13:16 · Production Error, Infrastructure (SRE), Varnish
John closed T7319: Varnish should clean up ramdisk after failure as Resolved.

System logs show the child restarts, no errors are displayed. Logs show that ramdisk is clearing now.

Aug 12 2021, 13:16 · Infrastructure (SRE), Varnish
John added a comment to T7319: Varnish should clean up ramdisk after failure.

Looking at grafana for cp13, when the software OOMs, the disk space reduces significantly immediately - which suggests there is proper disk clean up occurring. This is replicate on cp12 as well.

Aug 12 2021, 12:51 · Infrastructure (SRE), Varnish
RhinosF1 added a comment to T7698: Wikimedia incident can bring us down.

Latency increases beyond the varnish cut off so varnish depools everything

Aug 12 2021, 06:11 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)
Void reopened T7319: Varnish should clean up ramdisk after failure, a subtask of T7318: Varnish cp12 OOM'd May 16 20:25 UK time, as Open.
Aug 12 2021, 05:43 · Production Error, Infrastructure (SRE), Varnish
Void reopened T7319: Varnish should clean up ramdisk after failure as "Open".

Both cp12 and cp13 OOM'd tonight and didn't restart cleanly. Logs suggest this issue isn't fixed.

Aug 12 2021, 05:43 · Infrastructure (SRE), Varnish

Aug 11 2021

Unknown Object (User) moved T7698: Wikimedia incident can bring us down from Backlog to Upstream on the MediaWiki board.
Aug 11 2021, 22:43 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)
Unknown Object (User) added a comment to T7698: Wikimedia incident can bring us down.

Can I please have some context here (just for me, not for anything else)? I must not have been following what happened last time, attempted solutions, etc...

Aug 11 2021, 22:43 · Universal Omega, Monitoring, MediaWiki, Varnish, MediaWiki (SRE)

Aug 10 2021

John closed T7763: Review GDNSD setup when GB DC down as Resolved.

From a review perspective, this is sorted, as this was an accepted risk taken by the previous DSRE. I'm intending to do some reviews in terms of capacity so will follow this up with a relevant task/communication once I get onto the traffic side of things.

Aug 10 2021, 17:17 · Varnish, DNS, Infrastructure (SRE)