A user on Discord has reported it happening again, it's possible the issue wasn't fully resolved.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 28 2024
Feb 26 2024
Sounds good
Hmm, that's weird but now I don't get Error 500 neither by importing pages on gameshows nor by editing with code editor on chernowiki. Looks like the problem is actually resolved.
Visual editor being broken is already tracked in T11903. As for the other issues, can you reproduce this on any wikis other than gameshowswiki?
Feb 25 2024
Still got Error 500 when try to import pages on gameshows.miraheze.org. Small xml files are going well when large (like 750 kb) are failing.
Once again purged 13-16G of Varnish logs.
Several Discord users have reported this occurring recently and more frequently
Feb 24 2024
Reviewing Discord and Phabricator issues needing triage, this is probably a larger issue than I first assumed
To be clear, I lowered this to Normal because it only appears to be happening on some wikis and not all of their pages. Most functionality still works. Feel free to change it back if I'm wrong about this triage.
This is occurring again, see T11899
Feb 23 2024
Feb 22 2024
Jan 24 2024
503s no longer display a Twitter feed. They instead link to a static help page on GitHub Pages which explains what may have happened and links to our social media and status page so technically this is invalid?
T&S exists now, and @Agent_Isai is likely the best person to approve what comes next.
Oct 25 2023
If the problem is with CSP reviews I'd argue emfed has a better shot than facebook
Oct 24 2023
In T11052#227711, @OrangeStar wrote:Replacing it with Mastodon is the easiest route, since you already have that up and running. A quick search brings up https://sampsyo.github.io/emfed/. I could write a PR including emfed from the jsdelivr cdn if wanted.
Replacing it with Mastodon is the easiest route, since you already have that up and running. A quick search brings up https://sampsyo.github.io/emfed/. I could write a PR including emfed from the jsdelivr cdn if wanted.
Sep 11 2023
Makes sense to me
Aug 9 2023
**This is the truth Miraheze is burning like a house and the suppoesed firefighters are sitting relaxing having a coffee instead of helping
Jul 11 2023
May 19 2023
Apr 16 2022
Mar 26 2022
Spoke with @Paladox and no further action is needed on this task.
@RhinosF1 Do we still need this task open since the incident has passed?
Mar 23 2022
NCSC are aware
blocked at firewall level globally, let's keep an eye.
Mar 14 2022
00:08:29 <JohnLewis> dmehus: yeah, IO on cloud11's SSDs is pretty high because of piwik db migration
php-fpm looks to be struggling to keep up again.
When I tried to reach https://robla.miraheze.org about 20 minutes ago , I received the following error:
PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 20:19 PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.09, 2.06, 1.84 20:20 RECOVERY - matomo101 SSH on matomo101 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) 20:20 RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 3.172 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68 20:20 PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 8.22, 7.19, 6.98 20:21 RECOVERY - cp30 Stunnel HTTP for mw101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.312 second response time 20:21 RECOVERY - cp31 Varnish Backends on cp31 is OK: All 12 backends are healthy 20:21 PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.07, 1.69, 1.73 20:22 PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.24, 6.69, 6.81 20:23 RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy 20:23 RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.10, 1.50, 1.66 20:25 PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call 20:26 <dmehus> Doug !sre 20:26 <icinga-miraheze> IRC echo bot PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 20:26 RECOVERY - db101 Current Load on db101 is OK: OK - load average: 5.97, 6.63, 6.77 20:27 PROBLEM - cp30 Stunnel HTTP for mw101 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host 20:27 PROBLEM - cp31 Stunnel HTTP for phab121 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host 20:27 RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 1.177 second response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68 20:27 PROBLEM - cp30 Stunnel HTTP for mw111 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 20:27 PROBLEM - cp21 Stunnel HTTP for mw122 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 20:27 PROBLEM - cp21 Stunnel HTTP for mw111 on cp21 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.011 second response time 20:27 PROBLEM - cp31 Stunnel HTTP for mw111 on cp31 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.317 second response time 20:27 PROBLEM - cp20 Stunnel HTTP for mw111 on cp20 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 328 bytes in 0.012 second response time 20:27 PROBLEM - mw111 MediaWiki Rendering on mw111 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 1595 bytes in 0.008 second response time 20:28 <dmehus> Doug Can reproduce the above persistently 20:28 <icinga-miraheze> IRC echo bot PROBLEM - cp31 Stunnel HTTP for mw122 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host 20:28 PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 20:28 PROBLEM - cp30 Stunnel HTTP for mw122 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
^ Additional icinga alerts
<icinga-miraheze> IRC echo bot RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 6.68, 8.41, 8.47 18:17 PROBLEM - db112 Current Load on db112 is WARNING: WARNING - load average: 5.19, 5.81, 5.32 18:17 RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 6.64, 7.32, 8.32 18:19 PROBLEM - db112 Current Load on db112 is CRITICAL: CRITICAL - load average: 6.57, 6.00, 5.44 18:19 RECOVERY - gluster101 Current Load on gluster101 is OK: OK - load average: 3.19, 3.18, 3.16 18:20 alerting : [FIRING:1] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki 18:20 RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.34, 7.62, 8.49 18:21 PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.84, 8.74, 8.71 18:23 PROBLEM - db112 Current Load on db112 is WARNING: WARNING - load average: 2.62, 4.79, 5.11 18:24 PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 8.70, 8.49, 8.68 18:25 RECOVERY - db112 Current Load on db112 is OK: OK - load average: 2.99, 4.25, 4.87 18:27 → darkmatterman450 joined (~darkmatte@user/darkmatterman450) 18:27 <icinga-miraheze> IRC echo bot PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.21, 9.05, 8.85 18:28 PROBLEM - mw111 Current Load on mw111 is CRITICAL: CRITICAL - load average: 10.69, 9.69, 9.15 18:29 PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 9.93, 9.14, 8.90 18:30 PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 7.77, 9.17, 9.04 18:30 PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 8.33, 8.92, 8.52 18:31 PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.86, 9.82, 9.19 18:34 PROBLEM - mw111 Current Load on mw111 is CRITICAL: CRITICAL - load average: 11.45, 10.20, 9.45 18:34 PROBLEM - mw121 Current Load on mw121 is CRITICAL: CRITICAL - load average: 10.20, 9.44, 8.79 18:36 PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 9.32, 9.86, 9.42 18:36 PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 7.81, 8.99, 8.72 18:41 PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.68, 9.52, 9.53 18:42 PROBLEM - mw102 Current Load on mw102 is WARNING: WARNING - load average: 8.73, 7.72, 6.97 18:43 PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 10.80, 10.19, 9.78 18:44 RECOVERY - mw102 Current Load on mw102 is OK: OK - load average: 7.10, 7.53, 7.00 18:44 PROBLEM - mw122 Current Load on mw122 is CRITICAL: CRITICAL - load average: 10.86, 8.60, 8.05 18:45 PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 9.56, 9.96, 9.76 18:47 PROBLEM - mw112 Current Load on mw112 is CRITICAL: CRITICAL - load average: 11.72, 10.28, 9.87 18:48 PROBLEM - mw122 Current Load on mw122 is WARNING: WARNING - load average: 9.66, 9.10, 8.36 18:50 PROBLEM - cp31 Current Load on cp31 is CRITICAL: CRITICAL - load average: 2.54, 1.96, 1.29 18:51 PROBLEM - cp30 Current Load on cp30 is WARNING: WARNING - load average: 1.75, 1.65, 1.28 18:52 PROBLEM - cp30 Stunnel HTTP for matomo101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:52 PROBLEM - cp21 Stunnel HTTP for matomo101 on cp21 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 358 bytes in 0.157 second response time 18:52 PROBLEM - matomo101 Current Load on matomo101 is CRITICAL: CRITICAL - load average: 20.08, 9.20, 4.29 18:52 PROBLEM - cp31 Stunnel HTTP for matomo101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:52 RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.08, 7.80, 8.46 18:52 RECOVERY - cp31 Current Load on cp31 is OK: OK - load average: 1.10, 1.63, 1.25 18:53 PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 9.95, 7.83, 6.37 18:53 PROBLEM - matomo101 HTTPS on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:53 PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call 18:53 PROBLEM - cp20 Stunnel HTTP for matomo101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 18:53 RECOVERY - cp30 Current Load on cp30 is OK: OK - load average: 1.21, 1.54, 1.29 18:53 PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.99, 9.74, 9.82 18:54 RECOVERY - mw122 Current Load on mw122 is OK: OK - load average: 5.58, 7.61, 8.01 18:55 RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 2.725 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68 18:57 PROBLEM - cp31 Varnish Backends on cp31 is CRITICAL: 1 backends are down. mw111 18:58 PROBLEM - matomo101 Redis Process on matomo101 is CRITICAL: PROCS CRITICAL: 0 processes with args 'redis-server' 18:58 PROBLEM - mw102 Current Load on mw102 is WARNING: WARNING - load average: 8.76, 8.09, 7.53 18:58 RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 6.47, 7.93, 8.49 18:59 PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.51, 7.73, 6.89 18:59 RECOVERY - cp31 Varnish Backends on cp31 is OK: All 12 backends are healthy 19:00 RECOVERY - matomo101 Redis Process on matomo101 is OK: PROCS OK: 1 process with args 'redis-server' 19:00 PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:00 RECOVERY - mw102 Current Load on mw102 is OK: OK - load average: 8.47, 8.20, 7.64 19:01 PROBLEM - db101 Current Load on db101 is CRITICAL: CRITICAL - load average: 9.59, 8.57, 7.31 19:01 PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.07, 1.82, 1.51 19:02 PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call 19:03 PROBLEM - matomo101 NTP time on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:03 PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.51, 1.73, 1.52 19:03 PROBLEM - matomo101 Puppet on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:05 PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:05 RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 5.63, 6.85, 8.34 19:05 PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.93, 4.15, 3.36 19:05 PROBLEM - test101 Current Load on test101 is CRITICAL: CRITICAL - load average: 2.16, 1.89, 1.60 19:05 PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.66, 3.35, 2.84 19:06 PROBLEM - cp30 Stunnel HTTP for test101 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host 19:06 PROBLEM - cp30 Stunnel HTTP for mw121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host 19:07 PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 3.16, 3.74, 3.31 19:07 PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 3.91, 3.31, 2.87 19:08 PROBLEM - matomo101 ferm_active on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:08 RECOVERY - cp30 Stunnel HTTP for test101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14564 bytes in 0.338 second response time 19:08 RECOVERY - cp30 Stunnel HTTP for mw121 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 0.852 second response time 19:08 PROBLEM - ns2 GDNSD Datacenters on ns2 is CRITICAL: CRITICAL - 2 datacenters are down: 149.56.140.43/cpweb, 2607:5300:201:3100::929a/cpweb 19:09 PROBLEM - gluster121 Current Load on gluster121 is CRITICAL: CRITICAL - load average: 4.93, 4.00, 3.10 19:09 PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 7 backends are down. mw101 mw102 mw111 mw112 mw121 mw122 mediawiki 19:09 PROBLEM - ns1 GDNSD Datacenters on ns1 is CRITICAL: CRITICAL - 2 datacenters are down: 149.56.140.43/cpweb, 2607:5300:201:3100::929a/cpweb 19:09 PROBLEM - matomo101 Redis Process on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:09 PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.99, 4.13, 3.49 19:09 PROBLEM - matomo101 Disk Space on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:09 PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.69, 1.87, 1.67 19:09 PROBLEM - matomo101 php-fpm on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:09 PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.20, 3.56, 3.01 19:10 RECOVERY - ns2 GDNSD Datacenters on ns2 is OK: OK - all datacenters are online 19:11 RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy 19:11 RECOVERY - ns1 GDNSD Datacenters on ns1 is OK: OK - all datacenters are online 19:11 PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 3.54, 3.78, 3.18 19:11 PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 2.38, 3.59, 3.38 19:11 RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.37, 1.70, 1.63 19:11 RECOVERY - matomo101 Disk Space on matomo101 is OK: DISK OK - free space: / 1205 MB (12% inode=80%); 19:11 RECOVERY - matomo101 Puppet on matomo101 is OK: OK: Puppet is currently enabled, last run 51 minutes ago with 0 failures 19:11 RECOVERY - matomo101 php-fpm on matomo101 is OK: PROCS OK: 5 processes with command name 'php-fpm7.4' 19:11 RECOVERY - matomo101 Redis Process on matomo101 is OK: PROCS OK: 1 process with args 'redis-server' 19:11 RECOVERY - matomo101 NTP time on matomo101 is OK: NTP OK: Offset -0.005568474531 secs 19:12 RECOVERY - matomo101 SSH on matomo101 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) 19:13 RECOVERY - matomo101 ferm_active on matomo101 is OK: OK ferm input default policy is set 19:13 RECOVERY - matomo101 conntrack_table_size on matomo101 is OK: OK: nf_conntrack is 0 % full 19:13 PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.85, 7.63, 7.55 19:13 RECOVERY - cp30 Stunnel HTTP for matomo101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 3.464 second response time 19:13 RECOVERY - cp21 Stunnel HTTP for matomo101 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.889 second response time 19:13 RECOVERY - matomo101 PowerDNS Recursor on matomo101 is OK: DNS OK: 0.811 seconds response time. miraheze.org returns 198.244.148.90,2001:41d0:801:2000::1b80,2001:41d0:801:2000::4c25,51.195.220.68 19:13 PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 5.26, 4.24, 3.41 19:13 RECOVERY - cp20 Stunnel HTTP for matomo101 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.705 second response time 19:13 PROBLEM - gluster101 Current Load on gluster101 is CRITICAL: CRITICAL - load average: 4.51, 3.95, 3.54 19:13 RECOVERY - cp31 Stunnel HTTP for matomo101 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 66463 bytes in 0.663 second response time 19:14 RECOVERY - matomo101 HTTPS on matomo101 is OK: HTTP OK: HTTP/1.1 200 OK - 66479 bytes in 1.038 second response time 19:14 PROBLEM - gluster121 Current Load on gluster121 is WARNING: WARNING - load average: 3.53, 3.88, 3.40 19:15 PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 2.95, 3.69, 3.31 19:15 ok : [RESOLVED] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki 19:16 PROBLEM - mw111 Current Load on mw111 is WARNING: WARNING - load average: 9.25, 8.08, 7.21 19:16 PROBLEM - gluster121 Current Load on gluster121 is CRITICAL: CRITICAL - load average: 8.47, 4.85, 3.77 19:17 PROBLEM - gluster101 Current Load on gluster101 is WARNING: WARNING - load average: 3.26, 3.93, 3.66 19:18 RECOVERY - mw111 Current Load on mw111 is OK: OK - load average: 7.62, 7.88, 7.25 19:19 PROBLEM - gluster111 Current Load on gluster111 is CRITICAL: CRITICAL - load average: 4.06, 3.89, 3.45 19:19 PROBLEM - test101 Current Load on test101 is WARNING: WARNING - load average: 1.77, 1.64, 1.59 19:20 PROBLEM - mw112 Current Load on mw112 is WARNING: WARNING - load average: 8.11, 8.72, 8.18 19:20 PROBLEM - mw121 Current Load on mw121 is WARNING: WARNING - load average: 9.69, 8.68, 7.66 19:21 RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.29, 6.11, 6.73 19:21 PROBLEM - gluster111 Current Load on gluster111 is WARNING: WARNING - load average: 2.87, 3.62, 3.41 19:21 alerting : [FIRING:1] (PHP-FPM Worker Usage High yes mediawiki) https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki 19:21 RECOVERY - test101 Current Load on test101 is OK: OK - load average: 1.21, 1.50, 1.55 19:22 PROBLEM - cp31 Stunnel HTTP for matomo101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:22 RECOVERY - mw112 Current Load on mw112 is OK: OK - load average: 6.77, 8.29, 8.10 19:22 RECOVERY - mw121 Current Load on mw121 is OK: OK - load average: 5.47, 7.77, 7.48 19:23 RECOVERY - gluster111 Current Load on gluster111 is OK: OK - load average: 2.23, 3.29, 3.32 19:23 PROBLEM - cp30 Stunnel HTTP for matomo101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:23 PROBLEM - cp20 Stunnel HTTP for matomo101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:23 PROBLEM - matomo101 HTTPS on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:23 PROBLEM - cp21 Stunnel HTTP for matomo101 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:24 PROBLEM - matomo101 PowerDNS Recursor on matomo101 is CRITICAL: CRITICAL - Plugin timed out while executing system call 19:24 PROBLEM - gluster121 Current Load on gluster121 is WARNING: WARNING - load average: 1.87, 3.56, 3.75 19:24 PROBLEM - matomo101 SSH on matomo101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:24 PROBLEM - cp30 Stunnel HTTP for mail121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host 19:24 PROBLEM - cp31 Stunnel HTTP for mon111 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host 19:25 PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 7.20, 6.62, 6.79 19:25 PROBLEM - cp20 Stunnel HTTP for mw111 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:25 PROBLEM - cp21 Stunnel HTTP for mw121 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:25 PROBLEM - cp20 Stunnel HTTP for mw121 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:25 PROBLEM - cp30 Stunnel HTTP for mw111 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:25 RECOVERY - gluster101 Current Load on gluster101 is OK: OK - load average: 1.47, 2.91, 3.37 19:25 PROBLEM - cp30 Stunnel HTTP for phab121 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host 19:26 PROBLEM - matomo101 NTP time on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:26 PROBLEM - cp20 Stunnel HTTP for mw101 on cp20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:26 PROBLEM - matomo101 Puppet on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:26 RECOVERY - cp30 Stunnel HTTP for mail121 on cp30 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 427 bytes in 0.241 second response time 19:27 RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.06, 6.37, 6.67 19:27 PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 3 backends are down. mw101 mw102 mw122 19:27 RECOVERY - cp20 Stunnel HTTP for mw111 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 7.008 second response time 19:27 RECOVERY - cp21 Stunnel HTTP for mw121 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 7.478 second response time 19:27 RECOVERY - cp20 Stunnel HTTP for mw121 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 7.526 second response time 19:27 RECOVERY - cp30 Stunnel HTTP for mw111 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 5.210 second response time 19:27 PROBLEM - cp31 Varnish Backends on cp31 is CRITICAL: 5 backends are down. mw102 mw111 mw112 mw121 mw122 19:27 RECOVERY - cp30 Stunnel HTTP for phab121 on cp30 is OK: HTTP OK: Status line output matched "500" - 2855 bytes in 0.353 second response time 19:28 PROBLEM - cp31 Stunnel HTTP for mw101 on cp31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:28 PROBLEM - cp21 Stunnel HTTP for mw101 on cp21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:28 RECOVERY - matomo101 NTP time on matomo101 is OK: NTP OK: Offset -0.006324976683 secs 19:28 PROBLEM - mw101 MediaWiki Rendering on mw101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:28 PROBLEM - cp30 Stunnel HTTP for mw101 on cp30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds 19:28 RECOVERY - cp31 Stunnel HTTP for mon111 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 33915 bytes in 1.185 second response time 19:29 RECOVERY - cp30 Varnish Backends on cp30 is OK: All 12 backends are healthy 19:29 PROBLEM - cp20 Varnish Backends on cp20 is CRITICAL: 1 backends are down. mw101 19:30 RECOVERY - gluster121 Current Load on gluster121 is OK: OK - load average: 2.30, 2.80, 3.33 19:31 PROBLEM - db101 Current Load on db101 is WARNING: WARNING - load average: 6.54, 6.84, 6.83 19:33 RECOVERY - db101 Current Load on db101 is OK: OK - load average: 6.42, 6.67, 6.77 19:33 RECOVERY - cp20 Varnish Backends on cp20 is OK: All 12 backends are healthy 19:34 RECOVERY - cp21 Stunnel HTTP for mw101 on cp21 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 3.751 second response time 19:34 PROBLEM - cp31 Stunnel HTTP for mw122 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host 19:34 PROBLEM - cp31 Stunnel HTTP for mw112 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host 19:34 RECOVERY - mw101 MediaWiki Rendering on mw101 is OK: HTTP OK: HTTP/1.1 200 OK - 22336 bytes in 3.518 second response time 19:34 PROBLEM - cp30 Stunnel HTTP for mw112 on cp30 is CRITICAL: HTTP CRITICAL - No data received from host 19:34 PROBLEM - matomo101 conntrack_table_size on matomo101 is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. 19:34 RECOVERY - cp30 Stunnel HTTP for mw101 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.313 second response time 19:34 RECOVERY - cp20 Stunnel HTTP for mw101 on cp20 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.015 second response time 19:35 PROBLEM - cp30 Varnish Backends on cp30 is CRITICAL: 7 backends are down. mw101 mw102 mw111 mw112 mw121 mw122 mediawiki 19:35 PROBLEM - cp31 Stunnel HTTP for phab121 on cp31 is CRITICAL: HTTP CRITICAL - No data received from host 19:35 <dmehus> Doug SRE: persistent 503s on multiple wikis 19:35 <icinga-miraheze> IRC echo bot RECOVERY - cp31 Stunnel HTTP for mw101 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.312 second response time 19:36 RECOVERY - cp31 Stunnel HTTP for mw112 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.325 second response time 19:36 RECOVERY - cp31 Stunnel HTTP for mw122 on cp31 is OK: HTTP OK: HTTP/1.1 200 OK - 14556 bytes in 3.995 second response time 19:36 RECOVERY - cp30 Stunnel HTTP for mw112 on cp30 is OK: HTTP OK: HTTP/1.1 200 OK - 14562 bytes in 0.358 second response time Sunday, March 13th, 2022 about an hour ago ↓ 1 unread message (less than a minute) dmehus New message input
Mar 11 2022
I am now no longer able to reproduce this.
In T8912#180248, @Lens0021 wrote:I'm not complaining, but https://meta.miraheze.org/wiki/?action=raw&title=Miraheze&ARBITRARY=/w/img_auth.php/.gif still works.
Mar 9 2022
And css|js|json clearly doesn't seem to be for images; The use cases should be allowed.
I'm not complaining, but https://meta.miraheze.org/wiki/?action=raw&title=Miraheze&ARBITRARY=/w/img_auth.php/.gif still works.
Thank you for identify this problem, I have pushed a resolution but not fully tested it yet. I will verify the resolution before making this public
Feb 25 2022
I'm closing this task. I've tweaked logging so we can tell when LoginNotify is being triggered. We should follow up with some way to alert like we do for exceptions.
In T8832#178659, @Reception123 wrote:What do you mean by "security sensitive" pages?
What do you mean by "security sensitive" pages?
Blocked since 12:26
Feb 7 2022
In T8737#176998, @RhinosF1 wrote:Wont the backends nginx have access to the X-Varnish header, we can log it there and put it in graylog
Wont the backends nginx have access to the X-Varnish header, we can log it there and put it in graylog
Feb 6 2022
Which isn't helpful if users don't save that, which most aren't going too.
There’s only Varnishlog, which is easiest to search using the XID.
Is there an access log we can have that shows the XID, url & ip on the varnish end? It should be enough to match them up.
X-Varnish is set on response, not on request. This is because the header logs the response ID.
Feb 5 2022
Jan 26 2022
Dec 30 2021
8094-8101 used
Thank you!
Ports are assigned historically in numerical order, last used for MediaWiki was 8093, so 8094+
Nov 20 2021
This is resolved with the commits.
Nov 17 2021
Oct 11 2021
Sep 28 2021
@Paladox see above
Sep 12 2021
What are example of large objects? Are they infrequently requested? Frequently requested large objects would make more sense to cache than smaller infrequently requested objects
Sep 6 2021
Sep 4 2021
Yes it did happen.
Logs show indication last night it might have happened. I see depools around the wikimedia outage.
Aug 12 2021
System logs show the child restarts, no errors are displayed. Logs show that ramdisk is clearing now.
Looking at grafana for cp13, when the software OOMs, the disk space reduces significantly immediately - which suggests there is proper disk clean up occurring. This is replicate on cp12 as well.
Latency increases beyond the varnish cut off so varnish depools everything
Both cp12 and cp13 OOM'd tonight and didn't restart cleanly. Logs suggest this issue isn't fixed.
Aug 11 2021
Can I please have some context here (just for me, not for anything else)? I must not have been following what happened last time, attempted solutions, etc...
Aug 10 2021
From a review perspective, this is sorted, as this was an accepted risk taken by the previous DSRE. I'm intending to do some reviews in terms of capacity so will follow this up with a relevant task/communication once I get onto the traffic side of things.