we are having constant alerts and 50x errors due to high php-fpm child usage.
Description
Related Objects
Event Timeline
Agreed with the above, and thanks to all SRE who were on hand to stabilize the issues.
It’s definitely the same line of enquiry in terms of investigation.
Though given the user impact, an incident report should be generated for this.
https://meta.miraheze.org/wiki/Special:IncidentReports/49 I've put in everything I know
@Universal_Omega and I have looked and have no idea why this is happening or how to fix it. Restarts of services and reboots of servers aren't helping. When/if this stabilizes, this task will remain UBN as we really need to look into this.
Yes I have tried rebooting mw* which has absolutely no effect either. So yes +1 to this remaining UBN as resolving this should be #1 priority, but I am out of things I'm able to do here also.
There is a bit of a pattern with these outages, both times test101 was also affected, and both times it was at the same time of day. (roughly 8:00 PM EST/0:00 UTC)
https://grafana.miraheze.org/d/dsHv5-4nz/mediawiki?orgId=1&from=1649366728843&to=1649489031546 shows a ten-fold jump in connections active at the same time
I've had a quick glance through traffic patterns and there's no spike in actual requests. It must be something being accessed but no hint as to what.
test101 having the issues also seemed very strange to me seeing as how it currently didn't even work, as it is down right now.
Complete outage as we saw on 7/8 April has not occurred since database backups were disabled so lowering from UBN to High as this is not currently impacting us anymore
Is their a reason this is a security task still? It is not an issue that users can reproduce themselves so see no reason why.