Page MenuHomeMiraheze

Investigate recent outages
Closed, DeclinedPublic

Description

Over the past few days, there have been numerous instances of servers going down with icinga returning "No data received from host".

This included almost every server with any sort of web access, with the exception of cp* and graylog121. So it included, mw*, reports121, matomo101, mon111, puppet111, mwtask111, test101, mail121, and phab121.

It leads to temporary visible user-facing outages, usually lasting no more then 20-30 seconds. But it happens fairly frequently lately, sometimes 2-3 times in a day. It self-recovers, usually taking a few minutes to fully recover, however sometimes it will fully recover, and less than 10 minutes later, will repeat the same issues, and recover again, this time usually staying up.

But the issues have become to frequent to make it a rare occurrence, and definitely should be investigated ASAP.

Per @Paladox:

i think if everything is affected then it's because of the network

I'm leaving this UBN, since the outage is currently still on-going (affecting only some of the services at this point), and should be investigated ASAP. It can be lowered if necessary.

Event Timeline

Universal_Omega triaged this task as Unbreak Now! priority.Mon, Jun 20, 17:31
Universal_Omega created this task.

Has anyone pulled any MTRs for these periods?

Do we have dates and time periods? Original description is vague and this would be needed to do log diving hardware side.

In T9425#190850, @John wrote:

Do we have dates and time periods? Original description is vague and this would be needed to do log diving hardware side.

Sorry for not providing more information initially.

2022-06-20 16:38:04, fully ending around 2022-06-20 17:26:57 was today's incident (according to IRC logs from icinga-miraheze), But it had mostly recovered a bit before the total end there.
One of the first times I noticed it was on 2022-06-13, but I can't pull the exact times then. I do know it happened 4 times within a 14 hour period spanning 2022-06-13 to 2022-06-14.

I am still looking back in my own logs to see if I can see more of when it happened.

The switch seems non-responsive to web logins, so my thinking is currently that the switch could be at fault - but would need MTRs to back this up.

In T9425#190855, @John wrote:

The switch seems non-responsive to web logins, so my thinking is currently that the switch could be at fault - but would need MTRs to back this up.

I've managed to get the web logins to work by setting up a nginx proxy. I'm using sock5 to connect to it. You can do this via cloud12 and then it should work. How you managed to do graylog should be the same for this apart from you use cloud12 not graylog. I prefer waiting for you to be around to do anything. I can reboot via it but I'm not sure I should do that without your approval.

I would point out that in the survey responses I've seen many users complain of 503/502 issues so I definitely think this is something we need to prioritise and fix ASAP

Because there's no MTRs showing the problem or any other evidence of this issue, closing as declined as it is untraceable.

Because there's no MTRs showing the problem or any other evidence of this issue, closing as declined as it is untraceable.

This is following a log trawl at both software and hardware level