Page MenuHomeMiraheze

Investigate database server/cache proxy issues and extreme load times this evening
Closed, ResolvedPublic

Description

If the SRE Infrastructure team could kindly investigate and report on the possible/probable cause(s) for the database server, cache proxy, and backend fetch errors, as well as the extremely high load times, this evening, that'd be great.

Relevant IRC logs link: permalink

Created this task after discussion with @Universal_Omega, who has no access at the moment.

Event Timeline

Dmehus triaged this task as High priority.Mar 21 2021, 02:30
Dmehus created this task.
John claimed this task.
John added a subscriber: John.

Looking into this, the MySQL server crashed because of a memory issue around 0202, which matches up to error reports.

Digging deeper, the OOM was caused by accessing inexecutable memory stored in db11's RAM - which triggered the hwpoisoned process that lead to the areas of memory being marked as unusable - as such there was no longer sufficient memory so OOM-killer kicked in to kill mysql.

Grafana shows an unusual increase in pgmajfault errors - graph.

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

This isn't the first time this happened actually. IIRC, this is the third time.

In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

This isn't the first time this happened actually. IIRC, this is the third time.

This is the first time this particular issue has occurred to my knowledge.

In T7008#138643, @John wrote:
In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

This isn't the first time this happened actually. IIRC, this is the third time.

This is the first time this particular issue has occurred to my knowledge.

I obviously have no way of knowing if the downtime was caused due to this specific issue but I reported this error for the first time exactly a week ago (and you seem to have replied to it stating the cause which is unrelated and went unnoticed on my part) and this has happened several times since then. I just decided not to report them as they were temporary and took little time to get fixed by itself.