db121 frequently OOMs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Reception123
	Dec 10 2022, 13:29

Description

I hesitated to make this task high priority given that there are already so many and that the OOM's aren't that frequent but it seems like the issues aren't stopping and some users appear to be worried given the recent db141 incident it's not a good image to have another db server go down every once in a while. As far as I'm aware looking at SAL it's OOM'd on 10 December, 22 November, 13 November, 8 November

Related Objects

Mentioned In: T10462: All The Tropes - database locked with no explanation again
T10414: Database Error: Fatal exception "Wikimedia/Rdbms/DBQueryError"
T10322: Database Error: Fatal exception of type "Wikimedia/Rdbms/DBQueryError"
T10320: Another db121 database failure?

Event Timeline

Reception123 triaged this task as High priority.Dec 10 2022, 13:29

Reception123 created this task.

Herald added subscribers: Agent_Isai, Unknown Object (User), Void. · View Herald TranscriptDec 10 2022, 13:29

Reception123 added subscribers: Paladox, John.Dec 10 2022, 13:29

There's minimal opportunity to grow memory on db121 - as far as I know the cause is likely parsercache which would mean the easiest fix is to reduce the amount of caching on MW side.

John moved this task from Incoming to External on the Infrastructure (SRE) board.Dec 11 2022, 21:29

In T10117#203977, @John wrote:

There's minimal opportunity to grow memory on db121 - as far as I know the cause is likely parsercache which would mean the easiest fix is to reduce the amount of caching on MW side.

I also have mentioned a couple times that the cause is parsercache.

I was thinking, however, what if we moved parsercache to say db141, and instead potentially increased memory on db141 where we have more on cloud14? Another option I was thinking was investing in a smaller db server on say cloud14 that is specifically for parsercache, so that it going down, will also not bring down MediaWiki wikis.

I am not certain, but also decreasing caching MW-side, in my opinion, is not ideal also, caching less just doesn't seem like a preferred route, as us caching more lets us keep more, and as overall performance benefits.

Once the cloud13 reboots happen, a request for a smaller db server could be considered. Or moving it to another server as well could be considered.

Re-assigning as the action above identified is one for the MediaWiki team and not Infrastructure

Noting that last OOM was on January 4

Reception123 mentioned this in T10320: Another db121 database failure?.Jan 21 2023, 15:09

Creeper19472 mentioned this in T10322: Database Error: Fatal exception of type "Wikimedia/Rdbms/DBQueryError".Jan 21 2023, 16:41

BrandonWM mentioned this in T10414: Database Error: Fatal exception "Wikimedia/Rdbms/DBQueryError".Feb 1 2023, 04:40

Lowering to normal, since at this time there is not much we can do on this per the above is kinda blocked on the extra disk space on cloud13 and the removal of cloud10. Until that is done, this can not progress. Since there is not much that can be done right now, and no progress on this task, I am lowering to normal.

What are the plans for extra disk space on cloud13? This just happened again

Collei merged a task: T10462: All The Tropes - database locked with no explanation again.Feb 25 2023, 02:33

Collei added a subscriber: Looney_Toons.

I have now tried something to help this, I have now done two things:

Sharded parsercache across 10 tables, so instead of nearly 6,000,000 rows on one table, we would have somewhere around 500,000 on 10 tables, however, it would likely be even a third of that, as secondly,
I have changed the parsercache setup, so not only is it on db121 now, but it is also now spread across db131 and db142 as well, as a multi parser-cache server cluster of all three servers to attempt to help the issues it can cause to a single server.

Crash of db121 has just occurred again, @Universal_Omega's ideas above do not seem to be working. Have there been any updates on cloud13 and cloud10?

In T10117#213119, @BrandonWM wrote:

Crash of db121 has just occurred again, @Universal_Omega's ideas above do not seem to be working. Have there been any updates on cloud13 and cloud10?

I am removing parsercache from db121 entirely right now, leaving only on db131/db142 to see if it helps.

In T10117#213120, @Universal_Omega wrote:

In T10117#213119, @BrandonWM wrote:

Crash of db121 has just occurred again, @Universal_Omega's ideas above do not seem to be working. Have there been any updates on cloud13 and cloud10?

I am removing parsercache from db121 entirely right now, leaving only on db131/db142 to see if it helps.

Ah, okay. Hopefully so.

In T10117#203977, @John wrote:

There's minimal opportunity to grow memory on db121 - as far as I know the cause is likely parsercache which would mean the easiest fix is to reduce the amount of caching on MW side.

Confirmed to have nothing to do with parsercache, so the action identified for MediaWiki is irrelevant now. After removing parsercache from db121, it still happens.

Unknown Object (User) unsubscribed.Mar 18 2023, 03:33

I intend to reevaluate this task at some point soon, hopefully. It seems likely, based on the changes MW side that this is most likely caused by something on the configuration side of the DB. I can't guarantee that I have the knowledge/ability to solve, but I'll see what I can do.

This has continued to happen as of late. Moving priority to high as it affects all wikis on db121, though if wished to be moved back down, that's fine as well.

Read only on restart will be addressed here whenever I or someone else get the chance to deploy it.

Config change has been merged to remove read only, leaving the task open as a long-term goal for tuning mysql better.

Managed to reduce OOMs by changing the config and giving it more ram.

db121 frequently OOMsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

db121 frequently OOMs
Closed, ResolvedPublic
Actions