Page MenuHomeMiraheze

Poor Redis performance since ~10:40 8 July
Closed, ResolvedPublic

Description

Several wiki requests I've approved in the past 12-ish hours have thrown exceptions instead of properly creating wikis: https://meta.miraheze.org/wiki/Special:RequestWikiQueue/33393#mw-section-comments https://meta.miraheze.org/wiki/Special:RequestWikiQueue/33397#mw-section-comments https://meta.miraheze.org/wiki/Special:RequestWikiQueue/33399#mw-section-comments

List of steps to reproduce (step by step, including full links if applicable):

  • Submit a request (optional but preferred for testing)
  • Go to Special:RequestWikiQueue
  • Approve any request

What happens?:

There's a chance that a "Exception experienced creating the wiki. Error is: Redis server error: socket error on read socket" message will appear instead of "Wiki created".

What should have happened instead?:

"Wiki created" should almost always appear.

Browser information, screenshots and other applicable information:

Links to wiki requests where this has occurred provided above.

Event Timeline

RhinosF1 raised the priority of this task from Normal to Unbreak Now!.Jul 8 2023, 15:21
RhinosF1 subscribed.

https://grafana.miraheze.org/d/HZGjmu_Zz/redis?orgId=1&from=now-7d&to=now

Performance is much worse last day, wiki creations are fairly critical. Declaring UBN.

Redis was restarted. That may help.

RhinosF1 lowered the priority of this task from Unbreak Now! to Normal.Jul 8 2023, 16:25

Nothing is exploding although commands/sec (and misses) are still elevated

RhinosF1 renamed this task from CreateWiki throwing exceptions to Poor Redis performance since ~10:40 8 July.Jul 8 2023, 16:26
RhinosF1 removed a project: MediaWiki.

Most of the wikis that have been created since redis was restarted have had no issues, but there was one that was created very recently that threw an error: https://meta.miraheze.org/wiki/Special:RequestWikiQueue/33416

@Paladox has made changes and it seems to have solved the issues.

This needs some sort of incident report / summary of what happened.

Another reason why we shouldn't use software that's been abandonware for the last 6 years and doesn't scale in the slightest.

The update I did to jobrunner somehow broke the jobchron. I pulled in changes from upstream (starting as new) that broke things somehow.