Page MenuHomeMiraheze

redis-server is occasionally killed for OOM
Closed, ResolvedPublic

Description

I've been investigating the issues with incomplete wiki creations resulting from JobQueue errors. I've been able to discover that Redis restarted during the time one such wiki was being created. Looking into it further, I discovered this restart was initiated by:

oom_reaper: reaped process 16298 (redis-server)

We should investigate why this is being done, and hopefully prevent oom_reaper from killing redis unless absolutely necessary. If that isn't possible, perhaps we should setup redundant redis with jobrunner4, but I'm not too certain as to the logistics of this.

Event Timeline

Void triaged this task as High priority.Jul 11 2021, 21:49
Void created this task.

We should investigate why this is being done, and hopefully prevent oom_reaper from killing redis unless absolutely necessary.

I'm slightly concerned about this just killing something else as risky. While we can cope with a short loss of the jobrunner, we probably still need to thing about available resources.

If that isn't possible, perhaps we should setup redundant redis with jobrunner4, but I'm not too certain as to the logistics of this.

As far as I'm aware even the rdb servers are active-passive

Void raised the priority of this task from High to Unbreak Now!.Jul 17 2021, 00:57

Jobrunner3 is showing 155 Out of memory issues in the past 24 hours, killing several processes, including redis repeatedly.

Unknown Object (User) added a comment.Jul 17 2021, 01:03

This has been a reoccurring issue for the past 3-4 months from what I've observed also. I didn't realize before that it was such a major and often issue, but it was likely the same issue for the past months.

I have a process running on jobrunner3 that should report the full process information on any process that winds up getting killed by OOM. Hopefully it should tell us some more information about which processes are utilizing too much memory.

Unknown Object (User) added a comment.Jul 19 2021, 18:05

Redis server error: LOADING Redis is loading the dataset in memory

corresponds almost exactly with some of the reported times creating wikis failed.

That is due to RequestWikiAIJob

#0 /srv/jobrunner/src/RedisJobService.php(260): Redis->connect('51.195.236.220', '6379', 5)

is usually given just around one second before the other redis error as well.

<11>1 2021-07-10T23:04:29+00:00 jobrunner3.miraheze.org mediawiki - - - {"@timestamp":"2021-07-10T23:04:29.565386+00:00","@version":1,"host":"jobrunner3","message":"RequestWikiAIJob Special: description=Quiero hacer una web donde la gente tenga fácil acceso a mi forma de ver las cosas con respecto a una subcultura, con tal de reflexionar y profundizar, y ver si es posible hacer un cambio positivo, quiero que la información de la wiki sea definitiva y una fuente fácil de consultar al momento de que alguien quiera conocer las doctrinas, no es nada ilegal, solo hablaré de aspectos sociales más que todo. id=19224 requestId=879a8d57aba33b80219792ac namespace=-1 title= (uuid=d3006058972b4685b7ed3ff0f6cee65f,timestamp=1625958259) t=7819 error=JobQueueError: Redis server error: LOADING Redis is loading the dataset in memory\n","type":"mediawiki","channel":"runJobs","level":"ERROR","monolog_level":400,"shard":"c2","wiki":"metawiki","mwversion":"1.36.1","reqId":"879a8d57aba33b80219792ac","cli_argv":"/srv/mediawiki/w/maintenance/runJobs.php --wiki=metawiki --type=RequestWikiAIJob --maxtime=60 --memory-limit=1750M --result=json","job_type":"RequestWikiAIJob","job_duration":7819,"job_error":"JobQueueError: Redis server error: LOADING Redis is loading the dataset in memory\n"}

corresponds exactly with the time it was reported in T7338#153237 for https://meta.miraheze.org/wiki/Special:RequestWikiQueue/19224.

I am not 100% positive this is related, but seems to be. I did assume that RequestWikiAIJob was a possible cause of this. I do know with certainty the two mentioned errors here are related however.

Can we try and temporarily disable the RequestWikiAIJob to see if this alleviates the load? Or alternately, could we prevent jobrunner3 from running these types of jobs?

If either of these solves the issue, we may want to consider un-deploying T5105. In either case, we should also probably evaluate whether or not we actually want to move ahead with that, even if resources were not a problem.

Void lowered the priority of this task from Unbreak Now! to High.Jul 21 2021, 02:06

Effectively stalled on T7139 unless we can prevent jobrunner3 from accepting high intensity jobs (assuming RequestWikiAIJob).

Tentatively closing, looks like we've stabilized.

RhinosF1 reopened this task as Open.EditedAug 13 2021, 07:58

We just got redis errors on an import. (Not entirely sure whether it's an OOM but the grafana graphs show an increase in commands/sec and traffic)

Cc @John

Redis shows 17 hour uptime and memory usage peaks at 2MB. Grafana shows no evidence of an OOM even being a remote possibility here

In T7626#157137, @John wrote:

Redis shows 17 hour uptime and memory usage peaks at 2MB. Grafana shows no evidence of an OOM even being a remote possibility here

Any idea as to what could be the cause?

Unknown Object (User) added a comment.EditedAug 13 2021, 08:17

The one I got with import was the "Redis server error: socket error on read socket" one, and seconds later it was reported that CreateWiki was giving JobQueueConnectionError.

Also, I just got it again now, so the exact error is:

JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket

Can confirm I also just got this (it killed my imports):

JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket

#0 /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php(240): JobQueueRedis->handleErrorAndMakeException(Object(RedisConnRef), Object(RedisException))
#1 /srv/mediawiki/w/includes/jobqueue/JobQueue.php(365): JobQueueRedis->doBatchPush(Array, 0)
#2 /srv/mediawiki/w/includes/jobqueue/JobQueue.php(335): JobQueue->batchPush(Array, 0)
#3 /srv/mediawiki/w/includes/jobqueue/JobQueueGroup.php(174): JobQueue->push(Array)
#4 /srv/mediawiki/w/includes/jobqueue/JobQueueGroup.php(213): JobQueueGroup->push(Array)
#5 /srv/mediawiki/w/includes/page/WikiPage.php(3850): JobQueueGroup->lazyPush(Array)
#6 /srv/mediawiki/w/includes/Storage/DerivedPageDataUpdater.php(1618): WikiPage::onArticleEdit(Object(Title), Object(MediaWiki\Revision\RevisionStoreRecord), Array)
#7 /srv/mediawiki/w/includes/page/WikiPage.php(2381): MediaWiki\Storage\DerivedPageDataUpdater->doUpdates()
#8 /srv/mediawiki/w/includes/import/ImportableOldRevisionImporter.php(226): WikiPage->doEditUpdates(Object(MediaWiki\Revision\RevisionStoreRecord), Object(User), Array)
#9 /srv/mediawiki/w/includes/import/WikiRevision.php(670): ImportableOldRevisionImporter->import(Object(WikiRevision))
#10 /srv/mediawiki/w/includes/import/WikiImporter.php(429): WikiRevision->importOldRevision()
#11 /srv/mediawiki/w/maintenance/importDump.php(201): WikiImporter->importRevision(Object(WikiRevision))
#12 /srv/mediawiki/w/includes/import/WikiImporter.php(571): BackupReader->handleRevision(Object(WikiRevision), Object(WikiImporter))
#13 /srv/mediawiki/w/includes/import/WikiImporter.php(1059): WikiImporter->revisionCallback(Object(WikiRevision))
#14 /srv/mediawiki/w/includes/import/WikiImporter.php(926): WikiImporter->processRevision(Array, Array)
#15 /srv/mediawiki/w/includes/import/WikiImporter.php(861): WikiImporter->handleRevision(Array)
#16 /srv/mediawiki/w/includes/import/WikiImporter.php(678): WikiImporter->handlePage()
#17 /srv/mediawiki/w/maintenance/importDump.php(353): WikiImporter->doImport()
#18 /srv/mediawiki/w/maintenance/importDump.php(286): BackupReader->importFromHandle(Resource id #929)
#19 /srv/mediawiki/w/maintenance/importDump.php(130): BackupReader->importFromFile('/home/reception...')
#20 /srv/mediawiki/w/maintenance/doMaintenance.php(112): BackupReader->execute()
#21 /srv/mediawiki/w/maintenance/importDump.php(358): require_once('/srv/mediawiki/...')
#22 {main}

Later redis issues appear unrelated to the original task. I'd suggest we open a new task for whatever this new problem is.