redis-server is occasionally killed for OOM
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Void
	Jul 11 2021, 21:49

Description

I've been investigating the issues with incomplete wiki creations resulting from JobQueue errors. I've been able to discover that Redis restarted during the time one such wiki was being created. Looking into it further, I discovered this restart was initiated by:

oom_reaper: reaped process 16298 (redis-server)

We should investigate why this is being done, and hopefully prevent oom_reaper from killing redis unless absolutely necessary. If that isn't possible, perhaps we should setup redundant redis with jobrunner4, but I'm not too certain as to the logistics of this.

Related Objects

Mentioned In: T7801: Segmentation fault in jobrunner
T7139: MediaWiki Capacity Proposal
T7373: Investigate cause of redis server error (socket error on read socket) when CreateWiki Extension creates a wiki
Mentioned Here: T7139: MediaWiki Capacity Proposal
T5105: Investigate and Implement basic Machine Learning concepts for automatic wiki creation
T7338: Investigate cause of wiki being created but not creation farmer log entry being created

Event Timeline

Void triaged this task as High priority.Jul 11 2021, 21:49

Void created this task.

Herald added subscribers: Bukkit, Unknown Object (User), RhinosF1, Reception123. · View Herald TranscriptJul 11 2021, 21:49

We should investigate why this is being done, and hopefully prevent oom_reaper from killing redis unless absolutely necessary.

I'm slightly concerned about this just killing something else as risky. While we can cope with a short loss of the jobrunner, we probably still need to thing about available resources.

If that isn't possible, perhaps we should setup redundant redis with jobrunner4, but I'm not too certain as to the logistics of this.

As far as I'm aware even the rdb servers are active-passive

Redmin subscribed.Jul 12 2021, 07:37

Void moved this task from Incoming to Short Term on the Infrastructure (SRE) board.Jul 13 2021, 22:11

Void mentioned this in T7373: Investigate cause of redis server error (socket error on read socket) when CreateWiki Extension creates a wiki.Jul 16 2021, 03:55

Unknown Object (User) merged a task: T7373: Investigate cause of redis server error (socket error on read socket) when CreateWiki Extension creates a wiki.Jul 16 2021, 06:40

Unknown Object (User) added subscribers: Dmehus, John, NDKilla, • WikiJS.

Jobrunner3 is showing 155 Out of memory issues in the past 24 hours, killing several processes, including redis repeatedly.

This has been a reoccurring issue for the past 3-4 months from what I've observed also. I didn't realize before that it was such a major and often issue, but it was likely the same issue for the past months.

Void merged a task: T7338: Investigate cause of wiki being created but not creation farmer log entry being created.Jul 17 2021, 01:05

Void claimed this task.Jul 17 2021, 21:43

I have a process running on jobrunner3 that should report the full process information on any process that winds up getting killed by OOM. Hopefully it should tell us some more information about which processes are utilizing too much memory.

Redis server error: LOADING Redis is loading the dataset in memory

corresponds almost exactly with some of the reported times creating wikis failed.

That is due to RequestWikiAIJob

#0 /srv/jobrunner/src/RedisJobService.php(260): Redis->connect('51.195.236.220', '6379', 5)

is usually given just around one second before the other redis error as well.

<11>1 2021-07-10T23:04:29+00:00 jobrunner3.miraheze.org mediawiki - - - {"@timestamp":"2021-07-10T23:04:29.565386+00:00","@version":1,"host":"jobrunner3","message":"RequestWikiAIJob Special: description=Quiero hacer una web donde la gente tenga fácil acceso a mi forma de ver las cosas con respecto a una subcultura, con tal de reflexionar y profundizar, y ver si es posible hacer un cambio positivo, quiero que la información de la wiki sea definitiva y una fuente fácil de consultar al momento de que alguien quiera conocer las doctrinas, no es nada ilegal, solo hablaré de aspectos sociales más que todo. id=19224 requestId=879a8d57aba33b80219792ac namespace=-1 title= (uuid=d3006058972b4685b7ed3ff0f6cee65f,timestamp=1625958259) t=7819 error=JobQueueError: Redis server error: LOADING Redis is loading the dataset in memory\n","type":"mediawiki","channel":"runJobs","level":"ERROR","monolog_level":400,"shard":"c2","wiki":"metawiki","mwversion":"1.36.1","reqId":"879a8d57aba33b80219792ac","cli_argv":"/srv/mediawiki/w/maintenance/runJobs.php --wiki=metawiki --type=RequestWikiAIJob --maxtime=60 --memory-limit=1750M --result=json","job_type":"RequestWikiAIJob","job_duration":7819,"job_error":"JobQueueError: Redis server error: LOADING Redis is loading the dataset in memory\n"}

corresponds exactly with the time it was reported in T7338#153237 for https://meta.miraheze.org/wiki/Special:RequestWikiQueue/19224.

I am not 100% positive this is related, but seems to be. I did assume that RequestWikiAIJob was a possible cause of this. I do know with certainty the two mentioned errors here are related however.

Can we try and temporarily disable the RequestWikiAIJob to see if this alleviates the load? Or alternately, could we prevent jobrunner3 from running these types of jobs?

If either of these solves the issue, we may want to consider un-deploying T5105. In either case, we should also probably evaluate whether or not we actually want to move ahead with that, even if resources were not a problem.

Void mentioned this in T7139: MediaWiki Capacity Proposal.Jul 21 2021, 01:08

Effectively stalled on T7139 unless we can prevent jobrunner3 from accepting high intensity jobs (assuming RequestWikiAIJob).

Literally a setting in puppet: https://git.io/Jlvs7

Tentatively closing, looks like we've stabilized.

We just got redis errors on an import. (Not entirely sure whether it's an OOM but the grafana graphs show an increase in commands/sec and traffic)

Cc @John

Redis shows 17 hour uptime and memory usage peaks at 2MB. Grafana shows no evidence of an OOM even being a remote possibility here

In T7626#157137, @John wrote:

Redis shows 17 hour uptime and memory usage peaks at 2MB. Grafana shows no evidence of an OOM even being a remote possibility here

Any idea as to what could be the cause?

The one I got with import was the "Redis server error: socket error on read socket" one, and seconds later it was reported that CreateWiki was giving JobQueueConnectionError.

Also, I just got it again now, so the exact error is:

JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket

Can confirm I also just got this (it killed my imports):

JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket

#0 /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php(240): JobQueueRedis->handleErrorAndMakeException(Object(RedisConnRef), Object(RedisException))
#1 /srv/mediawiki/w/includes/jobqueue/JobQueue.php(365): JobQueueRedis->doBatchPush(Array, 0)
#2 /srv/mediawiki/w/includes/jobqueue/JobQueue.php(335): JobQueue->batchPush(Array, 0)
#3 /srv/mediawiki/w/includes/jobqueue/JobQueueGroup.php(174): JobQueue->push(Array)
#4 /srv/mediawiki/w/includes/jobqueue/JobQueueGroup.php(213): JobQueueGroup->push(Array)
#5 /srv/mediawiki/w/includes/page/WikiPage.php(3850): JobQueueGroup->lazyPush(Array)
#6 /srv/mediawiki/w/includes/Storage/DerivedPageDataUpdater.php(1618): WikiPage::onArticleEdit(Object(Title), Object(MediaWiki\Revision\RevisionStoreRecord), Array)
#7 /srv/mediawiki/w/includes/page/WikiPage.php(2381): MediaWiki\Storage\DerivedPageDataUpdater->doUpdates()
#8 /srv/mediawiki/w/includes/import/ImportableOldRevisionImporter.php(226): WikiPage->doEditUpdates(Object(MediaWiki\Revision\RevisionStoreRecord), Object(User), Array)
#9 /srv/mediawiki/w/includes/import/WikiRevision.php(670): ImportableOldRevisionImporter->import(Object(WikiRevision))
#10 /srv/mediawiki/w/includes/import/WikiImporter.php(429): WikiRevision->importOldRevision()
#11 /srv/mediawiki/w/maintenance/importDump.php(201): WikiImporter->importRevision(Object(WikiRevision))
#12 /srv/mediawiki/w/includes/import/WikiImporter.php(571): BackupReader->handleRevision(Object(WikiRevision), Object(WikiImporter))
#13 /srv/mediawiki/w/includes/import/WikiImporter.php(1059): WikiImporter->revisionCallback(Object(WikiRevision))
#14 /srv/mediawiki/w/includes/import/WikiImporter.php(926): WikiImporter->processRevision(Array, Array)
#15 /srv/mediawiki/w/includes/import/WikiImporter.php(861): WikiImporter->handleRevision(Array)
#16 /srv/mediawiki/w/includes/import/WikiImporter.php(678): WikiImporter->handlePage()
#17 /srv/mediawiki/w/maintenance/importDump.php(353): WikiImporter->doImport()
#18 /srv/mediawiki/w/maintenance/importDump.php(286): BackupReader->importFromHandle(Resource id #929)
#19 /srv/mediawiki/w/maintenance/importDump.php(130): BackupReader->importFromFile('/home/reception...')
#20 /srv/mediawiki/w/maintenance/doMaintenance.php(112): BackupReader->execute()
#21 /srv/mediawiki/w/maintenance/importDump.php(358): require_once('/srv/mediawiki/...')
#22 {main}

Later redis issues appear unrelated to the original task. I'd suggest we open a new task for whatever this new problem is.

Void mentioned this in T7801: Segmentation fault in jobrunner.Aug 13 2021, 20:42

redis-server is occasionally killed for OOMClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

redis-server is occasionally killed for OOM
Closed, ResolvedPublic
Actions