Page MenuHomeMiraheze

jobrunner1 temporality was unable to run jobs leading to a critical backlog.
Closed, ResolvedPublic

Description

I see no useful logs, restart hasn't helped. It seemed stop between 01:00 UK time and 02:00 and UK time

I see a traffic surge during that period.

Any ideas?

Event Timeline

RhinosF1 triaged this task as Unbreak Now! priority.Fri, Jul 31, 04:54
RhinosF1 created this task.
RhinosF1 updated the task description. (Show Details)Fri, Jul 31, 04:58
RobLa added a subscriber: RobLa.Fri, Jul 31, 05:17

It's my understanding from chatting with @RhinosF1 that some of the jobs in the job queue are getting backed up. I'm pretty rusty on my MediaWiki skills, so I'm not yet familiar with with which set of jobs, and whether the problem is with something in MediaWiki core or with one of the many Miraheze custom jobs.

We're in the process of asking for help on the #mediawiki channel over on Freenode IRC. I may try to distill some of the logs that @RhinosF1 posted in DM and some private channels we're both in so that we make sense to the Wikimedia folks trying to help us.

It's my understanding from chatting with @RhinosF1 that some of the jobs in the job queue are getting backed up. I'm pretty rusty on my MediaWiki skills, so I'm not yet familiar with with which set of jobs, and whether the problem is with something in MediaWiki core or with one of the many Miraheze custom jobs.

It's anything ran by the jobrunner

RobLa added a comment.Fri, Jul 31, 05:21

This makes me hopeful:

The times above are PDT, which is UTC-7. It's 22:20 PDT right now, so this was just a couple of minutes ago.

Zppix added a subscriber: Zppix.Fri, Jul 31, 05:33

@RobLa its still failing... my inital guess is it may be hitting mem limits, regardless this needs immediate fix, I suggest we do manual job runs until then.

A restart of jobrunner1 seems to have stopped it failing. Let's hope runJobs.php clears the backlog and it stays this way.

RhinosF1 lowered the priority of this task from Unbreak Now! to High.Fri, Jul 31, 06:54

my runJobs script was accidently on mw7 but load looks fine so I'm leaving it there to avoid restarting it. I'll move if it causes an issue.

I do not see any new crashes but the backlog is still critical and expanding at this point in time.

RhinosF1 renamed this task from jobrunner has stopped running jobs to jobrunner1 temporality was unable to run Jobs leading to a critical backlog..Fri, Jul 31, 07:41
RhinosF1 renamed this task from jobrunner1 temporality was unable to run Jobs leading to a critical backlog. to jobrunner1 temporality was unable to run jobs leading to a critical backlog..
RobLa added a comment.Fri, Jul 31, 08:23

I learned a lot via DM about the not-quite-outage. The most interesting theory I've heard (and I *think* that @RhinosF1 gets credit for this) is that the job for updating global user pages is dying for some reason. However, it could be that something other process (which is trying to be helpful) is killing those processes.

Anyway, it's way past bedtime in this part of the world, and I stopped paying attention to IRC/Discord at least an hour ago, so my understanding is probably stale. I hope y'all figure it out!

@RobLa: I'll talk to @Southparkfan about how we don't get here again but it looks to be just a case of bringing the backlog under control now.

Reception123 closed this task as Resolved.Sat, Aug 1, 04:47
Reception123 claimed this task.

The issue described here has now been resolved and the backlog has been cleared. As an actionable T5994 has been created and therefore the discussion for how to prevent this from happening again can be taken over there.