Page MenuHomeMiraheze

Consider how to make us less vulnerable to traffic surges
Closed, ResolvedPublic

Description

We've had a number of issues in the last few weeks caused by traffic surges and things unmounting and OOM'ing as a result (T5889 as today's example).

This has caused user impacting downtime. We need to work out if there's ways we can tweak config to reduce the risk of OOM's and ensure if mounts like varnish/gluster disconnect/break, they can be safely, automatically repaired.

Event Timeline

RhinosF1 triaged this task as High priority.Jul 9 2020, 13:07
RhinosF1 created this task.
RhinosF1 created this object in space Restricted Space.
RhinosF1 created this object with visibility "Custom Policy".
RhinosF1 added a subscriber: Paladox.

I had a chat on IRC with you so assigning to you in case you have anything to add / want to make public. I'm happy to write the IR for cp6.

There's 3 things that concern me specifically as actionables from the recent ones:

  • gluster not auto remounting (there is a task about static disconnecting too much with @Paladox tried to resolve and might have but I'd like to see it fix itself)
  • varnish not auto restarting (now fixed)
  • GDND not auto depooling. As discussed, see irc scrollback from the day for exact times of 502s I got.

While traffic surges are a risk we take and one anyone can be a victim to, I think the above points are actionable to improve ressilience.

RhinosF1 closed this task as Resolved.EditedJul 13 2020, 17:49
RhinosF1 shifted this object from the Restricted Space space to the S1 Public space.
RhinosF1 changed the visibility from "Custom Policy" to "Public (No Login Required)".

Discussed again with @Southparkfan and we consider this resolved.

  • Gluster has an open, public task.
  • Varnish now auto-restarts
  • We think the 502s came from mw* not cp*