Page MenuHomeMiraheze

Alert when icinga-miraheze disconnects from IRC
Closed, ResolvedPublic

Description

During the incident today, icinga-miraheze disconnected from IRC.

We should have an alert when this happens so we don’t miss anything important.

Self-Assigning as will handle at some point this week.

Event Timeline

RhinosF1 triaged this task as Normal priority.Apr 21 2020, 13:28
RhinosF1 created this task.

SigmaBot may be able to ping staff when icinga-miraheze parts the #miraheze channel. Staff please let me know your thoughts on this.

@John: Is someone going to work on the puppet change?

My plan was to have ZppixBot just watch for when it quits.

We already have redundant monitoring.

SigmaBot may be able to ping staff when icinga-miraheze parts the #miraheze channel. Staff please let me know your thoughts on this.

My plan was to have ZppixBot set off an alert with ‘@page’ mentioned so staff can stalk.

We already have redundant monitoring.

Did it alert?

If this is a task here tagged as monitoring, it’s a puppet change. If it’s not a change you’ll be making in Miraheze, this isn’t the place to track it currently.

In T5453#106269, @John wrote:

If this is a task here tagged as monitoring, it’s a puppet change. If it’s not a change you’ll be making in Miraheze, this isn’t the place to track it currently.

It’s probably best out of MH’s network to reduce the chance that it goes down when we do.

In T5453#106269, @John wrote:

If this is a task here tagged as monitoring, it’s a puppet change. If it’s not a change you’ll be making in Miraheze, this isn’t the place to track it currently.

It’s probably best out of MH’s network to reduce the chance that it goes down when we do.

This is true about all our monitoring. But we can’t afford such liberties.

The backup monitoring solution will:
Ping !staff when icinga-miraheze disconnects
Pings meta.miraheze.org & icinga.miraheze.org
Reports if the sites are UP or DOWN

All this should happen with seconds of icinga failing (which could be up to 300s after it stops working).

There will also be a manual !status command to check whether key services are off

RhinosF1 closed this task as Resolved.EditedApr 21 2020, 16:24

Complete.

SigmaBot will generate an alert including the ping “!sre” (stalk if you want to) when icinga-miraheze quits. It will run a short status check to indicate whether icinga and meta is up.

A full status check can be run by SRE, me or Examknow using !status but it is long as it checks multiple services so only use if you have to.

Complete.

SigmaBot will generate an alert including the ping “!sre” (stalk if you want to) when icinga-miraheze quits. It will run a short status check to indicate whether icinga and meta is up.

A full status check can be run by SRE, me or Examknow using !status but it is long as it checks multiple services so only use if you have to.

To add to that, if SRE wants a shorter report, they can use !s and the bot will give the total amount of alive and dead services. Just felt like there needed to be a less spamy option for non-emergencies.