During the incident today, icinga-miraheze disconnected from IRC.
We should have an alert when this happens so we don’t miss anything important.
Self-Assigning as will handle at some point this week.
During the incident today, icinga-miraheze disconnected from IRC.
We should have an alert when this happens so we don’t miss anything important.
Self-Assigning as will handle at some point this week.
SigmaBot may be able to ping staff when icinga-miraheze parts the #miraheze channel. Staff please let me know your thoughts on this.
@John: Is someone going to work on the puppet change?
My plan was to have ZppixBot just watch for when it quits.
My plan was to have ZppixBot set off an alert with ‘@page’ mentioned so staff can stalk.
If this is a task here tagged as monitoring, it’s a puppet change. If it’s not a change you’ll be making in Miraheze, this isn’t the place to track it currently.
It’s probably best out of MH’s network to reduce the chance that it goes down when we do.
The backup monitoring solution will:
Ping !staff when icinga-miraheze disconnects
Pings meta.miraheze.org & icinga.miraheze.org
Reports if the sites are UP or DOWN
All this should happen with seconds of icinga failing (which could be up to 300s after it stops working).
There will also be a manual !status command to check whether key services are off
Complete.
SigmaBot will generate an alert including the ping “!sre” (stalk if you want to) when icinga-miraheze quits. It will run a short status check to indicate whether icinga and meta is up.
A full status check can be run by SRE, me or Examknow using !status but it is long as it checks multiple services so only use if you have to.
To add to that, if SRE wants a shorter report, they can use !s and the bot will give the total amount of alive and dead services. Just felt like there needed to be a less spamy option for non-emergencies.