Grafana dashboard: https://grafana.miraheze.org/d/-rJPCPJnz/infrastructure-slos
- [x] Bastion
- [x] **SLO**: Availability of SSH services to be at least **99.5**%. **SLI**: Service uptime.
- [ ] Cache Proxies
- [ ] **SLO**: Errors account for less than **x**% of requests. **SLI**: NGINX Logs, errors are HTTP codes in the 5xx and 4xx range.
- [x] **SLO**: Availability of **75**% of backends is least **99**%. **SLI**: Varnish backend monitoring shows at least 75% of backends are up.
- [x] Cloud
- [x] **SLO**: Availability of cloud infrastructure is at least **99.5**%. **SLI**: Uptime of servers.
- [ ] DNS
- [x] **SLO**: Availability of DNS services to be at least **99.5**%. **SLI**: Service uptime.
- [x] **SLO**: Errors must be less than **0.5**% of total DNS requests. **SLI**: All requests which do not end with NOERR / TOTAL requests. This is to monitor spikes.
- [ ] **SLO**: Latency for DNS lookups to be below **x**ms. **SLI**: DNS lookup metrics, p95? p99?
- [x] ElasticSearch
- [x] **SLO**: Availability of service to be at least **99.5**%. **SLI**: Service uptime.
- [ ] Gluster
- [ ] **SLO**: Availability of Gluster read-total time is at least **x**% **SLI**: Read-write availability.
- [ ] Graylog
- [x] **SLO**: Availability of Graylog is at least **99.5**%. **SLI**: Service uptime (graylog sidecar?)
- [ ] **SLO**: Errors for log indexing is less than **x**%. **SLI**: org.graylog2.shared.messageq.MessageQueueWriter.failed-write-attempts metric?
- [ ] **SLO**: Latency for log indexing is less than **x**ms. **SLI**: org.graylog2.shared.buffers.ProcessBuffer.parseTime metric?
- [x] Mail
- [x] **SLO**: Availability of Mail servers to be at least **99.5**%. **SLI**: Ingestion service uptime.
- [x] **SLO**: Errors for sending mail is below **1**%. **SLI**: Error-related metrics - receiving/sending?
- [x] **SLO**: Latency of message delivery is below **30** seconds. **SLI**: Average age of messages.
- [x] MariaDB
- [x] **SLO**: Availability of MariaDB is at least **99.5**%. **SLI**: Monitoring of MariaDB's accessibility.
- [x] **SLO**: Error rates for access are below **5**%. **SLI**: Connection failure error rates with respect to total connections.
- [x] LDAP
- [x] **SLO**: Availability of LDAP to be at least **99.5**%. **SLI**: Service uptime.
- [ ] Phabricator
- [x] **SLO**: Availability of Phabricator to be at least **99.5**%. **SLI**: Site uptime.
- [ ] **SLO**: Latency of Phabricator to be below **x**ms. **SLI**: Grafana Prometheus blackbox exporter metrics over p95/p99?
- [ ] Puppet
- [ ] **SLO**: Availability of Puppet Server is to be at least **x**%. **SLI**: Service uptime.
- [ ] **SLO**: Latency of external HTTP communications for find facts is to be below **x**ms. **SLI**: Grafana metrics for find facts latency over p95/p99?