Grafana dashboard: https://grafana.miraheze.org/d/-rJPCPJnz/infrastructure-slos
- Bastion
- SLO: Availability of SSH services to be at least 99.5%. SLI: Service uptime.
- Cache Proxies
- SLO: Errors account for less than 7.5% of requests. SLI: NGINX Logs, errors are HTTP codes in the 5xx and 4xx range.
- SLO: Availability of 75% of backends is least 99%. SLI: Varnish backend monitoring shows at least 75% of backends are up.
- Cloud
- SLO: Availability of cloud infrastructure is at least 99.5%. SLI: Uptime of servers.
- DNS
- SLO: Availability of DNS services to be at least 99.5%. SLI: Service uptime.
- SLO: Errors must be less than 0.5% of total DNS requests. SLI: All requests which do not end with NOERR / TOTAL requests. This is to monitor spikes.
- SLO: Latency for DNS lookups to be below 5ms. SLI: DNS lookup metrics, p95? p99?
- ElasticSearch
- SLO: Availability of service to be at least 99.5%. SLI: Service uptime.
- Gluster
- SLO: Availability of Gluster read-total time is at least x% SLI: Read-write availability.
- Graylog
- SLO: Availability of Graylog is at least 99.5%. SLI: Service uptime (graylog sidecar?)
- SLO: Errors for log indexing is less than 0.5%. SLI: org.graylog2.shared.messageq.MessageQueueWriter.failed-write-attempts metric?
- SLO: Latency for log indexing is less than 5ms. SLI: org.graylog2.shared.buffers.ProcessBuffer.parseTime metric?
- Mail
- SLO: Availability of Mail servers to be at least 99.5%. SLI: Ingestion service uptime.
- SLO: Errors for sending mail is below 1%. SLI: Error-related metrics - receiving/sending?
- SLO: Latency of message delivery is below 30 seconds. SLI: Average age of messages.
- MariaDB
- SLO: Availability of MariaDB is at least 99.5%. SLI: Monitoring of MariaDB's accessibility.
- SLO: Error rates for access are below 5%. SLI: Connection failure error rates with respect to total connections.
- LDAP
- SLO: Availability of LDAP to be at least 99.5%. SLI: Service uptime.
- Phabricator
- SLO: Availability of Phabricator to be at least 99.5%. SLI: Site uptime.
- SLO: Latency of Phabricator to be below 5s. SLI: Grafana Prometheus blackbox exporter metrics over p95/p99?
- Puppet
- SLO: Availability of Puppet Server is to be at least 99.5%. SLI: Service uptime.
- SLO: Latency of external HTTP communications for find facts is to be below 30ms. SLI: Grafana metrics for find facts latency over p95/p99?
- Swift
- SLO: Availability of Swift to be at least 99.5%. SLI: Service uptime calculated by dispersion
- SLO: Errors for Swift requests to be less than 1% of all requests. SLI: Server 500 errors against total Swift requests.
- SLO: Latency for Swift Time To First Byte to be less than 1s. SLI: TTFB for Object Servers.