Page MenuHomeMiraheze

Formalise and agree on Infrastructure SLOs
Open, NormalPublic

Description

Grafana dashboard: https://grafana.miraheze.org/d/-rJPCPJnz/infrastructure-slos

  • Bastion
    • SLO: Availability of SSH services to be at least x%. SLI: Service uptime.
  • Cache Proxies
    • SLO: Errors account for less than x% of requests. SLI: NGINX Logs, errors are HTTP codes in the 5xx and 4xx range.
    • SLO: Availability of backends is least x%. SLI: Varnish backend monitoring shows at least x backends are up.
  • Cloud
    • SLO: Availability of cloud infrastructure is at least x%. SLI: Uptime of servers.
  • DNS
    • SLO: Availability of DNS services to be at least x%. SLI: Service uptime.
    • SLO: Errors must be less than x% of total DNS requests. SLI: All requests which do not end with NOERR / TOTAL requests. This is to monitor spikes.
    • SLO: Latency for DNS lookups to be below xms. SLI: DNS lookup metrics, p95? p99?
  • ElasticSearch
    • SLO: Availability of service to be at least x%. SLI: Service uptime.
  • Gluster
    • SLO: Availability of Gluster read-total time is at least x% SLI: Read-write availability.
  • Graylog
    • SLO: Availability of Graylog is at least x%. SLI: Service uptime (graylog sidecar?)
    • SLO: Errors for log indexing is less than x%. SLI: org.graylog2.shared.messageq.MessageQueueWriter.failed-write-attempts metric?
    • SLO: Latency for log indexing is less than xms. SLI: org.graylog2.shared.buffers.ProcessBuffer.parseTime metric?
  • Mail
    • SLO: Availability of Mail servers to be at least x%. SLI: Ingestion service uptime.
    • SLO: Errors for sending mail is below x%. SLI: Error-related metrics - receiving/sending?
    • SLO: Latency of message delivery is below x minutes. SLI: Average age of messages.
  • MariaDB
    • SLO: Availability of MariaDB is at least x%. SLI: Monitoring of MariaDB's accessibility.
    • SLO: Error rates for access are below x%. SLI: Connection failure error rates with respect to total connections.
  • LDAP
    • SLO: Availability of LDAP to be at least x%. SLI: Service uptime.
  • Phabricator
    • SLO: Availability of Phabricator to be at least x%. SLI: Site uptime.
    • SLO: Latency of Phabricator to be below xms. SLI: Grafana Prometheus blackbox exporter metrics over p95/p99?
  • Puppet
    • SLO: Availability of Puppet Server is to be at least x%. SLI: Service uptime.
    • SLO: Latency of external HTTP communications for find facts is to be below xms. SLI: Grafana metrics for find facts latency over p95/p99?