Page MenuHomeMiraheze

Formalise and agree on Infrastructure SLOs
Closed, ResolvedPublic

Description

Grafana dashboard: https://grafana.miraheze.org/d/-rJPCPJnz/infrastructure-slos

  • Bastion
    • SLO: Availability of SSH services to be at least 99.5%. SLI: Service uptime.
  • Cache Proxies
    • SLO: Errors account for less than 7.5% of requests. SLI: NGINX Logs, errors are HTTP codes in the 5xx and 4xx range.
    • SLO: Availability of 75% of backends is least 99%. SLI: Varnish backend monitoring shows at least 75% of backends are up.
  • Cloud
    • SLO: Availability of cloud infrastructure is at least 99.5%. SLI: Uptime of servers.
  • DNS
    • SLO: Availability of DNS services to be at least 99.5%. SLI: Service uptime.
    • SLO: Errors must be less than 0.5% of total DNS requests. SLI: All requests which do not end with NOERR / TOTAL requests. This is to monitor spikes.
    • SLO: Latency for DNS lookups to be below 5ms. SLI: DNS lookup metrics, p95? p99?
  • ElasticSearch
    • SLO: Availability of service to be at least 99.5%. SLI: Service uptime.
  • Gluster
    • SLO: Availability of Gluster read-total time is at least x% SLI: Read-write availability.
  • Graylog
    • SLO: Availability of Graylog is at least 99.5%. SLI: Service uptime (graylog sidecar?)
    • SLO: Errors for log indexing is less than 0.5%. SLI: org.graylog2.shared.messageq.MessageQueueWriter.failed-write-attempts metric?
    • SLO: Latency for log indexing is less than 5ms. SLI: org.graylog2.shared.buffers.ProcessBuffer.parseTime metric?
  • Mail
    • SLO: Availability of Mail servers to be at least 99.5%. SLI: Ingestion service uptime.
    • SLO: Errors for sending mail is below 1%. SLI: Error-related metrics - receiving/sending?
    • SLO: Latency of message delivery is below 30 seconds. SLI: Average age of messages.
  • MariaDB
    • SLO: Availability of MariaDB is at least 99.5%. SLI: Monitoring of MariaDB's accessibility.
    • SLO: Error rates for access are below 5%. SLI: Connection failure error rates with respect to total connections.
  • LDAP
    • SLO: Availability of LDAP to be at least 99.5%. SLI: Service uptime.
  • Phabricator
    • SLO: Availability of Phabricator to be at least 99.5%. SLI: Site uptime.
    • SLO: Latency of Phabricator to be below 5s. SLI: Grafana Prometheus blackbox exporter metrics over p95/p99?
  • Puppet
    • SLO: Availability of Puppet Server is to be at least 99.5%. SLI: Service uptime.
    • SLO: Latency of external HTTP communications for find facts is to be below 30ms. SLI: Grafana metrics for find facts latency over p95/p99?
  • Swift
    • SLO: Availability of Swift to be at least 99.5%. SLI: Service uptime calculated by dispersion
    • SLO: Errors for Swift requests to be less than 1% of all requests. SLI: Server 500 errors against total Swift requests.
    • SLO: Latency for Swift Time To First Byte to be less than 1s. SLI: TTFB for Object Servers.

Event Timeline

John triaged this task as Normal priority.Feb 20 2022, 14:53
John created this task.

Has there been any more discussion among SRE team members with regard to agreeing on SLOs for the above?

Has there been any more discussion among SRE team members with regard to agreeing on SLOs for the above?

Nope, they will be formalised once I get some time in the next few weeks to work on them

Proposal for some below are:

  • Bastion
    • SLO: Availability of SSH services to be at least 99.5%.
  • Cache Proxies
    • SLO: Availability of 75% of backends is least 99%.
  • Cloud
    • SLO: Availability of cloud infrastructure is at least 99.5%.
  • DNS
    • SLO: Availability of DNS services to be at least 99.5%.
    • SLO: Errors must be less than 99.5% of total DNS requests.
  • ElasticSearch
    • SLO: Availability of service to be at least 99.5%.
  • Graylog
    • SLO: Availability of Graylog is at least 99.5%.
  • Mail
    • SLO: Availability of Mail servers to be at least 99.5%.
    • SLO: Errors for sending mail is below 1%.
    • SLO: Latency of message delivery is below 30 seconds.
  • MariaDB
    • SLO: Availability of MariaDB is at least 99.5%.
    • SLO: Error rates for access are below 5%.
  • LDAP
    • SLO: Availability of LDAP to be at least 99.5%.
  • Phabricator
    • SLO: Availability of Phabricator to be at least 99.5%.
  • DNS
    • SLO: Latency for DNS lookups to be below 5ms at least 99.5%

@Paladox can we draft some Swift SLOs please so that we can start to monitor them before the end of this year?

Availability for the swift proxy, ac and object servers should be 99.5% I think.

John claimed this task.