It would be useful for us to get a full picture of the number of API requests we have on average in a minute/sec and then the types of modules and requests being executed. Would allow us to more quickly and easily pin point increased resource usages due to bots suddenly misusing API requests as well as guide potential caching attempts.
For this, looping over access logs may be of use (Graylog dashboards could also help?) on a minute-minute basis and reporting usages directly to either Grafana or via Prometheus may be the best option rather than setting up specific data storage services currently.
I told you on IRC yesterday, but it's good practice to document this on Phabricator: my suggestion is to use https://github.com/braedon/prometheus-es-exporter, a tool that runs on the graylog hosts. The tool takes an elasticsearch query, performs the search and returns the result in prometheus format. Prometheus collects the metrics, after which we can use the metrics in Grafana dashboards.
ACLs in elasticsearch are hard (but aren't needed so far, since elasticsearch only listens on 127.0.0.1), but the prometheus-es-exporter tool is a nice solution.
Wikimedia queries: https://github.com/wikimedia/puppet/tree/f9bdcb97b5d1a6154bfa033f0cac292ede3710a1/modules/prometheus/files/es_exporter
Puppet classes: https://github.com/wikimedia/puppet/blob/f9bdcb97b5d1a6154bfa033f0cac292ede3710a1/modules/prometheus/manifests/es_exporter.pp