Page MenuHomeMiraheze
Feed Advanced Search

Sun, Apr 11

John removed a project from T7127: Add more jobrunner rate tasks to Grafana: Redis-JobRunner.
Sun, Apr 11, 17:09 · MediaWiki (SRE), Monitoring
John closed T7108: Remove abandoned l-unclaimed entries as Resolved.

https://github.com/miraheze/jobrunner-service/compare/de7d72b68abc...7e6175d56b4e

Sun, Apr 11, 15:02 · Redis-JobRunner, Infrastructure (SRE)

Fri, Apr 9

John added a comment to T7067: Subscribe SRE to OpenCVE for notifications.

It looks like a useful service, so we should definitely give it a try and see from a security perspective.

Fri, Apr 9, 10:42 · Security, Site Reliability Engineering
John committed rPUPC63c60c548bf2: rm double keystroke (authored by John).
rm double keystroke
Fri, Apr 9, 10:37
John closed T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket as Resolved.

Changes never got deployed on the server, this has been fixed now.

Fri, Apr 9, 10:22 · Infrastructure (SRE)
John committed rPUPC222d2ffbc71c: jobrunner: ensure latest not present (authored by John).
jobrunner: ensure latest not present
Fri, Apr 9, 10:19

Thu, Apr 8

John closed T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket as Resolved.

T7107

Thu, Apr 8, 11:27 · Infrastructure (SRE)
John closed T7107: Remove :rootjobs: periodically as Resolved.
Thu, Apr 8, 11:26 · Redis-JobRunner, Infrastructure (SRE)
John moved T7107: Remove :rootjobs: periodically from Incoming to Short Term on the Infrastructure (SRE) board.
Thu, Apr 8, 11:21 · Redis-JobRunner, Infrastructure (SRE)
John moved T7108: Remove abandoned l-unclaimed entries from Incoming to Short Term on the Infrastructure (SRE) board.
Thu, Apr 8, 11:21 · Redis-JobRunner, Infrastructure (SRE)
John moved T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket from Incoming to Short Term on the Infrastructure (SRE) board.
Thu, Apr 8, 11:21 · Infrastructure (SRE)
John added a comment to T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket.

Because of our monitoring, we’re doing fairly intensive Lua scripts on almost a 100k keys, this can take up to 2 seconds to run. We have set our connectTimeout in Redis has being 2s (https://github.com/miraheze/mw-config/blob/master/GlobalCache.php#L48).

Thu, Apr 8, 10:17 · Infrastructure (SRE)
John edited projects for T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket, added: Infrastructure (SRE); removed Redis-JobRunner.

Redis software not the jobqueue software as this is manually ran, not a job

Thu, Apr 8, 10:09 · Infrastructure (SRE)

Wed, Apr 7

John moved T7108: Remove abandoned l-unclaimed entries from To Triage to Bugs on the Redis-JobRunner board.
Wed, Apr 7, 20:31 · Redis-JobRunner, Infrastructure (SRE)
John moved T7107: Remove :rootjobs: periodically from To Triage to Features on the Redis-JobRunner board.
Wed, Apr 7, 20:31 · Redis-JobRunner, Infrastructure (SRE)
John triaged T7108: Remove abandoned l-unclaimed entries as Normal priority.
Wed, Apr 7, 20:31 · Redis-JobRunner, Infrastructure (SRE)
John triaged T7107: Remove :rootjobs: periodically as Low priority.
Wed, Apr 7, 20:26 · Redis-JobRunner, Infrastructure (SRE)
John set the image for Redis-JobRunner to F1420607: fa-briefcase-blue.png.
Wed, Apr 7, 20:20
John created Redis-JobRunner.
Wed, Apr 7, 20:20
John committed rPUPC8bf8f9bce546: jobrunner: only run jobchron on one server (authored by John).
jobrunner: only run jobchron on one server
Wed, Apr 7, 20:03
John committed rPUPC2229561d9ad9: jobrunner: use Miraheze repo not Wikimedia (authored by John).
jobrunner: use Miraheze repo not Wikimedia
Wed, Apr 7, 15:42
Reception123 awarded T6974: Jobs Statistics in Grafana a Haypence token.
Wed, Apr 7, 04:35 · Monitoring, MediaWiki (SRE)
Dmehus awarded T6974: Jobs Statistics in Grafana a Like token.
Wed, Apr 7, 01:54 · Monitoring, MediaWiki (SRE)
John closed T6974: Jobs Statistics in Grafana as Resolved.

https://grafana.miraheze.org/d/3L3WYylMz/mediawiki-job-queue?orgId=1

Wed, Apr 7, 01:19 · Monitoring, MediaWiki (SRE)
John committed rPUPCadcfd9f89f8f: wiki:jobqueue not global:jobqueue (authored by John).
wiki:jobqueue not global:jobqueue
Wed, Apr 7, 00:25
John committed rPUPCa4a8da3589f0: remove claimed and delayed data collection (authored by John).
remove claimed and delayed data collection
Wed, Apr 7, 00:17

Tue, Apr 6

John added a comment to T6974: Jobs Statistics in Grafana.

https://github.com/miraheze/puppet/blob/master/modules/prometheus/files/redis/jobQueueCollector.lua

Tue, Apr 6, 23:49 · Monitoring, MediaWiki (SRE)
John committed rPUPC7429c584a9bf: add jobQueueCollector script to Redis Prometheus exporter (authored by John).
add jobQueueCollector script to Redis Prometheus exporter
Tue, Apr 6, 23:32
John added a comment to T6974: Jobs Statistics in Grafana.

Basic LUA script to handle this:

Tue, Apr 6, 18:48 · Monitoring, MediaWiki (SRE)

Mon, Apr 5

John claimed T6974: Jobs Statistics in Grafana.
Mon, Apr 5, 11:41 · Monitoring, MediaWiki (SRE)
John added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.

Since there are more uses than MediaWiki, should this be tagged as MediaWiki (SRE) only?

Mon, Apr 5, 11:16 · MediaWiki (SRE), Monitoring

Thu, Apr 1

John edited projects for T7073: Install prometheus-es-exporter for prometheus <-> graylog integration, added: MediaWiki (SRE); removed Infrastructure (SRE).
Thu, Apr 1, 00:08 · MediaWiki (SRE), Monitoring

Wed, Mar 31

John added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.

Is there a use case for this that the ES data source wouldn’t fulfil? Is this the approach MediaWiki (SRE) wish to take? If so this would fall under the MW team to implement as part of their task as without a use case for Infra, what’s the point in implementing something unused?

Wed, Mar 31, 23:41 · MediaWiki (SRE), Monitoring

Sun, Mar 28

John changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Stalled to Open.
Sun, Mar 28, 23:07 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
John changed the status of T6984: High load on dbbackup servers from Stalled to Open.

Not blocked on external entity

Sun, Mar 28, 23:07 · Database, Monitoring, Infrastructure (SRE)
John moved T7033: Restart services running on older openssl binaries from Incoming to Short Term on the Infrastructure (SRE) board.
Sun, Mar 28, 19:26 · Infrastructure (SRE), Security
John assigned T7033: Restart services running on older openssl binaries to Southparkfan.
Sun, Mar 28, 19:25 · Infrastructure (SRE), Security
John removed a project from T7046: New Resource Request for MediaWiki-Extension-Updates: MediaWiki (SRE).
Sun, Mar 28, 19:20 · Infrastructure (SRE)
John closed T7046: New Resource Request for MediaWiki-Extension-Updates as Declined.

We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.

Sun, Mar 28, 19:18 · Infrastructure (SRE)

Sat, Mar 27

John closed T7042: salt-ssh broken due to unknown minion as Invalid.

Sounds like there isn't a problem then?

Sat, Mar 27, 10:15 · Infrastructure (SRE)
John added a comment to T7033: Restart services running on older openssl binaries.

Do we have an update on this? Also, who is taking responsibility for this?

Sat, Mar 27, 10:14 · Infrastructure (SRE), Security

Fri, Mar 26

John reassigned T7046: New Resource Request for MediaWiki-Extension-Updates from John to Reception123.
Fri, Mar 26, 18:24 · Infrastructure (SRE)
John added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

Before I can review this, more information needs to be provided.

Fri, Mar 26, 18:18 · Infrastructure (SRE)

Thu, Mar 25

John closed T7038: Existing Server Resource Request for bacula2 as Resolved.

Approved, with spending authorisation by @Southparkfan

Thu, Mar 25, 22:41 · Infrastructure (SRE)
John edited P386 Resources Table.
Thu, Mar 25, 22:32 · Cloud Infrastructure, Infrastructure (SRE)
John closed T7037: [New] Server Resource Request for ats as Resolved.

Approved for cloud4.

Thu, Mar 25, 22:32 · Infrastructure (SRE)
John added a project to T7033: Restart services running on older openssl binaries: Infrastructure (SRE).
Thu, Mar 25, 18:42 · Infrastructure (SRE), Security

Tue, Mar 23

John closed T4191: Redesign compression of content inside NGINX and Varnish as Declined.

T4302 - if that task gets declined in the future then this task would need re-opening.

Tue, Mar 23, 17:09 · Infrastructure (SRE), Varnish
John committed rPUPC086bcaa22a85: grafana: add sre-mediawiki as Editors (authored by John).
grafana: add sre-mediawiki as Editors
Tue, Mar 23, 16:13

Mon, Mar 22

John assigned T4302: Deploy Apache Traffic Server to Paladox.
Mon, Mar 22, 20:29 · Infrastructure (SRE)
John claimed T5397: Create a logbot for server actions.
Mon, Mar 22, 20:09 · Infrastructure (SRE)
John committed rPUPC704701df93d8: set service_count to 0 (authored by John).
set service_count to 0
Mon, Mar 22, 19:04
John added a comment to T6974: Jobs Statistics in Grafana.

Get a list of all h-sha1ById, loop over them running a HLEN on the key will return how many unclaimed jobs there are by job type - add these up and then the data exists for both the whole jobqueue but also per job (and if you want to go further, each job type by each wiki)

Mon, Mar 22, 17:06 · Monitoring, MediaWiki (SRE)
John committed rPUPC8107e7fc79c8: change mailname on all non-mail servers (authored by John).
change mailname on all non-mail servers
Mon, Mar 22, 13:09
John closed T6976: General Mail Statistics as Resolved.

Reviewing the stats already put up by @Paladox and looking into Dovecot's stats facility in more detail, I don't believe we would gain any new information from Dovecot stats as the Postfix ones already cover all bases of mail, including connections, logins and auth failures.

Mon, Mar 22, 13:00 · Monitoring, Mail, Infrastructure (SRE)

Sun, Mar 21

Dmehus awarded T7008: Investigate database server/cache proxy issues and extreme load times this evening a Orange Medal token.
Sun, Mar 21, 19:24 · Infrastructure (SRE)
John changed the start date for E239: Infrastructure SRE Weekly Meeting from Sun, Mar 21, 20:00 to Mon, Mar 22, 20:00.
Sun, Mar 21, 18:25 · Infrastructure (SRE)
John changed the start date for E239: Infrastructure SRE Weekly Meeting from Sun, Mar 21, 19:00 to Sun, Mar 21, 20:00.
Sun, Mar 21, 18:24 · Infrastructure (SRE)
John set E239: Infrastructure SRE Weekly Meeting to repeat weekly.
Sun, Mar 21, 18:22 · Infrastructure (SRE)
John created E239: Infrastructure SRE Weekly Meeting.
Sun, Mar 21, 18:22 · Infrastructure (SRE)
John cancelled E106: SRE Duty.
Sun, Mar 21, 18:18
John cancelled E237: SRE Duty.
Sun, Mar 21, 18:18
John cancelled E236: SRE Duty.
Sun, Mar 21, 18:18
John cancelled E234: SRE Duty.
Sun, Mar 21, 18:18
John cancelled E235: SRE Duty.
Sun, Mar 21, 18:18
John cancelled E233: SRE Duty.
Sun, Mar 21, 18:18
John added a comment to T7008: Investigate database server/cache proxy issues and extreme load times this evening.
In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

This isn't the first time this happened actually. IIRC, this is the third time.

Sun, Mar 21, 16:09 · Infrastructure (SRE)
John closed T7008: Investigate database server/cache proxy issues and extreme load times this evening as Resolved.

Looking into this, the MySQL server crashed because of a memory issue around 0202, which matches up to error reports.

Sun, Mar 21, 13:15 · Infrastructure (SRE)

Fri, Mar 19

John added a comment to T6759: Automate the adding of SSL private keys to puppet3.

I've finally found the ticket, pasting my IRC comment here:
22:37:46 <+SPF|Cloud> @SRE, I can't recall who was talking about it (and where I read it), but I saw some messages regarding automating the addition of a new certificate (for https). have you considered https://wikitech.wikimedia.org/wiki/Acme-chief?

Fri, Mar 19, 18:31 · Infrastructure (SRE), SSL
John added a comment to T6981: Consider Deploying NavigationTiming Extension.

If this requires Kafka then it should be declined.

Fri, Mar 19, 00:19 · Extensions, Monitoring, Universal Omega, MediaWiki (SRE)

Thu, Mar 18

John added a comment to T6974: Jobs Statistics in Grafana.

Possible implementations can be:
Accessible JSON table: https://grafana.com/grafana/plugins/simpod-json-datasource/
Data from Redis directly: https://grafana.com/grafana/plugins/redis-datasource/

Thu, Mar 18, 19:39 · Monitoring, MediaWiki (SRE)
John added a comment to T6979: Collect Statistics for API Requests (Including Module Type).

https://grafana.com/docs/grafana/latest/datasources/elasticsearch/ data could alternatively be collected directly via ES potentially

Thu, Mar 18, 19:32 · Monitoring, MediaWiki (SRE)
John added a comment to T6765: Cache frequently accessed files on MediaWiki servers.

Can we note explicitly in the documentation that it can be done at any point then and the deployment doesn't need to wait until SRE are ready to merge a puppet PR?

Thu, Mar 18, 13:29 · MediaWiki (SRE), Performance, MediaWiki
John added a comment to T4420: Introduce stats for IncidentReports.
In T4420#138210, @John wrote:

When I try this and select ‘show number of incidents’ and ‘show all services’, all the rows turn up empty work no numbers. This is the same for visible outage and total outage.

Oh, hmm. That didn't happen to me when I was testing this. I will attach screenshots of local test shortly

Thu, Mar 18, 00:05 · MediaWiki (SRE), Goal-2021-Jan-Jun, Goal-2020-Jul-Dec, IncidentReporting

Wed, Mar 17

John reopened T4420: Introduce stats for IncidentReports as "Open".

When I try this and select ‘show number of incidents’ and ‘show all services’, all the rows turn up empty work no numbers. This is the same for visible outage and total outage.

Wed, Mar 17, 23:50 · MediaWiki (SRE), Goal-2021-Jan-Jun, Goal-2020-Jul-Dec, IncidentReporting
John edited P386 Resources Table.
Wed, Mar 17, 19:31 · Cloud Infrastructure, Infrastructure (SRE)
John closed T6993: [Existing] Server Resource Request for mem[12] as Resolved.

Increasing mem[12]'s memory to 10GB approved.

Wed, Mar 17, 19:30 · Infrastructure (SRE)
John updated the task description for T6993: [Existing] Server Resource Request for mem[12].
Wed, Mar 17, 19:29 · Infrastructure (SRE)
John added a comment to T6990: Increase in PHP Errors.
PHP Notice:  Undefined index: musgwiki in /srv/mediawiki/w/extensions/CreateWiki/includes/WikiInitialise.php on line 107

I just checked graylog now, and I actually don't even see the error mentioned.

Wed, Mar 17, 18:30 · Production Error, CreateWiki, Universal Omega, MediaWiki (SRE)
John triaged T6990: Increase in PHP Errors as High priority.
Wed, Mar 17, 16:51 · Production Error, CreateWiki, Universal Omega, MediaWiki (SRE)
John closed T6973: Monitor Physical Disk Health as Resolved.
Wed, Mar 17, 14:17 · Monitoring, Cloud Infrastructure, Infrastructure (SRE)
John committed rPUPCdb518efd378f: fact -> facts (authored by John).
fact -> facts
Wed, Mar 17, 14:15
John committed rPUPC392cd05e6f2b: T6973: monitor disk health (authored by John).
T6973: monitor disk health
Wed, Mar 17, 14:12
John added a comment to T6979: Collect Statistics for API Requests (Including Module Type).

For this, looping over access logs may be of use on a minute-minute basis and reporting usages directly to either Grafana or via Prometheus may be the best option rather than setting up specific data storage services currently.

Wed, Mar 17, 11:44 · Monitoring, MediaWiki (SRE)
John closed T6989: Review new OVH backup policy as Invalid.

We maintain our own backups infrastructure to ensure a continuity of service can maintain and a peace of mind for ourselves. Any external provisions by data centres we use would have no effect on us in practical terms.

Wed, Mar 17, 09:56 · Infrastructure (SRE)

Tue, Mar 16

John assigned T6975: LDAP Statistics to Paladox.
Tue, Mar 16, 17:23 · Monitoring, Infrastructure (SRE)
John claimed T6976: General Mail Statistics.
Tue, Mar 16, 17:22 · Monitoring, Mail, Infrastructure (SRE)
John assigned T6977: Document Memcached to Paladox.
Tue, Mar 16, 17:22 · Infrastructure (SRE)
John moved T6984: High load on dbbackup servers from Incoming to Short Term on the Infrastructure (SRE) board.
Tue, Mar 16, 17:21 · Database, Monitoring, Infrastructure (SRE)

Mon, Mar 15

John triaged T6981: Consider Deploying NavigationTiming Extension as Normal priority.
Mon, Mar 15, 17:39 · Extensions, Monitoring, Universal Omega, MediaWiki (SRE)
John triaged T6979: Collect Statistics for API Requests (Including Module Type) as Normal priority.
Mon, Mar 15, 17:18 · Monitoring, MediaWiki (SRE)
John moved T6973: Monitor Physical Disk Health from Incoming to Short Term on the Infrastructure (SRE) board.
Mon, Mar 15, 16:30 · Monitoring, Cloud Infrastructure, Infrastructure (SRE)
John moved T6975: LDAP Statistics from Incoming to Short Term on the Infrastructure (SRE) board.
Mon, Mar 15, 16:30 · Monitoring, Infrastructure (SRE)
John moved T6976: General Mail Statistics from Incoming to Short Term on the Infrastructure (SRE) board.
Mon, Mar 15, 16:30 · Monitoring, Mail, Infrastructure (SRE)
John moved T6977: Document Memcached from Incoming to Short Term on the Infrastructure (SRE) board.
Mon, Mar 15, 16:30 · Infrastructure (SRE)
John triaged T6978: Document the MediaWiki Application Stack as Normal priority.
Mon, Mar 15, 16:29 · MediaWiki (SRE)
John triaged T6977: Document Memcached as Normal priority.
Mon, Mar 15, 16:19 · Infrastructure (SRE)
John triaged T6976: General Mail Statistics as Normal priority.
Mon, Mar 15, 16:13 · Monitoring, Mail, Infrastructure (SRE)
John triaged T6975: LDAP Statistics as Normal priority.
Mon, Mar 15, 16:12 · Monitoring, Infrastructure (SRE)
John triaged T6974: Jobs Statistics in Grafana as Normal priority.
Mon, Mar 15, 16:09 · Monitoring, MediaWiki (SRE)