Page MenuHomeMiraheze

Infrastructure (SRE)Group
ActivePublic

Members (2)

Watchers

  • This project does not have any watchers.
  • View All

Details

Description

This is the project for the Infrastructure team based in the Site Reliability Engineering department.

This project is used to organise and manage all work which falls under the primacy of the Infrastructure team. Any queries with the progress or allocation of resources should be directed to the Engineering Manager for Infrastructure team.

Engineering Manager: @John

Recent Activity

Sun, Apr 11

John closed T7108: Remove abandoned l-unclaimed entries as Resolved.

https://github.com/miraheze/jobrunner-service/compare/de7d72b68abc...7e6175d56b4e

Sun, Apr 11, 15:02 · Redis-JobRunner, Infrastructure (SRE)

Fri, Apr 9

Southparkfan updated subscribers of T5877: Revise MariaDB backup strategy.

Running dump from db11 to dbbackup1:/srv/backups/db11. @Paladox and I are around to monitor.

Fri, Apr 9, 22:21 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Paladox closed T6975: LDAP Statistics as Resolved.
Fri, Apr 9, 21:24 · Monitoring, Infrastructure (SRE)
Paladox added a comment to T6975: LDAP Statistics.

I've added ldap monitoring. You can view at https://grafana.miraheze.org/d/uOLD33lMz/ldap?orgId=1

Fri, Apr 9, 21:23 · Monitoring, Infrastructure (SRE)
John closed T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket as Resolved.

Changes never got deployed on the server, this has been fixed now.

Fri, Apr 9, 10:22 · Infrastructure (SRE)
Reception123 reopened T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket as "Open".

@John I've run into the error again I'm afraid (though this time the dump has gone on for way longer, but eventually it happens)

Fri, Apr 9, 09:34 · Infrastructure (SRE)

Thu, Apr 8

John closed T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket as Resolved.

T7107

Thu, Apr 8, 11:27 · Infrastructure (SRE)
John closed T7107: Remove :rootjobs: periodically as Resolved.
Thu, Apr 8, 11:26 · Redis-JobRunner, Infrastructure (SRE)
John moved T7107: Remove :rootjobs: periodically from Incoming to Short Term on the Infrastructure (SRE) board.
Thu, Apr 8, 11:21 · Redis-JobRunner, Infrastructure (SRE)
John moved T7108: Remove abandoned l-unclaimed entries from Incoming to Short Term on the Infrastructure (SRE) board.
Thu, Apr 8, 11:21 · Redis-JobRunner, Infrastructure (SRE)
John moved T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket from Incoming to Short Term on the Infrastructure (SRE) board.
Thu, Apr 8, 11:21 · Infrastructure (SRE)
John added a comment to T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket.

Because of our monitoring, we’re doing fairly intensive Lua scripts on almost a 100k keys, this can take up to 2 seconds to run. We have set our connectTimeout in Redis has being 2s (https://github.com/miraheze/mw-config/blob/master/GlobalCache.php#L48).

Thu, Apr 8, 10:17 · Infrastructure (SRE)
John edited projects for T7112: JobQueueError from line 778 of /srv/mediawiki/w/includes/jobqueue/JobQueueRedis.php: Redis server error: socket error on read socket, added: Infrastructure (SRE); removed Redis-JobRunner.

Redis software not the jobqueue software as this is manually ran, not a job

Thu, Apr 8, 10:09 · Infrastructure (SRE)

Wed, Apr 7

John moved T7108: Remove abandoned l-unclaimed entries from To Triage to Bugs on the Redis-JobRunner board.
Wed, Apr 7, 20:31 · Redis-JobRunner, Infrastructure (SRE)
John moved T7107: Remove :rootjobs: periodically from To Triage to Features on the Redis-JobRunner board.
Wed, Apr 7, 20:31 · Redis-JobRunner, Infrastructure (SRE)
John triaged T7108: Remove abandoned l-unclaimed entries as Normal priority.
Wed, Apr 7, 20:31 · Redis-JobRunner, Infrastructure (SRE)
John triaged T7107: Remove :rootjobs: periodically as Low priority.
Wed, Apr 7, 20:26 · Redis-JobRunner, Infrastructure (SRE)

Mon, Apr 5

Paladox added a comment to T4425: Fix all mysql tables that are using latin rather then utf8mb4.

@Southparkfan I'm wondering if I could have assistance on this please? This is a really big change and could lead to data loss.

Mon, Apr 5, 19:09 · Infrastructure (SRE)

Sun, Apr 4

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

New performance test (using sshfs setup, 4 mydumper threads):

  • Uncompressed: 290 seconds
  • Compressed: 210 seconds
Sun, Apr 4, 22:07 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

For reference: mydumper is superior to mysqldump due to its better performance (using multiple threads) and the flexibility (PCRE based table inclusion/exclusion) in conjunction with transaction consistency and (almost) no locking (no read-only time required during backups). However, mydumper does not support TLS in connections, so dumping must happen at the database master.

Sun, Apr 4, 21:37 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec

Wed, Mar 31

Southparkfan added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.

Proof of concept:
/etc/prometheus-es-exporter/mediawiki.cfg:

[query_log_mediawiki]
QueryIntervalSecs = 900
QueryIndices = <graylog_deflector>
QueryJson = {
    "size": 0,
    "track_total_hits": true,
        "query": {
                "bool": {
                        "must": [
                                {
                                        "match": {
                                                "application_name": "mediawiki"
                                        }
                                }
                        ],
                        "filter": [
                                {
                                        "range": {
                                                "timestamp": { "gte": "now-15m", "lte": "now" }
                                        }
                                }
                        ]
                }
        },
        "aggs": {
                "mediawiki-channels": {
                        "terms": {
                                "field": "mediawiki_channel"
                        }
                }
        }
    }
Wed, Mar 31, 23:56 · MediaWiki (SRE), Monitoring
John added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.

Is there a use case for this that the ES data source wouldn’t fulfil? Is this the approach MediaWiki (SRE) wish to take? If so this would fall under the MW team to implement as part of their task as without a use case for Infra, what’s the point in implementing something unused?

Wed, Mar 31, 23:41 · MediaWiki (SRE), Monitoring
Southparkfan triaged T7073: Install prometheus-es-exporter for prometheus <-> graylog integration as Normal priority.
Wed, Mar 31, 23:01 · MediaWiki (SRE), Monitoring
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

More testing is required to determine the final backup sizes.

Wed, Mar 31, 15:10 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Wed, Mar 31, 14:27 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec

Mon, Mar 29

Southparkfan added a comment to T4302: Deploy Apache Traffic Server.

In order to do proper backend verification in the certificate (CN), we have tested using ENFORCE. However, the Host header from the client (e.g. allthetropes.org) is used for the CN check at the backend. Therefore, the allthetropes.org certificate would still be mandatory at the backend, even though I prefer to remove all certificates (including our wildcard one) but a single domain (such as ats-internal.miraheze.wiki) from the MediaWiki servers.

Mon, Mar 29, 00:46 · Infrastructure (SRE)

Sun, Mar 28

John changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Stalled to Open.
Sun, Mar 28, 23:07 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
John changed the status of T6984: High load on dbbackup servers from Stalled to Open.

Not blocked on external entity

Sun, Mar 28, 23:07 · Database, Monitoring, Infrastructure (SRE)
Southparkfan changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Open to Stalled.
Sun, Mar 28, 22:44 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan changed the status of T6984: High load on dbbackup servers from Open to Stalled.

The future of these servers depends on the outcome of testing regarding T5877#139273.

Sun, Mar 28, 22:44 · Database, Monitoring, Infrastructure (SRE)
Southparkfan changed the edit policy for T7033: Restart services running on older openssl binaries.
Sun, Mar 28, 22:40 · Infrastructure (SRE), Security
Southparkfan changed the visibility for T7033: Restart services running on older openssl binaries.
Sun, Mar 28, 22:40 · Infrastructure (SRE), Security
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

@Southparkfan should we make this task public viewable?

Sun, Mar 28, 22:40 · Infrastructure (SRE), Security
Paladox closed T7033: Restart services running on older openssl binaries as Resolved.

@Southparkfan should we make this task public viewable?

Sun, Mar 28, 22:35 · Infrastructure (SRE), Security
Paladox added a comment to T7033: Restart services running on older openssl binaries.

I've restarted both dbbackup1 and dbbackup2. I think we can close this as resolved per your acceptance of a low risk.

Sun, Mar 28, 22:35 · Infrastructure (SRE), Security
Paladox updated the task description for T4302: Deploy Apache Traffic Server.
Sun, Mar 28, 22:33 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

I recommend rebooting the dbbackup servers. They may or may not be affected by CVE-2021-3450, but as long as these servers are rebooted gracefully, we can survive without them for a few minutes.

Sun, Mar 28, 21:32 · Infrastructure (SRE), Security
John moved T7033: Restart services running on older openssl binaries from Incoming to Short Term on the Infrastructure (SRE) board.
Sun, Mar 28, 19:26 · Infrastructure (SRE), Security
John assigned T7033: Restart services running on older openssl binaries to Southparkfan.
Sun, Mar 28, 19:25 · Infrastructure (SRE), Security
Paladox added a comment to T7033: Restart services running on older openssl binaries.

I've restarted all the servers that I could without causing downtime. I'm waiting on SPF for the next moves.

Sun, Mar 28, 19:23 · Infrastructure (SRE), Security
John removed a project from T7046: New Resource Request for MediaWiki-Extension-Updates: MediaWiki (SRE).
Sun, Mar 28, 19:20 · Infrastructure (SRE)
John closed T7046: New Resource Request for MediaWiki-Extension-Updates as Declined.

We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.

Sun, Mar 28, 19:18 · Infrastructure (SRE)

Sat, Mar 27

John closed T7042: salt-ssh broken due to unknown minion as Invalid.

Sounds like there isn't a problem then?

Sat, Mar 27, 10:15 · Infrastructure (SRE)
John added a comment to T7033: Restart services running on older openssl binaries.

Do we have an update on this? Also, who is taking responsibility for this?

Sat, Mar 27, 10:14 · Infrastructure (SRE), Security

Fri, Mar 26

RhinosF1 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.

Fri, Mar 26, 18:33 · Infrastructure (SRE)
Universal_Omega added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

The plan for me is that we will have a web based interface that will list every extension/skin, it's latest commit, the commit we have deployed, whether it's i18n only and a link to the diff.

I shared some links yesterday with how we can get the info.

Fri, Mar 26, 18:31 · Infrastructure (SRE)
RhinosF1 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

The plan for me is that we will have a web based interface that will list every extension/skin, it's latest commit, the commit we have deployed, whether it's i18n only and a link to the diff.

Fri, Mar 26, 18:26 · Infrastructure (SRE)
Reception123 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

I think before we consider doing this, we also need to have a feasible implementation plan for a system. (Meaning that we don't do it and it ends up not being feasable with our resources)

Fri, Mar 26, 18:24 · Infrastructure (SRE)
John reassigned T7046: New Resource Request for MediaWiki-Extension-Updates from John to Reception123.
Fri, Mar 26, 18:24 · Infrastructure (SRE)
RhinosF1 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

I suggest we start on test3.

Fri, Mar 26, 18:23 · Infrastructure (SRE)