Page MenuHomeMiraheze
Feed Advanced Search

Mar 28 2021

John changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Stalled to Open.
Mar 28 2021, 23:07 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
John changed the status of T6984: High load on dbbackup servers from Stalled to Open.

Not blocked on external entity

Mar 28 2021, 23:07 · Database, Monitoring, Infrastructure (SRE)
Southparkfan changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Open to Stalled.
Mar 28 2021, 22:44 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan changed the status of T6984: High load on dbbackup servers from Open to Stalled.

The future of these servers depends on the outcome of testing regarding T5877#139273.

Mar 28 2021, 22:44 · Database, Monitoring, Infrastructure (SRE)
Southparkfan changed the edit policy for T7033: Restart services running on older openssl binaries.
Mar 28 2021, 22:40 · Infrastructure (SRE), Security
Southparkfan changed the visibility for T7033: Restart services running on older openssl binaries.
Mar 28 2021, 22:40 · Infrastructure (SRE), Security
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

@Southparkfan should we make this task public viewable?

Mar 28 2021, 22:40 · Infrastructure (SRE), Security
Paladox closed T7033: Restart services running on older openssl binaries as Resolved.

@Southparkfan should we make this task public viewable?

Mar 28 2021, 22:35 · Infrastructure (SRE), Security
Paladox added a comment to T7033: Restart services running on older openssl binaries.

I've restarted both dbbackup1 and dbbackup2. I think we can close this as resolved per your acceptance of a low risk.

Mar 28 2021, 22:35 · Infrastructure (SRE), Security
Paladox updated the task description for T4302: Deploy Apache Traffic Server.
Mar 28 2021, 22:33 · Goal-2021-Jul-Dec, Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

I recommend rebooting the dbbackup servers. They may or may not be affected by CVE-2021-3450, but as long as these servers are rebooted gracefully, we can survive without them for a few minutes.

Mar 28 2021, 21:32 · Infrastructure (SRE), Security
John moved T7033: Restart services running on older openssl binaries from Incoming to Short Term on the Infrastructure (SRE) board.
Mar 28 2021, 19:26 · Infrastructure (SRE), Security
John assigned T7033: Restart services running on older openssl binaries to Southparkfan.
Mar 28 2021, 19:25 · Infrastructure (SRE), Security
Paladox added a comment to T7033: Restart services running on older openssl binaries.

I've restarted all the servers that I could without causing downtime. I'm waiting on SPF for the next moves.

Mar 28 2021, 19:23 · Infrastructure (SRE), Security
John removed a project from T7046: New Resource Request for MediaWiki-Extension-Updates: MediaWiki (SRE).
Mar 28 2021, 19:20 · Infrastructure (SRE)
John closed T7046: New Resource Request for MediaWiki-Extension-Updates as Declined.

We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.

Mar 28 2021, 19:18 · Infrastructure (SRE)

Mar 27 2021

John closed T7042: salt-ssh broken due to unknown minion as Invalid.

Sounds like there isn't a problem then?

Mar 27 2021, 10:15 · Infrastructure (SRE)
John added a comment to T7033: Restart services running on older openssl binaries.

Do we have an update on this? Also, who is taking responsibility for this?

Mar 27 2021, 10:14 · Infrastructure (SRE), Security

Mar 26 2021

RhinosF1 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.

Mar 26 2021, 18:33 · Infrastructure (SRE)
Unknown Object (User) added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

The plan for me is that we will have a web based interface that will list every extension/skin, it's latest commit, the commit we have deployed, whether it's i18n only and a link to the diff.

I shared some links yesterday with how we can get the info.

Mar 26 2021, 18:31 · Infrastructure (SRE)
RhinosF1 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

The plan for me is that we will have a web based interface that will list every extension/skin, it's latest commit, the commit we have deployed, whether it's i18n only and a link to the diff.

Mar 26 2021, 18:26 · Infrastructure (SRE)
Reception123 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

I think before we consider doing this, we also need to have a feasible implementation plan for a system. (Meaning that we don't do it and it ends up not being feasable with our resources)

Mar 26 2021, 18:24 · Infrastructure (SRE)
John reassigned T7046: New Resource Request for MediaWiki-Extension-Updates from John to Reception123.
Mar 26 2021, 18:24 · Infrastructure (SRE)
RhinosF1 added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

I suggest we start on test3.

Mar 26 2021, 18:23 · Infrastructure (SRE)
Unknown Object (User) added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

I think before we consider doing this, we also need to have a feasible implementation plan for a system. (Meaning that we don't do it and it ends up not being feasable with our resources)

Mar 26 2021, 18:22 · Infrastructure (SRE)
John added a comment to T7046: New Resource Request for MediaWiki-Extension-Updates.

Before I can review this, more information needs to be provided.

Mar 26 2021, 18:18 · Infrastructure (SRE)
RhinosF1 created T7046: New Resource Request for MediaWiki-Extension-Updates.
Mar 26 2021, 16:24 · Infrastructure (SRE)
Paladox updated the task description for T4302: Deploy Apache Traffic Server.
Mar 26 2021, 16:09 · Goal-2021-Jul-Dec, Infrastructure (SRE)
Unknown Object (User) raised the priority of T4302: Deploy Apache Traffic Server from Low to Normal.
Mar 26 2021, 16:08 · Goal-2021-Jul-Dec, Infrastructure (SRE)
Paladox closed T7034: cp3 has gone down as Resolved.

Reinstalled cp3 and its back online now

Mar 26 2021, 15:37 · Infrastructure (SRE)
Paladox added a comment to T7042: salt-ssh broken due to unknown minion.
root@puppet3:/home/paladox# salt-ssh -E ".*" cmd.run 'echo test'
cloud3.miraheze.org:
    test
Mar 26 2021, 14:14 · Infrastructure (SRE)
Paladox added a comment to T7042: salt-ssh broken due to unknown minion.

you do salt-ssh -E ".*" cmd.run 'echo test'.

Mar 26 2021, 14:14 · Infrastructure (SRE)
Southparkfan added a comment to T7042: salt-ssh broken due to unknown minion.

I cannot find the minion in /etc/salt/roster.

Mar 26 2021, 12:37 · Infrastructure (SRE)
Southparkfan updated the task description for T7042: salt-ssh broken due to unknown minion.
Mar 26 2021, 12:33 · Infrastructure (SRE)
Southparkfan triaged T7042: salt-ssh broken due to unknown minion as High priority.
Mar 26 2021, 12:33 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

Servers that haven't been rebooted, except for db1[1-3] / cloud[3-5] / mon2 / ns[12]:

  • dbbackup1
  • dbbackup2
  • mem1
  • mem2
Mar 26 2021, 12:31 · Infrastructure (SRE), Security
Southparkfan added a comment to T7038: Existing Server Resource Request for bacula2.

RamNode is short on capacity, so we can't resize bacula yet. I hope we can resize the server next week.

Mar 26 2021, 12:14 · Infrastructure (SRE)

Mar 25 2021

John closed T7038: Existing Server Resource Request for bacula2 as Resolved.

Approved, with spending authorisation by @Southparkfan

Mar 25 2021, 22:41 · Infrastructure (SRE)
Southparkfan added a comment to T7038: Existing Server Resource Request for bacula2.

+$5/mo is approved by me, only requires John's approval as the EM of Infrastructure.

Mar 25 2021, 22:40 · Infrastructure (SRE)
Southparkfan updated the task description for T7038: Existing Server Resource Request for bacula2.
Mar 25 2021, 22:39 · Infrastructure (SRE)
Southparkfan created T7038: Existing Server Resource Request for bacula2.
Mar 25 2021, 22:39 · Infrastructure (SRE)
John edited P386 Resources Table.
Mar 25 2021, 22:32 · Cloud Infrastructure, Infrastructure (SRE)
John closed T7037: [New] Server Resource Request for ats as Resolved.

Approved for cloud4.

Mar 25 2021, 22:32 · Infrastructure (SRE)
Paladox updated the task description for T7037: [New] Server Resource Request for ats.
Mar 25 2021, 22:26 · Infrastructure (SRE)
Paladox updated the task description for T7037: [New] Server Resource Request for ats.
Mar 25 2021, 22:21 · Infrastructure (SRE)
Paladox updated the task description for T7037: [New] Server Resource Request for ats.
Mar 25 2021, 22:21 · Infrastructure (SRE)
Paladox renamed T4302: Deploy Apache Traffic Server from Experiment with Apache Traffic Server to Deploy Apache Traffic Server.
Mar 25 2021, 22:20 · Goal-2021-Jul-Dec, Infrastructure (SRE)
Paladox removed a subtask for T7036: Deploy test vm for ats: T7037: [New] Server Resource Request for ats.
Mar 25 2021, 22:16 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox removed a parent task for T7037: [New] Server Resource Request for ats: T7036: Deploy test vm for ats.
Mar 25 2021, 22:16 · Infrastructure (SRE)
Paladox changed the status of T7036: Deploy test vm for ats, a subtask of T7035: GOAL: Deploy Apache Traffic Server, from Declined to Invalid.
Mar 25 2021, 22:16 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox changed the status of T7036: Deploy test vm for ats from Declined to Invalid.
Mar 25 2021, 22:16 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox closed T7035: GOAL: Deploy Apache Traffic Server as Invalid.

Will reuse the main task T4302

Mar 25 2021, 22:16 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox closed T7036: Deploy test vm for ats, a subtask of T7035: GOAL: Deploy Apache Traffic Server, as Declined.
Mar 25 2021, 22:15 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox closed T7036: Deploy test vm for ats as Declined.

Will reuse the main task T4302

Mar 25 2021, 22:15 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Mar 25 2021, 22:08 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Paladox claimed T7034: cp3 has gone down.
Mar 25 2021, 22:01 · Infrastructure (SRE)
Southparkfan added a comment to T7037: [New] Server Resource Request for ats.

Spoke with @Paladox regarding ATS. Installing and testing ATS on test3 is not ideal, since that server is used for MediaWiki tests. Installing a new server as a testing cache proxy, granted that this cache proxy may not receive the 'allow 80/443 tcp' rules yet due to security reasons (we have agreed on a security review beforehand), has my support.

Mar 25 2021, 21:57 · Infrastructure (SRE)
Paladox added a parent task for T7037: [New] Server Resource Request for ats: T7036: Deploy test vm for ats.
Mar 25 2021, 21:53 · Infrastructure (SRE)
Paladox added a subtask for T7036: Deploy test vm for ats: T7037: [New] Server Resource Request for ats.
Mar 25 2021, 21:53 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox created T7037: [New] Server Resource Request for ats.
Mar 25 2021, 21:52 · Infrastructure (SRE)
Paladox triaged T7036: Deploy test vm for ats as Normal priority.
Mar 25 2021, 21:52 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox triaged T7035: GOAL: Deploy Apache Traffic Server as Normal priority.
Mar 25 2021, 21:50 · Goal-2021-Jan-Jun, Infrastructure (SRE)
Paladox added a comment to T7033: Restart services running on older openssl binaries.

I've done all servers apart from the critical ones you mentioned (instead I just restarted syslog-ng and where approbate the irc bots).

Mar 25 2021, 19:59 · Infrastructure (SRE), Security
Paladox triaged T7034: cp3 has gone down as Unbreak Now! priority.
Mar 25 2021, 19:16 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.
19:58:18 <+SPF|Cloud> my advice: reboot all VMs with services that can be depooled and repooled easily, in order to preserve uptime, do it the normal way (adhere to the 5 minutes DNS TTL, depool from varnish, wait until requests have finished, etc)
20:00:49 <+SPF|Cloud> on the critical servers, db1[1-3], cloud[3-5], mon2 and ns[12], restarting syslog-ng / IRC bots is fine, anything else shouldn't be touched (yet)
Mar 25 2021, 19:05 · Infrastructure (SRE), Security
Southparkfan updated the task description for T7033: Restart services running on older openssl binaries.
Mar 25 2021, 18:50 · Infrastructure (SRE), Security
John added a project to T7033: Restart services running on older openssl binaries: Infrastructure (SRE).
Mar 25 2021, 18:42 · Infrastructure (SRE), Security

Mar 24 2021

Southparkfan lowered the priority of T6984: High load on dbbackup servers from High to Normal.
Mar 24 2021, 12:20 · Database, Monitoring, Infrastructure (SRE)

Mar 23 2021

John closed T4191: Redesign compression of content inside NGINX and Varnish as Declined.

T4302 - if that task gets declined in the future then this task would need re-opening.

Mar 23 2021, 17:09 · Infrastructure (SRE), Varnish

Mar 22 2021

John assigned T4302: Deploy Apache Traffic Server to Paladox.
Mar 22 2021, 20:29 · Goal-2021-Jul-Dec, Infrastructure (SRE)
Paladox added a comment to T6975: LDAP Statistics.

I think we can try out https://github.com/tomcz/openldap_exporter as it is a newer exported.

Mar 22 2021, 20:11 · Monitoring, Infrastructure (SRE)
Paladox added a comment to T6975: LDAP Statistics.

I still need to do this.

Mar 22 2021, 20:11 · Monitoring, Infrastructure (SRE)
John claimed T5397: Create a logbot for server actions.
Mar 22 2021, 20:09 · Infrastructure (SRE)
John closed T6976: General Mail Statistics as Resolved.

Reviewing the stats already put up by @Paladox and looking into Dovecot's stats facility in more detail, I don't believe we would gain any new information from Dovecot stats as the Postfix ones already cover all bases of mail, including connections, logins and auth failures.

Mar 22 2021, 13:00 · Monitoring, Mail, Infrastructure (SRE)

Mar 21 2021

John changed the start date for E239: Infrastructure SRE Weekly Meeting from Mar 21 2021, 20:00 to Mar 22 2021, 20:00.
Mar 21 2021, 18:25 · Infrastructure (SRE)
John changed the start date for E239: Infrastructure SRE Weekly Meeting from Mar 21 2021, 19:00 to Mar 21 2021, 20:00.
Mar 21 2021, 18:24 · Infrastructure (SRE)
Paladox is attending E239: Infrastructure SRE Weekly Meeting.
Mar 21 2021, 18:23 · Infrastructure (SRE)
John set E239: Infrastructure SRE Weekly Meeting to repeat weekly.
Mar 21 2021, 18:22 · Infrastructure (SRE)
John created E239: Infrastructure SRE Weekly Meeting.
Mar 21 2021, 18:22 · Infrastructure (SRE)
Redmin added a comment to T7008: Investigate database server/cache proxy issues and extreme load times this evening.
In T7008#138643, @John wrote:
In T7008#138642, @R4356th wrote:
In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

This isn't the first time this happened actually. IIRC, this is the third time.

This is the first time this particular issue has occurred to my knowledge.

Mar 21 2021, 16:25 · Infrastructure (SRE)
John added a comment to T7008: Investigate database server/cache proxy issues and extreme load times this evening.
In T7008#138642, @R4356th wrote:
In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

This isn't the first time this happened actually. IIRC, this is the third time.

Mar 21 2021, 16:09 · Infrastructure (SRE)
Redmin added a comment to T7008: Investigate database server/cache proxy issues and extreme load times this evening.
In T7008#138619, @John wrote:

Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.

Mar 21 2021, 16:04 · Infrastructure (SRE)
Paladox added a comment to T6984: High load on dbbackup servers.

I much prefer your other suggestion of using mysql backup which means we at least have a stable backup even if its not a replica.

Mar 21 2021, 15:03 · Database, Monitoring, Infrastructure (SRE)
Paladox added a comment to T6984: High load on dbbackup servers.

@Paladox I have disabled c4 replication on dbbackup1, but the lag is not decreasing. It looks like dbbackup1 still doesn't have enough room to replicate a full database cluster. Do you see room for improvements?

Mar 21 2021, 15:01 · Database, Monitoring, Infrastructure (SRE)
John closed T7008: Investigate database server/cache proxy issues and extreme load times this evening as Resolved.

Looking into this, the MySQL server crashed because of a memory issue around 0202, which matches up to error reports.

Mar 21 2021, 13:15 · Infrastructure (SRE)
HooleHistory moved T6759: Automate the adding of SSL private keys to puppet3 from Waiting on response to Backlog on the SSL board.
Mar 21 2021, 11:58 · SRE Automation, Goal-2021-Jul-Dec, Infrastructure (SRE), SSL
HooleHistory moved T6759: Automate the adding of SSL private keys to puppet3 from Backlog to Waiting on response on the SSL board.
Mar 21 2021, 11:57 · SRE Automation, Goal-2021-Jul-Dec, Infrastructure (SRE), SSL
Dmehus triaged T7008: Investigate database server/cache proxy issues and extreme load times this evening as High priority.
Mar 21 2021, 02:30 · Infrastructure (SRE)

Mar 19 2021

John added a comment to T6759: Automate the adding of SSL private keys to puppet3.

I've finally found the ticket, pasting my IRC comment here:
22:37:46 <+SPF|Cloud> @SRE, I can't recall who was talking about it (and where I read it), but I saw some messages regarding automating the addition of a new certificate (for https). have you considered https://wikitech.wikimedia.org/wiki/Acme-chief?

Mar 19 2021, 18:31 · SRE Automation, Goal-2021-Jul-Dec, Infrastructure (SRE), SSL
Southparkfan placed T4191: Redesign compression of content inside NGINX and Varnish up for grabs.
Mar 19 2021, 16:42 · Infrastructure (SRE), Varnish
Southparkfan added a comment to T6759: Automate the adding of SSL private keys to puppet3.

I've finally found the ticket, pasting my IRC comment here:
22:37:46 <+SPF|Cloud> @SRE, I can't recall who was talking about it (and where I read it), but I saw some messages regarding automating the addition of a new certificate (for https). have you considered https://wikitech.wikimedia.org/wiki/Acme-chief?

Mar 19 2021, 16:29 · SRE Automation, Goal-2021-Jul-Dec, Infrastructure (SRE), SSL

Mar 18 2021

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

Perhaps, it may be possible to directly dump from the masters, with very little interruption: https://stackoverflow.com/q/56715657.
In that case, we can use the RamNode VMs to store the logical dumps (mydumper to stdout | ssh - local file). The disadvantage is that we won't have a live replica at all times (if a master crashes for good, the data between <most recent backup> and <crash> will be lost), but it's much cheaper: I/O limit is not much of an issue and since data is not replicated, there is more space for storing logical dumps.

Mar 18 2021, 23:08 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T6975: LDAP Statistics.

Available exporters: https://github.com/jcollie/openldap_exporter https://github.com/tomcz/openldap_exporter
WMF (for dashboard examples): https://grafana.wikimedia.org/d/DnxQ26qmk/ldap?orgId=1 / https://phabricator.wikimedia.org/T181511

Mar 18 2021, 21:26 · Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6984: High load on dbbackup servers.

@Paladox I have disabled c4 replication on dbbackup1, but the lag is not decreasing. It looks like dbbackup1 still doesn't have enough room to replicate a full database cluster. Do you see room for improvements?

Mar 18 2021, 21:21 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6984: High load on dbbackup servers.

Analysing the queries from a binlog (c4):
mysqlbinlog mysql-bin.001818 | grep '::' > ~/analyse-queries-T6984.txt
(any query that has '::' in it usually includes the PHP caller as SQL comments)

Mar 18 2021, 13:54 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a subtask for T5877: Revise MariaDB backup strategy: T6984: High load on dbbackup servers.
Mar 18 2021, 13:36 · Goal-2021-Jul-Dec, Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a parent task for T6984: High load on dbbackup servers: T5877: Revise MariaDB backup strategy.
Mar 18 2021, 13:36 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6984: High load on dbbackup servers.

Investigating load on dbbackup1.

Mar 18 2021, 13:36 · Database, Monitoring, Infrastructure (SRE)

Mar 17 2021

John edited P386 Resources Table.
Mar 17 2021, 19:31 · Cloud Infrastructure, Infrastructure (SRE)
John closed T6993: [Existing] Server Resource Request for mem[12] as Resolved.

Increasing mem[12]'s memory to 10GB approved.

Mar 17 2021, 19:30 · Infrastructure (SRE)