Not blocked on external entity
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 28 2021
The future of these servers depends on the outcome of testing regarding T5877#139273.
In T7033#139663, @Paladox wrote:@Southparkfan should we make this task public viewable?
@Southparkfan should we make this task public viewable?
I've restarted both dbbackup1 and dbbackup2. I think we can close this as resolved per your acceptance of a low risk.
I recommend rebooting the dbbackup servers. They may or may not be affected by CVE-2021-3450, but as long as these servers are rebooted gracefully, we can survive without them for a few minutes.
I've restarted all the servers that I could without causing downtime. I'm waiting on SPF for the next moves.
In T7046#139478, @RhinosF1 wrote:We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.
Mar 27 2021
Sounds like there isn't a problem then?
Do we have an update on this? Also, who is taking responsibility for this?
Mar 26 2021
We still need something to test on though. I suggest we use test3 at first. We therefore just need to know which db to put the cached info on.
In T7046#139476, @RhinosF1 wrote:The plan for me is that we will have a web based interface that will list every extension/skin, it's latest commit, the commit we have deployed, whether it's i18n only and a link to the diff.
I shared some links yesterday with how we can get the info.
The plan for me is that we will have a web based interface that will list every extension/skin, it's latest commit, the commit we have deployed, whether it's i18n only and a link to the diff.
In T7046#139471, @Universal_Omega wrote:I think before we consider doing this, we also need to have a feasible implementation plan for a system. (Meaning that we don't do it and it ends up not being feasable with our resources)
I suggest we start on test3.
I think before we consider doing this, we also need to have a feasible implementation plan for a system. (Meaning that we don't do it and it ends up not being feasable with our resources)
Before I can review this, more information needs to be provided.
Reinstalled cp3 and its back online now
root@puppet3:/home/paladox# salt-ssh -E ".*" cmd.run 'echo test' cloud3.miraheze.org: test
you do salt-ssh -E ".*" cmd.run 'echo test'.
I cannot find the minion in /etc/salt/roster.
Servers that haven't been rebooted, except for db1[1-3] / cloud[3-5] / mon2 / ns[12]:
- dbbackup1
- dbbackup2
- mem1
- mem2
RamNode is short on capacity, so we can't resize bacula yet. I hope we can resize the server next week.
Mar 25 2021
Approved, with spending authorisation by @Southparkfan
+$5/mo is approved by me, only requires John's approval as the EM of Infrastructure.
Approved for cloud4.
Will reuse the main task T4302
Will reuse the main task T4302
A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.
Spoke with @Paladox regarding ATS. Installing and testing ATS on test3 is not ideal, since that server is used for MediaWiki tests. Installing a new server as a testing cache proxy, granted that this cache proxy may not receive the 'allow 80/443 tcp' rules yet due to security reasons (we have agreed on a security review beforehand), has my support.
I've done all servers apart from the critical ones you mentioned (instead I just restarted syslog-ng and where approbate the irc bots).
19:58:18 <+SPF|Cloud> my advice: reboot all VMs with services that can be depooled and repooled easily, in order to preserve uptime, do it the normal way (adhere to the 5 minutes DNS TTL, depool from varnish, wait until requests have finished, etc) 20:00:49 <+SPF|Cloud> on the critical servers, db1[1-3], cloud[3-5], mon2 and ns[12], restarting syslog-ng / IRC bots is fine, anything else shouldn't be touched (yet)
Mar 24 2021
Mar 23 2021
T4302 - if that task gets declined in the future then this task would need re-opening.
Mar 22 2021
I think we can try out https://github.com/tomcz/openldap_exporter as it is a newer exported.
I still need to do this.
Reviewing the stats already put up by @Paladox and looking into Dovecot's stats facility in more detail, I don't believe we would gain any new information from Dovecot stats as the Postfix ones already cover all bases of mail, including connections, logins and auth failures.
Mar 21 2021
In T7008#138643, @John wrote:In T7008#138642, @R4356th wrote:In T7008#138619, @John wrote:Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.
This isn't the first time this happened actually. IIRC, this is the third time.
This is the first time this particular issue has occurred to my knowledge.
In T7008#138642, @R4356th wrote:In T7008#138619, @John wrote:Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.
This isn't the first time this happened actually. IIRC, this is the third time.
In T7008#138619, @John wrote:Unless this happens again, the ability to further investigate may significantly outweigh the benefits as in order to look into the full health of the RAM would require extensive downtime of all wikis.
I much prefer your other suggestion of using mysql backup which means we at least have a stable backup even if its not a replica.
In T6984#138318, @Southparkfan wrote:@Paladox I have disabled c4 replication on dbbackup1, but the lag is not decreasing. It looks like dbbackup1 still doesn't have enough room to replicate a full database cluster. Do you see room for improvements?
Looking into this, the MySQL server crashed because of a memory issue around 0202, which matches up to error reports.
Mar 19 2021
In T6759#138382, @Southparkfan wrote:I've finally found the ticket, pasting my IRC comment here:
22:37:46 <+SPF|Cloud> @SRE, I can't recall who was talking about it (and where I read it), but I saw some messages regarding automating the addition of a new certificate (for https). have you considered https://wikitech.wikimedia.org/wiki/Acme-chief?
I've finally found the ticket, pasting my IRC comment here:
22:37:46 <+SPF|Cloud> @SRE, I can't recall who was talking about it (and where I read it), but I saw some messages regarding automating the addition of a new certificate (for https). have you considered https://wikitech.wikimedia.org/wiki/Acme-chief?
Mar 18 2021
Perhaps, it may be possible to directly dump from the masters, with very little interruption: https://stackoverflow.com/q/56715657.
In that case, we can use the RamNode VMs to store the logical dumps (mydumper to stdout | ssh - local file). The disadvantage is that we won't have a live replica at all times (if a master crashes for good, the data between <most recent backup> and <crash> will be lost), but it's much cheaper: I/O limit is not much of an issue and since data is not replicated, there is more space for storing logical dumps.
Available exporters: https://github.com/jcollie/openldap_exporter https://github.com/tomcz/openldap_exporter
WMF (for dashboard examples): https://grafana.wikimedia.org/d/DnxQ26qmk/ldap?orgId=1 / https://phabricator.wikimedia.org/T181511
@Paladox I have disabled c4 replication on dbbackup1, but the lag is not decreasing. It looks like dbbackup1 still doesn't have enough room to replicate a full database cluster. Do you see room for improvements?
Analysing the queries from a binlog (c4):
mysqlbinlog mysql-bin.001818 | grep '::' > ~/analyse-queries-T6984.txt
(any query that has '::' in it usually includes the PHP caller as SQL comments)
Investigating load on dbbackup1.
Mar 17 2021
Increasing mem[12]'s memory to 10GB approved.