Page MenuHomeMiraheze

Southparkfan (Ferran Tufan)
Director of Site Reliability EngineeringAdministrator

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Apr 17 2016, 19:18 (259 w, 6 d)
Roles
Administrator
Availability
Available
IRC Nickname
SPF|Cloud
GitHub User
Southparkfan
Miraheze User
Southparkfan [ Global Accounts ]

Hi! I am Southparkfan; co-founder and system administrator for Miraheze. I am responsible for the smooth operation of Miraheze's servers, which includes applying configuration changes, conducting maintenance and incident investigations, performance tuning, monitoring the servers and other miscellaneous tasks.

You can usually find me on IRC in the #miraheze channel on chat.freenode.net.

Recent Activity

Fri, Apr 9

Southparkfan updated subscribers of T5877: Revise MariaDB backup strategy.

Running dump from db11 to dbbackup1:/srv/backups/db11. @Paladox and I are around to monitor.

Fri, Apr 9, 22:21 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec

Tue, Apr 6

Reception123 awarded Blog Post: An interview with Co-Founder Ferran Tufan a Mountain of Wealth token.
Tue, Apr 6, 11:24 · Site Reliability Engineering
Southparkfan published Blog Post: An interview with Co-Founder Ferran Tufan.
Tue, Apr 6, 10:31 · Site Reliability Engineering

Mon, Apr 5

Southparkfan added a reverting change for rPUPCbb1ec901f84f: Revert "Varnish: block Googlebot requests with specific parameter": rPUPC0b1a5da44601: Revert "Revert "Varnish: block Googlebot requests with specific parameter"".
Mon, Apr 5, 15:36
Southparkfan committed rPUPC0b1a5da44601: Revert "Revert "Varnish: block Googlebot requests with specific parameter"" (authored by Southparkfan).
Revert "Revert "Varnish: block Googlebot requests with specific parameter""
Mon, Apr 5, 15:36
Southparkfan added a reverting change for rPUPCc32785b0fa6f: Varnish: block Googlebot requests with specific parameter: rPUPCbb1ec901f84f: Revert "Varnish: block Googlebot requests with specific parameter".
Mon, Apr 5, 15:20
Southparkfan committed rPUPCbb1ec901f84f: Revert "Varnish: block Googlebot requests with specific parameter" (authored by Southparkfan).
Revert "Varnish: block Googlebot requests with specific parameter"
Mon, Apr 5, 15:20

Sun, Apr 4

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

New performance test (using sshfs setup, 4 mydumper threads):

  • Uncompressed: 290 seconds
  • Compressed: 210 seconds
Sun, Apr 4, 22:07 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

For reference: mydumper is superior to mysqldump due to its better performance (using multiple threads) and the flexibility (PCRE based table inclusion/exclusion) in conjunction with transaction consistency and (almost) no locking (no read-only time required during backups). However, mydumper does not support TLS in connections, so dumping must happen at the database master.

Sun, Apr 4, 21:37 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan raised the priority of T6900: Create draft of Data Processing Inventory from Low to Normal.
Sun, Apr 4, 20:35 · Site Reliability Engineering

Fri, Apr 2

Southparkfan added a comment to T7087: Add (rolling average) response time to grafana.

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

As mentioned in the task, /healthcheck is the biggest one because it has an effect on uptime if that gets too high.

/healthcheck = Meta's Main Page. We're already monitoring that.

I would recommend we do one that loads quite a few resources (eg. Images, javascript etc)

The blackbox exporter does not monitor subsequent requests, such as resources (images?) used on an article. We can monitor that though, but you'll need to provide specific URLs. :)

Fri, Apr 2, 21:50 · MediaWiki (SRE), MediaWiki, Monitoring
Southparkfan added a comment to T7087: Add (rolling average) response time to grafana.

We have the blackbox exporter for this. Can we help you by monitoring specific URLs?

Fri, Apr 2, 21:45 · MediaWiki (SRE), MediaWiki, Monitoring

Thu, Apr 1

Southparkfan added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.
In T7073#140060, @John wrote:

Is there a use case for this that the ES data source wouldn’t fulfil? Is this the approach MediaWiki (SRE) wish to take? If so this would fall under the MW team to implement as part of their task as without a use case for Infra, what’s the point in implementing something unused?

There are more use cases than MediaWiki only. For example, I would like to monitor SSH authentication attempts and access logs of non-MediaWiki services, which is a task for us, not for the MediaWiki team. The proof of concept above was tailored for MediaWiki logs, because said logs have a higher priority.

Thu, Apr 1, 00:12 · MediaWiki (SRE), Monitoring

Wed, Mar 31

Southparkfan added a comment to T7073: Install prometheus-es-exporter for prometheus <-> graylog integration.

Proof of concept:
/etc/prometheus-es-exporter/mediawiki.cfg:

[query_log_mediawiki]
QueryIntervalSecs = 900
QueryIndices = <graylog_deflector>
QueryJson = {
    "size": 0,
    "track_total_hits": true,
        "query": {
                "bool": {
                        "must": [
                                {
                                        "match": {
                                                "application_name": "mediawiki"
                                        }
                                }
                        ],
                        "filter": [
                                {
                                        "range": {
                                                "timestamp": { "gte": "now-15m", "lte": "now" }
                                        }
                                }
                        ]
                }
        },
        "aggs": {
                "mediawiki-channels": {
                        "terms": {
                                "field": "mediawiki_channel"
                        }
                }
        }
    }
Wed, Mar 31, 23:56 · MediaWiki (SRE), Monitoring
Southparkfan triaged T7073: Install prometheus-es-exporter for prometheus <-> graylog integration as Normal priority.
Wed, Mar 31, 23:01 · MediaWiki (SRE), Monitoring
Dmehus awarded T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc' a Like token.
Wed, Mar 31, 18:05 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan closed T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc' as Resolved.

afbeelding.png (290×1 px, 11 KB)

Wed, Mar 31, 17:54 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan added a comment to T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc'.

The rebuild should finish in 10-30 minutes. If the errors are gone after the rebuild, you can close this task. If not, assistance will be needed.

Wed, Mar 31, 17:44 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan added a comment to T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc'.

Already did that for 'en', but without luck. Started the rebuild in a screen now.

Wed, Mar 31, 17:43 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan updated the task description for T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc'.
Wed, Mar 31, 17:16 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan triaged T7070: MWException from line 129 of /srv/mediawiki/w/includes/MagicWord.php: Error: invalid magic word 'getshortdesc' as Unbreak Now! priority.
Wed, Mar 31, 17:16 · Upstream, Production Error, MediaWiki (SRE), MediaWiki
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

More testing is required to determine the final backup sizes.

Wed, Mar 31, 15:10 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Wed, Mar 31, 14:27 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan updated the task description for T7067: Subscribe SRE to OpenCVE for notifications.
Wed, Mar 31, 13:16 · Security, Site Reliability Engineering
Southparkfan added a comment to T7067: Subscribe SRE to OpenCVE for notifications.

Just noting that it's been decided to discontinue SRE duty due to the new team system and it didn't seem to be functioning anymore. The dashboard and links we've compiled have still been kept though as they're useful.

Wed, Mar 31, 13:16 · Security, Site Reliability Engineering

Tue, Mar 30

Southparkfan updated the task description for T7067: Subscribe SRE to OpenCVE for notifications.
Tue, Mar 30, 21:58 · Security, Site Reliability Engineering
Southparkfan updated subscribers of T7067: Subscribe SRE to OpenCVE for notifications.
Tue, Mar 30, 21:53 · Security, Site Reliability Engineering
Southparkfan moved T7067: Subscribe SRE to OpenCVE for notifications from Radar to Discussion on the Site Reliability Engineering board.
Tue, Mar 30, 21:52 · Security, Site Reliability Engineering
Southparkfan triaged T7067: Subscribe SRE to OpenCVE for notifications as Normal priority.
Tue, Mar 30, 21:52 · Security, Site Reliability Engineering

Mon, Mar 29

Southparkfan added a comment to T4302: Deploy Apache Traffic Server.

In order to do proper backend verification in the certificate (CN), we have tested using ENFORCE. However, the Host header from the client (e.g. allthetropes.org) is used for the CN check at the backend. Therefore, the allthetropes.org certificate would still be mandatory at the backend, even though I prefer to remove all certificates (including our wildcard one) but a single domain (such as ats-internal.miraheze.wiki) from the MediaWiki servers.

Mon, Mar 29, 00:46 · Infrastructure (SRE)

Sun, Mar 28

Southparkfan changed the status of T6984: High load on dbbackup servers, a subtask of T5877: Revise MariaDB backup strategy, from Open to Stalled.
Sun, Mar 28, 22:44 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan changed the status of T6984: High load on dbbackup servers from Open to Stalled.

The future of these servers depends on the outcome of testing regarding T5877#139273.

Sun, Mar 28, 22:44 · Database, Monitoring, Infrastructure (SRE)
Southparkfan changed the edit policy for T7033: Restart services running on older openssl binaries.
Sun, Mar 28, 22:40 · Infrastructure (SRE), Security
Southparkfan changed the visibility for T7033: Restart services running on older openssl binaries.
Sun, Mar 28, 22:40 · Infrastructure (SRE), Security
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

@Southparkfan should we make this task public viewable?

Sun, Mar 28, 22:40 · Infrastructure (SRE), Security
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

I recommend rebooting the dbbackup servers. They may or may not be affected by CVE-2021-3450, but as long as these servers are rebooted gracefully, we can survive without them for a few minutes.

Sun, Mar 28, 21:32 · Infrastructure (SRE), Security

Fri, Mar 26

Southparkfan added a comment to T7042: salt-ssh broken due to unknown minion.

I cannot find the minion in /etc/salt/roster.

Fri, Mar 26, 12:37 · Infrastructure (SRE)
Southparkfan updated the task description for T7042: salt-ssh broken due to unknown minion.
Fri, Mar 26, 12:33 · Infrastructure (SRE)
Southparkfan triaged T7042: salt-ssh broken due to unknown minion as High priority.
Fri, Mar 26, 12:33 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.

Servers that haven't been rebooted, except for db1[1-3] / cloud[3-5] / mon2 / ns[12]:

  • dbbackup1
  • dbbackup2
  • mem1
  • mem2
Fri, Mar 26, 12:31 · Infrastructure (SRE), Security
Southparkfan closed T7041: phab.miraheze.wiki cert expired as Resolved.

Fixed.

Fri, Mar 26, 12:24 · MediaWiki (SRE), SSL
Southparkfan reopened T7041: phab.miraheze.wiki cert expired as "Open".
Fri, Mar 26, 12:19 · MediaWiki (SRE), SSL
Southparkfan added a comment to T7038: Existing Server Resource Request for bacula2.

RamNode is short on capacity, so we can't resize bacula yet. I hope we can resize the server next week.

Fri, Mar 26, 12:14 · Infrastructure (SRE)
Southparkfan added a comment to T7041: phab.miraheze.wiki cert expired.
13:08:41 <+SPF|Cloud> first, the nginx config points to /etc/ssl/certs/miraheze.wiki.crt, but we have switched to /etc/ssl/localcerts 
13:09:17 <+SPF|Cloud> second, the certificate is valid for 'miraheze.wiki', but not phab.miraheze.wiki
Fri, Mar 26, 12:09 · MediaWiki (SRE), SSL

Thu, Mar 25

Southparkfan added a comment to T7038: Existing Server Resource Request for bacula2.

+$5/mo is approved by me, only requires John's approval as the EM of Infrastructure.

Thu, Mar 25, 22:40 · Infrastructure (SRE)
Southparkfan updated the task description for T7038: Existing Server Resource Request for bacula2.
Thu, Mar 25, 22:39 · Infrastructure (SRE)
Southparkfan created T7038: Existing Server Resource Request for bacula2.
Thu, Mar 25, 22:39 · Infrastructure (SRE)
Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

A maintenance window is required for dumping from masters directly. Not because impact is guaranteed, but because dumping may cause database locks for multiple seconds, hence increasing save time or knocking wikis offline.

Thu, Mar 25, 22:08 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a comment to T7037: [New] Server Resource Request for ats.

Spoke with @Paladox regarding ATS. Installing and testing ATS on test3 is not ideal, since that server is used for MediaWiki tests. Installing a new server as a testing cache proxy, granted that this cache proxy may not receive the 'allow 80/443 tcp' rules yet due to security reasons (we have agreed on a security review beforehand), has my support.

Thu, Mar 25, 21:57 · Infrastructure (SRE)
Southparkfan added a comment to T7033: Restart services running on older openssl binaries.
19:58:18 <+SPF|Cloud> my advice: reboot all VMs with services that can be depooled and repooled easily, in order to preserve uptime, do it the normal way (adhere to the 5 minutes DNS TTL, depool from varnish, wait until requests have finished, etc)
20:00:49 <+SPF|Cloud> on the critical servers, db1[1-3], cloud[3-5], mon2 and ns[12], restarting syslog-ng / IRC bots is fine, anything else shouldn't be touched (yet)
Thu, Mar 25, 19:05 · Infrastructure (SRE), Security
Southparkfan updated the task description for T7033: Restart services running on older openssl binaries.
Thu, Mar 25, 18:50 · Infrastructure (SRE), Security
Southparkfan created T7033: Restart services running on older openssl binaries.
Thu, Mar 25, 18:41 · Infrastructure (SRE), Security
R4356th awarded T4005: Execute external commands on MediaWiki servers inside sandboxes a 100 token.
Thu, Mar 25, 10:30 · Universal Omega, MediaWiki (SRE), Security, MediaWiki

Wed, Mar 24

Southparkfan lowered the priority of T6984: High load on dbbackup servers from High to Normal.
Wed, Mar 24, 12:20 · Database, Monitoring, Infrastructure (SRE)

Mon, Mar 22

Dmehus awarded T5222: MediaWiki response time can fluctuate due to messages a Like token.
Mon, Mar 22, 21:44 · MediaWiki (SRE), MediaWiki

Sat, Mar 20

Dmehus awarded T6765: Cache frequently accessed files on MediaWiki servers a Like token.
Sat, Mar 20, 16:51 · MediaWiki (SRE), Performance, MediaWiki

Fri, Mar 19

Southparkfan added a comment to T7003: Cargo: Error: unclosed string literal..

Sounds Upstream.

Agreed. There are multiple issues here:

  • Unclosed literals (with a single quote) are not a problem in column names, the check is too strict here
  • If the bug above won't be fixed, then the extension lacks basic input validation upon creating a table
Fri, Mar 19, 21:00 · Upstream, Extensions, MediaWiki (SRE), Universal Omega
Southparkfan added a comment to T7003: Cargo: Error: unclosed string literal..
21:46:14 <+SPF|Cloud> renaming the column to Kings_Rock or similar will fix the issue
21:48:18 <+SPF|Cloud> and https://mariadb.com/kb/en/identifier-names/#quoted says that the single quote (should have called them 'quotes' instead of 'apostrophes', I guess) is a valid character in a column name
21:49:18 <+SPF|Cloud> the extension is at fault here ;)
Fri, Mar 19, 20:52 · Upstream, Extensions, MediaWiki (SRE), Universal Omega
Southparkfan added a comment to T6985: Cargo command error.

HOWEVER
I still have to mention an actual bug: I can't see one of my tables for some reason, but it works normally. This happened just after the recent outage.

Just commenting because I don't know if it's worth a task (considering the table is working as intended, just its the page not working)

Opened T7003

Fri, Mar 19, 20:43 · Universal Omega, Extensions, MediaWiki (SRE)
Southparkfan added a comment to T7003: Cargo: Error: unclosed string literal..

Exception is generated at https://github.com/wikimedia/mediawiki-extensions-Cargo/blob/a619875ea82539bfbf525b8813036876e5cf39b4/includes/CargoUtils.php#L408
$string is string(11) "King's_Rock"
King's_Rock is a column in the cargo__Moves table:

stdClass Object
(
    [Field] => King's_Rock
    [Type] => tinyint(1)
    [Null] => YES
    [Key] => MUL
    [Default] =>
    [Extra] =>
)
Fri, Mar 19, 20:42 · Upstream, Extensions, MediaWiki (SRE), Universal Omega
Southparkfan moved T7003: Cargo: Error: unclosed string literal. from Backlog to Deployed Extension Bugs on the Extensions board.
Fri, Mar 19, 20:34 · Upstream, Extensions, MediaWiki (SRE), Universal Omega
Southparkfan triaged T7003: Cargo: Error: unclosed string literal. as Normal priority.
Fri, Mar 19, 20:34 · Upstream, Extensions, MediaWiki (SRE), Universal Omega
Southparkfan placed T4191: Redesign compression of content inside NGINX and Varnish up for grabs.
Fri, Mar 19, 16:42 · Infrastructure (SRE), Varnish
Southparkfan added a comment to T4005: Execute external commands on MediaWiki servers inside sandboxes.

Firejail helps to secure external binaries, but apparently firejail itself doesn't have a good track record either: https://github.com/netblue30/firejail/issues/3046

Fri, Mar 19, 16:41 · Universal Omega, MediaWiki (SRE), Security, MediaWiki
Southparkfan added a comment to T6759: Automate the adding of SSL private keys to puppet3.

I've finally found the ticket, pasting my IRC comment here:
22:37:46 <+SPF|Cloud> @SRE, I can't recall who was talking about it (and where I read it), but I saw some messages regarding automating the addition of a new certificate (for https). have you considered https://wikitech.wikimedia.org/wiki/Acme-chief?

Fri, Mar 19, 16:29 · Infrastructure (SRE), SSL
Southparkfan added a comment to T6985: Cargo command error.

Considering the recent outages and the mention of MariaDB

I doubt it.

Fri, Mar 19, 16:27 · Universal Omega, Extensions, MediaWiki (SRE)

Thu, Mar 18

Southparkfan added a comment to T5877: Revise MariaDB backup strategy.

Perhaps, it may be possible to directly dump from the masters, with very little interruption: https://stackoverflow.com/q/56715657.
In that case, we can use the RamNode VMs to store the logical dumps (mydumper to stdout | ssh - local file). The disadvantage is that we won't have a live replica at all times (if a master crashes for good, the data between <most recent backup> and <crash> will be lost), but it's much cheaper: I/O limit is not much of an issue and since data is not replicated, there is more space for storing logical dumps.

Thu, Mar 18, 23:08 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan committed rPUPC8e1d7afd5272: Prometheus blackbox: add HTTPS metrics for more sites (T6800) (authored by Southparkfan).
Prometheus blackbox: add HTTPS metrics for more sites (T6800)
Thu, Mar 18, 21:57
Southparkfan added a comment to T6800: Create SLOs/SLIs for services.

Collecting real data: T6981
Blackbox testing: https://github.com/prometheus/blackbox_exporter (currently being tested)

Thu, Mar 18, 21:42 · Site Reliability Engineering
Southparkfan added a comment to T6981: Consider Deploying NavigationTiming Extension.

I am a huge fan of data driven decision making!

Thu, Mar 18, 21:35 · Extensions, Monitoring, MediaWiki (SRE), Universal Omega
Southparkfan added a comment to T6975: LDAP Statistics.

Available exporters: https://github.com/jcollie/openldap_exporter https://github.com/tomcz/openldap_exporter
WMF (for dashboard examples): https://grafana.wikimedia.org/d/DnxQ26qmk/ldap?orgId=1 / https://phabricator.wikimedia.org/T181511

Thu, Mar 18, 21:26 · Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6984: High load on dbbackup servers.

@Paladox I have disabled c4 replication on dbbackup1, but the lag is not decreasing. It looks like dbbackup1 still doesn't have enough room to replicate a full database cluster. Do you see room for improvements?

Thu, Mar 18, 21:21 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6979: Collect Statistics for API Requests (Including Module Type).
In T6979#138313, @John wrote:

https://grafana.com/docs/grafana/latest/datasources/elasticsearch/ data could alternatively be collected directly via ES potentially

I told you on IRC yesterday, but it's good practice to document this on Phabricator: my suggestion is to use https://github.com/braedon/prometheus-es-exporter, a tool that runs on the graylog hosts. The tool takes an elasticsearch query, performs the search and returns the result in prometheus format. Prometheus collects the metrics, after which we can use the metrics in Grafana dashboards.

Thu, Mar 18, 21:15 · Monitoring, MediaWiki (SRE)
Southparkfan added a comment to T6900: Create draft of Data Processing Inventory.

Following a course regarding the basics of GDPR at the moment.

Thu, Mar 18, 21:08 · Site Reliability Engineering
Southparkfan added a comment to T6984: High load on dbbackup servers.

Analysing the queries from a binlog (c4):
mysqlbinlog mysql-bin.001818 | grep '::' > ~/analyse-queries-T6984.txt
(any query that has '::' in it usually includes the PHP caller as SQL comments)

Thu, Mar 18, 13:54 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a subtask for T5877: Revise MariaDB backup strategy: T6984: High load on dbbackup servers.
Thu, Mar 18, 13:36 · Infrastructure (SRE), Goal-2021-Jan-Jun, Database, Goal-2020-Jul-Dec
Southparkfan added a parent task for T6984: High load on dbbackup servers: T5877: Revise MariaDB backup strategy.
Thu, Mar 18, 13:36 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6984: High load on dbbackup servers.

Investigating load on dbbackup1.

Thu, Mar 18, 13:36 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6765: Cache frequently accessed files on MediaWiki servers.

Can't we generate the list dynamically? ie by running my 'script' periodically on a server?

Thu, Mar 18, 13:24 · MediaWiki (SRE), Performance, MediaWiki

Wed, Mar 17

Southparkfan added a comment to T6984: High load on dbbackup servers.

Downtimed dbbackup[12] Current Load.

Wed, Mar 17, 00:18 · Database, Monitoring, Infrastructure (SRE)
Southparkfan added a comment to T6984: High load on dbbackup servers.

I am not sure if load alerts are useful for these servers. These servers are not production facing; as long as the replication processes are running and the replication lag is not greater than X seconds, I don't mind what the load average is.

Wed, Mar 17, 00:14 · Database, Monitoring, Infrastructure (SRE)

Fri, Mar 12

Southparkfan added a comment to T6952: cp12 OOM'd March 12 2021 16:36.

Varnish was stopped at 16:36:45, gdnsd depooled cp12 at 16:37:02 on ns1 and at 16:37:04 on ns2. That's an excellent MTTR. I am not able to find out why the service ran out of memory, but since this quite a rare event, I'm tempted to skip the investigation process.

Fri, Mar 12, 22:14 · Infrastructure (SRE)

Mar 11 2021

Southparkfan added a comment to T6946: Login failed in primary authentication because no provider accepted.

For tracking purposes: https://github.com/wikimedia/mediawiki/blob/6978577273048ca9a32baaa285d4270d6e860e6a/includes/user/UserNameUtils.php#L181 returns false for 'SHEIKH', but https://github.com/wikimedia/mediawiki/blob/6978577273048ca9a32baaa285d4270d6e860e6a/includes/user/UserNameUtils.php#L183 returns true, so I presume SHEIKH ended up being in the reserved names list.

Mar 11 2021, 23:27 · Configuration, Universal Omega, MediaWiki (SRE)

Mar 10 2021

Southparkfan reassigned T6071: Set up replicas for all database clusters from Southparkfan to Paladox.
Mar 10 2021, 22:37 · Infrastructure (SRE), Database
Southparkfan added a comment to T6071: Set up replicas for all database clusters.

c2 is done and replicates fine. c3 has not been cloned yet. c4 had its replica process stopped for too long and thus the binary position it's on doesn't exist on the master anymore: a reclone is needed.

Mar 10 2021, 22:21 · Infrastructure (SRE), Database

Mar 9 2021

Southparkfan added a comment to T5044: Setup centralised logging for services.

We switched off syslog-ng logging on the cloud servers. Not sure if we want to switch it back on @John @Southparkfan ?

Yes, let's see if we can receive proxmox logs without further tweaking.

Mar 9 2021, 11:45 · Infrastructure (SRE), Goal-2021-Jan-Jun, Goal-2020-Jul-Dec, Goal-2020-Jan-Jun

Feb 27 2021

Southparkfan created T6907: WikiDiscover API does not honor siteprop parameter.
Feb 27 2021, 21:38 · MediaWiki (SRE), WikiDiscover, Universal Omega

Feb 26 2021

Southparkfan reassigned T5433: Evaluate regular SRE meetings from Southparkfan to Reception123.

What's the status on meetings in the MediaWiki team?

Feb 26 2021, 22:12 · Site Reliability Engineering
Southparkfan placed T6830: Add icinga/prometheus monitoring for multi-instance up for grabs.

(no time currently)

Feb 26 2021, 22:02 · Monitoring, Infrastructure (SRE), Database
Southparkfan moved T6900: Create draft of Data Processing Inventory from Radar to Management on the Site Reliability Engineering board.
Feb 26 2021, 21:59 · Site Reliability Engineering
Southparkfan triaged T6900: Create draft of Data Processing Inventory as Low priority.
Feb 26 2021, 21:58 · Site Reliability Engineering
Southparkfan lowered the priority of T4017: Reconfigure TLS settings inside MariaDB from Normal to Low.
Feb 26 2021, 21:47 · Infrastructure (SRE), Goal-2019-Jul-Dec, Goal-2020-Jan-Jun

Feb 25 2021

Southparkfan added a comment to T6765: Cache frequently accessed files on MediaWiki servers.
In T6765#134610, @John wrote:

If restarting the service is required to pick up changes, databases.json and *wiki.json can't be cached

Feb 25 2021, 17:21 · MediaWiki (SRE), Performance, MediaWiki

Feb 24 2021

Southparkfan edited P387 syslog-ng log to local file.
Feb 24 2021, 22:06
Southparkfan edited P387 syslog-ng log to local file.
Feb 24 2021, 22:05
Southparkfan created P387 syslog-ng log to local file.
Feb 24 2021, 22:04

Feb 17 2021

Southparkfan added a comment to T6858: Messages take a while to be sent to graylog.

Do you know what the bottleneck is? Is it MediaWiki -> syslog-ng or syslog-ng -> graylog?

Feb 17 2021, 22:31 · Infrastructure (SRE)

Feb 14 2021

Southparkfan added a comment to T6849: Grafana bug CVE-2019-15043 can still be exploited despite being out of vulnerable range.

Wikimedia has been contacted. Waiting on more information from the researcher.

Feb 14 2021, 13:41 · Upstream, Infrastructure (SRE), Security
Southparkfan lowered the priority of T6849: Grafana bug CVE-2019-15043 can still be exploited despite being out of vulnerable range from Unbreak Now! to High.
  • The API endpoint is open
  • Our Grafana version is on a patched version
  • Only impact on A, not C/I, Grafana is not critical
    • However, taking into account that Grafana resides on a system where critical systems (Icinga) are hosted..
Feb 14 2021, 12:54 · Upstream, Infrastructure (SRE), Security
Southparkfan added a comment to T6849: Grafana bug CVE-2019-15043 can still be exploited despite being out of vulnerable range.
southparkfan@test3:~$ python3 cve.py --url 'https://grafana.miraheze.org'
[-] Testing https://grafana.miraheze.org...
[-] Status: 200
[-] Checking for version...
[-] Grafana version appears to be: 7.4.1
[!] Version seems to indicate it's probably not vulnerable.
[-] Checking if snapshot api requires authentiation...
[+] Snapshot endpoint doesn't seem to require authentication! Host may be vulnerable.
Feb 14 2021, 12:51 · Upstream, Infrastructure (SRE), Security