Regarding IPv4 addresses, we will either need to repurpose the IPv4 addresses from jobrunner* (which would leave jobchron1 and task1 with no IPv4 address), or we need to purchase an additional two IPv4 addresses for these servers.
In theory we can recycle jobrunner3 to jobchron1. We'd need to remove the mediawiki and jobrunner roles, update hostname and DNS, and then adjust hardware settings. However, a downside to this approach will be that we will remain using a 47GB disk instead of 10GB. I don't believe we can resize this to be smaller.
I'm planning we recycle the existing jobrunner4 into this new task1 server. In theory, all we should need to do is update hostname and DNS, then a short restart to adjust hardware settings, and it should be good.
I'm not certain I would reduce the maximum (currently 4096). In theory, the only problem with having a password longer than 128 characters is potential server load.
Wed, Jul 28
Tue, Jul 27
+1 on disclosure from me.
Mon, Jul 26
I've disabled Loops on all wikis. Should probably have done it differently, will fix in a moment.
Tentatively resolving based on private task, please reopen if issues persist.
Most likely cause of T7693
Sat, Jul 24
Confirm edit is enabled on all wikis by default.
Fri, Jul 23
This bug fix, as well as apparently a number of other changes, appear to only be available on the master branch of the extension. We should consider updating to that, as we currently only use the REL1_36 branch. Not sure if we'd need to take any special measures when updating though.
Wed, Jul 21
Tentatively closing, looks like we've stabilized.
This would be difficult to implement, I believe. I think (for mediawiki sites) we'd only have to worry about /wiki and /w, but I'd be worried about static.miraheze.org and probably a few other things.
Literally a setting in puppet: https://git.io/Jlvs7
Tracked passively in Grafana. Correct me if I'm mistaken, but anything short of an obvious DOS attack from one source wouldn't be actionable outside of provisioning additional MW servers? If that is the case, then there are other things that take precedence at the moment.
Has Outlook been made aware of the status of the subtask (a general message would suffice)? Or have we had any other reports come in from Outlook?
Effectively stalled on T7139 unless we can prevent jobrunner3 from accepting high intensity jobs (assuming RequestWikiAIJob).
Technically resolved on reboot. No cause has been identified, but as a not-recurring issue, it isn't worth spending more time investigating.
I'm removing parent tasks, as I don't think it is reasonable to wait for and test Debian Bullseye in our infrastructure given our current capacity problems.
Tue, Jul 20
Operating at 77% disk usage, so looks good. Feel free to reopen if icinga reports a disk usage warning, or you see any disk usage warning in graylog.
Can we try and temporarily disable the RequestWikiAIJob to see if this alleviates the load? Or alternately, could we prevent jobrunner3 from running these types of jobs?
Declining for now, we're set to expand resources soon, but don't have the capacity to do so immediately.
Follow up: https://github.com/miraheze/puppet/pull/1816
Mon, Jul 19
Security advisory has been published, and CVE-2021-32774 was issued.
Scrambling the password has effectively solved this issue, but renders the guest account inaccessible for valid use cases (not that I think many users were using guest). We can follow up later with either restoring the account, or doing away with it entirely.
Confirmed guest account (the one we use for icinga - guest/guest) could be logged into on the mail server. I've scrambled that password for now (see K13). We'll have to see if this solves it. However, I also note that graylog suggests this has only been logged into today. It might not be the full cause, but it is the cause of those two pastes.
To clarify, Nuke does not show imported pages. This is an upstream bug, but I don't see it being fixed anytime soon, as it is over a decade old at this point.
Sun, Jul 18
I think instead of having tabs on the form, we should have a filter that simply updates the visibility of the different items (I think we do something similar with the yearly Survey, where checking a checkbox makes more questions visible). This way we could default the page to showing all items, but also easily filter it down to categories. Additionally, depending on how it's implemented, it could display multiple categories at once, and the same item could be in multiple categories.
I have a process running on jobrunner3 that should report the full process information on any process that winds up getting killed by OOM. Hopefully it should tell us some more information about which processes are utilizing too much memory.
Sat, Jul 17
Still monitoring this, but our storage usage is down to 10GB per day of logs from of 30GB per day. I think we can sustain this without difficulty.
I'll note that loading in a large number of scripts at once (such as listing multiple scripts in your common.js) can cause this to happen. Particularly if those scrips create additional requests to the server.
Flow is awkward, but if you're referring to Topic:Wcwif8hjydot2c1y, the post in question was hidden, not deleted. It therefore can be viewed by anyone with the flow-hide permission, which is everyone on the wiki.
Adding mahjongwiki as well
Jobrunner3 is showing 155 Out of memory issues in the past 24 hours, killing several processes, including redis repeatedly.
Fri, Jul 16
There's no indication of a cause anywhere in the cp3 logs. How was the outage reported and verified?
Could be T7626?
Tue, Jul 13
We don't have any available IPv4 addresses, so this may have to wait. Unless we want an MW server to be only available over IPv6. In any case, I'm not actually available for the rest of today, so this would be done tomorrow at the earliest.
Have access now to OVH/RN/Proxmox. I believe that's everything.
FYI if anyone needs to get graylog working again, go to System > Indices > Default index set, and delete the oldest indexes (oldest at bottom) until there is at least 15% disk space available.
Cleared some disk space, will need to monitor if our recent changes are sufficient to prevent any further issues.
Created https://github.com/miraheze/mw-config/pull/3997, but not yet willing to merge. Thoughts?