Page MenuHomeMiraheze

MediaWiki Capacity Proposal
Closed, ResolvedPublic

Description

From an Infrastructure perspective, the current state of play is:

4x MW servers with 4vCPU, 4GB memory and 20GB disk
2x JR servers with 3vCPU, 3GB memory and 45GB disk.

This means we have the follow resources allocated to MediaWiki across each cloud server: 11vCPUs, 11GB memory, and 85GB of disk space.

If we can improve our capacity and throughput while keeping these resources -> a bonus. With an increase -> not bad given our growth.

MediaWiki Processing Capacity (Current)

  • Number of children deployed currently: 26 per server
  • Capacity to Increase This? Yes
  • Need to Increase? No - Grafana

Current Capacity = 26 * 4 = 104 simultaneous requests

JobRunner Capacity (Current)

  • Number of processes: 2 per server (+1 dedicated)
  • Capacity to Increase? No
  • Need to Increase? No - Grafana

Proposal for Future Capacity Planning

Based on the above information, we have a known number of resources currently allocated and a known processing threshold. While there is no need specifically to increase this capacity - future planning now can not hurt - surely? So this is my brief proposal for discussion...

  • Remove jobrunner* servers - freeing up 6 vCPUs, 6GB of memory and 90GB of disk space.
  • Introduce two new mw servers (12/13) - with similar specs, this would mean an addition of 8vCPUs, 8GB of memory and 40GB of disk space to the cluster.
  • Introduce a lightweight jobchron server - minimal specs, 1 vCPU, 1GB ram, 10GB disk space to host Redis + JobChron service - potentially introduce a 2nd one as a Redis-replica if we need to in the future as it'll be entirely decoupled.
  • Introduce jobrunner onto all mainstream MediaWiki servers as a 1 threaded process.
  • Introduce a MediaWiki task server for heavy and intensive MediaWiki tasks - e.g. long running scripts, imports, which is not web accessible but is publicly exposed (via SSH of course). Resources potentially would be 2vCPU, 2 GB of memory, 50GB of disk space?

This would increase MediaWiki processing capacity to be 156 simultaneous requests with 8 jobrunner processes instead of 4. If we experience problem with long running jobs in the future, we could introduce the jobrunner to the MediaWiki task server as it fits the purpose perfectly.

Related Objects

Event Timeline

John triaged this task as Normal priority.Apr 14 2021, 19:30
John created this task.

This sounds like an interesting proposal and I think having a server dedicated to maintenance scripts and such tasks is not a bad idea

Steps to enact the above would be:

  • Reduce jobrunners from 2 to 1 and deploy 'jobrunner' on all servers
  • Fine tune and monitor current mw* servers to see if load handles well and if there is any problems to be solved
  • Create an approval task to introduce two new mw* servers and deploy
  • Create an approval task for a new task server and deploy
  • Create an approval task for a new jobchron server and deploy
  • Decom jobrunner servers

Noting that MW Enginneering need to consider how to balance both the MediaWiki + Debian upgrades before proceeding with installing new servers.

Deployment of jobrunner on all servers has now happened. Per the above, this is blocked on MediaWiki (SRE) deciding when they wish to deploy an additional two servers.

Added the 2 things we'd like to before adding new servers as sub tasks.

This is probably gonna be a few months.

Bullseye is apparently due July 31

Can we move forward with this now? Current tasks such as T7626 and T7633 indicate we cannot wait on expanding our infrastructure.

In T7139#154236, @Void wrote:

Can we move forward with this now? Current tasks such as T7626 and T7633 indicate we cannot wait on expanding our infrastructure.

As I said with bringing mw12 forward, yes we simply need to do this now.

@Reception123 can you review my plans on T7676, T7677, and T7678? I'm setting up pull requests for if we do them, but should we decide to take another approach, that's fine as well.

See also: https://github.com/miraheze/puppet/pull/1827

In T7139#155127, @Void wrote:

@Reception123 can you review my plans on T7676, T7677, and T7678? I'm setting up pull requests for if we do them, but should we decide to take another approach, that's fine as well.

See also: https://github.com/miraheze/puppet/pull/1827

I've had a look and if you think this is the best approach I have no issue with your proposed plans. I'd say even with it's downsides I would probably prefer the repurpose/recycle approach but if you think the other is better let me know and we can discuss.

Void raised the priority of this task from Normal to High.Aug 2 2021, 17:48

I intend to get this done this week.