Page MenuHomeMiraheze

Investigate MW 502/503 errors/high load (poor optimisation)
Closed, ResolvedPublic

Description

https://dmlwiki.miraheze.org/dmlwiki.miraheze.org

We have gotten semi-regular 503s since the migration. I've seen mention in discord a few times that these have been fixed, but for our wiki they have persisted throughout this time, continuing to happen through to today:

  • Error 503 Backend fetch failed, forwarded for 66.115.147.158, 127.0.0.1 (IP is a VPN IP, so fine to have posted here)

(Varnish XID 293834952) via cp31 at Fri, 04 Mar 2022 01:42:42 GMT.

  • Error 503 Backend fetch failed, forwarded for 66.115.147.158, 127.0.0.1 (IP is a VPN IP, so fine to have posted here)

(Varnish XID 225218191) via cp30 at Fri, 04 Mar 2022 01:27:18 GMT

  • Error 503 Backend fetch failed, forwarded for 66.115.147.158, 127.0.0.1

(Varnish XID 303237555) via cp31 at Fri, 04 Mar 2022 02:04:24 GMT.


Merged task about most imports failing with 502s. (Reception123)

Event Timeline

I see others have been reporting in Discord too.

Not certain if related, but I'm also getting multiple and repeated "bad gateway" errors when using DPL3 to list result sets with a template to format results. At first, lowering the number of results from 300 to 200 seemed to help, now I'm getting errors at 200 and below also.

Reception123 renamed this task from 503s persisting since migration to Investigate MW 502/503 errors/high load (poor optimisation).Mar 6 2022, 19:41
Reception123 added subscribers: MacFan4000, Unknown Object (User).

MediaWiki (SRE): Could we look into this more? It seems like something we need to focus on fixing.

Reception123 raised the priority of this task from Normal to High.Mar 6 2022, 19:49

I've decided to move this to high priority because this issue has been going on for months and not much attention has been paid to it. It's essential that we avoid such disruptions to the service, and if we just take the issue of imports it means that effectively people can't import almost anything without SRE intervention which is not good at all, I've seen imports as low as 1MB fail.

https://grafana.miraheze.org/d/W9MIkA7iz/miraheze-cluster?orgId=1&var-job=node&var-node=mw102.miraheze.org&var-port=9100&viewPanel=281

I've temporarily disabled jobrunner on mw101 so we can have a look at how it affects the load and other params.

I just noticed on the description that DPL is used here. I would very much think that it's got a lot to do with it on some wikis. The wikimedia version won't get extended to more wikis because it's so awful at performance.

This might be something to talk to UO about when he's around and see if we can mitigate any issues.

Unknown Object (User) added a comment.Mar 10 2022, 00:23

I started https://github.com/Universal-Omega/DynamicPageList3/pull/107 which should greatly improve performance of DPL3 if it properly works. But I'm not available for testing this out, so unable to confirm if it'll do any good, or cause issues.

Unknown Object (User) added a comment.EditedMar 10 2022, 03:08

I started https://github.com/Universal-Omega/DynamicPageList3/pull/107 which should greatly improve performance of DPL3 if it properly works. But I'm not available for testing this out, so unable to confirm if it'll do any good, or cause issues.

That PR should now be complete but still needs testing, which I'm not around to do. But it should improve performance and add some support for limiting query time, plus query cache support.

I've temporarily disabled jobrunner on mw101 so we can have a look at how it affects the load and other params.

This change seems to have dropped load by around 1 core, which makes sense as the process would actively utilise 1 core. Given the drop is only 1, I would not make a decision that this change improves the situation, rather provides an immediate and limited temporary solution.

The below is an example of mysql-slow query, this query is in the top 10 on all 3 main DB servers solely. Limits of 50 are much less than 24, but also remain ranked in the top 10 (though they average 4-5s)

Count: 1  Time=24.30s (24s)  Lock=0.00s (0s)  Rows_sent=500.0 (500), Rows_examined=135833.0 (135833), Rows_affected=0.0 (0), mediawiki[mediawiki]@[2a10:6740::6:207]
  SELECT /* SpecialRecentChanges::doMainQuery  */  rc_id,rc_timestamp,rc_namespace,rc_title,rc_minor,rc_bot,rc_new,rc_cur_id,rc_this_oldid,rc_last_oldid,rc_type,rc_source,rc_patrolled,rc_ip,rc_old_len,rc_new_len,rc_deleted,rc_logid,rc_log_type,rc_log_action,rc_params,rc_actor,recentchanges_actor.actor_user AS `rc_user`,recentchanges_actor.actor_name AS `rc_user_text`,comment_rc_comment.comment_text AS `rc_comment_text`,comment_rc_comment.comment_data AS `rc_comment_data`,comment_rc_comment.comment_id AS `rc_comment_cid`,rc_title,rc_namespace,wl_user,wl_notificationtimestamp,page_latest,(SELECT  GROUP_CONCAT(ctd_name SEPARATOR ',')  FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id))   WHERE ct_rc_id=rc_id  ) AS `ts_tags`  FROM `recentchanges` JOIN `actor` `recentchanges_actor` ON ((actor_id=rc_actor)) JOIN `comment` `comment_rc_comment` ON ((comment_rc_comment.comment_id = rc_comment_id)) LEFT JOIN `watchlist` ON (wl_user = 1101 AND (wl_title=rc_title) AND (wl_namespace=rc_namespace)) LEFT JOIN `page` ON ((rc_cur_id=page_id))   WHERE rc_bot = 0 AND (rc_namespace NOT IN (1198,1199)) AND (rc_timestamp >= '20210916192044') AND rc_new IN (0,1)   ORDER BY rc_timestamp DESC LIMIT 500
Unknown Object (User) added a comment.EditedMar 17 2022, 06:12

I just noticed on the description that DPL is used here. I would very much think that it's got a lot to do with it on some wikis. The wikimedia version won't get extended to more wikis because it's so awful at performance.

This might be something to talk to UO about when he's around and see if we can mitigate any issues.

Just to note an update was now deployed, and should greatly improve DPL3 performance in this regard.

Thanks very much for the performance improvements. I'll test, but first I'm going to test with DPL3 disabled to see if our 502/503s continue on other pages, and on pages that use Cargo, which is having weird problems of its own, per T8828 (related or unrelated, no idea).

DPL3 + Cargo will probably be the worst possible combination in terms of performance.
(They shouldn't cause 502/3s but they are fairly expensive)

Unknown Object (User) added a comment.Mar 18 2022, 07:00

DPL3 + Cargo will probably be the worst possible combination in terms of performance.
(They shouldn't cause 502/3s but they are fairly expensive)

DPL3 is not expensive every single time anymore. It has both parser and query cache functional now, to prevent it from always being expensive (doing sql and parse every page load) which was the whole point of the patch. Its still not a great combination but DPL3 shouldn't have much performance issues anymore now at least not every page load. It also sets the queries to automatically time out after 10 seconds now, again to prevent issues with some sql queries with it, especially very long ones.

Only one or the other is used on individual pages, not both. Or are you referring to simply having both extensions enabled on a single wiki at the same time (and not using them together on pages)? When we tested their use together on a standalone MW install (1.34 and 1.35 LTS), they were fine.

We are only testing cargo at this point, it is not widely used on the site (a handful of pages only, and those that have no DPL3 use). One is simply in testing as a fallback for the other, if for some reason one was disabled temporarily, which has occurred on MH.

It also seemed prudent to have a plan A, B etc., in case a third-party extension (we plan to use a LOT) is having a bug or falling unmaintained.

With one or the other disabled, they seem to perform the same as both being enabled together (meaning together on the same site, not on the same page). I'm putting both through their paces, separate and together (again not on same page), and can find no difference in performance. The only notable exception is that DPL3 now can handle more results again as it did before, regardless of cargo enabled or not. Cargo, no difference if DPL3 is enabled/disabled.

I also see in discord the 502's and 503's are being reported as occurring on wikis with neither Cargo, nor DPL3 enabled (such as Meta) and other wikis). Are you sure the issue is related to the presence of both of these extensions, even if one is not enabled?

DPL3 (previous to UO's patch) and Cargo are what would class as 'expensive' which mean they take long times to parse and load.

They won't help any wiki with them on and most definitely not both on a single page with any loading time.

Our current issues are generally with the way mediawiki is interacting with our database because write/read queries can be very slow and build up.

Unfortunately, the upstream fixes didn't backport well so we've got to wait until 1.38 until we can deploy them which might prevent mediawiki being completely crazy. Upstream have put a lot of work in the last few releases into improving the schemas (thanks Amir as much as they are a headache to deploy it's much appreciated) so hopefully we'll be able to get a better idea soon.

Our infrastructure is shared between ~1/3rd of wikis for databases which can mean that if one wiki starts causing problems, it can cascade and affect a lot which then slows our frontend down which affects all wikis.

Thanks for confirming, (and yep, using Cargo and DPL3 together on pages sounds a little insane lol). We've entirely disabled Cargo, and we'll just stick where our production wiki is at for now, and finish thoroughly testing the last LTS against Cargo and DPL3 before we do an upgrade. We'll edit once a month or something on MH to keep alive for a future migration re-test.

Do you guys have statistics on roughly how many Miraheze wikis are using Cargo at current?

Forgot to add: Also for UO, the current version of DPL3 seems to work well so far. It is basically running as it was prior to the 1.37 upgrade. Thanks again so much!

Do you guys have statistics on roughly how many Miraheze wikis are using Cargo at current?

Per {{NUMBEROFWIKISBYEXTENSION:cargo}} at press time, 351.

Thanks for this!

Per {{NUMBEROFWIKISBYEXTENSION:cargo}} at press time, 351.

Unknown Object (User) added a comment.Apr 15 2022, 00:53

I have deployed some things that should greatly improve this. Leaving this open, at least for a few days to see if it maintains a better state.

Unknown Object (User) added a project: Performance.Apr 15 2022, 00:56
Unknown Object (User) moved this task from Backlog to Short Term on the MediaWiki (SRE) board.Apr 15 2022, 02:10
Unknown Object (User) closed this task as Resolved.Apr 16 2022, 17:37
Unknown Object (User) claimed this task.

Closing as resolved since it seems to be at the very least, less frequent, and users are noticing that lately.