Page MenuHomeMiraheze

Implement local nameserver cache daemons on servers
Closed, ResolvedPublic

Description

Due to the introduction of TLS for MariaDB connections, MediaWiki connects to the database servers by using hostnames instead of IP addresses. This works fine, but connecting to MariaDB becomes much slower, because every connection (about four per request!) is responsible A and AAAA lookups through authoritative DNS.

Without any form of caching: 16.29% 58.451 4 - Wikimedia\Rdbms\DatabaseMysqli::mysqlConnect
When hardcoding IPv4 or IPv6 for 'db1[1-3].miraheze.org' in /etc/hosts: 7.67% 17.928 4 - Wikimedia\Rdbms\DatabaseMysqli::mysqlConnect

Presumably due to load on MediaWiki servers, four connections on a busier (e.g. flooding the server with requests) MediaWiki server take 700 ms(!), whereas the /etc/hosts hack reduced this to 200 ms.

Hardcoding the IP addresses of database servers in /etc/hosts is the easiest solution at first glance, but not the best example of reducing technical debt. Besides, applications like Matomo perform lots of DNS lookups as well, so a cluster-wide solution is the best one. https://phabricator.wikimedia.org/T171498 is a task at Wikimedia describing a different problem (overloaded DNS servers), but indicating the same solution: cache daemons on servers.

Event Timeline

Dmehus triaged this task as Normal priority.Jan 17 2021, 20:19
Dmehus added a subscriber: Dmehus.

Assigning as normal priority, unless @Southparkfan feels differently

John claimed this task.
John added a subscriber: John.

Non-reliable numbers here from small scaling testing but:

Dig Specified Hostname5-total (ms)Average (ms)
(blank)5310.6
192.184.82.120 (ns1)47595
8.8.8.8 (Google)6412.8
1.1.1.1 (CloudFlare)112.2
127.0.0.1 (pdns)30.6

Considering currently our resolver config is setup to query Google's, this would bring an average 10ms a query reduction through local caching. In addition, this would allow us to use the internal domains for private IP servers.

Average probe DNS lookups:

ServerPre-Change (ms)Post Change (ms)Percentage Change
cp327.50.279-99%
cp1028.30.562-98%
cp1129.90.463-98%
cp12270.423-98.5%

Graphs back up the data. Specifically, this graph shows the full impact for users in Asia in terms of cp3's lookup. No points for guessing at what time the change went live based on the graph.

I'll provide another update on the real-terms impact of the change once more time has passed to allow averages to full stabilise and graphs to show a more true reflection.