Page MenuHomeMiraheze

cp12 OOM'd March 12 2021 16:36
Closed, ResolvedPublic

Description

cp12 crashed due to OOM at the stated time causing a user visible outage.

It was depooled/rebooted/repooled by @Reception123

Please can infra review whether any mitigations / improvements could be made to prevent further outages.

See https://www.irccloud.com/pastebin/XSuAQOmv for log

Event Timeline

Unknown Object (User) triaged this task as High priority.Mar 12 2021, 16:58

Downtime for cp12 until depooling and restarting was ~5-6 minutes, for the Infrastructure team's determination of whether an incident report would be useful here.

Downtime for cp12 until depooling and restarting was ~5-6 minutes, for the Infrastructure team's determination of whether an incident report would be useful here.

An IR is most definitely needed. We need to find out exactly why we ran out of memory and what we can do to prevent future outages.

I see this:

Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) died signal=6
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Child starts
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Assert error in vsmw_newcluster(), common/common_vsmw.c line 247:
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Condition((VFIL_allocate(fd, (off_t)len, 1)) == 0) not true.
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said errno = 28 (No space left on device)
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Wrong turn in child_signal_handler(), cache/cache_main.c line 284: Signal 6 (Aborted) received at 0x7300000376 si_code -6
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said errno = 28 (No space left on device)

Varnish was stopped at 16:36:45, gdnsd depooled cp12 at 16:37:02 on ns1 and at 16:37:04 on ns2. That's an excellent MTTR. I am not able to find out why the service ran out of memory, but since this quite a rare event, I'm tempted to skip the investigation process, since it is unlikely we will find more information.

However, it doesn't look like varnish is properly restarted after a failure: see below for more information.

I see this:

Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) died signal=6
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Child starts
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Assert error in vsmw_newcluster(), common/common_vsmw.c line 247:
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Condition((VFIL_allocate(fd, (off_t)len, 1)) == 0) not true.
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said errno = 28 (No space left on device)
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Wrong turn in child_signal_handler(), cache/cache_main.c line 284: Signal 6 (Aborted) received at 0x7300000376 si_code -6
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said errno = 28 (No space left on device)

tmpfs was full? tmpfs is very small (128M), if a process is SIGKILLed, then varnish has no chance to clean up /var/lib/varnish (tmpfs).
My idea to fix this:

  1. Ensure the systemd unit is configured to restart the service after a SIGKILL: https://askubuntu.com/a/953383
  2. Run rm -rf /var/lib/varnish/ after a failure (before the service is restarted): https://unix.stackexchange.com/a/486032
John claimed this task.

Varnish was stopped at 16:36:45, gdnsd depooled cp12 at 16:37:02 on ns1 and at 16:37:04 on ns2. That's an excellent MTTR. I am not able to find out why the service ran out of memory, but since this quite a rare event, I'm tempted to skip the investigation process, since it is unlikely we will find more information.

However, it doesn't look like varnish is properly restarted after a failure: see below for more information.

I see this:

Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) died signal=6
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Child starts
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Assert error in vsmw_newcluster(), common/common_vsmw.c line 247:
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Condition((VFIL_allocate(fd, (off_t)len, 1)) == 0) not true.
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said errno = 28 (No space left on device)
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said Wrong turn in child_signal_handler(), cache/cache_main.c line 284: Signal 6 (Aborted) received at 0x7300000376 si_code -6
Mar 12 16:36:47 cp12 varnishd[26110]: Child (886) said errno = 28 (No space left on device)

tmpfs was full? tmpfs is very small (128M), if a process is SIGKILLed, then varnish has no chance to clean up /var/lib/varnish (tmpfs).
My idea to fix this:

  1. Ensure the systemd unit is configured to restart the service after a SIGKILL: https://askubuntu.com/a/953383
  2. Run rm -rf /var/lib/varnish/ after a failure (before the service is restarted): https://unix.stackexchange.com/a/486032

@Southparkfan What's MTTR stand for?