When Nomad misses a (heart)beat

Photo by freestocks on Unsplash

When Nomad misses a (heart)beat

In my homelab I have a hybrid setup (nodes both in the cloud and in the basement), and I use Tailscale to bridge the physical gap in the network.

What I have noticed, though (actually, for a while already, just didn't bother to investigate) is the following mystery:

  1. One node receives a lot of work to do (think: request for multi-platform Docker build via Gitea Actions)

  2. Docker containers got restarted on that node amass

  3. Nomad restarts all the jobs and everything "just works" again

Now, because of point 3) I never really had an incentive to find and fix the problem since Nomad just stabilizes the system rather quickly (1 min for example). This problem was occurring and reoccurring for months, but I didn't care much (pets vs cattle and all that).

What eventually turned me around is the fact that this problem occurred during my introduction of a new CI/CD platform (I am replacing Drone CI with Gitea Actions), and debugging long builds that also fail because Docker containers running those builds die is not the most optimal use case of my free time.

Now, my assumption was that the node would just work with the basic Linux (Debian) setup without any system tinkering (Ansible sets up the ssh server, my user account with a bunch of dotfiles, etc, but that's all just normal customization everyone does). That was a standard, but a lousy assumption.

Down the rabbit hole, we go...

Is it OOM?

I've noticed in my dmesg output [1] that some Docker processes were taken down by the OOM manager and immediately added swap to the system.

It's not that hard to add a swap file to a Debian system, for example, Digital Ocean has nice and very readable articles that handle basic administration tasks for standard Linux packages, so they have an article for the swap as well.

Now, it is said that in the cloud age one shouldn't really depend on swap and the machine workload should just be stable enough to work out of RAM (because of various reasons), but we are talking about a small 16GB laptop [2] so it makes sense for me to still resort to the swap for those peak moments.

This helped a bit and I had the docker work much more stable. Containers still disappeared, though, so it definitely wasn't the root cause, it was just a reason for the failures to happen (even) more often.

Is it the Docker service?

Now, the next suspect was Docker Daemon itself.

I've noticed this strange message appearing over and over again in the logs:

Your kernel does not support memory swappiness capabilities or the cgroup is not mounted. Memory swappiness discarded

But this is a standard Debian Linux, why wouldn't that cgroup be enabled/mounted? Weird. But, diving into the Net we find indeed that Docker says this is a normal thing, they even have part of the docs only for this specific message.

Needless to say, their attempt didn't work unfortunately for the current stable Debian (11). I had to research further and found this great thread for microk8s that exposed a change in Debian and later in the thread the way how to work around it until cgroups v2 are supported finally my hero who had the same exact problem and the solution presented itself to finally get read of the message in the logs:

# set in /etc/default/grub
GRUB_CMDLINE_LINUX="cgroup_enable=memory cgroup_memory=1 systemd.unified_cgroup_hierarchy=0"

Then just do an update-grub and restart the laptop [3]. Of course, this removed the predominant error message, but (you guessed probably), the containers continue to die under large pressure.

Is it Nomad?

But then, I figured out something (at that time, completely obvious): the containers that were restarted were all Nomad jobs. So, no other container running in that Docker container was ever restarted. 🤦

I refocused now on the Nomad setup: what could have gone wrong there?

Nomad has very complex machinery behind the job scheduling simplicity and they have great documentation.

What I have missed so far is the fact that although no containers failed, there were effectively killed by the scheduler. If the Nomad agent on that node doesn't communicate with the server using a heartbeat mechanism.

Another research and another great Github issue thread and here we go: Nomad team lead simply says that there is a way to go around this problem:

Just curious is there any way to increase heartbeat manually?

There are a few heartbeat related settings on the server: https://www.nomadproject.io/docs/agent/configuration/server.html#heartbeat_grace

So, the solution on my network setup is that I had to increase the heartbeat to something longer than the default (default value of 10s for the heartbeat_grace in the server block of the server Nomad configuration was replaced with extremely large 120s, but I'm all down with tiny hammers at this point).

Increasing the grace setting is probably the most straightforward way to give clients more time to recover in the cases when CPU is under a very heavy load. In this case, I also assume that the heartbeat going over Tailscale VPN and the server being a rather old and weak Intel chip also doesn't help.

Many Gitea Actions later with very heavy CPU & memory workloads, and Nomad still didn't decide to shut any node down.

So, so far, so good: no more restarts noticed. Let's hope it stays stable! I definitely don't want to see this problem again & hope will not see it occurring ever again.

  1. Or was it journalctl -xn? Not sure - this was a long time ago at this point. But, definitely, there were oom strings in the logs and I had that problem

  2. I know, I should replace it with an Intel NUC. I am just waiting for the damn thing to fail... it's working well enough for 6 years already and I don't like replacing stuff that just works

  3. Or, in my case, use my Telegram bot that runs laptop-booter to go through Intel AMT for a power cycle and then Dropbear SSH port to run cryptroot-unlock for my full-disk-encryption setup (but you know, that's just me)