Oct 27, 2016| Daniel Jones
A customer of ours recently had trouble with their Concourse instance which resulted in us raising a GitHub issue describing how containers may get orphaned and that users may experience the error message insufficient subnets remaining in the pool
.
This can mostly easily be remedied by performing bosh recreate
on the Worker VM, or if you’re not sensible enough to have deployed your Concourse with BOSH, by locating and killing all the orphaned container processes. The debugging journey was rather fun and exposed some of the innards of Concourse and Garden, so we’ve decided to share it.
Before going any further I should point out that all of the Bash-fu is the work of my colleague Jonathan Matthews.
Our first instinct was to check fly containers
to see how many were in existence. We only saw 14 in total, which shouldn’t be enough to starve Concourse of any resources. So we then used fly workers
to get a second view of how many containers were active:
$ fly -t target workers
name containers platform tags team
0a04d3fd-d485-4976-b4a0-84e12c37c05c 242 linux none none
A quick bit of Googling (other search engines are available) led us to a GitHub issue citing the same error message, but with no conclusive resolution. The issue pointed out that there’s a hard limit in the defaults of Garden (the underlying container-running implementation behind Concourse) of 250 containers, and that there’s also a limit on the subnets that can be assigned to those containers. This at least explained why the error message was about subnets.
Our next question was: “What are all these containers doing?” This Concourse instance has around 10 modest pipelines, each with no more that 5 resources to be polled. A quick survey of the teams yielded that no-one had set check_every
on their resources to anything other than the default, so we were able to rule out a user DoSing Garden with excessive resource polling.
We bosh ssh
’d onto the Worker VM to investigate further, first poking around in the Garden logs in /var/vcap/sys/log/garden
. We then downloaded the very-handy gaol
CLI to interact with the Garden server running on the Worker - it connected automatically, presumably defaulting to conventional settings for the server.
We used gaol list
to have a look at the containers this Garden instance was running, and found a very long list of UUIDs. All of the UUIDs from fly containers
appeared in gaol list
, which would have to be a Universe-breaking coincidence if Concourse wasn’t exposing Garden UUIDs via its UI. We’re rather glad that it does, as this made debugging a little easier.
gaol properties
revealed that our containers had a grace-time
of 300000000000
. The only snafu is that there are no units specified, so we had to trawl through the Garden and Concourse code to try and find out what the default unit of measurement was. Sadly it was quicker to work out that 5 minutes is equal to 300,000,000,000 nanoseconds than it was to find any conclusive proof, so we made the assumption that Concourse was setting the grace time of Garden containers in line with its advertised TTL.
At this point we know that Concourse says its Worker has many containers, possibly more than we think it should. Garden concurs with this point of view. Because of the limitations of fly
we can’t see all containers across all teams.
We needed to try and determine whether the containers alive in Garden were ‘valid’ or not: should they be there? Are they for active running tasks?
gaol shell
allowed us to get a shell session in any given container. The first few we picked at random seemed quite innocuous, mostly Concourse resources being checked. New containers were being spun up as others were being torn down, which made investigating them all a game of whack-a-mole. We needed to instead do coarse-grained diff over the containers - how many would still be alive after the stated TTL of 5 minutes? We knew that few jobs in any of the teams’ pipelines lasted longer than a few minutes, so we were unlikely to see hours-long tasks.
$ gaol list > containers-before
# 15 minutes later...
$ gaol list > containers-after
$ diff --side <(sort containers-before) <(sort containers-after) | grep -c -e '<' -e '>' -e '|'
This yielded the fact that the majority of containers were living longer that 15 minutes. It only then dawned on us that we could simply use ps
to see how long some of these processes had been kicking around for:
$ ps aux | grep run[c]
Some of these containers had been kicking around for a week! What’s more is that the ps
output included the UUID of each of them, so we could use goal shell
to get inside and see what was happening. Or so we thought.
$ gaol shell a65b0643-4e41-4081-634c-0256fe4e91e8
error: hijack: Backend error: Exit status: 500, message: {"Type":"","Message":"unable to find user root: no matching entries in passwd file","Handle":""}
The long-living containers seemed to be in some state of distress. Were their volumes missing? gaol properties
tells us the UUIDs of volumes mounted into each container, and we happen to know that those volumes are managed by Baggage Claim and live in /var/vcap/data/baggageclaim/volumes/
.
$ gaol properties eb8f2bbf-86e1-4603-4881-548c9133aa62
# lots of stuff...
concourse:volumes ["88914c83-fc71-4358-4f3f-b886429f9013","1a657656-9647-4bee-54be-8505eb323b9e"]
# lots of stuff...
We now had a place to go hunting for further information:
$ ll /var/vcap/data/baggageclaim/volumes/live/88914c83-fc71-4358-4f3f-b886429f9013
total 8
drwxr-xr-x 1 root root 58 Oct 19 19:07 ./
drwxr-xr-x 1 root root 17424 Oct 27 15:58 ../
-rw-r--r-- 1 root root 461 Oct 19 19:08 properties.json
-rw-r--r-- 1 root root 36 Oct 21 14:29 ttl.json
drwxr-xr-x 1 root root 130 Oct 19 19:08 volume/
$ cat properties.json
{
"initialized":"yep",
"resource-params":"04f8ff2682604862e405bf88de102ed7710ac45c1205957625e4ee3e5f5a2241e453614acc451345b91bafc88f38804019c7492444595674e94e8cf4be53817f",
"resource-source":"262e6b8d86067f0f96a98611e89a05a5e28632790cadca4853976260860a02342e93a7bc8b0a0ddf2577654ff03dac5c2cc64a4c7497eafa35edfefd161e946d",
"resource-type":"docker-image",
"resource-version":"{\"digest\":\"sha256:869f748b9399b1650abe41108c11d2817bc5d4c2b226799b3041ca74bf3f88ca\"}"
}
$ cat ttl.json
{"ttl":0,"expires_at":1477060154}
}
$ date -d @1477060154
Fri Oct 21 14:29:14 UTC 2016
So here was a container that definitely should have died off a long time ago - it was Thursday 27th when we were doing this investigation. But what about the other volume?
$ stat /var/vcap/data/baggageclaim/volumes/live/1a657656-9647-4bee-54be-8505eb323b9e
stat: cannot stat ā/var/vcap/data/baggageclaim/volumes/live/1a657656-9647-4bee-54be-8505eb323b9eā: No such file or directory
Checking the Baggage Claim logs in /var/vcap/sys/log/baggageclaim
showed that the volume had been intentionally deleted:
$ grep -e 1a657656-9647-4bee-54be-8505eb323b9e baggageclaim.std*
baggageclaim.stdout.log:{
"timestamp":"1477596534.502955675",
"source":"baggageclaim",
"message":"baggageclaim.tick.reaping",
"log_level":1,
"data":{
"handle":"1a657656-9647-4bee-54be-8505eb323b9e",
"session":"69393",
"ttl":300
}
}
There was however no mention of the container in the ATC logs:
$ zgrep eb8f2bbf-86e1-4603-4881-548c9133aa62 * | wc -l
0
…and so ends the trail. We’ve not been able to figure out exactly what was going on. We had originally theorised that this was a one-off event caused by a Worker restart a week previously, whereby the Worker didn’t connect to the TSA on the ATC for a good hour (again, we don’t know why, but the debug session for that is a whole other blog post complete with voodoo exorcism). This theory didn’t hold up to scrutiny however, as when we looked at the start times for processes relating to our containers we saw that ‘orphans’ had been being created up until the present:
# Ran at about midday
$ ps -ef --sort=start_time | grep "/proc/self/exe"
# SNIP!
root 30704 1 0 Oct27 ? 00:00:00 /proc/self/exe init
root 5486 1 0 Oct27 ? 00:00:00 /proc/self/exe init
root 21706 1 0 Oct27 ? 00:00:00 /proc/self/exe init
root 32505 1 0 Oct27 ? 00:00:00 /proc/self/exe init
root 1629 1 0 04:10 ? 00:00:00 /proc/self/exe init
root 20108 1 0 09:48 ? 00:00:00 /proc/self/exe init
root 20853 1 0 09:48 ? 00:00:00 /proc/self/exe init
root 22398 1 0 09:48 ? 00:00:00 /proc/self/exe init
root 31571 1 0 09:51 ? 00:00:00 /proc/self/exe init
root 1366 1 0 09:51 ? 00:00:00 /proc/self/exe init
root 8974 1 0 09:53 ? 00:00:00 /proc/self/exe init
root 10608 1 0 09:54 ? 00:00:00 /proc/self/exe init
root 13974 1 0 09:54 ? 00:00:00 /proc/self/exe init
root 24614 1 0 09:57 ? 00:00:00 /proc/self/exe init
root 27895 1 0 09:58 ? 00:00:00 /proc/self/exe init
root 30103 1 0 09:59 ? 00:00:00 /proc/self/exe init
root 30769 1 0 09:59 ? 00:00:00 /proc/self/exe init
It’s a little odd how there are nearly no orphans overnight but as soon as office hours start, we see them being created every few minutes.