Or why don’t free
and top
work in a Linux container?
Lately at Heroku, we have been trying to find the best way to expose memory usage and limits inside Linux containers. It would be easy to do it in a vendor-specific way: most container specific metrics are available at the cgroup filesystem via /path/to/cgroup/memory.stat
, /path/to/cgroup/memory.usage_in_bytes
, /path/to/cgroup/memory.limit_in_bytes
and others.
An implementation of Linux containers could easily inject one or more of those files inside containers. Here is an hypothetical example of what Heroku, Docker and others could do:
# create a new dyno (container): $ heroku run bash # then, inside the dyno: (dyno) $ cat /sys/fs/cgroup/memory/memory.stat cache 15582273536 rss 2308546560 mapped_file 275681280 swap 94928896 pgpgin 30203686979 pgpgout 30199319103 # ...
/sys/fs/cgroup/
is the recommended location for cgroup hierarchies, but it is not a standard. If a tool or library is trying to read from it, and be portable across multiple container implementations, it would need to discover the location first by parsing /proc/self/cgroup
and /proc/self/mountinfo
. Further, /sys/fs/cgroup
is just an umbrella for all cgroup hierarchies, there is no recommendation or standard for my own cgroup location. Thinking about it, /sys/fs/cgroup/self
would not be a bad idea.
If we decide to go down that path, I would personally prefer to work with the rest of the Linux containers community first and come up with a standard.
I wish it were that simple.
The problem
Most of the Linux tools providing system resource metrics were created before cgroups even existed (e.g.: free
and top
, both from procps). They usually read memory metrics from the proc
filesystem: /proc/meminfo
, /proc/vmstat
, /proc/PID/smaps
and others.
Unfortunately /proc/meminfo
, /proc/vmstat
and friends are not containerized. Meaning that they are not cgroup-aware. They will always display memory numbers from the host system (physical or virtual machine) as a whole, which is useless for modern Linux containers (Heroku, Docker, etc.). Processes inside a container can not rely on free
, top
and others to determine how much memory they have to work with; they are subject to limits imposed by their cgroups and can’t use all the memory available in the host system.
This causes a lot of confusion for users of Linux containers. Why does free
say there is 32GB free memory, when the container only allows 512MB to be used?
With the popularization of linux container technologies – Heroku, Docker, LXC (version 1.0 was recently released), CoreOS, lmctfy, systemd and friends – more and more people will face the same problem. It is time to start fixing it.
Why is this important?
Visibility into memory usage is very important. It allows people running applications inside containers to optimize their code and troubleshoot problems: memory leaks, swap space usage, etc.
Some time ago, we shipped log-runtime-metrics at Heroku, as an experimental labs feature. It is not a portable solution though, and does not expose the information inside containers, so that monitoring agents could read it. To make things worse, most of the monitoring agents out there (e.g.: NewRelic?1) rely on information provided by free
, /proc/meminfo
, etc. That is plain broken inside Linux containers.
On top of that, more and more people have been trying to maximize resource usage inside containers, usually by auto-scaling the number of workers, processes or threads running inside them. This is usually a function of how much memory is available (and/or free) inside the container, and for that do be done programmatically, the information needs to be accessible from inside the container.
More about /proc
In case you wondered, none of the files provided by the cgroup filesystem (/sys/fs/cgroup/memory/memory.*
) can be used as a drop-in replacement (i.e.: bind mounted on top of) for /proc/meminfo
, or /proc/vmstat
. They have different formats and use slightly different names for each metric. Why memory.stat
and friends decided to use a format different from what was already being used at /proc/meminfo
is beyond my comprehension.
Some of the contents of a /proc
filesystem are properly containerized, like the /proc/PID/*
and /proc/net/*
namespaces, but not all of them. Unfortunately, /proc
in general is considered to be a mess. From the excellent “Creating Linux virtual filesystems” article on LWN:
Linus and numerous other kernel developers dislike the ioctl() system call, seeing it as an uncontrolled way of adding new system calls to the kernel. Putting new files into /proc is also discouraged, since that area is seen as being a bit of a mess. Developers who populate their code with ioctl() implementations or /proc files are often encouraged to create a standalone virtual filesystem instead.
I went ahead and started experimenting with that: procg is an alternative proc
filesystem that can be mounted inside linux containers. It replaces /proc/meminfo
with a version that reads cgroup specific information. My goal was for it to be a drop-in replacement for proc
, without requiring any patches to the Linux kernel. Unfortunately, I later found that it was not possible, because none of the functions to read memory statistics from a cgroup (linux/memcontrol.h
and mm/memcontrol.c
) are public in the kernel. I hope to continue this discussion on LKML soon.
Others have tried similar things modifying the proc filesystem directly, but that is unlikely to be merged to the mainstream kernel if it affects all users of the proc filesystem. It would either need to be a custom filesystem (like procg) or a custom mount option to proc. E.g.:
mount -t proc -o meminfo-from-cgroup none /path/to/container/proc
FUSE
There is also a group of kernel developers advocating that this would be better served by something outside of the kernel, in userspace, making /proc/meminfo
be a virtual file that collects information elsewhere and formats it appropriately.
FUSE can be used to implement a filesystem in userspace to do just that. Libvirt went down that path with its libvirt-lxc driver. There were attempts to integrate a FUSE version of /proc/meminfo
into LXC too.
Even though there is a very nice implementation of FUSE in pure Go, and that I am excited with the idea to contribute a plugin/patch to Docker using it, at Heroku we (myself included) have a lot of resistance against using FUSE in production.
This is mainly due to bad past experiences with FUSE filesystems (sshfs, s3fs) and the increased surface area for attacks. My research so far has revealed that the situation may be much better nowadays, and I would even be willing to give it a try if there were not other problems with using fuse to replace the proc filesystem.
I am also not comfortable with making my containers dependent on an userspace daemon that serves FUSE requests. What happens when that daemon crashes? All containers in the box are probably left without access to their /proc/meminfo
. Either that, or having to run a different daemon per container. Hundreds of containers in a box would require hundreds of such daemons. Ugh.
/proc is not the only issue: sysinfo
Even if we could find a solution to containerize /proc/meminfo
with which everyone is happy, it would not be enough.
Linux also provides the sysinfo(2)
syscall, which returns information about system resources (e.g. memory). As with /proc/meminfo
, it is not containerized: it always returns metrics for the box as a whole.
I was surprised while testing my proc replacement (procg) that it did not work with Busybox. Later, I discovered that the Busybox’s implementation of free
does not use /proc/meminfo
. Guess what? It uses sysinfo(2)
. What else out there could also be using sysinfo(2)
and be broken inside containers?
ulimit, setrlimit
On top of cgroup limits, Linux processes are also subject to resource limits applied to them individually, via setrlimit(2)
.
Both cgroup limits and rlimit apply when memory is being allocated by a process.
systemd
Soon, cgroups are going to be managed by systemd. All operations on cgroups are going to be done through API calls to systemd, over DBUS (or a shared library).
That makes me think that systemd could also expose a consistent API for processes to query their available memory.
But until then…
Solution?
Some kernel developers (and I am starting to agree with them) believe that the best option is an userspace library that processes can use to query their memory usage and available memory.
libmymem
would do all the hard work of figuring out where to pull numbers from (/proc
vs. cgroup
vs. getrlimit(2)
vs. systemd, etc.). I am considering starting one. New code could easily benefit from it, but it is unlikely that all existing tools (free
, top
, etc.) will just switch to it. For now, we might need to encourage people to stop using those tools inside containers.
I hope my unfortunate realization – figuring out how much memory you can use inside a container is harder than it should be – helps people better understand the problem. Please leave a comment below and let me know what you think.
-
To be fair, I don’t really know what NewRelic is doing these days, I am just using them as an example. They may be reading memory metrics in a different way (maybe aggregating information from
proc/*/smaps
). Pardon my ignorance. ↩
It’s indeed a complex problem and one that many many of our users have been asking how to solve.
Daniel Lezcano (the original LXC maintainer before he handed over the project to Serge Hally and I) wrote a small fuse filesystem years ago, doing something very similar to what you describe though as you said, that doesn’t work for everything and has the disadvantage of adding extra userspace/kernelspace roundtrips which may be really problematic for some tools (think top/htop) which may do a lot of accesses.
There’s also the problem that there’s no guarantee that your memory/cpu/whatever limits apply to the whole container as cgroups apply to PIDs. You may have 20 processes running in your container that have acces just to 1 cpu and 512MB of RAM and another 20 that has 4 CPUs and 2GB of RAM, so anything mounting over /proc will have to do per-process lookup (which sort of excludes any caching and just creates additional context switches…).
There are other ways that people have tried, such as adding those files into the different cgroup controllers, then have those bind-mounted over the real thing but that comes back to the same problem (all proesses must be in the same cgroup) and still doesn’t solve sysinfo.
The only transparent fix would be to have the kernel do it transparently for you, possibly under a namespaced sysctl so that this can be turned on for containers and not for the rest of the system. However I very much doubt this would be approved given how much pushback there has been on similar tricks so far.
Coming up with clever userspace libraries is definitely a good idea, if only to save everyone some pain, unfortunately this means a lot of software still will report the old thing, because they don’t want to get that extra dependency or because they are just plain outdated. This includes quite a bunch of proprietary software which currently very badly behaves when provided with the wrong memory information…
So anyway, great article, tough topic, I really hope we’ll eventually come up with a solution everyone is fine with…
Thanks Stéphane!
Good point about multiple cgroups in a container. It isn’t an usual use case though, at least in my experience, and it may be better served by nested containers. It’s definitely something to be considered anyway.
Even if I or someone else goes ahead and start implementing a userspace library, it would still need to read cgroup data from somewhere. We may need a new syscall, or a standard location for cgroup related files (
/sys/fs/cgroup/self
?) inside containers. Thoughts?The library probably ought to be aware of multiple methods:
1) Direct fs lookup using /proc/mounts to find the cgroup hierarchies and /proc/self/cgroup to know what cgroup to look into.
2) If cgroupfs isn’t mounted, try to contact cgmanager, the systemd API or any other cgroup manager that may exists (I’m only aware of those two).
3) Give up and return the system value.
Forgot to mention, this won’t work with most LXC containers since for security reasons, we usually never mount cgroupfs or give any way to access it for privileged LXC containers (as if we did, the container would then be able to change its own limits).
CGManager knows how to deal with that and if the cgmanager socket is available inside the container, it will let you do read-only lookups of your own limits (and reject any changes to your own limits even if you are uid 0).
Unprivileged containers aren’t allowed to mount cgroupfs at the moment, so the use of a cgroup manager is required in those cases.
Instead of mounting cgroupfs inside the container, I was thinking on bind mounting specific files ro inside it (
memory.stat
,memory.usage_in_bytes
, etc.).A cgmanager (or systemd) d-bus socket would also work, however we’d need to agree on a standard location inside the container (probably
re-use /sys/fs/cgroup/cgmanager
?).BTW, I wasn’t aware of cgmanager. Serge’s proposal seems to be exactly what we need.
When using cgmanager inside of LXC, the socket is always available at /sys/fs/cgroup/cgmanager/sock
Did you try “lxc.mount.auto = cgroup:mixed” in your config? That one should do the cgroup bind-mounts for you, in theory blocking all dangerous bits and on systems using cgmanager, it’ll only bind-mount the cgmanager socket instead. That’s what I’m usually using on my systems when doing nested containers where I don’t want the intermediary containers to have unreasonable rights.
Not yet. We are not using many LXC features, and we do most of the cgroup/namespaces management ourselves. Thanks for the advice anyway, I’ll look into it.
Thanks for the article — definitely found it interesting. Looking forward to seeing what comes out of this problem. Good luck!
Re: the implied question about New Relic: the New Relic Ruby agent runs entirely in-process, and only reports memory usage stats about the host processes in which it runs. There’s also a standalone server agent that runs as a separate process and reports system-wide stats, but it’s not used on Heroku.
The Ruby agent reads the
VmRSS
field from/proc/<pid>/status
in order to determine the memory usage of a given process on Linux (whether containerized or not). We’d like to be able to also report dyno-level stats on Heroku, but I haven’t been able to figure out exactly how to derive numbers that match what log-runtime-metrics reports. The docs for log-runtime-metrics aren’t very specific about exactly how things are calculated (for example: “Resident Memory (memory_rss): The portion of the dyno’s memory (megabytes) held in RAM.”) and I’ve not been able to derive numbers that exactly line up from either /proc//smaps or /proc//status, so I assume they come from elsewhere (presumably somewhere that’s only accessible outside the container).A userspace library that abstracted all of this stuff would be awesome (though we’d need a Ruby wrapper for it in order to access it from the Ruby agent).
In the meantime, if you have any suggestions for how we might approximate log-runtime-metrics’s memory_rss or memory_total fields using stats available within the container, we’d be all ears!
Thanks for your work on this, and keep fighting the good fight!
Thanks for the info Ben!
This is a fine approximation (see below), as long as you do that for all processes inside the container (all
/proc/PID/*
entries), and also include memory mapped files and cache/buffers. Here is an important excerpt from theproc
documentation:log-runtime-metrics is just reporting metrics provided by the cgroup memory controller itself (
<cgroupfs>/memory.stat
). This is unfortunately not available inside Heroku dynos yet (which is the whole point of this blog post).For now, the best thing to do is to aggregate what’s in
/proc/PID/smaps
(also documented at the link above) for all processes inside the container, and consider cache/buffers. I haven’t done it myself yet, but I imagine that care must be taken with shared pages, so that they are not accounted multiple times. Capturing information for all processes is important because the in-process agent won’t be inside child processes (fork
+exec
), and will ignore the (minimal) overhead of our init-like (PID=1) process inside dynos.Have a look at tools like https://github.com/pixelb/ps_mem which should work better due to enumerating process first, before determining the core memory associated with them. Accurate totals are also provided by accounting for shared memory through the use of /proc/$pid/smaps
The analogous problem “CPU inside Linux containers” occurs with the cpu controller. If you restrict cpu resources of a container by limiting the number or cores, the common userland tools will fail to detect this. And therefore, all the daemons, which will derive the default number of threads from the number of available cores, will also make the wrong assumptions and one have to take care of it by hand.
Exactly!
Fabio – I’d really like it if you could check out this memory analysis tool:
https://github.com/etep/zest
This does not address your needs, per se, but it is fundamental research for future computer memory systems. I have been lucky enough to connect with several other folks running large scale applications (e.g. engineers at SAP, LinkedIn, Facebook, Yelp, …) and essentially want to look at as many workloads as possible. Happy to discuss more, shoot me an email if you want.
Hi Fabio –
I’d really appreciate it if you could take a look at the following memory analysis tool:
https://github.com/etep/zest
This doesn’t address your needs directly, but rather, it is fundamental research into future memory system design. I have been lucky enough to connect with engineers running other large scale applications, namely some folks at SAP, LinkedIn, Yelp, Facebook, and other companies, and get some datasets over real apps using real data. The results are pretty amazing, but as with any such project, the more data the better. So please take a look. I would be happy to just communicate regarding the work, the research, etc. Thanks for the awesome blog post.
Regards,
Pete Stevenson
I have started a library for cgroups monitoring and in general a statistics collector for Linux. It is still in alpha but all ears for feedback if it is useful for you.
https://github.com/square/prodeng/blob/master/inspect/README.md
Oh you are wondering about running the tool within a cgroup. inspect wouldn’t be useful in that regard.
Hi Fabio, what do you use nowadays for container monitoring ? Great blog by the way.
Thank you! I haven’t been using anything specific to containers, just regular app monitoring tools (Pingdom, NewRelic, Librato, log alerts, rollbar, etc).
Hi Fabio, great summary! Do you have any update so far? Thanks!
Thanks! Unfortunately not, but Serge is trying to get some momentum here: https://github.com/hallyn/libresource
Hello, your post inspired this: https://github.com/giampaolo/psutil/issues/1015
Hi Fabio, Thank you, Awesome blog by the way. Is this problem still exists? I am experiencing similar issue on a OpenShift Origin v3.11?
Thanks,
Lahiru
Hi Lahiru, thanks for the kind message. Things haven’t changed much on the Linux kernel itself, and I’m not sure what OpenShift Origin is doing, but some container platforms started adopting things like lxcfs (https://linuxcontainers.org/lxcfs/introduction/) to get around this problem.
the JVM specifically added some support for linux containers (cgroups) recently: https://blogs.oracle.com/java-platform-group/java-se-support-for-docker-cpu-and-memory-limits