Lately at Heroku, we have been trying to find the best way to expose memory usage and limits inside Linux containers. It would be easy to do it in a vendor-specific way: most container specific metrics are available at the cgroup filesystem via
/path/to/cgroup/memory.limit_in_bytes and others.
An implementation of Linux containers could easily inject one or more of those files inside containers. Here is an hypothetical example of what Heroku, Docker and others could do:
# create a new dyno (container): $ heroku run bash # then, inside the dyno: (dyno) $ cat /sys/fs/cgroup/memory/memory.stat cache 15582273536 rss 2308546560 mapped_file 275681280 swap 94928896 pgpgin 30203686979 pgpgout 30199319103 # ...
/sys/fs/cgroup/ is the recommended location for cgroup hierarchies, but it is not a standard. If a tool or library is trying to read from it, and be portable across multiple container implementations, it would need to discover the location first by parsing
/sys/fs/cgroup is just an umbrella for all cgroup hierarchies, there is no recommendation or standard for my own cgroup location. Thinking about it,
/sys/fs/cgroup/self would not be a bad idea.
If we decide to go down that path, I would personally prefer to work with the rest of the Linux containers community first and come up with a standard.
I wish it were that simple.
Most of the Linux tools providing system resource metrics were created before cgroups even existed (e.g.:
top, both from procps). They usually read memory metrics from the
/proc/PID/smaps and others.
/proc/vmstat and friends are not containerized. Meaning that they are not cgroup-aware. They will always display memory numbers from the host system (physical or virtual machine) as a whole, which is useless for modern Linux containers (Heroku, Docker, etc.). Processes inside a container can not rely on
top and others to determine how much memory they have to work with; they are subject to limits imposed by their cgroups and can’t use all the memory available in the host system.
This causes a lot of confusion for users of Linux containers. Why does
free say there is 32GB free memory, when the container only allows 512MB to be used?
With the popularization of linux container technologies – Heroku, Docker, LXC (version 1.0 was recently released), CoreOS, lmctfy, systemd and friends – more and more people will face the same problem. It is time to start fixing it.
Why is this important?
Visibility into memory usage is very important. It allows people running applications inside containers to optimize their code and troubleshoot problems: memory leaks, swap space usage, etc.
Some time ago, we shipped log-runtime-metrics at Heroku, as an experimental labs feature. It is not a portable solution though, and does not expose the information inside containers, so that monitoring agents could read it. To make things worse, most of the monitoring agents out there (e.g.: NewRelic?1) rely on information provided by
/proc/meminfo, etc. That is plain broken inside Linux containers.
On top of that, more and more people have been trying to maximize resource usage inside containers, usually by auto-scaling the number of workers, processes or threads running inside them. This is usually a function of how much memory is available (and/or free) inside the container, and for that do be done programmatically, the information needs to be accessible from inside the container.
In case you wondered, none of the files provided by the cgroup filesystem (
/sys/fs/cgroup/memory/memory.*) can be used as a drop-in replacement (i.e.: bind mounted on top of) for
/proc/vmstat. They have different formats and use slightly different names for each metric. Why
memory.stat and friends decided to use a format different from what was already being used at
/proc/meminfo is beyond my comprehension.
Some of the contents of a
/proc filesystem are properly containerized, like the
/proc/net/* namespaces, but not all of them. Unfortunately,
/proc in general is considered to be a mess. From the excellent “Creating Linux virtual filesystems” article on LWN:
Linus and numerous other kernel developers dislike the ioctl() system call, seeing it as an uncontrolled way of adding new system calls to the kernel. Putting new files into /proc is also discouraged, since that area is seen as being a bit of a mess. Developers who populate their code with ioctl() implementations or /proc files are often encouraged to create a standalone virtual filesystem instead.
I went ahead and started experimenting with that: procg is an alternative
proc filesystem that can be mounted inside linux containers. It replaces
/proc/meminfo with a version that reads cgroup specific information. My goal was for it to be a drop-in replacement for
proc, without requiring any patches to the Linux kernel. Unfortunately, I later found that it was not possible, because none of the functions to read memory statistics from a cgroup (
mm/memcontrol.c) are public in the kernel. I hope to continue this discussion on LKML soon.
Others have tried similar things modifying the proc filesystem directly, but that is unlikely to be merged to the mainstream kernel if it affects all users of the proc filesystem. It would either need to be a custom filesystem (like procg) or a custom mount option to proc. E.g.:
mount -t proc -o meminfo-from-cgroup none /path/to/container/proc
There is also a group of kernel developers advocating that this would be better served by something outside of the kernel, in userspace, making
/proc/meminfo be a virtual file that collects information elsewhere and formats it appropriately.
FUSE can be used to implement a filesystem in userspace to do just that. Libvirt went down that path with its libvirt-lxc driver. There were attempts to integrate a FUSE version of
/proc/meminfo into LXC too.
Even though there is a very nice implementation of FUSE in pure Go, and that I am excited with the idea to contribute a plugin/patch to Docker using it, at Heroku we (myself included) have a lot of resistance against using FUSE in production.
This is mainly due to bad past experiences with FUSE filesystems (sshfs, s3fs) and the increased surface area for attacks. My research so far has revealed that the situation may be much better nowadays, and I would even be willing to give it a try if there were not other problems with using fuse to replace the proc filesystem.
I am also not comfortable with making my containers dependent on an userspace daemon that serves FUSE requests. What happens when that daemon crashes? All containers in the box are probably left without access to their
/proc/meminfo. Either that, or having to run a different daemon per container. Hundreds of containers in a box would require hundreds of such daemons. Ugh.
/proc is not the only issue:
Even if we could find a solution to containerize
/proc/meminfo with which everyone is happy, it would not be enough.
Linux also provides the
sysinfo(2) syscall, which returns information about system resources (e.g. memory). As with
/proc/meminfo, it is not containerized: it always returns metrics for the box as a whole.
I was surprised while testing my proc replacement (procg) that it did not work with Busybox. Later, I discovered that the Busybox’s implementation of
free does not use
/proc/meminfo. Guess what? It uses
sysinfo(2). What else out there could also be using
sysinfo(2) and be broken inside containers?
On top of cgroup limits, Linux processes are also subject to resource limits applied to them individually, via
Both cgroup limits and rlimit apply when memory is being allocated by a process.
Soon, cgroups are going to be managed by systemd. All operations on cgroups are going to be done through API calls to systemd, over DBUS (or a shared library).
That makes me think that systemd could also expose a consistent API for processes to query their available memory.
But until then…
Some kernel developers (and I am starting to agree with them) believe that the best option is an userspace library that processes can use to query their memory usage and available memory.
libmymem would do all the hard work of figuring out where to pull numbers from (
getrlimit(2) vs. systemd, etc.). I am considering starting one. New code could easily benefit from it, but it is unlikely that all existing tools (
top, etc.) will just switch to it. For now, we might need to encourage people to stop using those tools inside containers.
I hope my unfortunate realization – figuring out how much memory you can use inside a container is harder than it should be – helps people better understand the problem. Please leave a comment below and let me know what you think.
To be fair, I don’t really know what NewRelic is doing these days, I am just using them as an example. They may be reading memory metrics in a different way (maybe aggregating information from
proc/*/smaps). Pardon my ignorance. ↩