Memory inside Linux containers

Or why don’t free and top work in a Linux container?

Lately at Heroku, we have been trying to find the best way to expose memory usage and limits inside Linux containers. It would be easy to do it in a vendor-specific way: most container specific metrics are available at the cgroup filesystem via /path/to/cgroup/memory.stat, /path/to/cgroup/memory.usage_in_bytes, /path/to/cgroup/memory.limit_in_bytes and others.

An implementation of Linux containers could easily inject one or more of those files inside containers. Here is an hypothetical example of what Heroku, Docker and others could do:

# create a new dyno (container):
$ heroku run bash

# then, inside the dyno:
(dyno) $ cat /sys/fs/cgroup/memory/memory.stat
cache 15582273536
rss 2308546560
mapped_file 275681280
swap 94928896
pgpgin 30203686979
pgpgout 30199319103
# ...

/sys/fs/cgroup/ is the recommended location for cgroup hierarchies, but it is not a standard. If a tool or library is trying to read from it, and be portable across multiple container implementations, it would need to discover the location first by parsing /proc/self/cgroup and /proc/self/mountinfo. Further, /sys/fs/cgroup is just an umbrella for all cgroup hierarchies, there is no recommendation or standard for my own cgroup location. Thinking about it, /sys/fs/cgroup/self would not be a bad idea.

If we decide to go down that path, I would personally prefer to work with the rest of the Linux containers community first and come up with a standard.

I wish it were that simple.

The problem

Most of the Linux tools providing system resource metrics were created before cgroups even existed (e.g.: free and top, both from procps). They usually read memory metrics from the proc filesystem: /proc/meminfo, /proc/vmstat, /proc/PID/smaps and others.

Unfortunately /proc/meminfo, /proc/vmstat and friends are not containerized. Meaning that they are not cgroup-aware. They will always display memory numbers from the host system (physical or virtual machine) as a whole, which is useless for modern Linux containers (Heroku, Docker, etc.). Processes inside a container can not rely on free, top and others to determine how much memory they have to work with; they are subject to limits imposed by their cgroups and can’t use all the memory available in the host system.

This causes a lot of confusion for users of Linux containers. Why does free say there is 32GB free memory, when the container only allows 512MB to be used?

With the popularization of linux container technologies – Heroku, Docker, LXC (version 1.0 was recently released), CoreOS, lmctfy, systemd and friends – more and more people will face the same problem. It is time to start fixing it.

Why is this important?

Visibility into memory usage is very important. It allows people running applications inside containers to optimize their code and troubleshoot problems: memory leaks, swap space usage, etc.

Some time ago, we shipped log-runtime-metrics at Heroku, as an experimental labs feature. It is not a portable solution though, and does not expose the information inside containers, so that monitoring agents could read it. To make things worse, most of the monitoring agents out there (e.g.: NewRelic?1) rely on information provided by free, /proc/meminfo, etc. That is plain broken inside Linux containers.

On top of that, more and more people have been trying to maximize resource usage inside containers, usually by auto-scaling the number of workers, processes or threads running inside them. This is usually a function of how much memory is available (and/or free) inside the container, and for that do be done programmatically, the information needs to be accessible from inside the container.

More about /proc

In case you wondered, none of the files provided by the cgroup filesystem (/sys/fs/cgroup/memory/memory.*) can be used as a drop-in replacement (i.e.: bind mounted on top of) for /proc/meminfo, or /proc/vmstat. They have different formats and use slightly different names for each metric. Why memory.stat and friends decided to use a format different from what was already being used at /proc/meminfo is beyond my comprehension.

Some of the contents of a /proc filesystem are properly containerized, like the /proc/PID/* and /proc/net/* namespaces, but not all of them. Unfortunately, /proc in general is considered to be a mess. From the excellent “Creating Linux virtual filesystems” article on LWN:

Linus and numerous other kernel developers dislike the ioctl() system call, seeing it as an uncontrolled way of adding new system calls to the kernel. Putting new files into /proc is also discouraged, since that area is seen as being a bit of a mess. Developers who populate their code with ioctl() implementations or /proc files are often encouraged to create a standalone virtual filesystem instead.

I went ahead and started experimenting with that: procg is an alternative proc filesystem that can be mounted inside linux containers. It replaces /proc/meminfo with a version that reads cgroup specific information. My goal was for it to be a drop-in replacement for proc, without requiring any patches to the Linux kernel. Unfortunately, I later found that it was not possible, because none of the functions to read memory statistics from a cgroup (linux/memcontrol.h and mm/memcontrol.c) are public in the kernel. I hope to continue this discussion on LKML soon.

Others have tried similar things modifying the proc filesystem directly, but that is unlikely to be merged to the mainstream kernel if it affects all users of the proc filesystem. It would either need to be a custom filesystem (like procg) or a custom mount option to proc. E.g.:

mount -t proc -o meminfo-from-cgroup none /path/to/container/proc

FUSE

There is also a group of kernel developers advocating that this would be better served by something outside of the kernel, in userspace, making /proc/meminfo be a virtual file that collects information elsewhere and formats it appropriately.

FUSE can be used to implement a filesystem in userspace to do just that. Libvirt went down that path with its libvirt-lxc driver. There were attempts to integrate a FUSE version of /proc/meminfo into LXC too.

Even though there is a very nice implementation of FUSE in pure Go, and that I am excited with the idea to contribute a plugin/patch to Docker using it, at Heroku we (myself included) have a lot of resistance against using FUSE in production.

This is mainly due to bad past experiences with FUSE filesystems (sshfs, s3fs) and the increased surface area for attacks. My research so far has revealed that the situation may be much better nowadays, and I would even be willing to give it a try if there were not other problems with using fuse to replace the proc filesystem.

I am also not comfortable with making my containers dependent on an userspace daemon that serves FUSE requests. What happens when that daemon crashes? All containers in the box are probably left without access to their /proc/meminfo. Either that, or having to run a different daemon per container. Hundreds of containers in a box would require hundreds of such daemons. Ugh.

/proc is not the only issue: sysinfo

Even if we could find a solution to containerize /proc/meminfo with which everyone is happy, it would not be enough.

Linux also provides the sysinfo(2) syscall, which returns information about system resources (e.g. memory). As with /proc/meminfo, it is not containerized: it always returns metrics for the box as a whole.

I was surprised while testing my proc replacement (procg) that it did not work with Busybox. Later, I discovered that the Busybox’s implementation of free does not use /proc/meminfo. Guess what? It uses sysinfo(2). What else out there could also be using sysinfo(2) and be broken inside containers?

ulimit, setrlimit

On top of cgroup limits, Linux processes are also subject to resource limits applied to them individually, via setrlimit(2).

Both cgroup limits and rlimit apply when memory is being allocated by a process.

systemd

Soon, cgroups are going to be managed by systemd. All operations on cgroups are going to be done through API calls to systemd, over DBUS (or a shared library).

That makes me think that systemd could also expose a consistent API for processes to query their available memory.

But until then…

Solution?

Some kernel developers (and I am starting to agree with them) believe that the best option is an userspace library that processes can use to query their memory usage and available memory.

libmymem would do all the hard work of figuring out where to pull numbers from (/proc vs. cgroup vs. getrlimit(2) vs. systemd, etc.). I am considering starting one. New code could easily benefit from it, but it is unlikely that all existing tools (free, top, etc.) will just switch to it. For now, we might need to encourage people to stop using those tools inside containers.

I hope my unfortunate realization – figuring out how much memory you can use inside a container is harder than it should be – helps people better understand the problem. Please leave a comment below and let me know what you think.


  1. To be fair, I don’t really know what NewRelic is doing these days, I am just using them as an example. They may be reading memory metrics in a different way (maybe aggregating information from proc/*/smaps). Pardon my ignorance. 

16 thoughts on “Memory inside Linux containers

  1. It’s indeed a complex problem and one that many many of our users have been asking how to solve.

    Daniel Lezcano (the original LXC maintainer before he handed over the project to Serge Hally and I) wrote a small fuse filesystem years ago, doing something very similar to what you describe though as you said, that doesn’t work for everything and has the disadvantage of adding extra userspace/kernelspace roundtrips which may be really problematic for some tools (think top/htop) which may do a lot of accesses.

    There’s also the problem that there’s no guarantee that your memory/cpu/whatever limits apply to the whole container as cgroups apply to PIDs. You may have 20 processes running in your container that have acces just to 1 cpu and 512MB of RAM and another 20 that has 4 CPUs and 2GB of RAM, so anything mounting over /proc will have to do per-process lookup (which sort of excludes any caching and just creates additional context switches…).

    There are other ways that people have tried, such as adding those files into the different cgroup controllers, then have those bind-mounted over the real thing but that comes back to the same problem (all proesses must be in the same cgroup) and still doesn’t solve sysinfo.

    The only transparent fix would be to have the kernel do it transparently for you, possibly under a namespaced sysctl so that this can be turned on for containers and not for the rest of the system. However I very much doubt this would be approved given how much pushback there has been on similar tricks so far.

    Coming up with clever userspace libraries is definitely a good idea, if only to save everyone some pain, unfortunately this means a lot of software still will report the old thing, because they don’t want to get that extra dependency or because they are just plain outdated. This includes quite a bunch of proprietary software which currently very badly behaves when provided with the wrong memory information…

    So anyway, great article, tough topic, I really hope we’ll eventually come up with a solution everyone is fine with…

    • Thanks Stéphane!

      Good point about multiple cgroups in a container. It isn’t an usual use case though, at least in my experience, and it may be better served by nested containers. It’s definitely something to be considered anyway.

      Even if I or someone else goes ahead and start implementing a userspace library, it would still need to read cgroup data from somewhere. We may need a new syscall, or a standard location for cgroup related files (/sys/fs/cgroup/self?) inside containers. Thoughts?

      • The library probably ought to be aware of multiple methods:
        1) Direct fs lookup using /proc/mounts to find the cgroup hierarchies and /proc/self/cgroup to know what cgroup to look into.
        2) If cgroupfs isn’t mounted, try to contact cgmanager, the systemd API or any other cgroup manager that may exists (I’m only aware of those two).
        3) Give up and return the system value.

      • Forgot to mention, this won’t work with most LXC containers since for security reasons, we usually never mount cgroupfs or give any way to access it for privileged LXC containers (as if we did, the container would then be able to change its own limits).

        CGManager knows how to deal with that and if the cgmanager socket is available inside the container, it will let you do read-only lookups of your own limits (and reject any changes to your own limits even if you are uid 0).

        Unprivileged containers aren’t allowed to mount cgroupfs at the moment, so the use of a cgroup manager is required in those cases.

      • Instead of mounting cgroupfs inside the container, I was thinking on bind mounting specific files ro inside it (memory.stat, memory.usage_in_bytes, etc.).

        A cgmanager (or systemd) d-bus socket would also work, however we’d need to agree on a standard location inside the container (probably re-use /sys/fs/cgroup/cgmanager?).

        BTW, I wasn’t aware of cgmanager. Serge’s proposal seems to be exactly what we need.

      • When using cgmanager inside of LXC, the socket is always available at /sys/fs/cgroup/cgmanager/sock

        Did you try “lxc.mount.auto = cgroup:mixed” in your config? That one should do the cgroup bind-mounts for you, in theory blocking all dangerous bits and on systems using cgmanager, it’ll only bind-mount the cgmanager socket instead. That’s what I’m usually using on my systems when doing nested containers where I don’t want the intermediary containers to have unreasonable rights.

      • Did you try “lxc.mount.auto = cgroup:mixed” in your config?

        Not yet. We are not using many LXC features, and we do most of the cgroup/namespaces management ourselves. Thanks for the advice anyway, I’ll look into it.

  2. Thanks for the article — definitely found it interesting. Looking forward to seeing what comes out of this problem. Good luck!

  3. Re: the implied question about New Relic: the New Relic Ruby agent runs entirely in-process, and only reports memory usage stats about the host processes in which it runs. There’s also a standalone server agent that runs as a separate process and reports system-wide stats, but it’s not used on Heroku.

    The Ruby agent reads the VmRSS field from /proc/<pid>/status in order to determine the memory usage of a given process on Linux (whether containerized or not). We’d like to be able to also report dyno-level stats on Heroku, but I haven’t been able to figure out exactly how to derive numbers that match what log-runtime-metrics reports. The docs for log-runtime-metrics aren’t very specific about exactly how things are calculated (for example: “Resident Memory (memory_rss): The portion of the dyno’s memory (megabytes) held in RAM.”) and I’ve not been able to derive numbers that exactly line up from either /proc//smaps or /proc//status, so I assume they come from elsewhere (presumably somewhere that’s only accessible outside the container).

    A userspace library that abstracted all of this stuff would be awesome (though we’d need a Ruby wrapper for it in order to access it from the Ruby agent).

    In the meantime, if you have any suggestions for how we might approximate log-runtime-metrics’s memory_rss or memory_total fields using stats available within the container, we’d be all ears!

    Thanks for your work on this, and keep fighting the good fight!

    • Thanks for the info Ben!

      The Ruby agent reads the VmRSS field from /proc/<pid>/status

      This is a fine approximation (see below), as long as you do that for all processes inside the container (all /proc/PID/* entries), and also include memory mapped files and cache/buffers. Here is an important excerpt from the proc documentation:

      “For making accounting scalable, RSS related information are handled in asynchronous manner and the vaule may not be very precise. To see a precise snapshot of a moment, you can see /proc/PID/smaps file and scan page table.

      It’s slow but very precise.”

      The docs for log-runtime-metrics aren’t very specific about exactly how things are calculated

      log-runtime-metrics is just reporting metrics provided by the cgroup memory controller itself (<cgroupfs>/memory.stat). This is unfortunately not available inside Heroku dynos yet (which is the whole point of this blog post).

      In the meantime, if you have any suggestions for how we might approximate log-runtime-metrics’s memory_rss or memory_total fields using stats available within the container, we’d be all ears!

      For now, the best thing to do is to aggregate what’s in /proc/PID/smaps (also documented at the link above) for all processes inside the container, and consider cache/buffers. I haven’t done it myself yet, but I imagine that care must be taken with shared pages, so that they are not accounted multiple times. Capturing information for all processes is important because the in-process agent won’t be inside child processes (fork + exec), and will ignore the (minimal) overhead of our init-like (PID=1) process inside dynos.

  4. The analogous problem “CPU inside Linux containers” occurs with the cpu controller. If you restrict cpu resources of a container by limiting the number or cores, the common userland tools will fail to detect this. And therefore, all the daemons, which will derive the default number of threads from the number of available cores, will also make the wrong assumptions and one have to take care of it by hand.

  5. Fabio – I’d really like it if you could check out this memory analysis tool:
    https://github.com/etep/zest

    This does not address your needs, per se, but it is fundamental research for future computer memory systems. I have been lucky enough to connect with several other folks running large scale applications (e.g. engineers at SAP, LinkedIn, Facebook, Yelp, …) and essentially want to look at as many workloads as possible. Happy to discuss more, shoot me an email if you want.

  6. Hi Fabio -
    I’d really appreciate it if you could take a look at the following memory analysis tool:
    https://github.com/etep/zest

    This doesn’t address your needs directly, but rather, it is fundamental research into future memory system design. I have been lucky enough to connect with engineers running other large scale applications, namely some folks at SAP, LinkedIn, Yelp, Facebook, and other companies, and get some datasets over real apps using real data. The results are pretty amazing, but as with any such project, the more data the better. So please take a look. I would be happy to just communicate regarding the work, the research, etc. Thanks for the awesome blog post.

    Regards,
    Pete Stevenson

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s