My DockerCon 2014 talk: Thoughts on interoperable containers

Recently, I finally had a chance to read Presentation Zen: Simple Ideas on Presentation Design and Delivery, by Garr Reynolds. I wish I had done if before, after so many years of giving less-than-ideal talks and ignoring recommendations from presenters I admire, who give presentations I highly respect.

It is usual to post your slides somewhere public right after you give a presentation. I’m sure that Docker organizers are going to publish my slides somewhere, but here they are if you really want to see them and don’t want to wait. However, if you agree with some of the guidelines from Presentation Zen, slides from a good presentation will not have a lot of useful information. Slides are there just as a visual aid to the story being told during a presentation. They should be highly visual and help illustrate your points. A slide deck is not a document.

I want to try something different this time. This blog post is an attempt of a proper handout of my slides. I hope it is a bit more useful than a bunch of slides with beatiful pictures and small sentences to all of you who could not attend DockerCon and didn’t see my talk. Or for those of you that attended (thank you!) and want to review what I said.

We want to run our apps, unmodified, everywhere. I too have been trying to find ways to make it happen, and I want to share some of the things I have discovered so far.

I run Linux Containers at Heroku. Lots of them. Heroku has been running Linux Containers (dynos) for more than 3 years. Maybe 4 years, I don’t know exactly when the move from chroot jails was made.

Docker has so many different uses! Every other week I discover people using Docker in a different way. For example, some people are now distributing CLI tools as Docker containers. Here is (more or less) how Ruby VMs are being built these days at Heroku (thanks to Terence and ENTRYPOINT):

docker run hone/ruby-builder --version 2.1.0p123

# a new ruby VM is now available in a local directory

Compare that with a ruby-builder binary that would do the same thing: the containerized version includes all the dependencies and works reliably everywhere Docker is available.

We are interested in a particular usage for Docker though. Docker to run portable, server side, web applications following the 12factor.net guidelines.

“Write once, run everywhere” is the dream. In reality, each group (app developers, PaaS providers, Docker developers, LXC developers, Linux Kernel developers, etc.) has different priorities. Different trade-offs and optimizations are being made by each one of them.

Developers want apps: “hey provider, just run my damn container!”.
PaaS providers want scale: they must scale as a business, with sustainable operations, while being fast and secure. Arbitrary Code Execution as a Service.
Docker wants to be the right tool for many different use cases. It is a toolkit that will enable many different things, but we can’t expect Docker developers to solve all the problems. They are too busy already!

I have been navigating this problem space for a while now, playing all the different roles. At the end, besides building a PaaS at Heroku, I am also an app developer and I enjoy hacking on open source container managers (Docker, LXC, the Linux kernel).

Of all the slides with beautiful pictures, this is the only one I can truly say that I made it myself. 🙂

I have been trying to reconcile all sides and would like to share some things I learned. Ultimately, I want to hear from you: tell me what is bullshit and what wrong assumptions I’m making, and/or if you agree. You can also contribute and help make this happen with ideas, design and code.

The container shipping analogy is the first thing that comes to mind. You build a full container locally then ship it to Platform and Infrastructure providers. A Docker container would run anywhere.

trying to make Docker secure for multi-tenant scenarios is a can of worms

– darren0, at #docker-dev

I agree with darren0. Most of the challenges I faced so far when trying to make it work as a Platform provider are differences between local or small Docker environments (single or a few apps on a few servers) versus multitenant environments where boxes are packed with containers from multiple different apps (tenants).

At Heroku, we run millions of containers (apps). It is not hard to imagine why trade-offs need to be different than with a smaller environment.

One of these challenges is root access. When you build a container locally or in your build servers, you have root access inside it.

Root access is controversial. For some people it is obvious why containers can’t have it in a shared (multitenant) environment, or any production environment for that matter. Others have a hard time accepting the idea it is dangerous and believe that containers should be safe enough to cover all issues. Or they don’t want to care and just want the runtime environment to solve all the problems.

App developers often want root access because:

apt get install ..., they want to install packages, tools and libraries.
vi /etc/..., they want to edit important configuration files.
mount -t fancy ..., they want to mount filesystems and use what the kernel has to offer.
modprobe something, they want to load modules and use what the kernel has to offer #2.
iptables -A INPUT ..., they want to configure firewall rules, port mappings, …

The big problem is that root access in a container is also root access on the host system. The same host machine that runs many other containers for different apps/tenants. If anyone can – for any reason – escape a container, they would be able to do anything they wanted with other containers in the same box: look at other containers data, code, and potentially escalate to other components of the infrastructure.

This used to be much easier. To be honest, things are much better nowadays, and I will admit that there is a lot of FUD. Jérôme Petazzoni gave a nice talk about it earlier this year, go check it out:

LXC, Docker, Security

Still, you need to always be careful and do the right thing when running containers. Thankfully, Docker, LXC and Heroku (the container managers I’m familiar with) are in general doing a good job IMO: dropping capabilities, using AppArmor/SELinux/grsecurity, kernel namespaces, cgroup controllers, etc. But anyone still can do bad things on top of it and leak resources inside containers. Just as an example, I’ve seen people mentioning they are bind mounting the Docker socket (/var/run/docker.sock) inside containers. Which basically gives containers full privileges on the host.

But, escaping containers is not the only problem with root access. In the Linux Kernel, there are still many parts that haven’t been containerized. Some places in the kernel will hold global locks, or expose global resources. Some examples are the /sys and /proc filesystems. Docker does a great job preventing known problems, but if you are not careful:

# don't do this in a machine you care about, the host will halt
docker run --privileged ubuntu sh -c "echo b > /proc/sysrq-trigger"

If you are not careful protecting /sys and /proc (and again, Docker by default does the right thing AFAIK), any container can bring a whole host down, affecting other tenants running on the same box. Even when container managers do everything they can to protect the host, it might not be enough. Root users inside containers can still call syscalls or operations usually only available to root, that will cause the kernel to hold global locks and impact the host. I wouldn’t be surprised if we keep finding more of such parts of the kernel.

This is also one of the reasons that many people are choosing to run one container per box (the other being better resource isolation), physical or virtual. One container per VM is a nice way to leverage the advantages of both containers and hypervisors.

I am happy that this is getting better and better with time, and most of these concerns are not really a big deal anymore, as long as you use bleeding edge versions of Docker, the Linux Kernel, and AppArmor/SELinux, etc. I can see a light at the end of the tunnel that would allow Providers to permit root access inside containers. Some Providers even started already running code as root inside containers. However, it increases the surface area of attack considerably, and we can’t blame providers for trying to avoid malicious tenants (or even unintentionally) DoS’ing their boxes or getting access to other containers (apps).

User namespaces (a.k.a. unprivileged containers) were recently added to the Linux Kernel to provide safe root access usage inside containers. It allows regular (unprivileged) user ids on a host machine to be mapped to arbitrary user ids inside a new namespace.

That way, root users (uid=0) inside a container (namespace) can be mapped to regular unprivileged users outside of the container. Root users inside user namespaces are not necessarily root on the whole box. It is a better implementation of fakeroot enforced and controlled by the kernel. Nice!

This is very recent though, and will probably take some time to stabilize. It is not hard to find exploits from earlier this year (a few months ago).

Because of the way it works, by design, it can also be made dangerous. By default, root users inside new user namespaces have all the kernel capabilities (superpowers) inside that namespace (container).

Providers relying on user namespaces to provide (fake)root access will still need to drop capabilities, otherwise (fake)root inside containers will still be able to call syscalls and operations on the kernel that only root users can call. With that, they can potentially DoS boxes or escape containers. Another simple example is mknod and mount: if a container is not protected appropriately, even with user namespaces, a (fake)root user could just get raw access to local disks in a box, mount them, and start reading all unencrypted data in the box, including data from other containers. Again, Docker does the right thing AFAIK, as long as containers don’t run as --privileged. But, Docker doesn’t support user namespaces yet. I’m sure Docker developers are thinking about it before blindly adding support.

Another attack vector is code like this:

if (getuid() == 0) {
  // do root stuff
}

Who has not done this before? Blame me, I certainly have. This is how auth was intended to be done on Linux systems, before capabilities and user namespaces were added. Nowadays, in order to be safe, that code would need to check capabilities in a specific user namespace.

What code like this is out there? Who knows… but if you are uid 0 in a namespace you will be able to pass all these validations.

When a root user has all its capabilities dropped in a new user namespace, it is very similar to a regular non-privileged user. None of the things a user would want root access for are possible. It might as well be a regular, non-privileged user. Most (if not all) of the things that require root will not work anyway.

Just don’t run as root?

As an app developer, why should I care? Providers then can just run app code as unprivileged users and end of story!

Unfortunately not: setuid binaries are a problem in this scenario. Binaries with that permission (bit) set are executed as the owner of the file, not with the permissions of the user that runs the binary.

Injecting arbitrary code as setuid binaries owned by root is a very well known old attack to execute code as root on UNIX systems. Container images built locally can contain anything. If PaaS providers accept arbitrary images to run, they must be very careful to remove all setuid binaries owned by root, or disable setuid completely on all filesystems (with the nosuid flag and AppArmor/SELinux, for example). Anyone can inject any setuid binary owned by root in container images being built locally.

Providers also need to be extra careful with what gets injected into containers, e.g.: bind mounts with --volume or --volumes-from. A filesystem without nosuid leaking into a container allows malicious users to create and execute arbitrary setuid binaries.

These binaries are sometimes useful and some apps may require them. setuid is what allows unprivileged users to execute useful things only traditionally available to superusers, like ping (requires raw socket access), tcpdump (requires promiscuous mode), etc. These will probably not be available in many multitenant container Platforms.

This all makes me start thinking that we may need a way to specify constraints on container images that particular runtime environments impose. Different Providers will make different choices on what they accept or not, and it may be useful to capture these requirements, or container capabilities as container metadata.

We would need something like a Restrictions API for containers, that could be validated at runtime by Providers, and/or during build by container managers, or during docker push by registries, …

Here are some other examples of requirements that may need to be imposed on containers by some runtime environments:

Networking: the number of network interfaces available, which ports are reachable/routable from the outside, how many IP addresses are available, public vs. private IPs, firewall rules, etc.
Ephemeral disks: some Providers may not provide persistent disks for containers.
Arch, OSes: which architectures are supported (eg.: x86_64, arm).
Image size: there probably will be a max amount of disk a container can occupy, or a max size for the container image to be downloaded.

container-rfc is an attempt to define a standard container image format and metadata that would work on multiple container managers.

Container images, in my experience, have 2-5GB.

Heroku is very dynamic. Containers are constantly being cycled and are always moving. Thousands of containers are being launched per minute. Having to download a 2-5GB image every time a container is being launched does not sound reasonable.

Docker solves this problem with layers.

Deltas are overlaid on top of a base image. All containers are formed from a hierarchy of read-only layers, and a private writeable layer on top of them. Read-only layers are shared between all containers that use them on a host, and are cached locally.

When many containers use the same base image and share some layers, they are downloaded only once. If providers can restrict the number of base images they support (Restrictions API, again?) known base images can be pre-cached in runtime servers and containers can be launched very quickly, as only the layers (deltas) specific to each app are going to be downloaded.

Even so, Providers will probably need a way to limit the max size of a layer, or a set of layers.

This works well, until base images (or shared layers) need to be updated. We do that constantly at Heroku, for example with security updates. Every time a base image or shared layer gets updated, all containers using them need to be rebuilt, to pick up the changes.

Heroku wouldn’t be able to quickly respond to security incidents like Heartbleed if we had to rebuild millions of containers every time the base image needs to be updated.

Compare that with what we currently do at Heroku.

It is a more traditional approach. Similar to how packages are traditionally installed on Operating Systems. Apps are just unpacked on top of a read-only base image, inside containers. The read-only base image is shared (bind mounted) between all containers, and downloaded only once on each box.

When containers are being launched, only the app package (a.k.a. slug) is downloaded.

When the base image needs to be updated, a new version is sent to all runtime instances, and new containers are simply launched on top of a new read-only base image. No rebuild operations are required on any apps.

It is also possible to do that with Docker, but it does not follow the traditional model of shipping containers (docker push + docker pull). The community may come up with good ways to solve this problem in the future, see dotcloud/docker#332 for example.

# idea: make a container point to a new base image?
docker save myapp | docker load --rebase=new-base-image

Honestly, the whole idea of supporting an arbitrary image format everywhere reminds me a bit of what happened with VM image formats a while ago. Many people were trying to find a common format for VMs that would work on any hypervisor. They failed. Do you remember VMDK vs VHD vs QCow vs QCow2 vs …?

I can’t say yet if the idea of shipping whole containers everywhere is a rabbit hole. Maybe it is, maybe it isn’t. I’ve been experimenting with it and others should do as well.

But let’s take a step back. How about instead of putting restrictions on whole container images and distribute them everywhere, we turn portable apps into containers? We shift the focus back to the app code to be distributed and make sure that apps are portable and can run anywhere, then we map apps to execution runtimes, like Docker containers.

Runtime environments don’t need to be just containers. Portable apps (12factor.net) can then be mapped to raw VMs, or even bare metal servers.

What we are missing in this context is something standard to prepare these portable apps for different runtime environments.

Buildpacks are a possible way to make this happen. Initially, buildpacks were created to compile (build) 12factor apps into a package that can be executed inside Heroku containers (dynos). But buildpacks are very simple and flexible: just a few executables (bin/detect, bin/compile and bin/release) that are used to build an app for a runtime environment.

Many different PaaS providers adopted buildpacks as their mechanisms to build app code. Each language runtime (Ruby, Python, Node.js, etc.) has a different buildpack.

I propose that we extend the concept of a buildpack: given a base image, it is something that maps (transforms) apps into runtime environments (e.g.: a Docker container) during builds.

One way to implement this idea for Docker, is to make each buildpack a container image that can be used as a parent image by 12factor apps:

$ cat my-portable-app/Dockerfile
FROM heroku/heroku-buildpack-ruby

Buildpack images can then use ONBUILD triggers to do everything required to turn an app into an executable Docker container:

$ cat heroku/heroku-buildpack-ruby/Dockerfile
# Heroku's base image based on Ubuntu 10.04
FROM heroku/cedar

ADD . /buildpack
ONBUILD ADD . /app
ONBUILD RUN /buildpack/bin/compile /app
ONBUILD ENV PORT 5000
ONBUILD EXPOSE 5000

I am exploring heavily with this and I’ve already learned that ONBUILD has limitations (dotcloud/docker#5714). I hope to share more soon.

I am also sure there are other possible ways to use buildpacks to turn apps into executable docker containers. This is just an example (not even complete). Others are doing similar things:

Buildstep, from Flynn/Dokku.
Google *-runtime images act as buildpack images too, though they don’t use heroku buildpacks – yet :-).
Radial tries to map 12factor guidelines to Docker.

I’m not trying to undervalue containers (and Docker!). They are still an amazing way of running apps. But once we shift the focus from “shipping containers” to “shipping apps” again, we open possibilities to run our apps in more runtime environments. As an example, some of my apps have a Makefile to build them locally using a buildpack, so that I can run them locally, on my dev machine (no VMs, no containers, just plain old process execution):

#!/usr/bin/env make -f

buildpath := .build
buildpackpath := $(buildpath)/pack
buildpackcache := $(buildpath)/cache

build: $(buildpackpath)/bin
    $(buildpackpath)/bin/compile . $(buildpackcache)

$(buildpackcache):
    mkdir -p $(buildpath)
    mkdir -p $(buildpackcache)
    curl -O https://codon-buildpacks.s3.amazonaws.com/buildpacks/kr/go.tgz
    mv go.tgz $(buildpath)

$(buildpackpath)/bin: $(buildpackcache)
    mkdir -p $(buildpackpath)
    tar -C $(buildpackpath) -zxf $(buildpath)/go.tgz

make is not the most straightforward thing, but this should be simple enough. make build will build my code using a buildpack. Int his case, the code is written in Go, but it could be any language/framework that has a buildpack. The Makefile just downloads a buildpack and calls bin/compile, the standard buildpack interface to build apps.

This makefile can be seen as a buildpack that maps an app (source code) into a runtime environment (single binary), given a base image (the OS on my laptop).

Here is another example we use at Heroku to run apps in raw machines or VMs (hey, sometimes we need to do it too!):

ruby = "https://codon-buildpacks.s3.amazonaws.com/buildpacks/heroku/ruby.tgz"

app_container "myapp" do
  buildpack ruby
  git_url "git@mycompany.com:myapp.git"
end

define :app_container,
       name: nil,
       buildpack: nil,
       git_url: nil do
  # ...

  execute "#{name} buildpack compile" do
    command "#{dir}/.build/pack/bin/compile #{dir} .build/cache"
  end
end

It is a Chef recipe that builds an app inside a machine using a buildpack, by calling bin/compile. Again!

If an app can be built with a buildpack, then it can potentially run on multiple runtime envinronments, including Docker containers, Heroku, Cloud Foundry, your own machines via Chef recipes, etc.

My talk was not intended to provide any final answers. In my personal journey, I’ve been bouncing between these two concepts: container centric and app centric, and so far I have found that both have their pros and cons. I’m leaning more towards the app centric model, but I’m biased working as a PaaS provider. Ultimately, I hope we can find something most people are happy with and that everyone can use to run their apps (almost) anywhere.

5 thoughts on “My DockerCon 2014 talk: Thoughts on interoperable containers”

Saul Shanabrook (@SaulShanabrook) says:

June 14, 2014 at 7:55 pm

Maybe this is super naive, but what is the problem with Heroku just pulling source, running docker build -t something . && docker run something?

Gabriel Rubens says:

June 18, 2014 at 7:05 pm

Great idea, I’ll do it with my slides too.
There is a balance between clean slides and a good explanation.

Sergio Freire says:

September 17, 2014 at 3:48 am

Hi Fabio,
Im trying to understanding the rationale of having just ONE container per VM. I understand that containers on top of VMs have increased benefits. My questions on why having just ONE container per VM. What would you get if you compare with the use case when you have just VMs?

- Fabio Kung says:
  
  September 29, 2014 at 5:40 am
  
  Containers are still not great at resource isolation and sharing, especially I/O. They are more vulnerable to “noisy neighbor” types of problems. Hypervisors are currently better at isolating and sharing resources.
  
  The kernel is also shared, so if for any reason you need to have different kernels (different versions, features, configuration, or to be protected against vulnerabilities in the kernel code), you need a different VM per kernel (and possibly a VM per container).
  
  As for comparing with just using VMs, containers are a great way to distribute your application. All dependencies are vendored, and your app is “isolated” from the services running inside the host VM and other containers. Running containers in production, even if each VM only runs one container, is a great way to increase dev/prod parity and also have the same container running in other environments (QA, staging, etc).
  
  IMO, using containers is more of a development/build related choice, whereas the density on VMs (i.e.: how many containers to run per VM) is a runtime/production related choice.
  
  Does this help?
  
Sergio Freire says:

September 29, 2014 at 6:33 am

So if I correctly understood, you would say that putting one (or more) containers inside VMs would be because of:
– increased security due to isolation
– better IO/handling and management, due to more features available by VM solutions (including networking, live migrations, resource sharing, so on)
– similar approach between dev/build (where we would use containers a lot possibly on top of a VM/server) and production (where we would use more containers more wisely)

Is that it?

Fabio Kung

software, programming, computers, technology

Menu

5 thoughts on “My DockerCon 2014 talk: Thoughts on interoperable containers”

Leave a comment Cancel reply

Menu

Just don’t run as root?

Share this:

Related

5 thoughts on “My DockerCon 2014 talk: Thoughts on interoperable containers”

Leave a comment Cancel reply