My DockerCon 2014 talk: Thoughts on interoperable containers

Recently, I finally had a chance to read Presentation Zen: Simple Ideas on Presentation Design and Delivery, by Garr Reynolds. I wish I had done if before, after so many years of giving less-than-ideal talks and ignoring recommendations from presenters I admire, who give presentations I highly respect.

It is usual to post your slides somewhere public right after you give a presentation. I’m sure that Docker organizers are going to publish my slides somewhere, but here they are if you really want to see them and don’t want to wait. However, if you agree with some of the guidelines from Presentation Zen, slides from a good presentation will not have a lot of useful information. Slides are there just as a visual aid to the story being told during a presentation. They should be highly visual and help illustrate your points. A slide deck is not a document.

I want to try something different this time. This blog post is an attempt of a proper handout of my slides. I hope it is a bit more useful than a bunch of slides with beatiful pictures and small sentences to all of you who could not attend DockerCon and didn’t see my talk. Or for those of you that attended (thank you!) and want to review what I said.

interoperable containers

We want to run our apps, unmodified, everywhere. I too have been trying to find ways to make it happen, and I want to share some of the things I have discovered so far.


I run Linux Containers at Heroku. Lots of them. Heroku has been running Linux Containers (dynos) for more than 3 years. Maybe 4 years, I don’t know exactly when the move from chroot jails was made.

Docker has so many different uses! Every other week I discover people using Docker in a different way. For example, some people are now distributing CLI tools as Docker containers. Here is (more or less) how Ruby VMs are being built these days at Heroku (thanks to Terence and ENTRYPOINT):

docker run hone/ruby-builder --version 2.1.0p123

# a new ruby VM is now available in a local directory

Compare that with a ruby-builder binary that would do the same thing: the containerized version includes all the dependencies and works reliably everywhere Docker is available.

We are interested in a particular usage for Docker though. Docker to run portable, server side, web applications following the guidelines.

“Write once, run everywhere” is the dream. In reality, each group (app developers, PaaS providers, Docker developers, LXC developers, Linux Kernel developers, etc.) has different priorities. Different trade-offs and optimizations are being made by each one of them.

  • Developers want apps: “hey provider, just run my damn container!”.
  • PaaS providers want scale: they must scale as a business, with sustainable operations, while being fast and secure. Arbitrary Code Execution as a Service.
  • Docker wants to be the right tool for many different use cases. It is a toolkit that will enable many different things, but we can’t expect Docker developers to solve all the problems. They are too busy already!

Docker GitHub issues

I have been navigating this problem space for a while now, playing all the different roles. At the end, besides building a PaaS at Heroku, I am also an app developer and I enjoy hacking on open source container managers (Docker, LXC, the Linux kernel).

different groups and me

Of all the slides with beautiful pictures, this is the only one I can truly say that I made it myself. 🙂

I have been trying to reconcile all sides and would like to share some things I learned. Ultimately, I want to hear from you: tell me what is bullshit and what wrong assumptions I’m making, and/or if you agree. You can also contribute and help make this happen with ideas, design and code.


The container shipping analogy is the first thing that comes to mind. You build a full container locally then ship it to Platform and Infrastructure providers. A Docker container would run anywhere.

trying to make Docker secure for multi-tenant scenarios is a can of worms

– darren0, at #docker-dev

I agree with darren0. Most of the challenges I faced so far when trying to make it work as a Platform provider are differences between local or small Docker environments (single or a few apps on a few servers) versus multitenant environments where boxes are packed with containers from multiple different apps (tenants).

1 vs 1M

At Heroku, we run millions of containers (apps). It is not hard to imagine why trade-offs need to be different than with a smaller environment.

One of these challenges is root access. When you build a container locally or in your build servers, you have root access inside it.


Root access is controversial. For some people it is obvious why containers can’t have it in a shared (multitenant) environment, or any production environment for that matter. Others have a hard time accepting the idea it is dangerous and believe that containers should be safe enough to cover all issues. Or they don’t want to care and just want the runtime environment to solve all the problems.

App developers often want root access because:

  • apt get install ..., they want to install packages, tools and libraries.
  • vi /etc/..., they want to edit important configuration files.
  • mount -t fancy ..., they want to mount filesystems and use what the kernel has to offer.
  • modprobe something, they want to load modules and use what the kernel has to offer #2.
  • iptables -A INPUT ..., they want to configure firewall rules, port mappings, …

kernelspace abuse

The big problem is that root access in a container is also root access on the host system. The same host machine that runs many other containers for different apps/tenants. If anyone can – for any reason – escape a container, they would be able to do anything they wanted with other containers in the same box: look at other containers data, code, and potentially escalate to other components of the infrastructure.

This used to be much easier. To be honest, things are much better nowadays, and I will admit that there is a lot of FUD. Jérôme Petazzoni gave a nice talk about it earlier this year, go check it out:

LXC, Docker, Security

Still, you need to always be careful and do the right thing when running containers. Thankfully, Docker, LXC and Heroku (the container managers I’m familiar with) are in general doing a good job IMO: dropping capabilities, using AppArmor/SELinux/grsecurity, kernel namespaces, cgroup controllers, etc. But anyone still can do bad things on top of it and leak resources inside containers. Just as an example, I’ve seen people mentioning they are bind mounting the Docker socket (/var/run/docker.sock) inside containers. Which basically gives containers full privileges on the host.

But, escaping containers is not the only problem with root access. In the Linux Kernel, there are still many parts that haven’t been containerized. Some places in the kernel will hold global locks, or expose global resources. Some examples are the /sys and /proc filesystems. Docker does a great job preventing known problems, but if you are not careful:

# don't do this in a machine you care about, the host will halt
docker run --privileged ubuntu sh -c "echo b > /proc/sysrq-trigger"

If you are not careful protecting /sys and /proc (and again, Docker by default does the right thing AFAIK), any container can bring a whole host down, affecting other tenants running on the same box. Even when container managers do everything they can to protect the host, it might not be enough. Root users inside containers can still call syscalls or operations usually only available to root, that will cause the kernel to hold global locks and impact the host. I wouldn’t be surprised if we keep finding more of such parts of the kernel.

This is also one of the reasons that many people are choosing to run one container per box (the other being better resource isolation), physical or virtual. One container per VM is a nice way to leverage the advantages of both containers and hypervisors.

I am happy that this is getting better and better with time, and most of these concerns are not really a big deal anymore, as long as you use bleeding edge versions of Docker, the Linux Kernel, and AppArmor/SELinux, etc. I can see a light at the end of the tunnel that would allow Providers to permit root access inside containers. Some Providers even started already running code as root inside containers. However, it increases the surface area of attack considerably, and we can’t blame providers for trying to avoid malicious tenants (or even unintentionally) DoS’ing their boxes or getting access to other containers (apps).

User Namespaces

User namespaces (a.k.a. unprivileged containers) were recently added to the Linux Kernel to provide safe root access usage inside containers. It allows regular (unprivileged) user ids on a host machine to be mapped to arbitrary user ids inside a new namespace.

That way, root users (uid=0) inside a container (namespace) can be mapped to regular unprivileged users outside of the container. Root users inside user namespaces are not necessarily root on the whole box. It is a better implementation of fakeroot enforced and controlled by the kernel. Nice!

Kernel capabilities in user namespaces

This is very recent though, and will probably take some time to stabilize. It is not hard to find exploits from earlier this year (a few months ago).

Because of the way it works, by design, it can also be made dangerous. By default, root users inside new user namespaces have all the kernel capabilities (superpowers) inside that namespace (container).

Providers relying on user namespaces to provide (fake)root access will still need to drop capabilities, otherwise (fake)root inside containers will still be able to call syscalls and operations on the kernel that only root users can call. With that, they can potentially DoS boxes or escape containers. Another simple example is mknod and mount: if a container is not protected appropriately, even with user namespaces, a (fake)root user could just get raw access to local disks in a box, mount them, and start reading all unencrypted data in the box, including data from other containers. Again, Docker does the right thing AFAIK, as long as containers don’t run as --privileged. But, Docker doesn’t support user namespaces yet. I’m sure Docker developers are thinking about it before blindly adding support.

Another attack vector is code like this:

if (getuid() == 0) {
  // do root stuff

Who has not done this before? Blame me, I certainly have. This is how auth was intended to be done on Linux systems, before capabilities and user namespaces were added. Nowadays, in order to be safe, that code would need to check capabilities in a specific user namespace.

What code like this is out there? Who knows… but if you are uid 0 in a namespace you will be able to pass all these validations.

When a root user has all its capabilities dropped in a new user namespace, it is very similar to a regular non-privileged user. None of the things a user would want root access for are possible. It might as well be a regular, non-privileged user. Most (if not all) of the things that require root will not work anyway.

Just don’t run as root?

As an app developer, why should I care? Providers then can just run app code as unprivileged users and end of story!

Unfortunately not: setuid binaries are a problem in this scenario. Binaries with that permission (bit) set are executed as the owner of the file, not with the permissions of the user that runs the binary.

Injecting arbitrary code as setuid binaries owned by root is a very well known old attack to execute code as root on UNIX systems. Container images built locally can contain anything. If PaaS providers accept arbitrary images to run, they must be very careful to remove all setuid binaries owned by root, or disable setuid completely on all filesystems (with the nosuid flag and AppArmor/SELinux, for example). Anyone can inject any setuid binary owned by root in container images being built locally.

Providers also need to be extra careful with what gets injected into containers, e.g.: bind mounts with --volume or --volumes-from. A filesystem without nosuid leaking into a container allows malicious users to create and execute arbitrary setuid binaries.

These binaries are sometimes useful and some apps may require them. setuid is what allows unprivileged users to execute useful things only traditionally available to superusers, like ping (requires raw socket access), tcpdump (requires promiscuous mode), etc. These will probably not be available in many multitenant container Platforms.


This all makes me start thinking that we may need a way to specify constraints on container images that particular runtime environments impose. Different Providers will make different choices on what they accept or not, and it may be useful to capture these requirements, or container capabilities as container metadata.

We would need something like a Restrictions API for containers, that could be validated at runtime by Providers, and/or during build by container managers, or during docker push by registries, …

Here are some other examples of requirements that may need to be imposed on containers by some runtime environments:

  • Networking: the number of network interfaces available, which ports are reachable/routable from the outside, how many IP addresses are available, public vs. private IPs, firewall rules, etc.
  • Ephemeral disks: some Providers may not provide persistent disks for containers.
  • Arch, OSes: which architectures are supported (eg.: x86_64, arm).
  • Image size: there probably will be a max amount of disk a container can occupy, or a max size for the container image to be downloaded.

container-rfc is an attempt to define a standard container image format and metadata that would work on multiple container managers.

image size

Container images, in my experience, have 2-5GB.

Heroku is very dynamic. Containers are constantly being cycled and are always moving. Thousands of containers are being launched per minute. Having to download a 2-5GB image every time a container is being launched does not sound reasonable.

Docker solves this problem with layers.


Deltas are overlaid on top of a base image. All containers are formed from a hierarchy of read-only layers, and a private writeable layer on top of them. Read-only layers are shared between all containers that use them on a host, and are cached locally.

When many containers use the same base image and share some layers, they are downloaded only once. If providers can restrict the number of base images they support (Restrictions API, again?) known base images can be pre-cached in runtime servers and containers can be launched very quickly, as only the layers (deltas) specific to each app are going to be downloaded.

Even so, Providers will probably need a way to limit the max size of a layer, or a set of layers.


This works well, until base images (or shared layers) need to be updated. We do that constantly at Heroku, for example with security updates. Every time a base image or shared layer gets updated, all containers using them need to be rebuilt, to pick up the changes.

Heroku wouldn’t be able to quickly respond to security incidents like Heartbleed if we had to rebuild millions of containers every time the base image needs to be updated.

Compare that with what we currently do at Heroku.


It is a more traditional approach. Similar to how packages are traditionally installed on Operating Systems. Apps are just unpacked on top of a read-only base image, inside containers. The read-only base image is shared (bind mounted) between all containers, and downloaded only once on each box.

When containers are being launched, only the app package (a.k.a. slug) is downloaded.

When the base image needs to be updated, a new version is sent to all runtime instances, and new containers are simply launched on top of a new read-only base image. No rebuild operations are required on any apps.

It is also possible to do that with Docker, but it does not follow the traditional model of shipping containers (docker push + docker pull). The community may come up with good ways to solve this problem in the future, see dotcloud/docker#332 for example.

# idea: make a container point to a new base image?
docker save myapp | docker load --rebase=new-base-image


Honestly, the whole idea of supporting an arbitrary image format everywhere reminds me a bit of what happened with VM image formats a while ago. Many people were trying to find a common format for VMs that would work on any hypervisor. They failed. Do you remember VMDK vs VHD vs QCow vs QCow2 vs …?

I can’t say yet if the idea of shipping whole containers everywhere is a rabbit hole. Maybe it is, maybe it isn’t. I’ve been experimenting with it and others should do as well.

But let’s take a step back. How about instead of putting restrictions on whole container images and distribute them everywhere, we turn portable apps into containers? We shift the focus back to the app code to be distributed and make sure that apps are portable and can run anywhere, then we map apps to execution runtimes, like Docker containers.

Runtime environments don’t need to be just containers. Portable apps ( can then be mapped to raw VMs, or even bare metal servers.

What we are missing in this context is something standard to prepare these portable apps for different runtime environments.


Buildpacks are a possible way to make this happen. Initially, buildpacks were created to compile (build) 12factor apps into a package that can be executed inside Heroku containers (dynos). But buildpacks are very simple and flexible: just a few executables (bin/detect, bin/compile and bin/release) that are used to build an app for a runtime environment.

Many different PaaS providers adopted buildpacks as their mechanisms to build app code. Each language runtime (Ruby, Python, Node.js, etc.) has a different buildpack.

I propose that we extend the concept of a buildpack: given a base image, it is something that maps (transforms) apps into runtime environments (e.g.: a Docker container) during builds.

One way to implement this idea for Docker, is to make each buildpack a container image that can be used as a parent image by 12factor apps:

$ cat my-portable-app/Dockerfile
FROM heroku/heroku-buildpack-ruby

Buildpack images can then use ONBUILD triggers to do everything required to turn an app into an executable Docker container:

$ cat heroku/heroku-buildpack-ruby/Dockerfile
# Heroku's base image based on Ubuntu 10.04
FROM heroku/cedar

ADD . /buildpack
ONBUILD RUN /buildpack/bin/compile /app

I am exploring heavily with this and I’ve already learned that ONBUILD has limitations (dotcloud/docker#5714). I hope to share more soon.

I am also sure there are other possible ways to use buildpacks to turn apps into executable docker containers. This is just an example (not even complete). Others are doing similar things:

I’m not trying to undervalue containers (and Docker!). They are still an amazing way of running apps. But once we shift the focus from “shipping containers” to “shipping apps” again, we open possibilities to run our apps in more runtime environments. As an example, some of my apps have a Makefile to build them locally using a buildpack, so that I can run them locally, on my dev machine (no VMs, no containers, just plain old process execution):

#!/usr/bin/env make -f

buildpath := .build
buildpackpath := $(buildpath)/pack
buildpackcache := $(buildpath)/cache

build: $(buildpackpath)/bin
    $(buildpackpath)/bin/compile . $(buildpackcache)

    mkdir -p $(buildpath)
    mkdir -p $(buildpackcache)
    curl -O
    mv go.tgz $(buildpath)

$(buildpackpath)/bin: $(buildpackcache)
    mkdir -p $(buildpackpath)
    tar -C $(buildpackpath) -zxf $(buildpath)/go.tgz

make is not the most straightforward thing, but this should be simple enough. make build will build my code using a buildpack. Int his case, the code is written in Go, but it could be any language/framework that has a buildpack. The Makefile just downloads a buildpack and calls bin/compile, the standard buildpack interface to build apps.

This makefile can be seen as a buildpack that maps an app (source code) into a runtime environment (single binary), given a base image (the OS on my laptop).

Here is another example we use at Heroku to run apps in raw machines or VMs (hey, sometimes we need to do it too!):

ruby = ""

app_container "myapp" do
  buildpack ruby
  git_url ""

define :app_container,
       name: nil,
       buildpack: nil,
       git_url: nil do
  # ...

  execute "#{name} buildpack compile" do
    command "#{dir}/.build/pack/bin/compile #{dir} .build/cache"

It is a Chef recipe that builds an app inside a machine using a buildpack, by calling bin/compile. Again!

If an app can be built with a buildpack, then it can potentially run on multiple runtime envinronments, including Docker containers, Heroku, Cloud Foundry, your own machines via Chef recipes, etc.


My talk was not intended to provide any final answers. In my personal journey, I’ve been bouncing between these two concepts: container centric and app centric, and so far I have found that both have their pros and cons. I’m leaning more towards the app centric model, but I’m biased working as a PaaS provider. Ultimately, I hope we can find something most people are happy with and that everyone can use to run their apps (almost) anywhere.

Thank you!

Running Java Web Applications on Heroku Cedar Stack

Update: Heroku now does support Java.

Heroku does not officially support Java applications yet (yes, it does). However, the most recently launched stack comes with support for Clojure. Well, if Heroku does run Clojure code, it is certainly running a JVM. Then why can we not deploy regular Java code on top of it?

I was playing with it today and found a way to do that. I admit, so far it is just a hack, but it is working fine. The best (?) part is that it should allow any maven-based Java Web Application to be deployed fairly easy. So, if your project can be built with maven, i.e. mvn package generates a WAR package for you, then you are ready to run on the fantastic Heroku Platform as a Service.

Wait. There is more. In these polyglot times, we all know that being able to run Java code means that we are able to run basically anything on top of the JVM. I’ve got a simple Scala (Lift) Web Application running on Heroku, for example.

There are also examples of simple spring-roo and VRaptor web applications. I had a lot of fun coding on that stuff, which finally gave me an opportunity to put my hands on Clojure. There are even two new Leiningen plugins: lein-mvn and lein-herokujetty. 🙂

VRaptor on Heroku

Let me show you what I did with an example. Here is a step-by-step to deploy a VRaptor application on Heroku Celadon Cedar:

  1. Go to Heroku and create an account if you still do not have.
  2. We are going to need some tools. First Heroku gems:
    $ gem install heroku
    $ gem install foreman
  3. Then Leiningen, the clojure project automation tool. On my Mac, I installed it with brew:
    $ brew update
    $ brew install leiningen
  4. The vraptor-scaffold gem, to help us bootstrap a new project:
    $ gem install vraptor

That is it for the preparation. We may now start dealing with the project.

  1. First, we need to create the project skeleton:
    $ vraptor new <project-name> -b mvn
    $ lein new <project-name>
    $ cd <project-name>

    The lein command is not strictly necessary, but it helps with .gitignore and other minor stuff.

  2. Now, the secret sauce. This is the ugly code that actually tries to be smart and tricks Heroku to do additional build/compilation steps:
    $ wget -P src

    Or if you do not have wget installed:

    $ curl -L > src/heroku_autobuild.clj
  3. You also need to tweak the Leiningen project definition – project.clj. The template is here. Please remember to adjust your project name. It must be the same that you are using in the pom.xml. Or you may download it directly if you prefer:
    $ curl -L > project.clj
  4. Unfortunately, Leiningen comes bundled with an old version of the Maven/Ant integration. The embedded maven is currently at version 2.0.8, which is old. The versions of some plugins configured in the default pom.xml from vraptor-scaffold are incompatible with this old version of maven.

    The best way to solve it for now is to remove all <version> tags from items inside the <plugins> section of your pom.xml. This is specific to VRaptor and other frameworks may need different modifications. See below for spring-roo and Lift instructions.

    If even after removing versions from all plugins, you still get ClassNotFoundException: org.apache.maven.toolchain.ToolchainManager errors, try cleaning your local maven repository (the default location is $HOME/.m2/repository). The pom.xml that I used is here.

  5. Now, try the same command that Heroku uses during its slug compilation phase. It should download all dependencies and package your application in a WAR file inside the target directory.
    $ lein deps

    Confirm that the WAR was built into target/. It must have the same name as defined in your project.clj.

  6. Create your Heroku/Foreman Procfile containing the web process definition:
    web: lein herokujetty

    And test with:

    $ foreman start

    A Jetty server should start and listen on a random port defined by foreman. Heroku does the same.

  7. Time to push your application to Heroku. Start editing your .gitignore: add target/ and tmp/, and please remove pom.xml. Important: The pom.xml must not be ignored.
  8. You may also remove some unused files (created by leiningen):
    $ rm -rf src/<project-name>
    # do not remove src/main src/test and src/heroku_autobuild.clj!
  9. Create a git repository:
    $ git init
    $ git add .
    $ git commit -m "initial import"
  10. Push everything to Heroku and (hopefully) watch it get built there!
    $ heroku create --stack cedar
    $ git push -u heroku master
  11. When it ends, open another terminal on the project directory, to follow logs:
    $ heroku logs -t
  12. Open the URL that heroku creates and test the application. You may also spawn more instances and use everything Heroku has to offer:
    $ heroku ps:scale web=2
    $ heroku ps:scale web=4
    $ heroku ps:scale web=0
  13. Quick tip: you do not have SSH access directly to Dynos, but this little trick is very useful to troubleshoot issues and to see how slugs were compiled:
    $ heroku run bash

spring-roo on Heroku

Once you understand the process for a VRaptor application, it should be easy for others as well. Just follow the simple step-by-step here, to create a simple spring-roo maven application.

Then, before running lein deps or foreman start, you must adjust your pom.xml, to remove plugins incompatible with the older mvn that is bundled in Leiningen. Here is the pom.xml that I am using (with some of the plugins downgraded).

You can see my spring-roo application running on Heroku here.

Scala/Lift on Heroku

The same for Scala/Lift: follow the instructions to bootstrap an application with maven. One simple change to the pom.xml is required. My version is here.

You also need to turn off the AUTO_SERVER feature of H2 DB. It makes H2 automatically starts a server process, which binds on a tcp port. Heroku only allows one port bound per process/dyno, and for the web process, it must be the jetty port.

To turn it off, change the database URL inside src/main/scala/bootstrap/liftweb/Boot.scala. I changed mine to a in-memory database. The URL must be something similar to jdbc:h2:mem:<project>.db.

My Lift application is running on Heroku here.

I hope it helps. Official support for Java must be in Heroku’s plans, but even without it yet, we’ve got a lot of possibilities.

Status report: new job, new life

I’m sorry my last post was about three months ago. But, I have a good excuse: I’m just married!
(and I took a nice and fast honeymoon vacation)

Besides that, after three happy years, full time at Caelum, here is the shocking news: I’m now part of the Locaweb team!


This was (and is still being) a very hard decision. People close to me know, that I have a nice and strong relationship with Caelum. Man, I love the company!

I’m feeling very strange, because I’m sad I’m no more full time at Caelum and, at the same time, I’m extremely excited with the new challenges which are about to come. My decision to join Locaweb, which is a really good place to work, just proves I’m very anxious to make a good job there. Other than that, I’m going to sit near very known people as my friend Fabio Akita and Daniel Cukier, just to cite some.

I’m joining the talented Cloud Computing team and I hope I can help them to improve the Cloud Server product. An enormous responsibility!

Cloud Server

Locaweb is a huge company and I must admit I’m a little bit scared with the size of things there. They have many teams working on a big diversity of challenging products. They even have their own Datacenter! Locaweb already embraced agile and people there are very open minded, because as a hosting provider, they have to deal with almost all kind of technology.

Because of my work prior to Locaweb, I know the expectations on me are quite high. I like being honest and just to be clear, I’m not better than anyone. I’m relatively new to this area, so I still have too much to learn. Fortunately, I already know many people at Locaweb and I really believe they will help me start being productive. Additionally, there are areas I know I can give something. I’m joining to help, not to prove anything to anyone.

It’s also important to say, that I couldn’t fully leave Caelum. I’m still part of the team, as an instructor, and I keep helping them in many areas, such as the book, courses, textbooks, internal discussions, events, talks and where else I can.

Arquitetura e Design de Software - Uma visão sobre a plataforma Java

That’s it. Comments, questions and feature suggestions are, as always, welcome. As a cloud provider, I’m happy to hear what would you like to see in a cloud product and what can we do to help you scale and earn profit.

From now on, I stop referring to Locaweb as “they” and instead as “we at Locaweb …”.