What’s in your container?
Pieter van Noordennen
Apr 22, 2021
The world of cloud-native development is rife with metaphor, and none more so than the container-ship analogy surrounding Docker. When it comes to Docker Layers, it might be useful to equate them to wooden pallets in the shipping analogy.
A Docker layer is a set of filesystem changes that creates an intermediate image in the build process. Like pallets, the Layers used to build a container:
- Stack on top of each other sequentially to create the final container;
- When arranged thoughtfully, can make a container organized, fast, and easy to work;
- Conversely, make a giant mess when not constructed well;
- Are of little interest to end users (i.e., consumers or developers), who care more about the final delivery than the container’s internal construction
Commands specified in a Docker file (FROM, RUN, COPY, etc.) cause the previous image to be altered, thus creating a new layer. Docker layers provide developers with a way to effectively “track changes” when authoring a new Docker image, but can also add performance overhead and larger container sizes.
FROM python:2.7.15 RUN mkdir -p /opt/my/service COPY service /opt/my/service WORKDIR /opt/my/service RUN pip install -r requirements.txt EXPOSE 9000 ENTRYPOINT ["python","/opt/my/service/server.py"]
Other than DevOps specialists, most developers don’t understand — nor care about — the internal structure of their Docker images and their related build paths. But those internals, and most notably the way Docker Layers are constructed in their images, play a critical role in the performance, functionality, and size of final containers.
Engineers who specialize in container optimization and security examine layers during construction to be able to tune the image construction process, reduce container size, increase performance and security, and triage errors in generated containers.
Docker Layers and Optimization: the Good, the Bad, and the Ugly
This layering concept is valuable when authoring and optimizing images for several reasons. It makes changes easier to see from one step to the next, it increases build speed through layer caching, and it allows container authors to arrange their image construction to improve build speed.
In effect, Docker stages changes like a version control system would, adding a change, then another, then another, in a sequential manner until the build is complete. Each successive change is a diff between the previous layer and the new one, just like you’d see in a merge request in GitHub.
When creating a layer, Docker will mount its own read-write filesystem layer on top of what’s already there and begin making changes. Layers do not have elements like environment variables or default values — these are properties of the image as a whole rather than a particular component — and should never be dependent on the state of any external system or process.
The benefit of this “immutability” restriction is that any number of containers can be started from one and the same image, making the state of a freshly created container predictable, and that Docker can use layer caching to reduce build time.
The downside, however, is that often necessary packages, files, or data is maintained in the filesystem and packaged with the resulting container. Three instructions — ADD, COPY, and RUN — create layers that increase the size of the resulting container. If a certain Dockerfile adds, deletes, and changes many files, the image grows in size. These active images can negatively impact performance.
Multi-stage builds are meant to control for this, but can be cumbersome to create and debug. Open source projects like our own DockerSlim can automate file size reduction, but have a learning curve to get implemented correctly.
Because layers are intermediate images, if you make a change to your Dockerfile, Docker will build only the layer that was changed and the ones after that in a process is called layer caching.
Layer caching is useful when re-building an image that already exists on your machine. Each layer has its own sha256 value, and Docker will simply look at that value to see if it matches the sha256 value in cache, knowing whether it needs to rebuild a layer or not.
Layer caching can lead those new to Docker into a trap, however. A common anti-pattern in Dockerfile is to have commands or instructions that may depend on the state of external systems, such as an action pushed in a development database or microservice. If Docker sees no changes to the underlying layer (it doesn’t observe the outside world), it will assume the layer is properly constructed and cached and not rebuild it. This can lead to container errors that result in costly cycles to debug.
Since each layer is immutable, Docker doesn’t need to rebuild any layer in the cache that hasn’t changed. Almost all Docker image constructions start with the Operating System install in Layer 0 (often the result of the FROM command in the Dockerifle). This is because the OS is highly unlikely to have changed from one build to the next.
In good Dockerfile construction, layers are built from least likely to change to more likely to change, with some exceptions. This takes full advantage of layer caching, allowing images to be rebuilt and redeployed quickly.
However, sometimes container authors take shortcuts here to reduce file size, such as combining a bunch of actions into a single line of Dockerfile using the && operator. There’s nothing wrong with this, but if the instructions are haphazard, it could actually backfire by lumping in costly actions that could be cached with those that are bound to change frequently in development.
In summary, while most Docker Layers are going to be less interesting to those who simply want to grab a container, toss their app in it, and go, Layers actually do matter a lot when it comes to optimizing a container.