The sophisticated rkt
offered what might be called hard tenancy between containers. This strict isolation enabled true protection for Customer B if Customer A was compromised; and although containers are, again, not virtual machines, rkt
bridged a gap where previously few other security innovations had succeeded.
A modern approach being actively developed, similar to that of rkt
, is called Kata Containers (katacontainers.io
) via the Open Stack Foundation (OSF). The marketing strapline on the website confidently declares that you can achieve the “speed of containers” and still have the “security of VMs.” Along a similar vein to rkt
, MicroVMs are offered via an Open Source runtime. By using hardware virtualization the isolation of containerized workloads can be comfortably assured. This post from Red Hat about SElinux alerations for Kara Containers is informative: www.redhat.com/sysadmin/selinux-kata-containers
. Its customers apparently include internet giants such as Baidu, which uses Kata Containers in production, and you are encouraged to investigate their offering further.
Finally, following a slight tangent, another interesting addition to this space is courtesy of AWS, which, in 2020, announced the general availability of an Open Source Linux distribution called Bottlerocket (aws.amazon.com/bottlerocket
). This operating system is designed specifically to run containers with improved security. The premise for the operational side of Bottlerocket is that creating a distribution that contains only the minimal files required for running containers reduces the attack surface significantly. Coupled with SElinux, to increase isolation between containers and the underlying host, the usual suspects are present too: cgroups, namespaces, and seccomp. There is also device mapper functionality from dm-verity
that provides integrity checking of block devices to prevent the chances of advanced persistent threats taking hold. While time will tell if Bottlerocket proves to be popular, it is an interesting development that should be watched.
Summary
In this chapter, we looked at some of the key concepts around container security and how the Linux kernel developers have added a number of features over the years to help protect containerized workloads, along with contributions from commercial entities such as Google.
We then looked at some hands-on examples of how a container is constructed and how containers are ultimately viewed from a system's perspective. Our approach made it easy to appreciate how any kind of privilege escalation can lead to unwanted results for other containers and critically important system resources on a host machine.
Additionally, we saw that the USER instruction should never be set to root
within a container and how a simple Dockerfile can be constructed securely if permissions are set correctly for resources, using some forethought. Finally, we noted that other technologies such as serverless also use containerization for their needs.
CHAPTER 2 Rootless Runtimes
In Chapter 1, “What Is A Container?,” we looked at the components that make up a container and how a system is sliced up into segments to provide isolation for the standard components that Linux usually offers.
We also discussed the likely issues that could be caused by offering a container excessive privileges. It became clear that, having examined a container's innards, opening up as few Linux kernel capabilities as possible and stoically avoiding the use of Privileged mode was the way to run containers in the most secure fashion.
In this chapter, we continue looking at developments in the container space that have meant it is no longer necessary to always use the root
user to run the underlying container runtime(s). Consider that for a moment. In Chapter 1 we discussed how a compromised container can provide a significant threat to the underlying operating system (OS) and other containers running on the host. Additionally, we looked at how the root
user on the host transposed directly to the root
user within a container. If the container was subject to a compromise, then any resources that the container could access were also accessible on the host; and most alarmingly, they would have superuser permissions. For a number of years, to improve the Linux container security model, developers made great efforts to run containers without providing root
user permissions. Relatively recent runtime innovations have meant that the Holy Grail is now a reality.
In the first half of this chapter, we will look at an experimental feature available from Docker (www.docker.com
), known as rootless mode, which apparently is soon to be a stable feature. Following that we will explore another prominent container runtime, called Podman (podman.io
), that offers similar functionality with some welcome extra features.
Docker Rootless Mode
Docker, beginning with v19.03 (docs.docker.com/engine/release-notes/#19030
), offers a clever feature it calls rootless mode, in which Docker Engine doesn't require superuser privileges to spawn containers. Rootless mode appears to be an extension of a stable feature called user namespaces, which helped harden a container. The premise of that functionality was to effectively fool a container into thinking that it was using a host's user ID (UID) and group ID (GID) normally, when from a host's perspective the UID/GID used in the container was being run without any privileges and so was of much less consequence to the host's security.
With rootless mode there are some prerequisites to get started; these have to do with mapping unprivileged users with kernel namespaces. On Debian derivatives, the package we need is called uidmap
, but we will start (as the root
user) by removing Docker Engine and its associated packages with this command (be careful only to do this on systems that are used for development, for obvious reasons):
$ apt purge docker
Then, continuing as the superuser, we will install the package noted earlier with this command:
$ apt install uidmap
Next, we need to check the following two files to make sure that a less-privileged user (named chris
in this instance) has 65,536 UIDs and GIDs available for re-mapping:
$ cat /etc/subuid chris:100000:65536 $ cat /etc/subgid chris:100000:65536
The output is what is expected, so we can continue. One caveat with this experimental functionality is that Docker Inc. encourages you to use an Ubuntu kernel. We will test this setup on a Linux Mint machine with Ubuntu 18.04 LTS under the hood.
If you want to try this on Debian Linux, Arch Linux, openSUSE, Fedora (v31+), or CentOS, then you will need to prepare your machine a little more beforehand. For example, although Debian is the underlying OS for Ubuntu, there are clearly notable differences between the two OSs; to try this feature on Debian, you would need to adjust the kernel settings a little beforehand. The required kernel tweak would be as follows, relating to user namespaces:
kernel.unprivileged_userns_clone=1 # add me to /etc/sysctl.conf to persist after a reboot
You would also be wise to use the overlay2
storage driver with this command:
$ modprobe overlay permit_mounts_in_userns=1 # add me to /etc/modprobe.d to survive a reboot
There are a few limitations that we will need to look at before continuing. The earlier user namespace feature had some trade-offs that meant the functionality was not suited for every application. For example, the --net=host
feature was not compatible. However, that is not a problem, because the feature is a security hole; it is not recommended, because the host's network stack is opened up to a container for abuse. Similarly, we saw that the same applied when we tried to share the process table with the --pid
switch in Chapter 1. It was also impossible to use --read-only
containers to prevent data being saved to the