sent to the kernel. Syscalls are used whenever a system resource requests anything from the kernel. That could involve access to a file, memory, or another process among many other things, for example. The manual explains that during the usual run of events on traditional Unix-like systems, there are two categories of processes: any privileged process (belonging to the root
user) and unprivileged processes (which don't belong to the root
user). According to the Kernel Development site (lwn.net/1999/1202/kernel.php3
), kernel capabilities were introduced in 1999 via the v2.1 kernel. Using kernel capabilities, it is possible to finely tune how much system access a process can get without being the root
user.
By contrast, cgroups or control groups were introduced into the kernel in 2006 after being designed by Google engineers to enforce quotas for system resources including RAM and CPU; such limitations are also of great benefit to the security of a system when it is sliced into smaller pieces to run containers.
The problem that kernel capabilities addressed was that privileged processes bypass all kernel permission checks while all nonroot processes are run through security checks that involve monitoring the user ID (UID), group ID (GID), and any other groups the user is a member of (known as supplementary groups). The checks that are performed on processes will be against what is called the effective UID of the process. In other words, imagine that you have just logged in as a nonroot user chris
and then elevate to become the root
user with an su-
command. Your “real UID” (your login user) remains the same; but after you elevate to become the superuser, your “effective UID” is now 0, the UID for the root
user. This is an important concept to understand for security, because security controls need to track both UIDs throughout their lifecycle on a system. Clearly you don't want a security application telling you that the root
user is attacking your system, but instead you need to know the “real UID,” or the login user chris
in this example, that elevated to become the root user instead. If you are ever doing work within a container for testing and changing the USER instruction in the Dockerfile that created the container image, then the id
command is a helpful tool, offering output such as this so you can find out exactly which user you currently are:
uid=0(root) gid=0(root) groups=0(root)
Even with other security controls used within a Linux system running containers, such as namespaces that segregate access between pods in Kubernetes and OpenShift or containers within a runtime, it is highly advisable never to run a container as the root
user. A typical Dockerfile that prevents the root
user running within the container might be created as shown in Listing1.1.
Listing 1.1: A Simple Example Dockerfile of How to Spawn a Container as Nonroot
FROM debian:stable USER root RUN apt-get update && apt-get install -y iftop && apt-get clean USER nobody CMD bash
In Listing 1.1, the second line explicitly states that the root
user is initially used to create the packages in the container image, and then the nobody
user actually executes the final command. The USER root
line isn't needed if you build the container image as the root
user but is added here to demonstrate the change between responsibilities for each USER
clearly.
Once an image is built from that Dockerfile, when that image is spawned as a container, it will run as the nobody
user, with the predictable UID and GID of 65534 on Debian derivatives or UID/GID 99 on Red Hat Enterprise Linux derivatives. These UIDs or usernames are useful to remember so that you can check that the permissions within your containers are set up to suit your needs. You might need them to mount a storage volume with the correct permissions, for example.
Now that we have covered some of the theory, we'll move on to a more hands-on approach to demonstrate the components of how a container is constructed. In our case we will not use the dreaded --privileged
option, which to all intents and purposes gives a container root
permissions. Docker offers the following useful security documentation about privileges and kernel capabilities, which is worth a read to help with greater clarity in this area:
docs.docker.com/engine/reference/run/
#runtime-privilege-and-linux-capabilities
The docs describe Privileged mode as essentially enabling “…access to all devices on the host as well as [having the ability to] set some configuration in AppArmor or SElinux to allow the container nearly all the same access to the host as processes running outside containers on the host.” In other words, you should rarely, if ever, use this switch on your container command line. It is simply the least secure and laziest approach, widely abused when developers cannot get features to work. Taking such an approach might mean that a volume can only be mounted from a container with tightened permissions onto a host's directory, which takes more effort to achieve a more secure outcome. Rest assured, with some effort, whichever way you approach the problem there will be a possible solution using specific kernel capabilities, potentially coupled with other mechanisms, which means that you don't have to open the floodgates and use Privileged mode.
For our example, we will choose two of the most powerful kernel capabilities to demonstrate what a container looks like, from the inside out. They are CAP_SYS_ADMIN
and CAP_NET_ADMIN
(commonly abbreviated without CAP_
in Docker and kernel parlance).
The first of these enables a container to run a number of sysadmin commands to control a system in ways a root
user would. The second capability is similarly powerful but can manipulate the host's and container network stack. In the Linux manual page (man7.org/linux/man-pages/man7/capabilities.7.html
) you can see the capabilities afforded to these --cap-add
settings within Docker.
From that web page we can see that Network Admin (CAP_NET_ADMIN
) includes the following:
Interface configuration
Administration of IP firewall
Modifying routing tables
Binding to any address for proxying
Switching on promiscuous mode
Enabling multicasting
We will start our look at a container's internal components by running this command:
$ docker run -d --rm --name apache -p443:443 httpd:latest
We can now check that TCP port 443 is available from our Apache container (Apache is also known as httpd
) and that the default port, TCP port 80, has been exposed as so:
$ docker ps IMAGE COMMAND CREATED STATUS PORTS NAMES httpd "httpd-foreground" 36 seconds ago Up 33s 80/tcp, 443->443/tcp apache
Having seen the slightly redacted output from that command, we will now use a second container (running Debian Linux) to look inside our first container with the following command, which elevates permissions available to the container using the two kernel capabilities that we just looked at:
$ docker run --rm -it --name debian --pid=container:apache \ --net=container:apache --cap-add sys_admin debian:latest
We will come back to the contents of that command, which started a Debian container in a moment. Now that we're running a Bash shell inside our Debian container, let's see what processes the container is running, by installing the procps
package:
root@0237e1ebcc85: /# apt update; apt install procps -y root@0237e1ebcc85: /# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 15:17 ? 00:00:00 httpd -DFOREGROUND daemon 9 1 0 15:17 ? 00:00:00 httpd -DFOREGROUND daemon 10 1 0 15:17 ? 00:00:00 httpd -DFOREGROUND daemon 11 1 0 15:17 ? 00:00:00 httpd -DFOREGROUND root 93 0 0 15:45 pts/0 00:00:00 bash root 670 93 0 15:51 pts/0 00:00:00 ps -ef
We can see from the ps
command's output that bash
and ps -ef
processes are present, but additionally several Apache web server processes are also shown as httpd
. Why