kubernetes/docs/proposals/selinux-enhancements.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

## Abstract

Presents a proposal for enhancing the security of Kubernetes clusters using
SELinux and simplifying the implementation of SELinux support within the
Kubelet by removing the need to label the Kubelet directory with an SELinux
context usable from a container.

## Motivation

The current Kubernetes codebase relies upon the Kubelet directory being
labeled with an SELinux context usable from a container.  This means that a
container escaping namespace isolation will be able to use any file within the
Kubelet directory without defeating kernel
[MAC (mandatory access control)](https://en.wikipedia.org/wiki/Mandatory_access_control).
In order to limit the attack surface, we should enhance the Kubelet to relabel
any bind-mounts into containers into a usable SELinux context without depending
on the Kubelet directory's SELinux context.

## Constraints and Assumptions

1.  No API changes allowed
2.  Behavior must be fully backward compatible
3.  No new admission controllers - make incremental improvements without huge
    refactorings

## Use Cases

1.  As a cluster operator, I want to avoid having to label the Kubelet
    directory with a label usable from a container, so that I can limit the
    attack surface available to a container escaping its namespace isolation
2.  As a user, I want to run a pod without an SELinux context explicitly
    specified and be isolated using MCS (multi-category security) on systems
    where SELinux is enabled, so that the pods on each host are isolated from
    one another
3.  As a user, I want to run a pod that uses the host IPC or PID namespace and
    want the system to do the right thing with regard to SELinux, so that no
    unnecessary relabel actions are performed

### Labeling the Kubelet directory

As previously stated, the current codebase relies on the Kubelet directory
being labeled with an SELinux context usable from a container.  The Kubelet
uses the SELinux context of this directory to determine what SELinux context
`tmpfs` mounts (provided by the EmptyDir memory-medium option) should receive.
The problem with this is that it opens an attack surface to a container that
escapes its namespace isolation; such a container would be able to use any
file in the Kubelet directory without defeating kernel MAC.

### SELinux when no context is specified

When no SELinux context is specified, Kubernetes should just do the right
thing, where doing the right thing is defined as isolating pods with a node-
unique set of categories.  Node-uniqueness means unique among the pods
scheduled onto the node.  Long-term, we want to have a cluster-wide allocator
for MCS labels. Node-unique MCS labels are a good middle ground that is
possible without a new, large, feature.

### SELinux and host IPC and PID namespaces

Containers in pods that use the host IPC or PID namespaces need access to
other processes and IPC mechanisms on the host.  Therefore, these containers
should be run with the `spc_t` SELinux type by the container runtime.  The
`spc_t` type is an unconfined type that other SELinux domains are allowed to
connect to.  In the case where a pod uses one of these host namespaces, it
should be unnecessary to relabel the pod's volumes.

## Analysis

### Libcontainer SELinux library

Docker and rkt both use the libcontainer SELinux library.  This library
provides a method, `GetLxcContexts`, that returns the a unique SELinux
contexts for container processes and files used by them.  `GetLxcContexts`
reads the base SELinux context information from a file at `/etc/selinux/<policy-
name>/contexts/lxc_contexts` and then adds a process-unique MCS label.

Docker and rkt both leverage this call to determine the 'starting' SELinux
contexts for containers.

### Docker

Docker's behavior when no SELinux context is defined for a container is to
give the container a node-unique MCS label.

#### Sharing IPC namespaces

On the Docker runtime, the containers in a Kubernetes pod share the IPC and
PID namespaces of the pod's infra container.

Docker's behavior for containers sharing these namespaces is as follows: if a
container B shares the IPC namespace of another container A, container B is
given the SELinux context of container A.  Therefore, for Kubernetes pods
running on docker, in a vacuum the containers in a pod should have the same
SELinux context.

[**Known issue**](https://bugzilla.redhat.com/show_bug.cgi?id=1377869): When
the seccomp profile is set on a docker container that shares the IPC namespace
of another container, that container will not receive the other container's
SELinux context.

#### Host IPC and PID namespaces

In the case of a pod that shares the host IPC or PID namespace, this flag is
simply ignored and the container receives the `spc_t` SELinux type.  The
`spc_t` type is unconfined, and so no relabeling needs to be done for volumes
for these pods.  Currently, however, there is code which relabels volumes into
explicitly specified SELinux contexts for these pods. This code is unnecessary
and should be removed.

#### Relabeling bind-mounts

Docker is capable of relabeling bind-mounts into containers using the `:Z`
bind-mount flag.  However, in the current implementation of the docker runtime
in Kubernetes, the `:Z` option is only applied when the pod's SecurityContext
contains an SELinux context.  We could easily implement the correct behaviors
by always setting `:Z` on systems where SELinux is enabled.

### rkt

rkt's behavior when no SELinux context is defined for a pod is similar to
Docker's -- an SELinux context with a node-unique MCS label is given to the
containers of a pod.

#### Sharing IPC namespaces

Containers (apps, in rkt terminology) in rkt pods share an IPC and PID
namespace by default.

#### Relabeling bind-mounts

Bind-mounts into rkt pods are automatically relabeled into the pod's SELinux
context.

#### Host IPC and PID namespaces

Using the host IPC and PID namespaces is not currently supported by rkt.

## Proposed Changes

### Refactor `pkg/util/selinux`

1.  The `selinux` package should provide a method `SELinuxEnabled` that returns
    whether SELinux is enabled, and is built for all platforms (the
    libcontainer SELinux is only built on linux)
2.  The `SelinuxContextRunner` interface should be renamed to `SELinuxRunner`
    and be changed to have the same method names and signatures as the
    libcontainer methods its implementations wrap
3.  The `SELinuxRunner` interface should have a new method added called
    `GetLxcContexts`; this should return a **shared** (ie, without MCS labels)
    SELinux context usable by a container

```go
package selinux

// Note: the libcontainer SELinux package is only built for Linux, so it is
// necessary to have a NOP wrapper which is built for non-Linux platforms to
// allow code that links to this package not to differentiate its own methods
// for Linux and non-Linux platforms.
//
// SELinuxRunner wraps certain libcontainer SELinux calls. For more
// information, see:
//
// https://github.com/opencontainers/runc/blob/master/libcontainer/selinux/selinux.go
type SELinuxRunner interface {
	// Setfilecon sets the SELinux context for the given path or returns an
	// error.
	Setfilecon(path, context string) error

	// Getfilecon returns the SELinux context for the given path or returns an
	// error.
	Getfilecon(path string) (string, error)

	// GetLxcContexts returns the process and file SELinux contexts to use for
	// containers.
	GetLxcContexts() (string, string)
}
```

### Kubelet Changes

1.  The `relabelVolumes` method in `kubelet_volumes.go` is not needed and can
    be removed
2.  The `GenerateRunContainerOptions` method in `kubelet_pods.go` should no
    longer call `relabelVolumes`
3.  The `makeHostsMount` method in `kubelet_pods.go` should set the
    `SELinuxRelabel` attribute of the mount for the pod's hosts file to `true`

### Changes to `pkg/kubelet/dockertools/`

1.  The `makeMountBindings` should be changed to:
  1.  No longer accept the `podHasSELinuxLabel` parameter
  2.  Always use the `:Z` bind-mount flag when SELinux is enabled and the mount
      has the `SELinuxRelabel` attribute set to `true`
2.  The `runContainer` method should be changed to always use the `:Z`
    bind-mount flag on the termination message mount when SELinux is enabled

### Changes to `pkg/kubelet/rkt`

The should not be any required changes for the rkt runtime; we should test to
ensure things work as expected under rkt.

### Changes to volume plugins and infrastructure

1.  The `VolumeHost` interface contains a method called `GetRootContext`; this
    is an artifact of the old assumptions about the Kubelet directory's SELinux
    context and can be removed
2.  The `empty_dir.go` file should be changed to create an `SELinuxRunner` and
    call its `GetLxcContexts` method to determine the right SELinux context to
    give `tmpfs` mounts

### Changes to `pkg/controller/...`

The `VolumeHost` abstraction is used in a couple of PV controllers as NOP
implementations.  These should be altered to no longer include `GetRootContext`.

<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/selinux-enhancements.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->