From 8b72dd9000c7315aa93143d79aea60d571a875f1 Mon Sep 17 00:00:00 2001
From: csrwng <cewong@redhat.com>
Date: Thu, 22 Jan 2015 09:32:30 -0500
Subject: [PATCH 1/2] [Proposal] Security Contexts

---
 docs/design/security_context.md | 158 ++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 docs/design/security_context.md

diff --git a/docs/design/security_context.md b/docs/design/security_context.md
new file mode 100644
index 00000000000..87d67aa7409
--- /dev/null
+++ b/docs/design/security_context.md
@@ -0,0 +1,158 @@
+# Security Contexts
+## Abstract
+A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)):
+
+1.  Ensure a clear isolation between container and the underlying host it runs on
+2.  Limit the ability of the container to negatively impact the infrastructure or other containers
+
+## Background
+
+The problem of securing containers in Kubernetes has come up [before](https://github.com/GoogleCloudPlatform/kubernetes/issues/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface.
+
+## Motivation
+
+### Container isolation
+
+In order to improve container isolation from host and other containers running on the host, containers should only be 
+granted the access they need to perform their work. To this end it should be possible to take advantage of Docker 
+features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration) 
+to the container process.
+
+Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers.
+
+### External integration with shared storage
+In order to support external integration with shared storage, processes running in a Kubernetes cluster 
+should be able to be uniquely identified by their Unix UID, such that a chain of  ownership can be established. 
+Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks.
+
+## Constraints and Assumptions
+* It is out of the scope of this document to prescribe a specific set 
+  of constraints to isolate containers from their host. Different use cases need different
+  settings.
+* The concept of a security context should not be tied to a particular security mechanism or platform 
+  (ie. SELinux, AppArmor)
+* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for
+  [service accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297).
+
+## Use Cases
+
+In order of increasing complexity, following are example use cases that would 
+be addressed with security contexts:
+
+1.  Kubernetes is used to run a single cloud application. In order to protect
+    nodes from containers:
+    * All containers run as a single non-root user
+    * Privileged containers are disabled
+    * All containers run with a particular MCS label 
+    * Kernel capabilities like CHOWN and MKNOD are removed from containers
+    
+2.  Just like case #1, except that I have more than one application running on
+    the Kubernetes cluster.
+    * Each application is run in its own namespace to avoid name collisions
+    * For each application a different uid and MCS label is used
+    
+3.  Kubernetes is used as the base for a PAAS with 
+    multiple projects, each project represented by a namespace. 
+    * Each namespace is associated with a range of uids/gids on the node that
+      are mapped to uids/gids on containers using linux user namespaces. 
+    * Certain pods in each namespace have special privileges to perform system
+      actions such as talking back to the server for deployment, run docker
+      builds, etc.
+    * External NFS storage is assigned to each namespace and permissions set
+      using the range of uids/gids assigned to that namespace. 
+
+## Proposed Design
+
+### Overview
+A *security context* consists of a set of constraints that determine how a container
+is secured before getting created and run. It has a 1:1 correspondence to a
+[service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). A *security context provider* is passed to the Kubelet so it can have a chance
+to mutate Docker API calls in order to apply the security context.
+
+It is recommended that this design be implemented in two phases:
+
+1.  Implement the security context provider extension point in the Kubelet 
+    so that a default security context can be applied on container run and creation.
+2.  Implement a security context structure that is part of a service account. The
+    default context provider can then be used to apply a security context based
+    on the service account associated with the pod.
+    
+### Security Context Provider
+
+The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container:
+
+```go
+type SecurityContextProvider interface {
+    // ModifyContainerConfig is called before the Docker createContainer call.
+    // The security context provider can make changes to the Config with which
+    // the container is created.
+    // An error is returned if it's not possible to secure the container as 
+    // requested with a security context. 
+	ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config *docker.Config) error
+	
+	// ModifyHostConfig is called before the Docker runContainer call.
+	// The security context provider can make changes to the HostConfig, affecting
+	// security options, whether the container is privileged, volume binds, etc.
+	// An error is returned if it's not possible to secure the container as requested 
+    // with a security context. 
+	ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig)
+}
+```
+If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today.   
+
+### Security Context
+
+A security context has a 1:1 correspondence to a service account and it can be included as
+part of the service account resource. Following is an example of an initial implementation:
+
+```go
+type SecurityContext struct {
+    // user is the uid to use when running the container
+	User int
+	
+	// allowPrivileged indicates whether this context allows privileged mode containers
+	AllowPrivileged bool
+	
+	// allowedVolumeTypes lists the types of volumes that a container can bind
+	AllowedVolumeTypes []string
+	
+	// addCapabilities is the list of Linux kernel capabilities to add
+	AddCapabilities []string
+	
+	// removeCapabilities is the list of Linux kernel capabilities to remove
+	RemoveCapabilities []string
+	
+	// SELinux specific settings (optional)
+	SELinux *SELinuxContext
+	
+	// AppArmor specific settings (optional)
+	AppArmor *AppArmorContext
+	
+	// FUTURE:
+	// With Linux user namespace support, it should be possible to map
+	// a range of container uids/gids to arbitrary host uids/gids
+	// UserMappings []IDMapping
+	// GroupMappings []IDMapping
+}
+
+type SELinuxContext struct {
+    // MCS label/SELinux level to run the container under
+    Level string
+    
+    // SELinux type label for container processes
+    Type  string    
+    
+    // FUTURE:
+    // LabelVolumeMountsExclusive []Volume
+    // LabelVolumeMountsShared    []Volume
+}
+
+type AppArmorContext struct {
+	// AppArmor profile
+	Profile string
+}
+```
+
+#### Security Context Lifecycle
+ 
+The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use.
\ No newline at end of file

From 2b01746104211797ccd516bae59d996acd33caca Mon Sep 17 00:00:00 2001
From: csrwng <cewong@redhat.com>
Date: Mon, 9 Feb 2015 14:17:51 -0500
Subject: [PATCH 2/2] Specify intent for container isolation and add details
 for id mapping

---
 docs/design/security_context.md | 88 ++++++++++++++++++++++-----------
 1 file changed, 60 insertions(+), 28 deletions(-)

diff --git a/docs/design/security_context.md b/docs/design/security_context.md
index 87d67aa7409..400d30e97b3 100644
--- a/docs/design/security_context.md
+++ b/docs/design/security_context.md
@@ -98,6 +98,7 @@ type SecurityContextProvider interface {
 	ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig)
 }
 ```
+
 If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today.   
 
 ### Security Context
@@ -106,53 +107,84 @@ A security context has a 1:1 correspondence to a service account and it can be i
 part of the service account resource. Following is an example of an initial implementation:
 
 ```go
+
+// SecurityContext specifies the security constraints associated with a service account
 type SecurityContext struct {
     // user is the uid to use when running the container
 	User int
 	
-	// allowPrivileged indicates whether this context allows privileged mode containers
+	// AllowPrivileged indicates whether this context allows privileged mode containers
 	AllowPrivileged bool
 	
-	// allowedVolumeTypes lists the types of volumes that a container can bind
+	// AllowedVolumeTypes lists the types of volumes that a container can bind
 	AllowedVolumeTypes []string
 	
-	// addCapabilities is the list of Linux kernel capabilities to add
+	// AddCapabilities is the list of Linux kernel capabilities to add
 	AddCapabilities []string
 	
-	// removeCapabilities is the list of Linux kernel capabilities to remove
+	// RemoveCapabilities is the list of Linux kernel capabilities to remove
 	RemoveCapabilities []string
 	
-	// SELinux specific settings (optional)
-	SELinux *SELinuxContext
-	
-	// AppArmor specific settings (optional)
-	AppArmor *AppArmorContext
-	
-	// FUTURE:
-	// With Linux user namespace support, it should be possible to map
-	// a range of container uids/gids to arbitrary host uids/gids
-	// UserMappings []IDMapping
-	// GroupMappings []IDMapping
+	// Isolation specifies the type of isolation required for containers 
+	// in this security context 
+	Isolation ContainerIsolationSpec
 }
 
-type SELinuxContext struct {
-    // MCS label/SELinux level to run the container under
-    Level string
-    
-    // SELinux type label for container processes
-    Type  string    
-    
-    // FUTURE:
-    // LabelVolumeMountsExclusive []Volume
-    // LabelVolumeMountsShared    []Volume
+// ContainerIsolationSpec indicates intent for container isolation
+type ContainerIsolationSpec struct {
+	// Type is the container isolation type (None, Private)
+	Type ContainerIsolationType
+	
+	// FUTURE: IDMapping specifies how users and groups from the host will be mapped
+	IDMapping *IDMapping
 }
 
-type AppArmorContext struct {
-	// AppArmor profile
-	Profile string
+// ContainerIsolationType is the type of container isolation for a security context
+type ContainerIsolationType string
+
+const (
+    // ContainerIsolationNone means that no additional consraints are added to
+    // containers to isolate them from their host
+	ContainerIsolationNone ContainerIsolationType = "None"
+	
+	// ContainerIsolationPrivate means that containers are isolated in process
+	// and storage from their host and other containers.
+	ContainerIsolationPrivate ContainerIsolationType = "Private"
+)
+
+// IDMapping specifies the requested user and group mappings for containers 
+// associated with a specific security context
+type IDMapping struct {
+	// SharedUsers is the set of user ranges that must be unique to the entire cluster
+	SharedUsers []IDMappingRange
+	
+	// SharedGroups is the set of group ranges that must be unique to the entire cluster
+	SharedGroups []IDMappingRange
+
+	// PrivateUsers are mapped to users on the host node, but are not necessarily
+	// unique to the entire cluster
+	PrivateUsers []IDMappingRange
+
+	// PrivateGroups are mapped to groups on the host node, but are not necessarily
+	// unique to the entire cluster
+	PrivateGroups []IDMappingRange
 }
+
+// IDMappingRange specifies a mapping between container IDs and node IDs
+type IDMappingRange struct {
+	// ContainerID is the starting container ID
+	ContainerID int
+
+	// HostID is the starting host ID
+	HostID int
+	
+	// Length is the length of the ID range
+	Length int
+}
+
 ```
 
+
 #### Security Context Lifecycle
  
 The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use.
\ No newline at end of file