Merge pull request #25899 from yujuhong/ncri

Automatic merge from submit-queue

Add a new container runtime interface

This PR includes a proposal and a Go file to re-define the container runtime interface.
This is based on the original doc: https://docs.google.com/document/d/1ietD5eavK0aTuMQTw6-21r67UU73_vqYSUIPFdA0J5Q/

The umbrella issues is #22964

/cc @kubernetes/sig-node
This commit is contained in:
k8s-merge-robot 2016-07-01 16:55:44 -07:00 committed by GitHub
commit fb19362e01
3 changed files with 696 additions and 0 deletions

View File

@ -0,0 +1,281 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Redefine Container Runtime Interface
The umbrella issue: [#22964](https://issues.k8s.io/22964)
## Motivation
Kubelet employs a declarative pod-level interface, which acts as the sole
integration point for container runtimes (e.g., `docker` and `rkt`). The
high-level, declarative interface has caused higher integration and maintenance
cost, and also slowed down feature velocity for the following reasons.
1. **Not every container runtime supports the concept of pods natively**.
When integrating with Kubernetes, a significant amount of work needs to
go into implementing a shim of significant size to support all pod
features. This also adds maintenance overhead (e.g., `docker`).
2. **High-level interface discourages code sharing and reuse among runtimes**.
E.g, each runtime today implements an all-encompassing `SyncPod()`
function, with the Pod Spec as the input argument. The runtime implements
logic to determine how to achieve the desired state based on the current
status, (re-)starts pods/containers and manages lifecycle hooks
accordingly.
3. **Pod Spec is evolving rapidly**. New features are being added constantly.
Any pod-level change or addition requires changing of all container
runtime shims. E.g., init containers and volume containers.
## Goals and Non-Goals
The goals of defining the interface are to
- **improve extensibility**: Easier container runtime integration.
- **improve feature velocity**
- **improve code maintainability**
The non-goals include
- proposing *how* to integrate with new runtimes, i.e., where the shim
resides. The discussion of adopting a client-server architecture is tracked
by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of
such an architecture is discussed.
- versioning the new interface/API. We intend to provide API versioning to
offer stability for runtime integrations, but the details are beyond the
scope of this proposal.
- adding support to Windows containers. Windows container support is a
parallel effort and is tracked by [#22623](https://issues.k8s.io/22623).
The new interface will not be augmented to support Windows containers, but
it will be made extensible such that the support can be added in the future.
- re-defining Kubelet's internal interfaces. These interfaces, though, may
affect Kubelet's maintainability, is not relevant to runtime integration.
- improving Kubelet's efficiency or performance, e.g., adopting event stream
from the container runtime [#8756](https://issues.k8s.io/8756),
[#16831](https://issues.k8s.io/16831).
## Requirements
* Support the already integrated container runtime: `docker` and `rkt`
* Support hypervisor-based container runtimes: `hyper`.
The existing pod-level interface will remain as it is in the near future to
ensure supports of all existing runtimes are continued. Meanwhile, we will
work with all parties involved to switching to the proposed interface.
## Container Runtime Interface
The main idea of this proposal is to adopt an imperative container-level
interface, which allows Kubelet to directly control the lifecycles of the
containers.
Pod is composed of a group of containers in an isolated environment with
resource constraints. In Kubernetes, pod is also the smallest schedulable unit.
After a pod has been scheduled to the node, Kubelet will create the environment
for the pod, and add/update/remove containers in that environment to meet the
Pod Spec. To distinguish between the environment and the pod as a whole, we
will call the pod environment **PodSandbox.**
The container runtimes may interpret the PodSandBox concept differently based
on how it operates internally. For runtimes relying on hypervisor, sandbox
represents a virtual machine naturally. For others, it can be Linux namespaces.
In short, a PodSandbox should have the following features.
* **Isolation**: E.g., Linux namespaces or a full virtual machine, or even
support additional security features.
* **Compute resource specifications**: A PodSandbox should implement pod-level
resource demands and restrictions.
*NOTE: The resource specification does not include externalized costs to
container setup that are not currently trackable as Pod constraints, e.g.,
filesystem setup, container image pulling, etc.*
A container in a PodSandbox maps to an application in the Pod Spec. For Linux
containers, they are expected to share at least network and IPC namespaces,
with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).
Below is an example of the proposed interfaces.
```go
// PodSandboxManager contains basic operations for sandbox.
type PodSandboxManager interface {
Create(config *PodSandboxConfig) (string, error)
Delete(id string) (string, error)
List(filter PodSandboxFilter) []PodSandboxListItem
Status(id string) PodSandboxStatus
}
// ContainerRuntime contains basic operations for containers.
type ContainerRuntime interface {
Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error)
Start(id string) error
Stop(id string, timeout int) error
Remove(id string) error
List(filter ContainerFilter) ([]ContainerListItem, error)
Status(id string) (ContainerStatus, error)
Exec(id string, cmd []string, streamOpts StreamOptions) error
}
// ImageService contains image-related operations.
type ImageService interface {
List() ([]Image, error)
Pull(image ImageSpec, auth AuthConfig) error
Remove(image ImageSpec) error
Status(image ImageSpec) (Image, error)
Metrics(image ImageSpec) (ImageMetrics, error)
}
type ContainerMetricsGetter interface {
ContainerMetrics(id string) (ContainerMetrics, error)
}
All functions listed above are expected to be thread-safe.
```
### Pod/Container Lifecycle
The PodSandboxs lifecycle is decoupled from the containers, i.e., a sandbox
is created before any containers, and can exist after all containers in it have
terminated.
Assume there is a pod with a single container C. To start a pod:
```
create sandbox Foo --> create container C --> start container C
```
To delete a pod:
```
stop container C --> remove container C --> delete sandbox Foo
```
The restart policy in the Pod Spec defines how indiviual containers should
be handled when they terminated. Kubelet is responsible to ensure that the
restart policy is enforced. In other words, once Kubelet discovers that a
container terminates (e.g., through `List()`), it will create and start a new
container if needed.
Kubelet is also responsible for gracefully terminating all the containers
in the sandbox before deleting the sandbox. If Kubelet chooses to delete
the sandbox with running containers in it, those containers should be forcibly
deleted.
Note that every PodSandbox/container lifecycle operation (create, start,
stop, delete) should either return an error or block until the operation
succeeds. A successful operation should include a state transition of the
PodSandbox/container. E.g., if a `Create` call for a container does not
return an error, the container state should be "created" when the runtime is
queried.
### Updates to PodSandbox or Containers
Kubernetes support updates only to a very limited set of fields in the Pod
Spec. These updates may require containers to be re-created by Kubelet. This
can be achieved through the proposed, imperative container-level interface.
On the other hand, PodSandbox update currently is not required.
### Container Lifecycle Hooks
Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing
discussion for supporting pre-start and post-stop hooks in
[#140](https://issues.k8s.io/140).
These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the
container runtime. This frees the runtimes from having to support hooks
natively.
Illustration of the container lifecycle and hooks:
```
pre-start post-start pre-stop post-stop
| | | |
exec exec exec exec
| | | |
create --------> start ----------------> stop --------> remove
```
In order for the lifecycle hooks to function as expected, the `Exec` call
will need access to the container's filesystem (e.g., mount namespaces).
### Extensibility
There are several dimensions for container runtime extensibility.
- Host OS (e.g., Linux)
- PodSandbox isolation mechanism (e.g., namespaces or VM)
- PodSandbox OS (e.g., Linux)
As mentioned previously, this proposal will only address the Linux based
PodSandbox and containers. All Linux-specific configuration will be grouped
into one field. A container runtime is required to enforce all configuration
applicable to its platform, and should return an error otherwise.
### Keep it minimal
The proposed interface is experimental, i.e., it will go through (many) changes
until it stabilizes. The principle is to to keep the interface minimal and
extend it later if needed. This includes a several features that are still in
discussion and may be achieved alternatively:
* `AttachContainer`: [#23335](https://issues.k8s.io/23335)
* `PortForward`: [#25113](https://issues.k8s.io/25113)
## Alternatives
**[Status quo] Declarative pod-level interface**
- Pros: No changes needed.
- Cons: All the issues stated in #motivation
**Allow integration at both pod- and container-level interfaces**
- Pros: Flexibility.
- Cons: All the issues stated in #motivation
**Imperative pod-level interface**
The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod().
This implies that the runtime needs to take over container lifecycle
manangement (i.e., enforce restart policy), lifecycle hooks, liveness checks,
etc. Kubelet will mainly be responsible for interfacing with the apiserver, and
can potentially become a very thin daemon.
- Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker`
shim maintenance cost is discounted.
- Cons: This will incur higher integration cost because every new container
runtime needs to implement all the features and need to understand the
concept of pods. This would also lead to lower feature velocity because the
interface will need to be changed, and the new pod-level feature will need
to be supported in each runtime.
## Related Issues
* Metrics: [#27097](https://issues.k8s.io/27097)
* Log management: [#24677](https://issues.k8s.io/24677)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-runtime-interface-v1.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -0,0 +1,413 @@
/*
Copyright 2016 The Kubernetes Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package container
import (
"io"
"k8s.io/kubernetes/pkg/api"
"k8s.io/kubernetes/pkg/api/resource"
"k8s.io/kubernetes/pkg/api/unversioned"
)
type PodSandboxID string
// PodSandboxManager provides basic operations to create/delete and examine the
// PodSandboxes. These methods should either return an error or block until the
// operation succeeds.
type PodSandboxManager interface {
// Create creates a sandbox based on the given config, and returns the
// the new sandbox.
Create(config *PodSandboxConfig) (PodSandboxID, error)
// Stop stops the sandbox by its ID. If there are any running
// containers in the sandbox, they will be terminated as a side-effect.
Stop(id PodSandboxID) error
// Delete deletes the sandbox by its ID. If there are any running
// containers in the sandbox, they will be deleted as a side-effect.
Delete(id PodSandboxID) error
// List lists existing sandboxes, filtered by the given PodSandboxFilter.
List(filter PodSandboxFilter) ([]PodSandboxListItem, error)
// Status gets the status of the sandbox by ID.
Status(id PodSandboxID) (PodSandboxStatus, error)
}
// PodSandboxConfig holds all the required and optional fields for creating a
// sandbox.
type PodSandboxConfig struct {
// Name is the name of the sandbox. The string should conform to
// [a-zA-Z0-9_-]+.
Name string
// Hostname is the hostname of the sandbox.
Hostname string
// DNSOptions sets the DNS options for the sandbox.
DNSOptions DNSOptions
// PortMappings lists the port mappings for the sandbox.
PortMappings []PortMapping
// Resources specifies the resource limits for the sandbox (i.e., the
// aggregate cpu/memory resources limits of all containers).
// Note: On a Linux host, kubelet will create a pod-level cgroup and pass
// it as the cgroup parent for the PodSandbox. For some runtimes, this is
// sufficent. For others, e.g., hypervisor-based runtimes, explicit
// resource limits for the sandbox are needed at creation time.
Resources PodSandboxResources
// Path to the directory on the host in which container log files are
// stored.
// By default the Log of a container going into the LogDirectory will be
// hooked up to STDOUT and STDERR. However, the LogDirectory may contain
// binary log files with structured logging data from the individual
// containers. For example the files might be newline seperated JSON
// structured logs, systemd-journald journal files, gRPC trace files, etc.
// E.g.,
// PodSandboxConfig.LogDirectory = `/var/log/pods/<podUID>/`
// ContainerConfig.LogPath = `containerName_Instance#.log`
//
// WARNING: Log managment and how kubelet should interface with the
// container logs are under active discussion in
// https://issues.k8s.io/24677. There *may* be future change of direction
// for logging as the discussion carries on.
LogDirectory string
// Labels are key value pairs that may be used to scope and select
// individual resources.
Labels Labels
// Annotations is an unstructured key value map that may be set by external
// tools to store and retrieve arbitrary metadata.
Annotations map[string]string
// Linux contains configurations specific to Linux hosts.
Linux *LinuxPodSandboxConfig
}
// Labels are key value pairs that may be used to scope and select individual
// resources.
// Label keys are of the form:
// label-key ::= prefixed-name | name
// prefixed-name ::= prefix '/' name
// prefix ::= DNS_SUBDOMAIN
// name ::= DNS_LABEL
type Labels map[string]string
// LinuxPodSandboxConfig holds platform-specific configuraions for Linux
// host platforms and Linux-based containers.
type LinuxPodSandboxConfig struct {
// CgroupParent is the parent cgroup of the sandbox. The cgroupfs style
// syntax will be used, but the container runtime can convert it to systemd
// semantices if needed.
CgroupParent string
// NamespaceOptions contains configurations for the sandbox's namespaces.
// This will be used only if the PodSandbox uses namespace for isolation.
NamespaceOptions NamespaceOptions
}
// NamespaceOptions provides options for Linux namespaces.
type NamespaceOptions struct {
// HostNetwork uses the host's network namespace.
HostNetwork bool
// HostPID uses the host's pid namesapce.
HostPID bool
// HostIPC uses the host's ipc namespace.
HostIPC bool
}
// DNSOptions specifies the DNS servers and search domains.
type DNSOptions struct {
// Servers is a list of DNS servers of the cluster.
Servers []string
// Searches is a list of DNS search domains of the cluster.
Searches []string
}
type PodSandboxState string
const (
// PodSandboxReady means the sandbox is functioning properly.
PodSandboxReady PodSandboxState = "Ready"
// PodSandboxInNotReady means the sandbox is not functioning properly.
PodSandboxNotReady PodSandboxState = "NotReady"
)
// PodSandboxFilter is used to filter a list of PodSandboxes.
type PodSandboxFilter struct {
// Name of the sandbox.
Name *string
// ID of the sandbox.
ID *PodSandboxID
// State of the sandbox.
State *PodSandboxState
// LabelSelector to select matches.
// Only api.MatchLabels is supported for now and the requirements
// are ANDed. MatchExpressions is not supported yet.
LabelSelector unversioned.LabelSelector
}
// PodSandboxListItem contains minimal information about a sandbox.
type PodSandboxListItem struct {
ID PodSandboxID
State PodSandboxState
// Labels are key value pairs that may be used to scope and select individual resources.
Labels Labels
}
// PodSandboxStatus contains the status of the PodSandbox.
type PodSandboxStatus struct {
// ID of the sandbox.
ID PodSandboxID
// State of the sandbox.
State PodSandboxState
// Network contains network status if network is handled by the runtime.
Network *PodSandboxNetworkStatus
// Status specific to a Linux sandbox.
Linux *LinuxPodSandboxStatus
// Labels are key value pairs that may be used to scope and select individual resources.
Labels Labels
// Annotations is an unstructured key value map.
Annotations map[string]string
}
// PodSandboxNetworkStatus is the status of the network for a PodSandbox.
type PodSandboxNetworkStatus struct {
IPs []string
}
// Namespaces contains paths to the namespaces.
type Namespaces struct {
// Network is the path to the network namespace.
Network string
}
// LinuxSandBoxStatus contains status specific to Linux sandboxes.
type LinuxPodSandboxStatus struct {
// Namespaces contains paths to the sandbox's namespaces.
Namespaces *Namespaces
}
// PodSandboxResources contains the CPU/memory resource requirements.
type PodSandboxResources struct {
// CPU resource requirement.
CPU resource.Quantity
// Memory resource requirement.
Memory resource.Quantity
}
// This is to distinguish with existing ContainerID type, which includes a
// runtime type prefix (e.g., docker://). We may rename this later.
type RawContainerID string
// ContainerRuntime provides methods for container lifecycle operations, as
// well as listing or inspecting existing containers. These methods should
// either return an error or block until the operation succeeds.
type ContainerRuntime interface {
// Create creates a container in the sandbox, and returns the ID
// of the created container.
Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, sandboxID PodSandboxID) (RawContainerID, error)
// Start starts a created container.
Start(id RawContainerID) error
// Stop stops a running container with a grace period (i.e., timeout).
Stop(id RawContainerID, timeout int) error
// Remove removes the container.
Remove(id RawContainerID) error
// List lists the existing containers that match the ContainerFilter.
// The returned list should only include containers previously created
// by this ContainerRuntime.
List(filter ContainerFilter) ([]ContainerListItem, error)
// Status returns the status of the container.
Status(id RawContainerID) (RawContainerStatus, error)
// Exec executes a command in the container.
Exec(id RawContainerID, cmd []string, streamOpts StreamOptions) error
}
// ContainerListItem provides the runtime information for a container returned
// by List().
type ContainerListItem struct {
// The ID of the container, used by the container runtime to identify
// a container.
ID ContainerID
// The name of the container, which should be the same as specified by
// api.Container.
Name string
// Reference to the image in use. For most runtimes, this should be an
// image ID.
ImageRef string
// State is the state of the container.
State ContainerState
// Labels are key value pairs that may be used to scope and select individual resources.
Labels Labels
}
type ContainerConfig struct {
// Name of the container. The string should conform to [a-zA-Z0-9_-]+.
Name string
// Image to use.
Image ImageSpec
// Command to execute (i.e., entrypoint for docker)
Command []string
// Args for the Command (i.e., command for docker)
Args []string
// Current working directory of the command.
WorkingDir string
// List of environment variable to set in the container
Env []KeyValue
// Mounts specifies mounts for the container
Mounts []Mount
// Labels are key value pairs that may be used to scope and select individual resources.
Labels Labels
// Annotations is an unstructured key value map that may be set by external
// tools to store and retrieve arbitrary metadata.
Annotations map[string]string
// Privileged runs the container in the privileged mode.
Privileged bool
// ReadOnlyRootFS sets the root filesystem of the container to be
// read-only.
ReadOnlyRootFS bool
// Path relative to PodSandboxConfig.LogDirectory for container to store
// the log (STDOUT and STDERR) on the host.
// E.g.,
// PodSandboxConfig.LogDirectory = `/var/log/pods/<podUID>/`
// ContainerConfig.LogPath = `containerName_Instance#.log`
//
// WARNING: Log managment and how kubelet should interface with the
// container logs are under active discussion in
// https://issues.k8s.io/24677. There *may* be future change of direction
// for logging as the discussion carries on.
LogPath string
// Variables for interactive containers, these have very specialized
// use-cases (e.g. debugging).
// TODO: Determine if we need to continue supporting these fields that are
// part of Kubernetes's Container Spec.
STDIN bool
STDINONCE bool
TTY bool
// Linux contains configuration specific to Linux containers.
Linux *LinuxContainerConfig
}
// RawContainerStatus represents the status of a container.
type RawContainerStatus struct {
// ID of the container.
ID ContainerID
// Name of the container.
Name string
// Status of the container.
State ContainerState
// Creation time of the container.
CreatedAt unversioned.Time
// Start time of the container.
StartedAt unversioned.Time
// Finish time of the container.
FinishedAt unversioned.Time
// Exit code of the container.
ExitCode int
// Reference to the image in use. For most runtimes, this should be an
// image ID.
ImageRef string
// Labels are key value pairs that may be used to scope and select individual resources.
Labels Labels
// Annotations is an unstructured key value map.
Annotations map[string]string
// A brief CamelCase string explains why container is in such a status.
Reason string
}
// LinuxContainerConfig contains platform-specific configuration for
// Linux-based containers.
type LinuxContainerConfig struct {
// Resources specification for the container.
Resources *LinuxContainerResources
// Capabilities to add or drop.
Capabilities *api.Capabilities
// SELinux is the SELinux context to be applied.
SELinux *api.SELinuxOptions
// TODO: Add support for seccomp.
}
// LinuxContainerResources specifies Linux specific configuration for
// resources.
// TODO: Consider using Resources from opencontainers/runtime-spec/specs-go
// directly.
type LinuxContainerResources struct {
// CPU CFS (Completely Fair Scheduler) period
CPUPeriod *int64
// CPU CFS (Completely Fair Scheduler) quota
CPUQuota *int64
// CPU shares (relative weight vs. other containers)
CPUShares *int64
// Memory limit in bytes
MemoryLimitInBytes *int64
// OOMScoreAdj adjusts the oom-killer score.
OOMScoreAdj *int64
}
// ContainerFilter is used to filter containers.
type ContainerFilter struct {
// Name of the container.
Name *string
// ID of the container.
ID *RawContainerID
// State of the contianer.
State *ContainerState
// ID of the PodSandbox.
PodSandboxID *PodSandboxID
// LabelSelector to select matches.
// Only api.MatchLabels is supported for now and the requirements
// are ANDed. MatchExpressions is not supported yet.
LabelSelector unversioned.LabelSelector
}
type StreamOptions struct {
TTY bool
InputStream io.Reader
OutputStream io.Writer
ErrorStream io.Writer
}
// KeyValue represents a key-value pair.
type KeyValue struct {
Key string
Value string
}
// ImageService offers basic image operations.
type ImageService interface {
// List lists the existing images.
List() ([]Image, error)
// Pull pulls an image with authentication config. The PodSandboxConfig is
// passed so that the image service can charge the resources used for
// pulling to a sepcific pod.
Pull(image ImageSpec, auth AuthConfig, sandboxConfig *PodSandboxConfig) error
// Remove removes an image.
Remove(image ImageSpec) error
// Status returns the status of an image.
Status(image ImageSpec) (Image, error)
}
// AuthConfig contains authorization information for connecting to a registry.
// TODO: This is copied from docker's Authconfig. We should re-evaluate to
// support other registries.
type AuthConfig struct {
Username string
Password string
Auth string
ServerAddress string
// IdentityToken is used to authenticate the user and get
// an access token for the registry.
IdentityToken string
// RegistryToken is a bearer token to be sent to a registry
RegistryToken string
}
// TODO: Add ContainerMetricsGetter and ImageMetricsGetter.

View File

@ -233,6 +233,8 @@ const (
ContainerStateExited ContainerState = "exited"
// This unknown encompasses all the states that we currently don't care.
ContainerStateUnknown ContainerState = "unknown"
// Not in use yet.
ContainerStateCreated ContainerState = "created"
)
// Container provides the runtime information for a container, such as ID, hash,