diff --git a/docs/proposals/container-runtime-interface-v1.md b/docs/proposals/container-runtime-interface-v1.md new file mode 100644 index 00000000000..d8378e18f17 --- /dev/null +++ b/docs/proposals/container-runtime-interface-v1.md @@ -0,0 +1,281 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Redefine Container Runtime Interface + +The umbrella issue: [#22964](https://issues.k8s.io/22964) + +## Motivation + +Kubelet employs a declarative pod-level interface, which acts as the sole +integration point for container runtimes (e.g., `docker` and `rkt`). The +high-level, declarative interface has caused higher integration and maintenance +cost, and also slowed down feature velocity for the following reasons. + 1. **Not every container runtime supports the concept of pods natively**. + When integrating with Kubernetes, a significant amount of work needs to + go into implementing a shim of significant size to support all pod + features. This also adds maintenance overhead (e.g., `docker`). + 2. **High-level interface discourages code sharing and reuse among runtimes**. + E.g, each runtime today implements an all-encompassing `SyncPod()` + function, with the Pod Spec as the input argument. The runtime implements + logic to determine how to achieve the desired state based on the current + status, (re-)starts pods/containers and manages lifecycle hooks + accordingly. + 3. **Pod Spec is evolving rapidly**. New features are being added constantly. + Any pod-level change or addition requires changing of all container + runtime shims. E.g., init containers and volume containers. + +## Goals and Non-Goals + +The goals of defining the interface are to + - **improve extensibility**: Easier container runtime integration. + - **improve feature velocity** + - **improve code maintainability** + +The non-goals include + - proposing *how* to integrate with new runtimes, i.e., where the shim + resides. The discussion of adopting a client-server architecture is tracked + by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of + such an architecture is discussed. + - versioning the new interface/API. We intend to provide API versioning to + offer stability for runtime integrations, but the details are beyond the + scope of this proposal. + - adding support to Windows containers. Windows container support is a + parallel effort and is tracked by [#22623](https://issues.k8s.io/22623). + The new interface will not be augmented to support Windows containers, but + it will be made extensible such that the support can be added in the future. + - re-defining Kubelet's internal interfaces. These interfaces, though, may + affect Kubelet's maintainability, is not relevant to runtime integration. + - improving Kubelet's efficiency or performance, e.g., adopting event stream + from the container runtime [#8756](https://issues.k8s.io/8756), + [#16831](https://issues.k8s.io/16831). + +## Requirements + + * Support the already integrated container runtime: `docker` and `rkt` + * Support hypervisor-based container runtimes: `hyper`. + +The existing pod-level interface will remain as it is in the near future to +ensure supports of all existing runtimes are continued. Meanwhile, we will +work with all parties involved to switching to the proposed interface. + + +## Container Runtime Interface + +The main idea of this proposal is to adopt an imperative container-level +interface, which allows Kubelet to directly control the lifecycles of the +containers. + +Pod is composed of a group of containers in an isolated environment with +resource constraints. In Kubernetes, pod is also the smallest schedulable unit. +After a pod has been scheduled to the node, Kubelet will create the environment +for the pod, and add/update/remove containers in that environment to meet the +Pod Spec. To distinguish between the environment and the pod as a whole, we +will call the pod environment **PodSandbox.** + +The container runtimes may interpret the PodSandBox concept differently based +on how it operates internally. For runtimes relying on hypervisor, sandbox +represents a virtual machine naturally. For others, it can be Linux namespaces. + +In short, a PodSandbox should have the following features. + + * **Isolation**: E.g., Linux namespaces or a full virtual machine, or even + support additional security features. + * **Compute resource specifications**: A PodSandbox should implement pod-level + resource demands and restrictions. + +*NOTE: The resource specification does not include externalized costs to +container setup that are not currently trackable as Pod constraints, e.g., +filesystem setup, container image pulling, etc.* + +A container in a PodSandbox maps to an application in the Pod Spec. For Linux +containers, they are expected to share at least network and IPC namespaces, +with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615). + + +Below is an example of the proposed interfaces. + +```go +// PodSandboxManager contains basic operations for sandbox. +type PodSandboxManager interface { + Create(config *PodSandboxConfig) (string, error) + Delete(id string) (string, error) + List(filter PodSandboxFilter) []PodSandboxListItem + Status(id string) PodSandboxStatus +} + +// ContainerRuntime contains basic operations for containers. +type ContainerRuntime interface { + Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error) + Start(id string) error + Stop(id string, timeout int) error + Remove(id string) error + List(filter ContainerFilter) ([]ContainerListItem, error) + Status(id string) (ContainerStatus, error) + Exec(id string, cmd []string, streamOpts StreamOptions) error +} + +// ImageService contains image-related operations. +type ImageService interface { + List() ([]Image, error) + Pull(image ImageSpec, auth AuthConfig) error + Remove(image ImageSpec) error + Status(image ImageSpec) (Image, error) + Metrics(image ImageSpec) (ImageMetrics, error) +} + +type ContainerMetricsGetter interface { + ContainerMetrics(id string) (ContainerMetrics, error) +} + +All functions listed above are expected to be thread-safe. +``` + +### Pod/Container Lifecycle + +The PodSandbox’s lifecycle is decoupled from the containers, i.e., a sandbox +is created before any containers, and can exist after all containers in it have +terminated. + +Assume there is a pod with a single container C. To start a pod: + +``` + create sandbox Foo --> create container C --> start container C +``` + +To delete a pod: + +``` + stop container C --> remove container C --> delete sandbox Foo +``` + +The restart policy in the Pod Spec defines how indiviual containers should +be handled when they terminated. Kubelet is responsible to ensure that the +restart policy is enforced. In other words, once Kubelet discovers that a +container terminates (e.g., through `List()`), it will create and start a new +container if needed. + +Kubelet is also responsible for gracefully terminating all the containers +in the sandbox before deleting the sandbox. If Kubelet chooses to delete +the sandbox with running containers in it, those containers should be forcibly +deleted. + +Note that every PodSandbox/container lifecycle operation (create, start, +stop, delete) should either return an error or block until the operation +succeeds. A successful operation should include a state transition of the +PodSandbox/container. E.g., if a `Create` call for a container does not +return an error, the container state should be "created" when the runtime is +queried. + +### Updates to PodSandbox or Containers + +Kubernetes support updates only to a very limited set of fields in the Pod +Spec. These updates may require containers to be re-created by Kubelet. This +can be achieved through the proposed, imperative container-level interface. +On the other hand, PodSandbox update currently is not required. + + +### Container Lifecycle Hooks + +Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing +discussion for supporting pre-start and post-stop hooks in +[#140](https://issues.k8s.io/140). + +These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the +container runtime. This frees the runtimes from having to support hooks +natively. + +Illustration of the container lifecycle and hooks: + +``` + pre-start post-start pre-stop post-stop + | | | | + exec exec exec exec + | | | | + create --------> start ----------------> stop --------> remove +``` + +In order for the lifecycle hooks to function as expected, the `Exec` call +will need access to the container's filesystem (e.g., mount namespaces). + +### Extensibility + +There are several dimensions for container runtime extensibility. + - Host OS (e.g., Linux) + - PodSandbox isolation mechanism (e.g., namespaces or VM) + - PodSandbox OS (e.g., Linux) + +As mentioned previously, this proposal will only address the Linux based +PodSandbox and containers. All Linux-specific configuration will be grouped +into one field. A container runtime is required to enforce all configuration +applicable to its platform, and should return an error otherwise. + +### Keep it minimal + +The proposed interface is experimental, i.e., it will go through (many) changes +until it stabilizes. The principle is to to keep the interface minimal and +extend it later if needed. This includes a several features that are still in +discussion and may be achieved alternatively: + + * `AttachContainer`: [#23335](https://issues.k8s.io/23335) + * `PortForward`: [#25113](https://issues.k8s.io/25113) + +## Alternatives + +**[Status quo] Declarative pod-level interface** + - Pros: No changes needed. + - Cons: All the issues stated in #motivation + +**Allow integration at both pod- and container-level interfaces** + - Pros: Flexibility. + - Cons: All the issues stated in #motivation + +**Imperative pod-level interface** +The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod(). +This implies that the runtime needs to take over container lifecycle +manangement (i.e., enforce restart policy), lifecycle hooks, liveness checks, +etc. Kubelet will mainly be responsible for interfacing with the apiserver, and +can potentially become a very thin daemon. + - Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker` + shim maintenance cost is discounted. + - Cons: This will incur higher integration cost because every new container + runtime needs to implement all the features and need to understand the + concept of pods. This would also lead to lower feature velocity because the + interface will need to be changed, and the new pod-level feature will need + to be supported in each runtime. + +## Related Issues + + * Metrics: [#27097](https://issues.k8s.io/27097) + * Log management: [#24677](https://issues.k8s.io/24677) + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-runtime-interface-v1.md?pixel)]() + diff --git a/pkg/kubelet/container/interface.go b/pkg/kubelet/container/interface.go new file mode 100644 index 00000000000..198545ecd20 --- /dev/null +++ b/pkg/kubelet/container/interface.go @@ -0,0 +1,413 @@ +/* +Copyright 2016 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package container + +import ( + "io" + + "k8s.io/kubernetes/pkg/api" + "k8s.io/kubernetes/pkg/api/resource" + "k8s.io/kubernetes/pkg/api/unversioned" +) + +type PodSandboxID string + +// PodSandboxManager provides basic operations to create/delete and examine the +// PodSandboxes. These methods should either return an error or block until the +// operation succeeds. +type PodSandboxManager interface { + // Create creates a sandbox based on the given config, and returns the + // the new sandbox. + Create(config *PodSandboxConfig) (PodSandboxID, error) + // Stop stops the sandbox by its ID. If there are any running + // containers in the sandbox, they will be terminated as a side-effect. + Stop(id PodSandboxID) error + // Delete deletes the sandbox by its ID. If there are any running + // containers in the sandbox, they will be deleted as a side-effect. + Delete(id PodSandboxID) error + // List lists existing sandboxes, filtered by the given PodSandboxFilter. + List(filter PodSandboxFilter) ([]PodSandboxListItem, error) + // Status gets the status of the sandbox by ID. + Status(id PodSandboxID) (PodSandboxStatus, error) +} + +// PodSandboxConfig holds all the required and optional fields for creating a +// sandbox. +type PodSandboxConfig struct { + // Name is the name of the sandbox. The string should conform to + // [a-zA-Z0-9_-]+. + Name string + // Hostname is the hostname of the sandbox. + Hostname string + // DNSOptions sets the DNS options for the sandbox. + DNSOptions DNSOptions + // PortMappings lists the port mappings for the sandbox. + PortMappings []PortMapping + // Resources specifies the resource limits for the sandbox (i.e., the + // aggregate cpu/memory resources limits of all containers). + // Note: On a Linux host, kubelet will create a pod-level cgroup and pass + // it as the cgroup parent for the PodSandbox. For some runtimes, this is + // sufficent. For others, e.g., hypervisor-based runtimes, explicit + // resource limits for the sandbox are needed at creation time. + Resources PodSandboxResources + // Path to the directory on the host in which container log files are + // stored. + // By default the Log of a container going into the LogDirectory will be + // hooked up to STDOUT and STDERR. However, the LogDirectory may contain + // binary log files with structured logging data from the individual + // containers. For example the files might be newline seperated JSON + // structured logs, systemd-journald journal files, gRPC trace files, etc. + // E.g., + // PodSandboxConfig.LogDirectory = `/var/log/pods//` + // ContainerConfig.LogPath = `containerName_Instance#.log` + // + // WARNING: Log managment and how kubelet should interface with the + // container logs are under active discussion in + // https://issues.k8s.io/24677. There *may* be future change of direction + // for logging as the discussion carries on. + LogDirectory string + // Labels are key value pairs that may be used to scope and select + // individual resources. + Labels Labels + // Annotations is an unstructured key value map that may be set by external + // tools to store and retrieve arbitrary metadata. + Annotations map[string]string + + // Linux contains configurations specific to Linux hosts. + Linux *LinuxPodSandboxConfig +} + +// Labels are key value pairs that may be used to scope and select individual +// resources. +// Label keys are of the form: +// label-key ::= prefixed-name | name +// prefixed-name ::= prefix '/' name +// prefix ::= DNS_SUBDOMAIN +// name ::= DNS_LABEL +type Labels map[string]string + +// LinuxPodSandboxConfig holds platform-specific configuraions for Linux +// host platforms and Linux-based containers. +type LinuxPodSandboxConfig struct { + // CgroupParent is the parent cgroup of the sandbox. The cgroupfs style + // syntax will be used, but the container runtime can convert it to systemd + // semantices if needed. + CgroupParent string + // NamespaceOptions contains configurations for the sandbox's namespaces. + // This will be used only if the PodSandbox uses namespace for isolation. + NamespaceOptions NamespaceOptions +} + +// NamespaceOptions provides options for Linux namespaces. +type NamespaceOptions struct { + // HostNetwork uses the host's network namespace. + HostNetwork bool + // HostPID uses the host's pid namesapce. + HostPID bool + // HostIPC uses the host's ipc namespace. + HostIPC bool +} + +// DNSOptions specifies the DNS servers and search domains. +type DNSOptions struct { + // Servers is a list of DNS servers of the cluster. + Servers []string + // Searches is a list of DNS search domains of the cluster. + Searches []string +} + +type PodSandboxState string + +const ( + // PodSandboxReady means the sandbox is functioning properly. + PodSandboxReady PodSandboxState = "Ready" + // PodSandboxInNotReady means the sandbox is not functioning properly. + PodSandboxNotReady PodSandboxState = "NotReady" +) + +// PodSandboxFilter is used to filter a list of PodSandboxes. +type PodSandboxFilter struct { + // Name of the sandbox. + Name *string + // ID of the sandbox. + ID *PodSandboxID + // State of the sandbox. + State *PodSandboxState + // LabelSelector to select matches. + // Only api.MatchLabels is supported for now and the requirements + // are ANDed. MatchExpressions is not supported yet. + LabelSelector unversioned.LabelSelector +} + +// PodSandboxListItem contains minimal information about a sandbox. +type PodSandboxListItem struct { + ID PodSandboxID + State PodSandboxState + // Labels are key value pairs that may be used to scope and select individual resources. + Labels Labels +} + +// PodSandboxStatus contains the status of the PodSandbox. +type PodSandboxStatus struct { + // ID of the sandbox. + ID PodSandboxID + // State of the sandbox. + State PodSandboxState + // Network contains network status if network is handled by the runtime. + Network *PodSandboxNetworkStatus + // Status specific to a Linux sandbox. + Linux *LinuxPodSandboxStatus + // Labels are key value pairs that may be used to scope and select individual resources. + Labels Labels + // Annotations is an unstructured key value map. + Annotations map[string]string +} + +// PodSandboxNetworkStatus is the status of the network for a PodSandbox. +type PodSandboxNetworkStatus struct { + IPs []string +} + +// Namespaces contains paths to the namespaces. +type Namespaces struct { + // Network is the path to the network namespace. + Network string +} + +// LinuxSandBoxStatus contains status specific to Linux sandboxes. +type LinuxPodSandboxStatus struct { + // Namespaces contains paths to the sandbox's namespaces. + Namespaces *Namespaces +} + +// PodSandboxResources contains the CPU/memory resource requirements. +type PodSandboxResources struct { + // CPU resource requirement. + CPU resource.Quantity + // Memory resource requirement. + Memory resource.Quantity +} + +// This is to distinguish with existing ContainerID type, which includes a +// runtime type prefix (e.g., docker://). We may rename this later. +type RawContainerID string + +// ContainerRuntime provides methods for container lifecycle operations, as +// well as listing or inspecting existing containers. These methods should +// either return an error or block until the operation succeeds. +type ContainerRuntime interface { + // Create creates a container in the sandbox, and returns the ID + // of the created container. + Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, sandboxID PodSandboxID) (RawContainerID, error) + // Start starts a created container. + Start(id RawContainerID) error + // Stop stops a running container with a grace period (i.e., timeout). + Stop(id RawContainerID, timeout int) error + // Remove removes the container. + Remove(id RawContainerID) error + // List lists the existing containers that match the ContainerFilter. + // The returned list should only include containers previously created + // by this ContainerRuntime. + List(filter ContainerFilter) ([]ContainerListItem, error) + // Status returns the status of the container. + Status(id RawContainerID) (RawContainerStatus, error) + // Exec executes a command in the container. + Exec(id RawContainerID, cmd []string, streamOpts StreamOptions) error +} + +// ContainerListItem provides the runtime information for a container returned +// by List(). +type ContainerListItem struct { + // The ID of the container, used by the container runtime to identify + // a container. + ID ContainerID + // The name of the container, which should be the same as specified by + // api.Container. + Name string + // Reference to the image in use. For most runtimes, this should be an + // image ID. + ImageRef string + // State is the state of the container. + State ContainerState + // Labels are key value pairs that may be used to scope and select individual resources. + Labels Labels +} + +type ContainerConfig struct { + // Name of the container. The string should conform to [a-zA-Z0-9_-]+. + Name string + // Image to use. + Image ImageSpec + // Command to execute (i.e., entrypoint for docker) + Command []string + // Args for the Command (i.e., command for docker) + Args []string + // Current working directory of the command. + WorkingDir string + // List of environment variable to set in the container + Env []KeyValue + // Mounts specifies mounts for the container + Mounts []Mount + // Labels are key value pairs that may be used to scope and select individual resources. + Labels Labels + // Annotations is an unstructured key value map that may be set by external + // tools to store and retrieve arbitrary metadata. + Annotations map[string]string + // Privileged runs the container in the privileged mode. + Privileged bool + // ReadOnlyRootFS sets the root filesystem of the container to be + // read-only. + ReadOnlyRootFS bool + // Path relative to PodSandboxConfig.LogDirectory for container to store + // the log (STDOUT and STDERR) on the host. + // E.g., + // PodSandboxConfig.LogDirectory = `/var/log/pods//` + // ContainerConfig.LogPath = `containerName_Instance#.log` + // + // WARNING: Log managment and how kubelet should interface with the + // container logs are under active discussion in + // https://issues.k8s.io/24677. There *may* be future change of direction + // for logging as the discussion carries on. + LogPath string + + // Variables for interactive containers, these have very specialized + // use-cases (e.g. debugging). + // TODO: Determine if we need to continue supporting these fields that are + // part of Kubernetes's Container Spec. + STDIN bool + STDINONCE bool + TTY bool + + // Linux contains configuration specific to Linux containers. + Linux *LinuxContainerConfig +} + +// RawContainerStatus represents the status of a container. +type RawContainerStatus struct { + // ID of the container. + ID ContainerID + // Name of the container. + Name string + // Status of the container. + State ContainerState + // Creation time of the container. + CreatedAt unversioned.Time + // Start time of the container. + StartedAt unversioned.Time + // Finish time of the container. + FinishedAt unversioned.Time + // Exit code of the container. + ExitCode int + // Reference to the image in use. For most runtimes, this should be an + // image ID. + ImageRef string + // Labels are key value pairs that may be used to scope and select individual resources. + Labels Labels + // Annotations is an unstructured key value map. + Annotations map[string]string + // A brief CamelCase string explains why container is in such a status. + Reason string +} + +// LinuxContainerConfig contains platform-specific configuration for +// Linux-based containers. +type LinuxContainerConfig struct { + // Resources specification for the container. + Resources *LinuxContainerResources + // Capabilities to add or drop. + Capabilities *api.Capabilities + // SELinux is the SELinux context to be applied. + SELinux *api.SELinuxOptions + // TODO: Add support for seccomp. +} + +// LinuxContainerResources specifies Linux specific configuration for +// resources. +// TODO: Consider using Resources from opencontainers/runtime-spec/specs-go +// directly. +type LinuxContainerResources struct { + // CPU CFS (Completely Fair Scheduler) period + CPUPeriod *int64 + // CPU CFS (Completely Fair Scheduler) quota + CPUQuota *int64 + // CPU shares (relative weight vs. other containers) + CPUShares *int64 + // Memory limit in bytes + MemoryLimitInBytes *int64 + // OOMScoreAdj adjusts the oom-killer score. + OOMScoreAdj *int64 +} + +// ContainerFilter is used to filter containers. +type ContainerFilter struct { + // Name of the container. + Name *string + // ID of the container. + ID *RawContainerID + // State of the contianer. + State *ContainerState + // ID of the PodSandbox. + PodSandboxID *PodSandboxID + // LabelSelector to select matches. + // Only api.MatchLabels is supported for now and the requirements + // are ANDed. MatchExpressions is not supported yet. + LabelSelector unversioned.LabelSelector +} + +type StreamOptions struct { + TTY bool + InputStream io.Reader + OutputStream io.Writer + ErrorStream io.Writer +} + +// KeyValue represents a key-value pair. +type KeyValue struct { + Key string + Value string +} + +// ImageService offers basic image operations. +type ImageService interface { + // List lists the existing images. + List() ([]Image, error) + // Pull pulls an image with authentication config. The PodSandboxConfig is + // passed so that the image service can charge the resources used for + // pulling to a sepcific pod. + Pull(image ImageSpec, auth AuthConfig, sandboxConfig *PodSandboxConfig) error + // Remove removes an image. + Remove(image ImageSpec) error + // Status returns the status of an image. + Status(image ImageSpec) (Image, error) +} + +// AuthConfig contains authorization information for connecting to a registry. +// TODO: This is copied from docker's Authconfig. We should re-evaluate to +// support other registries. +type AuthConfig struct { + Username string + Password string + Auth string + ServerAddress string + // IdentityToken is used to authenticate the user and get + // an access token for the registry. + IdentityToken string + // RegistryToken is a bearer token to be sent to a registry + RegistryToken string +} + +// TODO: Add ContainerMetricsGetter and ImageMetricsGetter. diff --git a/pkg/kubelet/container/runtime.go b/pkg/kubelet/container/runtime.go index e110a16d2af..df06a737b73 100644 --- a/pkg/kubelet/container/runtime.go +++ b/pkg/kubelet/container/runtime.go @@ -233,6 +233,8 @@ const ( ContainerStateExited ContainerState = "exited" // This unknown encompasses all the states that we currently don't care. ContainerStateUnknown ContainerState = "unknown" + // Not in use yet. + ContainerStateCreated ContainerState = "created" ) // Container provides the runtime information for a container, such as ID, hash,