Merge pull request #39121 from michelleN/docs-design-stubs

replace contents of docs/design with stubs
This commit is contained in:
Brian Grant 2017-01-13 15:18:34 -08:00 committed by GitHub
commit 1d6e85bf71
61 changed files with 47 additions and 15329 deletions

View File

@ -1,62 +1 @@
# Kubernetes Design Overview
Kubernetes is a system for managing containerized applications across multiple
hosts, providing basic mechanisms for deployment, maintenance, and scaling of
applications.
Kubernetes establishes robust declarative primitives for maintaining the desired
state requested by the user. We see these primitives as the main value added by
Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and
replicating containers require active controllers, not just imperative
orchestration.
Kubernetes is primarily targeted at applications composed of multiple
containers, such as elastic, distributed micro-services. It is also designed to
facilitate migration of non-containerized application stacks to Kubernetes. It
therefore includes abstractions for grouping containers in both loosely coupled
and tightly coupled formations, and provides ways for containers to find and
communicate with each other in relatively familiar ways.
Kubernetes enables users to ask a cluster to run a set of containers. The system
automatically chooses hosts to run those containers on. While Kubernetes's
scheduler is currently very simple, we expect it to grow in sophistication over
time. Scheduling is a policy-rich, topology-aware, workload-specific function
that significantly impacts availability, performance, and capacity. The
scheduler needs to take into account individual and collective resource
requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality,
inter-workload interference, deadlines, and so on. Workload-specific
requirements will be exposed through the API as necessary.
Kubernetes is intended to run on a number of cloud providers, as well as on
physical hosts.
A single Kubernetes cluster is not intended to span multiple availability zones.
Instead, we recommend building a higher-level layer to replicate complete
deployments of highly available applications across multiple zones (see
[the multi-cluster doc](../admin/multi-cluster.md) and [cluster federation proposal](../proposals/federation.md)
for more details).
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS
platform and toolkit. Therefore, architecturally, we want Kubernetes to be built
as a collection of pluggable components and layers, with the ability to use
alternative schedulers, controllers, storage systems, and distribution
mechanisms, and we're evolving its current code in that direction. Furthermore,
we want others to be able to extend Kubernetes functionality, such as with
higher-level PaaS functionality or multi-cluster layers, without modification of
core Kubernetes source. Therefore, its API isn't just (or even necessarily
mainly) targeted at end users, but at tool and extension developers. Its APIs
are intended to serve as the foundation for an open ecosystem of tools,
automation systems, and higher-level API layers. Consequently, there are no
"internal" inter-component APIs. All APIs are visible and available, including
the APIs used by the scheduler, the node controller, the replication-controller
manager, Kubelet's API, etc. There's no glass to break -- in order to handle
more complex use cases, one can just access the lower-level APIs in a fully
transparent, composable manner.
For more about the Kubernetes architecture, see [architecture](architecture.md).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/README.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/README.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/README.md)

View File

@ -1,376 +1 @@
# K8s Identity and Access Management Sketch
This document suggests a direction for identity and access management in the
Kubernetes system.
## Background
High level goals are:
- Have a plan for how identity, authentication, and authorization will fit in
to the API.
- Have a plan for partitioning resources within a cluster between independent
organizational units.
- Ease integration with existing enterprise and hosted scenarios.
### Actors
Each of these can act as normal users or attackers.
- External Users: People who are accessing applications running on K8s (e.g.
a web site served by webserver running in a container on K8s), but who do not
have K8s API access.
- K8s Users: People who access the K8s API (e.g. create K8s API objects like
Pods)
- K8s Project Admins: People who manage access for some K8s Users
- K8s Cluster Admins: People who control the machines, networks, or binaries
that make up a K8s cluster.
- K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
### Threats
Both intentional attacks and accidental use of privilege are concerns.
For both cases it may be useful to think about these categories differently:
- Application Path - attack by sending network messages from the internet to
the IP/port of any application running on K8s. May exploit weakness in
application or misconfiguration of K8s.
- K8s API Path - attack by sending network messages to any K8s API endpoint.
- Insider Path - attack on K8s system components. Attacker may have
privileged access to networks, machines or K8s software and data. Software
errors in K8s system components and administrator error are some types of threat
in this category.
This document is primarily concerned with K8s API paths, and secondarily with
Internal paths. The Application path also needs to be secure, but is not the
focus of this document.
### Assets to protect
External User assets:
- Personal information like private messages, or images uploaded by External
Users.
- web server logs.
K8s User assets:
- External User assets of each K8s User.
- things private to the K8s app, like:
- credentials for accessing other services (docker private repos, storage
services, facebook, etc)
- SSL certificates for web servers
- proprietary data and code
K8s Cluster assets:
- Assets of each K8s User.
- Machine Certificates or secrets.
- The value of K8s cluster computing resources (cpu, memory, etc).
This document is primarily about protecting K8s User assets and K8s cluster
assets from other K8s Users and K8s Project and Cluster Admins.
### Usage environments
Cluster in Small organization:
- K8s Admins may be the same people as K8s Users.
- Few K8s Admins.
- Prefer ease of use to fine-grained access control/precise accounting, etc.
- Product requirement that it be easy for potential K8s Cluster Admin to try
out setting up a simple cluster.
Cluster in Large organization:
- K8s Admins typically distinct people from K8s Users. May need to divide
K8s Cluster Admin access by roles.
- K8s Users need to be protected from each other.
- Auditing of K8s User and K8s Admin actions important.
- Flexible accurate usage accounting and resource controls important.
- Lots of automated access to APIs.
- Need to integrate with existing enterprise directory, authentication,
accounting, auditing, and security policy infrastructure.
Org-run cluster:
- Organization that runs K8s master components is same as the org that runs
apps on K8s.
- Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix.
Hosted cluster:
- Offering K8s API as a service, or offering a Paas or Saas built on K8s.
- May already offer web services, and need to integrate with existing customer
account concept, and existing authentication, accounting, auditing, and security
policy infrastructure.
- May want to leverage K8s User accounts and accounting to manage their User
accounts (not a priority to support this use case.)
- Precise and accurate accounting of resources needed. Resource controls
needed for hard limits (Users given limited slice of data) and soft limits
(Users can grow up to some limit and then be expanded).
K8s ecosystem services:
- There may be companies that want to offer their existing services (Build, CI,
A/B-test, release automation, etc) for use with K8s. There should be some story
for this case.
Pods configs should be largely portable between Org-run and hosted
configurations.
# Design
Related discussion:
- http://issue.k8s.io/442
- http://issue.k8s.io/443
This doc describes two security profiles:
- Simple profile: like single-user mode. Make it easy to evaluate K8s
without lots of configuring accounts and policies. Protects from unauthorized
users, but does not partition authorized users.
- Enterprise profile: Provide mechanisms needed for large numbers of users.
Defense in depth. Should integrate with existing enterprise security
infrastructure.
K8s distribution should include templates of config, and documentation, for
simple and enterprise profiles. System should be flexible enough for
knowledgeable users to create intermediate profiles, but K8s developers should
only reason about those two Profiles, not a matrix.
Features in this doc are divided into "Initial Feature", and "Improvements".
Initial features would be candidates for version 1.00.
## Identity
### userAccount
K8s will have a `userAccount` API object.
- `userAccount` has a UID which is immutable. This is used to associate users
with objects and to record actions in audit logs.
- `userAccount` has a name which is a string and human readable and unique among
userAccounts. It is used to refer to users in Policies, to ensure that the
Policies are human readable. It can be changed only when there are no Policy
objects or other objects which refer to that name. An email address is a
suggested format for this field.
- `userAccount` is not related to the unix username of processes in Pods created
by that userAccount.
- `userAccount` API objects can have labels.
The system may associate one or more Authentication Methods with a
`userAccount` (but they are not formally part of the userAccount object.)
In a simple deployment, the authentication method for a user might be an
authentication token which is verified by a K8s server. In a more complex
deployment, the authentication might be delegated to another system which is
trusted by the K8s API to authenticate users, but where the authentication
details are unknown to K8s.
Initial Features:
- There is no superuser `userAccount`
- `userAccount` objects are statically populated in the K8s API store by reading
a config file. Only a K8s Cluster Admin can do this.
- `userAccount` can have a default `namespace`. If API call does not specify a
`namespace`, the default `namespace` for that caller is assumed.
- `userAccount` is global. A single human with access to multiple namespaces is
recommended to only have one userAccount.
Improvements:
- Make `userAccount` part of a separate API group from core K8s objects like
`pod.` Facilitates plugging in alternate Access Management.
Simple Profile:
- Single `userAccount`, used by all K8s Users and Project Admins. One access
token shared by all.
Enterprise Profile:
- Every human user has own `userAccount`.
- `userAccount`s have labels that indicate both membership in groups, and
ability to act in certain roles.
- Each service using the API has own `userAccount` too. (e.g. `scheduler`,
`repcontroller`)
- Automated jobs to denormalize the ldap group info into the local system
list of users into the K8s userAccount file.
### Unix accounts
A `userAccount` is not a Unix user account. The fact that a pod is started by a
`userAccount` does not mean that the processes in that pod's containers run as a
Unix user with a corresponding name or identity.
Initially:
- The unix accounts available in a container, and used by the processes running
in a container are those that are provided by the combination of the base
operating system and the Docker manifest.
- Kubernetes doesn't enforce any relation between `userAccount` and unix
accounts.
Improvements:
- Kubelet allocates disjoint blocks of root-namespace uids for each container.
This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
- requires docker to integrate user namespace support, and deciding what
getpwnam() does for these uids.
- any features that help users avoid use of privileged containers
(http://issue.k8s.io/391)
### Namespaces
K8s will have a `namespace` API object. It is similar to a Google Compute
Engine `project`. It provides a namespace for objects created by a group of
people co-operating together, preventing name collisions with non-cooperating
groups. It also serves as a reference point for authorization policies.
Namespaces are described in [namespaces.md](namespaces.md).
In the Enterprise Profile:
- a `userAccount` may have permission to access several `namespace`s.
In the Simple Profile:
- There is a single `namespace` used by the single user.
Namespaces versus userAccount vs. Labels:
- `userAccount`s are intended for audit logging (both name and UID should be
logged), and to define who has access to `namespace`s.
- `labels` (see [docs/user-guide/labels.md](../../docs/user-guide/labels.md))
should be used to distinguish pods, users, and other objects that cooperate
towards a common goal but are different in some way, such as version, or
responsibilities.
- `namespace`s prevent name collisions between uncoordinated groups of people,
and provide a place to attach common policies for co-operating groups of people.
## Authentication
Goals for K8s authentication:
- Include a built-in authentication system with no configuration required to use
in single-user mode, and little configuration required to add several user
accounts, and no https proxy required.
- Allow for authentication to be handled by a system external to Kubernetes, to
allow integration with existing to enterprise authorization systems. The
Kubernetes namespace itself should avoid taking contributions of multiple
authorization schemes. Instead, a trusted proxy in front of the apiserver can be
used to authenticate users.
- For organizations whose security requirements only allow FIPS compliant
implementations (e.g. apache) for authentication.
- So the proxy can terminate SSL, and isolate the CA-signed certificate from
less trusted, higher-touch APIserver.
- For organizations that already have existing SaaS web services (e.g.
storage, VMs) and want a common authentication portal.
- Avoid mixing authentication and authorization, so that authorization policies
be centrally managed, and to allow changes in authentication methods without
affecting authorization code.
Initially:
- Tokens used to authenticate a user.
- Long lived tokens identify a particular `userAccount`.
- Administrator utility generates tokens at cluster setup.
- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
- No scopes for tokens. Authorization happens in the API server
- Tokens dynamically generated by apiserver to identify pods which are making
API calls.
- Tokens checked in a module of the APIserver.
- Authentication in apiserver can be disabled by flag, to allow testing without
authorization enabled, and to allow use of an authenticating proxy. In this
mode, a query parameter or header added by the proxy will identify the caller.
Improvements:
- Refresh of tokens.
- SSH keys to access inside containers.
To be considered for subsequent versions:
- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749)
- Scoped tokens.
- Tokens that are bound to the channel between the client and the api server
- http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
- http://www.browserauth.net
## Authorization
K8s authorization should:
- Allow for a range of maturity levels, from single-user for those test driving
the system, to integration with existing to enterprise authorization systems.
- Allow for centralized management of users and policies. In some
organizations, this will mean that the definition of users and access policies
needs to reside on a system other than k8s and encompass other web services
(such as a storage service).
- Allow processes running in K8s Pods to take on identity, and to allow narrow
scoping of permissions for those identities in order to limit damage from
software faults.
- Have Authorization Policies exposed as API objects so that a single config
file can create or delete Pods, Replication Controllers, Services, and the
identities and policies for those Pods and Replication Controllers.
- Be separate as much as practical from Authentication, to allow Authentication
methods to change over time and space, without impacting Authorization policies.
K8s will implement a relatively simple
[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
The model will be described in more detail in a forthcoming document. The model
will:
- Be less complex than XACML
- Be easily recognizable to those familiar with Amazon IAM Policies.
- Have a subset/aliases/defaults which allow it to be used in a way comfortable
to those users more familiar with Role-Based Access Control.
Authorization policy is set by creating a set of Policy objects.
The API Server will be the Enforcement Point for Policy. For each API call that
it receives, it will construct the Attributes needed to evaluate the policy
(what user is making the call, what resource they are accessing, what they are
trying to do that resource, etc) and pass those attributes to a Decision Point.
The Decision Point code evaluates the Attributes against all the Policies and
allows or denies the API call. The system will be modular enough that the
Decision Point code can either be linked into the APIserver binary, or be
another service that the apiserver calls for each Decision (with appropriate
time-limited caching as needed for performance).
Policy objects may be applicable only to a single namespace or to all
namespaces; K8s Project Admins would be able to create those as needed. Other
Policy objects may be applicable to all namespaces; a K8s Cluster Admin might
create those in order to authorize a new type of controller to be used by all
namespaces, or to make a K8s User into a K8s Project Admin.)
## Accounting
The API should have a `quota` concept (see http://issue.k8s.io/442). A quota
object relates a namespace (and optionally a label selector) to a maximum
quantity of resources that may be used (see [resources design doc](resources.md)).
Initially:
- A `quota` object is immutable.
- For hosted K8s systems that do billing, Project is recommended level for
billing accounts.
- Every object that consumes resources should have a `namespace` so that
Resource usage stats are roll-up-able to `namespace`.
- K8s Cluster Admin sets quota objects by writing a config file.
Improvements:
- Allow one namespace to charge the quota for one or more other namespaces. This
would be controlled by a policy which allows changing a billing_namespace =
label on an object.
- Allow quota to be set by namespace owners for (namespace x label) combinations
(e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't
allow "webserver" namespace and "instance=test" use more than 10 cores.
- Tools to help write consistent quota config files based on number of nodes,
historical namespace usages, QoS needs, etc.
- Way for K8s Cluster Admin to incrementally adjust Quota objects.
Simple profile:
- A single `namespace` with infinite resource limits.
Enterprise profile:
- Multiple namespaces each with their own limits.
Issues:
- Need for locking or "eventual consistency" when multiple apiserver goroutines
are accessing the object store and handling pod creations.
## Audit Logging
API actions can be logged.
Initial implementation:
- All API calls logged to nginx logs.
Improvements:
- API server does logging instead.
- Policies to drop logging for high rate trusted API calls, or by users
performing audit or other sensitive functions.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/access.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/access.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/access.md)

View File

@ -1,106 +1 @@
# Kubernetes Proposal - Admission Control
**Related PR:**
| Topic | Link |
| ----- | ---- |
| Separate validation from RESTStorage | http://issue.k8s.io/2977 |
## Background
High level goals:
* Enable an easy-to-use mechanism to provide admission control to cluster.
* Enable a provider to support multiple admission control strategies or author
their own.
* Ensure any rejected request can propagate errors back to the caller with why
the request failed.
Authorization via policy is focused on answering if a user is authorized to
perform an action.
Admission Control is focused on if the system will accept an authorized action.
Kubernetes may choose to dismiss an authorized action based on any number of
admission control strategies.
This proposal documents the basic design, and describes how any number of
admission control plug-ins could be injected.
Implementation of specific admission control strategies are handled in separate
documents.
## kube-apiserver
The kube-apiserver takes the following OPTIONAL arguments to enable admission
control:
| Option | Behavior |
| ------ | -------- |
| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. |
| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. |
An **AdmissionControl** plug-in is an implementation of the following interface:
```go
package admission
// Attributes is an interface used by a plug-in to make an admission decision
// on a individual request.
type Attributes interface {
GetNamespace() string
GetKind() string
GetOperation() string
GetObject() runtime.Object
}
// Interface is an abstract, pluggable interface for Admission Control decisions.
type Interface interface {
// Admit makes an admission decision based on the request attributes
// An error is returned if it denies the request.
Admit(a Attributes) (err error)
}
```
A **plug-in** must be compiled with the binary, and is registered as an
available option by providing a name, and implementation of admission.Interface.
```go
func init() {
admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil })
}
```
A **plug-in** must be added to the imports in [plugins.go](../../cmd/kube-apiserver/app/plugins.go)
```go
// Admission policies
_ "k8s.io/kubernetes/plugin/pkg/admission/admit"
_ "k8s.io/kubernetes/plugin/pkg/admission/alwayspullimages"
_ "k8s.io/kubernetes/plugin/pkg/admission/antiaffinity"
...
_ "<YOUR NEW PLUGIN>"
```
Invocation of admission control is handled by the **APIServer** and not
individual **RESTStorage** implementations.
This design assumes that **Issue 297** is adopted, and as a consequence, the
general framework of the APIServer request/response flow will ensure the
following:
1. Incoming request
2. Authenticate user
3. Authorize user
4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes)
- invoke each admission.Interface object in sequence
5. Case on the operation:
- If operation=create|update, then validate(object) and persist
- If operation=delete, delete the object
- If operation=connect, exec
If at any step, there is an error, the request is canceled.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control.md)

View File

@ -1,233 +1 @@
# Admission control plugin: LimitRanger
## Background
This document proposes a system for enforcing resource requirements constraints
as part of admission control.
## Use cases
1. Ability to enumerate resource requirement constraints per namespace
2. Ability to enumerate min/max resource constraints for a pod
3. Ability to enumerate min/max resource constraints for a container
4. Ability to specify default resource limits for a container
5. Ability to specify default resource requests for a container
6. Ability to enforce a ratio between request and limit for a resource.
7. Ability to enforce min/max storage requests for persistent volume claims
## Data Model
The **LimitRange** resource is scoped to a **Namespace**.
### Type
```go
// LimitType is a type of object that is limited
type LimitType string
const (
// Limit that applies to all pods in a namespace
LimitTypePod LimitType = "Pod"
// Limit that applies to all containers in a namespace
LimitTypeContainer LimitType = "Container"
)
// LimitRangeItem defines a min/max usage limit for any resource that matches
// on kind.
type LimitRangeItem struct {
// Type of resource that this limit applies to.
Type LimitType `json:"type,omitempty"`
// Max usage constraints on this kind by resource name.
Max ResourceList `json:"max,omitempty"`
// Min usage constraints on this kind by resource name.
Min ResourceList `json:"min,omitempty"`
// Default resource requirement limit value by resource name if resource limit
// is omitted.
Default ResourceList `json:"default,omitempty"`
// DefaultRequest is the default resource requirement request value by
// resource name if resource request is omitted.
DefaultRequest ResourceList `json:"defaultRequest,omitempty"`
// MaxLimitRequestRatio if specified, the named resource must have a request
// and limit that are both non-zero where limit divided by request is less
// than or equal to the enumerated value; this represents the max burst for
// the named resource.
MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"`
}
// LimitRangeSpec defines a min/max usage limit for resources that match
// on kind.
type LimitRangeSpec struct {
// Limits is the list of LimitRangeItem objects that are enforced.
Limits []LimitRangeItem `json:"limits"`
}
// LimitRange sets resource usage limits for each kind of resource in a
// Namespace.
type LimitRange struct {
TypeMeta `json:",inline"`
// Standard object's metadata.
// More info:
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
ObjectMeta `json:"metadata,omitempty"`
// Spec defines the limits enforced.
// More info:
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status
Spec LimitRangeSpec `json:"spec,omitempty"`
}
// LimitRangeList is a list of LimitRange items.
type LimitRangeList struct {
TypeMeta `json:",inline"`
// Standard list metadata.
// More info:
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds
ListMeta `json:"metadata,omitempty"`
// Items is a list of LimitRange objects.
// More info:
// http://releases.k8s.io/HEAD/docs/design/admission_control_limit_range.md
Items []LimitRange `json:"items"`
}
```
### Validation
Validation of a **LimitRange** enforces that for a given named resource the
following rules apply:
Min (if specified) <= DefaultRequest (if specified) <= Default (if specified)
<= Max (if specified)
### Default Value Behavior
The following default value behaviors are applied to a LimitRange for a given
named resource.
```
if LimitRangeItem.Default[resourceName] is undefined
if LimitRangeItem.Max[resourceName] is defined
LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName]
```
```
if LimitRangeItem.DefaultRequest[resourceName] is undefined
if LimitRangeItem.Default[resourceName] is defined
LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName]
else if LimitRangeItem.Min[resourceName] is defined
LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName]
```
## AdmissionControl plugin: LimitRanger
The **LimitRanger** plug-in introspects all incoming pod requests and evaluates
the constraints defined on a LimitRange.
If a constraint is not specified for an enumerated resource, it is not enforced
or tracked.
To enable the plug-in and support for LimitRange, the kube-apiserver must be
configured as follows:
```console
$ kube-apiserver --admission-control=LimitRanger
```
### Enforcement of constraints
**Type: Container**
Supported Resources:
1. memory
2. cpu
Supported Constraints:
Per container, the following must hold true:
| Constraint | Behavior |
| ---------- | -------- |
| Min | Min <= Request (required) <= Limit (optional) |
| Max | Limit (required) <= Max |
| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) |
Supported Defaults:
1. Default - if the named resource has no enumerated value, the Limit is equal
to the Default
2. DefaultRequest - if the named resource has no enumerated value, the Request
is equal to the DefaultRequest
**Type: Pod**
Supported Resources:
1. memory
2. cpu
Supported Constraints:
Across all containers in pod, the following must hold true
| Constraint | Behavior |
| ---------- | -------- |
| Min | Min <= Request (required) <= Limit (optional) |
| Max | Limit (required) <= Max |
| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) |
**Type: PersistentVolumeClaim**
Supported Resources:
1. storage
Supported Constraints:
Across all claims in a namespace, the following must hold true:
| Constraint | Behavior |
| ---------- | -------- |
| Min | Min >= Request (required) |
| Max | Max <= Request (required) |
Supported Defaults: None. Storage is a required field in `PersistentVolumeClaim`, so defaults are not applied at this time.
## Run-time configuration
The default ```LimitRange``` that is applied via Salt configuration will be
updated as follows:
```
apiVersion: "v1"
kind: "LimitRange"
metadata:
name: "limits"
namespace: default
spec:
limits:
- type: "Container"
defaultRequests:
cpu: "100m"
```
## Example
An example LimitRange configuration:
| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio |
| ---- | -------- | --- | --- | ------- | -------------- | ----------------- |
| Container | cpu | .1 | 1 | 500m | 250m | 4 |
| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | |
Assuming an incoming container that specified no incoming resource requirements,
the following would happen.
1. The incoming container cpu would request 250m with a limit of 500m.
2. The incoming container memory would request 250Mi with a limit of 500Mi
3. If the container is later resized, it's cpu would be constrained to between
.1 and 1 and the ratio of limit to request could not exceed 4.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_limit_range.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_limit_range.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_limit_range.md)

View File

@ -1,215 +1 @@
# Admission control plugin: ResourceQuota
## Background
This document describes a system for enforcing hard resource usage limits per
namespace as part of admission control.
## Use cases
1. Ability to enumerate resource usage limits per namespace.
2. Ability to monitor resource usage for tracked resources.
3. Ability to reject resource usage exceeding hard quotas.
## Data Model
The **ResourceQuota** object is scoped to a **Namespace**.
```go
// The following identify resource constants for Kubernetes object types
const (
// Pods, number
ResourcePods ResourceName = "pods"
// Services, number
ResourceServices ResourceName = "services"
// ReplicationControllers, number
ResourceReplicationControllers ResourceName = "replicationcontrollers"
// ResourceQuotas, number
ResourceQuotas ResourceName = "resourcequotas"
// ResourceSecrets, number
ResourceSecrets ResourceName = "secrets"
// ResourcePersistentVolumeClaims, number
ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims"
)
// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
type ResourceQuotaSpec struct {
// Hard is the set of desired hard limits for each named resource
Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
}
// ResourceQuotaStatus defines the enforced hard limits and observed use
type ResourceQuotaStatus struct {
// Hard is the set of enforced hard limits for each named resource
Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
// Used is the current observed total usage of the resource in the namespace
Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"`
}
// ResourceQuota sets aggregate quota restrictions enforced per namespace
type ResourceQuota struct {
TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"`
// Spec defines the desired quota
Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"`
// Status defines the actual enforced quota and its current usage
Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status"`
}
// ResourceQuotaList is a list of ResourceQuota items
type ResourceQuotaList struct {
TypeMeta `json:",inline"`
ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata"`
// Items is a list of ResourceQuota objects
Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
}
```
## Quota Tracked Resources
The following resources are supported by the quota system:
| Resource | Description |
| ------------ | ----------- |
| cpu | Total requested cpu usage |
| memory | Total requested memory usage |
| pods | Total number of active pods where phase is pending or active. |
| services | Total number of services |
| replicationcontrollers | Total number of replication controllers |
| resourcequotas | Total number of resource quotas |
| secrets | Total number of secrets |
| persistentvolumeclaims | Total number of persistent volume claims |
If a third-party wants to track additional resources, it must follow the
resource naming conventions prescribed by Kubernetes. This means the resource
must have a fully-qualified name (i.e. mycompany.org/shinynewresource)
## Resource Requirements: Requests vs. Limits
If a resource supports the ability to distinguish between a request and a limit
for a resource, the quota tracking system will only cost the request value
against the quota usage. If a resource is tracked by quota, and no request value
is provided, the associated entity is rejected as part of admission.
For an example, consider the following scenarios relative to tracking quota on
CPU:
| Pod | Container | Request CPU | Limit CPU | Result |
| --- | --------- | ----------- | --------- | ------ |
| X | C1 | 100m | 500m | The quota usage is incremented 100m |
| Y | C2 | 100m | none | The quota usage is incremented 100m |
| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit |
| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. |
The rationale for accounting for the requested amount of a resource versus the
limit is the belief that a user should only be charged for what they are
scheduled against in the cluster. In addition, attempting to track usage against
actual usage, where request < actual < limit, is considered highly volatile.
As a consequence of this decision, the user is able to spread its usage of a
resource across multiple tiers of service. Let's demonstrate this via an
example with a 4 cpu quota.
The quota may be allocated as follows:
| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage |
| --- | --------- | ----------- | --------- | ---- | ----------- |
| X | C1 | 1 | 4 | Burstable | 1 |
| Y | C2 | 2 | 2 | Guaranteed | 2 |
| Z | C3 | 1 | 3 | Burstable | 1 |
It is possible that the pods may consume 9 cpu over a given time period
depending on the nodes available cpu that held pod X and Z, but since we
scheduled X and Z relative to the request, we only track the requesting value
against their allocated quota. If one wants to restrict the ratio between the
request and limit, it is encouraged that the user define a **LimitRange** with
**LimitRequestRatio** to control burst out behavior. This would in effect, let
an administrator keep the difference between request and limit more in line with
tracked usage if desired.
## Status API
A REST API endpoint to update the status section of the **ResourceQuota** is
exposed. It requires an atomic compare-and-swap in order to keep resource usage
tracking consistent.
## Resource Quota Controller
A resource quota controller monitors observed usage for tracked resources in the
**Namespace**.
If there is observed difference between the current usage stats versus the
current **ResourceQuota.Status**, the controller posts an update of the
currently observed usage metrics to the **ResourceQuota** via the /status
endpoint.
The resource quota controller is the only component capable of monitoring and
recording usage updates after a DELETE operation since admission control is
incapable of guaranteeing a DELETE request actually succeeded.
## AdmissionControl plugin: ResourceQuota
The **ResourceQuota** plug-in introspects all incoming admission requests.
To enable the plug-in and support for ResourceQuota, the kube-apiserver must be
configured as follows:
```
$ kube-apiserver --admission-control=ResourceQuota
```
It makes decisions by evaluating the incoming object against all defined
**ResourceQuota.Status.Hard** resource limits in the request namespace. If
acceptance of the resource would cause the total usage of a named resource to
exceed its hard limit, the request is denied.
If the incoming request does not cause the total usage to exceed any of the
enumerated hard resource limits, the plug-in will post a
**ResourceQuota.Status** document to the server to atomically update the
observed usage based on the previously read **ResourceQuota.ResourceVersion**.
This keeps incremental usage atomically consistent, but does introduce a
bottleneck (intentionally) into the system.
To optimize system performance, it is encouraged that all resource quotas are
tracked on the same **ResourceQuota** document in a **Namespace**. As a result,
it is encouraged to impose a cap on the total number of individual quotas that
are tracked in the **Namespace** to 1 in the **ResourceQuota** document.
## kubectl
kubectl is modified to support the **ResourceQuota** resource.
`kubectl describe` provides a human-readable output of quota.
For example:
```console
$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/namespace.yaml
namespace "quota-example" created
$ kubectl create -f test/fixtures/doc-yaml/admin/resourcequota/quota.yaml --namespace=quota-example
resourcequota "quota" created
$ kubectl describe quota quota --namespace=quota-example
Name: quota
Namespace: quota-example
Resource Used Hard
-------- ---- ----
cpu 0 20
memory 0 1Gi
persistentvolumeclaims 0 10
pods 0 10
replicationcontrollers 0 20
resourcequotas 1 1
secrets 1 10
services 0 5
```
## More information
See [resource quota document](../admin/resource-quota.md) and the [example of Resource Quota](../admin/resourcequota/) for more information.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/admission_control_resource_quota.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_resource_quota.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/admission_control_resource_quota.md)

Binary file not shown.

View File

@ -1,85 +1 @@
# Kubernetes architecture
A running Kubernetes cluster contains node agents (`kubelet`) and master
components (APIs, scheduler, etc), on top of a distributed storage solution.
This diagram shows our desired eventual state, though we're still working on a
few things, like making `kubelet` itself (all our components, really) run within
containers, and making the scheduler 100% pluggable.
![Architecture Diagram](architecture.png?raw=true "Architecture overview")
## The Kubernetes Node
When looking at the architecture of the system, we'll break it down to services
that run on the worker node and services that compose the cluster-level control
plane.
The Kubernetes node has the services necessary to run application containers and
be managed from the master systems.
Each node runs Docker, of course. Docker takes care of the details of
downloading images and running containers.
### `kubelet`
The `kubelet` manages [pods](../user-guide/pods.md) and their containers, their
images, their volumes, etc.
### `kube-proxy`
Each node also runs a simple network proxy and load balancer (see the
[services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for
more details). This reflects `services` (see
[the services doc](../user-guide/services.md) for more details) as defined in
the Kubernetes API on each node and can do simple TCP and UDP stream forwarding
(round robin) across a set of backends.
Service endpoints are currently found via [DNS](../admin/dns.md) or through
environment variables (both
[Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and
Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are
supported). These variables resolve to ports managed by the service proxy.
## The Kubernetes Control Plane
The Kubernetes control plane is split into a set of components. Currently they
all run on a single _master_ node, but that is expected to change soon in order
to support high-availability clusters. These components work together to provide
a unified view of the cluster.
### `etcd`
All persistent master state is stored in an instance of `etcd`. This provides a
great way to store configuration data reliably. With `watch` support,
coordinating components can be notified very quickly of changes.
### Kubernetes API Server
The apiserver serves up the [Kubernetes API](../api.md). It is intended to be a
CRUD-y server, with most/all business logic implemented in separate components
or in plug-ins. It mainly processes REST operations, validates them, and updates
the corresponding objects in `etcd` (and eventually other stores).
### Scheduler
The scheduler binds unscheduled pods to nodes via the `/binding` API. The
scheduler is pluggable, and we expect to support multiple cluster schedulers and
even user-provided schedulers in the future.
### Kubernetes Controller Manager Server
All other cluster-level functions are currently performed by the Controller
Manager. For instance, `Endpoints` objects are created and updated by the
endpoints controller, and nodes are discovered, managed, and monitored by the
node controller. These could eventually be split into separate components to
make them independently pluggable.
The [`replicationcontroller`](../user-guide/replication-controller.md) is a
mechanism that is layered on top of the simple [`pod`](../user-guide/pods.md)
API. We eventually plan to port it to a generic plug-in mechanism, once one is
implemented.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/architecture.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 262 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 50 KiB

View File

@ -1,310 +1 @@
# Peeking under the hood of Kubernetes on AWS
This document provides high-level insight into how Kubernetes works on AWS and
maps to AWS objects. We assume that you are familiar with AWS.
We encourage you to use [kube-up](../getting-started-guides/aws.md) to create
clusters on AWS. We recommend that you avoid manual configuration but are aware
that sometimes it's the only option.
Tip: You should open an issue and let us know what enhancements can be made to
the scripts to better suit your needs.
That said, it's also useful to know what's happening under the hood when
Kubernetes clusters are created on AWS. This can be particularly useful if
problems arise or in circumstances where the provided scripts are lacking and
you manually created or configured your cluster.
**Table of contents:**
* [Architecture overview](#architecture-overview)
* [Storage](#storage)
* [Auto Scaling group](#auto-scaling-group)
* [Networking](#networking)
* [NodePort and LoadBalancer services](#nodeport-and-loadbalancer-services)
* [Identity and access management (IAM)](#identity-and-access-management-iam)
* [Tagging](#tagging)
* [AWS objects](#aws-objects)
* [Manual infrastructure creation](#manual-infrastructure-creation)
* [Instance boot](#instance-boot)
### Architecture overview
Kubernetes is a cluster of several machines that consists of a Kubernetes
master and a set number of nodes (previously known as 'nodes') for which the
master is responsible. See the [Architecture](architecture.md) topic for
more details.
By default on AWS:
* Instances run Ubuntu 15.04 (the official AMI). It includes a sufficiently
modern kernel that pairs well with Docker and doesn't require a
reboot. (The default SSH user is `ubuntu` for this and other ubuntu images.)
* Nodes use aufs instead of ext4 as the filesystem / container storage (mostly
because this is what Google Compute Engine uses).
You can override these defaults by passing different environment variables to
kube-up.
### Storage
AWS supports persistent volumes by using [Elastic Block Store (EBS)](../user-guide/volumes.md#awselasticblockstore).
These can then be attached to pods that should store persistent data (e.g. if
you're running a database).
By default, nodes in AWS use [instance storage](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html)
unless you create pods with persistent volumes
[(EBS)](../user-guide/volumes.md#awselasticblockstore). In general, Kubernetes
containers do not have persistent storage unless you attach a persistent
volume, and so nodes on AWS use instance storage. Instance storage is cheaper,
often faster, and historically more reliable. Unless you can make do with
whatever space is left on your root partition, you must choose an instance type
that provides you with sufficient instance storage for your needs.
To configure Kubernetes to use EBS storage, pass the environment variable
`KUBE_AWS_STORAGE=ebs` to kube-up.
Note: The master uses a persistent volume ([etcd](architecture.md#etcd)) to
track its state. Similar to nodes, containers are mostly run against instance
storage, except that we repoint some important data onto the persistent volume.
The default storage driver for Docker images is aufs. Specifying btrfs (by
passing the environment variable `DOCKER_STORAGE=btrfs` to kube-up) is also a
good choice for a filesystem. btrfs is relatively reliable with Docker and has
improved its reliability with modern kernels. It can easily span multiple
volumes, which is particularly useful when we are using an instance type with
multiple ephemeral instance disks.
### Auto Scaling group
Nodes (but not the master) are run in an
[Auto Scaling group](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroup.html)
on AWS. Currently auto-scaling (e.g. based on CPU) is not actually enabled
([#11935](http://issues.k8s.io/11935)). Instead, the Auto Scaling group means
that AWS will relaunch any nodes that are terminated.
We do not currently run the master in an AutoScalingGroup, but we should
([#11934](http://issues.k8s.io/11934)).
### Networking
Kubernetes uses an IP-per-pod model. This means that a node, which runs many
pods, must have many IPs. AWS uses virtual private clouds (VPCs) and advanced
routing support so each pod is assigned a /24 CIDR. The assigned CIDR is then
configured to route to an instance in the VPC routing table.
It is also possible to use overlay networking on AWS, but that is not the
default configuration of the kube-up script.
### NodePort and LoadBalancer services
Kubernetes on AWS integrates with [Elastic Load Balancing
(ELB)](http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/US_SetUpASLBApp.html).
When you create a service with `Type=LoadBalancer`, Kubernetes (the
kube-controller-manager) will create an ELB, create a security group for the
ELB which allows access on the service ports, attach all the nodes to the ELB,
and modify the security group for the nodes to allow traffic from the ELB to
the nodes. This traffic reaches kube-proxy where it is then forwarded to the
pods.
ELB has some restrictions:
* ELB requires that all nodes listen on a single port,
* ELB acts as a forwarding proxy (i.e. the source IP is not preserved, but see below
on ELB annotations for pods speaking HTTP).
To work with these restrictions, in Kubernetes, [LoadBalancer
services](../user-guide/services.md#type-loadbalancer) are exposed as
[NodePort services](../user-guide/services.md#type-nodeport). Then
kube-proxy listens externally on the cluster-wide port that's assigned to
NodePort services and forwards traffic to the corresponding pods.
For example, if we configure a service of Type LoadBalancer with a
public port of 80:
* Kubernetes will assign a NodePort to the service (e.g. port 31234)
* ELB is configured to proxy traffic on the public port 80 to the NodePort
assigned to the service (in this example port 31234).
* Then any in-coming traffic that ELB forwards to the NodePort (31234)
is recognized by kube-proxy and sent to the correct pods for that service.
Note that we do not automatically open NodePort services in the AWS firewall
(although we do open LoadBalancer services). This is because we expect that
NodePort services are more of a building block for things like inter-cluster
services or for LoadBalancer. To consume a NodePort service externally, you
will likely have to open the port in the node security group
(`kubernetes-node-<clusterid>`).
For SSL support, starting with 1.3 two annotations can be added to a service:
```
service.beta.kubernetes.io/aws-load-balancer-ssl-cert=arn:aws:acm:us-east-1:123456789012:certificate/12345678-1234-1234-1234-123456789012
```
The first specifies which certificate to use. It can be either a
certificate from a third party issuer that was uploaded to IAM or one created
within AWS Certificate Manager.
```
service.beta.kubernetes.io/aws-load-balancer-backend-protocol=(https|http|ssl|tcp)
```
The second annotation specifies which protocol a pod speaks. For HTTPS and
SSL, the ELB will expect the pod to authenticate itself over the encrypted
connection.
HTTP and HTTPS will select layer 7 proxying: the ELB will terminate
the connection with the user, parse headers and inject the `X-Forwarded-For`
header with the user's IP address (pods will only see the IP address of the
ELB at the other end of its connection) when forwarding requests.
TCP and SSL will select layer 4 proxying: the ELB will forward traffic without
modifying the headers.
### Identity and Access Management (IAM)
kube-proxy sets up two IAM roles, one for the master called
[kubernetes-master](../../cluster/aws/templates/iam/kubernetes-master-policy.json)
and one for the nodes called
[kubernetes-node](../../cluster/aws/templates/iam/kubernetes-minion-policy.json).
The master is responsible for creating ELBs and configuring them, as well as
setting up advanced VPC routing. Currently it has blanket permissions on EC2,
along with rights to create and destroy ELBs.
The nodes do not need a lot of access to the AWS APIs. They need to download
a distribution file, and then are responsible for attaching and detaching EBS
volumes from itself.
The node policy is relatively minimal. In 1.2 and later, nodes can retrieve ECR
authorization tokens, refresh them every 12 hours if needed, and fetch Docker
images from it, as long as the appropriate permissions are enabled. Those in
[AmazonEC2ContainerRegistryReadOnly](http://docs.aws.amazon.com/AmazonECR/latest/userguide/ecr_managed_policies.html#AmazonEC2ContainerRegistryReadOnly),
without write access, should suffice. The master policy is probably overly
permissive. The security conscious may want to lock-down the IAM policies
further ([#11936](http://issues.k8s.io/11936)).
We should make it easier to extend IAM permissions and also ensure that they
are correctly configured ([#14226](http://issues.k8s.io/14226)).
### Tagging
All AWS resources are tagged with a tag named "KubernetesCluster", with a value
that is the unique cluster-id. This tag is used to identify a particular
'instance' of Kubernetes, even if two clusters are deployed into the same VPC.
Resources are considered to belong to the same cluster if and only if they have
the same value in the tag named "KubernetesCluster". (The kube-up script is
not configured to create multiple clusters in the same VPC by default, but it
is possible to create another cluster in the same VPC.)
Within the AWS cloud provider logic, we filter requests to the AWS APIs to
match resources with our cluster tag. By filtering the requests, we ensure
that we see only our own AWS objects.
**Important:** If you choose not to use kube-up, you must pick a unique
cluster-id value, and ensure that all AWS resources have a tag with
`Name=KubernetesCluster,Value=<clusterid>`.
### AWS objects
The kube-up script does a number of things in AWS:
* Creates an S3 bucket (`AWS_S3_BUCKET`) and then copies the Kubernetes
distribution and the salt scripts into it. They are made world-readable and the
HTTP URLs are passed to instances; this is how Kubernetes code gets onto the
machines.
* Creates two IAM profiles based on templates in [cluster/aws/templates/iam](../../cluster/aws/templates/iam/):
* `kubernetes-master` is used by the master.
* `kubernetes-node` is used by nodes.
* Creates an AWS SSH key named `kubernetes-<fingerprint>`. Fingerprint here is
the OpenSSH key fingerprint, so that multiple users can run the script with
different keys and their keys will not collide (with near-certainty). It will
use an existing key if one is found at `AWS_SSH_KEY`, otherwise it will create
one there. (With the default Ubuntu images, if you have to SSH in: the user is
`ubuntu` and that user can `sudo`).
* Creates a VPC for use with the cluster (with a CIDR of 172.20.0.0/16) and
enables the `dns-support` and `dns-hostnames` options.
* Creates an internet gateway for the VPC.
* Creates a route table for the VPC, with the internet gateway as the default
route.
* Creates a subnet (with a CIDR of 172.20.0.0/24) in the AZ `KUBE_AWS_ZONE`
(defaults to us-west-2a). Currently, each Kubernetes cluster runs in a
single AZ on AWS. Although, there are two philosophies in discussion on how to
achieve High Availability (HA):
* cluster-per-AZ: An independent cluster for each AZ, where each cluster
is entirely separate.
* cross-AZ-clusters: A single cluster spans multiple AZs.
The debate is open here, where cluster-per-AZ is discussed as more robust but
cross-AZ-clusters are more convenient.
* Associates the subnet to the route table
* Creates security groups for the master (`kubernetes-master-<clusterid>`)
and the nodes (`kubernetes-node-<clusterid>`).
* Configures security groups so that masters and nodes can communicate. This
includes intercommunication between masters and nodes, opening SSH publicly
for both masters and nodes, and opening port 443 on the master for the HTTPS
API endpoints.
* Creates an EBS volume for the master of size `MASTER_DISK_SIZE` and type
`MASTER_DISK_TYPE`.
* Launches a master with a fixed IP address (172.20.0.9) that is also
configured for the security group and all the necessary IAM credentials. An
instance script is used to pass vital configuration information to Salt. Note:
The hope is that over time we can reduce the amount of configuration
information that must be passed in this way.
* Once the instance is up, it attaches the EBS volume and sets up a manual
routing rule for the internal network range (`MASTER_IP_RANGE`, defaults to
10.246.0.0/24).
* For auto-scaling, on each nodes it creates a launch configuration and group.
The name for both is <*KUBE_AWS_INSTANCE_PREFIX*>-node-group. The default
name is kubernetes-node-group. The auto-scaling group has a min and max size
that are both set to NUM_NODES. You can change the size of the auto-scaling
group to add or remove the total number of nodes from within the AWS API or
Console. Each nodes self-configures, meaning that they come up; run Salt with
the stored configuration; connect to the master; are assigned an internal CIDR;
and then the master configures the route-table with the assigned CIDR. The
kube-up script performs a health-check on the nodes but it's a self-check that
is not required.
If attempting this configuration manually, it is recommend to follow along
with the kube-up script, and being sure to tag everything with a tag with name
`KubernetesCluster` and value set to a unique cluster-id. Also, passing the
right configuration options to Salt when not using the script is tricky: the
plan here is to simplify this by having Kubernetes take on more node
configuration, and even potentially remove Salt altogether.
### Manual infrastructure creation
While this work is not yet complete, advanced users might choose to manually
create certain AWS objects while still making use of the kube-up script (to
configure Salt, for example). These objects can currently be manually created:
* Set the `AWS_S3_BUCKET` environment variable to use an existing S3 bucket.
* Set the `VPC_ID` environment variable to reuse an existing VPC.
* Set the `SUBNET_ID` environment variable to reuse an existing subnet.
* If your route table has a matching `KubernetesCluster` tag, it will be reused.
* If your security groups are appropriately named, they will be reused.
Currently there is no way to do the following with kube-up:
* Use an existing AWS SSH key with an arbitrary name.
* Override the IAM credentials in a sensible way
([#14226](http://issues.k8s.io/14226)).
* Use different security group permissions.
* Configure your own auto-scaling groups.
If any of the above items apply to your situation, open an issue to request an
enhancement to the kube-up script. You should provide a complete description of
the use-case, including all the details around what you want to accomplish.
### Instance boot
The instance boot procedure is currently pretty complicated, primarily because
we must marshal configuration from Bash to Salt via the AWS instance script.
As we move more post-boot configuration out of Salt and into Kubernetes, we
will hopefully be able to simplify this.
When the kube-up script launches instances, it builds an instance startup
script which includes some configuration options passed to kube-up, and
concatenates some of the scripts found in the cluster/aws/templates directory.
These scripts are responsible for mounting and formatting volumes, downloading
Salt and Kubernetes from the S3 bucket, and then triggering Salt to actually
install Kubernetes.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/aws_under_the_hood.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/aws_under_the_hood.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/aws_under_the_hood.md)

View File

@ -1,128 +1 @@
# Clustering in Kubernetes
## Overview
The term "clustering" refers to the process of having all members of the
Kubernetes cluster find and trust each other. There are multiple different ways
to achieve clustering with different security and usability profiles. This
document attempts to lay out the user experiences for clustering that Kubernetes
aims to address.
Once a cluster is established, the following is true:
1. **Master -> Node** The master needs to know which nodes can take work and
what their current status is wrt capacity.
1. **Location** The master knows the name and location of all of the nodes in
the cluster.
* For the purposes of this doc, location and name should be enough
information so that the master can open a TCP connection to the Node. Most
probably we will make this either an IP address or a DNS name. It is going to be
important to be consistent here (master must be able to reach kubelet on that
DNS name) so that we can verify certificates appropriately.
2. **Target AuthN** A way to securely talk to the kubelet on that node.
Currently we call out to the kubelet over HTTP. This should be over HTTPS and
the master should know what CA to trust for that node.
3. **Caller AuthN/Z** This would be the master verifying itself (and
permissions) when calling the node. Currently, this is only used to collect
statistics as authorization isn't critical. This may change in the future
though.
2. **Node -> Master** The nodes currently talk to the master to know which pods
have been assigned to them and to publish events.
1. **Location** The nodes must know where the master is at.
2. **Target AuthN** Since the master is assigning work to the nodes, it is
critical that they verify whom they are talking to.
3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to
the master. Ideally this authentication is specific to each node so that
authorization can be narrowly scoped. The details of the work to run (including
things like environment variables) might be considered sensitive and should be
locked down also.
**Note:** While the description here refers to a singular Master, in the future
we should enable multiple Masters operating in an HA mode. While the "Master" is
currently the combination of the API Server, Scheduler and Controller Manager,
we will restrict ourselves to thinking about the main API and policy engine --
the API Server.
## Current Implementation
A central authority (generally the master) is responsible for determining the
set of machines which are members of the cluster. Calls to create and remove
worker nodes in the cluster are restricted to this single authority, and any
other requests to add or remove worker nodes are rejected. (1.i.)
Communication from the master to nodes is currently over HTTP and is not secured
or authenticated in any way. (1.ii, 1.iii.)
The location of the master is communicated out of band to the nodes. For GCE,
this is done via Salt. Other cluster instructions/scripts use other methods.
(2.i.)
Currently most communication from the node to the master is over HTTP. When it
is done over HTTPS there is currently no verification of the cert of the master
(2.ii.)
Currently, the node/kubelet is authenticated to the master via a token shared
across all nodes. This token is distributed out of band (using Salt for GCE) and
is optional. If it is not present then the kubelet is unable to publish events
to the master. (2.iii.)
Our current mix of out of band communication doesn't meet all of our needs from
a security point of view and is difficult to set up and configure.
## Proposed Solution
The proposed solution will provide a range of options for setting up and
maintaining a secure Kubernetes cluster. We want to both allow for centrally
controlled systems (leveraging pre-existing trust and configuration systems) or
more ad-hoc automagic systems that are incredibly easy to set up.
The building blocks of an easier solution:
* **Move to TLS** We will move to using TLS for all intra-cluster communication.
We will explicitly identify the trust chain (the set of trusted CAs) as opposed
to trusting the system CAs. We will also use client certificates for all AuthN.
* [optional] **API driven CA** Optionally, we will run a CA in the master that
will mint certificates for the nodes/kubelets. There will be pluggable policies
that will automatically approve certificate requests here as appropriate.
* **CA approval policy** This is a pluggable policy object that can
automatically approve CA signing requests. Stock policies will include
`always-reject`, `queue` and `insecure-always-approve`. With `queue` there would
be an API for evaluating and accepting/rejecting requests. Cloud providers could
implement a policy here that verifies other out of band information and
automatically approves/rejects based on other external factors.
* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give
a node permission to register itself.
* To start with, we'd have the kubelets generate a cert/account in the form of
`kubelet:<host>`. To start we would then hard code policy such that we give that
particular account appropriate permissions. Over time, we can make the policy
engine more generic.
* [optional] **Bootstrap API endpoint** This is a helper service hosted outside
of the Kubernetes cluster that helps with initial discovery of the master.
### Static Clustering
In this sequence diagram there is out of band admin entity that is creating all
certificates and distributing them. It is also making sure that the kubelets
know where to find the master. This provides for a lot of control but is more
difficult to set up as lots of information must be communicated outside of
Kubernetes.
![Static Sequence Diagram](clustering/static.png)
### Dynamic Clustering
This diagram shows dynamic clustering using the bootstrap API endpoint. This
endpoint is used to both find the location of the master and communicate the
root CA for the master.
This flow has the admin manually approving the kubelet signing requests. This is
the `queue` policy defined above. This manual intervention could be replaced by
code that can verify the signing requests via other means.
![Dynamic Sequence Diagram](clustering/dynamic.png)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/clustering.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/clustering.md)

View File

@ -1,26 +0,0 @@
# Copyright 2016 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM debian:jessie
RUN apt-get update
RUN apt-get -qy install python-seqdiag make curl
WORKDIR /diagrams
RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf
ADD . /diagrams
CMD bash -c 'make >/dev/stderr && tar cf - *.png'

View File

@ -1,41 +0,0 @@
# Copyright 2016 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FONT := DroidSansMono.ttf
PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag))
.PHONY: all
all: $(PNGS)
.PHONY: watch
watch:
fswatch *.seqdiag | xargs -n 1 sh -c "make || true"
$(FONT):
curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT)
%.png: %.seqdiag $(FONT)
seqdiag --no-transparency -a -f '$(FONT)' $<
# Build the stuff via a docker image
.PHONY: docker
docker:
docker build -t clustering-seqdiag .
docker run --rm clustering-seqdiag | tar xvf -
.PHONY: docker-clean
docker-clean:
docker rmi clustering-seqdiag || true
docker images -q --filter "dangling=true" | xargs docker rmi

View File

@ -1,35 +1 @@
This directory contains diagrams for the clustering design doc.
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index.html).
Assuming you have a non-borked python install, this should be installable with:
```sh
pip install seqdiag
```
Just call `make` to regenerate the diagrams.
## Building with Docker
If you are on a Mac or your pip install is messed up, you can easily build with
docker:
```sh
make docker
```
The first run will be slow but things should be fast after that.
To clean up the docker containers that are created (and other cruft that is left
around) you can run `make docker-clean`.
## Automatically rebuild on file changes
If you have the fswatch utility installed, you can have it monitor the file
system and automatically rebuild when files have changed. Just do a
`make watch`.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/clustering/README.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/clustering/README.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/clustering/README.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 71 KiB

View File

@ -1,24 +0,0 @@
seqdiag {
activation = none;
user[label = "Admin User"];
bootstrap[label = "Bootstrap API\nEndpoint"];
master;
kubelet[stacked];
user -> bootstrap [label="createCluster", return="cluster ID"];
user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"];
user ->> master [label="start\n- bootstrap-cluster-uri"];
master => bootstrap [label="setMaster\n- master-location\n- master-ca"];
user ->> kubelet [label="start\n- bootstrap-cluster-uri"];
kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"];
kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="returns\n- kubelet-cert"];
user => master [label="getSignRequests"];
user => master [label="approveSignRequests"];
kubelet <<-- master [label="returns\n- kubelet-cert"];
kubelet => master [label="register\n- kubelet-location"]
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

View File

@ -1,16 +0,0 @@
seqdiag {
activation = none;
admin[label = "Manual Admin"];
ca[label = "Manual CA"]
master;
kubelet[stacked];
admin => ca [label="create\n- master-cert"];
admin ->> master [label="start\n- ca-root\n- master-cert"];
admin => ca [label="create\n- kubelet-cert"];
admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"];
kubelet => master [label="register\n- kubelet-location"];
}

View File

@ -1,158 +1 @@
# Container Command Execution & Port Forwarding in Kubernetes
## Abstract
This document describes how to use Kubernetes to execute commands in containers,
with stdin/stdout/stderr streams attached and how to implement port forwarding
to the containers.
## Background
See the following related issues/PRs:
- [Support attach](http://issue.k8s.io/1521)
- [Real container ssh](http://issue.k8s.io/1513)
- [Provide easy debug network access to services](http://issue.k8s.io/1863)
- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576)
## Motivation
Users and administrators are accustomed to being able to access their systems
via SSH to run remote commands, get shell access, and do port forwarding.
Supporting SSH to containers in Kubernetes is a difficult task. You must
specify a "user" and a hostname to make an SSH connection, and `sshd` requires
real users (resolvable by NSS and PAM). Because a container belongs to a pod,
and the pod belongs to a namespace, you need to specify namespace/pod/container
to uniquely identify the target container. Unfortunately, a
namespace/pod/container is not a real user as far as SSH is concerned. Also,
most Linux systems limit user names to 32 characters, which is unlikely to be
large enough to contain namespace/pod/container. We could devise some scheme to
map each namespace/pod/container to a 32-character user name, adding entries to
`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the
time. Alternatively, we could write custom NSS and PAM modules that allow the
host to resolve a namespace/pod/container to a user without needing to keep
files or LDAP in sync.
As an alternative to SSH, we are using a multiplexed streaming protocol that
runs on top of HTTP. There are no requirements about users being real users,
nor is there any limitation on user name length, as the protocol is under our
control. The only downside is that standard tooling that expects to use SSH
won't be able to work with this mechanism, unless adapters can be written.
## Constraints and Assumptions
- SSH support is not currently in scope.
- CGroup confinement is ultimately desired, but implementing that support is not
currently in scope.
- SELinux confinement is ultimately desired, but implementing that support is
not currently in scope.
## Use Cases
- A user of a Kubernetes cluster wants to run arbitrary commands in a
container with local stdin/stdout/stderr attached to the container.
- A user of a Kubernetes cluster wants to connect to local ports on his computer
and have them forwarded to ports in a container.
## Process Flow
### Remote Command Execution Flow
1. The client connects to the Kubernetes Master to initiate a remote command
execution request.
2. The Master proxies the request to the Kubelet where the container lives.
3. The Kubelet executes nsenter + the requested command and streams
stdin/stdout/stderr back and forth between the client and the container.
### Port Forwarding Flow
1. The client connects to the Kubernetes Master to initiate a remote command
execution request.
2. The Master proxies the request to the Kubelet where the container lives.
3. The client listens on each specified local port, awaiting local connections.
4. The client connects to one of the local listening ports.
4. The client notifies the Kubelet of the new connection.
5. The Kubelet executes nsenter + socat and streams data back and forth between
the client and the port in the container.
## Design Considerations
### Streaming Protocol
The current multiplexed streaming protocol used is SPDY. This is not the
long-term desire, however. As soon as there is viable support for HTTP/2 in Go,
we will switch to that.
### Master as First Level Proxy
Clients should not be allowed to communicate directly with the Kubelet for
security reasons. Therefore, the Master is currently the only suggested entry
point to be used for remote command execution and port forwarding. This is not
necessarily desirable, as it means that all remote command execution and port
forwarding traffic must travel through the Master, potentially impacting other
API requests.
In the future, it might make more sense to retrieve an authorization token from
the Master, and then use that token to initiate a remote command execution or
port forwarding request with a load balanced proxy service dedicated to this
functionality. This would keep the streaming traffic out of the Master.
### Kubelet as Backend Proxy
The kubelet is currently responsible for handling remote command execution and
port forwarding requests. Just like with the Master described above, this means
that all remote command execution and port forwarding streaming traffic must
travel through the Kubelet, which could result in a degraded ability to service
other requests.
In the future, it might make more sense to use a separate service on the node.
Alternatively, we could possibly inject a process into the container that only
listens for a single request, expose that process's listening port on the node,
and then issue a redirect to the client such that it would connect to the first
level proxy, which would then proxy directly to the injected process's exposed
port. This would minimize the amount of proxying that takes place.
### Scalability
There are at least 2 different ways to execute a command in a container:
`docker exec` and `nsenter`. While `docker exec` might seem like an easier and
more obvious choice, it has some drawbacks.
#### `docker exec`
We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port
on the node), but this would require proxying from the edge and securing the
Docker API. `docker exec` calls go through the Docker daemon, meaning that all
stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop.
Additionally, you can't isolate 1 malicious `docker exec` call from normal
usage, meaning an attacker could initiate a denial of service or other attack
and take down the Docker daemon, or the node itself.
We expect remote command execution and port forwarding requests to be long
running and/or high bandwidth operations, and routing all the streaming data
through the Docker daemon feels like a bottleneck we can avoid.
#### `nsenter`
The implementation currently uses `nsenter` to run commands in containers,
joining the appropriate container namespaces. `nsenter` runs directly on the
node and is not proxied through any single daemon process.
### Security
Authentication and authorization hasn't specifically been tested yet with this
functionality. We need to make sure that users are not allowed to execute
remote commands or do port forwarding to containers they aren't allowed to
access.
Additional work is required to ensure that multiple command execution or port
forwarding connections from different clients are not able to see each other's
data. This can most likely be achieved via SELinux labeling and unique process
contexts.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/command_execution_port_forwarding.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/command_execution_port_forwarding.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/command_execution_port_forwarding.md)

View File

@ -1,300 +1 @@
# Generic Configuration Object
## Abstract
The `ConfigMap` API resource stores data used for the configuration of
applications deployed on Kubernetes.
The main focus of this resource is to:
* Provide dynamic distribution of configuration data to deployed applications.
* Encapsulate configuration information and simplify `Kubernetes` deployments.
* Create a flexible configuration model for `Kubernetes`.
## Motivation
A `Secret`-like API resource is needed to store configuration data that pods can
consume.
Goals of this design:
1. Describe a `ConfigMap` API resource.
2. Describe the semantics of consuming `ConfigMap` as environment variables.
3. Describe the semantics of consuming `ConfigMap` as files in a volume.
## Use Cases
1. As a user, I want to be able to consume configuration data as environment
variables.
2. As a user, I want to be able to consume configuration data as files in a
volume.
3. As a user, I want my view of configuration data in files to be eventually
consistent with changes to the data.
### Consuming `ConfigMap` as Environment Variables
A series of events for consuming `ConfigMap` as environment variables:
1. Create a `ConfigMap` object.
2. Create a pod to consume the configuration data via environment variables.
3. The pod is scheduled onto a node.
4. The Kubelet retrieves the `ConfigMap` resource(s) referenced by the pod and
starts the container processes with the appropriate configuration data from
environment variables.
### Consuming `ConfigMap` in Volumes
A series of events for consuming `ConfigMap` as configuration files in a volume:
1. Create a `ConfigMap` object.
2. Create a new pod using the `ConfigMap` via a volume plugin.
3. The pod is scheduled onto a node.
4. The Kubelet creates an instance of the volume plugin and calls its `Setup()`
method.
5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod
and projects the appropriate configuration data into the volume.
### Consuming `ConfigMap` Updates
Any long-running system has configuration that is mutated over time. Changes
made to configuration data must be made visible to pods consuming data in
volumes so that they can respond to those changes.
The `resourceVersion` of the `ConfigMap` object will be updated by the API
server every time the object is modified. After an update, modifications will be
made visible to the consumer container:
1. Create a `ConfigMap` object.
2. Create a new pod using the `ConfigMap` via the volume plugin.
3. The pod is scheduled onto a node.
4. During the sync loop, the Kubelet creates an instance of the volume plugin
and calls its `Setup()` method.
5. The volume plugin retrieves the `ConfigMap` resource(s) referenced by the pod
and projects the appropriate data into the volume.
6. The `ConfigMap` referenced by the pod is updated.
7. During the next iteration of the `syncLoop`, the Kubelet creates an instance
of the volume plugin and calls its `Setup()` method.
8. The volume plugin projects the updated data into the volume atomically.
It is the consuming pod's responsibility to make use of the updated data once it
is made visible.
Because environment variables cannot be updated without restarting a container,
configuration data consumed in environment variables will not be updated.
### Advantages
* Easy to consume in pods; consumer-agnostic
* Configuration data is persistent and versioned
* Consumers of configuration data in volumes can respond to changes in the data
## Proposed Design
### API Resource
The `ConfigMap` resource will be added to the main API:
```go
package api
// ConfigMap holds configuration data for pods to consume.
type ConfigMap struct {
TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`
// Data contains the configuration data. Each key must be a valid
// DNS_SUBDOMAIN or leading dot followed by valid DNS_SUBDOMAIN.
Data map[string]string `json:"data,omitempty"`
}
type ConfigMapList struct {
TypeMeta `json:",inline"`
ListMeta `json:"metadata,omitempty"`
Items []ConfigMap `json:"items"`
}
```
A `Registry` implementation for `ConfigMap` will be added to
`pkg/registry/configmap`.
### Environment Variables
The `EnvVarSource` will be extended with a new selector for `ConfigMap`:
```go
package api
// EnvVarSource represents a source for the value of an EnvVar.
type EnvVarSource struct {
// other fields omitted
// Selects a key of a ConfigMap.
ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"`
}
// Selects a key from a ConfigMap.
type ConfigMapKeySelector struct {
// The ConfigMap to select from.
LocalObjectReference `json:",inline"`
// The key to select.
Key string `json:"key"`
}
```
### Volume Source
A new `ConfigMapVolumeSource` type of volume source containing the `ConfigMap`
object will be added to the `VolumeSource` struct in the API:
```go
package api
type VolumeSource struct {
// other fields omitted
ConfigMap *ConfigMapVolumeSource `json:"configMap,omitempty"`
}
// Represents a volume that holds configuration data.
type ConfigMapVolumeSource struct {
LocalObjectReference `json:",inline"`
// A list of keys to project into the volume.
// If unspecified, each key-value pair in the Data field of the
// referenced ConfigMap will be projected into the volume as a file whose name
// is the key and content is the value.
// If specified, the listed keys will be project into the specified paths, and
// unlisted keys will not be present.
Items []KeyToPath `json:"items,omitempty"`
}
// Represents a mapping of a key to a relative path.
type KeyToPath struct {
// The name of the key to select
Key string `json:"key"`
// The relative path name of the file to be created.
// Must not be absolute or contain the '..' path. Must be utf-8 encoded.
// The first item of the relative path must not start with '..'
Path string `json:"path"`
}
```
**Note:** The update logic used in the downward API volume plug-in will be
extracted and re-used in the volume plug-in for `ConfigMap`.
### Changes to Secret
We will update the Secret volume plugin to have a similar API to the new
`ConfigMap` volume plugin. The secret volume plugin will also begin updating
secret content in the volume when secrets change.
## Examples
#### Consuming `ConfigMap` as Environment Variables
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: etcd-env-config
data:
number-of-members: "1"
initial-cluster-state: new
initial-cluster-token: DUMMY_ETCD_INITIAL_CLUSTER_TOKEN
discovery-token: DUMMY_ETCD_DISCOVERY_TOKEN
discovery-url: http://etcd-discovery:2379
etcdctl-peers: http://etcd:2379
```
This pod consumes the `ConfigMap` as environment variables:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: config-env-example
spec:
containers:
- name: etcd
image: openshift/etcd-20-centos7
ports:
- containerPort: 2379
protocol: TCP
- containerPort: 2380
protocol: TCP
env:
- name: ETCD_NUM_MEMBERS
valueFrom:
configMapKeyRef:
name: etcd-env-config
key: number-of-members
- name: ETCD_INITIAL_CLUSTER_STATE
valueFrom:
configMapKeyRef:
name: etcd-env-config
key: initial-cluster-state
- name: ETCD_DISCOVERY_TOKEN
valueFrom:
configMapKeyRef:
name: etcd-env-config
key: discovery-token
- name: ETCD_DISCOVERY_URL
valueFrom:
configMapKeyRef:
name: etcd-env-config
key: discovery-url
- name: ETCDCTL_PEERS
valueFrom:
configMapKeyRef:
name: etcd-env-config
key: etcdctl-peers
```
#### Consuming `ConfigMap` as Volumes
`redis-volume-config` is intended to be used as a volume containing a config
file:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: redis-volume-config
data:
redis.conf: "pidfile /var/run/redis.pid\nport 6379\ntcp-backlog 511\ndatabases 1\ntimeout 0\n"
```
The following pod consumes the `redis-volume-config` in a volume:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: config-volume-example
spec:
containers:
- name: redis
image: kubernetes/redis
command: ["redis-server", "/mnt/config-map/etc/redis.conf"]
ports:
- containerPort: 6379
volumeMounts:
- name: config-map-volume
mountPath: /mnt/config-map
volumes:
- name: config-map-volume
configMap:
name: redis-volume-config
items:
- path: "etc/redis.conf"
key: redis.conf
```
## Future Improvements
In the future, we may add the ability to specify an init-container that can
watch the volume contents for updates and respond to changes when they occur.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/configmap.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/configmap.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/configmap.md)

View File

@ -1,241 +1 @@
# Kubernetes and Cluster Federation Control Plane Resilience
## Long Term Design and Current Status
### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
### December 14, 2015
## Summary
Some amount of confusion exists around how we currently, and in future
want to ensure resilience of the Kubernetes (and by implication
Kubernetes Cluster Federation) control plane. This document is an attempt to capture that
definitively. It covers areas including self-healing, high
availability, bootstrapping and recovery. Most of the information in
this document already exists in the form of github comments,
PR's/proposals, scattered documents, and corridor conversations, so
document is primarily a consolidation and clarification of existing
ideas.
## Terms
* **Self-healing:** automatically restarting or replacing failed
processes and machines without human intervention
* **High availability:** continuing to be available and work correctly
even if some components are down or uncontactable. This typically
involves multiple replicas of critical services, and a reliable way
to find available replicas. Note that it's possible (but not
desirable) to have high
availability properties (e.g. multiple replicas) in the absence of
self-healing properties (e.g. if a replica fails, nothing replaces
it). Fairly obviously, given enough time, such systems typically
become unavailable (after enough replicas have failed).
* **Bootstrapping**: creating an empty cluster from nothing
* **Recovery**: recreating a non-empty cluster after perhaps
catastrophic failure/unavailability/data corruption
## Overall Goals
1. **Resilience to single failures:** Kubernetes clusters constrained
to single availability zones should be resilient to individual
machine and process failures by being both self-healing and highly
available (within the context of such individual failures).
1. **Ubiquitous resilience by default:** The default cluster creation
scripts for (at least) GCE, AWS and basic bare metal should adhere
to the above (self-healing and high availability) by default (with
options available to disable these features to reduce control plane
resource requirements if so required). It is hoped that other
cloud providers will also follow the above guidelines, but the
above 3 are the primary canonical use cases.
1. **Resilience to some correlated failures:** Kubernetes clusters
which span multiple availability zones in a region should by
default be resilient to complete failure of one entire availability
zone (by similarly providing self-healing and high availability in
the default cluster creation scripts as above).
1. **Default implementation shared across cloud providers:** The
differences between the default implementations of the above for
GCE, AWS and basic bare metal should be minimized. This implies
using shared libraries across these providers in the default
scripts in preference to highly customized implementations per
cloud provider. This is not to say that highly differentiated,
customized per-cloud cluster creation processes (e.g. for GKE on
GCE, or some hosted Kubernetes provider on AWS) are discouraged.
But those fall squarely outside the basic cross-platform OSS
Kubernetes distro.
1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
for achieving system resilience (replication controllers, health
checking, service load balancing etc) should be used in preference
to building a separate set of mechanisms to achieve the same thing.
This implies that self hosting (the kubernetes control plane on
kubernetes) is strongly preferred, with the caveat below.
1. **Recovery from catastrophic failure:** The ability to quickly and
reliably recover a cluster from catastrophic failure is critical,
and should not be compromised by the above goal to self-host
(i.e. it goes without saying that the cluster should be quickly and
reliably recoverable, even if the cluster control plane is
broken). This implies that such catastrophic failure scenarios
should be carefully thought out, and the subject of regular
continuous integration testing, and disaster recovery exercises.
## Relative Priorities
1. **(Possibly manual) recovery from catastrophic failures:** having a
Kubernetes cluster, and all applications running inside it, disappear forever
perhaps is the worst possible failure mode. So it is critical that we be able to
recover the applications running inside a cluster from such failures in some
well-bounded time period.
1. In theory a cluster can be recovered by replaying all API calls
that have ever been executed against it, in order, but most
often that state has been lost, and/or is scattered across
multiple client applications or groups. So in general it is
probably infeasible.
1. In theory a cluster can also be recovered to some relatively
recent non-corrupt backup/snapshot of the disk(s) backing the
etcd cluster state. But we have no default consistent
backup/snapshot, verification or restoration process. And we
don't routinely test restoration, so even if we did routinely
perform and verify backups, we have no hard evidence that we
can in practise effectively recover from catastrophic cluster
failure or data corruption by restoring from these backups. So
there's more work to be done here.
1. **Self-healing:** Most major cloud providers provide the ability to
easily and automatically replace failed virtual machines within a
small number of minutes (e.g. GCE
[Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
and Managed Instance Groups,
AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
can fairly trivially be used to reduce control-plane down-time due
to machine failure to a small number of minutes per failure
(i.e. typically around "3 nines" availability), provided that:
1. cluster persistent state (i.e. etcd disks) is either:
1. truely persistent (i.e. remote persistent disks), or
1. reconstructible (e.g. using etcd [dynamic member
addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
or [backup and
recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
1. and boot disks are either:
1. truely persistent (i.e. remote persistent disks), or
1. reconstructible (e.g. using boot-from-snapshot,
boot-from-pre-configured-image or
boot-from-auto-initializing image).
1. **High Availability:** This has the potential to increase
availability above the approximately "3 nines" level provided by
automated self-healing, but it's somewhat more complex, and
requires additional resources (e.g. redundant API servers and etcd
quorum members). In environments where cloud-assisted automatic
self-healing might be infeasible (e.g. on-premise bare-metal
deployments), it also gives cluster administrators more time to
respond (e.g. replace/repair failed machines) without incurring
system downtime.
## Design and Status (as of December 2015)
<table>
<tr>
<td><b>Control Plane Component</b></td>
<td><b>Resilience Plan</b></td>
<td><b>Current Status</b></td>
</tr>
<tr>
<td><b>API Server</b></td>
<td>
Multiple stateless, self-hosted, self-healing API servers behind a HA
load balancer, built out by the default "kube-up" automation on GCE,
AWS and basic bare metal (BBM). Note that the single-host approach of
having etcd listen only on localhost to ensure that only API server can
connect to it will no longer work, so alternative security will be
needed in the regard (either using firewall rules, SSL certs, or
something else). All necessary flags are currently supported to enable
SSL between API server and etcd (OpenShift runs like this out of the
box), but this needs to be woven into the "kube-up" and related
scripts. Detailed design of self-hosting and related bootstrapping
and catastrophic failure recovery will be detailed in a separate
design doc.
</td>
<td>
No scripted self-healing or HA on GCE, AWS or basic bare metal
currently exists in the OSS distro. To be clear, "no self healing"
means that even if multiple e.g. API servers are provisioned for HA
purposes, if they fail, nothing replaces them, so eventually the
system will fail. Self-healing and HA can be set up
manually by following documented instructions, but this is not
currently an automated process, and it is not tested as part of
continuous integration. So it's probably safest to assume that it
doesn't actually work in practise.
</td>
</tr>
<tr>
<td><b>Controller manager and scheduler</b></td>
<td>
Multiple self-hosted, self healing warm standby stateless controller
managers and schedulers with leader election and automatic failover of API
server clients, automatically installed by default "kube-up" automation.
</td>
<td>As above.</td>
</tr>
<tr>
<td><b>etcd</b></td>
<td>
Multiple (3-5) etcd quorum members behind a load balancer with session
affinity (to prevent clients from being bounced from one to another).
Regarding self-healing, if a node running etcd goes down, it is always necessary
to do three things:
<ol>
<li>allocate a new node (not necessary if running etcd as a pod, in
which case specific measures are required to prevent user pods from
interfering with system pods, for example using node selectors as
described in <A HREF="),
<li>start an etcd replica on that new node, and
<li>have the new replica recover the etcd state.
</ol>
In the case of local disk (which fails in concert with the machine), the etcd
state must be recovered from the other replicas. This is called
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">
dynamic member addition</A>.
In the case of remote persistent disk, the etcd state can be recovered by
attaching the remote persistent disk to the replacement node, thus the state is
recoverable even if all other replicas are down.
There are also significant performance differences between local disks and remote
persistent disks. For example, the
<A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">
sustained throughput local disks in GCE is approximatley 20x that of remote
disks</A>.
Hence we suggest that self-healing be provided by remotely mounted persistent
disks in non-performance critical, single-zone cloud deployments. For
performance critical installations, faster local SSD's should be used, in which
case remounting on node failure is not an option, so
<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">
etcd runtime configuration</A> should be used to replace the failed machine.
Similarly, for cross-zone self-healing, cloud persistent disks are zonal, so
automatic <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">
runtime configuration</A> is required. Similarly, basic bare metal deployments
cannot generally rely on remote persistent disks, so the same approach applies
there.
</td>
<td>
<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
Somewhat vague instructions exist</A> on how to set some of this up manually in
a self-hosted configuration. But automatic bootstrapping and self-healing is not
described (and is not implemented for the non-PD cases). This all still needs to
be automated and continuously tested.
</td>
</tr>
</table>
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/control-plane-resilience.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/control-plane-resilience.md)

View File

@ -1,206 +1 @@
# DaemonSet in Kubernetes
**Author**: Ananya Kumar (@AnanyaKumar)
**Status**: Implemented.
This document presents the design of the Kubernetes DaemonSet, describes use
cases, and gives an overview of the code.
## Motivation
Many users have requested for a way to run a daemon on every node in a
Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential
for use cases such as building a sharded datastore, or running a logger on every
node. In comes the DaemonSet, a way to conveniently create and manage
daemon-like workloads in Kubernetes.
## Use Cases
The DaemonSet can be used for user-specified system services, cluster-level
applications with strong node ties, and Kubernetes node services. Below are
example use cases in each category.
### User-Specified System Services:
Logging: Some users want a way to collect statistics about nodes in a cluster
and send those logs to an external database. For example, system administrators
might want to know if their machines are performing as expected, if they need to
add more machines to the cluster, or if they should switch cloud providers. The
DaemonSet can be used to run a data collection service (for example fluentd) on
every node and send the data to a service like ElasticSearch for analysis.
### Cluster-Level Applications
Datastore: Users might want to implement a sharded datastore in their cluster. A
few nodes in the cluster, labeled app=datastore, might be responsible for
storing data shards, and pods running on these nodes might serve data. This
architecture requires a way to bind pods to specific nodes, so it cannot be
achieved using a Replication Controller. A DaemonSet is a convenient way to
implement such a datastore.
For other uses, see the related [feature request](https://issues.k8s.io/1518)
## Functionality
The DaemonSet supports standard API features:
- create
- The spec for DaemonSets has a pod template field.
- Using the pods nodeSelector field, DaemonSets can be restricted to operate
over nodes that have a certain label. For example, suppose that in a cluster
some nodes are labeled app=database. You can use a DaemonSet to launch a
datastore pod on exactly those nodes labeled app=database.
- Using the pod's nodeName field, DaemonSets can be restricted to operate on a
specified node.
- The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec
used by the Replication Controller.
- The initial implementation will not guarantee that DaemonSet pods are
created on nodes before other pods.
- The initial implementation of DaemonSet does not guarantee that DaemonSet
pods show up on nodes (for example because of resource limitations of the node),
but makes a best effort to launch DaemonSet pods (like Replication Controllers
do with pods). Subsequent revisions might ensure that DaemonSet pods show up on
nodes, preempting other pods if necessary.
- The DaemonSet controller adds an annotation:
```"kubernetes.io/created-by: \<json API object reference\>"```
- YAML example:
```YAML
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: datastore
name: datastore
spec:
template:
metadata:
labels:
app: datastore-shard
spec:
nodeSelector:
app: datastore-node
containers:
name: datastore-shard
image: kubernetes/sharded
ports:
- containerPort: 9042
name: main
```
- commands that get info:
- get (e.g. kubectl get daemonsets)
- describe
- Modifiers:
- delete (if --cascade=true, then first the client turns down all the pods
controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is
unlikely to be set on any node); then it deletes the DaemonSet; then it deletes
the pods)
- label
- annotate
- update operations like patch and replace (only allowed to selector and to
nodeSelector and nodeName of pod template)
- DaemonSets have labels, so you could, for example, list all DaemonSets
with certain labels (the same way you would for a Replication Controller).
In general, for all the supported features like get, describe, update, etc,
the DaemonSet works in a similar way to the Replication Controller. However,
note that the DaemonSet and the Replication Controller are different constructs.
### Persisting Pods
- Ordinary liveness probes specified in the pod template work to keep pods
created by a DaemonSet running.
- If a daemon pod is killed or stopped, the DaemonSet will create a new
replica of the daemon pod on the node.
### Cluster Mutations
- When a new node is added to the cluster, the DaemonSet controller starts
daemon pods on the node for DaemonSets whose pod template nodeSelectors match
the nodes labels.
- Suppose the user launches a DaemonSet that runs a logging daemon on all
nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label
to a node (that did not initially have the label), the logging daemon will
launch on the node. Additionally, if a user removes the label from a node, the
logging daemon on that node will be killed.
## Alternatives Considered
We considered several alternatives, that were deemed inferior to the approach of
creating a new DaemonSet abstraction.
One alternative is to include the daemon in the machine image. In this case it
would run outside of Kubernetes proper, and thus not be monitored, health
checked, usable as a service endpoint, easily upgradable, etc.
A related alternative is to package daemons as static pods. This would address
most of the problems described above, but they would still not be easily
upgradable, and more generally could not be managed through the API server
interface.
A third alternative is to generalize the Replication Controller. We would do
something like: if you set the `replicas` field of the ReplicationControllerSpec
to -1, then it means "run exactly one replica on every node matching the
nodeSelector in the pod template." The ReplicationController would pretend
`replicas` had been set to some large number -- larger than the largest number
of nodes ever expected in the cluster -- and would use some anti-affinity
mechanism to ensure that no more than one Pod from the ReplicationController
runs on any given node. There are two downsides to this approach. First,
there would always be a large number of Pending pods in the scheduler (these
will be scheduled onto new machines when they are added to the cluster). The
second downside is more philosophical: DaemonSet and the Replication Controller
are very different concepts. We believe that having small, targeted controllers
for distinct purposes makes Kubernetes easier to understand and use, compared to
having larger multi-functional controllers (see
["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for
some discussion of this topic).
## Design
#### Client
- Add support for DaemonSet commands to kubectl and the client. Client code was
added to pkg/client/unversioned. The main files in Kubectl that were modified are
pkg/kubectl/describe.go and pkg/kubectl/stop.go, since for other calls like Get, Create,
and Update, the client simply forwards the request to the backend via the REST
API.
#### Apiserver
- Accept, parse, validate client commands
- REST API calls are handled in pkg/registry/daemonset
- In particular, the api server will add the object to etcd
- DaemonManager listens for updates to etcd (using Framework.informer)
- API objects for DaemonSet were created in expapi/v1/types.go and
expapi/v1/register.go
- Validation code is in expapi/validation
#### Daemon Manager
- Creates new DaemonSets when requested. Launches the corresponding daemon pod
on all nodes with labels matching the new DaemonSets selector.
- Listens for addition of new nodes to the cluster, by setting up a
framework.NewInformer that watches for the creation of Node API objects. When a
new node is added, the daemon manager will loop through each DaemonSet. If the
label of the node matches the selector of the DaemonSet, then the daemon manager
will create the corresponding daemon pod in the new node.
- The daemon manager creates a pod on a node by sending a command to the API
server, requesting for a pod to be bound to the node (the node will be specified
via its hostname.)
#### Kubelet
- Does not need to be modified, but health checking will occur for the daemon
pods and revive the pods if they are killed (we set the pod restartPolicy to
Always). We reject DaemonSet objects with pod templates that dont have
restartPolicy set to Always.
## Open Issues
- Should work similarly to [Deployment](http://issues.k8s.io/1743).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/daemon.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/daemon.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/daemon.md)

View File

@ -1,622 +1 @@
# Downward API for resource limits and requests
## Background
Currently the downward API (via environment variables and volume plugin) only
supports exposing a Pod's name, namespace, annotations, labels and its IP
([see details](http://kubernetes.io/docs/user-guide/downward-api/)). This
document explains the need and design to extend them to expose resources
(e.g. cpu, memory) limits and requests.
## Motivation
Software applications require configuration to work optimally with the resources they're allowed to use.
Exposing the requested and limited amounts of available resources inside containers will allow
these applications to be configured more easily. Although docker already
exposes some of this information inside containers, the downward API helps
exposing this information in a runtime-agnostic manner in Kubernetes.
## Use cases
As an application author, I want to be able to use cpu or memory requests and
limits to configure the operational requirements of my applications inside containers.
For example, Java applications expect to be made aware of the available heap size via
a command line argument to the JVM, for example: java -Xmx:`<heap-size>`. Similarly, an
application may want to configure its thread pool based on available cpu resources and
the exported value of GOMAXPROCS.
## Design
This is mostly driven by the discussion in [this issue](https://github.com/kubernetes/kubernetes/issues/9473).
There are three approaches discussed in this document to obtain resources limits
and requests to be exposed as environment variables and volumes inside
containers:
1. The first approach requires users to specify full json path selectors
in which selectors are relative to the pod spec. The benefit of this
approach is to specify pod-level resources, and since containers are
also part of a pod spec, it can be used to specify container-level
resources too.
2. The second approach requires specifying partial json path selectors
which are relative to the container spec. This approach helps
in retrieving a container specific resource limits and requests, and at
the same time, it is simpler to specify than full json path selectors.
3. In the third approach, users specify fixed strings (magic keys) to retrieve
resources limits and requests and do not specify any json path
selectors. This approach is similar to the existing downward API
implementation approach. The advantages of this approach are that it is
simpler to specify that the first two, and does not require any type of
conversion between internal and versioned objects or json selectors as
discussed below.
Before discussing a bit more about merits of each approach, here is a
brief discussion about json path selectors and some implications related
to their use.
#### JSONpath selectors
Versioned objects in kubernetes have json tags as part of their golang fields.
Currently, objects in the internal API have json tags, but it is planned that
these will eventually be removed (see [3933](https://github.com/kubernetes/kubernetes/issues/3933)
for discussion). So for discussion in this proposal, we assume that
internal objects do not have json tags. In the first two approaches
(full and partial json selectors), when a user creates a pod and its
containers, the user specifies a json path selector in the pod's
spec to retrieve values of its limits and requests. The selector
is composed of json tags similar to json paths used with kubectl
([json](http://kubernetes.io/docs/user-guide/jsonpath/)). This proposal
uses kubernetes' json path library to process the selectors to retrieve
the values. As kubelet operates on internal objects (without json tags),
and the selectors are part of versioned objects, retrieving values of
the limits and requests can be handled using these two solutions:
1. By converting an internal object to versioned object, and then using
the json path library to retrieve the values from the versioned object
by processing the selector.
2. By converting a json selector of the versioned objects to internal
object's golang expression and then using the json path library to
retrieve the values from the internal object by processing the golang
expression. However, converting a json selector of the versioned objects
to internal object's golang expression will still require an instance
of the versioned object, so it seems more work from the first solution
unless there is another way without requiring the versioned object.
So there is a one time conversion cost associated with the first (full
path) and second (partial path) approaches, whereas the third approach
(magic keys) does not require any such conversion and can directly
work on internal objects. If we want to avoid conversion cost and to
have implementation simplicity, my opinion is that magic keys approach
is relatively easiest to implement to expose limits and requests with
least impact on existing functionality.
To summarize merits/demerits of each approach:
|Approach | Scope | Conversion cost | JSON selectors | Future extension|
| ---------- | ------------------- | -------------------| ------------------- | ------------------- |
|Full selectors | Pod/Container | Yes | Yes | Possible |
|Partial selectors | Container | Yes | Yes | Possible |
|Magic keys | Container | No | No | Possible|
Note: Please note that pod resources can always be accessed using existing `type ObjectFieldSelector` object
in conjunction with partial selectors and magic keys approaches.
### API with full JSONpath selectors
Full json path selectors specify the complete path to the resources
limits and requests relative to pod spec.
#### Environment variables
This table shows how selectors can be used for various requests and
limits to be exposed as environment variables. Environment variable names
are examples only and not necessarily as specified, and the selectors do not
have to start with dot.
| Env Var Name | Selector |
| ---- | ------------------- |
| CPU_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.cpu|
| MEMORY_LIMIT | spec.containers[?(@.name=="container-name")].resources.limits.memory|
| CPU_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.cpu|
| MEMORY_REQUEST | spec.containers[?(@.name=="container-name")].resources.requests.memory |
#### Volume plugin
This table shows how selectors can be used for various requests and
limits to be exposed as volumes. The path names are examples only and
not necessarily as specified, and the selectors do not have to start with dot.
| Path | Selector |
| ---- | ------------------- |
| cpu_limit | spec.containers[?(@.name=="container-name")].resources.limits.cpu|
| memory_limit| spec.containers[?(@.name=="container-name")].resources.limits.memory|
| cpu_request | spec.containers[?(@.name=="container-name")].resources.requests.cpu|
| memory_request |spec.containers[?(@.name=="container-name")].resources.requests.memory|
Volumes are pod scoped, so a selector must be specified with a container name.
Full json path selectors will use existing `type ObjectFieldSelector`
to extend the current implementation for resources requests and limits.
```
// ObjectFieldSelector selects an APIVersioned field of an object.
type ObjectFieldSelector struct {
APIVersion string `json:"apiVersion"`
// Required: Path of the field to select in the specified API version
FieldPath string `json:"fieldPath"`
}
```
#### Examples
These examples show how to use full selectors with environment variables and volume plugin.
```
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh","-c", "env" ]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
env:
- name: CPU_LIMIT
valueFrom:
fieldRef:
fieldPath: spec.containers[?(@.name=="test-container")].resources.limits.cpu
```
```
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example
spec:
containers:
- name: client-container
image: gcr.io/google_containers/busybox
command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi;sleep 5; done"]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
volumeMounts:
- name: podinfo
mountPath: /etc
readOnly: false
volumes:
- name: podinfo
downwardAPI:
items:
- path: "cpu_limit"
fieldRef:
fieldPath: spec.containers[?(@.name=="client-container")].resources.limits.cpu
```
#### Validations
For APIs with full json path selectors, verify that selectors are
valid relative to pod spec.
### API with partial JSONpath selectors
Partial json path selectors specify paths to resources limits and requests
relative to the container spec. These will be implemented by introducing a
`ContainerSpecFieldSelector` (json: `containerSpecFieldRef`) to extend the current
implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`.
```
// ContainerSpecFieldSelector selects an APIVersioned field of an object.
type ContainerSpecFieldSelector struct {
APIVersion string `json:"apiVersion"`
// Container name
ContainerName string `json:"containerName,omitempty"`
// Required: Path of the field to select in the specified API version
FieldPath string `json:"fieldPath"`
}
// Represents a single file containing information from the downward API
type DownwardAPIVolumeFile struct {
// Required: Path is the relative path name of the file to be created.
Path string `json:"path"`
// Selects a field of the pod: only annotations, labels, name and
// namespace are supported.
FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"`
// Selects a field of the container: only resources limits and requests
// (resources.limits.cpu, resources.limits.memory, resources.requests.cpu,
// resources.requests.memory) are currently supported.
ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"`
}
// EnvVarSource represents a source for the value of an EnvVar.
// Only one of its fields may be set.
type EnvVarSource struct {
// Selects a field of the container: only resources limits and requests
// (resources.limits.cpu, resources.limits.memory, resources.requests.cpu,
// resources.requests.memory) are currently supported.
ContainerSpecFieldRef *ContainerSpecFieldSelector `json:"containerSpecFieldRef,omitempty"`
// Selects a field of the pod; only name and namespace are supported.
FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"`
// Selects a key of a ConfigMap.
ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"`
// Selects a key of a secret in the pod's namespace.
SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"`
}
```
#### Environment variables
This table shows how partial selectors can be used for various requests and
limits to be exposed as environment variables. Environment variable names
are examples only and not necessarily as specified, and the selectors do not
have to start with dot.
| Env Var Name | Selector |
| -------------------- | -------------------|
| CPU_LIMIT | resources.limits.cpu |
| MEMORY_LIMIT | resources.limits.memory |
| CPU_REQUEST | resources.requests.cpu |
| MEMORY_REQUEST | resources.requests.memory |
Since environment variables are container scoped, it is optional
to specify container name as part of the partial selectors as they are
relative to container spec. If container name is not specified, then
it defaults to current container. However, container name could be specified
to expose variables from other containers.
#### Volume plugin
This table shows volume paths and partial selectors used for resources cpu and memory.
Volume path names are examples only and not necessarily as specified, and the
selectors do not have to start with dot.
| Path | Selector |
| -------------------- | -------------------|
| cpu_limit | resources.limits.cpu |
| memory_limit | resources.limits.memory |
| cpu_request | resources.requests.cpu |
| memory_request | resources.requests.memory |
Volumes are pod scoped, the container name must be specified as part of
`containerSpecFieldRef` with them.
#### Examples
These examples show how to use partial selectors with environment variables and volume plugin.
```
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh","-c", "env" ]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
env:
- name: CPU_LIMIT
valueFrom:
containerSpecFieldRef:
fieldPath: resources.limits.cpu
```
```
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example
spec:
containers:
- name: client-container
image: gcr.io/google_containers/busybox
command: ["sh", "-c", "while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
volumeMounts:
- name: podinfo
mountPath: /etc
readOnly: false
volumes:
- name: podinfo
downwardAPI:
items:
- path: "cpu_limit"
containerSpecFieldRef:
containerName: "client-container"
fieldPath: resources.limits.cpu
```
#### Validations
For APIs with partial json path selectors, verify
that selectors are valid relative to container spec.
Also verify that container name is provided with volumes.
### API with magic keys
In this approach, users specify fixed strings (or magic keys) to retrieve resources
limits and requests. This approach is similar to the existing downward
API implementation approach. The fixed string used for resources limits and requests
for cpu and memory are `limits.cpu`, `limits.memory`,
`requests.cpu` and `requests.memory`. Though these strings are same
as json path selectors but are processed as fixed strings. These will be implemented by
introducing a `ResourceFieldSelector` (json: `resourceFieldRef`) to extend the current
implementation for `type DownwardAPIVolumeFile struct` and `type EnvVarSource struct`.
The fields in ResourceFieldSelector are `containerName` to specify the name of a
container, `resource` to specify the type of a resource (cpu or memory), and `divisor`
to specify the output format of values of exposed resources. The default value of divisor
is `1` which means cores for cpu and bytes for memory. For cpu, divisor's valid
values are `1m` (millicores), `1`(cores), and for memory, the valid values in fixed point integer
(decimal) are `1`(bytes), `1k`(kilobytes), `1M`(megabytes), `1G`(gigabytes),
`1T`(terabytes), `1P`(petabytes), `1E`(exabytes), and in their power-of-two equivalents `1Ki(kibibytes)`,
`1Mi`(mebibytes), `1Gi`(gibibytes), `1Ti`(tebibytes), `1Pi`(pebibytes), `1Ei`(exbibytes).
For more information about these resource formats, [see details](resources.md).
Also, the exposed values will be `ceiling` of the actual values in the requestd format in divisor.
For example, if requests.cpu is `250m` (250 millicores) and the divisor by default is `1`, then
exposed value will be `1` core. It is because 250 millicores when converted to cores will be 0.25 and
the ceiling of 0.25 is 1.
```
type ResourceFieldSelector struct {
// Container name
ContainerName string `json:"containerName,omitempty"`
// Required: Resource to select
Resource string `json:"resource"`
// Specifies the output format of the exposed resources
Divisor resource.Quantity `json:"divisor,omitempty"`
}
// Represents a single file containing information from the downward API
type DownwardAPIVolumeFile struct {
// Required: Path is the relative path name of the file to be created.
Path string `json:"path"`
// Selects a field of the pod: only annotations, labels, name and
// namespace are supported.
FieldRef *ObjectFieldSelector `json:"fieldRef, omitempty"`
// Selects a resource of the container: only resources limits and requests
// (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"`
}
// EnvVarSource represents a source for the value of an EnvVar.
// Only one of its fields may be set.
type EnvVarSource struct {
// Selects a resource of the container: only resources limits and requests
// (limits.cpu, limits.memory, requests.cpu and requests.memory) are currently supported.
ResourceFieldRef *ResourceFieldSelector `json:"resourceFieldRef,omitempty"`
// Selects a field of the pod; only name and namespace are supported.
FieldRef *ObjectFieldSelector `json:"fieldRef,omitempty"`
// Selects a key of a ConfigMap.
ConfigMapKeyRef *ConfigMapKeySelector `json:"configMapKeyRef,omitempty"`
// Selects a key of a secret in the pod's namespace.
SecretKeyRef *SecretKeySelector `json:"secretKeyRef,omitempty"`
}
```
#### Environment variables
This table shows environment variable names and strings used for resources cpu and memory.
The variable names are examples only and not necessarily as specified.
| Env Var Name | Resource |
| -------------------- | -------------------|
| CPU_LIMIT | limits.cpu |
| MEMORY_LIMIT | limits.memory |
| CPU_REQUEST | requests.cpu |
| MEMORY_REQUEST | requests.memory |
Since environment variables are container scoped, it is optional
to specify container name as part of the partial selectors as they are
relative to container spec. If container name is not specified, then
it defaults to current container. However, container name could be specified
to expose variables from other containers.
#### Volume plugin
This table shows volume paths and strings used for resources cpu and memory.
Volume path names are examples only and not necessarily as specified.
| Path | Resource |
| -------------------- | -------------------|
| cpu_limit | limits.cpu |
| memory_limit | limits.memory|
| cpu_request | requests.cpu |
| memory_request | requests.memory |
Volumes are pod scoped, the container name must be specified as part of
`resourceFieldRef` with them.
#### Examples
These examples show how to use magic keys approach with environment variables and volume plugin.
```
apiVersion: v1
kind: Pod
metadata:
name: dapi-test-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh","-c", "env" ]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
env:
- name: CPU_LIMIT
valueFrom:
resourceFieldRef:
resource: limits.cpu
- name: MEMORY_LIMIT
valueFrom:
resourceFieldRef:
resource: limits.memory
divisor: "1Mi"
```
In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 1 (in cores) and 128 (in Mi), respectively.
```
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example
spec:
containers:
- name: client-container
image: gcr.io/google_containers/busybox
command: ["sh", "-c","while true; do if [[ -e /etc/labels ]]; then cat /etc/labels; fi; if [[ -e /etc/annotations ]]; then cat /etc/annotations; fi; sleep 5; done"]
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
volumeMounts:
- name: podinfo
mountPath: /etc
readOnly: false
volumes:
- name: podinfo
downwardAPI:
items:
- path: "cpu_limit"
resourceFieldRef:
containerName: client-container
resource: limits.cpu
divisor: "1m"
- path: "memory_limit"
resourceFieldRef:
containerName: client-container
resource: limits.memory
```
In the above example, the exposed values of CPU_LIMIT and MEMORY_LIMIT will be 500 (in millicores) and 134217728 (in bytes), respectively.
#### Validations
For APIs with magic keys, verify that the resource strings are valid and is one
of `limits.cpu`, `limits.memory`, `requests.cpu` and `requests.memory`.
Also verify that container name is provided with volumes.
## Pod-level and container-level resource access
Pod-level resources (like `metadata.name`, `status.podIP`) will always be accessed with `type ObjectFieldSelector` object in
all approaches. Container-level resources will be accessed by `type ObjectFieldSelector`
with full selector approach; and by `type ContainerSpecFieldRef` and `type ResourceFieldRef`
with partial and magic keys approaches, respectively. The following table
summarizes resource access with these approaches.
| Approach | Pod resources| Container resources |
| -------------------- | -------------------|-------------------|
| Full selectors | `ObjectFieldSelector` | `ObjectFieldSelector`|
| Partial selectors | `ObjectFieldSelector`| `ContainerSpecFieldRef` |
| Magic keys | `ObjectFieldSelector`| `ResourceFieldRef` |
## Output format
The output format for resources limits and requests will be same as
cgroups output format, i.e. cpu in cpu shares (cores multiplied by 1024
and rounded to integer) and memory in bytes. For example, memory request
or limit of `64Mi` in the container spec will be output as `67108864`
bytes, and cpu request or limit of `250m` (millicores) will be output as
`256` of cpu shares.
## Implementation approach
The current implementation of this proposal will focus on the API with magic keys
approach. The main reason for selecting this approach is that it might be
easier to incorporate and extend resource specific functionality.
## Applied example
Here we discuss how to use exposed resource values to set, for example, Java
memory size or GOMAXPROCS for your applications. Lets say, you expose a container's
(running an application like tomcat for example) requested memory as `HEAP_SIZE`
and requested cpu as CPU_LIMIT (or could be GOMAXPROCS directly) environment variable.
One way to set the heap size or cpu for this application would be to wrap the binary
in a shell script, and then export `JAVA_OPTS` (assuming your container image supports it)
and GOMAXPROCS environment variables inside the container image. The spec file for the
application pod could look like:
```
apiVersion: v1
kind: Pod
metadata:
name: kubernetes-downwardapi-volume-example
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh","-c", "env" ]
resources:
requests:
memory: "64M"
cpu: "250m"
limits:
memory: "128M"
cpu: "500m"
env:
- name: HEAP_SIZE
valueFrom:
resourceFieldRef:
resource: requests.memory
- name: CPU_LIMIT
valueFrom:
resourceFieldRef:
resource: requests.cpu
```
Note that the value of divisor by default is `1`. Now inside the container,
the HEAP_SIZE (in bytes) and GOMAXPROCS (in cores) could be exported as:
```
export JAVA_OPTS="$JAVA_OPTS -Xmx:$(HEAP_SIZE)"
and
export GOMAXPROCS=$(CPU_LIMIT)"
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/downward_api_resources_limits_requests.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/downward_api_resources_limits_requests.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/downward_api_resources_limits_requests.md)

View File

@ -1,429 +1 @@
# Enhance Pluggable Policy
While trying to develop an authorization plugin for Kubernetes, we found a few
places where API extensions would ease development and add power. There are a
few goals:
1. Provide an authorization plugin that can evaluate a .Authorize() call based
on the full content of the request to RESTStorage. This includes information
like the full verb, the content of creates and updates, and the names of
resources being acted upon.
1. Provide a way to ask whether a user is permitted to take an action without
running in process with the API Authorizer. For instance, a proxy for exec
calls could ask whether a user can run the exec they are requesting.
1. Provide a way to ask who can perform a given action on a given resource.
This is useful for answering questions like, "who can create replication
controllers in my namespace".
This proposal adds to and extends the existing API to so that authorizers may
provide the functionality described above. It does not attempt to describe how
the policies themselves can be expressed, that is up the authorization plugins
themselves.
## Enhancements to existing Authorization interfaces
The existing Authorization interfaces are described
[here](../admin/authorization.md). A couple additions will allow the development
of an Authorizer that matches based on different rules than the existing
implementation.
### Request Attributes
The existing authorizer.Attributes only has 5 attributes (user, groups,
isReadOnly, kind, and namespace). If we add more detailed verbs, content, and
resource names, then Authorizer plugins will have the same level of information
available to RESTStorage components in order to express more detailed policy.
The replacement excerpt is below.
An API request has the following attributes that can be considered for
authorization:
- user - the user-string which a user was authenticated as. This is included
in the Context.
- groups - the groups to which the user belongs. This is included in the
Context.
- verb - string describing the requesting action. Today we have: get, list,
watch, create, update, and delete. The old `readOnly` behavior is equivalent to
allowing get, list, watch.
- namespace - the namespace of the object being access, or the empty string if
the endpoint does not support namespaced objects. This is included in the
Context.
- resourceGroup - the API group of the resource being accessed
- resourceVersion - the API version of the resource being accessed
- resource - which resource is being accessed
- applies only to the API endpoints, such as `/api/v1beta1/pods`. For
miscellaneous endpoints, like `/version`, the kind is the empty string.
- resourceName - the name of the resource during a get, update, or delete
action.
- subresource - which subresource is being accessed
A non-API request has 2 attributes:
- verb - the HTTP verb of the request
- path - the path of the URL being requested
### Authorizer Interface
The existing Authorizer interface is very simple, but there isn't a way to
provide details about allows, denies, or failures. The extended detail is useful
for UIs that want to describe why certain actions are allowed or disallowed. Not
all Authorizers will want to provide that information, but for those that do,
having that capability is useful. In addition, adding a `GetAllowedSubjects`
method that returns back the users and groups that can perform a particular
action makes it possible to answer questions like, "who can see resources in my
namespace" (see [ResourceAccessReview](#ResourceAccessReview) further down).
```go
// OLD
type Authorizer interface {
Authorize(a Attributes) error
}
```
```go
// NEW
// Authorizer provides the ability to determine if a particular user can perform
// a particular action
type Authorizer interface {
// Authorize takes a Context (for namespace, user, and traceability) and
// Attributes to make a policy determination.
// reason is an optional return value that can describe why a policy decision
// was made. Reasons are useful during debugging when trying to figure out
// why a user or group has access to perform a particular action.
Authorize(ctx api.Context, a Attributes) (allowed bool, reason string, evaluationError error)
}
// AuthorizerIntrospection is an optional interface that provides the ability to
// determine which users and groups can perform a particular action. This is
// useful for building caches of who can see what. For instance, "which
// namespaces can this user see". That would allow someone to see only the
// namespaces they are allowed to view instead of having to choose between
// listing them all or listing none.
type AuthorizerIntrospection interface {
// GetAllowedSubjects takes a Context (for namespace and traceability) and
// Attributes to determine which users and groups are allowed to perform the
// described action in the namespace. This API enables the ResourceBasedReview
// requests below
GetAllowedSubjects(ctx api.Context, a Attributes) (users util.StringSet, groups util.StringSet, evaluationError error)
}
```
### SubjectAccessReviews
This set of APIs answers the question: can a user or group (use authenticated
user if none is specified) perform a given action. Given the Authorizer
interface (proposed or existing), this endpoint can be implemented generically
against any Authorizer by creating the correct Attributes and making an
.Authorize() call.
There are three different flavors:
1. `/apis/authorization.kubernetes.io/{version}/subjectAccessReviews` - this
checks to see if a specified user or group can perform a given action at the
cluster scope or across all namespaces. This is a highly privileged operation.
It allows a cluster-admin to inspect rights of any person across the entire
cluster and against cluster level resources.
2. `/apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews` -
this checks to see if the current user (including his groups) can perform a
given action at any specified scope. This is an unprivileged operation. It
doesn't expose any information that a user couldn't discover simply by trying an
endpoint themselves.
3. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localSubjectAccessReviews` -
this checks to see if a specified user or group can perform a given action in
**this** namespace. This is a moderately privileged operation. In a multi-tenant
environment, having a namespace scoped resource makes it very easy to reason
about powers granted to a namespace admin. This allows a namespace admin
(someone able to manage permissions inside of one namespaces, but not all
namespaces), the power to inspect whether a given user or group can manipulate
resources in his namespace.
SubjectAccessReview is runtime.Object with associated RESTStorage that only
accepts creates. The caller POSTs a SubjectAccessReview to this URL and he gets
a SubjectAccessReviewResponse back. Here is an example of a call and its
corresponding return:
```
// input
{
"kind": "SubjectAccessReview",
"apiVersion": "authorization.kubernetes.io/v1",
"authorizationAttributes": {
"verb": "create",
"resource": "pods",
"user": "Clark",
"groups": ["admins", "managers"]
}
}
// POSTed like this
curl -X POST /apis/authorization.kubernetes.io/{version}/subjectAccessReviews -d @subject-access-review.json
// or
accessReviewResult, err := Client.SubjectAccessReviews().Create(subjectAccessReviewObject)
// output
{
"kind": "SubjectAccessReviewResponse",
"apiVersion": "authorization.kubernetes.io/v1",
"allowed": true
}
```
PersonalSubjectAccessReview is runtime.Object with associated RESTStorage that
only accepts creates. The caller POSTs a PersonalSubjectAccessReview to this URL
and he gets a SubjectAccessReviewResponse back. Here is an example of a call and
its corresponding return:
```
// input
{
"kind": "PersonalSubjectAccessReview",
"apiVersion": "authorization.kubernetes.io/v1",
"authorizationAttributes": {
"verb": "create",
"resource": "pods",
"namespace": "any-ns",
}
}
// POSTed like this
curl -X POST /apis/authorization.kubernetes.io/{version}/personalSubjectAccessReviews -d @personal-subject-access-review.json
// or
accessReviewResult, err := Client.PersonalSubjectAccessReviews().Create(subjectAccessReviewObject)
// output
{
"kind": "PersonalSubjectAccessReviewResponse",
"apiVersion": "authorization.kubernetes.io/v1",
"allowed": true
}
```
LocalSubjectAccessReview is runtime.Object with associated RESTStorage that only
accepts creates. The caller POSTs a LocalSubjectAccessReview to this URL and he
gets a LocalSubjectAccessReviewResponse back. Here is an example of a call and
its corresponding return:
```
// input
{
"kind": "LocalSubjectAccessReview",
"apiVersion": "authorization.kubernetes.io/v1",
"namespace": "my-ns"
"authorizationAttributes": {
"verb": "create",
"resource": "pods",
"user": "Clark",
"groups": ["admins", "managers"]
}
}
// POSTed like this
curl -X POST /apis/authorization.kubernetes.io/{version}/localSubjectAccessReviews -d @local-subject-access-review.json
// or
accessReviewResult, err := Client.LocalSubjectAccessReviews().Create(localSubjectAccessReviewObject)
// output
{
"kind": "LocalSubjectAccessReviewResponse",
"apiVersion": "authorization.kubernetes.io/v1",
"namespace": "my-ns"
"allowed": true
}
```
The actual Go objects look like this:
```go
type AuthorizationAttributes struct {
// Namespace is the namespace of the action being requested. Currently, there
// is no distinction between no namespace and all namespaces
Namespace string `json:"namespace" description:"namespace of the action being requested"`
// Verb is one of: get, list, watch, create, update, delete
Verb string `json:"verb" description:"one of get, list, watch, create, update, delete"`
// Resource is one of the existing resource types
ResourceGroup string `json:"resourceGroup" description:"group of the resource being requested"`
// ResourceVersion is the version of resource
ResourceVersion string `json:"resourceVersion" description:"version of the resource being requested"`
// Resource is one of the existing resource types
Resource string `json:"resource" description:"one of the existing resource types"`
// ResourceName is the name of the resource being requested for a "get" or
// deleted for a "delete"
ResourceName string `json:"resourceName" description:"name of the resource being requested for a get or delete"`
// Subresource is one of the existing subresources types
Subresource string `json:"subresource" description:"one of the existing subresources"`
}
// SubjectAccessReview is an object for requesting information about whether a
// user or group can perform an action
type SubjectAccessReview struct {
kapi.TypeMeta `json:",inline"`
// AuthorizationAttributes describes the action being tested.
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
// User is optional, but at least one of User or Groups must be specified
User string `json:"user" description:"optional, user to check"`
// Groups is optional, but at least one of User or Groups must be specified
Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
}
// SubjectAccessReviewResponse describes whether or not a user or group can
// perform an action
type SubjectAccessReviewResponse struct {
kapi.TypeMeta
// Allowed is required. True if the action would be allowed, false otherwise.
Allowed bool
// Reason is optional. It indicates why a request was allowed or denied.
Reason string
}
// PersonalSubjectAccessReview is an object for requesting information about
// whether a user or group can perform an action
type PersonalSubjectAccessReview struct {
kapi.TypeMeta `json:",inline"`
// AuthorizationAttributes describes the action being tested.
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
}
// PersonalSubjectAccessReviewResponse describes whether this user can perform
// an action
type PersonalSubjectAccessReviewResponse struct {
kapi.TypeMeta
// Namespace is the namespace used for the access review
Namespace string
// Allowed is required. True if the action would be allowed, false otherwise.
Allowed bool
// Reason is optional. It indicates why a request was allowed or denied.
Reason string
}
// LocalSubjectAccessReview is an object for requesting information about
// whether a user or group can perform an action
type LocalSubjectAccessReview struct {
kapi.TypeMeta `json:",inline"`
// AuthorizationAttributes describes the action being tested.
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
// User is optional, but at least one of User or Groups must be specified
User string `json:"user" description:"optional, user to check"`
// Groups is optional, but at least one of User or Groups must be specified
Groups []string `json:"groups" description:"optional, list of groups to which the user belongs"`
}
// LocalSubjectAccessReviewResponse describes whether or not a user or group can
// perform an action
type LocalSubjectAccessReviewResponse struct {
kapi.TypeMeta
// Namespace is the namespace used for the access review
Namespace string
// Allowed is required. True if the action would be allowed, false otherwise.
Allowed bool
// Reason is optional. It indicates why a request was allowed or denied.
Reason string
}
```
### ResourceAccessReview
This set of APIs nswers the question: which users and groups can perform the
specified verb on the specified resourceKind. Given the Authorizer interface
described above, this endpoint can be implemented generically against any
Authorizer by calling the .GetAllowedSubjects() function.
There are two different flavors:
1. `/apis/authorization.kubernetes.io/{version}/resourceAccessReview` - this
checks to see which users and groups can perform a given action at the cluster
scope or across all namespaces. This is a highly privileged operation. It allows
a cluster-admin to inspect rights of all subjects across the entire cluster and
against cluster level resources.
2. `/apis/authorization.kubernetes.io/{version}/ns/{namespace}/localResourceAccessReviews` -
this checks to see which users and groups can perform a given action in **this**
namespace. This is a moderately privileged operation. In a multi-tenant
environment, having a namespace scoped resource makes it very easy to reason
about powers granted to a namespace admin. This allows a namespace admin
(someone able to manage permissions inside of one namespaces, but not all
namespaces), the power to inspect which users and groups can manipulate
resources in his namespace.
ResourceAccessReview is a runtime.Object with associated RESTStorage that only
accepts creates. The caller POSTs a ResourceAccessReview to this URL and he gets
a ResourceAccessReviewResponse back. Here is an example of a call and its
corresponding return:
```
// input
{
"kind": "ResourceAccessReview",
"apiVersion": "authorization.kubernetes.io/v1",
"authorizationAttributes": {
"verb": "list",
"resource": "replicationcontrollers"
}
}
// POSTed like this
curl -X POST /apis/authorization.kubernetes.io/{version}/resourceAccessReviews -d @resource-access-review.json
// or
accessReviewResult, err := Client.ResourceAccessReviews().Create(resourceAccessReviewObject)
// output
{
"kind": "ResourceAccessReviewResponse",
"apiVersion": "authorization.kubernetes.io/v1",
"namespace": "default"
"users": ["Clark", "Hubert"],
"groups": ["cluster-admins"]
}
```
The actual Go objects look like this:
```go
// ResourceAccessReview is a means to request a list of which users and groups
// are authorized to perform the action specified by spec
type ResourceAccessReview struct {
kapi.TypeMeta `json:",inline"`
// AuthorizationAttributes describes the action being tested.
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
}
// ResourceAccessReviewResponse describes who can perform the action
type ResourceAccessReviewResponse struct {
kapi.TypeMeta
// Users is the list of users who can perform the action
Users []string
// Groups is the list of groups who can perform the action
Groups []string
}
// LocalResourceAccessReview is a means to request a list of which users and
// groups are authorized to perform the action specified in a specific namespace
type LocalResourceAccessReview struct {
kapi.TypeMeta `json:",inline"`
// AuthorizationAttributes describes the action being tested.
AuthorizationAttributes `json:"authorizationAttributes" description:"the action being tested"`
}
// LocalResourceAccessReviewResponse describes who can perform the action
type LocalResourceAccessReviewResponse struct {
kapi.TypeMeta
// Namespace is the namespace used for the access review
Namespace string
// Users is the list of users who can perform the action
Users []string
// Groups is the list of groups who can perform the action
Groups []string
}
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/enhance-pluggable-policy.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/enhance-pluggable-policy.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/enhance-pluggable-policy.md)

View File

@ -1,169 +1 @@
# Kubernetes Event Compression
This document captures the design of event compression.
## Background
Kubernetes components can get into a state where they generate tons of events.
The events can be categorized in one of two ways:
1. same - The event is identical to previous events except it varies only on
timestamp.
2. similar - The event is identical to previous events except it varies on
timestamp and message.
For example, when pulling a non-existing image, Kubelet will repeatedly generate
`image_not_existing` and `container_is_waiting` events until upstream components
correct the image. When this happens, the spam from the repeated events makes
the entire event mechanism useless. It also appears to cause memory pressure in
etcd (see [#3853](http://issue.k8s.io/3853)).
The goal is introduce event counting to increment same events, and event
aggregation to collapse similar events.
## Proposal
Each binary that generates events (for example, `kubelet`) should keep track of
previously generated events so that it can collapse recurring events into a
single event instead of creating a new instance for each new event. In addition,
if many similar events are created, events should be aggregated into a single
event to reduce spam.
Event compression should be best effort (not guaranteed). Meaning, in the worst
case, `n` identical (minus timestamp) events may still result in `n` event
entries.
## Design
Instead of a single Timestamp, each event object
[contains](http://releases.k8s.io/HEAD/pkg/api/types.go#L1111) the following
fields:
* `FirstTimestamp unversioned.Time`
* The date/time of the first occurrence of the event.
* `LastTimestamp unversioned.Time`
* The date/time of the most recent occurrence of the event.
* On first occurrence, this is equal to the FirstTimestamp.
* `Count int`
* The number of occurrences of this event between FirstTimestamp and
LastTimestamp.
* On first occurrence, this is 1.
Each binary that generates events:
* Maintains a historical record of previously generated events:
* Implemented with
["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go)
in [`pkg/client/record/events_cache.go`](../../pkg/client/record/events_cache.go).
* Implemented behind an `EventCorrelator` that manages two subcomponents:
`EventAggregator` and `EventLogger`.
* The `EventCorrelator` observes all incoming events and lets each
subcomponent visit and modify the event in turn.
* The `EventAggregator` runs an aggregation function over each event. This
function buckets each event based on an `aggregateKey` and identifies the event
uniquely with a `localKey` in that bucket.
* The default aggregation function groups similar events that differ only by
`event.Message`. Its `localKey` is `event.Message` and its aggregate key is
produced by joining:
* `event.Source.Component`
* `event.Source.Host`
* `event.InvolvedObject.Kind`
* `event.InvolvedObject.Namespace`
* `event.InvolvedObject.Name`
* `event.InvolvedObject.UID`
* `event.InvolvedObject.APIVersion`
* `event.Reason`
* If the `EventAggregator` observes a similar event produced 10 times in a 10
minute window, it drops the event that was provided as input and creates a new
event that differs only on the message. The message denotes that this event is
used to group similar events that matched on reason. This aggregated `Event` is
then used in the event processing sequence.
* The `EventLogger` observes the event out of `EventAggregation` and tracks
the number of times it has observed that event previously by incrementing a key
in a cache associated with that matching event.
* The key in the cache is generated from the event object minus
timestamps/count/transient fields, specifically the following events fields are
used to construct a unique key for an event:
* `event.Source.Component`
* `event.Source.Host`
* `event.InvolvedObject.Kind`
* `event.InvolvedObject.Namespace`
* `event.InvolvedObject.Name`
* `event.InvolvedObject.UID`
* `event.InvolvedObject.APIVersion`
* `event.Reason`
* `event.Message`
* The LRU cache is capped at 4096 events for both `EventAggregator` and
`EventLogger`. That means if a component (e.g. kubelet) runs for a long period
of time and generates tons of unique events, the previously generated events
cache will not grow unchecked in memory. Instead, after 4096 unique events are
generated, the oldest events are evicted from the cache.
* When an event is generated, the previously generated events cache is checked
(see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/HEAD/pkg/client/record/event.go)).
* If the key for the new event matches the key for a previously generated
event (meaning all of the above fields match between the new event and some
previously generated event), then the event is considered to be a duplicate and
the existing event entry is updated in etcd:
* The new PUT (update) event API is called to update the existing event
entry in etcd with the new last seen timestamp and count.
* The event is also updated in the previously generated events cache with
an incremented count, updated last seen timestamp, name, and new resource
version (all required to issue a future event update).
* If the key for the new event does not match the key for any previously
generated event (meaning none of the above fields match between the new event
and any previously generated events), then the event is considered to be
new/unique and a new event entry is created in etcd:
* The usual POST/create event API is called to create a new event entry in
etcd.
* An entry for the event is also added to the previously generated events
cache.
## Issues/Risks
* Compression is not guaranteed, because each component keeps track of event
history in memory
* An application restart causes event history to be cleared, meaning event
history is not preserved across application restarts and compression will not
occur across component restarts.
* Because an LRU cache is used to keep track of previously generated events,
if too many unique events are generated, old events will be evicted from the
cache, so events will only be compressed until they age out of the events cache,
at which point any new instance of the event will cause a new entry to be
created in etcd.
## Example
Sample kubectl output:
```console
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-node-4.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-1.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-1.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-3.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-3.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-node-2.c.saad-dev-vms.internal Node starting {kubelet kubernetes-node-2.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-node-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest"
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-node-4.c.saad-dev-vms.internal
```
This demonstrates what would have been 20 separate entries (indicating
scheduling failure) collapsed/compressed down to 5 entries.
## Related Pull Requests/Issues
* Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events.
* PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API.
* PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow
compressing multiple recurring events in to a single event.
* PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a
single event to optimize etcd storage.
* PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache
instead of map.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/event_compression.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/event_compression.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/event_compression.md)

View File

@ -1,417 +1 @@
# Variable expansion in pod command, args, and env
## Abstract
A proposal for the expansion of environment variables using a simple `$(var)`
syntax.
## Motivation
It is extremely common for users to need to compose environment variables or
pass arguments to their commands using the values of environment variables.
Kubernetes should provide a facility for the 80% cases in order to decrease
coupling and the use of workarounds.
## Goals
1. Define the syntax format
2. Define the scoping and ordering of substitutions
3. Define the behavior for unmatched variables
4. Define the behavior for unexpected/malformed input
## Constraints and Assumptions
* This design should describe the simplest possible syntax to accomplish the
use-cases.
* Expansion syntax will not support more complicated shell-like behaviors such
as default values (viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc.
## Use Cases
1. As a user, I want to compose new environment variables for a container using
a substitution syntax to reference other variables in the container's
environment and service environment variables.
1. As a user, I want to substitute environment variables into a container's
command.
1. As a user, I want to do the above without requiring the container's image to
have a shell.
1. As a user, I want to be able to specify a default value for a service
variable which may not exist.
1. As a user, I want to see an event associated with the pod if an expansion
fails (ie, references variable names that cannot be expanded).
### Use Case: Composition of environment variables
Currently, containers are injected with docker-style environment variables for
the services in their pod's namespace. There are several variables for each
service, but users routinely need to compose URLs based on these variables
because there is not a variable for the exact format they need. Users should be
able to build new environment variables with the exact format they need.
Eventually, it should also be possible to turn off the automatic injection of
the docker-style variables into pods and let the users consume the exact
information they need via the downward API and composition.
#### Expanding expanded variables
It should be possible to reference an variable which is itself the result of an
expansion, if the referenced variable is declared in the container's environment
prior to the one referencing it. Put another way -- a container's environment is
expanded in order, and expanded variables are available to subsequent
expansions.
### Use Case: Variable expansion in command
Users frequently need to pass the values of environment variables to a
container's command. Currently, Kubernetes does not perform any expansion of
variables. The workaround is to invoke a shell in the container's command and
have the shell perform the substitution, or to write a wrapper script that sets
up the environment and runs the command. This has a number of drawbacks:
1. Solutions that require a shell are unfriendly to images that do not contain
a shell.
2. Wrapper scripts make it harder to use images as base images.
3. Wrapper scripts increase coupling to Kubernetes.
Users should be able to do the 80% case of variable expansion in command without
writing a wrapper script or adding a shell invocation to their containers'
commands.
### Use Case: Images without shells
The current workaround for variable expansion in a container's command requires
the container's image to have a shell. This is unfriendly to images that do not
contain a shell (`scratch` images, for example). Users should be able to perform
the other use-cases in this design without regard to the content of their
images.
### Use Case: See an event for incomplete expansions
It is possible that a container with incorrect variable values or command line
may continue to run for a long period of time, and that the end-user would have
no visual or obvious warning of the incorrect configuration. If the kubelet
creates an event when an expansion references a variable that cannot be
expanded, it will help users quickly detect problems with expansions.
## Design Considerations
### What features should be supported?
In order to limit complexity, we want to provide the right amount of
functionality so that the 80% cases can be realized and nothing more. We felt
that the essentials boiled down to:
1. Ability to perform direct expansion of variables in a string.
2. Ability to specify default values via a prioritized mapping function but
without support for defaults as a syntax-level feature.
### What should the syntax be?
The exact syntax for variable expansion has a large impact on how users perceive
and relate to the feature. We considered implementing a very restrictive subset
of the shell `${var}` syntax. This syntax is an attractive option on some level,
because many people are familiar with it. However, this syntax also has a large
number of lesser known features such as the ability to provide default values
for unset variables, perform inline substitution, etc.
In the interest of preventing conflation of the expansion feature in Kubernetes
with the shell feature, we chose a different syntax similar to the one in
Makefiles, `$(var)`. We also chose not to support the bar `$var` format, since
it is not required to implement the required use-cases.
Nested references, ie, variable expansion within variable names, are not
supported.
#### How should unmatched references be treated?
Ideally, it should be extremely clear when a variable reference couldn't be
expanded. We decided the best experience for unmatched variable references would
be to have the entire reference, syntax included, show up in the output. As an
example, if the reference `$(VARIABLE_NAME)` cannot be expanded, then
`$(VARIABLE_NAME)` should be present in the output.
#### Escaping the operator
Although the `$(var)` syntax does overlap with the `$(command)` form of command
substitution supported by many shells, because unexpanded variables are present
verbatim in the output, we expect this will not present a problem to many users.
If there is a collision between a variable name and command substitution syntax,
the syntax can be escaped with the form `$$(VARIABLE_NAME)`, which will evaluate
to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not.
## Design
This design encompasses the variable expansion syntax and specification and the
changes needed to incorporate the expansion feature into the container's
environment and command.
### Syntax and expansion mechanics
This section describes the expansion syntax, evaluation of variable values, and
how unexpected or malformed inputs are handled.
#### Syntax
The inputs to the expansion feature are:
1. A utf-8 string (the input string) which may contain variable references.
2. A function (the mapping function) that maps the name of a variable to the
variable's value, of type `func(string) string`.
Variable references in the input string are indicated exclusively with the syntax
`$(<variable-name>)`. The syntax tokens are:
- `$`: the operator,
- `(`: the reference opener, and
- `)`: the reference closer.
The operator has no meaning unless accompanied by the reference opener and
closer tokens. The operator can be escaped using `$$`. One literal `$` will be
emitted for each `$$` in the input.
The reference opener and closer characters have no meaning when not part of a
variable reference. If a variable reference is malformed, viz: `$(VARIABLE_NAME`
without a closing expression, the operator and expression opening characters are
treated as ordinary characters without special meanings.
#### Scope and ordering of substitutions
The scope in which variable references are expanded is defined by the mapping
function. Within the mapping function, any arbitrary strategy may be used to
determine the value of a variable name. The most basic implementation of a
mapping function is to use a `map[string]string` to lookup the value of a
variable.
In order to support default values for variables like service variables
presented by the kubelet, which may not be bound because the service that
provides them does not yet exist, there should be a mapping function that uses a
list of `map[string]string` like:
```go
func MakeMappingFunc(maps ...map[string]string) func(string) string {
return func(input string) string {
for _, context := range maps {
val, ok := context[input]
if ok {
return val
}
}
return ""
}
}
// elsewhere
containerEnv := map[string]string{
"FOO": "BAR",
"ZOO": "ZAB",
"SERVICE2_HOST": "some-host",
}
serviceEnv := map[string]string{
"SERVICE_HOST": "another-host",
"SERVICE_PORT": "8083",
}
// single-map variation
mapping := MakeMappingFunc(containerEnv)
// default variables not found in serviceEnv
mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv)
```
### Implementation changes
The necessary changes to implement this functionality are:
1. Add a new interface, `ObjectEventRecorder`, which is like the
`EventRecorder` interface, but scoped to a single object, and a function that
returns an `ObjectEventRecorder` given an `ObjectReference` and an
`EventRecorder`.
2. Introduce `third_party/golang/expansion` package that provides:
1. An `Expand(string, func(string) string) string` function.
2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string`
function.
3. Make the kubelet expand environment correctly.
4. Make the kubelet expand command correctly.
#### Event Recording
In order to provide an event when an expansion references undefined variables,
the mapping function must be able to create an event. In order to facilitate
this, we should create a new interface in the `api/client/record` package which
is similar to `EventRecorder`, but scoped to a single object:
```go
// ObjectEventRecorder knows how to record events about a single object.
type ObjectEventRecorder interface {
// Event constructs an event from the given information and puts it in the queue for sending.
// 'reason' is the reason this event is generated. 'reason' should be short and unique; it will
// be used to automate handling of events, so imagine people writing switch statements to
// handle them. You want to make that easy.
// 'message' is intended to be human readable.
//
// The resulting event will be created in the same namespace as the reference object.
Event(reason, message string)
// Eventf is just like Event, but with Sprintf for the message field.
Eventf(reason, messageFmt string, args ...interface{})
// PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field.
PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{})
}
```
There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object`
and an `EventRecorder`:
```go
type objectRecorderImpl struct {
object runtime.Object
recorder EventRecorder
}
func (r *objectRecorderImpl) Event(reason, message string) {
r.recorder.Event(r.object, reason, message)
}
func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder {
return &objectRecorderImpl{object, recorder}
}
```
#### Expansion package
The expansion package should provide two methods:
```go
// MappingFuncFor returns a mapping function for use with Expand that
// implements the expansion semantics defined in the expansion spec; it
// returns the input string wrapped in the expansion syntax if no mapping
// for the input is found. If no expansion is found for a key, an event
// is raised on the given recorder.
func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string {
// ...
}
// Expand replaces variable references in the input string according to
// the expansion spec using the given mapping function to resolve the
// values of variables.
func Expand(input string, mapping func(string) string) string {
// ...
}
```
#### Kubelet changes
The Kubelet should be made to correctly expand variables references in a
container's environment, command, and args. Changes will need to be made to:
1. The `makeEnvironmentVariables` function in the kubelet; this is used by
`GenerateRunContainerOptions`, which is used by both the docker and rkt
container runtimes.
2. The docker manager `setEntrypointAndCommand` func has to be changed to
perform variable expansion.
3. The rkt runtime should be made to support expansion in command and args
when support for it is implemented.
### Examples
#### Inputs and outputs
These examples are in the context of the mapping:
| Name | Value |
|-------------|------------|
| `VAR_A` | `"A"` |
| `VAR_B` | `"B"` |
| `VAR_C` | `"C"` |
| `VAR_REF` | `$(VAR_A)` |
| `VAR_EMPTY` | `""` |
No other variables are defined.
| Input | Result |
|--------------------------------|----------------------------|
| `"$(VAR_A)"` | `"A"` |
| `"___$(VAR_B)___"` | `"___B___"` |
| `"___$(VAR_C)"` | `"___C"` |
| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` |
| `"$(VAR_A)-1"` | `"A-1"` |
| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` |
| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` |
| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` |
| `"f000-$$VAR_A"` | `"f000-$VAR_A"` |
| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` |
| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` |
| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` |
| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` |
| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` |
| `"$(VAR_REF)"` | `"$(VAR_A)"` |
| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` |
| `"foo$(VAR_EMPTY)bar"` | `"foobar"` |
| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` |
| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` |
| `"$?_boo_$!"` | `"$?_boo_$!"` |
| `"$VAR_A"` | `"$VAR_A"` |
| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` |
| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` |
| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` |
| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` |
| `"$$$$$$$(VAR_A)"` | `"$$$A"` |
| `"$VAR_A)"` | `"$VAR_A)"` |
| `"${VAR_A}"` | `"${VAR_A}"` |
| `"$(VAR_B)_______$(A"` | `"B_______$(A"` |
| `"$(VAR_C)_______$("` | `"C_______$("` |
| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` |
| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` |
| `"--$($($($($--"` | `"--$($($($($--"` |
| `"$($($($($--foo$("` | `"$($($($($--foo$("` |
| `"foo0--$($($($("` | `"foo0--$($($($("` |
| `"$(foo$$var)` | `$(foo$$var)` |
#### In a pod: building a URL
Notice the `$(var)` syntax.
```yaml
apiVersion: v1
kind: Pod
metadata:
name: expansion-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: PUBLIC_URL
value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)"
restartPolicy: Never
```
#### In a pod: building a URL using downward API
```yaml
apiVersion: v1
kind: Pod
metadata:
name: expansion-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: "metadata.namespace"
- name: PUBLIC_URL
value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)"
restartPolicy: Never
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/expansion.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/expansion.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/expansion.md)

View File

@ -1,203 +1 @@
# Adding custom resources to the Kubernetes API server
This document describes the design for implementing the storage of custom API
types in the Kubernetes API Server.
## Resource Model
### The ThirdPartyResource
The `ThirdPartyResource` resource describes the multiple versions of a custom
resource that the user wants to add to the Kubernetes API. `ThirdPartyResource`
is a non-namespaced resource; attempting to place it in a namespace will return
an error.
Each `ThirdPartyResource` resource has the following:
* Standard Kubernetes object metadata.
* ResourceKind - The kind of the resources described by this third party
resource.
* Description - A free text description of the resource.
* APIGroup - An API group that this resource should be placed into.
* Versions - One or more `Version` objects.
### The `Version` Object
The `Version` object describes a single concrete version of a custom resource.
The `Version` object currently only specifies:
* The `Name` of the version.
* The `APIGroup` this version should belong to.
## Expectations about third party objects
Every object that is added to a third-party Kubernetes object store is expected
to contain Kubernetes compatible [object metadata](../devel/api-conventions.md#metadata).
This requirement enables the Kubernetes API server to provide the following
features:
* Filtering lists of objects via label queries.
* `resourceVersion`-based optimistic concurrency via compare-and-swap.
* Versioned storage.
* Event recording.
* Integration with basic `kubectl` command line tooling.
* Watch for resource changes.
The `Kind` for an instance of a third-party object (e.g. CronTab) below is
expected to be programmatically convertible to the name of the resource using
the following conversion. Kinds are expected to be of the form
`<CamelCaseKind>`, and the `APIVersion` for the object is expected to be
`<api-group>/<api-version>`. To prevent collisions, it's expected that you'll
use a DNS name of at least three segments for the API group, e.g. `mygroup.example.com`.
For example `mygroup.example.com/v1`
'CamelCaseKind' is the specific type name.
To convert this into the `metadata.name` for the `ThirdPartyResource` resource
instance, the `<domain-name>` is copied verbatim, the `CamelCaseKind` is then
converted using '-' instead of capitalization ('camel-case'), with the first
character being assumed to be capitalized. In pseudo code:
```go
var result string
for ix := range kindName {
if isCapital(kindName[ix]) {
result = append(result, '-')
}
result = append(result, toLowerCase(kindName[ix])
}
```
As a concrete example, the resource named `camel-case-kind.mygroup.example.com` defines
resources of Kind `CamelCaseKind`, in the APIGroup with the prefix
`mygroup.example.com/...`.
The reason for this is to enable rapid lookup of a `ThirdPartyResource` object
given the kind information. This is also the reason why `ThirdPartyResource` is
not namespaced.
## Usage
When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts
by creating a new, namespaced RESTful resource path. For now, non-namespaced
objects are not supported. As with existing built-in objects, deleting a
namespace deletes all third party resources in that namespace.
For example, if a user creates:
```yaml
metadata:
name: cron-tab.mygroup.example.com
apiVersion: extensions/v1beta1
kind: ThirdPartyResource
description: "A specification of a Pod to run on a cron style schedule"
versions:
- name: v1
- name: v2
```
Then the API server will program in the new RESTful resource path:
* `/apis/mygroup.example.com/v1/namespaces/<namespace>/crontabs/...`
**Note: This may take a while before RESTful resource path registration happen, please
always check this before you create resource instances.**
Now that this schema has been created, a user can `POST`:
```json
{
"metadata": {
"name": "my-new-cron-object"
},
"apiVersion": "mygroup.example.com/v1",
"kind": "CronTab",
"cronSpec": "* * * * /5",
"image": "my-awesome-cron-image"
}
```
to: `/apis/mygroup.example.com/v1/namespaces/default/crontabs`
and the corresponding data will be stored into etcd by the APIServer, so that
when the user issues:
```
GET /apis/mygroup.example.com/v1/namespaces/default/crontabs/my-new-cron-object`
```
And when they do that, they will get back the same data, but with additional
Kubernetes metadata (e.g. `resourceVersion`, `createdTimestamp`) filled in.
Likewise, to list all resources, a user can issue:
```
GET /apis/mygroup.example.com/v1/namespaces/default/crontabs
```
and get back:
```json
{
"apiVersion": "mygroup.example.com/v1",
"kind": "CronTabList",
"items": [
{
"metadata": {
"name": "my-new-cron-object"
},
"apiVersion": "mygroup.example.com/v1",
"kind": "CronTab",
"cronSpec": "* * * * /5",
"image": "my-awesome-cron-image"
}
]
}
```
Because all objects are expected to contain standard Kubernetes metadata fields,
these list operations can also use label queries to filter requests down to
specific subsets.
Likewise, clients can use watch endpoints to watch for changes to stored
objects.
## Storage
In order to store custom user data in a versioned fashion inside of etcd, we
need to also introduce a `Codec`-compatible object for persistent storage in
etcd. This object is `ThirdPartyResourceData` and it contains:
* Standard API Metadata.
* `Data`: The raw JSON data for this custom object.
### Storage key specification
Each custom object stored by the API server needs a custom key in storage, this
is described below:
#### Definitions
* `resource-namespace`: the namespace of the particular resource that is
being stored
* `resource-name`: the name of the particular resource being stored
* `third-party-resource-namespace`: the namespace of the `ThirdPartyResource`
resource that represents the type for the specific instance being stored
* `third-party-resource-name`: the name of the `ThirdPartyResource` resource
that represents the type for the specific instance being stored
#### Key
Given the definitions above, the key for a specific third-party object is:
```
${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name}
```
Thus, listing a third-party resource can be achieved by listing the directory:
```
${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/extending-api.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/extending-api.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/extending-api.md)

View File

@ -1,513 +1 @@
# Federated ReplicaSets
# Requirements & Design Document
This document is a markdown version converted from a working [Google Doc](https://docs.google.com/a/google.com/document/d/1C1HEHQ1fwWtEhyl9JYu6wOiIUJffSmFmZgkGta4720I/edit?usp=sharing). Please refer to the original for extended commentary and discussion.
Author: Marcin Wielgus [mwielgus@google.com](mailto:mwielgus@google.com)
Based on discussions with
Quinton Hoole [quinton@google.com](mailto:quinton@google.com), Wojtek Tyczyński [wojtekt@google.com](mailto:wojtekt@google.com)
## Overview
### Summary & Vision
When running a global application on a federation of Kubernetes
clusters the owner currently has to start it in multiple clusters and
control whether he has both enough application replicas running
locally in each of the clusters (so that, for example, users are
handled by a nearby cluster, with low latency) and globally (so that
there is always enough capacity to handle all traffic). If one of the
clusters has issues or hasnt enough capacity to run the given set of
replicas the replicas should be automatically moved to some other
cluster to keep the application responsive.
In single cluster Kubernetes there is a concept of ReplicaSet that
manages the replicas locally. We want to expand this concept to the
federation level.
### Goals
+ Win large enterprise customers who want to easily run applications
across multiple clusters
+ Create a reference controller implementation to facilitate bringing
other Kubernetes concepts to Federated Kubernetes.
## Glossary
Federation Cluster - a cluster that is a member of federation.
Local ReplicaSet (LRS) - ReplicaSet defined and running on a cluster
that is a member of federation.
Federated ReplicaSet (FRS) - ReplicaSet defined and running inside of Federated K8S server.
Federated ReplicaSet Controller (FRSC) - A controller running inside
of Federated K8S server that controlls FRS.
## User Experience
### Critical User Journeys
+ [CUJ1] User wants to create a ReplicaSet in each of the federation
cluster. They create a definition of federated ReplicaSet on the
federated master and (local) ReplicaSets are automatically created
in each of the federation clusters. The number of replicas is each
of the Local ReplicaSets is (perheps indirectly) configurable by
the user.
+ [CUJ2] When the current number of replicas in a cluster drops below
the desired number and new replicas cannot be scheduled then they
should be started in some other cluster.
### Features Enabling Critical User Journeys
Feature #1 -> CUJ1:
A component which looks for newly created Federated ReplicaSets and
creates the appropriate Local ReplicaSet definitions in the federated
clusters.
Feature #2 -> CUJ2:
A component that checks how many replicas are actually running in each
of the subclusters and if the number matches to the
FederatedReplicaSet preferences (by default spread replicas evenly
across the clusters but custom preferences are allowed - see
below). If it doesnt and the situation is unlikely to improve soon
then the replicas should be moved to other subclusters.
### API and CLI
All interaction with FederatedReplicaSet will be done by issuing
kubectl commands pointing on the Federated Master API Server. All the
commands would behave in a similar way as on the regular master,
however in the next versions (1.5+) some of the commands may give
slightly different output. For example kubectl describe on federated
replica set should also give some information about the subclusters.
Moreover, for safety, some defaults will be different. For example for
kubectl delete federatedreplicaset cascade will be set to false.
FederatedReplicaSet would have the same object as local ReplicaSet
(although it will be accessible in a different part of the
api). Scheduling preferences (how many replicas in which cluster) will
be passed as annotations.
### FederateReplicaSet preferences
The preferences are expressed by the following structure, passed as a
serialized json inside annotations.
```
type FederatedReplicaSetPreferences struct {
// If set to true then already scheduled and running replicas may be moved to other clusters to
// in order to bring cluster replicasets towards a desired state. Otherwise, if set to false,
// up and running replicas will not be moved.
Rebalance bool `json:"rebalance,omitempty"`
// Map from cluster name to preferences for that cluster. It is assumed that if a cluster
// doesnt have a matching entry then it should not have local replica. The cluster matches
// to "*" if there is no entry with the real cluster name.
Clusters map[string]LocalReplicaSetPreferences
}
// Preferences regarding number of replicas assigned to a cluster replicaset within a federated replicaset.
type ClusterReplicaSetPreferences struct {
// Minimum number of replicas that should be assigned to this Local ReplicaSet. 0 by default.
MinReplicas int64 `json:"minReplicas,omitempty"`
// Maximum number of replicas that should be assigned to this Local ReplicaSet. Unbounded if no value provided (default).
MaxReplicas *int64 `json:"maxReplicas,omitempty"`
// A number expressing the preference to put an additional replica to this LocalReplicaSet. 0 by default.
Weight int64
}
```
How this works in practice:
**Scenario 1**. I want to spread my 50 replicas evenly across all available clusters. Config:
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ Weight: 1}
}
}
```
Example:
+ Clusters A,B,C, all have capacity.
Replica layout: A=16 B=17 C=17.
+ Clusters A,B,C and C has capacity for 6 replicas.
Replica layout: A=22 B=22 C=6
+ Clusters A,B,C. B and C are offline:
Replica layout: A=50
**Scenario 2**. I want to have only 2 replicas in each of the clusters.
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MaxReplicas: 2; Weight: 1}
}
}
```
Or
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MinReplicas: 2; Weight: 0 }
}
}
```
Or
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MinReplicas: 2; MaxReplicas: 2}
}
}
```
There is a global target for 50, however if there are 3 clusters there will be only 6 replicas running.
**Scenario 3**. I want to have 20 replicas in each of 3 clusters.
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ MinReplicas: 20; Weight: 0}
}
}
```
There is a global target for 50, however clusters require 60. So some clusters will have less replicas.
Replica layout: A=20 B=20 C=10.
**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however dont put more than 20 replicas to cluster C.
```
FederatedReplicaSetPreferences {
Rebalance : true
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ Weight: 1}
“C” : LocalReplicaSet{ MaxReplicas: 20, Weight: 1}
}
}
```
Example:
+ All have capacity.
Replica layout: A=16 B=17 C=17.
+ B is offline/has no capacity
Replica layout: A=30 B=0 C=20
+ A and B are offline:
Replica layout: C=20
**Scenario 5**. I want to run my application in cluster A, however if there are troubles FRS can also use clusters B and C, equally.
```
FederatedReplicaSetPreferences {
Clusters : map[string]LocalReplicaSet {
“A” : LocalReplicaSet{ Weight: 1000000}
“B” : LocalReplicaSet{ Weight: 1}
“C” : LocalReplicaSet{ Weight: 1}
}
}
```
Example:
+ All have capacity.
Replica layout: A=50 B=0 C=0.
+ A has capacity for only 40 replicas
Replica layout: A=40 B=5 C=5
**Scenario 6**. I want to run my application in clusters A, B and C. Cluster A gets twice the QPS than other clusters.
```
FederatedReplicaSetPreferences {
Clusters : map[string]LocalReplicaSet {
“A” : LocalReplicaSet{ Weight: 2}
“B” : LocalReplicaSet{ Weight: 1}
“C” : LocalReplicaSet{ Weight: 1}
}
}
```
**Scenario 7**. I want to spread my 50 replicas evenly across all available clusters, but if there
are already some replicas, please do not move them. Config:
```
FederatedReplicaSetPreferences {
Rebalance : false
Clusters : map[string]LocalReplicaSet {
"*" : LocalReplicaSet{ Weight: 1}
}
}
```
Example:
+ Clusters A,B,C, all have capacity, but A already has 20 replicas
Replica layout: A=20 B=15 C=15.
+ Clusters A,B,C and C has capacity for 6 replicas, A has already 20 replicas.
Replica layout: A=22 B=22 C=6
+ Clusters A,B,C and C has capacity for 6 replicas, A has already 30 replicas.
Replica layout: A=30 B=14 C=6
## The Idea
A new federated controller - Federated Replica Set Controller (FRSC)
will be created inside federated controller manager. Below are
enumerated the key idea elements:
+ [I0] It is considered OK to have slightly higher number of replicas
globally for some time.
+ [I1] FRSC starts an informer on the FederatedReplicaSet that listens
on FRS being created, updated or deleted. On each create/update the
scheduling code will be started to calculate where to put the
replicas. The default behavior is to start the same amount of
replicas in each of the cluster. While creating LocalReplicaSets
(LRS) the following errors/issues can occur:
+ [E1] Master rejects LRS creation (for known or unknown
reason). In this case another attempt to create a LRS should be
attempted in 1m or so. This action can be tied with
[[I5]](#heading=h.ififs95k9rng). Until the the LRS is created
the situation is the same as [E5]. If this happens multiple
times all due replicas should be moved elsewhere and later moved
back once the LRS is created.
+ [E2] LRS with the same name but different configuration already
exists. The LRS is then overwritten and an appropriate event
created to explain what happened. Pods under the control of the
old LRS are left intact and the new LRS may adopt them if they
match the selector.
+ [E3] LRS is new but the pods that match the selector exist. The
pods are adopted by the RS (if not owned by some other
RS). However they may have a different image, configuration
etc. Just like with regular LRS.
+ [I2] For each of the cluster FRSC starts a store and an informer on
LRS that will listen for status updates. These status changes are
only interesting in case of troubles. Otherwise it is assumed that
LRS runs trouble free and there is always the right number of pod
created but possibly not scheduled.
+ [E4] LRS is manually deleted from the local cluster. In this case
a new LRS should be created. It is the same case as
[[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind
wont be killed and will be adopted after the LRS is recreated.
+ [E5] LRS fails to create (not necessary schedule) the desired
number of pods due to master troubles, admission control
etc. This should be considered as the same situation as replicas
unable to schedule (see [[I4]](#heading=h.dqalbelvn1pv)).
+ [E6] It is impossible to tell that an informer lost connection
with a remote cluster or has other synchronization problem so it
should be handled by cluster liveness probe and deletion
[[I6]](#heading=h.z90979gc2216).
+ [I3] For each of the cluster start an store and informer to monitor
whether the created pods are eventually scheduled and what is the
current number of correctly running ready pods. Errors:
+ [E7] It is impossible to tell that an informer lost connection
with a remote cluster or has other synchronization problem so it
should be handled by cluster liveness probe and deletion
[[I6]](#heading=h.z90979gc2216)
+ [I4] It is assumed that a not scheduled pod is a normal situation
and can last up to X min if there is a huge traffic on the
cluster. However if the replicas are not scheduled in that time then
FRSC should consider moving most of the unscheduled replicas
elsewhere. For that purpose FRSC will maintain a data structure
where for each FRS controlled LRS we store a list of pods belonging
to that LRS along with their current status and status change timestamp.
+ [I5] If a new cluster is added to the federation then it doesnt
have a LRS and the situation is equal to
[[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef).
+ [I6] If a cluster is removed from the federation then the situation
is equal to multiple [E4]. It is assumed that if a connection with
a cluster is lost completely then the cluster is removed from the
the cluster list (or marked accordingly) so
[[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda)
dont need to be handled.
+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable),
checked against the current list of clusters, and all missing LRS
are created. This will be executed in combination with [I8].
+ [I8] All pods from ToBeChecked FRS/LRS are browsed every 1 min
(configurable) to check whether some replica move between clusters
is needed or not.
+ FRSC never moves replicas to LRS that have not scheduled/running
pods or that has pods that failed to be created.
+ When FRSC notices that a number of pods are not scheduler/running
or not_even_created in one LRS for more than Y minutes it takes
most of them from LRS, leaving couple still waiting so that once
they are scheduled FRSC will know that it is ok to put some more
replicas to that cluster.
+ [I9] FRS becomes ToBeChecked if:
+ It is newly created
+ Some replica set inside changed its status
+ Some pods inside cluster changed their status
+ Some cluster is added or deleted.
> FRS stops ToBeChecked if is in desired configuration (or is stable enough).
## (RE)Scheduling algorithm
To calculate the (re)scheduling moves for a given FRS:
1. For each cluster FRSC calculates the number of replicas that are placed
(not necessary up and running) in the cluster and the number of replicas that
failed to be scheduled. Cluster capacity is the difference between the
the placed and failed to be scheduled.
2. Order all clusters by their weight and hash of the name so that every time
we process the same replica-set we process the clusters in the same order.
Include federated replica set name in the cluster name hash so that we get
slightly different ordering for different RS. So that not all RS of size 1
end up on the same cluster.
3. Assign minimum prefered number of replicas to each of the clusters, if
there is enough replicas and capacity.
4. If rebalance = false, assign the previously present replicas to the clusters,
remember the number of extra replicas added (ER). Of course if there
is enough replicas and capacity.
5. Distribute the remaining replicas with regard to weights and cluster capacity.
In multiple iterations calculate how many of the replicas should end up in the cluster.
For each of the cluster cap the number of assigned replicas by max number of replicas and
cluster capacity. If there were some extra replicas added to the cluster in step
4, don't really add the replicas but balance them gains ER from 4.
## Goroutines layout
+ [GR1] Involved in FRS informer (see
[[I1]]). Whenever a FRS is created and
updated it puts the new/updated FRS on FRS_TO_CHECK_QUEUE with
delay 0.
+ [GR2_1...GR2_N] Involved in informers/store on LRS (see
[[I2]]). On all changes the FRS is put on
FRS_TO_CHECK_QUEUE with delay 1min.
+ [GR3_1...GR3_N] Involved in informers/store on Pods
(see [[I3]] and [[I4]]). They maintain the status store
so that for each of the LRS we know the number of pods that are
actually running and ready in O(1) time. They also put the
corresponding FRS on FRS_TO_CHECK_QUEUE with delay 1min.
+ [GR4] Involved in cluster informer (see
[[I5]] and [[I6]] ). It puts all FRS on FRS_TO_CHECK_QUEUE
with delay 0.
+ [GR5_*] Go routines handling FRS_TO_CHECK_QUEUE that put FRS on
FRS_CHANNEL after the given delay (and remove from
FRS_TO_CHECK_QUEUE). Every time an already present FRS is added to
FRS_TO_CHECK_QUEUE the delays are compared and updated so that the
shorter delay is used.
+ [GR6] Contains a selector that listens on a FRS_CHANNEL. Whenever
a FRS is received it is put to a work queue. Work queue has no delay
and makes sure that a single replica set is process is processed by
only one goroutine.
+ [GR7_*] Goroutines related to workqueue. They fire DoFrsCheck on the FRS.
Multiple replica set can be processed in parallel. Two Goroutines cannot
process the same FRS at the same time.
## Func DoFrsCheck
The function does [[I7]] and[[I8]]. It is assumed that it is run on a
single thread/goroutine so we check and evaluate the same FRS on many
goroutines (however if needed the function can be parallelized for
different FRS). It takes data only from store maintained by GR2_* and
GR3_*. The external communication is only required to:
+ Create LRS. If a LRS doesnt exist it is created after the
rescheduling, when we know how much replicas should it have.
+ Update LRS replica targets.
If FRS is not in the desired state then it is put to
FRS_TO_CHECK_QUEUE with delay 1min (possibly increasing).
## Monitoring and status reporting
FRCS should expose a number of metrics form the run, like
+ FRSC -> LRS communication latency
+ Total times spent in various elements of DoFrsCheck
FRSC should also expose the status of FRS as an annotation on FRS and
as events.
## Workflow
Here is the sequence of tasks that need to be done in order for a
typical FRS to be split into a number of LRSs and to be created in
the underlying federated clusters.
Note a: the reason the workflow would be helpful at this phase is that
for every one or two steps we can create PRs accordingly to start with
the development.
Note b: we assume that the federation is already in place and the
federated clusters are added to the federation.
Step 1. the client sends an RS create request to the
federation-apiserver
Step 2. federation-apiserver persists an FRS into the federation etcd
Note c: federation-apiserver populates the clusterid field in the FRS
before persisting it into the federation etcd
Step 3: the federation-level “informer” in FRSC watches federation
etcd for new/modified FRSs, with empty clusterid or clusterid equal
to federation ID, and if detected, it calls the scheduling code
Step 4.
Note d: scheduler populates the clusterid field in the LRS with the
IDs of target clusters
Note e: at this point let us assume that it only does the even
distribution, i.e., equal weights for all of the underlying clusters
Step 5. As soon as the scheduler function returns the control to FRSC,
the FRSC starts a number of cluster-level “informer”s, one per every
target cluster, to watch changes in every target cluster etcd
regarding the posted LRSs and if any violation from the scheduled
number of replicase is detected the scheduling code is re-called for
re-scheduling purposes.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-replicasets.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-replicasets.md)

View File

@ -1,517 +1 @@
# Kubernetes Cluster Federation (previously nicknamed "Ubernetes")
## Cross-cluster Load Balancing and Service Discovery
### Requirements and System Design
### by Quinton Hoole, Dec 3 2015
## Requirements
### Discovery, Load-balancing and Failover
1. **Internal discovery and connection**: Pods/containers (running in
a Kubernetes cluster) must be able to easily discover and connect
to endpoints for Kubernetes services on which they depend in a
consistent way, irrespective of whether those services exist in a
different kubernetes cluster within the same cluster federation.
Hence-forth referred to as "cluster-internal clients", or simply
"internal clients".
1. **External discovery and connection**: External clients (running
outside a Kubernetes cluster) must be able to discover and connect
to endpoints for Kubernetes services on which they depend.
1. **External clients predominantly speak HTTP(S)**: External
clients are most often, but not always, web browsers, or at
least speak HTTP(S) - notable exceptions include Enterprise
Message Busses (Java, TLS), DNS servers (UDP),
SIP servers and databases)
1. **Find the "best" endpoint:** Upon initial discovery and
connection, both internal and external clients should ideally find
"the best" endpoint if multiple eligible endpoints exist. "Best"
in this context implies the closest (by network topology) endpoint
that is both operational (as defined by some positive health check)
and not overloaded (by some published load metric). For example:
1. An internal client should find an endpoint which is local to its
own cluster if one exists, in preference to one in a remote
cluster (if both are operational and non-overloaded).
Similarly, one in a nearby cluster (e.g. in the same zone or
region) is preferable to one further afield.
1. An external client (e.g. in New York City) should find an
endpoint in a nearby cluster (e.g. U.S. East Coast) in
preference to one further away (e.g. Japan).
1. **Easy fail-over:** If the endpoint to which a client is connected
becomes unavailable (no network response/disconnected) or
overloaded, the client should reconnect to a better endpoint,
somehow.
1. In the case where there exist one or more connection-terminating
load balancers between the client and the serving Pod, failover
might be completely automatic (i.e. the client's end of the
connection remains intact, and the client is completely
oblivious of the fail-over). This approach incurs network speed
and cost penalties (by traversing possibly multiple load
balancers), but requires zero smarts in clients, DNS libraries,
recursing DNS servers etc, as the IP address of the endpoint
remains constant over time.
1. In a scenario where clients need to choose between multiple load
balancer endpoints (e.g. one per cluster), multiple DNS A
records associated with a single DNS name enable even relatively
dumb clients to try the next IP address in the list of returned
A records (without even necessarily re-issuing a DNS resolution
request). For example, all major web browsers will try all A
records in sequence until a working one is found (TBD: justify
this claim with details for Chrome, IE, Safari, Firefox).
1. In a slightly more sophisticated scenario, upon disconnection, a
smarter client might re-issue a DNS resolution query, and
(modulo DNS record TTL's which can typically be set as low as 3
minutes, and buggy DNS resolvers, caches and libraries which
have been known to completely ignore TTL's), receive updated A
records specifying a new set of IP addresses to which to
connect.
### Portability
A Kubernetes application configuration (e.g. for a Pod, Replication
Controller, Service etc) should be able to be successfully deployed
into any Kubernetes Cluster or Federation of Clusters,
without modification. More specifically, a typical configuration
should work correctly (although possibly not optimally) across any of
the following environments:
1. A single Kubernetes Cluster on one cloud provider (e.g. Google
Compute Engine, GCE).
1. A single Kubernetes Cluster on a different cloud provider
(e.g. Amazon Web Services, AWS).
1. A single Kubernetes Cluster on a non-cloud, on-premise data center
1. A Federation of Kubernetes Clusters all on the same cloud provider
(e.g. GCE).
1. A Federation of Kubernetes Clusters across multiple different cloud
providers and/or on-premise data centers (e.g. one cluster on
GCE/GKE, one on AWS, and one on-premise).
### Trading Portability for Optimization
It should be possible to explicitly opt out of portability across some
subset of the above environments in order to take advantage of
non-portable load balancing and DNS features of one or more
environments. More specifically, for example:
1. For HTTP(S) applications running on GCE-only Federations,
[GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
should be usable. These provide single, static global IP addresses
which load balance and fail over globally (i.e. across both regions
and zones). These allow for really dumb clients, but they only
work on GCE, and only for HTTP(S) traffic.
1. For non-HTTP(S) applications running on GCE-only Federations within
a single region,
[GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
should be usable. These provide TCP (i.e. both HTTP/S and
non-HTTP/S) load balancing and failover, but only on GCE, and only
within a single region.
[Google Cloud DNS](https://cloud.google.com/dns) can be used to
route traffic between regions (and between different cloud
providers and on-premise clusters, as it's plain DNS, IP only).
1. For applications running on AWS-only Federations,
[AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
should be usable. These provide both L7 (HTTP(S)) and L4 load
balancing, but only within a single region, and only on AWS
([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
used to load balance and fail over across multiple regions, and is
also capable of resolving to non-AWS endpoints).
## Component Cloud Services
Cross-cluster Federated load balancing is built on top of the following:
1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
provide single, static global IP addresses which load balance and
fail over globally (i.e. across both regions and zones). These
allow for really dumb clients, but they only work on GCE, and only
for HTTP(S) traffic.
1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
provide both HTTP(S) and non-HTTP(S) load balancing and failover,
but only on GCE, and only within a single region.
1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
provide both L7 (HTTP(S)) and L4 load balancing, but only within a
single region, and only on AWS.
1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
programmable DNS service, like
[CloudFlare](http://www.cloudflare.com) can be used to route
traffic between regions (and between different cloud providers and
on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
doesn't provide any built-in geo-DNS, latency-based routing, health
checking, weighted round robin or other advanced capabilities.
It's plain old DNS. We would need to build all the aforementioned
on top of it. It can provide internal DNS services (i.e. serve RFC
1918 addresses).
1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
be used to load balance and fail over across regions, and is also
capable of routing to non-AWS endpoints). It provides built-in
geo-DNS, latency-based routing, health checking, weighted
round robin and optional tight integration with some other
AWS services (e.g. Elastic Load Balancers).
1. Kubernetes L4 Service Load Balancing: This provides both a
[virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
and a
[real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
service IP which is load-balanced (currently simple round-robin)
across the healthy pods comprising a service within a single
Kubernetes cluster.
1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html):
A generic wrapper around cloud-provided L4 and L7 load balancing services, and
roll-your-own load balancers run in pods, e.g. HA Proxy.
## Cluster Federation API
The Cluster Federation API for load balancing should be compatible with the equivalent
Kubernetes API, to ease porting of clients between Kubernetes and
federations of Kubernetes clusters.
Further details below.
## Common Client Behavior
To be useful, our load balancing solution needs to work properly with real
client applications. There are a few different classes of those...
### Browsers
These are the most common external clients. These are all well-written. See below.
### Well-written clients
1. Do a DNS resolution every time they connect.
1. Don't cache beyond TTL (although a small percentage of the DNS
servers on which they rely might).
1. Do try multiple A records (in order) to connect.
1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
Examples:
+ all common browsers (except for SRV records)
+ ...
### Dumb clients
1. Don't do a DNS resolution every time they connect (or do cache beyond the
TTL).
1. Do try multiple A records
Examples:
+ ...
### Dumber clients
1. Only do a DNS lookup once on startup.
1. Only try the first returned DNS A record.
Examples:
+ ...
### Dumbest clients
1. Never do a DNS lookup - are pre-configured with a single (or possibly
multiple) fixed server IP(s). Nothing else matters.
## Architecture and Implementation
### General Control Plane Architecture
Each cluster hosts one or more Cluster Federation master components (Federation API
servers, controller managers with leader election, and etcd quorum members. This
is documented in more detail in a separate design doc:
[Kubernetes and Cluster Federation Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
In the description below, assume that 'n' clusters, named 'cluster-1'...
'cluster-n' have been registered against a Cluster Federation "federation-1",
each with their own set of Kubernetes API endpoints,so,
"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
### Federated Services
Federated Services are pretty straight-forward. They're comprised of multiple
equivalent underlying Kubernetes Services, each with their own external
endpoint, and a load balancing mechanism across them. Let's work through how
exactly that works in practice.
Our user creates the following Federated Service (against a Federation
API endpoint):
$ kubectl create -f my-service.yaml --context="federation-1"
where service.yaml contains the following:
kind: Service
metadata:
labels:
run: my-service
name: my-service
namespace: my-namespace
spec:
ports:
- port: 2379
protocol: TCP
targetPort: 2379
name: client
- port: 2380
protocol: TCP
targetPort: 2380
name: peer
selector:
run: my-service
type: LoadBalancer
The Cluster Federation control system in turn creates one equivalent service (identical config to the above)
in each of the underlying Kubernetes clusters, each of which results in
something like this:
$ kubectl get -o yaml --context="cluster-1" service my-service
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2015-11-25T23:35:25Z
labels:
run: my-service
name: my-service
namespace: my-namespace
resourceVersion: "147365"
selfLink: /api/v1/namespaces/my-namespace/services/my-service
uid: 33bfc927-93cd-11e5-a38c-42010af00002
spec:
clusterIP: 10.0.153.185
ports:
- name: client
nodePort: 31333
port: 2379
protocol: TCP
targetPort: 2379
- name: peer
nodePort: 31086
port: 2380
protocol: TCP
targetPort: 2380
selector:
run: my-service
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- ip: 104.197.117.10
Similar services are created in `cluster-2` and `cluster-3`, each of which are
allocated their own `spec.clusterIP`, and `status.loadBalancer.ingress.ip`.
In the Cluster Federation `federation-1`, the resulting federated service looks as follows:
$ kubectl get -o yaml --context="federation-1" service my-service
apiVersion: v1
kind: Service
metadata:
creationTimestamp: 2015-11-25T23:35:23Z
labels:
run: my-service
name: my-service
namespace: my-namespace
resourceVersion: "157333"
selfLink: /api/v1/namespaces/my-namespace/services/my-service
uid: 33bfc927-93cd-11e5-a38c-42010af00007
spec:
clusterIP:
ports:
- name: client
nodePort: 31333
port: 2379
protocol: TCP
targetPort: 2379
- name: peer
nodePort: 31086
port: 2380
protocol: TCP
targetPort: 2380
selector:
run: my-service
sessionAffinity: None
type: LoadBalancer
status:
loadBalancer:
ingress:
- hostname: my-service.my-namespace.my-federation.my-domain.com
Note that the federated service:
1. Is API-compatible with a vanilla Kubernetes service.
1. has no clusterIP (as it is cluster-independent)
1. has a federation-wide load balancer hostname
In addition to the set of underlying Kubernetes services (one per cluster)
described above, the Cluster Federation control system has also created a DNS name (e.g. on
[Google Cloud DNS](https://cloud.google.com/dns) or
[AWS Route 53](https://aws.amazon.com/route53/), depending on configuration)
which provides load balancing across all of those services. For example, in a
very basic configuration:
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.117.10
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
Each of the above IP addresses (which are just the external load balancer
ingress IP's of each cluster service) is of course load balanced across the pods
comprising the service in each cluster.
In a more sophisticated configuration (e.g. on GCE or GKE), the Cluster
Federation control system
automatically creates a
[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
which exposes a single, globally load-balanced IP:
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
my-service.my-namespace.my-federation.my-domain.com 180 IN A 107.194.17.44
Optionally, the Cluster Federation control system also configures the local DNS servers (SkyDNS)
in each Kubernetes cluster to preferentially return the local
clusterIP for the service in that cluster, with other clusters'
external service IP's (or a global load-balanced IP) also configured
for failover purposes:
$ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
my-service.my-namespace.my-federation.my-domain.com 180 IN A 10.0.153.185
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.74.77
my-service.my-namespace.my-federation.my-domain.com 180 IN A 104.197.38.157
If Cluster Federation Global Service Health Checking is enabled, multiple service health
checkers running across the federated clusters collaborate to monitor the health
of the service endpoints, and automatically remove unhealthy endpoints from the
DNS record (e.g. a majority quorum is required to vote a service endpoint
unhealthy, to avoid false positives due to individual health checker network
isolation).
### Federated Replication Controllers
So far we have a federated service defined, with a resolvable load balancer
hostname by which clients can reach it, but no pods serving traffic directed
there. So now we need a Federated Replication Controller. These are also fairly
straight-forward, being comprised of multiple underlying Kubernetes Replication
Controllers which do the hard work of keeping the desired number of Pod replicas
alive in each Kubernetes cluster.
$ kubectl create -f my-service-rc.yaml --context="federation-1"
where `my-service-rc.yaml` contains the following:
kind: ReplicationController
metadata:
labels:
run: my-service
name: my-service
namespace: my-namespace
spec:
replicas: 6
selector:
run: my-service
template:
metadata:
labels:
run: my-service
spec:
containers:
image: gcr.io/google_samples/my-service:v1
name: my-service
ports:
- containerPort: 2379
protocol: TCP
- containerPort: 2380
protocol: TCP
The Cluster Federation control system in turn creates one equivalent replication controller
(identical config to the above, except for the replica count) in each
of the underlying Kubernetes clusters, each of which results in
something like this:
$ ./kubectl get -o yaml rc my-service --context="cluster-1"
kind: ReplicationController
metadata:
creationTimestamp: 2015-12-02T23:00:47Z
labels:
run: my-service
name: my-service
namespace: my-namespace
selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
uid: 86542109-9948-11e5-a38c-42010af00002
spec:
replicas: 2
selector:
run: my-service
template:
metadata:
labels:
run: my-service
spec:
containers:
image: gcr.io/google_samples/my-service:v1
name: my-service
ports:
- containerPort: 2379
protocol: TCP
- containerPort: 2380
protocol: TCP
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status:
replicas: 2
The exact number of replicas created in each underlying cluster will of course
depend on what scheduling policy is in force. In the above example, the
scheduler created an equal number of replicas (2) in each of the three
underlying clusters, to make up the total of 6 replicas required. To handle
entire cluster failures, various approaches are possible, including:
1. **simple overprovisioning**, such that sufficient replicas remain even if a
cluster fails. This wastes some resources, but is simple and reliable.
2. **pod autoscaling**, where the replication controller in each
cluster automatically and autonomously increases the number of
replicas in its cluster in response to the additional traffic
diverted from the failed cluster. This saves resources and is relatively
simple, but there is some delay in the autoscaling.
3. **federated replica migration**, where the Cluster Federation
control system detects the cluster failure and automatically
increases the replica count in the remainaing clusters to make up
for the lost replicas in the failed cluster. This does not seem to
offer any benefits relative to pod autoscaling above, and is
arguably more complex to implement, but we note it here as a
possibility.
### Implementation Details
The implementation approach and architecture is very similar to Kubernetes, so
if you're familiar with how Kubernetes works, none of what follows will be
surprising. One additional design driver not present in Kubernetes is that
the Cluster Federation control system aims to be resilient to individual cluster and availability zone
failures. So the control plane spans multiple clusters. More specifically:
+ Cluster Federation runs it's own distinct set of API servers (typically one
or more per underlying Kubernetes cluster). These are completely
distinct from the Kubernetes API servers for each of the underlying
clusters.
+ Cluster Federation runs it's own distinct quorum-based metadata store (etcd,
by default). Approximately 1 quorum member runs in each underlying
cluster ("approximately" because we aim for an odd number of quorum
members, and typically don't want more than 5 quorum members, even
if we have a larger number of federated clusters, so 2 clusters->3
quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
Cluster Controllers in the Federation control system watch against the
Federation API server/etcd
state, and apply changes to the underlying kubernetes clusters accordingly. They
also have the anti-entropy mechanism for reconciling Cluster Federation "desired desired"
state against kubernetes "actual desired" state.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-services.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federated-services.md)

View File

@ -1,407 +1 @@
# Ubernetes Design Spec (phase one)
**Huawei PaaS Team**
## INTRODUCTION
In this document we propose a design for the “Control Plane” of
Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
this work please refer to
[this proposal](../../docs/proposals/federation.md).
The document is arranged as following. First we briefly list scenarios
and use cases that motivate K8S federation work. These use cases drive
the design and they also verify the design. We summarize the
functionality requirements from these use cases, and define the “in
scope” functionalities that will be covered by this design (phase
one). After that we give an overview of the proposed architecture, API
and building blocks. And also we go through several activity flows to
see how these building blocks work together to support use cases.
## REQUIREMENTS
There are many reasons why customers may want to build a K8S
federation:
+ **High Availability:** Customers want to be immune to the outage of
a single availability zone, region or even a cloud provider.
+ **Sensitive workloads:** Some workloads can only run on a particular
cluster. They cannot be scheduled to or migrated to other clusters.
+ **Capacity overflow:** Customers prefer to run workloads on a
primary cluster. But if the capacity of the cluster is not
sufficient, workloads should be automatically distributed to other
clusters.
+ **Vendor lock-in avoidance:** Customers want to spread their
workloads on different cloud providers, and can easily increase or
decrease the workload proportion of a specific provider.
+ **Cluster Size Enhancement:** Currently K8S cluster can only support
a limited size. While the community is actively improving it, it can
be expected that cluster size will be a problem if K8S is used for
large workloads or public PaaS infrastructure. While we can separate
different tenants to different clusters, it would be good to have a
unified view.
Here are the functionality requirements derived from above use cases:
+ Clients of the federation control plane API server can register and deregister
clusters.
+ Workloads should be spread to different clusters according to the
workload distribution policy.
+ Pods are able to discover and connect to services hosted in other
clusters (in cases where inter-cluster networking is necessary,
desirable and implemented).
+ Traffic to these pods should be spread across clusters (in a manner
similar to load balancing, although it might not be strictly
speaking balanced).
+ The control plane needs to know when a cluster is down, and migrate
the workloads to other clusters.
+ Clients have a unified view and a central control point for above
activities.
## SCOPE
Its difficult to have a perfect design with one click that implements
all the above requirements. Therefore we will go with an iterative
approach to design and build the system. This document describes the
phase one of the whole work. In phase one we will cover only the
following objectives:
+ Define the basic building blocks and API objects of control plane
+ Implement a basic end-to-end workflow
+ Clients register federated clusters
+ Clients submit a workload
+ The workload is distributed to different clusters
+ Service discovery
+ Load balancing
The following parts are NOT covered in phase one:
+ Authentication and authorization (other than basic client
authentication against the ubernetes API, and from ubernetes control
plane to the underlying kubernetes clusters).
+ Deployment units other than replication controller and service
+ Complex distribution policy of workloads
+ Service affinity and migration
## ARCHITECTURE
The overall architecture of a control plane is shown as following:
![Ubernetes Architecture](ubernetes-design.png)
Some design principles we are following in this architecture:
1. Keep the underlying K8S clusters independent. They should have no
knowledge of control plane or of each other.
1. Keep the Ubernetes API interface compatible with K8S API as much as
possible.
1. Re-use concepts from K8S as much as possible. This reduces
customers learning curve and is good for adoption. Below is a brief
description of each module contained in above diagram.
## Ubernetes API Server
The API Server in the Ubernetes control plane works just like the API
Server in K8S. It talks to a distributed key-value store to persist,
retrieve and watch API objects. This store is completely distinct
from the kubernetes key-value stores (etcd) in the underlying
kubernetes clusters. We still use `etcd` as the distributed
storage so customers dont need to learn and manage a different
storage system, although it is envisaged that other storage systems
(consol, zookeeper) will probably be developedand supported over
time.
## Ubernetes Scheduler
The Ubernetes Scheduler schedules resources onto the underlying
Kubernetes clusters. For example it watches for unscheduled Ubernetes
replication controllers (those that have not yet been scheduled onto
underlying Kubernetes clusters) and performs the global scheduling
work. For each unscheduled replication controller, it calls policy
engine to decide how to spit workloads among clusters. It creates a
Kubernetes Replication Controller on one ore more underlying cluster,
and post them back to `etcd` storage.
One sublety worth noting here is that the scheduling decision is arrived at by
combining the application-specific request from the user (which might
include, for example, placement constraints), and the global policy specified
by the federation administrator (for example, "prefer on-premise
clusters over AWS clusters" or "spread load equally across clusters").
## Ubernetes Cluster Controller
The cluster controller
performs the following two kinds of work:
1. It watches all the sub-resources that are created by Ubernetes
components, like a sub-RC or a sub-service. And then it creates the
corresponding API objects on the underlying K8S clusters.
1. It periodically retrieves the available resources metrics from the
underlying K8S cluster, and updates them as object status of the
`cluster` API object. An alternative design might be to run a pod
in each underlying cluster that reports metrics for that cluster to
the Ubernetes control plane. Which approach is better remains an
open topic of discussion.
## Ubernetes Service Controller
The Ubernetes service controller is a federation-level implementation
of K8S service controller. It watches service resources created on
control plane, creates corresponding K8S services on each involved K8S
clusters. Besides interacting with services resources on each
individual K8S clusters, the Ubernetes service controller also
performs some global DNS registration work.
## API OBJECTS
## Cluster
Cluster is a new first-class API object introduced in this design. For
each registered K8S cluster there will be such an API resource in
control plane. The way clients register or deregister a cluster is to
send corresponding REST requests to following URL:
`/api/{$version}/clusters`. Because control plane is behaving like a
regular K8S client to the underlying clusters, the spec of a cluster
object contains necessary properties like K8S cluster address and
credentials. The status of a cluster API object will contain
following information:
1. Which phase of its lifecycle
1. Cluster resource metrics for scheduling decisions.
1. Other metadata like the version of cluster
$version.clusterSpec
<table style="border:1px solid #000000;border-collapse:collapse;">
<tbody>
<tr>
<td style="padding:5px;"><b>Name</b><br>
</td>
<td style="padding:5px;"><b>Description</b><br>
</td>
<td style="padding:5px;"><b>Required</b><br>
</td>
<td style="padding:5px;"><b>Schema</b><br>
</td>
<td style="padding:5px;"><b>Default</b><br>
</td>
</tr>
<tr>
<td style="padding:5px;">Address<br>
</td>
<td style="padding:5px;">address of the cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">address<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">Credential<br>
</td>
<td style="padding:5px;">the type (e.g. bearer token, client
certificate etc) and data of the credential used to access cluster. Its used for system routines (not behalf of users)<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">string <br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
</tbody>
</table>
$version.clusterStatus
<table style="border:1px solid #000000;border-collapse:collapse;">
<tbody>
<tr>
<td style="padding:5px;"><b>Name</b><br>
</td>
<td style="padding:5px;"><b>Description</b><br>
</td>
<td style="padding:5px;"><b>Required</b><br>
</td>
<td style="padding:5px;"><b>Schema</b><br>
</td>
<td style="padding:5px;"><b>Default</b><br>
</td>
</tr>
<tr>
<td style="padding:5px;">Phase<br>
</td>
<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">enum<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">Capacity<br>
</td>
<td style="padding:5px;">represents the available resources of a cluster<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">any<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
<tr>
<td style="padding:5px;">ClusterMeta<br>
</td>
<td style="padding:5px;">Other cluster metadata like the version<br>
</td>
<td style="padding:5px;">yes<br>
</td>
<td style="padding:5px;">ClusterMeta<br>
</td>
<td style="padding:5px;"><p></p></td>
</tr>
</tbody>
</table>
**For simplicity we didnt introduce a separate “cluster metrics” API
object here**. The cluster resource metrics are stored in cluster
status section, just like what we did to nodes in K8S. In phase one it
only contains available CPU resources and memory resources. The
cluster controller will periodically poll the underlying cluster API
Server to get cluster capability. In phase one it gets the metrics by
simply aggregating metrics from all nodes. In future we will improve
this with more efficient ways like leveraging heapster, and also more
metrics will be supported. Similar to node phases in K8S, the “phase”
field includes following values:
+ pending: newly registered clusters or clusters suspended by admin
for various reasons. They are not eligible for accepting workloads
+ running: clusters in normal status that can accept workloads
+ offline: clusters temporarily down or not reachable
+ terminated: clusters removed from federation
Below is the state transition diagram.
![Cluster State Transition Diagram](ubernetes-cluster-state.png)
## Replication Controller
A global workload submitted to control plane is represented as a
replication controller in the Cluster Federation control plane. When a replication controller
is submitted to control plane, clients need a way to express its
requirements or preferences on clusters. Depending on different use
cases it may be complex. For example:
+ This workload can only be scheduled to cluster Foo. It cannot be
scheduled to any other clusters. (use case: sensitive workloads).
+ This workload prefers cluster Foo. But if there is no available
capacity on cluster Foo, its OK to be scheduled to cluster Bar
(use case: workload )
+ Seventy percent of this workload should be scheduled to cluster Foo,
and thirty percent should be scheduled to cluster Bar (use case:
vendor lock-in avoidance). In phase one, we only introduce a
_clusterSelector_ field to filter acceptable clusters. In default
case there is no such selector and it means any cluster is
acceptable.
Below is a sample of the YAML to create such a replication controller.
```
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx-controller
spec:
replicas: 5
selector:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
clusterSelector:
name in (Foo, Bar)
```
Currently clusterSelector (implemented as a
[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
only supports a simple list of acceptable clusters. Workloads will be
evenly distributed on these acceptable clusters in phase one. After
phase one we will define syntax to represent more advanced
constraints, like cluster preference ordering, desired number of
splitted workloads, desired ratio of workloads spread on different
clusters, etc.
Besides this explicit “clusterSelector” filter, a workload may have
some implicit scheduling restrictions. For example it defines
“nodeSelector” which can only be satisfied on some particular
clusters. How to handle this will be addressed after phase one.
## Federated Services
The Service API object exposed by the Cluster Federation is similar to service
objects on Kubernetes. It defines the access to a group of pods. The
federation service controller will create corresponding Kubernetes
service objects on underlying clusters. These are detailed in a
separate design document: [Federated Services](federated-services.md).
## Pod
In phase one we only support scheduling replication controllers. Pod
scheduling will be supported in later phase. This is primarily in
order to keep the Cluster Federation API compatible with the Kubernetes API.
## ACTIVITY FLOWS
## Scheduling
The below diagram shows how workloads are scheduled on the Cluster Federation control\
plane:
1. A replication controller is created by the client.
1. APIServer persists it into the storage.
1. Cluster controller periodically polls the latest available resource
metrics from the underlying clusters.
1. Scheduler is watching all pending RCs. It picks up the RC, make
policy-driven decisions and split it into different sub RCs.
1. Each cluster control is watching the sub RCs bound to its
corresponding cluster. It picks up the newly created sub RC.
1. The cluster controller issues requests to the underlying cluster
API Server to create the RC. In phase one we dont support complex
distribution policies. The scheduling rule is basically:
1. If a RC does not specify any nodeSelector, it will be scheduled
to the least loaded K8S cluster(s) that has enough available
resources.
1. If a RC specifies _N_ acceptable clusters in the
clusterSelector, all replica will be evenly distributed among
these clusters.
There is a potential race condition here. Say at time _T1_ the control
plane learns there are _m_ available resources in a K8S cluster. As
the cluster is working independently it still accepts workload
requests from other K8S clients or even another Cluster Federation control
plane. The Cluster Federation scheduling decision is based on this data of
available resources. However when the actual RC creation happens to
the cluster at time _T2_, the cluster may dont have enough resources
at that time. We will address this problem in later phases with some
proposed solutions like resource reservation mechanisms.
![Federated Scheduling](ubernetes-scheduling.png)
## Service Discovery
This part has been included in the section “Federated Service” of
document
“[Federated Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”.
Please refer to that document for details.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation-phase-1.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/federation-phase-1.md)

View File

@ -1,236 +1 @@
# Automated HA master deployment
**Author:** filipg@, jsz@
# Introduction
We want to allow users to easily replicate kubernetes masters to have highly available cluster,
initially using `kube-up.sh` and `kube-down.sh`.
This document describes technical design of this feature. It assumes that we are using aforementioned
scripts for cluster deployment. All of the ideas described in the following sections should be easy
to implement on GCE, AWS and other cloud providers.
It is a non-goal to design a specific setup for bare-metal environment, which
might be very different.
# Overview
In a cluster with replicated master, we will have N VMs, each running regular master components
such as apiserver, etcd, scheduler or controller manager. These components will interact in the
following way:
* All etcd replicas will be clustered together and will be using master election
and quorum mechanism to agree on the state. All of these mechanisms are integral
parts of etcd and we will only have to configure them properly.
* All apiserver replicas will be working independently talking to an etcd on
127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master
(as explained [here](https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html)).
* We will introduce provider specific solutions to load balance traffic between master replicas
(see section `load balancing`)
* Controller manager, scheduler & cluster autoscaler will use lease mechanism and
only a single instance will be an active master. All other will be waiting in a standby mode.
* All add-on managers will work independently and each of them will try to keep add-ons in sync
# Detailed design
## Components
### etcd
```
Note: This design for etcd clustering is quite pet-set like - each etcd
replica has its name which is explicitly used in etcd configuration etc. In
medium-term future we would like to have the ability to run masters as part of
autoscaling-group (AWS) or managed-instance-group (GCE) and add/remove replicas
automatically. This is pretty tricky and this design does not cover this.
It will be covered in a separate doc.
```
All etcd instances will be clustered together and one of them will be an elected master.
In order to commit any change quorum of the cluster will have to confirm it. Etcd will be
configured in such a way that all writes and reads will go through the master (requests
will be forwarded by the local etcd server such that its invisible for the user). It will
affect latency for all operations, but it should not increase by much more than the network
latency between master replicas (latency between GCE zones with a region is < 10ms).
Currently etcd exposes port only using localhost interface. In order to allow clustering
and inter-VM communication we will also have to use public interface. To secure the
communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)).
When generating command line for etcd we will always assume its part of a cluster
(initially of size 1) and list all existing kubernetes master replicas.
Based on that, we will set the following flags:
* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one)
* `-initial-cluster-state` (keep in mind that we are adding master replicas one by one):
* `new` if we are adding the first replica, i.e. the list of existing master replicas is empty
* `existing` if there are more than one replica, i.e. the list of existing master replicas is non-empty.
This will allow us to have exactly the same logic for HA and non-HA master. List of DNS names for VMs
with master replicas will be generated in `kube-up.sh` script and passed to as a env variable
`INITIAL_ETCD_CLUSTER`.
### apiservers
All apiservers will work independently. They will contact etcd on 127.0.0.1, i.e. they will always contact
etcd replica running on the same VM. If needed, such requests will be forwarded by etcd server to the
etcd leader. This functionality is completely hidden from the client (apiserver
in our case).
Caching mechanism, which is implemented in apiserver, will not be affected by
replicating master because:
* GET requests go directly to etcd
* LIST requests go either directly to etcd or to cache populated via watch
(depending on the ResourceVersion in ListOptions). In the second scenario,
after a PUT/POST request, changes might not be visible in LIST response.
This is however not worse than it is with the current single master.
* WATCH does not give any guarantees when change will be delivered.
#### load balancing
With multiple apiservers we need a way to load balance traffic to/from master replicas. As different cloud
providers have different capabilities and limitations, we will not try to find a common lowest
denominator that will work everywhere. Instead we will document various options and apply different
solution for different deployments. Below we list possible approaches:
1. `Managed DNS` - user need to specify a domain name during cluster creation. DNS entries will be managed
automaticaly by the deployment tool that will be intergrated with solutions like Route53 (AWS)
or Google Cloud DNS (GCP). For load balancing we will have two options:
1.1. create an L4 load balancer in front of all apiservers and update DNS name appropriately
1.2. use round-robin DNS technique to access all apiservers directly
2. `Unmanaged DNS` - this is very similar to `Managed DNS`, with the exception that DNS entries
will be manually managed by the user. We will provide detailed documentation for the entries we
expect.
3. [GCP only] `Promote master IP` - in GCP, when we create the first master replica, we generate a static
external IP address that is later assigned to the master VM. When creating additional replicas we
will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
instead of a single master. When removing second to last replica we will reverse this operation (assign
IP address to the remaining master VM and delete load balancer). That way user will not have to provide
a domain name and all client configurations will keep working.
This will also impact `kubelet <-> master` communication as it should use load
balancing for it. Depending on the chosen method we will use it to properly configure
kubelet.
#### `kubernetes` service
Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
list of IP addresses for all apiservers. As it uses a command line flag
`--apiserver-count` it is not very dynamic and would require restarting all
masters to change number of master replicas.
To allow dynamic changes to the number of apiservers in the cluster, we will
introduce a `ConfigMap` in `kube-system` namespace, that will keep an expiration
time for each apiserver (keyed by IP). Each apiserver will do three things:
1. periodically update expiration time for it's own IP address
2. remove all the stale IP addresses from the endpoints list
3. add it's own IP address if it's not on the list yet.
That way we will not only solve the problem of dynamically changing number
of apiservers in the cluster, but also the problem of non-responsive apiservers
that should be removed from the `kubernetes` service endpoints list.
#### Certificates
Certificate generation will work as today. In particular, on GCE, we will
generate it for the public IP used to access the cluster (see `load balancing`
section) and local IP of the master replica VM.
That means that with multiple master replicas and a load balancer in front
of them, accessing one of the replicas directly (using it's ephemeral public
IP) will not work on GCE without appropriate flags:
- `kubectl --insecure-skip-tls-verify=true`
- `curl --insecure`
- `wget --no-check-certificate`
For other deployment tools and providers the details of certificate generation
may be different, but it must be possible to access the cluster by using either
the main cluster endpoint (DNS name or IP address) or internal service called
`kubernetes` that points directly to the apiservers.
### controller manager, scheduler & cluster autoscaler
Controller manager and scheduler will by default use a lease mechanism to choose an active instance
among all masters. Only one instance will be performing any operations.
All other will be waiting in standby mode.
We will use the same configuration in non-replicated mode to simplify deployment scripts.
### add-on manager
All add-on managers will be working independently. Each of them will observe current state of
add-ons and will try to sync it with files on disk. As a result, due to races, a single add-on
can be updated multiple times in a row after upgrading the master. Long-term we should fix this
by using a similar mechanisms as controller manager or scheduler. However, currently add-on
manager is just a bash script and adding a master election mechanism would not be easy.
## Adding replica
Command to add new replica on GCE using kube-up script:
```
KUBE_REPLICATE_EXISTING_MASTER=true KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-up.sh
```
A pseudo-code for adding a new master replica using managed DNS and a loadbalancer is the following:
```
1. If there is no load balancer for this cluster:
1. Create load balancer using ephemeral IP address
2. Add existing apiserver to the load balancer
3. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
4. Update DNS to point to the load balancer.
2. Clone existing master (create a new VM with the same configuration) including
all env variables (certificates, IP ranges etc), with the exception of
`INITIAL_ETCD_CLUSTER`.
3. SSH to an existing master and run the following command to extend etcd cluster
with the new instance:
`curl <existing_master>:4001/v2/members -XPOST -H "Content-Type: application/json" -d '{"peerURLs":["http://<new_master>:2380"]}'`
4. Add IP address of the new apiserver to the load balancer.
```
A simplified algorithm for adding a new master replica and promoting master IP to the load balancer
is identical to the one when using DNS, with a different step to setup load balancer:
```
1. If there is no load balancer for this cluster:
1. Unassign IP from the existing master replica
2. Create load balancer using static IP reclaimed in the previous step
3. Add existing apiserver to the load balancer
4. Wait until load balancer is working, i.e. all data is propagated, in GCE up to 20 min (sic!)
...
```
## Deleting replica
Command to delete one replica on GCE using kube-up script:
```
KUBE_DELETE_NODES=false KUBE_GCE_ZONE=us-central1-b kubernetes/cluster/kube-down.sh
```
A pseudo-code for deleting an existing replica for the master is the following:
```
1. Remove replica IP address from the load balancer or DNS configuration
2. SSH to one of the remaining masters and run the following command to remove replica from the cluster:
`curl etcd-0:4001/v2/members/<id> -XDELETE -L`
3. Delete replica VM
4. If load balancer has only a single target instance, then delete load balancer
5. Update DNS to point to the remaining master replica, or [on GCE] assign static IP back to the master VM.
```
## Upgrades
Upgrading replicated master will be possible by upgrading them one by one using existing tools
(e.g. upgrade.sh for GCE). This will work out of the box because:
* Requests from nodes will be correctly served by either new or old master because apiserver is backward compatible.
* Requests from scheduler (and controllers) go to a local apiserver via localhost interface, so both components
will be in the same version.
* Apiserver talks only to a local etcd replica which will be in a compatible version
* We assume we will introduce this setup after we upgrade to etcd v3 so we don't need to cover upgrading database.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/ha_master.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/ha_master.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/ha_master.md)

View File

@ -1,263 +1 @@
<h2>Warning! This document might be outdated.</h2>
# Horizontal Pod Autoscaling
## Preface
This document briefly describes the design of the horizontal autoscaler for
pods. The autoscaler (implemented as a Kubernetes API resource and controller)
is responsible for dynamically controlling the number of replicas of some
collection (e.g. the pods of a ReplicationController) to meet some objective(s),
for example a target per-pod CPU utilization.
This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.0/docs/proposals/autoscaling.md).
## Overview
The resource usage of a serving application usually varies over time: sometimes
the demand for the application rises, and sometimes it drops. In Kubernetes
version 1.0, a user can only manually set the number of serving pods. Our aim is
to provide a mechanism for the automatic adjustment of the number of pods based
on CPU utilization statistics (a future version will allow autoscaling based on
other resources/metrics).
## Scale Subresource
In Kubernetes version 1.1, we are introducing Scale subresource and implementing
horizontal autoscaling of pods based on it. Scale subresource is supported for
replication controllers and deployments. Scale subresource is a Virtual Resource
(does not correspond to an object stored in etcd). It is only present in the API
as an interface that a controller (in this case the HorizontalPodAutoscaler) can
use to dynamically scale the number of replicas controlled by some other API
object (currently ReplicationController and Deployment) and to learn the current
number of replicas. Scale is a subresource of the API object that it serves as
the interface for. The Scale subresource is useful because whenever we introduce
another type we want to autoscale, we just need to implement the Scale
subresource for it. The wider discussion regarding Scale took place in issue
[#1629](https://github.com/kubernetes/kubernetes/issues/1629).
Scale subresource is in API for replication controller or deployment under the
following paths:
`apis/extensions/v1beta1/replicationcontrollers/myrc/scale`
`apis/extensions/v1beta1/deployments/mydeployment/scale`
It has the following structure:
```go
// represents a scaling request for a resource.
type Scale struct {
unversioned.TypeMeta
api.ObjectMeta
// defines the behavior of the scale.
Spec ScaleSpec
// current status of the scale.
Status ScaleStatus
}
// describes the attributes of a scale subresource
type ScaleSpec struct {
// desired number of instances for the scaled object.
Replicas int `json:"replicas,omitempty"`
}
// represents the current status of a scale subresource.
type ScaleStatus struct {
// actual number of observed instances of the scaled object.
Replicas int `json:"replicas"`
// label query over pods that should match the replicas count.
Selector map[string]string `json:"selector,omitempty"`
}
```
Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment
associated with the given Scale subresource. `ScaleStatus.Replicas` reports how
many pods are currently running in the replication controller/deployment, and
`ScaleStatus.Selector` returns selector for the pods.
## HorizontalPodAutoscaler Object
In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It
is accessible under:
`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler`
It has the following structure:
```go
// configuration of a horizontal pod autoscaler.
type HorizontalPodAutoscaler struct {
unversioned.TypeMeta
api.ObjectMeta
// behavior of autoscaler.
Spec HorizontalPodAutoscalerSpec
// current information about the autoscaler.
Status HorizontalPodAutoscalerStatus
}
// specification of a horizontal pod autoscaler.
type HorizontalPodAutoscalerSpec struct {
// reference to Scale subresource; horizontal pod autoscaler will learn the current resource
// consumption from its status,and will set the desired number of pods by modifying its spec.
ScaleRef SubresourceReference
// lower limit for the number of pods that can be set by the autoscaler, default 1.
MinReplicas *int
// upper limit for the number of pods that can be set by the autoscaler.
// It cannot be smaller than MinReplicas.
MaxReplicas int
// target average CPU utilization (represented as a percentage of requested CPU) over all the pods;
// if not specified it defaults to the target CPU utilization at 80% of the requested resources.
CPUUtilization *CPUTargetUtilization
}
type CPUTargetUtilization struct {
// fraction of the requested CPU that should be utilized/used,
// e.g. 70 means that 70% of the requested CPU should be in use.
TargetPercentage int
}
// current status of a horizontal pod autoscaler
type HorizontalPodAutoscalerStatus struct {
// most recent generation observed by this autoscaler.
ObservedGeneration *int64
// last time the HorizontalPodAutoscaler scaled the number of pods;
// used by the autoscaler to control how often the number of pods is changed.
LastScaleTime *unversioned.Time
// current number of replicas of pods managed by this autoscaler.
CurrentReplicas int
// desired number of replicas of pods managed by this autoscaler.
DesiredReplicas int
// current average CPU utilization over all pods, represented as a percentage of requested CPU,
// e.g. 70 means that an average pod is using now 70% of its requested CPU.
CurrentCPUUtilizationPercentage *int
}
```
`ScaleRef` is a reference to the Scale subresource.
`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler
configuration. We are also introducing HorizontalPodAutoscalerList object to
enable listing all autoscalers in a namespace:
```go
// list of horizontal pod autoscaler objects.
type HorizontalPodAutoscalerList struct {
unversioned.TypeMeta
unversioned.ListMeta
// list of horizontal pod autoscaler objects.
Items []HorizontalPodAutoscaler
}
```
## Autoscaling Algorithm
The autoscaler is implemented as a control loop. It periodically queries pods
described by `Status.PodSelector` of Scale subresource, and collects their CPU
utilization. Then, it compares the arithmetic mean of the pods' CPU utilization
with the target defined in `Spec.CPUUtilization`, and adjusts the replicas of
the Scale if needed to match the target (preserving condition: MinReplicas <=
Replicas <= MaxReplicas).
The period of the autoscaler is controlled by the
`--horizontal-pod-autoscaler-sync-period` flag of controller manager. The
default value is 30 seconds.
CPU utilization is the recent CPU usage of a pod (average across the last 1
minute) divided by the CPU requested by the pod. In Kubernetes version 1.1, CPU
usage is taken directly from Heapster. In future, there will be API on master
for this purpose (see issue [#11951](https://github.com/kubernetes/kubernetes/issues/11951)).
The target number of pods is calculated from the following formula:
```
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
```
Starting and stopping pods may introduce noise to the metric (for instance,
starting may temporarily increase CPU). So, after each action, the autoscaler
should wait some time for reliable data. Scale-up can only happen if there was
no rescaling within the last 3 minutes. Scale-down will wait for 5 minutes from
the last rescaling. Moreover any scaling will only be made if:
`avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1
(10% tolerance). Such approach has two benefits:
* Autoscaler works in a conservative way. If new user load appears, it is
important for us to rapidly increase the number of pods, so that user requests
will not be rejected. Lowering the number of pods is not that urgent.
* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting
decision if the load is not stable.
## Relative vs. absolute metrics
We chose values of the target metric to be relative (e.g. 90% of requested CPU
resource) rather than absolute (e.g. 0.6 core) for the following reason. If we
choose absolute metric, user will need to guarantee that the target is lower
than the request. Otherwise, overloaded pods may not be able to consume more
than the autoscaler's absolute target utilization, thereby preventing the
autoscaler from seeing high enough utilization to trigger it to scale up. This
may be especially troublesome when user changes requested resources for a pod
because they would need to also change the autoscaler utilization threshold.
Therefore, we decided to choose relative metric. For user, it is enough to set
it to a value smaller than 100%, and further changes of requested resources will
not invalidate it.
## Support in kubectl
To make manipulation of HorizontalPodAutoscaler object simpler, we added support
for creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl. In
addition, in future, we are planning to add kubectl support for the following
use-cases:
* When creating a replication controller or deployment with
`kubectl create [-f]`, there should be a possibility to specify an additional
autoscaler object. (This should work out-of-the-box when creation of autoscaler
is supported by kubectl as we may include multiple objects in the same config
file).
* *[future]* When running an image with `kubectl run`, there should be an
additional option to create an autoscaler for it.
* *[future]* We will add a new command `kubectl autoscale` that will allow for
easy creation of an autoscaler object for already existing replication
controller/deployment.
## Next steps
We list here some features that are not supported in Kubernetes version 1.1.
However, we want to keep them in mind, as they will most probably be needed in
the future.
Our design is in general compatible with them.
* *[future]* **Autoscale pods based on metrics different than CPU** (e.g.
memory, network traffic, qps). This includes scaling based on a custom/application metric.
* *[future]* **Autoscale pods base on an aggregate metric.** Autoscaler,
instead of computing average for a target metric across pods, will use a single,
external, metric (e.g. qps metric from load balancer). The metric will be
aggregated while the target will remain per-pod (e.g. when observing 100 qps on
load balancer while the target is 20 qps per pod, autoscaler will set the number
of replicas to 5).
* *[future]* **Autoscale pods based on multiple metrics.** If the target numbers
of pods for different metrics are different, choose the largest target number of
pods.
* *[future]* **Scale the number of pods starting from 0.** All pods can be
turned-off, and then turned-on when there is a demand for them. When a request
to service with no pods arrives, kube-proxy will generate an event for
autoscaler to create a new pod. Discussed in issue [#3247](https://github.com/kubernetes/kubernetes/issues/3247).
* *[future]* **When scaling down, make more educated decision which pods to
kill.** E.g.: if two or more pods from the same replication controller are on
the same node, kill one of them. Discussed in issue [#4301](https://github.com/kubernetes/kubernetes/issues/4301).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/horizontal-pod-autoscaler.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/horizontal-pod-autoscaler.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/horizontal-pod-autoscaler.md)

View File

@ -1,113 +1 @@
# Identifiers and Names in Kubernetes
A summarization of the goals and recommendations for identifiers in Kubernetes.
Described in GitHub issue [#199](http://issue.k8s.io/199).
## Definitions
`UID`: A non-empty, opaque, system-generated value guaranteed to be unique in time
and space; intended to distinguish between historical occurrences of similar
entities.
`Name`: A non-empty string guaranteed to be unique within a given scope at a
particular time; used in resource URLs; provided by clients at creation time and
encouraged to be human friendly; intended to facilitate creation idempotence and
space-uniqueness of singleton objects, distinguish distinct entities, and
reference particular entities across operations.
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `label` (DNS_LABEL):
An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters,
with the '-' character allowed anywhere except the first or last character,
suitable for use as a hostname or segment in a domain name.
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) `subdomain` (DNS_SUBDOMAIN):
One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum
length of 253 characters.
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) `universally unique identifier` (UUID):
A 128 bit generated value that is extremely unlikely to collide across time and
space and requires no central coordination.
[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) `port name` (IANA_SVC_NAME):
An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters,
with the '-' character allowed anywhere except the first or the last character
or adjacent to another '-' character, it must contain at least a (a-z)
character.
## Objectives for names and UIDs
1. Uniquely identify (via a UID) an object across space and time.
2. Uniquely name (via a name) an object across space.
3. Provide human-friendly names in API operations and/or configuration files.
4. Allow idempotent creation of API resources (#148) and enforcement of
space-uniqueness of singleton objects.
5. Allow DNS names to be automatically generated for some objects.
## General design
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must
be specified. Name must be non-empty and unique within the apiserver. This
enables idempotent and space-unique creation operations. Parts of the system
(e.g. replication controller) may join strings (e.g. a base name and a random
suffix) to create a unique Name. For situations where generating a name is
impractical, some or all objects may support a param to auto-generate a name.
Generating random names will defeat idempotency.
* Examples: "guestbook.user", "backend-x4eb1"
2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN?
format TBD via #1114) may be specified. Depending on the API receiver,
namespaces might be validated (e.g. apiserver might ensure that the namespace
actually exists). If a namespace is not specified, one will be assigned by the
API receiver. This assignment policy might vary across API receivers (e.g.
apiserver might have a default, kubelet might generate something semi-random).
* Example: "api.k8s.example.com"
3. Upon acceptance of an object via an API, the object is assigned a UID
(a UUID). UID must be non-empty and unique across space and time.
* Example: "01234567-89ab-cdef-0123-456789abcdef"
## Case study: Scheduling a pod
Pods can be placed onto a particular node in a number of ways. This case study
demonstrates how the above design can be applied to satisfy the objectives.
### A pod scheduled by a user through the apiserver
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
2. The apiserver validates the input.
1. A default Namespace is assigned.
2. The pod name must be space-unique within the Namespace.
3. Each container within the pod has a name which must be space-unique within
the pod.
3. The pod is accepted.
1. A new UID is assigned.
4. The pod is bound to a node.
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
5. Kubelet validates the input.
6. Kubelet runs the pod.
1. Each container is started up with enough metadata to distinguish the pod
from whence it came.
2. Each attempt to run a container is assigned a UID (a string) that is
unique across time. * This may correspond to Docker's container ID.
### A pod placed by a config file on the node
1. A config file is stored on the node, containing a pod with UID="",
Namespace="", and Name="cadvisor".
2. Kubelet validates the input.
1. Since UID is not provided, kubelet generates one.
2. Since Namespace is not provided, kubelet generates one.
1. The generated namespace should be deterministic and cluster-unique for
the source, such as a hash of the hostname and file path.
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
3. Kubelet runs the pod.
1. Each container is started up with enough metadata to distinguish the pod
from whence it came.
2. Each attempt to run a container is assigned a UID (a string) that is
unique across time.
1. This may correspond to Docker's container ID.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/identifiers.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/identifiers.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/identifiers.md)

View File

@ -1,900 +1 @@
# Design: Indexed Feature of Job object
## Summary
This design extends kubernetes with user-friendly support for
running embarrassingly parallel jobs.
Here, *parallel* means on multiple nodes, which means multiple pods.
By *embarrassingly parallel*, it is meant that the pods
have no dependencies between each other. In particular, neither
ordering between pods nor gang scheduling are supported.
Users already have two other options for running embarrassingly parallel
Jobs (described in the next section), but both have ease-of-use issues.
Therefore, this document proposes extending the Job resource type to support
a third way to run embarrassingly parallel programs, with a focus on
ease of use.
This new style of Job is called an *indexed job*, because each Pod of the Job
is specialized to work on a particular *index* from a fixed length array of work
items.
## Background
The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
the embarrassingly parallel use case through *workqueue jobs*.
While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very
flexible, they can be difficult to use. They: (1) typically require running a
message queue or other database service, (2) typically require modifications
to existing binaries and images and (3) subtle race conditions are easy to
overlook.
Users also have another option for parallel jobs: creating [multiple Job objects
from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of
Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job
objects at once. But, that approach also has its drawbacks: (1) for large levels
of parallelism (hundreds or thousands of pods) this approach means that listing
all jobs presents too much information, (2) users want a single source of
information about the success or failure of what the user views as a single
logical process.
Indexed job fills provides a third option with better ease-of-use for common
use cases.
## Requirements
### User Requirements
- Users want an easy way to run a Pod to completion *for each* item within a
[work list](#example-use-cases).
- Users want to run these pods in parallel for speed, but to vary the level of
parallelism as needed, independent of the number of work items.
- Users want to do this without requiring changes to existing images,
or source-to-image pipelines.
- Users want a single object that encompasses the lifetime of the parallel
program. Deleting it should delete all dependent objects. It should report the
status of the overall process. Users should be able to wait for it to complete,
and can refer to it from other resource types, such as
[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).
### Example Use Cases
Here are several examples of *work lists*: lists of command lines that the user
wants to run, each line its own Pod. (Note that in practice, a work list may not
ever be written out in this form, but it exists in the mind of the Job creator,
and it is a useful way to talk about the intent of the user when discussing
alternatives for specifying Indexed Jobs).
Note that we will not have the user express their requirements in work list
form; it is just a format for presenting use cases. Subsequent discussion will
reference these work lists.
#### Work List 1
Process several files with the same program:
```
/usr/local/bin/process_file 12342.dat
/usr/local/bin/process_file 97283.dat
/usr/local/bin/process_file 38732.dat
```
#### Work List 2
Process a matrix (or image, etc) in rectangular blocks:
```
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
```
#### Work List 3
Build a program at several different git commits:
```
HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH
HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH
```
#### Work List 4
Render several frames of a movie:
```
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3
```
#### Work List 5
Render several blocks of frames (Render blocks to avoid Pod startup overhead for
every frame):
```
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300
```
## Design Discussion
### Converting Work Lists into Indexed Jobs.
Given a work list, like in the [work list examples](#work-list-examples),
the information from the work list needs to get into each Pod of the Job.
Users will typically not want to create a new image for each job they
run. They will want to use existing images. So, the image is not the place
for the work list.
A work list can be stored on networked storage, and mounted by pods of the job.
Also, as a shortcut, for small worklists, it can be included in an annotation on
the Job object, which is then exposed as a volume in the pod via the downward
API.
### What Varies Between Pods of a Job
Pods need to differ in some way to do something different. (They do not differ
in the work-queue style of Job, but that style has ease-of-use issues).
A general approach would be to allow pods to differ from each other in arbitrary
ways. For example, the Job object could have a list of PodSpecs to run.
However, this is so general that it provides little value. It would:
- make the Job Spec very verbose, especially for jobs with thousands of work
items
- Job becomes such a vague concept that it is hard to explain to users
- in practice, we do not see cases where many pods which differ across many
fields of their specs, and need to run as a group, with no ordering constraints.
- CLIs and UIs need to support more options for creating Job
- it is useful for monitoring and accounting databases want to aggregate data
for pods with the same controller. However, pods with very different Specs may
not make sense to aggregate.
- profiling, debugging, accounting, auditing and monitoring tools cannot assume
common images/files, behaviors, provenance and so on between Pods of a Job.
Also, variety has another cost. Pods which differ in ways that affect scheduling
(node constraints, resource requirements, labels) prevent the scheduler from
treating them as fungible, which is an important optimization for the scheduler.
Therefore, we will not allow Pods from the same Job to differ arbitrarily
(anyway, users can use multiple Job objects for that case). We will try to
allow as little as possible to differ between pods of the same Job, while still
allowing users to express common parallel patterns easily. For users who need to
run jobs which differ in other ways, they can create multiple Jobs, and manage
them as a group using labels.
From the above work lists, we see a need for Pods which differ in their command
lines, and in their environment variables. These work lists do not require the
pods to differ in other ways.
Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf)
has shown this model to be applicable to a very broad range of problems, despite
this restriction.
Therefore we to allow pods in the same Job to differ **only** in the following
aspects:
- command line
- environment variables
### Composition of existing images
The docker image that is used in a job may not be maintained by the person
running the job. Over time, the Dockerfile may change the ENTRYPOINT or CMD.
If we require people to specify the complete command line to use Indexed Job,
then they will not automatically pick up changes in the default
command or args.
This needs more thought.
### Running Ad-Hoc Jobs using kubectl
A user should be able to easily start an Indexed Job using `kubectl`. For
example to run [work list 1](#work-list-1), a user should be able to type
something simple like:
```
kubectl run process-files --image=myfileprocessor \
--per-completion-env=F="12342.dat 97283.dat 38732.dat" \
--restart=OnFailure \
-- \
/usr/local/bin/process_file '$F'
```
In the above example:
- `--restart=OnFailure` implies creating a job instead of replicationController.
- Each pods command line is `/usr/local/bin/process_file $F`.
- `--per-completion-env=` implies the jobs `.spec.completions` is set to the
length of the argument array (3 in the example).
- `--per-completion-env=F=<values>` causes env var with `F` to be available in
the environment when the command line is evaluated.
How exactly this happens is discussed later in the doc: this is a sketch of the
user experience.
In practice, the list of files might be much longer and stored in a file on the
users local host, like:
```
$ cat files-to-process.txt
12342.dat
97283.dat
38732.dat
...
```
So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`.
However, `kubectl` should also support a format like:
`--per-completion-env=F=@files-to-process.txt`.
That allows `kubectl` to parse the file, point out any syntax errors, and would
not run up against command line length limits (2MB is common, as low as 4kB is
POSIX compliant).
One case we do not try to handle is where the file of work is stored on a cloud
filesystem, and not accessible from the users local host. Then we cannot easily
use indexed job, because we do not know the number of completions. The user
needs to copy the file locally first or use the Work-Queue style of Job (already
supported).
Another case we do not try to handle is where the input file does not exist yet
because this Job is to be run at a future time, or depends on another job. The
workflow and scheduled job proposal need to consider this case. For that case,
you could use an indexed job which runs a program which shards the input file
(map-reduce-style).
#### Multiple parameters
The user may also have multiple parameters, like in [work list 2](#work-list-2).
One way is to just list all the command lines already expanded, one per line, in
a file, like this:
```
$ cat matrix-commandlines.txt
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
```
and run the Job like this:
```
kubectl run process-matrix --image=my/matrix \
--per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \
--restart=OnFailure \
-- \
'eval "$COMMAND_LINE"'
```
However, this may have some subtleties with shell escaping. Also, it depends on
the user knowing all the correct arguments to the docker image being used (more
on this later).
Instead, kubectl should support multiple instances of the `--per-completion-env`
flag. For example, to implement work list 2, a user could do:
```
kubectl run process-matrix --image=my/matrix \
--per-completion-env=SR="0 16 0 16" \
--per-completion-env=ER="15 31 15 31" \
--per-completion-env=SC="0 0 16 16" \
--per-completion-env=EC="15 15 31 31" \
--restart=OnFailure \
-- \
/usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC
```
### Composition With Workflows and ScheduledJob
A user should be able to create a job (Indexed or not) which runs at a specific
time(s). For example:
```
$ kubectl run process-files --image=myfileprocessor \
--per-completion-env=F="12342.dat 97283.dat 38732.dat" \
--restart=OnFailure \
--runAt=2015-07-21T14:00:00Z
-- \
/usr/local/bin/process_file '$F'
created "scheduledJob/process-files-37dt3"
```
Kubectl should build the same JobSpec, and then put it into a ScheduledJob
(#11980) and create that.
For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a
complete workflow from a single command line would be messy, because of the need
to specify all the arguments multiple times.
For that use case, the user could create a workflow message by hand. Or the user
could create a job template, and then make a workflow from the templates,
perhaps like this:
```
$ kubectl run process-files --image=myfileprocessor \
--per-completion-env=F="12342.dat 97283.dat 38732.dat" \
--restart=OnFailure \
--asTemplate \
-- \
/usr/local/bin/process_file '$F'
created "jobTemplate/process-files"
$ kubectl run merge-files --image=mymerger \
--restart=OnFailure \
--asTemplate \
-- \
/usr/local/bin/mergefiles 12342.out 97283.out 38732.out \
created "jobTemplate/merge-files"
$ kubectl create-workflow process-and-merge \
--job=jobTemplate/process-files
--job=jobTemplate/merge-files
--dependency=process-files:merge-files
created "workflow/process-and-merge"
```
### Completion Indexes
A JobSpec specifies the number of times a pod needs to complete successfully,
through the `job.Spec.Completions` field. The number of completions will be
equal to the number of work items in the work list.
Each pod that the job controller creates is intended to complete one work item
from the work list. Since a pod may fail, several pods may, serially, attempt to
complete the same index. Therefore, we call it a *completion index* (or just
*index*), but not a *pod index*.
For each completion index, in the range 1 to `.job.Spec.Completions`, the job
controller will create a pod with that index, and keep creating them on failure,
until each index is completed.
An dense integer index, rather than a sparse string index (e.g. using just
`metadata.generate-name`) makes it easy to use the index to lookup parameters
in, for example, an array in shared storage.
### Pod Identity and Template Substitution in Job Controller
The JobSpec contains a single pod template. When the job controller creates a
particular pod, it copies the pod template and modifies it in some way to make
that pod distinctive. Whatever is distinctive about that pod is its *identity*.
We consider several options.
#### Index Substitution Only
The job controller substitutes only the *completion index* of the pod into the
pod template when creating it. The JSON it POSTs differs only in a single
fields.
We would put the completion index as a stringified integer, into an annotation
of the pod. The user can extract it from the annotation into an env var via the
downward API, or put it in a file via a Downward API volume, and parse it
himself.
Once it is an environment variable in the pod (say `$INDEX`), then one of two
things can happen.
First, the main program can know how to map from an integer index to what it
needs to do. For example, from Work List 4 above:
```
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
```
Second, a shell script can be prepended to the original command line which maps
the index to one or more string parameters. For example, to implement Work List
5 above, you could do:
```
/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
```
In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX`
and exports `$START_FRAME` and `$END_FRAME`.
The shell could be part of the image, but more usefully, it could be generated
by a program and stuffed in an annotation or a configMap, and from there added
to a volume.
The first approach may require the user to modify an existing image (see next
section) to be able to accept an `$INDEX` env var or argument. The second
approach requires that the image have a shell. We think that together these two
options cover a wide range of use cases (though not all).
#### Multiple Substitution
In this option, the JobSpec is extended to include a list of values to
substitute, and which fields to substitute them into. For example, a worklist
like this:
```
FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt
FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
```
Can be broken down into a template like this, with three parameters:
```
<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
```
and a list of parameter tuples, like this:
```
("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds")
("FRUIT_COLOR=yellow", "-f banana.txt", "")
("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
```
The JobSpec can be extended to hold a list of parameter tuples (which are more
easily expressed as a list of lists of individual parameters). For example:
```
apiVersion: extensions/v1beta1
kind: Job
...
spec:
completions: 3
...
template:
...
perCompletionArgs:
container: 0
-
- "-f apple.txt"
- "-f banana.txt"
- "-f cherry.txt"
-
- "--remove-seeds"
- ""
- "--remove-pit"
perCompletionEnvVars:
- name: "FRUIT_COLOR"
- "green"
- "yellow"
- "red"
```
However, just providing custom env vars, and not arguments, is sufficient for
many use cases: parameter can be put into env vars, and then substituted on the
command line.
#### Comparison
The multiple substitution approach:
- keeps the *per completion parameters* in the JobSpec.
- Drawback: makes the job spec large for job with thousands of completions. (But
for very large jobs, the work-queue style or another type of controller, such as
map-reduce or spark, may be a better fit.)
- Drawback: is a form of server-side templating, which we want in Kubernetes but
have not fully designed (see the [StatefulSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
The index-only approach:
- Requires that the user keep the *per completion parameters* in a separate
storage, such as a configData or networked storage.
- Makes no changes to the JobSpec.
- Drawback: while in separate storage, they could be mutated, which would have
unexpected effects.
- Drawback: Logic for using index to lookup parameters needs to be in the Pod.
- Drawback: CLIs and UIs are limited to using the "index" as the identity of a
pod from a job. They cannot easily say, for example `repeated failures on the
pod processing banana.txt`.
Index-only approach relies on at least one of the following being true:
1. Image containing a shell and certain shell commands (not all images have
this).
1. Use directly consumes the index from annotations (file or env var) and
expands to specific behavior in the main program.
Also Using the index-only approach from non-kubectl clients requires that they
mimic the script-generation step, or only use the second style.
#### Decision
It is decided to implement the Index-only approach now. Once the server-side
templating design is complete for Kubernetes, and we have feedback from users,
we can consider if Multiple Substitution.
## Detailed Design
#### Job Resource Schema Changes
No changes are made to the JobSpec.
The JobStatus is also not changed. The user can gauge the progress of the job by
the `.status.succeeded` count.
#### Job Spec Compatilibity
A job spec written before this change will work exactly the same as before with
the new controller. The Pods it creates will have the same environment as
before. They will have a new annotation, but pod are expected to tolerate
unfamiliar annotations.
However, if the job controller version is reverted, to a version before this
change, the jobs whose pod specs depend on the new annotation will fail.
This is okay for a Beta resource.
#### Job Controller Changes
The Job controller will maintain for each Job a data structed which
indicates the status of each completion index. We call this the
*scoreboard* for short. It is an array of length `.spec.completions`.
Elements of the array are `enum` type with possible values including
`complete`, `running`, and `notStarted`.
The scoreboard is stored in Job Controller memory for efficiency. In either
case, the Status can be reconstructed from watching pods of the job (such as on
a controller manager restart). The index of the pods can be extracted from the
pod annotation.
When Job controller sees that the number of running pods is less than the
desired parallelism of the job, it finds the first index in the scoreboard with
value `notRunning`. It creates a pod with this creation index.
When it creates a pod with creation index `i`, it makes a copy of the
`.spec.template`, and sets
`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to
`i`. It does this in both the index-only and multiple-substitutions options.
Then it creates the pod.
When the controller notices that a pod has completed or is running or failed,
it updates the scoreboard.
When all entries in the scoreboard are `complete`, then the job is complete.
#### Downward API Changes
The downward API is changed to support extracting specific key names into a
single environment variable. So, the following would be supported:
```
kind: Pod
version: v1
spec:
containers:
- name: foo
env:
- name: MY_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations[kubernetes.io/job/completion-index]
```
This requires kubelet changes.
Users who fail to upgrade their kubelets at the same time as they upgrade their
controller manager will see a failure for pods to run when they are created by
the controller. The Kubelet will send an event about failure to create the pod.
The `kubectl describe job` will show many failed pods.
#### Kubectl Interface Changes
The `--completions` and `--completion-index-var-name` flags are added to
kubectl.
For example, this command:
```
kubectl run say-number --image=busybox \
--completions=3 \
--completion-index-var-name=I \
-- \
sh -c 'echo "My index is $I" && sleep 5'
```
will run 3 pods to completion, each printing one of the following lines:
```
My index is 1
My index is 2
My index is 0
```
Kubectl would create the following pod:
Kubectl will also support the `--per-completion-env` flag, as described
previously. For example, this command:
```
kubectl run say-fruit --image=busybox \
--per-completion-env=FRUIT="apple banana cherry" \
--per-completion-env=COLOR="green yellow red" \
-- \
sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```
or equivalently:
```
echo "apple banana cherry" > fruits.txt
echo "green yellow red" > colors.txt
kubectl run say-fruit --image=busybox \
--per-completion-env=FRUIT="$(cat fruits.txt)" \
--per-completion-env=COLOR="$(cat fruits.txt)" \
-- \
sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```
or similarly:
```
kubectl run say-fruit --image=busybox \
--per-completion-env=FRUIT=@fruits.txt \
--per-completion-env=COLOR=@fruits.txt \
-- \
sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```
will all run 3 pods in parallel. Index 0 pod will log:
```
Have a nice grenn apple
```
and so on.
Notes:
- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a
quoted space separated list or `@` and the name of a text file containing a
list.
- `--per-completion-env=` can be specified several times, but all must have the
same length list.
- `--completions=N` with `N` equal to list length is implied.
- The flag `--completions=3` sets `job.spec.completions=3`.
- The flag `--completion-index-var-name=I` causes an env var to be created named
I in each pod, with the index in it.
- The flag `--restart=OnFailure` is implied by `--completions` or any
job-specific arguments. The user can also specify `--restart=Never` if they
desire but may not specify `--restart=Always` with job-related flags.
- Setting any of these flags in turn tells kubectl to create a Job, not a
replicationController.
#### How Kubectl Creates Job Specs.
To pass in the parameters, kubectl will generate a shell script which
can:
- parse the index from the annotation
- hold all the parameter lists.
- lookup the correct index in each parameter list and set an env var.
For example, consider this command:
```
kubectl run say-fruit --image=busybox \
--per-completion-env=FRUIT="apple banana cherry" \
--per-completion-env=COLOR="green yellow red" \
-- \
sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```
First, kubectl generates the PodSpec as it normally does for `kubectl run`.
But, then it will generate this script:
```sh
#!/bin/sh
# Generated by kubectl run ...
# Check for needed commands
if [[ ! type cat ]]
then
echo "$0: Image does not include required command: cat"
exit 2
fi
if [[ ! type grep ]]
then
echo "$0: Image does not include required command: grep"
exit 2
fi
# Check that annotations are mounted from downward API
if [[ ! -e /etc/annotations ]]
then
echo "$0: Cannot find /etc/annotations"
exit 2
fi
# Get our index from annotations file
I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index"
export I
# Our parameter lists are stored inline in this script.
FRUIT_0="apple"
FRUIT_1="banana"
FRUIT_2="cherry"
# Extract the right parameter value based on our index.
# This works on any Bourne-based shell.
FRUIT=$(eval echo \$"FRUIT_$I")
export FRUIT
COLOR_0="green"
COLOR_1="yellow"
COLOR_2="red"
COLOR=$(eval echo \$"FRUIT_$I")
export COLOR
```
Then it POSTs this script, encoded, inside a ConfigData.
It attaches this volume to the PodSpec.
Then it will edit the command line of the Pod to run this script before the rest of
the command line.
Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this:
It also appends the Secret (later configData) volume with the script in it.
So, the Pod template that kubectl creates (inside the job template) looks like this:
```
apiVersion: v1
kind: Job
...
spec:
...
template:
...
spec:
containers:
- name: c
image: gcr.io/google_containers/busybox
command:
- 'sh'
- '-c'
- '/etc/job-params.sh; echo "this is the rest of the command"'
volumeMounts:
- name: annotations
mountPath: /etc
- name: script
mountPath: /etc
volumes:
- name: annotations
downwardAPI:
items:
- path: "annotations"
ieldRef:
fieldPath: metadata.annotations
- name: script
secret:
secretName: jobparams-abc123
```
###### Alternatives
Kubectl could append a `valueFrom` line like this to
get the index into the environment:
```yaml
apiVersion: extensions/v1beta1
kind: Job
metadata:
...
spec:
...
template:
...
spec:
containers:
- name: foo
...
env:
# following block added:
- name: I
valueFrom:
fieldRef:
fieldPath: metadata.annotations."kubernetes.io/job-idx"
```
However, in order to inject other env vars from parameter list,
kubectl still needs to edit the command line.
Parameter lists could be passed via a configData volume instead of a secret.
Kubectl can be changed to work that way once the configData implementation is
complete.
Parameter lists could be passed inside an EnvVar. This would have length
limitations, would pollute the output of `kubectl describe pods` and `kubectl
get pods -o json`.
Parameter lists could be passed inside an annotation. This would have length
limitations, would pollute the output of `kubectl describe pods` and `kubectl
get pods -o json`. Also, currently annotations can only be extracted into a
single file. Complex logic is then needed to filter out exactly the desired
annotation data.
Bash array variables could simplify extraction of a particular parameter from a
list of parameters. However, some popular base images do not include
`/bin/bash`. For example, `busybox` uses a compact `/bin/sh` implementation
that does not support array syntax.
Kubelet does support [expanding variables without a
shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html). But it does not
allow for recursive substitution, which is required to extract the correct
parameter from a list based on the completion index of the pod. The syntax
could be extended, but doing so seems complex and will be an unfamiliar syntax
for users.
Putting all the command line editing into a script and running that causes
the least pollution to the original command line, and it allows
for complex error handling.
Kubectl could store the script in an [Inline Volume](
https://github.com/kubernetes/kubernetes/issues/13610) if that proposal
is approved. That would remove the need to manage the lifetime of the
configData/secret, and prevent the case where someone changes the
configData mid-job, and breaks things in a hard-to-debug way.
## Interactions with other features
#### Supporting Work Queue Jobs too
For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
to be greater than it, and pods have no identity. So, the job controller should
not create a scoreboard in the JobStatus, just a count. Therefore, we need to
add one of the following to JobSpec:
- allow unset `.spec.completions` to indicate no scoreboard, and no index for
tasks (identical tasks).
- allow `.spec.completions=-1` to indicate the same.
- add `.spec.indexed` to job to indicate need for scoreboard.
#### Interaction with vertical autoscaling
Since pods of the same job will not be created with different resources,
a vertical autoscaler will need to:
- if it has index-specific initial resource suggestions, suggest those at
admission time; it will need to understand indexes.
- mutate resource requests on already created pods based on usage trend or
previous container failures.
- modify the job template, affecting all indexes.
#### Comparison to StatefulSets (previously named PetSets)
The *Index substitution-only* option corresponds roughly to StatefulSet Proposal 1b.
The `perCompletionArgs` approach is similar to StatefulSet Proposal 1e, but more
restrictive and thus less verbose.
It would be easier for users if Indexed Job and StatefulSet are similar where
possible. However, StatefulSet differs in several key respects:
- StatefulSet is for ones to tens of instances. Indexed job should work with tens of
thousands of instances.
- When you have few instances, you may want to give them names. When you have many instances,
integer indexes make more sense.
- When you have thousands of instances, storing the work-list in the JobSpec
is verbose. For StatefulSet, this is less of a problem.
- StatefulSets (apparently) need to differ in more fields than indexed Jobs.
This differs from StatefulSet in that StatefulSet uses names and not indexes. StatefulSet is
intended to support ones to tens of things.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/indexed-job.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/indexed-job.md)

View File

@ -1,137 +1 @@
# MetadataPolicy and its use in choosing the scheduler in a multi-scheduler system
## Introduction
This document describes a new API resource, `MetadataPolicy`, that configures an
admission controller to take one or more actions based on an object's metadata.
Initially the metadata fields that the predicates can examine are labels and
annotations, and the actions are to add one or more labels and/or annotations,
or to reject creation/update of the object. In the future other actions might be
supported, such as applying an initializer.
The first use of `MetadataPolicy` will be to decide which scheduler should
schedule a pod in a [multi-scheduler](../proposals/multiple-schedulers.md)
Kubernetes system. In particular, the policy will add the scheduler name
annotation to a pod based on an annotation that is already on the pod that
indicates the QoS of the pod. (That annotation was presumably set by a simpler
admission controller that uses code, rather than configuration, to map the
resource requests and limits of a pod to QoS, and attaches the corresponding
annotation.)
We anticipate a number of other uses for `MetadataPolicy`, such as defaulting
for labels and annotations, prohibiting/requiring particular labels or
annotations, or choosing a scheduling policy within a scheduler. We do not
discuss them in this doc.
## API
```go
// MetadataPolicySpec defines the configuration of the MetadataPolicy API resource.
// Every rule is applied, in an unspecified order, but if the action for any rule
// that matches is to reject the object, then the object is rejected without being mutated.
type MetadataPolicySpec struct {
Rules []MetadataPolicyRule `json:"rules,omitempty"`
}
// If the PolicyPredicate is met, then the PolicyAction is applied.
// Example rules:
// reject object if label with key X is present (i.e. require X)
// reject object if label with key X is not present (i.e. forbid X)
// add label X=Y if label with key X is not present (i.e. default X)
// add annotation A=B if object has annotation C=D or E=F
type MetadataPolicyRule struct {
PolicyPredicate PolicyPredicate `json:"policyPredicate"`
PolicyAction PolicyAction `json:policyAction"`
}
// All criteria must be met for the PolicyPredicate to be considered met.
type PolicyPredicate struct {
// Note that Namespace is not listed here because MetadataPolicy is per-Namespace.
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
AnnotationSelector *LabelSelector `json:"annotationSelector,omitempty"`
}
// Apply the indicated Labels and/or Annotations (if present), unless Reject is set
// to true, in which case reject the object without mutating it.
type PolicyAction struct {
// If true, the object will be rejected and not mutated.
Reject bool `json:"reject"`
// The labels to add or update, if any.
UpdatedLabels *map[string]string `json:"updatedLabels,omitempty"`
// The annotations to add or update, if any.
UpdatedAnnotations *map[string]string `json:"updatedAnnotations,omitempty"`
}
// MetadataPolicy describes the MetadataPolicy API resource, which is used for specifying
// policies that should be applied to objects based on the objects' metadata. All MetadataPolicy's
// are applied to all objects in the namespace; the order of evaluation is not guaranteed,
// but if any of the matching policies have an action of rejecting the object, then the object
// will be rejected without being mutated.
type MetadataPolicy struct {
unversioned.TypeMeta `json:",inline"`
// Standard object's metadata.
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#metadata
ObjectMeta `json:"metadata,omitempty"`
// Spec defines the metadata policy that should be enforced.
// http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#spec-and-status
Spec MetadataPolicySpec `json:"spec,omitempty"`
}
// MetadataPolicyList is a list of MetadataPolicy items.
type MetadataPolicyList struct {
unversioned.TypeMeta `json:",inline"`
// Standard list metadata.
// More info: http://releases.k8s.io/HEAD/docs/devel/api-conventions.md#types-kinds
unversioned.ListMeta `json:"metadata,omitempty"`
// Items is a list of MetadataPolicy objects.
// More info: http://releases.k8s.io/HEAD/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota
Items []MetadataPolicy `json:"items"`
}
```
## Implementation plan
1. Create `MetadataPolicy` API resource
1. Create admission controller that implements policies defined in
`MetadataPolicy`
1. Create admission controller that sets annotation
`scheduler.alpha.kubernetes.io/qos: <QoS>`
(where `QOS` is one of `Guaranteed, Burstable, BestEffort`)
based on pod's resource request and limit.
## Future work
Longer-term we will have QoS be set on create and update by the registry,
similar to `Pending` phase today, instead of having an admission controller
(that runs before the one that takes `MetadataPolicy` as input) do it.
We plan to eventually move from having an admission controller set the scheduler
name as a pod annotation, to using the initializer concept. In particular, the
scheduler will be an initializer, and the admission controller that decides
which scheduler to use will add the scheduler's name to the list of initializers
for the pod (presumably the scheduler will be the last initializer to run on
each pod). The admission controller would still be configured using the
`MetadataPolicy` described here, only the mechanism the admission controller
uses to record its decision of which scheduler to use would change.
## Related issues
The main issue for multiple schedulers is #11793. There was also a lot of
discussion in PRs #17197 and #17865.
We could use the approach described here to choose a scheduling policy within a
single scheduler, as opposed to choosing a scheduler, a desire mentioned in
# 9920. Issue #17097 describes a scenario unrelated to scheduler-choosing where
`MetadataPolicy` could be used. Issue #17324 proposes to create a generalized
API for matching "claims" to "service classes"; matching a pod to a scheduler
would be one use for such an API.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/metadata-policy.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/metadata-policy.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/metadata-policy.md)

View File

@ -1,203 +1 @@
# Kubernetes monitoring architecture
## Executive Summary
Monitoring is split into two pipelines:
* A **core metrics pipeline** consisting of Kubelet, a resource estimator, a slimmed-down
Heapster called metrics-server, and the API server serving the master metrics API. These
metrics are used by core system components, such as scheduling logic (e.g. scheduler and
horizontal pod autoscaling based on system metrics) and simple out-of-the-box UI components
(e.g. `kubectl top`). This pipeline is not intended for integration with third-party
monitoring systems.
* A **monitoring pipeline** used for collecting various metrics from the system and exposing
them to end-users, as well as to the Horizontal Pod Autoscaler (for custom metrics) and Infrastore
via adapters. Users can choose from many monitoring system vendors, or run none at all. In
open-source, Kubernetes will not ship with a monitoring pipeline, but third-party options
will be easy to install. We expect that such pipelines will typically consist of a per-node
agent and a cluster-level aggregator.
The architecture is illustrated in the diagram in the Appendix of this doc.
## Introduction and Objectives
This document proposes a high-level monitoring architecture for Kubernetes. It covers
a subset of the issues mentioned in the “Kubernetes Monitoring Architecture” doc,
specifically focusing on an architecture (components and their interactions) that
hopefully meets the numerous requirements. We do not specify any particular timeframe
for implementing this architecture, nor any particular roadmap for getting there.
### Terminology
There are two types of metrics, system metrics and service metrics. System metrics are
generic metrics that are generally available from every entity that is monitored (e.g.
usage of CPU and memory by container and node). Service metrics are explicitly defined
in application code and exported (e.g. number of 500s served by the API server). Both
system metrics and service metrics can originate from users containers or from system
infrastructure components (master components like the API server, addon pods running on
the master, and addon pods running on user nodes).
We divide system metrics into
* *core metrics*, which are metrics that Kubernetes understands and uses for operation
of its internal components and core utilities -- for example, metrics used for scheduling
(including the inputs to the algorithms for resource estimation, initial resources/vertical
autoscaling, cluster autoscaling, and horizontal pod autoscaling excluding custom metrics),
the kube dashboard, and “kubectl top.” As of now this would consist of cpu cumulative usage,
memory instantaneous usage, disk usage of pods, disk usage of containers
* *non-core metrics*, which are not interpreted by Kubernetes; we generally assume they
include the core metrics (though not necessarily in a format Kubernetes understands) plus
additional metrics.
Service metrics can be divided into those produced by Kubernetes infrastructure components
(and thus useful for operation of the Kubernetes cluster) and those produced by user applications.
Service metrics used as input to horizontal pod autoscaling are sometimes called custom metrics.
Of course horizontal pod autoscaling also uses core metrics.
We consider logging to be separate from monitoring, so logging is outside the scope of
this doc.
### Requirements
The monitoring architecture should
* include a solution that is part of core Kubernetes and
* makes core system metrics about nodes, pods, and containers available via a standard
master API (today the master metrics API), such that core Kubernetes features do not
depend on non-core components
* requires Kubelet to only export a limited set of metrics, namely those required for
core Kubernetes components to correctly operate (this is related to #18770)
* can scale up to at least 5000 nodes
* is small enough that we can require that all of its components be running in all deployment
configurations
* include an out-of-the-box solution that can serve historical data, e.g. to support Initial
Resources and vertical pod autoscaling as well as cluster analytics queries, that depends
only on core Kubernetes
* allow for third-party monitoring solutions that are not part of core Kubernetes and can
be integrated with components like Horizontal Pod Autoscaler that require service metrics
## Architecture
We divide our description of the long-term architecture plan into the core metrics pipeline
and the monitoring pipeline. For each, it is necessary to think about how to deal with each
type of metric (core metrics, non-core metrics, and service metrics) from both the master
and minions.
### Core metrics pipeline
The core metrics pipeline collects a set of core system metrics. There are two sources for
these metrics
* Kubelet, providing per-node/pod/container usage information (the current cAdvisor that
is part of Kubelet will be slimmed down to provide only core system metrics)
* a resource estimator that runs as a DaemonSet and turns raw usage values scraped from
Kubelet into resource estimates (values used by scheduler for a more advanced usage-based
scheduler)
These sources are scraped by a component we call *metrics-server* which is like a slimmed-down
version of today's Heapster. metrics-server stores locally only latest values and has no sinks.
metrics-server exposes the master metrics API. (The configuration described here is similar
to the current Heapster in “standalone” mode.)
[Discovery summarizer](../../docs/proposals/federated-api-servers.md)
makes the master metrics API available to external clients such that from the clients perspective
it looks the same as talking to the API server.
Core (system) metrics are handled as described above in all deployment environments. The only
easily replaceable part is resource estimator, which could be replaced by power users. In
theory, metric-server itself can also be substituted, but itd be similar to substituting
apiserver itself or controller-manager - possible, but not recommended and not supported.
Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon
themselves (e.g. CPU usage of Kubelet), even though they do not run in containers.
The core metrics pipeline is intentionally small and not designed for third-party integrations.
“Full-fledged” monitoring is left to third-party systems, which provide the monitoring pipeline
(see next section) and can run on Kubernetes without having to make changes to upstream components.
In this way we can remove the burden we have today that comes with maintaining Heapster as the
integration point for every possible metrics source, sink, and feature.
#### Infrastore
We will build an open-source Infrastore component (most likely reusing existing technologies)
for serving historical queries over core system metrics and events, which it will fetch from
the master APIs. Infrastore will expose one or more APIs (possibly just SQL-like queries --
this is TBD) to handle the following use cases
* initial resources
* vertical autoscaling
* oldtimer API
* decision-support queries for debugging, capacity planning, etc.
* usage graphs in the [Kubernetes Dashboard](https://github.com/kubernetes/dashboard)
In addition, it may collect monitoring metrics and service metrics (at least from Kubernetes
infrastructure containers), described in the upcoming sections.
### Monitoring pipeline
One of the goals of building a dedicated metrics pipeline for core metrics, as described in the
previous section, is to allow for a separate monitoring pipeline that can be very flexible
because core Kubernetes components do not need to rely on it. By default we will not provide
one, but we will provide an easy way to install one (using a single command, most likely using
Helm). We described the monitoring pipeline in this section.
Data collected by the monitoring pipeline may contain any sub- or superset of the following groups
of metrics:
* core system metrics
* non-core system metrics
* service metrics from user application containers
* service metrics from Kubernetes infrastructure containers; these metrics are exposed using
Prometheus instrumentation
It is up to the monitoring solution to decide which of these are collected.
In order to enable horizontal pod autoscaling based on custom metrics, the provider of the
monitoring pipeline would also have to create a stateless API adapter that pulls the custom
metrics from the monitoring pipeline and exposes them to the Horizontal Pod Autoscaler. Such
API will be a well defined, versioned API similar to regular APIs. Details of how it will be
exposed or discovered will be covered in a detailed design doc for this component.
The same approach applies if it is desired to make monitoring pipeline metrics available in
Infrastore. These adapters could be standalone components, libraries, or part of the monitoring
solution itself.
There are many possible combinations of node and cluster-level agents that could comprise a
monitoring pipeline, including
cAdvisor + Heapster + InfluxDB (or any other sink)
* cAdvisor + collectd + Heapster
* cAdvisor + Prometheus
* snapd + Heapster
* snapd + SNAP cluster-level agent
* Sysdig
As an example well describe a potential integration with cAdvisor + Prometheus.
Prometheus has the following metric sources on a node:
* core and non-core system metrics from cAdvisor
* service metrics exposed by containers via HTTP handler in Prometheus format
* [optional] metrics about node itself from Node Exporter (a Prometheus component)
All of them are polled by the Prometheus cluster-level agent. We can use the Prometheus
cluster-level agent as a source for horizontal pod autoscaling custom metrics by using a
standalone API adapter that proxies/translates between the Prometheus Query Language endpoint
on the Prometheus cluster-level agent and an HPA-specific API. Likewise an adapter can be
used to make the metrics from the monitoring pipeline available in Infrastore. Neither
adapter is necessary if the user does not need the corresponding feature.
The command that installs cAdvisor+Prometheus should also automatically set up collection
of the metrics from infrastructure containers. This is possible because the names of the
infrastructure containers and metrics of interest are part of the Kubernetes control plane
configuration itself, and because the infrastructure containers export their metrics in
Prometheus format.
## Appendix: Architecture diagram
### Open-source monitoring pipeline
![Architecture Diagram](monitoring_architecture.png?raw=true "Architecture overview")
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/monitoring_architecture.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/monitoring_architecture.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 75 KiB

View File

@ -1,370 +1 @@
# Namespaces
## Abstract
A Namespace is a mechanism to partition resources created by users into
a logically named group.
## Motivation
A single cluster should be able to satisfy the needs of multiple user
communities.
Each user community wants to be able to work in isolation from other
communities.
Each user community has its own:
1. resources (pods, services, replication controllers, etc.)
2. policies (who can or cannot perform actions in their community)
3. constraints (this community is allowed this much quota, etc.)
A cluster operator may create a Namespace for each unique user community.
The Namespace provides a unique scope for:
1. named resources (to avoid basic naming collisions)
2. delegated management authority to trusted users
3. ability to limit community resource consumption
## Use cases
1. As a cluster operator, I want to support multiple user communities on a
single cluster.
2. As a cluster operator, I want to delegate authority to partitions of the
cluster to trusted users in those communities.
3. As a cluster operator, I want to limit the amount of resources each
community can consume in order to limit the impact to other communities using
the cluster.
4. As a cluster user, I want to interact with resources that are pertinent to
my user community in isolation of what other user communities are doing on the
cluster.
## Design
### Data Model
A *Namespace* defines a logically named group for multiple *Kind*s of resources.
```go
type Namespace struct {
TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`
Spec NamespaceSpec `json:"spec,omitempty"`
Status NamespaceStatus `json:"status,omitempty"`
}
```
A *Namespace* name is a DNS compatible label.
A *Namespace* must exist prior to associating content with it.
A *Namespace* must not be deleted if there is content associated with it.
To associate a resource with a *Namespace* the following conditions must be
satisfied:
1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with
the server
2. The resource's *TypeMeta.Namespace* field must have a value that references
an existing *Namespace*
The *Name* of a resource associated with a *Namespace* is unique to that *Kind*
in that *Namespace*.
It is intended to be used in resource URLs; provided by clients at creation
time, and encouraged to be human friendly; intended to facilitate idempotent
creation, space-uniqueness of singleton objects, distinguish distinct entities,
and reference particular entities across operations.
### Authorization
A *Namespace* provides an authorization scope for accessing content associated
with the *Namespace*.
See [Authorization plugins](../admin/authorization.md)
### Limit Resource Consumption
A *Namespace* provides a scope to limit resource consumption.
A *LimitRange* defines min/max constraints on the amount of resources a single
entity can consume in a *Namespace*.
See [Admission control: Limit Range](admission_control_limit_range.md)
A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and
allows cluster operators to define *Hard* resource usage limits that a
*Namespace* may consume.
See [Admission control: Resource Quota](admission_control_resource_quota.md)
### Finalizers
Upon creation of a *Namespace*, the creator may provide a list of *Finalizer*
objects.
```go
type FinalizerName string
// These are internal finalizers to Kubernetes, must be qualified name unless defined here
const (
FinalizerKubernetes FinalizerName = "kubernetes"
)
// NamespaceSpec describes the attributes on a Namespace
type NamespaceSpec struct {
// Finalizers is an opaque list of values that must be empty to permanently remove object from storage
Finalizers []FinalizerName
}
```
A *FinalizerName* is a qualified name.
The API Server enforces that a *Namespace* can only be deleted from storage if
and only if it's *Namespace.Spec.Finalizers* is empty.
A *finalize* operation is the only mechanism to modify the
*Namespace.Spec.Finalizers* field post creation.
Each *Namespace* created has *kubernetes* as an item in its list of initial
*Namespace.Spec.Finalizers* set by default.
### Phases
A *Namespace* may exist in the following phases.
```go
type NamespacePhase string
const(
NamespaceActive NamespacePhase = "Active"
NamespaceTerminating NamespaceTerminating = "Terminating"
)
type NamespaceStatus struct {
...
Phase NamespacePhase
}
```
A *Namespace* is in the **Active** phase if it does not have a
*ObjectMeta.DeletionTimestamp*.
A *Namespace* is in the **Terminating** phase if it has a
*ObjectMeta.DeletionTimestamp*.
**Active**
Upon creation, a *Namespace* goes in the *Active* phase. This means that content
may be associated with a namespace, and all normal interactions with the
namespace are allowed to occur in the cluster.
If a DELETE request occurs for a *Namespace*, the
*Namespace.ObjectMeta.DeletionTimestamp* is set to the current server time. A
*namespace controller* observes the change, and sets the
*Namespace.Status.Phase* to *Terminating*.
**Terminating**
A *namespace controller* watches for *Namespace* objects that have a
*Namespace.ObjectMeta.DeletionTimestamp* value set in order to know when to
initiate graceful termination of the *Namespace* associated content that are
known to the cluster.
The *namespace controller* enumerates each known resource type in that namespace
and deletes it one by one.
Admission control blocks creation of new resources in that namespace in order to
prevent a race-condition where the controller could believe all of a given
resource type had been deleted from the namespace, when in fact some other rogue
client agent had created new objects. Using admission control in this scenario
allows each of registry implementations for the individual objects to not need
to take into account Namespace life-cycle.
Once all objects known to the *namespace controller* have been deleted, the
*namespace controller* executes a *finalize* operation on the namespace that
removes the *kubernetes* value from the *Namespace.Spec.Finalizers* list.
If the *namespace controller* sees a *Namespace* whose
*ObjectMeta.DeletionTimestamp* is set, and whose *Namespace.Spec.Finalizers*
list is empty, it will signal the server to permanently remove the *Namespace*
from storage by sending a final DELETE action to the API server.
### REST API
To interact with the Namespace API:
| Action | HTTP Verb | Path | Description |
| ------ | --------- | ---- | ----------- |
| CREATE | POST | /api/{version}/namespaces | Create a namespace |
| LIST | GET | /api/{version}/namespaces | List all namespaces |
| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} |
| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} |
| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} |
| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces |
This specification reserves the name *finalize* as a sub-resource to namespace.
As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*.
To interact with content associated with a Namespace:
| Action | HTTP Verb | Path | Description |
| ---- | ---- | ---- | ---- |
| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} |
| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} |
| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} |
| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} |
| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} |
| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} |
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
The API server verifies the *Namespace* on resource creation matches the
*{namespace}* on the path.
The API server will associate a resource with a *Namespace* if not populated by
the end-user based on the *Namespace* context of the incoming request. If the
*Namespace* of the resource being created, or updated does not match the
*Namespace* on the request, then the API server will reject the request.
### Storage
A namespace provides a unique identifier space and therefore must be in the
storage path of a resource.
In etcd, we want to continue to still support efficient WATCH across namespaces.
Resources that persist content in etcd will have storage paths as follows:
/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name}
This enables consumers to WATCH /registry/{resourceType} for changes across
namespace of a particular {resourceType}.
### Kubelet
The kubelet will register pod's it sources from a file or http source with a
namespace associated with the *cluster-id*
### Example: OpenShift Origin managing a Kubernetes Namespace
In this example, we demonstrate how the design allows for agents built on-top of
Kubernetes that manage their own set of resource types associated with a
*Namespace* to take part in Namespace termination.
OpenShift creates a Namespace in Kubernetes
```json
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": ["openshift.com/origin", "kubernetes"]
},
"status": {
"phase": "Active"
}
}
```
OpenShift then goes and creates a set of resources (pods, services, etc)
associated with the "development" namespace. It also creates its own set of
resources in its own storage associated with the "development" namespace unknown
to Kubernetes.
User deletes the Namespace in Kubernetes, and Namespace now has following state:
```json
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"deletionTimestamp": "...",
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": ["openshift.com/origin", "kubernetes"]
},
"status": {
"phase": "Terminating"
}
}
```
The Kubernetes *namespace controller* observes the namespace has a
*deletionTimestamp* and begins to terminate all of the content in the namespace
that it knows about. Upon success, it executes a *finalize* action that modifies
the *Namespace* by removing *kubernetes* from the list of finalizers:
```json
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"deletionTimestamp": "...",
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": ["openshift.com/origin"]
},
"status": {
"phase": "Terminating"
}
}
```
OpenShift Origin has its own *namespace controller* that is observing cluster
state, and it observes the same namespace had a *deletionTimestamp* assigned to
it. It too will go and purge resources from its own storage that it manages
associated with that namespace. Upon completion, it executes a *finalize* action
and removes the reference to "openshift.com/origin" from the list of finalizers.
This results in the following state:
```json
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"deletionTimestamp": "...",
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": []
},
"status": {
"phase": "Terminating"
}
}
```
At this point, the Kubernetes *namespace controller* in its sync loop will see
that the namespace has a deletion timestamp and that its list of finalizers is
empty. As a result, it knows all content associated from that namespace has been
purged. It performs a final DELETE action to remove that Namespace from the
storage.
At this point, all content associated with that Namespace, and the Namespace
itself are gone.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/namespaces.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/namespaces.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/namespaces.md)

View File

@ -1,190 +1 @@
# Networking
There are 4 distinct networking problems to solve:
1. Highly-coupled container-to-container communications
2. Pod-to-Pod communications
3. Pod-to-Service communications
4. External-to-internal communications
## Model and motivation
Kubernetes deviates from the default Docker networking model (though as of
Docker 1.8 their network plugins are getting closer). The goal is for each pod
to have an IP in a flat shared networking namespace that has full communication
with other physical computers and containers across the network. IP-per-pod
creates a clean, backward-compatible model where pods can be treated much like
VMs or physical hosts from the perspectives of port allocation, networking,
naming, service discovery, load balancing, application configuration, and
migration.
Dynamic port allocation, on the other hand, requires supporting both static
ports (e.g., for externally accessible services) and dynamically allocated
ports, requires partitioning centrally allocated and locally acquired dynamic
ports, complicates scheduling (since ports are a scarce resource), is
inconvenient for users, complicates application configuration, is plagued by
port conflicts and reuse and exhaustion, requires non-standard approaches to
naming (e.g. consul or etcd rather than DNS), requires proxies and/or
redirection for programs using standard naming/addressing mechanisms (e.g. web
browsers), requires watching and cache invalidation for address/port changes
for instances in addition to watching group membership changes, and obstructs
container/pod migration (e.g. using CRIU). NAT introduces additional complexity
by fragmenting the addressing space, which breaks self-registration mechanisms,
among other problems.
## Container to container
All containers within a pod behave as if they are on the same host with regard
to networking. They can all reach each others ports on localhost. This offers
simplicity (static ports know a priori), security (ports bound to localhost
are visible within the pod but never outside it), and performance. This also
reduces friction for applications moving from the world of uncontainerized apps
on physical or virtual hosts. People running application stacks together on
the same host have already figured out how to make ports not conflict and have
arranged for clients to find them.
The approach does reduce isolation between containers within a pod &mdash;
ports could conflict, and there can be no container-private ports, but these
seem to be relatively minor issues with plausible future workarounds. Besides,
the premise of pods is that containers within a pod share some resources
(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
Additionally, the user can control what containers belong to the same pod
whereas, in general, they don't control what pods land together on a host.
## Pod to pod
Because every pod gets a "real" (not machine-private) IP address, pods can
communicate without proxies or translations. The pod can use well-known port
numbers and can avoid the use of higher-level service discovery systems like
DNS-SD, Consul, or Etcd.
When any container calls ioctl(SIOCGIFADDR) (get the address of an interface),
it sees the same IP that any peer container would see them coming from &mdash;
each pod has its own IP address that other pods can know. By making IP addresses
and ports the same both inside and outside the pods, we create a NAT-less, flat
address space. Running "ip addr show" should work as expected. This would enable
all existing naming/discovery mechanisms to work out of the box, including
self-registration mechanisms and applications that distribute IP addresses. We
should be optimizing for inter-pod network communication. Within a pod,
containers are more likely to use communication through volumes (e.g., tmpfs) or
IPC.
This is different from the standard Docker model. In that mode, each container
gets an IP in the 172-dot space and would only see that 172-dot address from
SIOCGIFADDR. If these containers connect to another container the peer would see
the connect coming from a different IP than the container itself knows. In short
&mdash; you can never self-register anything from a container, because a
container can not be reached on its private IP.
An alternative we considered was an additional layer of addressing: pod-centric
IP per container. Each container would have its own local IP address, visible
only within that pod. This would perhaps make it easier for containerized
applications to move from physical/virtual hosts to pods, but would be more
complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS)
and to reason about, due to the additional layer of address translation, and
would break self-registration and IP distribution mechanisms.
Like Docker, ports can still be published to the host node's interface(s), but
the need for this is radically diminished.
## Implementation
For the Google Compute Engine cluster configuration scripts, we use [advanced
routing rules](https://developers.google.com/compute/docs/networking#routing)
and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that
get routed to it. This is in addition to the 'main' IP address assigned to the
VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to
differentiate it from `docker0`) is set up outside of Docker proper.
Example of GCE's advanced routing rules:
```sh
gcloud compute routes add "${NODE_NAMES[$i]}" \
--project "${PROJECT}" \
--destination-range "${NODE_IP_RANGES[$i]}" \
--network "${NETWORK}" \
--next-hop-instance "${NODE_NAMES[$i]}" \
--next-hop-instance-zone "${ZONE}" &
```
GCE itself does not know anything about these IPs, though. This means that when
a pod tries to egress beyond GCE's project the packets must be SNAT'ed
(masqueraded) to the VM's IP, which GCE recognizes and allows.
### Other implementations
With the primary aim of providing IP-per-pod-model, other implementations exist
to serve the purpose outside of GCE.
- [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking.md)
- [Flannel](https://github.com/coreos/flannel#flannel)
- [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/)
("With Linux Bridge devices" section)
- [Weave](https://github.com/zettio/weave) is yet another way to build an
overlay network, primarily aiming at Docker integration.
- [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real
container IPs.
## Pod to service
The [service](../user-guide/services.md) abstraction provides a way to group pods under a
common access policy (e.g. load-balanced). The implementation of this creates a
virtual IP which clients can access and which is transparently proxied to the
pods in a Service. Each node runs a kube-proxy process which programs
`iptables` rules to trap access to service IPs and redirect them to the correct
backends. This provides a highly-available load-balancing solution with low
performance overhead by balancing client traffic from a node on that same node.
## External to internal
So far the discussion has been about how to access a pod or service from within
the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
want to offer highly-available, high-performance load balancing to target
Kubernetes Services. Most public cloud providers are simply not flexible enough
yet.
The way this is generally implemented is to set up external load balancers (e.g.
GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
traffic arrives at a node it is recognized as being part of a particular Service
and routed to an appropriate backend Pod. This does mean that some traffic will
get double-bounced on the network. Once cloud providers have better offerings
we can take advantage of those.
## Challenges and future work
### Docker API
Right now, docker inspect doesn't show the networking configuration of the
containers, since they derive it from another container. That information should
be exposed somehow.
### External IP assignment
We want to be able to assign IP addresses externally from Docker
[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need
to statically allocate fixed-size IP ranges to each node, so that IP addresses
can be made stable across pod infra container restarts
([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate
pod migration. Right now, if the pod infra container dies, all the user
containers must be stopped and restarted because the netns of the pod infra
container will change on restart, and any subsequent user container restart
will join that new netns, thereby not being able to see its peers.
Additionally, a change in IP address would encounter DNS caching/TTL problems.
External IP assignment would also simplify DNS support (see below).
### IPv6
IPv6 support would be nice but requires significant internal changes in a few
areas. First pods should be able to report multiple IP addresses
[Kubernetes issue #27398](https://github.com/kubernetes/kubernetes/issues/27398)
and the network plugin architecture Kubernetes uses needs to allow returning
IPv6 addresses too [CNI issue #245](https://github.com/containernetworking/cni/issues/245).
Kubernetes code that deals with IP addresses must then be audited and fixed to
support both IPv4 and IPv6 addresses and not assume IPv4.
Additionally, direct ipv6 assignment to instances doesn't appear to be supported
by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull
requests from people running Kubernetes on bare metal, though. :-)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/networking.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/networking.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/networking.md)

View File

@ -1,246 +1 @@
# Node affinity and NodeSelector
## Introduction
This document proposes a new label selector representation, called
`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit
more flexible and is intended to be used only for selecting nodes.
In addition, we propose to replace the `map[string]string` in `PodSpec` that the
scheduler currently uses as part of restricting the set of nodes onto which a
pod is eligible to schedule, with a field of type `Affinity` that contains one
or more affinity specifications. In this document we discuss `NodeAffinity`,
which contains one or more of the following:
* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
the current `map[string]string` but still serves the purpose of restricting
the set of nodes onto which the pod can schedule. In addition, unlike the
behavior of the current `map[string]string`, when it becomes violated the system
will try to eventually evict the pod from its node.
* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is
identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the
system may or may not try to eventually evict the pod from its node.
* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that
specifies which nodes are preferred for scheduling among those that meet all
scheduling requirements.
(In practice, as discussed later, we will actually *add* the `Affinity` field
rather than replacing `map[string]string`, due to backward compatibility
requirements.)
The affinity specifications described above allow a pod to request various
properties that are inherent to nodes, for example "run this pod on a node with
an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z."
([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
some of the properties that a node might publish as labels, which affinity
expressions can match against.) They do *not* allow a pod to request to schedule
(or not schedule) on a node based on what other pods are running on the node.
That feature is called "inter-pod topological affinity/anti-affinity" and is
described [here](https://github.com/kubernetes/kubernetes/pull/18265).
## API
### NodeSelector
```go
// A node selector represents the union of the results of one or more label queries
// over a set of nodes; that is, it represents the OR of the selectors represented
// by the nodeSelectorTerms.
type NodeSelector struct {
// nodeSelectorTerms is a list of node selector terms. The terms are ORed.
NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"`
}
// An empty node selector term matches all objects. A null node selector term
// matches no objects.
type NodeSelectorTerm struct {
// matchExpressions is a list of node selector requirements. The requirements are ANDed.
MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
}
// A node selector requirement is a selector that contains values, a key, and an operator
// that relates the key and values.
type NodeSelectorRequirement struct {
// key is the label key that the selector applies to.
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
// operator represents a key's relationship to a set of values.
// Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
Operator NodeSelectorOperator `json:"operator"`
// values is an array of string values. If the operator is In or NotIn,
// the values array must be non-empty. If the operator is Exists or DoesNotExist,
// the values array must be empty. If the operator is Gt or Lt, the values
// array must have a single element, which will be interpreted as an integer.
// This array is replaced during a strategic merge patch.
Values []string `json:"values,omitempty"`
}
// A node selector operator is the set of operators that can be used in
// a node selector requirement.
type NodeSelectorOperator string
const (
NodeSelectorOpIn NodeSelectorOperator = "In"
NodeSelectorOpNotIn NodeSelectorOperator = "NotIn"
NodeSelectorOpExists NodeSelectorOperator = "Exists"
NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
NodeSelectorOpGt NodeSelectorOperator = "Gt"
NodeSelectorOpLt NodeSelectorOperator = "Lt"
)
```
### NodeAffinity
We will add one field to `PodSpec`
```go
Affinity *Affinity `json:"affinity,omitempty"`
```
The `Affinity` type is defined as follows
```go
type Affinity struct {
NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"`
}
type NodeAffinity struct {
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a node label update),
// the system will try to eventually evict the pod from its node.
RequiredDuringSchedulingRequiredDuringExecution *NodeSelector `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a node label update),
// the system may or may not try to eventually evict the pod from its node.
RequiredDuringSchedulingIgnoredDuringExecution *NodeSelector `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
// for each node that meets all of the scheduling requirements (resource
// request, RequiredDuringScheduling affinity expressions, etc.),
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
// node(s) with the highest sum are the most preferred.
PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}
// An empty preferred scheduling term matches all objects with implicit weight 0
// (i.e. it's a no-op). A null preferred scheduling term matches no objects.
type PreferredSchedulingTerm struct {
// weight is in the range 1-100
Weight int `json:"weight"`
// matchExpressions is a list of node selector requirements. The requirements are ANDed.
MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
}
```
Unfortunately, the name of the existing `map[string]string` field in PodSpec is
`NodeSelector` and we can't change it since this name is part of the API.
Hopefully this won't cause too much confusion.
## Examples
** TODO: fill in this section **
* Run this pod on a node with an Intel or AMD CPU
* Run this pod on a node in availability zone Z
## Backward compatibility
When we add `Affinity` to PodSpec, we will deprecate, but not remove, the
current field in PodSpec
```go
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
```
Old version of the scheduler will ignore the `Affinity` field. New versions of
the scheduler will apply their scheduling predicates to both `Affinity` and
`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets
of requirements. We will not attempt to convert between `Affinity` and
`nodeSelector`.
Old versions of non-scheduling clients will not know how to do anything
semantically meaningful with `Affinity`, but we don't expect that this will
cause a problem.
See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
for more discussion.
Users should not start using `NodeAffinity` until the full implementation has
been in Kubelet and the master for enough binary versions that we feel
comfortable that we will not need to roll back either Kubelet or master to a
version that does not support them. Longer-term we will use a programatic
approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
## Implementation plan
1. Add the `Affinity` field to PodSpec and the `NodeAffinity`,
`PreferredDuringSchedulingIgnoredDuringExecution`, and
`RequiredDuringSchedulingIgnoredDuringExecution` types to the API.
2. Implement a scheduler predicate that takes
`RequiredDuringSchedulingIgnoredDuringExecution` into account.
3. Implement a scheduler priority function that takes
`PreferredDuringSchedulingIgnoredDuringExecution` into account.
4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be
marked as deprecated.
5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API.
6. Modify the scheduler predicate from step 2 to also take
`RequiredDuringSchedulingRequiredDuringExecution` into account.
7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission
decision.
8. Implement code in Kubelet *or* the controllers that evicts a pod that no
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
We assume Kubelet publishes labels describing the node's membership in all of
the relevant scheduling domains (e.g. node name, rack name, availability zone
name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044).
## Extensibility
The design described here is the result of careful analysis of use cases, a
decade of experience with Borg at Google, and a review of similar features in
other open-source container orchestration systems. We believe that it properly
balances the goal of expressiveness against the goals of simplicity and
efficiency of implementation. However, we recognize that use cases may arise in
the future that cannot be expressed using the syntax described here. Although we
are not implementing an affinity-specific extensibility mechanism for a variety
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
for Kubernetes users to get a consistent experience, etc.), the regular
Kubernetes annotation mechanism can be used to add or replace affinity rules.
The way this work would is:
1. Define one or more annotations to describe the new affinity rule(s)
1. User (or an admission controller) attaches the annotation(s) to pods to
request the desired scheduling behavior. If the new rule(s) *replace* one or
more fields of `Affinity` then the user would omit those fields from `Affinity`;
if they are *additional rules*, then the user would fill in `Affinity` as well
as the annotation(s).
1. Scheduler takes the annotation(s) into account when scheduling.
If some particular new syntax becomes popular, we would consider upstreaming it
by integrating it into the standard `Affinity`.
## Future work
Are there any other fields we should convert from `map[string]string` to
`NodeSelector`?
## Related issues
The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261).
The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341).
Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related.
Those issues reference other related issues.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/nodeaffinity.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/nodeaffinity.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/nodeaffinity.md)

View File

@ -1,292 +1 @@
# Persistent Storage
This document proposes a model for managing persistent, cluster-scoped storage
for applications requiring long lived data.
### Abstract
Two new API kinds:
A `PersistentVolume` (PV) is a storage resource provisioned by an administrator.
It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/)
for how to use it.
A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to
use in a pod. It is analogous to a pod.
One new system component:
`PersistentVolumeClaimBinder` is a singleton running in master that watches all
PersistentVolumeClaims in the system and binds them to the closest matching
available PersistentVolume. The volume manager watches the API for newly created
volumes to manage.
One new volume:
`PersistentVolumeClaimVolumeSource` references the user's PVC in the same
namespace. This volume finds the bound PV and mounts that volume for the pod. A
`PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another
type of volume that is owned by someone else (the system).
Kubernetes makes no guarantees at runtime that the underlying storage exists or
is available. High availability is left to the storage provider.
### Goals
* Allow administrators to describe available storage.
* Allow pod authors to discover and request persistent volumes to use with pods.
* Enforce security through access control lists and securing storage to the same
namespace as the pod volume.
* Enforce quotas through admission control.
* Enforce scheduler rules by resource counting.
* Ensure developers can rely on storage being available without being closely
bound to a particular disk, server, network, or storage device.
#### Describe available storage
Cluster administrators use the API to manage *PersistentVolumes*. A custom store
`NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by
storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for
storage and binds them to an available volume by matching the volume's
characteristics (AccessModes and storage size) to the user's request.
PVs are system objects and, thus, have no namespace.
Many means of dynamic provisioning will be eventually be implemented for various
storage types.
##### PersistentVolume API
| Action | HTTP Verb | Path | Description |
| ---- | ---- | ---- | ---- |
| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume |
| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} |
| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} |
| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} |
| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume |
| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume |
#### Request Storage
Kubernetes users request persistent storage for their pod by creating a
```PersistentVolumeClaim```. Their request for storage is described by their
requirements for resources and mount capabilities.
Requests for volumes are bound to available volumes by the volume manager, if a
suitable match is found. Requests for resources can go unfulfilled.
Users attach their claim to their pod using a new
```PersistentVolumeClaimVolumeSource``` volume source.
##### PersistentVolumeClaim API
| Action | HTTP Verb | Path | Description |
| ---- | ---- | ---- | ---- |
| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} |
| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} |
| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} |
| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} |
| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} |
| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} |
#### Scheduling constraints
Scheduling constraints are to be handled similar to pod resource constraints.
Pods will need to be annotated or decorated with the number of resources it
requires on a node. Similarly, a node will need to list how many it has used or
available.
TBD
#### Events
The implementation of persistent storage will not require events to communicate
to the user the state of their claim. The CLI for bound claims contains a
reference to the backing persistent volume. This is always present in the API
and CLI, making an event to communicate the same unnecessary.
Events that communicate the state of a mounted volume are left to the volume
plugins.
### Example
#### Admin provisions storage
An administrator provisions storage by posting PVs to the API. Various ways to
automate this task can be scripted. Dynamic provisioning is a future feature
that can maintain levels of PVs.
```yaml
POST:
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv0001
spec:
capacity:
storage: 10
persistentDisk:
pdName: "abc123"
fsType: "ext4"
```
```console
$ kubectl get pv
NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
pv0001 map[] 10737418240 RWO Pending
```
#### Users request storage
A user requests storage by posting a PVC to the API. Their request contains the
AccessModes they wish their volume to have and the minimum size needed.
The user must be within a namespace to create PVCs.
```yaml
POST:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim-1
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3
```
```console
$ kubectl get pvc
NAME LABELS STATUS VOLUME
myclaim-1 map[] pending
```
#### Matching and binding
The ```PersistentVolumeClaimBinder``` attempts to find an available volume that
most closely matches the user's request. If one exists, they are bound by
putting a reference on the PV to the PVC. Requests can go unfulfilled if a
suitable match is not found.
```console
$ kubectl get pv
NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e
kubectl get pvc
NAME LABELS STATUS VOLUME
myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e
```
A claim must request access modes and storage capacity. This is because internally PVs are
indexed by their `AccessModes`, and target PVs are, to some degree, sorted by their capacity.
A claim may request one of more of the following attributes to better match a PV: volume name, selectors,
and volume class (currently implemented as an annotation).
A PV may define a `ClaimRef` which can greatly influence (but does not absolutely guarantee) which
PVC it will match.
A PV may also define labels, annotations, and a volume class (currently implemented as an
annotation) to better target PVCs.
As of Kubernetes version 1.4, the following algorithm describes in more details how a claim is
matched to a PV:
1. Only PVs with `accessModes` equal to or greater than the claim's requested `accessModes` are considered.
"Greater" here means that the PV has defined more modes than needed by the claim, but it also defines
the mode requested by the claim.
1. The potential PVs above are considered in order of the closest access mode match, with the best case
being an exact match, and a worse case being more modes than requested by the claim.
1. Each PV above is processed. If the PV has a `claimRef` matching the claim, *and* the PV's capacity
is not less than the storage being requested by the claim then this PV will bind to the claim. Done.
1. Otherwise, if the PV has the "volume.alpha.kubernetes.io/storage-class" annotation defined then it is
skipped and will be handled by Dynamic Provisioning.
1. Otherwise, if the PV has a `claimRef` defined, which can specify a different claim or simply be a
placeholder, then the PV is skipped.
1. Otherwise, if the claim is using a selector but it does *not* match the PV's labels (if any) then the
PV is skipped. But, even if a claim has selectors which match a PV that does not guarantee a match
since capacities may differ.
1. Otherwise, if the PV's "volume.beta.kubernetes.io/storage-class" annotation (which is a placeholder
for a volume class) does *not* match the claim's annotation (same placeholder) then the PV is skipped.
If the annotations for the PV and PVC are empty they are treated as being equal.
1. Otherwise, what remains is a list of PVs that may match the claim. Within this list of remaining PVs,
the PV with the smallest capacity that is also equal to or greater than the claim's requested storage
is the matching PV and will be bound to the claim. Done. In the case of two or more PVCs matching all
of the above criteria, the first PV (remember the PV order is based on `accessModes`) is the winner.
*Note:* if no PV matches the claim and the claim defines a `StorageClass` (or a default
`StorageClass` has been defined) then a volume will be dynamically provisioned.
#### Claim usage
The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim
and mount its volume for a pod.
The claim holder owns the claim and its data for as long as the claim exists.
The pod using the claim can be deleted, but the claim remains in the user's
namespace. It can be used again and again by many pods.
```yaml
POST:
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- image: nginx
name: myfrontend
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
source:
persistentVolumeClaim:
accessMode: ReadWriteOnce
claimRef:
name: myclaim-1
```
#### Releasing a claim and Recycling a volume
When a claim holder is finished with their data, they can delete their claim.
```console
$ kubectl delete pvc myclaim-1
```
The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim
reference from the PV and change the PVs status to 'Released'.
Admins can script the recycling of released volumes. Future dynamic provisioners
will understand how a volume should be recycled.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/persistent-storage.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/persistent-storage.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/persistent-storage.md)

View File

@ -1,673 +1 @@
# Inter-pod topological affinity and anti-affinity
## Introduction
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
This document describes a proposal for specifying and implementing inter-pod
topological affinity and anti-affinity. By that we mean: rules that specify that
certain pods should be placed in the same topological domain (e.g. same node,
same rack, same zone, same power domain, etc.) as some other pods, or,
conversely, should *not* be placed in the same topological domain as some other
pods.
Here are a few example rules; we explain how to express them using the API
described in this doc later, in the section "Examples."
* Affinity
* Co-locate the pods from a particular service or Job in the same availability
zone, without specifying which zone that should be.
* Co-locate the pods from service S1 with pods from service S2 because S1 uses
S2 and thus it is useful to minimize the network latency between them.
Co-location might mean same nodes and/or same availability zone.
* Anti-affinity
* Spread the pods of a service across nodes and/or availability zones, e.g. to
reduce correlated failures.
* Give a pod "exclusive" access to a node to guarantee resource isolation --
it must never share the node with other pods.
* Don't schedule the pods of a particular service on the same nodes as pods of
another service that are known to interfere with the performance of the pods of
the first service.
For both affinity and anti-affinity, there are three variants. Two variants have
the property of requiring the affinity/anti-affinity to be satisfied for the pod
to be allowed to schedule onto a node; the difference between them is that if
the condition ceases to be met later on at runtime, for one of them the system
will try to eventually evict the pod, while for the other the system may not try
to do so. The third variant simply provides scheduling-time *hints* that the
scheduler will try to satisfy but may not be able to. These three variants are
directly analogous to the three variants of [node affinity](nodeaffinity.md).
Note that this proposal is only about *inter-pod* topological affinity and
anti-affinity. There are other forms of topological affinity and anti-affinity.
For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
that a set of pods all be scheduled in some specific zone Z. Node affinity is
not capable of expressing inter-pod dependencies, and conversely the API we
describe in this document is not capable of expressing node affinity rules. For
simplicity, we will use the terms "affinity" and "anti-affinity" to mean
"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
respectively, in the remainder of this document.
## API
We will add one field to `PodSpec`
```go
Affinity *Affinity `json:"affinity,omitempty"`
```
The `Affinity` type is defined as follows
```go
type Affinity struct {
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
}
type PodAffinity struct {
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system will try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system may or may not try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
// for each node that meets all of the scheduling requirements (resource
// request, RequiredDuringScheduling affinity expressions, etc.),
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
// node(s) with the highest sum are the most preferred.
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}
type PodAntiAffinity struct {
// If the anti-affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the anti-affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system will try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// If the anti-affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the anti-affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system may or may not try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the anti-affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
// for each node that meets all of the scheduling requirements (resource
// request, RequiredDuringScheduling anti-affinity expressions, etc.),
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
// node(s) with the highest sum are the most preferred.
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}
type WeightedPodAffinityTerm struct {
// weight is in the range 1-100
Weight int `json:"weight"`
PodAffinityTerm PodAffinityTerm `json:"podAffinityTerm"`
}
type PodAffinityTerm struct {
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
// nil list means "this pod's namespace," empty list means "all namespaces"
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
Namespaces []api.Namespace `json:"namespaces,omitempty"`
// empty topology key is interpreted by the scheduler as "all topologies"
TopologyKey string `json:"topologyKey,omitempty"`
}
```
Note that the `Namespaces` field is necessary because normal `LabelSelector` is
scoped to the pod's namespace, but we need to be able to match against all pods
globally.
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
`Affinity` that is configured as follows (note that we've omitted and collapsed
some fields for simplicity, but this should sufficiently convey the intent of
the design):
```go
PodAffinity {
RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}},
PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}},
}
PodAntiAffinity {
RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}},
PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}}
}
```
Then when scheduling pod P, the scheduler:
* Can only schedule P onto nodes that are running pods that satisfy `P1`.
(Assumes all nodes have a label with key `node` and value specifying their node
name.)
* Should try to schedule P onto zones that are running pods that satisfy `P2`.
(Assumes all nodes have a label with key `zone` and value specifying their
zone.)
* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
(Assumes all nodes have a label with key `rack` and value specifying their rack
name.)
* Should try not to schedule P onto any power domains that are running pods that
satisfy `P4`. (Assumes all nodes have a label with key `power` and value
specifying their power domain.)
When `RequiredDuringScheduling` has multiple elements, the requirements are
ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
are satisfied for each node, and the node(s) with the highest weight(s) are the
most preferred.
In reality there are two variants of `RequiredDuringScheduling`: one suffixed
with `RequiredDuringExecution` and one suffixed with `IgnoredDuringExecution`.
For the first variant, if the affinity/anti-affinity ceases to be met at some
point during pod execution (e.g. due to a pod label update), the system will try
to eventually evict the pod from its node. In the second variant, the system may
or may not try to eventually evict the pod from its node.
## A comment on symmetry
One thing that makes affinity and anti-affinity tricky is symmetry.
Imagine a cluster that is running pods from two services, S1 and S2. Imagine
that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
run me on nodes that are running pods from S2." It is not sufficient just to
check that there are no S2 pods on a node when you are scheduling a S1 pod. You
also need to ensure that there are no S1 pods on a node when you are scheduling
a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
anti-affinity rule, then:
* if a node is empty, you can schedule S1 or S2 onto the node
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
Note that while RequiredDuringScheduling anti-affinity is symmetric,
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2," it is not required that there be S1 pods on a node in order to
schedule a S2 pod onto that node. More specifically, if S1 has the
aforementioned RequiredDuringScheduling affinity rule, then:
* if a node is empty, you can schedule S2 onto the node
* if a node is empty, you cannot schedule S1 onto the node
* if a node is running S2, you can schedule S1 onto the node
* if a node is running S1+S2 and S1 terminates, S2 continues running
* if a node is running S1+S2 and S2 terminates, the system terminates S1
(eventually)
However, although RequiredDuringScheduling affinity is not symmetric, there is
an implicit PreferredDuringScheduling affinity rule corresponding to every
RequiredDuringScheduling affinity rule: if the pods of S1 have a
RequiredDuringScheduling affinity rule "run me on nodes that are running pods
from S2" then it is not required that there be S1 pods on a node in order to
schedule a S2 pod onto that node, but it would be better if there are.
PreferredDuringScheduling is symmetric. If the pods of S1 had a
PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
are running pods from S2" then we would prefer to keep a S1 pod that we are
scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
are running pods from S2" then we would prefer to place a S1 pod that we are
scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
we are scheduling onto a node that is running a S1 pod.
## Examples
Here are some examples of how you would express various affinity and
anti-affinity rules using the API we described.
### Affinity
In the examples below, the word "put" is intentionally ambiguous; the rules are
the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
put" (PreferredDuringScheduling)--all that changes is which field the rule goes
into. Also, we only discuss scheduling-time, and ignore the execution-time.
Finally, some of the examples use "zone" and some use "node," just to make the
examples more interesting; any of the examples with "zone" will also work for
"node" if you change the `TopologyKey`, and vice-versa.
* **Put the pod in zone Z**:
Tricked you! It is not possible express this using the API described here. For
this you should use node affinity.
* **Put the pod in a zone that is running at least one pod from service S**:
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
* **Put the pod on a node that is already running a pod that requires a license
for software package P**: Assuming pods that require a license for software
package P have a label `{key=license, value=P}`:
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
* **Put this pod in the same zone as other pods from its same service**:
Assuming pods from this pod's service have some label `{key=service, value=S}`:
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
This last example illustrates a small issue with this API when it is used with a
scheduler that processes the pending queue one pod at a time, like the current
Kubernetes scheduler. The RequiredDuringScheduling rule
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
only "works" once one pod from service S has been scheduled. But if all pods in
service S have this RequiredDuringScheduling rule in their PodSpec, then the
RequiredDuringScheduling rule will block the first pod of the service from ever
scheduling, since it is only allowed to run in a zone with another pod from the
same service. And of course that means none of the pods of the service will be
able to schedule. This problem *only* applies to RequiredDuringScheduling
affinity, not PreferredDuringScheduling affinity or any variant of
anti-affinity. There are at least three ways to solve this problem:
* **short-term**: have the scheduler use a rule that if the
RequiredDuringScheduling affinity requirement matches a pod's own labels, and
there are no other such pods anywhere, then disregard the requirement. This
approach has a corner case when running parallel schedulers that are allowed to
schedule pods from the same replicated set (e.g. a single PodTemplate): both
schedulers may try to schedule pods from the set at the same time and think
there are no other pods from that set scheduled yet (e.g. they are trying to
schedule the first two pods from the set), but by the time the second binding is
committed, the first one has already been committed, leaving you with two pods
running that do not respect their RequiredDuringScheduling affinity. There is no
simple way to detect this "conflict" at scheduling time given the current system
implementation.
* **longer-term**: when a controller creates pods from a PodTemplate, for
exactly *one* of those pods, it should omit any RequiredDuringScheduling
affinity rules that select the pods of that PodTemplate.
* **very long-term/speculative**: controllers could present the scheduler with a
group of pods from the same PodTemplate as a single unit. This is similar to the
first approach described above but avoids the corner case. No special logic is
needed in the controllers. Moreover, this would allow the scheduler to do proper
[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
it could receive an entire gang simultaneously as a single unit.
### Anti-affinity
As with the affinity examples, the examples here can be RequiredDuringScheduling
or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
"must not" or as "try not to" depending on whether the rule appears in
`RequiredDuringScheduling` or `PreferredDuringScheduling`.
* **Spread the pods of this service S across nodes and zones**:
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
(note that if this is specified as a RequiredDuringScheduling anti-affinity,
then the first clause is redundant, since the second clause will force the
scheduler to not put more than one pod from S in the same zone, and thus by
definition it will not put more than one pod from S on the same node, assuming
each node is in one zone. This rule is more useful as PreferredDuringScheduling
anti-affinity, e.g. one might expect it to be common in
[Cluster Federation](../../docs/proposals/federation.md) clusters.)
* **Don't co-locate pods of this service with pods from service "evilService"**:
`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}`
* **Don't co-locate pods of this service with any other pods including pods of this service**:
`{LabelSelector: empty, TopologyKey: "node"}`
* **Don't co-locate pods of this service with any other pods except other pods of this service**:
Assuming pods from the service have some label `{key=service, value=S}`:
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
Note that this works because `"service" NotIn "S"` matches pods with no key
"service" as well as pods with key "service" and a corresponding value that is
not "S."
## Algorithm
An example algorithm a scheduler might use to implement affinity and
anti-affinity rules is as follows. There are certainly more efficient ways to
do it; this is just intended to demonstrate that the API's semantics are
implementable.
Terminology definition: We say a pod P is "feasible" on a node N if P meets all
of the scheduler predicates for scheduling P onto N. Note that this algorithm is
only concerned about scheduling time, thus it makes no distinction between
RequiredDuringExecution and IgnoredDuringExecution.
To make the algorithm slightly more readable, we use the term "HardPodAffinity"
as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
into account; currently it assumes all terms have weight 1. **
```
Z = the pod you are scheduling
{N} = the set of all nodes in the system // this algorithm will reduce it to the set of all nodes feasible for Z
// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction
X = {Z's PodSpec's HardPodAffinity}
foreach element H of {X}
P = {all pods in the system that match H.LabelSelector}
M map[string]int // topology value -> number of pods running on nodes with that topology value
foreach pod Q of {P}
L = {labels of the node on which Q is running, represented as a map from label key to label value}
M[L[H.TopologyKey]]++
{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]}
// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity
// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0
X = {Z's PodSpec's HardPodAntiAffinity}
foreach element H of {X}
P = {all pods in the system that match H.LabelSelector}
M map[string]int // topology value -> number of pods running on nodes with that topology value
foreach pod Q of {P}
L = {labels of the node on which Q is running, represented as a map from label key to label value}
M[L[H.TopologyKey]]++
{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]}
// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity
foreach node A of {N}
foreach pod B that is bound to A
if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N}
// At this point, all node in {N} are feasible for Z.
// Step 3a: Soft version of Step 1a
Y map[string]int // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node
Initialize the keys of Y to all of the nodes in {N}, and the values to 0
X = {Z's PodSpec's SoftPodAffinity}
Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 3b: Soft version of Step 1b
X = {Z's PodSpec's SoftPodAntiAffinity}
Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft
foreach node A of {N}
foreach pod B that is bound to A
increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A
// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is
// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with
// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better).
```
## Special considerations for RequiredDuringScheduling anti-affinity
In this section we discuss three issues with RequiredDuringScheduling
anti-affinity: Denial of Service (DoS), co-existing with daemons, and
determining which pod(s) to kill. See issue [#18265](https://github.com/kubernetes/kubernetes/issues/18265)
for additional discussion of these topics.
### Denial of Service
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
can intentionally or unintentionally cause various problems for other pods, due
to the symmetry property of anti-affinity.
The most notable danger is the ability for a pod that arrives first to some
topology domain, to block all other pods from scheduling there by stating a
conflict with all other pods. The standard approach to preventing resource
hogging is quota, but simple resource quota cannot prevent this scenario because
the pod may request very little resources. Addressing this using quota requires
a quota scheme that charges based on "opportunity cost" rather than based simply
on requested resources. For example, when handling a pod that expresses
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
(i.e. exclusive access to a node), it could charge for the resources of the
average or largest node in the cluster. Likewise if a pod expresses
RequiredDuringScheduling anti-affinity for all pods using a "cluster"
`TopologyKey`, it could charge for the resources of the entire cluster. If node
affinity is used to constrain the pod to a particular topology domain, then the
admission-time quota charging should take that into account (e.g. not charge for
the average/largest machine if the PodSpec constrains the pod to a specific
machine with a known size; instead charge for the size of the actual machine
that the pod was constrained to). In all cases once the pod is scheduled, the
quota charge should be adjusted down to the actual amount of resources allocated
(e.g. the size of the actual machine that was assigned, not the
average/largest). If a cluster administrator wants to overcommit quota, for
example to allow more than N pods across all users to request exclusive node
access in a cluster with N nodes, then a priority/preemption scheme should be
added so that the most important pods run when resource demand exceeds supply.
An alternative approach, which is a bit of a blunt hammer, is to use a
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
to trusted users. A more complex capability mechanism might only restrict it
when using a non-"node" TopologyKey.
Our initial implementation will use a variant of the capability approach, which
requires no configuration: we will simply reject ALL requests, regardless of
user, that specify "all namespaces" with non-"node" TopologyKey for
RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
case while prohibiting the more dangerous ones.
A weaker variant of the problem described in the previous paragraph is a pod's
ability to use anti-affinity to degrade the scheduling quality of another pod,
but not completely block it from scheduling. For example, a set of pods S1 could
use node affinity to request to schedule onto a set of nodes that some other set
of pods S2 prefers to schedule onto. If the pods in S1 have
RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
S2, then due to the symmetry property of anti-affinity, they can prevent the
pods in S2 from scheduling onto their preferred nodes if they arrive first (for
sure in the RequiredDuringScheduling case, and with some probability that
depends on the weighting scheme for the PreferredDuringScheduling case). A very
sophisticated priority and/or quota scheme could mitigate this, or alternatively
we could eliminate the symmetry property of the implementation of
PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
anti-affinity could affect scheduling quality of another pod, and as we
described in the previous paragraph, such pods could be charged quota for the
full topology domain, thereby reducing the potential for abuse.
We won't try to address this issue in our initial implementation; we can
consider one of the approaches mentioned above if it turns out to be a problem
in practice.
### Co-existing with daemons
A cluster administrator may wish to allow pods that express anti-affinity
against all pods, to nonetheless co-exist with system daemon pods, such as those
run by DaemonSet. In principle, we would like the specification for
RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
more other pods (see [#18263](https://github.com/kubernetes/kubernetes/issues/18263)
for a more detailed explanation of the toleration concept).
There are at least two ways to accomplish this:
* Scheduler special-cases the namespace(s) where daemons live, in the
sense that it ignores pods in those namespaces when it is
determining feasibility for pods with anti-affinity. The name(s) of
the special namespace(s) could be a scheduler configuration
parameter, and default to `kube-system`. We could allow
multiple namespaces to be specified if we want cluster admins to be
able to give their own daemons this special power (they would add
their namespace to the list in the scheduler configuration). And of
course this would be symmetric, so daemons could schedule onto a node
that is already running a pod with anti-affinity.
* We could add an explicit "toleration" concept/field to allow the
user to specify namespaces that are excluded when they use
RequiredDuringScheduling anti-affinity, and use an admission
controller/defaulter to ensure these namespaces are always listed.
Our initial implementation will use the first approach.
### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
Because anti-affinity is symmetric, in the case of
RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
determine which pod(s) to kill when a pod's labels are updated in such as way as
to cause them to conflict with one or more other pods'
RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
absence of a priority/preemption scheme, our rule will be that the pod with the
anti-affinity rule that becomes violated should be the one killed. A pod should
only specify constraints that apply to namespaces it trusts to not do malicious
things. Once we have priority/preemption, we can change the rule to say that the
lowest-priority pod(s) are killed until all
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
## Special considerations for RequiredDuringScheduling affinity
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
conflicting pods, and pods that conflict with P cannot schedule onto the node
one P has been scheduled there. The design we have described says that the
symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
says it can only schedule onto nodes running pod Q, this does not mean Q can
only run on a node that is running P, but the scheduler will try to schedule Q
onto a node that is running P (i.e. treats the reverse direction as preferred).
This raises the same scheduling quality concern as we mentioned at the end of
the Denial of Service section above, and can be addressed in similar ways.
The nature of affinity (as opposed to anti-affinity) means that there is no
issue of determining which pod(s) to kill when a pod's labels change: it is
obviously the pod with the affinity rule that becomes violated that must be
killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
"fix" violation an anti-affinity rule.) However, affinity does have a different
question related to killing: how long should the system wait before declaring
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
is temporarily killed so that it can be updated to a new binary version, should
that trigger killing of P? More generally, how long should the system wait
before declaring that P's affinity is violated? (Of course affinity is expressed
in terms of label selectors, not for a specific pod, but the scenario is easier
to describe using a concrete pod.) This is closely related to the concept of
forgiveness (see issue [#1574](https://github.com/kubernetes/kubernetes/issues/1574)).
In theory we could make this time duration be configurable by the user on a per-pod
basis, but for the first version of this feature we will make it a configurable
property of whichever component does the killing and that applies across all pods
using the feature. Making it configurable by the user would require a nontrivial
change to the API syntax (since the field would only apply to
RequiredDuringSchedulingRequiredDuringExecution affinity).
## Implementation plan
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
`PodAntiAffinity` types to the API along with all of their descendant types.
2. Implement a scheduler predicate that takes
`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
account. Include a workaround for the issue described at the end of the Affinity
section of the Examples section (can't schedule first pod).
3. Implement a scheduler priority function that takes
`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
into account.
4. Implement admission controller that rejects requests that specify "all
namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
anti-affinity. This admission controller should be enabled by default.
5. Implement the recommended solution to the "co-existing with daemons" issue
6. At this point, the feature can be deployed.
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
and anti-affinity, and make sure the pieces of the system already implemented
for `RequiredDuringSchedulingIgnoredDuringExecution` also take
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
scheduler predicate, the quota mechanism, the "co-existing with daemons"
solution).
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
`TopologyKey` to Kubelet's admission decision.
9. Implement code in Kubelet *or* the controllers that evicts a pod that no
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
then only for "node" `TopologyKey`; if controller then potentially for all
`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
Do so in a way that addresses the "determining which pod(s) to kill" issue.
We assume Kubelet publishes labels describing the node's membership in all of
the relevant scheduling domains (e.g. node name, rack name, availability zone
name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044).
## Backward compatibility
Old versions of the scheduler will ignore `Affinity`.
Users should not start using `Affinity` until the full implementation has been
in Kubelet and the master for enough binary versions that we feel comfortable
that we will not need to roll back either Kubelet or master to a version that
does not support them. Longer-term we will use a programmatic approach to
enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
## Extensibility
The design described here is the result of careful analysis of use cases, a
decade of experience with Borg at Google, and a review of similar features in
other open-source container orchestration systems. We believe that it properly
balances the goal of expressiveness against the goals of simplicity and
efficiency of implementation. However, we recognize that use cases may arise in
the future that cannot be expressed using the syntax described here. Although we
are not implementing an affinity-specific extensibility mechanism for a variety
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
for Kubernetes users to get a consistent experience, etc.), the regular
Kubernetes annotation mechanism can be used to add or replace affinity rules.
The way this work would is:
1. Define one or more annotations to describe the new affinity rule(s)
1. User (or an admission controller) attaches the annotation(s) to pods to
request the desired scheduling behavior. If the new rule(s) *replace* one or
more fields of `Affinity` then the user would omit those fields from `Affinity`;
if they are *additional rules*, then the user would fill in `Affinity` as well
as the annotation(s).
1. Scheduler takes the annotation(s) into account when scheduling.
If some particular new syntax becomes popular, we would consider upstreaming it
by integrating it into the standard `Affinity`.
## Future work and non-work
One can imagine that in the anti-affinity RequiredDuringScheduling case one
might want to associate a number with the rule, for example "do not allow this
pod to share a rack with more than three other pods (in total, or from the same
service as the pod)." We could allow this to be specified by adding an integer
`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
However, this flexibility complicates the system and we do not intend to
implement it.
It is likely that the specification and implementation of pod anti-affinity
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
and likewise that the specification and implementation of pod affinity
can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
labels would be "inherited" by the node, and pods would only be able to specify
affinity and anti-affinity for a node's labels. Our main motivation for not
unifying taints and tolerations with pod anti-affinity is that we foresee taints
and tolerations as being a concept that only cluster administrators need to
understand (and indeed in some setups taints and tolerations wouldn't even be
directly manipulated by a cluster administrator, instead they would only be set
by an admission controller that is implementing the administrator's high-level
policy about different classes of special machines and the users who belong to
the groups allowed to access them). Moreover, the concept of nodes "inheriting"
labels from pods seems complicated; it seems conceptually simpler to separate
rules involving relatively static properties of nodes from rules involving which
other pods are running on the same node or larger topology domain.
Data/storage affinity is related to pod affinity, and is likely to draw on some
of the ideas we have used for pod affinity. Today, data/storage affinity is
expressed using node affinity, on the assumption that the pod knows which
node(s) store(s) the data it wants. But a more flexible approach would allow the
pod to name the data rather than the node.
## Related issues
The review for this proposal is in [#18265](https://github.com/kubernetes/kubernetes/issues/18265).
The topic of affinity/anti-affinity has generated a lot of discussion. The main
issue is [#367](https://github.com/kubernetes/kubernetes/issues/367)
but [#14484](https://github.com/kubernetes/kubernetes/issues/14484)/[#14485](https://github.com/kubernetes/kubernetes/issues/14485),
[#9560](https://github.com/kubernetes/kubernetes/issues/9560), [#11369](https://github.com/kubernetes/kubernetes/issues/11369),
[#14543](https://github.com/kubernetes/kubernetes/issues/14543), [#11707](https://github.com/kubernetes/kubernetes/issues/11707),
[#3945](https://github.com/kubernetes/kubernetes/issues/3945), [#341](https://github.com/kubernetes/kubernetes/issues/341),
[#1965](https://github.com/kubernetes/kubernetes/issues/1965), and [#2906](https://github.com/kubernetes/kubernetes/issues/2906)
all have additional discussion and use cases.
As the examples in this document have demonstrated, topological affinity is very
useful in clusters that are spread across availability zones, e.g. to co-locate
pods of a service in the same zone to avoid a wide-area network hop, or to
spread pods across zones for failure tolerance. [#17059](https://github.com/kubernetes/kubernetes/issues/17059),
[#13056](https://github.com/kubernetes/kubernetes/issues/13056), [#13063](https://github.com/kubernetes/kubernetes/issues/13063),
and [#4235](https://github.com/kubernetes/kubernetes/issues/4235) are relevant.
Issue [#15675](https://github.com/kubernetes/kubernetes/issues/15675) describes connection affinity, which is vaguely related.
This proposal is to satisfy [#14816](https://github.com/kubernetes/kubernetes/issues/14816).
## Related work
** TODO: cite references **
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/podaffinity.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/podaffinity.md)

View File

@ -1,101 +1 @@
# Design Principles
Principles to follow when extending Kubernetes.
## API
See also the [API conventions](../devel/api-conventions.md).
* All APIs should be declarative.
* API objects should be complementary and composable, not opaque wrappers.
* The control plane should be transparent -- there are no hidden internal APIs.
* The cost of API operations should be proportional to the number of objects
intentionally operated upon. Therefore, common filtered lookups must be indexed.
Beware of patterns of multiple API calls that would incur quadratic behavior.
* Object status must be 100% reconstructable by observation. Any history kept
must be just an optimization and not required for correct operation.
* Cluster-wide invariants are difficult to enforce correctly. Try not to add
them. If you must have them, don't enforce them atomically in master components,
that is contention-prone and doesn't provide a recovery path in the case of a
bug allowing the invariant to be violated. Instead, provide a series of checks
to reduce the probability of a violation, and make every component involved able
to recover from an invariant violation.
* Low-level APIs should be designed for control by higher-level systems.
Higher-level APIs should be intent-oriented (think SLOs) rather than
implementation-oriented (think control knobs).
## Control logic
* Functionality must be *level-based*, meaning the system must operate correctly
given the desired state and the current/observed state, regardless of how many
intermediate state updates may have been missed. Edge-triggered behavior must be
just an optimization.
* Assume an open world: continually verify assumptions and gracefully adapt to
external events and/or actors. Example: we allow users to kill pods under
control of a replication controller; it just replaces them.
* Do not define comprehensive state machines for objects with behaviors
associated with state transitions and/or "assumed" states that cannot be
ascertained by observation.
* Don't assume a component's decisions will not be overridden or rejected, nor
for the component to always understand why. For example, etcd may reject writes.
Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry,
but back off and/or make alternative decisions.
* Components should be self-healing. For example, if you must keep some state
(e.g., cache) the content needs to be periodically refreshed, so that if an item
does get erroneously stored or a deletion event is missed etc, it will be soon
fixed, ideally on timescales that are shorter than what will attract attention
from humans.
* Component behavior should degrade gracefully. Prioritize actions so that the
most important activities can continue to function even when overloaded and/or
in states of partial failure.
## Architecture
* Only the apiserver should communicate with etcd/store, and not other
components (scheduler, kubelet, etc.).
* Compromising a single node shouldn't compromise the cluster.
* Components should continue to do what they were last told in the absence of
new instructions (e.g., due to network partition or component outage).
* All components should keep all relevant state in memory all the time. The
apiserver should write through to etcd/store, other components should write
through to the apiserver, and they should watch for updates made by other
clients.
* Watch is preferred over polling.
## Extensibility
TODO: pluggability
## Bootstrapping
* [Self-hosting](http://issue.k8s.io/246) of all components is a goal.
* Minimize the number of dependencies, particularly those required for
steady-state operation.
* Stratify the dependencies that remain via principled layering.
* Break any circular dependencies by converting hard dependencies to soft
dependencies.
* Also accept that data from other components from another source, such as
local files, which can then be manually populated at bootstrap time and then
continuously updated once those other components are available.
* State should be rediscoverable and/or reconstructable.
* Make it easy to run temporary, bootstrap instances of all components in
order to create the runtime state needed to run the components in the steady
state; use a lock (master election for distributed components, file lock for
local components like Kubelet) to coordinate handoff. We call this technique
"pivoting".
* Have a solution to restart dead components. For distributed components,
replication works well. For local components such as Kubelet, a process manager
or even a simple shell loop works.
## Availability
TODO
## General principles
* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules)
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/principles.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/principles.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/principles.md)

View File

@ -1,218 +1 @@
# Resource Quality of Service in Kubernetes
**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar)
**Last Updated**: 5/17/2016
**Status**: Implemented
*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.*
## Introduction
This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*.
Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use.
The system computes pod level requests and limits by summing up per-resource requests and limits across all containers.
When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers.
This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees.
Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes.
## Requests and Limits
For each resource, containers can specify a resource request and limit, `0 <= request <= `[`Node Allocatable`](../proposals/node-allocatable.md) & `request <= limit <= Infinity`.
If a pod is successfully scheduled, the container is guaranteed the amount of resources requested.
Scheduling is based on `requests` and not `limits`.
The pods and its containers will not be allowed to exceed the specified limit.
How the request and limit are enforced depends on whether the resource is [compressible or incompressible](resources.md).
### Compressible Resource Guarantees
- For now, we are only supporting CPU.
- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.
- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
### Incompressible Resource Guarantees
- For now, we are only supporting memory.
- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel.
### Admission/Scheduling Policy
- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](../proposals/node-allocatable.md) capacity (for both memory and CPU).
## QoS Classes
In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority.
The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section.
Pods can be of one of 3 different classes:
- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the pod is classified as **Guaranteed**.
Examples:
```yaml
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
```
```yaml
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
```
- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**.
When `limits` are not specified, they default to the node capacity.
Examples:
Container `bar` has not resources specified.
```yaml
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
```
Container `foo` and `bar` have limits set for different resources.
```yaml
containers:
name: foo
resources:
limits:
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
```
Container `foo` has no limits set, and `bar` has neither requests nor limits specified.
```yaml
containers:
name: foo
resources:
requests:
cpu: 10m
memory: 1Gi
name: bar
```
- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**.
Examples:
```yaml
containers:
name: foo
resources:
name: bar
resources:
```
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory.
These containers can use any amount of free memory in the node though.
- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available.
Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist.
### OOM Score configuration at the Nodes
Pod OOM score configuration
- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed.
- The base OOM score is between 0 and 1000, so if process As OOM_SCORE_ADJ - process Bs OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B.
- The final OOM score of a process is also between 0 and 1000
*Best-effort*
- Set OOM_SCORE_ADJ: 1000
- So processes in best-effort containers will have an OOM_SCORE of 1000
*Guaranteed*
- Set OOM_SCORE_ADJ: -998
- So processes in guaranteed containers will have an OOM_SCORE of 0 or 1
*Burstable*
- If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2
- Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested)
- This ensures that the OOM_SCORE of burstable pod is > 1
- If memory request is `0`, OOM_SCORE_ADJ is set to `999`.
- So burstable pods will be killed if they conflict with guaranteed pods
- If a burstable pod uses less memory than requested, its OOM_SCORE < 1000
- So best-effort pods will be killed if they conflict with burstable pods using less than requested memory
- If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000
- Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed
- If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees.
*Pod infra containers* or *Special Pod init process*
- OOM_SCORE_ADJ: -998
*Kubelet, Docker*
- OOM_SCORE_ADJ: -999 (wont be OOM killed)
- Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume.
## Known issues and possible improvements
The above implementation provides for basic oversubscription with protection, but there are a few known limitations.
#### Support for Swap
- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isnt enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior.
## Alternative QoS Class Policy
An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed).
A strict hierarchy of user-specified numerical priorities is not desirable because:
1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively
2. Changes to desired priority bands would require changes to all user pod configurations.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md)

View File

@ -1,370 +1 @@
**Note: this is a design doc, which describes features that have not been
completely implemented. User documentation of the current state is
[here](../user-guide/compute-resources.md). The tracking issue for
implementation of this model is [#168](http://issue.k8s.io/168). Currently, both
limits and requests of memory and cpu on containers (not pods) are supported.
"memory" is in bytes and "cpu" is in milli-cores.**
# The Kubernetes resource model
To do good pod placement, Kubernetes needs to know how big pods are, as well as
the sizes of the nodes onto which they are being placed. The definition of "how
big" is given by the Kubernetes resource model &mdash; the subject of this
document.
The resource model aims to be:
* simple, for common cases;
* extensible, to accommodate future growth;
* regular, with few special cases; and
* precise, to avoid misunderstandings and promote pod portability.
## The resource model
A Kubernetes _resource_ is something that can be requested by, allocated to, or
consumed by a pod or container. Examples include memory (RAM), CPU, disk-time,
and network bandwidth.
Once resources on a node have been allocated to one pod, they should not be
allocated to another until that pod is removed or exits. This means that
Kubernetes schedulers should ensure that the sum of the resources allocated
(requested and granted) to its pods never exceeds the usable capacity of the
node. Testing whether a pod will fit on a node is called _feasibility checking_.
Note that the resource model currently prohibits over-committing resources; we
will want to relax that restriction later.
### Resource types
All resources have a _type_ that is identified by their _typename_ (a string,
e.g., "memory"). Several resource types are predefined by Kubernetes (a full
list is below), although only two will be supported at first: CPU and memory.
Users and system administrators can define their own resource types if they wish
(e.g., Hadoop slots).
A fully-qualified resource typename is constructed from a DNS-style _subdomain_,
followed by a slash `/`, followed by a name.
* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt)
(e.g., `kubernetes.io`, `example.com`).
* The name must be not more than 63 characters, consisting of upper- or
lower-case alphanumeric characters, with the `-`, `_`, and `.` characters
allowed anywhere except the first or last character.
* As a shorthand, any resource typename that does not start with a subdomain and
a slash will automatically be prefixed with the built-in Kubernetes _namespace_,
`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for
code in the open source Kubernetes repository; as a result, all user typenames
MUST be fully qualified, and cannot be created in this namespace.
Some example typenames include `memory` (which will be fully-qualified as
`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
For future reference, note that some resources, such as CPU and network
bandwidth, are _compressible_, which means that their usage can potentially be
throttled in a relatively benign manner. All other resources are
_incompressible_, which means that any attempt to throttle them is likely to
cause grief. This distinction will be important if a Kubernetes implementation
supports over-committing of resources.
### Resource quantities
Initially, all Kubernetes resource types are _quantitative_, and have an
associated _unit_ for quantities of the associated resource (e.g., bytes for
memory, bytes per seconds for bandwidth, instances for software licences). The
units will always be a resource type's natural base units (e.g., bytes, not MB),
to avoid confusion between binary and decimal multipliers and the underlying
unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
Resource quantities can be added and subtracted: for example, a node has a fixed
quantity of each resource type that can be allocated to pods/containers; once
such an allocation has been made, the allocated resources cannot be made
available to other pods/containers without over-committing the resources.
To make life easier for people, quantities can be represented externally as
unadorned integers, or as fixed-point integers with one of these SI suffices
(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi,
Ki). For example, the following represent roughly the same value: 128974848,
"129e6", "129M" , "123Mi". Small quantities can be represented directly as
decimals (e.g., 0.3), or using milli-units (e.g., "300m").
* "Externally" means in user interfaces, reports, graphs, and in JSON or YAML
resource specifications that might be generated or read by people.
* Case is significant: "m" and "M" are not the same, so "k" is not a valid SI
suffix. There are no power-of-two equivalents for SI suffixes that represent
multipliers less than 1.
* These conventions only apply to resource quantities, not arbitrary values.
Internally (i.e., everywhere else), Kubernetes will represent resource
quantities as integers so it can avoid problems with rounding errors, and will
not use strings to represent numeric values. To achieve this, quantities that
naturally have fractional parts (e.g., CPU seconds/second) will be scaled to
integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in.
Internal APIs, data structures, and protobufs will use these scaled integer
units. Raw measurement data such as usage may still need to be tracked and
calculated using floating point values, but internally they should be rescaled
to avoid some values being in milli-units and some not.
* Note that reading in a resource quantity and writing it out again may change
the way its values are represented, and truncate precision (e.g., 1.0001 may
become 1.000), so comparison and difference operations (e.g., by an updater)
must be done on the internal representations.
* Avoiding milli-units in external representations has advantages for people
who will use Kubernetes, but runs the risk of developers forgetting to rescale
or accidentally using floating-point representations. That seems like the right
choice. We will try to reduce the risk by providing libraries that automatically
do the quantization for JSON/YAML inputs.
### Resource specifications
Both users and a number of system components, such as schedulers, (horizontal)
auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers
need to reason about resource requirements of workloads, resource capacities of
nodes, and resource usage. Kubernetes divides specifications of *desired state*,
aka the Spec, and representations of *current state*, aka the Status. Resource
requirements and total node capacity fall into the specification category, while
resource usage, characterizations derived from usage (e.g., maximum usage,
histograms), and other resource demand signals (e.g., CPU load) clearly fall
into the status category and are discussed in the Appendix for now.
Resource requirements for a container or pod should have the following form:
```yaml
resourceRequirementSpec: [
request: [ cpu: 2.5, memory: "40Mi" ],
limit: [ cpu: 4.0, memory: "99Mi" ],
]
```
Where:
* _request_ [optional]: the amount of resources being requested, or that were
requested and have been allocated. Scheduler algorithms will use these
quantities to test feasibility (whether a pod will fit onto a node).
If a container (or pod) tries to use more resources than its _request_, any
associated SLOs are voided &mdash; e.g., the program it is running may be
throttled (compressible resource types), or the attempt may be denied. If
_request_ is omitted for a container, it defaults to _limit_ if that is
explicitly specified, otherwise to an implementation-defined value; this will
always be 0 for a user-defined resource type. If _request_ is omitted for a pod,
it defaults to the sum of the (explicit or implicit) _request_ values for the
containers it encloses.
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources
that will be made available to a container or pod; if a container or pod uses
more resources than its _limit_, it may be terminated. The _limit_ defaults to
"unbounded"; in practice, this probably means the capacity of an enclosing
container, pod, or node, but may result in non-deterministic behavior,
especially for memory.
Total capacity for a node should have a similar structure:
```yaml
resourceCapacitySpec: [
total: [ cpu: 12, memory: "128Gi" ]
]
```
Where:
* _total_: the total allocatable resources of a node. Initially, the resources
at a given scope will bound the resources of the sum of inner scopes.
#### Notes
* It is an error to specify the same resource type more than once in each
list.
* It is an error for the _request_ or _limit_ values for a pod to be less than
the sum of the (explicit or defaulted) values for the containers it encloses.
(We may relax this later.)
* If multiple pods are running on the same node and attempting to use more
resources than they have requested, the result is implementation-defined. For
example: unallocated or unused resources might be spread equally across
claimants, or the assignment might be weighted by the size of the original
request, or as a function of limits, or priority, or the phase of the moon,
perhaps modulated by the direction of the tide. Thus, although it's not
mandatory to provide a _request_, it's probably a good idea. (Note that the
_request_ could be filled in by an automated system that is observing actual
usage and/or historical data.)
* Internally, the Kubernetes master can decide the defaulting behavior and the
kubelet implementation may expected an absolute specification. For example, if
the master decided that "the default is unbounded" it would pass 2^64 to the
kubelet.
## Kubernetes-defined resource types
The following resource types are predefined ("reserved") by Kubernetes in the
`kubernetes.io` namespace, and so cannot be used for user-defined resources.
Note that the syntax of all resource types in the resource spec is deliberately
similar, but some resource types (e.g., CPU) may receive significantly more
support than simply tracking quantities in the schedulers and/or the Kubelet.
### Processor cycles
* Name: `cpu` (or `kubernetes.io/cpu`)
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to
a canonical "Kubernetes CPU")
* Internal representation: milli-KCUs
* Compressible? yes
* Qualities: this is a placeholder for the kind of thing that may be supported
in the future &mdash; see [#147](http://issue.k8s.io/147)
* [future] `schedulingLatency`: as per lmctfy
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU
core on the node's processor divided by the speed of the canonical Kubernetes
CPU (a floating point value; default = 1.0).
To reduce performance portability problems for pods, and to avoid worse-case
provisioning behavior, the units of CPU will be normalized to a canonical
"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be
equivalent to a single CPU hyperthreaded core for some recent x86 processor. The
normalization may be implementation-defined, although some reasonable defaults
will be provided in the open-source Kubernetes code.
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will
be allocated &mdash; control of aspects like this will be handled by resource
_qualities_ (a future feature).
### Memory
* Name: `memory` (or `kubernetes.io/memory`)
* Units: bytes
* Compressible? no (at least initially)
The precise meaning of what "memory" means is implementation dependent, but the
basic idea is to rely on the underlying `memcg` mechanisms, support, and
definitions.
Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory
quantities rather than decimal ones: "64MiB" rather than "64MB".
## Resource metadata
A resource type may have an associated read-only ResourceType structure, that
contains metadata about the type. For example:
```yaml
resourceTypes: [
"kubernetes.io/memory": [
isCompressible: false, ...
]
"kubernetes.io/cpu": [
isCompressible: true,
internalScaleExponent: 3, ...
]
"kubernetes.io/disk-space": [ ... ]
]
```
Kubernetes will provide ResourceType metadata for its predefined types. If no
resource metadata can be found for a resource type, Kubernetes will assume that
it is a quantified, incompressible resource that is not specified in
milli-units, and has no default value.
The defined properties are as follows:
| field name | type | contents |
| ---------- | ---- | -------- |
| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) |
| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) |
| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". |
| isCompressible | bool, default=false | true if the resource type is compressible |
| defaultRequest | string, default=none | in the same format as a user-supplied value |
| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). |
# Appendix: future extensions
The following are planned future extensions to the resource model, included here
to encourage comments.
## Usage data
Because resource usage and related metrics change continuously, need to be
tracked over time (i.e., historically), can be characterized in a variety of
ways, and are fairly voluminous, we will not include usage in core API objects,
such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs
for accessing and managing that data. See the Appendix for possible
representations of usage data, but the representation we'll use is TBD.
Singleton values for observed and predicted future usage will rapidly prove
inadequate, so we will support the following structure for extended usage
information:
```yaml
resourceStatus: [
usage: [ cpu: <CPU-info>, memory: <memory-info> ],
maxusage: [ cpu: <CPU-info>, memory: <memory-info> ],
predicted: [ cpu: <CPU-info>, memory: <memory-info> ],
]
```
where a `<CPU-info>` or `<memory-info>` structure looks like this:
```yaml
{
mean: <value> # arithmetic mean
max: <value> # minimum value
min: <value> # maximum value
count: <value> # number of data points
percentiles: [ # map from %iles to values
"10": <10th-percentile-value>,
"50": <median-value>,
"99": <99th-percentile-value>,
"99.9": <99.9th-percentile-value>,
...
]
}
```
All parts of this structure are optional, although we strongly encourage
including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles.
_[In practice, it will be important to include additional info such as the
length of the time window over which the averages are calculated, the
confidence level, and information-quality metrics such as the number of dropped
or discarded data points.]_ and predicted
## Future resource types
### _[future] Network bandwidth_
* Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`)
* Units: bytes per second
* Compressible? yes
### _[future] Network operations_
* Name: "network-iops" (or `kubernetes.io/network-iops`)
* Units: operations (messages) per second
* Compressible? yes
### _[future] Storage space_
* Name: "storage-space" (or `kubernetes.io/storage-space`)
* Units: bytes
* Compressible? no
The amount of secondary storage space available to a container. The main target
is local disk drives and SSDs, although this could also be used to qualify
remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a
disk array, or a file system fronting any of these, is left for future work.
### _[future] Storage time_
* Name: storage-time (or `kubernetes.io/storage-time`)
* Units: seconds per second of disk time
* Internal representation: milli-units
* Compressible? yes
This is the amount of time a container spends accessing disk, including actuator
and transfer time. A standard disk drive provides 1.0 diskTime seconds per
second.
### _[future] Storage operations_
* Name: "storage-iops" (or `kubernetes.io/storage-iops`)
* Units: operations per second
* Compressible? yes
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resources.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resources.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resources.md)

View File

@ -1,105 +1 @@
# Scheduler extender
There are three ways to add new scheduling rules (predicates and priority
functions) to Kubernetes: (1) by adding these rules to the scheduler and
recompiling (described here:
https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler.md),
(2) implementing your own scheduler process that runs instead of, or alongside
of, the standard Kubernetes scheduler, (3) implementing a "scheduler extender"
process that the standard Kubernetes scheduler calls out to as a final pass when
making scheduling decisions.
This document describes the third approach. This approach is needed for use
cases where scheduling decisions need to be made on resources not directly
managed by the standard Kubernetes scheduler. The extender helps make scheduling
decisions based on such resources. (Note that the three approaches are not
mutually exclusive.)
When scheduling a pod, the extender allows an external process to filter and
prioritize nodes. Two separate http/https calls are issued to the extender, one
for "filter" and one for "prioritize" actions. To use the extender, you must
create a scheduler policy configuration file. The configuration specifies how to
reach the extender, whether to use http or https and the timeout.
```go
// Holds the parameters used to communicate with the extender. If a verb is unspecified/empty,
// it is assumed that the extender chose not to provide that extension.
type ExtenderConfig struct {
// URLPrefix at which the extender is available
URLPrefix string `json:"urlPrefix"`
// Verb for the filter call, empty if not supported. This verb is appended to the URLPrefix when issuing the filter call to extender.
FilterVerb string `json:"filterVerb,omitempty"`
// Verb for the prioritize call, empty if not supported. This verb is appended to the URLPrefix when issuing the prioritize call to extender.
PrioritizeVerb string `json:"prioritizeVerb,omitempty"`
// The numeric multiplier for the node scores that the prioritize call generates.
// The weight should be a positive integer
Weight int `json:"weight,omitempty"`
// EnableHttps specifies whether https should be used to communicate with the extender
EnableHttps bool `json:"enableHttps,omitempty"`
// TLSConfig specifies the transport layer security config
TLSConfig *client.TLSClientConfig `json:"tlsConfig,omitempty"`
// HTTPTimeout specifies the timeout duration for a call to the extender. Filter timeout fails the scheduling of the pod. Prioritize
// timeout is ignored, k8s/other extenders priorities are used to select the node.
HTTPTimeout time.Duration `json:"httpTimeout,omitempty"`
}
```
A sample scheduler policy file with extender configuration:
```json
{
"predicates": [
{
"name": "HostName"
},
{
"name": "MatchNodeSelector"
},
{
"name": "PodFitsResources"
}
],
"priorities": [
{
"name": "LeastRequestedPriority",
"weight": 1
}
],
"extenders": [
{
"urlPrefix": "http://127.0.0.1:12345/api/scheduler",
"filterVerb": "filter",
"enableHttps": false
}
]
}
```
Arguments passed to the FilterVerb endpoint on the extender are the set of nodes
filtered through the k8s predicates and the pod. Arguments passed to the
PrioritizeVerb endpoint on the extender are the set of nodes filtered through
the k8s predicates and extender predicates and the pod.
```go
// ExtenderArgs represents the arguments needed by the extender to filter/prioritize
// nodes for a pod.
type ExtenderArgs struct {
// Pod being scheduled
Pod api.Pod `json:"pod"`
// List of candidate nodes where the pod can be scheduled
Nodes api.NodeList `json:"nodes"`
}
```
The "filter" call returns a list of nodes (schedulerapi.ExtenderFilterResult). The "prioritize" call
returns priorities for each node (schedulerapi.HostPriorityList).
The "filter" call may prune the set of nodes based on its predicates. Scores
returned by the "prioritize" call are added to the k8s scores (computed through
its priority functions) and used for final host selection.
Multiple extenders can be configured in the scheduler policy.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/scheduler_extender.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduler_extender.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/scheduler_extender.md)

View File

@ -1,266 +1 @@
## Abstract
A proposal for adding **alpha** support for
[seccomp](https://github.com/seccomp/libseccomp) to Kubernetes. Seccomp is a
system call filtering facility in the Linux kernel which lets applications
define limits on system calls they may make, and what should happen when
system calls are made. Seccomp is used to reduce the attack surface available
to applications.
## Motivation
Applications use seccomp to restrict the set of system calls they can make.
Recently, container runtimes have begun adding features to allow the runtime
to interact with seccomp on behalf of the application, which eliminates the
need for applications to link against libseccomp directly. Adding support in
the Kubernetes API for describing seccomp profiles will allow administrators
greater control over the security of workloads running in Kubernetes.
Goals of this design:
1. Describe how to reference seccomp profiles in containers that use them
## Constraints and Assumptions
This design should:
* build upon previous security context work
* be container-runtime agnostic
* allow use of custom profiles
* facilitate containerized applications that link directly to libseccomp
## Use Cases
1. As an administrator, I want to be able to grant access to a seccomp profile
to a class of users
2. As a user, I want to run an application with a seccomp profile similar to
the default one provided by my container runtime
3. As a user, I want to run an application which is already libseccomp-aware
in a container, and for my application to manage interacting with seccomp
unmediated by Kubernetes
4. As a user, I want to be able to use a custom seccomp profile and use
it with my containers
### Use Case: Administrator access control
Controlling access to seccomp profiles is a cluster administrator
concern. It should be possible for an administrator to control which users
have access to which profiles.
The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893)
API extension governs the ability of users to make requests that affect pod
and container security contexts. The proposed design should deal with
required changes to control access to new functionality.
### Use Case: Seccomp profiles similar to container runtime defaults
Many users will want to use images that make assumptions about running in the
context of their chosen container runtime. Such images are likely to
frequently assume that they are running in the context of the container
runtime's default seccomp settings. Therefore, it should be possible to
express a seccomp profile similar to a container runtime's defaults.
As an example, all dockerhub 'official' images are compatible with the Docker
default seccomp profile. So, any user who wanted to run one of these images
with seccomp would want the default profile to be accessible.
### Use Case: Applications that link to libseccomp
Some applications already link to libseccomp and control seccomp directly. It
should be possible to run these applications unmodified in Kubernetes; this
implies there should be a way to disable seccomp control in Kubernetes for
certain containers, or to run with a "no-op" or "unconfined" profile.
Sometimes, applications that link to seccomp can use the default profile for a
container runtime, and restrict further on top of that. It is important to
note here that in this case, applications can only place _further_
restrictions on themselves. It is not possible to re-grant the ability of a
process to make a system call once it has been removed with seccomp.
As an example, elasticsearch manages its own seccomp filters in its code.
Currently, elasticsearch is capable of running in the context of the default
Docker profile, but if in the future, elasticsearch needed to be able to call
`ioperm` or `iopr` (both of which are disallowed in the default profile), it
should be possible to run elasticsearch by delegating the seccomp controls to
the pod.
### Use Case: Custom profiles
Different applications have different requirements for seccomp profiles; it
should be possible to specify an arbitrary seccomp profile and use it in a
container. This is more of a concern for applications which need a higher
level of privilege than what is granted by the default profile for a cluster,
since applications that want to restrict privileges further can always make
additional calls in their own code.
An example of an application that requires the use of a syscall disallowed in
the Docker default profile is Chrome, which needs `clone` to create a new user
namespace. Another example would be a program which uses `ptrace` to
implement a sandbox for user-provided code, such as
[eval.in](https://eval.in/).
## Community Work
### Container runtime support for seccomp
#### Docker / opencontainers
Docker supports the open container initiative's API for
seccomp, which is very close to the libseccomp API. It allows full
specification of seccomp filters, with arguments, operators, and actions.
Docker allows the specification of a single seccomp filter. There are
community requests for:
Issues:
* [docker/22109](https://github.com/docker/docker/issues/22109): composable
seccomp filters
* [docker/21105](https://github.com/docker/docker/issues/22105): custom
seccomp filters for builds
#### rkt / appcontainers
The `rkt` runtime delegates to systemd for seccomp support; there is an open
issue to add support once `appc` supports it. The `appc` project has an open
issue to be able to describe seccomp as an isolator in an appc pod.
The systemd seccomp facility is based on a whitelist of system calls that can
be made, rather than a full filter specification.
Issues:
* [appc/529](https://github.com/appc/spec/issues/529)
* [rkt/1614](https://github.com/coreos/rkt/issues/1614)
#### HyperContainer
[HyperContainer](https://hypercontainer.io) does not support seccomp.
### Other platforms and seccomp-like capabilities
FreeBSD has a seccomp/capability-like facility called
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4).
#### lxd
[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile.
Issues:
* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp
## Proposed Design
### Seccomp API Resource?
An earlier draft of this proposal described a new global API resource that
could be used to describe seccomp profiles. After some discussion, it was
determined that without a feedback signal from users indicating a need to
describe new profiles in the Kubernetes API, it is not possible to know
whether a new API resource is warranted.
That being the case, we will not propose a new API resource at this time. If
there is strong community desire for such a resource, we may consider it in
the future.
Instead of implementing a new API resource, we propose that pods be able to
reference seccomp profiles by name. Since this is an alpha feature, we will
use annotations instead of extending the API with new fields.
### API changes?
In the alpha version of this feature we will use annotations to store the
names of seccomp profiles. The keys will be:
`container.seccomp.security.alpha.kubernetes.io/<container name>`
which will be used to set the seccomp profile of a container, and:
`seccomp.security.alpha.kubernetes.io/pod`
which will set the seccomp profile for the containers of an entire pod. If a
pod-level annotation is present, and a container-level annotation present for
a container, then the container-level profile takes precedence.
The value of these keys should be container-runtime agnostic. We will
establish a format that expresses the conventions for distinguishing between
an unconfined profile, the container runtime's default, or a custom profile.
Since format of profile is likely to be runtime dependent, we will consider
profiles to be opaque to kubernetes for now.
The following format is scoped as follows:
1. `runtime/default` - the default profile for the container runtime
2. `unconfined` - unconfined profile, ie, no seccomp sandboxing
3. `localhost/<profile-name>` - the profile installed to the node's local seccomp profile root
Since seccomp profile schemes may vary between container runtimes, we will
treat the contents of profiles as opaque for now and avoid attempting to find
a common way to describe them. It is up to the container runtime to be
sensitive to the annotations proposed here and to interpret instructions about
local profiles.
A new area on disk (which we will call the seccomp profile root) must be
established to hold seccomp profiles. A field will be added to the Kubelet
for the seccomp profile root and a knob (`--seccomp-profile-root`) exposed to
allow admins to set it. If unset, it should default to the `seccomp`
subdirectory of the kubelet root directory.
### Pod Security Policy annotation
The `PodSecurityPolicy` type should be annotated with the allowed seccomp
profiles using the key
`seccomp.security.alpha.kubernetes.io/allowedProfileNames`. The value of this
key should be a comma delimited list.
## Examples
### Unconfined profile
Here's an example of a pod that uses the unconfined profile:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: trustworthy-pod
annotations:
seccomp.security.alpha.kubernetes.io/pod: unconfined
spec:
containers:
- name: trustworthy-container
image: sotrustworthy:latest
```
### Custom profile
Here's an example of a pod that uses a profile called `example-explorer-
profile` using the container-level annotation:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: explorer
annotations:
container.seccomp.security.alpha.kubernetes.io/explorer: localhost/example-explorer-profile
spec:
containers:
- name: explorer
image: gcr.io/google_containers/explorer:1.0
args: ["-port=8080"]
ports:
- containerPort: 8080
protocol: TCP
volumeMounts:
- mountPath: "/mount/test-volume"
name: test-volume
volumes:
- name: test-volume
emptyDir: {}
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/seccomp.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/seccomp.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/seccomp.md)

View File

@ -1,628 +1 @@
## Abstract
A proposal for the distribution of [secrets](../user-guide/secrets.md)
(passwords, keys, etc) to the Kubelet and to containers inside Kubernetes using
a custom [volume](../user-guide/volumes.md#secrets) type. See the
[secrets example](../user-guide/secrets/) for more information.
## Motivation
Secrets are needed in containers to access internal resources like the
Kubernetes master or external resources such as git repositories, databases,
etc. Users may also want behaviors in the kubelet that depend on secret data
(credentials for image pull from a docker registry) associated with pods.
Goals of this design:
1. Describe a secret resource
2. Define the various challenges attendant to managing secrets on the node
3. Define a mechanism for consuming secrets in containers without modification
## Constraints and Assumptions
* This design does not prescribe a method for storing secrets; storage of
secrets should be pluggable to accommodate different use-cases
* Encryption of secret data and node security are orthogonal concerns
* It is assumed that node and master are secure and that compromising their
security could also compromise secrets:
* If a node is compromised, the only secrets that could potentially be
exposed should be the secrets belonging to containers scheduled onto it
* If the master is compromised, all secrets in the cluster may be exposed
* Secret rotation is an orthogonal concern, but it should be facilitated by
this proposal
* A user who can consume a secret in a container can know the value of the
secret; secrets must be provisioned judiciously
## Use Cases
1. As a user, I want to store secret artifacts for my applications and consume
them securely in containers, so that I can keep the configuration for my
applications separate from the images that use them:
1. As a cluster operator, I want to allow a pod to access the Kubernetes
master using a custom `.kubeconfig` file, so that I can securely reach the
master
2. As a cluster operator, I want to allow a pod to access a Docker registry
using credentials from a `.dockercfg` file, so that containers can push images
3. As a cluster operator, I want to allow a pod to access a git repository
using SSH keys, so that I can push to and fetch from the repository
2. As a user, I want to allow containers to consume supplemental information
about services such as username and password which should be kept secret, so
that I can share secrets about a service amongst the containers in my
application securely
3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a
secret and have the kubelet implement some reserved behaviors based on the types
of secrets the service account consumes:
1. Use credentials for a docker registry to pull the pod's docker image
2. Present Kubernetes auth token to the pod or transparently decorate
traffic between the pod and master service
4. As a user, I want to be able to indicate that a secret expires and for that
secret's value to be rotated once it expires, so that the system can help me
follow good practices
### Use-Case: Configuration artifacts
Many configuration files contain secrets intermixed with other configuration
information. For example, a user's application may contain a properties file
than contains database credentials, SaaS API tokens, etc. Users should be able
to consume configuration artifacts in their containers and be able to control
the path on the container's filesystems where the artifact will be presented.
### Use-Case: Metadata about services
Most pieces of information about how to use a service are secrets. For example,
a service that provides a MySQL database needs to provide the username,
password, and database name to consumers so that they can authenticate and use
the correct database. Containers in pods consuming the MySQL service would also
consume the secrets associated with the MySQL service.
### Use-Case: Secrets associated with service accounts
[Service Accounts](service_accounts.md) are proposed as a mechanism to decouple
capabilities and security contexts from individual human users. A
`ServiceAccount` contains references to some number of secrets. A `Pod` can
specify that it is associated with a `ServiceAccount`. Secrets should have a
`Type` field to allow the Kubelet and other system components to take action
based on the secret's type.
#### Example: service account consumes auth token secret
As an example, the service account proposal discusses service accounts consuming
secrets which contain Kubernetes auth tokens. When a Kubelet starts a pod
associated with a service account which consumes this type of secret, the
Kubelet may take a number of actions:
1. Expose the secret in a `.kubernetes_auth` file in a well-known location in
the container's file system
2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod
to the `kubernetes-master` service with the auth token, e. g. by adding a header
to the request (see the [LOAS Daemon](http://issue.k8s.io/2209) proposal)
#### Example: service account consumes docker registry credentials
Another example use case is where a pod is associated with a secret containing
docker registry credentials. The Kubelet could use these credentials for the
docker pull to retrieve the image.
### Use-Case: Secret expiry and rotation
Rotation is considered a good practice for many types of secret data. It should
be possible to express that a secret has an expiry date; this would make it
possible to implement a system component that could regenerate expired secrets.
As an example, consider a component that rotates expired secrets. The rotator
could periodically regenerate the values for expired secrets of common types and
update their expiry dates.
## Deferral: Consuming secrets as environment variables
Some images will expect to receive configuration items as environment variables
instead of files. We should consider what the best way to allow this is; there
are a few different options:
1. Force the user to adapt files into environment variables. Users can store
secrets that need to be presented as environment variables in a format that is
easy to consume from a shell:
$ cat /etc/secrets/my-secret.txt
export MY_SECRET_ENV=MY_SECRET_VALUE
The user could `source` the file at `/etc/secrets/my-secret` prior to
executing the command for the image either inline in the command or in an init
script.
2. Give secrets an attribute that allows users to express the intent that the
platform should generate the above syntax in the file used to present a secret.
The user could consume these files in the same manner as the above option.
3. Give secrets attributes that allow the user to express that the secret
should be presented to the container as an environment variable. The container's
environment would contain the desired values and the software in the container
could use them without accommodation the command or setup script.
For our initial work, we will treat all secrets as files to narrow the problem
space. There will be a future proposal that handles exposing secrets as
environment variables.
## Flow analysis of secret data with respect to the API server
There are two fundamentally different use-cases for access to secrets:
1. CRUD operations on secrets by their owners
2. Read-only access to the secrets needed for a particular node by the kubelet
### Use-Case: CRUD operations by owners
In use cases for CRUD operations, the user experience for secrets should be no
different than for other API resources.
#### Data store backing the REST API
The data store backing the REST API should be pluggable because different
cluster operators will have different preferences for the central store of
secret data. Some possibilities for storage:
1. An etcd collection alongside the storage for other API resources
2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module)
3. A secrets server like [Vault](https://www.vaultproject.io/) or
[Keywhiz](https://square.github.io/keywhiz/)
4. An external datastore such as an external etcd, RDBMS, etc.
#### Size limit for secrets
There should be a size limit for secrets in order to:
1. Prevent DOS attacks against the API server
2. Allow kubelet implementations that prevent secret data from touching the
node's filesystem
The size limit should satisfy the following conditions:
1. Large enough to store common artifact types (encryption keypairs,
certificates, small configuration files)
2. Small enough to avoid large impact on node resource consumption (storage,
RAM for tmpfs, etc)
To begin discussion, we propose an initial value for this size limit of **1MB**.
#### Other limitations on secrets
Defining a policy for limitations on how a secret may be referenced by another
API resource and how constraints should be applied throughout the cluster is
tricky due to the number of variables involved:
1. Should there be a maximum number of secrets a pod can reference via a
volume?
2. Should there be a maximum number of secrets a service account can reference?
3. Should there be a total maximum number of secrets a pod can reference via
its own spec and its associated service account?
4. Should there be a total size limit on the amount of secret data consumed by
a pod?
5. How will cluster operators want to be able to configure these limits?
6. How will these limits impact API server validations?
7. How will these limits affect scheduling?
For now, we will not implement validations around these limits. Cluster
operators will decide how much node storage is allocated to secrets. It will be
the operator's responsibility to ensure that the allocated storage is sufficient
for the workload scheduled onto a node.
For now, kubelets will only attach secrets to api-sourced pods, and not file-
or http-sourced ones. Doing so would:
- confuse the secrets admission controller in the case of mirror pods.
- create an apiserver-liveness dependency -- avoiding this dependency is a
main reason to use non-api-source pods.
### Use-Case: Kubelet read of secrets for node
The use-case where the kubelet reads secrets has several additional requirements:
1. Kubelets should only be able to receive secret data which is required by
pods scheduled onto the kubelet's node
2. Kubelets should have read-only access to secret data
3. Secret data should not be transmitted over the wire insecurely
4. Kubelets must ensure pods do not have access to each other's secrets
#### Read of secret data by the Kubelet
The Kubelet should only be allowed to read secrets which are consumed by pods
scheduled onto that Kubelet's node and their associated service accounts.
Authorization of the Kubelet to read this data would be delegated to an
authorization plugin and associated policy rule.
#### Secret data on the node: data at rest
Consideration must be given to whether secret data should be allowed to be at
rest on the node:
1. If secret data is not allowed to be at rest, the size of secret data becomes
another draw on the node's RAM - should it affect scheduling?
2. If secret data is allowed to be at rest, should it be encrypted?
1. If so, how should be this be done?
2. If not, what threats exist? What types of secret are appropriate to
store this way?
For the sake of limiting complexity, we propose that initially secret data
should not be allowed to be at rest on a node; secret data should be stored on a
node-level tmpfs filesystem. This filesystem can be subdivided into directories
for use by the kubelet and by the volume plugin.
#### Secret data on the node: resource consumption
The Kubelet will be responsible for creating the per-node tmpfs file system for
secret storage. It is hard to make a prescriptive declaration about how much
storage is appropriate to reserve for secrets because different installations
will vary widely in available resources, desired pod to node density, overcommit
policy, and other operation dimensions. That being the case, we propose for
simplicity that the amount of secret storage be controlled by a new parameter to
the kubelet with a default value of **64MB**. It is the cluster operator's
responsibility to handle choosing the right storage size for their installation
and configuring their Kubelets correctly.
Configuring each Kubelet is not the ideal story for operator experience; it is
more intuitive that the cluster-wide storage size be readable from a central
configuration store like the one proposed in [#1553](http://issue.k8s.io/1553).
When such a store exists, the Kubelet could be modified to read this
configuration item from the store.
When the Kubelet is modified to advertise node resources (as proposed in
[#4441](http://issue.k8s.io/4441)), the capacity calculation
for available memory should factor in the potential size of the node-level tmpfs
in order to avoid memory overcommit on the node.
#### Secret data on the node: isolation
Every pod will have a [security context](security_context.md).
Secret data on the node should be isolated according to the security context of
the container. The Kubelet volume plugin API will be changed so that a volume
plugin receives the security context of a volume along with the volume spec.
This will allow volume plugins to implement setting the security context of
volumes they manage.
## Community work
Several proposals / upstream patches are notable as background for this
proposal:
1. [Docker vault proposal](https://github.com/docker/docker/issues/10310)
2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277)
3. [Kubernetes service account proposal](service_accounts.md)
4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075)
5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697)
## Proposed Design
We propose a new `Secret` resource which is mounted into containers with a new
volume type. Secret volumes will be handled by a volume plugin that does the
actual work of fetching the secret and storing it. Secrets contain multiple
pieces of data that are presented as different files within the secret volume
(example: SSH key pair).
In order to remove the burden from the end user in specifying every file that a
secret consists of, it should be possible to mount all files provided by a
secret with a single `VolumeMount` entry in the container specification.
### Secret API Resource
A new resource for secrets will be added to the API:
```go
type Secret struct {
TypeMeta
ObjectMeta
// Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN.
// The serialized form of the secret data is a base64 encoded string,
// representing the arbitrary (possibly non-string) data value here.
Data map[string][]byte `json:"data,omitempty"`
// Used to facilitate programmatic handling of secret data.
Type SecretType `json:"type,omitempty"`
}
type SecretType string
const (
SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default)
SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token
SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth
SecretTypeDockerConfigJson SecretType = "kubernetes.io/dockerconfigjson" // Latest Docker registry auth
// FUTURE: other type values
)
const MaxSecretSize = 1 * 1024 * 1024
```
A Secret can declare a type in order to provide type information to system
components that work with secrets. The default type is `opaque`, which
represents arbitrary user-owned data.
Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must
be valid DNS subdomains.
A new REST API and registry interface will be added to accompany the `Secret`
resource. The default implementation of the registry will store `Secret`
information in etcd. Future registry implementations could store the `TypeMeta`
and `ObjectMeta` fields in etcd and store the secret data in another data store
entirely, or store the whole object in another data store.
#### Other validations related to secrets
Initially there will be no validations for the number of secrets a pod
references, or the number of secrets that can be associated with a service
account. These may be added in the future as the finer points of secrets and
resource allocation are fleshed out.
### Secret Volume Source
A new `SecretSource` type of volume source will be added to the `VolumeSource`
struct in the API:
```go
type VolumeSource struct {
// Other fields omitted
// SecretSource represents a secret that should be presented in a volume
SecretSource *SecretSource `json:"secret"`
}
type SecretSource struct {
Target ObjectReference
}
```
Secret volume sources are validated to ensure that the specified object
reference actually points to an object of type `Secret`.
In the future, the `SecretSource` will be extended to allow:
1. Fine-grained control over which pieces of secret data are exposed in the
volume
2. The paths and filenames for how secret data are exposed
### Secret Volume Plugin
A new Kubelet volume plugin will be added to handle volumes with a secret
source. This plugin will require access to the API server to retrieve secret
data and therefore the volume `Host` interface will have to change to expose a
client interface:
```go
type Host interface {
// Other methods omitted
// GetKubeClient returns a client interface
GetKubeClient() client.Interface
}
```
The secret volume plugin will be responsible for:
1. Returning a `volume.Mounter` implementation from `NewMounter` that:
1. Retrieves the secret data for the volume from the API server
2. Places the secret data onto the container's filesystem
3. Sets the correct security attributes for the volume based on the pod's
`SecurityContext`
2. Returning a `volume.Unmounter` implementation from `NewUnmounter` that
cleans the volume from the container's filesystem
### Kubelet: Node-level secret storage
The Kubelet must be modified to accept a new parameter for the secret storage
size and to create a tmpfs file system of that size to store secret data. Rough
accounting of specific changes:
1. The Kubelet should have a new field added called `secretStorageSize`; units
are megabytes
2. `NewMainKubelet` should accept a value for secret storage size
3. The Kubelet server should have a new flag added for secret storage size
4. The Kubelet's `setupDataDirs` method should be changed to create the secret
storage
### Kubelet: New behaviors for secrets associated with service accounts
For use-cases where the Kubelet's behavior is affected by the secrets associated
with a pod's `ServiceAccount`, the Kubelet will need to be changed. For example,
if secrets of type `docker-reg-auth` affect how the pod's images are pulled, the
Kubelet will need to be changed to accommodate this. Subsequent proposals can
address this on a type-by-type basis.
## Examples
For clarity, let's examine some detailed examples of some common use-cases in
terms of the suggested changes. All of these examples are assumed to be created
in a namespace called `example`.
### Use-Case: Pod with ssh keys
To create a pod that uses an ssh key stored as a secret, we first need to create
a secret:
```json
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "ssh-key-secret"
},
"data": {
"id-rsa": "dmFsdWUtMg0KDQo=",
"id-rsa.pub": "dmFsdWUtMQ0K"
}
}
```
**Note:** The serialized JSON and YAML values of secret data are encoded as
base64 strings. Newlines are not valid within these strings and must be
omitted.
Now we can create a pod which references the secret with the ssh key and
consumes it in a volume:
```json
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "secret-test-pod",
"labels": {
"name": "secret-test"
}
},
"spec": {
"volumes": [
{
"name": "secret-volume",
"secret": {
"secretName": "ssh-key-secret"
}
}
],
"containers": [
{
"name": "ssh-test-container",
"image": "mySshImage",
"volumeMounts": [
{
"name": "secret-volume",
"readOnly": true,
"mountPath": "/etc/secret-volume"
}
]
}
]
}
}
```
When the container's command runs, the pieces of the key will be available in:
/etc/secret-volume/id-rsa.pub
/etc/secret-volume/id-rsa
The container is then free to use the secret data to establish an ssh
connection.
### Use-Case: Pods with prod / test credentials
This example illustrates a pod which consumes a secret containing prod
credentials and another pod which consumes a secret with test environment
credentials.
The secrets:
```json
{
"apiVersion": "v1",
"kind": "List",
"items":
[{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "prod-db-secret"
},
"data": {
"password": "dmFsdWUtMg0KDQo=",
"username": "dmFsdWUtMQ0K"
}
},
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "test-db-secret"
},
"data": {
"password": "dmFsdWUtMg0KDQo=",
"username": "dmFsdWUtMQ0K"
}
}]
}
```
The pods:
```json
{
"apiVersion": "v1",
"kind": "List",
"items":
[{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "prod-db-client-pod",
"labels": {
"name": "prod-db-client"
}
},
"spec": {
"volumes": [
{
"name": "secret-volume",
"secret": {
"secretName": "prod-db-secret"
}
}
],
"containers": [
{
"name": "db-client-container",
"image": "myClientImage",
"volumeMounts": [
{
"name": "secret-volume",
"readOnly": true,
"mountPath": "/etc/secret-volume"
}
]
}
]
}
},
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "test-db-client-pod",
"labels": {
"name": "test-db-client"
}
},
"spec": {
"volumes": [
{
"name": "secret-volume",
"secret": {
"secretName": "test-db-secret"
}
}
],
"containers": [
{
"name": "db-client-container",
"image": "myClientImage",
"volumeMounts": [
{
"name": "secret-volume",
"readOnly": true,
"mountPath": "/etc/secret-volume"
}
]
}
]
}
}]
}
```
The specs for the two pods differ only in the value of the object referred to by
the secret volume source. Both containers will have the following files present
on their filesystems:
/etc/secret-volume/username
/etc/secret-volume/password
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/secrets.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/secrets.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/secrets.md)

View File

@ -1,218 +1 @@
# Security in Kubernetes
Kubernetes should define a reasonable set of security best practices that allows
processes to be isolated from each other, from the cluster infrastructure, and
which preserves important boundaries between those who manage the cluster, and
those who use the cluster.
While Kubernetes today is not primarily a multi-tenant system, the long term
evolution of Kubernetes will increasingly rely on proper boundaries between
users and administrators. The code running on the cluster must be appropriately
isolated and secured to prevent malicious parties from affecting the entire
cluster.
## High Level Goals
1. Ensure a clear isolation between the container and the underlying host it
runs on
2. Limit the ability of the container to negatively impact the infrastructure
or other containers
3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) -
ensure components are only authorized to perform the actions they need, and
limit the scope of a compromise by limiting the capabilities of individual
components
4. Reduce the number of systems that have to be hardened and secured by
defining clear boundaries between components
5. Allow users of the system to be cleanly separated from administrators
6. Allow administrative functions to be delegated to users where necessary
7. Allow applications to be run on the cluster that have "secret" data (keys,
certs, passwords) which is properly abstracted from "public" data.
## Use cases
### Roles
We define "user" as a unique identity accessing the Kubernetes API server, which
may be a human or an automated process. Human users fall into the following
categories:
1. k8s admin - administers a Kubernetes cluster and has access to the underlying
components of the system
2. k8s project administrator - administrates the security of a small subset of
the cluster
3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster
resources
Automated process users fall into the following categories:
1. k8s container user - a user that processes running inside a container (on the
cluster) can use to access other cluster resources independent of the human
users attached to a project
2. k8s infrastructure user - the user that Kubernetes infrastructure components
use to perform cluster functions with clearly defined roles
### Description of roles
* Developers:
* write pod specs.
* making some of their own images, and using some "community" docker images
* know which pods need to talk to which other pods
* decide which pods should share files with other pods, and which should not.
* reason about application level security, such as containing the effects of a
local-file-read exploit in a webserver pod.
* do not often reason about operating system or organizational security.
* are not necessarily comfortable reasoning about the security properties of a
system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
* Project Admins:
* allocate identity and roles within a namespace
* reason about organizational security within a namespace
* don't give a developer permissions that are not needed for role.
* protect files on shared storage from unnecessary cross-team access
* are less focused about application security
* Administrators:
* are less focused on application security. Focused on operating system
security.
* protect the node from bad actors in containers, and properly-configured
innocent containers from bad actors in other containers.
* comfortable reasoning about the security properties of a system at the level
of detail of Linux Capabilities, SELinux, AppArmor, etc.
* decides who can use which Linux Capabilities, run privileged containers, use
hostPath, etc.
* e.g. a team that manages Ceph or a mysql server might be trusted to have
raw access to storage devices in some organizations, but teams that develop the
applications at higher layers would not.
## Proposed Design
A pod runs in a *security context* under a *service account* that is defined by
an administrator or project administrator, and the *secrets* a pod has access to
is limited by that *service account*.
1. The API should authenticate and authorize user actions [authn and authz](access.md)
2. All infrastructure components (kubelets, kube-proxies, controllers,
scheduler) should have an infrastructure user that they can authenticate with
and be authorized to perform only the functions they require against the API.
3. Most infrastructure components should use the API as a way of exchanging data
and changing the system, and only the API should have access to the underlying
data store (etcd)
4. When containers run on the cluster and need to talk to other containers or
the API server, they should be identified and authorized clearly as an
autonomous process via a [service account](service_accounts.md)
1. If the user who started a long-lived process is removed from access to
the cluster, the process should be able to continue without interruption
2. If the user who started processes are removed from the cluster,
administrators may wish to terminate their processes in bulk
3. When containers run with a service account, the user that created /
triggered the service account behavior must be associated with the container's
action
5. When container processes run on the cluster, they should run in a
[security context](security_context.md) that isolates those processes via Linux
user security, user namespaces, and permissions.
1. Administrators should be able to configure the cluster to automatically
confine all container processes as a non-root, randomly assigned UID
2. Administrators should be able to ensure that container processes within
the same namespace are all assigned the same unix user UID
3. Administrators should be able to limit which developers and project
administrators have access to higher privilege actions
4. Project administrators should be able to run pods within a namespace
under different security contexts, and developers must be able to specify which
of the available security contexts they may use
5. Developers should be able to run their own images or images from the
community and expect those images to run correctly
6. Developers may need to ensure their images work within higher security
requirements specified by administrators
7. When available, Linux kernel user namespaces can be used to ensure 5.2
and 5.4 are met.
8. When application developers want to share filesystem data via distributed
filesystems, the Unix user ids on those filesystems must be consistent across
different container processes
6. Developers should be able to define [secrets](secrets.md) that are
automatically added to the containers when pods are run
1. Secrets are files injected into the container whose values should not be
displayed within a pod. Examples:
1. An SSH private key for git cloning remote data
2. A client certificate for accessing a remote system
3. A private key and certificate for a web server
4. A .kubeconfig file with embedded cert / token data for accessing the
Kubernetes master
5. A .dockercfg file for pulling images from a protected registry
2. Developers should be able to define the pod spec so that a secret lands
in a specific location
3. Project administrators should be able to limit developers within a
namespace from viewing or modifying secrets (anyone who can launch an arbitrary
pod can view secrets)
4. Secrets are generally not copied from one namespace to another when a
developer's application definitions are copied
### Related design discussion
* [Authorization and authentication](access.md)
* [Secret distribution via files](http://pr.k8s.io/2030)
* [Docker secrets](https://github.com/docker/docker/pull/6697)
* [Docker vault](https://github.com/docker/docker/issues/10310)
* [Service Accounts:](service_accounts.md)
* [Secret volumes](http://pr.k8s.io/4126)
## Specific Design Points
### TODO: authorization, authentication
### Isolate the data store from the nodes and supporting infrastructure
Access to the central data store (etcd) in Kubernetes allows an attacker to run
arbitrary containers on hosts, to gain access to any protected information
stored in either volumes or in pods (such as access tokens or shared secrets
provided as environment variables), to intercept and redirect traffic from
running services by inserting middlemen, or to simply delete the entire history
of the cluster.
As a general principle, access to the central data store should be restricted to
the components that need full control over the system and which can apply
appropriate authorization and authentication of change requests. In the future,
etcd may offer granular access control, but that granularity will require an
administrator to understand the schema of the data to properly apply security.
An administrator must be able to properly secure Kubernetes at a policy level,
rather than at an implementation level, and schema changes over time should not
risk unintended security leaks.
Both the Kubelet and Kube Proxy need information related to their specific roles -
for the Kubelet, the set of pods it should be running, and for the Proxy, the
set of services and endpoints to load balance. The Kubelet also needs to provide
information about running pods and historical termination data. The access
pattern for both Kubelet and Proxy to load their configuration is an efficient
"wait for changes" request over HTTP. It should be possible to limit the Kubelet
and Proxy to only access the information they need to perform their roles and no
more.
The controller manager for Replication Controllers and other future controllers
act on behalf of a user via delegation to perform automated maintenance on
Kubernetes resources. Their ability to access or modify resource state should be
strictly limited to their intended duties and they should be prevented from
accessing information not pertinent to their role. For example, a replication
controller needs only to create a copy of a known pod configuration, to
determine the running state of an existing pod, or to delete an existing pod
that it created - it does not need to know the contents or current state of a
pod, nor have access to any data in the pods attached volumes.
The Kubernetes pod scheduler is responsible for reading data from the pod to fit
it onto a node in the cluster. At a minimum, it needs access to view the ID of a
pod (to craft the binding), its current state, any resource information
necessary to identify placement, and other data relevant to concerns like
anti-affinity, zone or region preference, or custom logic. It does not need the
ability to modify pods or see other resources, only to create bindings. It
should not need the ability to delete bindings unless the scheduler takes
control of relocating components on failed hosts (which could be implemented by
a separate component that can delete bindings but not create them). The
scheduler may need read access to user or project-container information to
determine preferential location (underspecified at this time).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security.md)

View File

@ -1,192 +1 @@
# Security Contexts
## Abstract
A security context is a set of constraints that are applied to a container in
order to achieve the following goals (from [security design](security.md)):
1. Ensure a clear isolation between container and the underlying host it runs
on
2. Limit the ability of the container to negatively impact the infrastructure
or other containers
## Background
The problem of securing containers in Kubernetes has come up
[before](http://issue.k8s.io/398) and the potential problems with container
security are [well known](http://opensource.com/business/14/7/docker-security-selinux).
Although it is not possible to completely isolate Docker containers from their
hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304)
make it possible to greatly reduce the attack surface.
## Motivation
### Container isolation
In order to improve container isolation from host and other containers running
on the host, containers should only be granted the access they need to perform
their work. To this end it should be possible to take advantage of Docker
features such as the ability to
[add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration)
and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
to the container process.
Support for user namespaces has recently been
[merged](https://github.com/docker/libcontainer/pull/304) into Docker's
libcontainer project and should soon surface in Docker itself. It will make it
possible to assign a range of unprivileged uids and gids from the host to each
container, improving the isolation between host and container and between
containers.
### External integration with shared storage
In order to support external integration with shared storage, processes running
in a Kubernetes cluster should be able to be uniquely identified by their Unix
UID, such that a chain of ownership can be established. Processes in pods will
need to have consistent UID/GID/SELinux category labels in order to access
shared disks.
## Constraints and Assumptions
* It is out of the scope of this document to prescribe a specific set of
constraints to isolate containers from their host. Different use cases need
different settings.
* The concept of a security context should not be tied to a particular security
mechanism or platform (i.e. SELinux, AppArmor)
* Applying a different security context to a scope (namespace or pod) requires
a solution such as the one proposed for [service accounts](service_accounts.md).
## Use Cases
In order of increasing complexity, following are example use cases that would
be addressed with security contexts:
1. Kubernetes is used to run a single cloud application. In order to protect
nodes from containers:
* All containers run as a single non-root user
* Privileged containers are disabled
* All containers run with a particular MCS label
* Kernel capabilities like CHOWN and MKNOD are removed from containers
2. Just like case #1, except that I have more than one application running on
the Kubernetes cluster.
* Each application is run in its own namespace to avoid name collisions
* For each application a different uid and MCS label is used
3. Kubernetes is used as the base for a PAAS with multiple projects, each
project represented by a namespace.
* Each namespace is associated with a range of uids/gids on the node that
are mapped to uids/gids on containers using linux user namespaces.
* Certain pods in each namespace have special privileges to perform system
actions such as talking back to the server for deployment, run docker builds,
etc.
* External NFS storage is assigned to each namespace and permissions set
using the range of uids/gids assigned to that namespace.
## Proposed Design
### Overview
A *security context* consists of a set of constraints that determine how a
container is secured before getting created and run. A security context resides
on the container and represents the runtime parameters that will be used to
create and run the container via container APIs. A *security context provider*
is passed to the Kubelet so it can have a chance to mutate Docker API calls in
order to apply the security context.
It is recommended that this design be implemented in two phases:
1. Implement the security context provider extension point in the Kubelet
so that a default security context can be applied on container run and creation.
2. Implement a security context structure that is part of a service account. The
default context provider can then be used to apply a security context based on
the service account associated with the pod.
### Security Context Provider
The Kubelet will have an interface that points to a `SecurityContextProvider`.
The `SecurityContextProvider` is invoked before creating and running a given
container:
```go
type SecurityContextProvider interface {
// ModifyContainerConfig is called before the Docker createContainer call.
// The security context provider can make changes to the Config with which
// the container is created.
// An error is returned if it's not possible to secure the container as
// requested with a security context.
ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config)
// ModifyHostConfig is called before the Docker runContainer call.
// The security context provider can make changes to the HostConfig, affecting
// security options, whether the container is privileged, volume binds, etc.
// An error is returned if it's not possible to secure the container as requested
// with a security context.
ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig)
}
```
If the value of the SecurityContextProvider field on the Kubelet is nil, the
kubelet will create and run the container as it does today.
### Security Context
A security context resides on the container and represents the runtime
parameters that will be used to create and run the container via container APIs.
Following is an example of an initial implementation:
```go
type Container struct {
... other fields omitted ...
// Optional: SecurityContext defines the security options the pod should be run with
SecurityContext *SecurityContext
}
// SecurityContext holds security configuration that will be applied to a container. SecurityContext
// contains duplication of some existing fields from the Container resource. These duplicate fields
// will be populated based on the Container configuration if they are not set. Defining them on
// both the Container AND the SecurityContext will result in an error.
type SecurityContext struct {
// Capabilities are the capabilities to add/drop when running the container
Capabilities *Capabilities
// Run the container in privileged mode
Privileged *bool
// SELinuxOptions are the labels to be applied to the container
// and volumes
SELinuxOptions *SELinuxOptions
// RunAsUser is the UID to run the entrypoint of the container process.
RunAsUser *int64
}
// SELinuxOptions are the labels to be applied to the container.
type SELinuxOptions struct {
// SELinux user label
User string
// SELinux role label
Role string
// SELinux type label
Type string
// SELinux level label.
Level string
}
```
### Admission
It is up to an admission plugin to determine if the security context is
acceptable or not. At the time of writing, the admission control plugin for
security contexts will only allow a context that has defined capabilities or
privileged. Contexts that attempt to define a UID or SELinux options will be
denied by default. In the future the admission plugin will base this decision
upon configurable policies that reside within the [service account](http://pr.k8s.io/2297).
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/security_context.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security_context.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/security_context.md)

View File

@ -1,180 +1 @@
Design
=============
# Goals
Make it really hard to accidentally create a job which has an overlapping
selector, while still making it possible to chose an arbitrary selector, and
without adding complex constraint solving to the APIserver.
# Use Cases
1. user can leave all label and selector fields blank and system will fill in
reasonable ones: non-overlappingness guaranteed.
2. user can put on the pod template some labels that are useful to the user,
without reasoning about non-overlappingness. System adds additional label to
assure not overlapping.
3. If user wants to reparent pods to new job (very rare case) and knows what
they are doing, they can completely disable this behavior and specify explicit
selector.
4. If a controller that makes jobs, like scheduled job, wants to use different
labels, such as the time and date of the run, it can do that.
5. If User reads v1beta1 documentation or reuses v1beta1 Job definitions and
just changes the API group, the user should not automatically be allowed to
specify a selector, since this is very rarely what people want to do and is
error prone.
6. If User downloads an existing job definition, e.g. with
`kubectl get jobs/old -o yaml` and tries to modify and post it, he should not
create an overlapping job.
7. If User downloads an existing job definition, e.g. with
`kubectl get jobs/old -o yaml` and tries to modify and post it, and he
accidentally copies the uniquifying label from the old one, then he should not
get an error from a label-key conflict, nor get erratic behavior.
8. If user reads swagger docs and sees the selector field, he should not be able
to set it without realizing the risks.
8. (Deferred requirement:) If user wants to specify a preferred name for the
non-overlappingness key, they can pick a name.
# Proposed changes
## API
`extensions/v1beta1 Job` remains the same. `batch/v1 Job` changes change as
follows.
Field `job.spec.manualSelector` is added. It controls whether selectors are
automatically generated. In automatic mode, user cannot make the mistake of
creating non-unique selectors. In manual mode, certain rare use cases are
supported.
Validation is not changed. A selector must be provided, and it must select the
pod template.
Defaulting changes. Defaulting happens in one of two modes:
### Automatic Mode
- User does not specify `job.spec.selector`.
- User is probably unaware of the `job.spec.manualSelector` field and does not
think about it.
- User optionally puts labels on pod template (optional). User does not think
about uniqueness, just labeling for user's own reasons.
- Defaulting logic sets `job.spec.selector` to
`matchLabels["controller-uid"]="$UIDOFJOB"`
- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`.
- The first label is controller-uid=$UIDOFJOB.
- The second label is "job-name=$NAMEOFJOB".
### Manual Mode
- User means User or Controller for the rest of this list.
- User does specify `job.spec.selector`.
- User does specify `job.spec.manualSelector=true`
- User puts a unique label or label(s) on pod template (required). User does
think carefully about uniqueness.
- No defaulting of pod labels or the selector happen.
### Rationale
UID is better than Name in that:
- it allows cross-namespace control someday if we need it.
- it is unique across all kinds. `controller-name=foo` does not ensure
uniqueness across Kinds `job` vs `replicaSet`. Even `job-name=foo` has a
problem: you might have a `batch.Job` and a `snazzyjob.io/types.Job` -- the
latter cannot use label `job-name=foo`, though there is a temptation to do so.
- it uniquely identifies the controller across time. This prevents the case
where, for example, someone deletes a job via the REST api or client
(where cascade=false), leaving pods around. We don't want those to be picked up
unintentionally. It also prevents the case where a user looks at an old job that
finished but is not deleted, and tries to select its pods, and gets the wrong
impression that it is still running.
Job name is more user friendly. It is self documenting
Commands like `kubectl get pods -l job-name=myjob` should do exactly what is
wanted 99.9% of the time. Automated control loops should still use the
controller-uid=label.
Using both gets the benefits of both, at the cost of some label verbosity.
The field is a `*bool`. Since false is expected to be much more common,
and since the feature is complex, it is better to leave it unspecified so that
users looking at a stored pod spec do not need to be aware of this field.
### Overriding Unique Labels
If user does specify `job.spec.selector` then the user must also specify
`job.spec.manualSelector`. This ensures the user knows that what he is doing is
not the normal thing to do.
To prevent users from copying the `job.spec.manualSelector` flag from existing
jobs, it will be optional and default to false, which means when you ask GET and
existing job back that didn't use this feature, you don't even see the
`job.spec.manualSelector` flag, so you are not tempted to wonder if you should
fiddle with it.
## Job Controller
No changes
## Kubectl
No required changes. Suggest moving SELECTOR to wide output of `kubectl get
jobs` since users do not write the selector.
## Docs
Remove examples that use selector and remove labels from pod templates.
Recommend `kubectl get jobs -l job-name=name` as the way to find pods of a job.
# Conversion
The following applies to Job, as well as to other types that adopt this pattern:
- Type `extensions/v1beta1` gets a field called `job.spec.autoSelector`.
- Both the internal type and the `batch/v1` type will get
`job.spec.manualSelector`.
- The fields `manualSelector` and `autoSelector` have opposite meanings.
- Each field defaults to false when unset, and so v1beta1 has a different
default than v1 and internal. This is intentional: we want new uses to default
to the less error-prone behavior, and we do not want to change the behavior of
v1beta1.
*Note*: since the internal default is changing, client library consumers that
create Jobs may need to add "job.spec.manualSelector=true" to keep working, or
switch to auto selectors.
Conversion is as follows:
- `extensions/__internal` to `extensions/v1beta1`: the value of
`__internal.Spec.ManualSelector` is defaulted to false if nil, negated,
defaulted to nil if false, and written `v1beta1.Spec.AutoSelector`.
- `extensions/v1beta1` to `extensions/__internal`: the value of
`v1beta1.SpecAutoSelector` is defaulted to false if nil, negated, defaulted to
nil if false, and written to `__internal.Spec.ManualSelector`.
This conversion gives the following properties.
1. Users that previously used v1beta1 do not start seeing a new field when they
get back objects.
2. Distinction between originally unset versus explicitly set to false is not
preserved (would have been nice to do so, but requires more complicated
solution).
3. Users who only created v1beta1 examples or v1 examples, will not ever see the
existence of either field.
4. Since v1beta1 are convertable to/from v1, the storage location (path in etcd)
does not need to change, allowing scriptable rollforward/rollback.
# Future Work
Follow this pattern for Deployments, ReplicaSet, DaemonSet when going to v1, if
it works well for job.
Docs will be edited to show examples without a `job.spec.selector`.
We probably want as much as possible the same behavior for Job and
ReplicationController.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selector-generation.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selector-generation.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selector-generation.md)

View File

@ -1,317 +1 @@
## Abstract
A proposal for enabling containers in a pod to share volumes using a pod level SELinux context.
## Motivation
Many users have a requirement to run pods on systems that have SELinux enabled. Volume plugin
authors should not have to explicitly account for SELinux except for volume types that require
special handling of the SELinux context during setup.
Currently, each container in a pod has an SELinux context. This is not an ideal factoring for
sharing resources using SELinux.
We propose a pod-level SELinux context and a mechanism to support SELinux labeling of volumes in a
generic way.
Goals of this design:
1. Describe the problems with a container SELinux context
2. Articulate a design for generic SELinux support for volumes using a pod level SELinux context
which is backward compatible with the v1.0.0 API
## Constraints and Assumptions
1. We will not support securing containers within a pod from one another
2. Volume plugins should not have to handle setting SELinux context on volumes
3. We will not deal with shared storage
## Current State Overview
### Docker
Docker uses a base SELinux context and calculates a unique MCS label per container. The SELinux
context of a container can be overridden with the `SecurityOpt` api that allows setting the different
parts of the SELinux context individually.
Docker has functionality to relabel bind-mounts with a usable SElinux and supports two different
use-cases:
1. The `:Z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's
SELinux context
2. The `:z` bind-mount flag, which tells Docker to relabel a bind-mount with the container's
SElinux context, but remove the MCS labels, making the volume shareable between containers
We should avoid using the `:z` flag, because it relaxes the SELinux context so that any container
(from an SELinux standpoint) can use the volume.
### rkt
rkt currently reads the base SELinux context to use from `/etc/selinux/*/contexts/lxc_contexts`
and allocates a unique MCS label per pod.
### Kubernetes
There is a [proposed change](https://github.com/kubernetes/kubernetes/pull/9844) to the
EmptyDir plugin that adds SELinux relabeling capabilities to that plugin, which is also carried as a
patch in [OpenShift](https://github.com/openshift/origin). It is preferable to solve the problem
in general of handling SELinux in kubernetes to merging this PR.
A new `PodSecurityContext` type has been added that carries information about security attributes
that apply to the entire pod and that apply to all containers in a pod. See:
1. [Skeletal implementation](https://github.com/kubernetes/kubernetes/pull/13939)
1. [Proposal for inlining container security fields](https://github.com/kubernetes/kubernetes/pull/12823)
## Use Cases
1. As a cluster operator, I want to support securing pods from one another using SELinux when
SELinux integration is enabled in the cluster
2. As a user, I want volumes sharing to work correctly amongst containers in pods
#### SELinux context: pod- or container- level?
Currently, SELinux context is specifiable only at the container level. This is an inconvenient
factoring for sharing volumes and other SELinux-secured resources between containers because there
is no way in SELinux to share resources between processes with different MCS labels except to
remove MCS labels from the shared resource. This is a big security risk: _any container_ in the
system can work with a resource which has the same SELinux context as it and no MCS labels. Since
we are also not interested in isolating containers in a pod from one another, the SELinux context
should be shared by all containers in a pod to facilitate isolation from the containers in other
pods and sharing resources amongst all the containers of a pod.
#### Volumes
Kubernetes volumes can be divided into two broad categories:
1. Unshared storage:
1. Volumes created by the kubelet on the host directory: empty directory, git repo, secret,
downward api. All volumes in this category delegate to `EmptyDir` for their underlying
storage.
2. Volumes based on network block devices: AWS EBS, iSCSI, RBD, etc, *when used exclusively
by a single pod*.
2. Shared storage:
1. `hostPath` is shared storage because it is necessarily used by a container and the host
2. Network file systems such as NFS, Glusterfs, Cephfs, etc.
3. Block device based volumes in `ReadOnlyMany` or `ReadWriteMany` modes are shared because
they may be used simultaneously by multiple pods.
For unshared storage, SELinux handling for most volumes can be generalized into running a `chcon`
operation on the volume directory after running the volume plugin's `Setup` function. For these
volumes, the Kubelet can perform the `chcon` operation and keep SELinux concerns out of the volume
plugin code. Some volume plugins may need to use the SELinux context during a mount operation in
certain cases. To account for this, our design must have a way for volume plugins to state that
a particular volume should or should not receive generic label management.
For shared storage, the picture is murkier. Labels for existing shared storage will be managed
outside Kubernetes and administrators will have to set the SELinux context of pods correctly.
The problem of solving SELinux label management for new shared storage is outside the scope for
this proposal.
## Analysis
The system needs to be able to:
1. Model correctly which volumes require SELinux label management
1. Relabel volumes with the correct SELinux context when required
### Modeling whether a volume requires label management
#### Unshared storage: volumes derived from `EmptyDir`
Empty dir and volumes derived from it are created by the system, so Kubernetes must always ensure
that the ownership and SELinux context (when relevant) are set correctly for the volume to be
usable.
#### Unshared storage: network block devices
Volume plugins based on network block devices such as AWS EBS and RBS can be treated the same way
as local volumes. Since inodes are written to these block devices in the same way as `EmptyDir`
volumes, permissions and ownership can be managed on the client side by the Kubelet when used
exclusively by one pod. When the volumes are used outside of a persistent volume, or with the
`ReadWriteOnce` mode, they are effectively unshared storage.
When used by multiple pods, there are many additional use-cases to analyze before we can be
confident that we can support SELinux label management robustly with these file systems. The right
design is one that makes it easy to experiment and develop support for ownership management with
volume plugins to enable developers and cluster operators to continue exploring these issues.
#### Shared storage: hostPath
The `hostPath` volume should only be used by effective-root users, and the permissions of paths
exposed into containers via hostPath volumes should always be managed by the cluster operator. If
the Kubelet managed the SELinux labels for `hostPath` volumes, a user who could create a `hostPath`
volume could affect changes in the state of arbitrary paths within the host's filesystem. This
would be a severe security risk, so we will consider hostPath a corner case that the kubelet should
never perform ownership management for.
#### Shared storage: network
Ownership management of shared storage is a complex topic. SELinux labels for existing shared
storage will be managed externally from Kubernetes. For this case, our API should make it simple to
express whether a particular volume should have these concerns managed by Kubernetes.
We will not attempt to address the concerns of new shared storage in this proposal.
When a network block device is used as a persistent volume in `ReadWriteMany` or `ReadOnlyMany`
modes, it is shared storage, and thus outside the scope of this proposal.
#### API requirements
From the above, we know that label management must be applied:
1. To some volume types always
2. To some volume types never
3. To some volume types *sometimes*
Volumes should be relabeled with the correct SELinux context. Docker has this capability today; it
is desirable for other container runtime implementations to provide similar functionality.
Relabeling should be an optional aspect of a volume plugin to accommodate:
1. volume types for which generalized relabeling support is not sufficient
2. testing for each volume plugin individually
## Proposed Design
Our design should minimize code for handling SELinux labelling required in the Kubelet and volume
plugins.
### Deferral: MCS label allocation
Our short-term goal is to facilitate volume sharing and isolation with SELinux and expose the
primitives for higher level composition; making these automatic is a longer-term goal. Allocating
groups and MCS labels are fairly complex problems in their own right, and so our proposal will not
encompass either of these topics. There are several problems that the solution for allocation
depends on:
1. Users and groups in Kubernetes
2. General auth policy in Kubernetes
3. [security policy](https://github.com/kubernetes/kubernetes/pull/7893)
### API changes
The [inline container security attributes PR (12823)](https://github.com/kubernetes/kubernetes/pull/12823)
adds a `pod.Spec.SecurityContext.SELinuxOptions` field. The change to the API in this proposal is
the addition of the semantics to this field:
* When the `pod.Spec.SecurityContext.SELinuxOptions` field is set, volumes that support ownership
management in the Kubelet have their SELinuxContext set from this field.
```go
package api
type PodSecurityContext struct {
// SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's
// SecurityContext.SELinuxOptions field is set, that setting has precedent for that container.
//
// This field will be used to set the SELinux of volumes that support SELinux label management
// by the kubelet.
SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
}
```
The V1 API is extended with the same semantics:
```go
package v1
type PodSecurityContext struct {
// SELinuxOptions captures the SELinux context for all containers in a Pod. If a container's
// SecurityContext.SELinuxOptions field is set, that setting has precedent for that container.
//
// This field will be used to set the SELinux of volumes that support SELinux label management
// by the kubelet.
SELinuxOptions *SELinuxOptions `json:"seLinuxOptions,omitempty"`
}
```
#### API backward compatibility
Old pods that do not have the `pod.Spec.SecurityContext.SELinuxOptions` field set will not receive
SELinux label management for their volumes. This is acceptable since old clients won't know about
this field and won't have any expectation of their volumes being managed this way.
The existing backward compatibility semantics for SELinux do not change at all with this proposal.
### Kubelet changes
The Kubelet should be modified to perform SELinux label management when required for a volume. The
criteria to activate the kubelet SELinux label management for volumes are:
1. SELinux integration is enabled in the cluster
2. SELinux is enabled on the node
3. The `pod.Spec.SecurityContext.SELinuxOptions` field is set
4. The volume plugin supports SELinux label management
The `volume.Mounter` interface should have a new method added that indicates whether the plugin
supports SELinux label management:
```go
package volume
type Builder interface {
// other methods omitted
SupportsSELinux() bool
}
```
Individual volume plugins are responsible for correctly reporting whether they support label
management in the kubelet. In the first round of work, only `hostPath` and `emptyDir` and its
derivations will be tested with ownership management support:
| Plugin Name | SupportsOwnershipManagement |
|-------------------------|-------------------------------|
| `hostPath` | false |
| `emptyDir` | true |
| `gitRepo` | true |
| `secret` | true |
| `downwardAPI` | true |
| `gcePersistentDisk` | false |
| `awsElasticBlockStore` | false |
| `nfs` | false |
| `iscsi` | false |
| `glusterfs` | false |
| `persistentVolumeClaim` | depends on underlying volume and PV mode |
| `rbd` | false |
| `cinder` | false |
| `cephfs` | false |
Ultimately, the matrix will theoretically look like:
| Plugin Name | SupportsOwnershipManagement |
|-------------------------|-------------------------------|
| `hostPath` | false |
| `emptyDir` | true |
| `gitRepo` | true |
| `secret` | true |
| `downwardAPI` | true |
| `gcePersistentDisk` | true |
| `awsElasticBlockStore` | true |
| `nfs` | false |
| `iscsi` | true |
| `glusterfs` | false |
| `persistentVolumeClaim` | depends on underlying volume and PV mode |
| `rbd` | true |
| `cinder` | false |
| `cephfs` | false |
In order to limit the amount of SELinux label management code in Kubernetes, we propose that it be a
function of the container runtime implementations. Initially, we will modify the docker runtime
implementation to correctly set the `:Z` flag on the appropriate bind-mounts in order to accomplish
generic label management for docker containers.
Volume types that require SELinux context information at mount must be injected with and respect the
enablement setting for the labeling for the volume type. The proposed `VolumeConfig` mechanism
will be used to carry information about label management enablement to the volume plugins that have
to manage labels individually.
This allows the volume plugins to determine when they do and don't want this type of support from
the Kubelet, and allows the criteria each plugin uses to evolve without changing the Kubelet.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/selinux.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selinux.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/selinux.md)

View File

@ -1,210 +1 @@
# Service Accounts
## Motivation
Processes in Pods may need to call the Kubernetes API. For example:
- scheduler
- replication controller
- node controller
- a map-reduce type framework which has a controller that then tries to make a
dynamically determined number of workers and watch them
- continuous build and push system
- monitoring system
They also may interact with services other than the Kubernetes API, such as:
- an image repository, such as docker -- both when the images are pulled to
start the containers, and for writing images in the case of pods that generate
images.
- accessing other cloud services, such as blob storage, in the context of a
large, integrated, cloud offering (hosted or private).
- accessing files in an NFS volume attached to the pod
## Design Overview
A service account binds together several things:
- a *name*, understood by users, and perhaps by peripheral systems, for an
identity
- a *principal* that can be authenticated and [authorized](../admin/authorization.md)
- a [security context](security_context.md), which defines the Linux
Capabilities, User IDs, Groups IDs, and other capabilities and controls on
interaction with the file system and OS.
- a set of [secrets](secrets.md), which a container may use to access various
networked resources.
## Design Discussion
A new object Kind is added:
```go
type ServiceAccount struct {
TypeMeta `json:",inline" yaml:",inline"`
ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"`
username string
securityContext ObjectReference // (reference to a securityContext object)
secrets []ObjectReference // (references to secret objects
}
```
The name ServiceAccount is chosen because it is widely used already (e.g. by
Kerberos and LDAP) to refer to this type of account. Note that it has no
relation to Kubernetes Service objects.
The ServiceAccount object does not include any information that could not be
defined separately:
- username can be defined however users are defined.
- securityContext and secrets are only referenced and are created using the
REST API.
The purpose of the serviceAccount object is twofold:
- to bind usernames to securityContexts and secrets, so that the username can
be used to refer succinctly in contexts where explicitly naming securityContexts
and secrets would be inconvenient
- to provide an interface to simplify allocation of new securityContexts and
secrets.
These features are explained later.
### Names
From the standpoint of the Kubernetes API, a `user` is any principal which can
authenticate to Kubernetes API. This includes a human running `kubectl` on her
desktop and a container in a Pod on a Node making API calls.
There is already a notion of a username in Kubernetes, which is populated into a
request context after authentication. However, there is no API object
representing a user. While this may evolve, it is expected that in mature
installations, the canonical storage of user identifiers will be handled by a
system external to Kubernetes.
Kubernetes does not dictate how to divide up the space of user identifier
strings. User names can be simple Unix-style short usernames, (e.g. `alice`), or
may be qualified to allow for federated identity (`alice@example.com` vs.
`alice@example.org`.) Naming convention may distinguish service accounts from
user accounts (e.g. `alice@example.com` vs.
`build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`), but
Kubernetes does not require this.
Kubernetes also does not require that there be a distinction between human and
Pod users. It will be possible to setup a cluster where Alice the human talks to
the Kubernetes API as username `alice` and starts pods that also talk to the API
as user `alice` and write files to NFS as user `alice`. But, this is not
recommended.
Instead, it is recommended that Pods and Humans have distinct identities, and
reference implementations will make this distinction.
The distinction is useful for a number of reasons:
- the requirements for humans and automated processes are different:
- Humans need a wide range of capabilities to do their daily activities.
Automated processes often have more narrowly-defined activities.
- Humans may better tolerate the exceptional conditions created by
expiration of a token. Remembering to handle this in a program is more annoying.
So, either long-lasting credentials or automated rotation of credentials is
needed.
- A Human typically keeps credentials on a machine that is not part of the
cluster and so not subject to automatic management. A VM with a
role/service-account can have its credentials automatically managed.
- the identity of a Pod cannot in general be mapped to a single human.
- If policy allows, it may be created by one human, and then updated by
another, and another, until its behavior cannot be attributed to a single human.
**TODO**: consider getting rid of separate serviceAccount object and just
rolling its parts into the SecurityContext or Pod Object.
The `secrets` field is a list of references to /secret objects that an process
started as that service account should have access to be able to assert that
role.
The secrets are not inline with the serviceAccount object. This way, most or
all users can have permission to `GET /serviceAccounts` so they can remind
themselves what serviceAccounts are available for use.
Nothing will prevent creation of a serviceAccount with two secrets of type
`SecretTypeKubernetesAuth`, or secrets of two different types. Kubelet and
client libraries will have some behavior, TBD, to handle the case of multiple
secrets of a given type (pick first or provide all and try each in order, etc).
When a serviceAccount and a matching secret exist, then a `User.Info` for the
serviceAccount and a `BearerToken` from the secret are added to the map of
tokens used by the authentication process in the apiserver, and similarly for
other types. (We might have some types that do not do anything on apiserver but
just get pushed to the kubelet.)
### Pods
The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If
this is unset, then a default value is chosen. If it is set, then the
corresponding value of `Pods.Spec.SecurityContext` is set by the Service Account
Finalizer (see below).
TBD: how policy limits which users can make pods with which service accounts.
### Authorization
Kubernetes API Authorization Policies refer to users. Pods created with a
`Pods.Spec.ServiceAccountUsername` typically get a `Secret` which allows them to
authenticate to the Kubernetes APIserver as a particular user. So any policy
that is desired can be applied to them.
A higher level workflow is needed to coordinate creation of serviceAccounts,
secrets and relevant policy objects. Users are free to extend Kubernetes to put
this business logic wherever is convenient for them, though the Service Account
Finalizer is one place where this can happen (see below).
### Kubelet
The kubelet will treat as "not ready to run" (needing a finalizer to act on it)
any Pod which has an empty SecurityContext.
The kubelet will set a default, restrictive, security context for any pods
created from non-Apiserver config sources (http, file).
Kubelet watches apiserver for secrets which are needed by pods bound to it.
**TODO**: how to only let kubelet see secrets it needs to know.
### The service account finalizer
There are several ways to use Pods with SecurityContexts and Secrets.
One way is to explicitly specify the securityContext and all secrets of a Pod
when the pod is initially created, like this:
**TODO**: example of pod with explicit refs.
Another way is with the *Service Account Finalizer*, a plugin process which is
optional, and which handles business logic around service accounts.
The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount
definitions.
First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no
`Pod.Spec.SecurityContext` set, then it copies in the referenced securityContext
and secrets references for the corresponding `serviceAccount`.
Second, if ServiceAccount definitions change, it may take some actions.
**TODO**: decide what actions it takes when a serviceAccount definition changes.
Does it stop pods, or just allow someone to list ones that are out of spec? In
general, people may want to customize this?
Third, if a new namespace is created, it may create a new serviceAccount for
that namespace. This may include a new username (e.g.
`NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`),
a new securityContext, a newly generated secret to authenticate that
serviceAccount to the Kubernetes API, and default policies for that service
account.
**TODO**: more concrete example. What are typical default permissions for
default service account (e.g. readonly access to services in the same namespace
and read-write access to events in that namespace?)
Finally, it may provide an interface to automate creation of new
serviceAccounts. In that case, the user may want to GET serviceAccounts to see
what has been created.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/service_accounts.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service_accounts.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/service_accounts.md)

View File

@ -1,131 +1 @@
## Simple rolling update
This is a lightweight design document for simple
[rolling update](../user-guide/kubectl/kubectl_rolling-update.md) in `kubectl`.
Complete execution flow can be found [here](#execution-details). See the
[example of rolling update](../user-guide/update-demo/) for more information.
### Lightweight rollout
Assume that we have a current replication controller named `foo` and it is
running image `image:v1`
`kubectl rolling-update foo [foo-v2] --image=myimage:v2`
If the user doesn't specify a name for the 'next' replication controller, then
the 'next' replication controller is renamed to
the name of the original replication controller.
Obviously there is a race here, where if you kill the client between delete foo,
and creating the new version of 'foo' you might be surprised about what is
there, but I think that's ok. See [Recovery](#recovery) below
If the user does specify a name for the 'next' replication controller, then the
'next' replication controller is retained with its existing name, and the old
'foo' replication controller is deleted. For the purposes of the rollout, we add
a unique-ifying label `kubernetes.io/deployment` to both the `foo` and
`foo-next` replication controllers. The value of that label is the hash of the
complete JSON representation of the`foo-next` or`foo` replication controller.
The name of this label can be overridden by the user with the
`--deployment-label-key` flag.
#### Recovery
If a rollout fails or is terminated in the middle, it is important that the user
be able to resume the roll out. To facilitate recovery in the case of a crash of
the updating process itself, we add the following annotations to each
replication controller in the `kubernetes.io/` annotation namespace:
* `desired-replicas` The desired number of replicas for this replication
controller (either N or zero)
* `update-partner` A pointer to the replication controller resource that is
the other half of this update (syntax `<name>` the namespace is assumed to be
identical to the namespace of this replication controller.)
Recovery is achieved by issuing the same command again:
```sh
kubectl rolling-update foo [foo-v2] --image=myimage:v2
```
Whenever the rolling update command executes, the kubectl client looks for
replication controllers called `foo` and `foo-next`, if they exist, an attempt
is made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is
created, and the rollout is a new rollout. If `foo` doesn't exist, then it is
assumed that the rollout is nearly completed, and `foo-next` is renamed to
`foo`. Details of the execution flow are given below.
### Aborting a rollout
Abort is assumed to want to reverse a rollout in progress.
`kubectl rolling-update foo [foo-v2] --rollback`
This is really just semantic sugar for:
`kubectl rolling-update foo-v2 foo`
With the added detail that it moves the `desired-replicas` annotation from
`foo-v2` to `foo`
### Execution Details
For the purposes of this example, assume that we are rolling from `foo` to
`foo-next` where the only change is an image update from `v1` to `v2`
If the user doesn't specify a `foo-next` name, then it is either discovered from
the `update-partner` annotation on `foo`. If that annotation doesn't exist,
then `foo-next` is synthesized using the pattern
`<controller-name>-<hash-of-next-controller-JSON>`
#### Initialization
* If `foo` and `foo-next` do not exist:
* Exit, and indicate an error to the user, that the specified controller
doesn't exist.
* If `foo` exists, but `foo-next` does not:
* Create `foo-next` populate it with the `v2` image, set
`desired-replicas` to `foo.Spec.Replicas`
* Goto Rollout
* If `foo-next` exists, but `foo` does not:
* Assume that we are in the rename phase.
* Goto Rename
* If both `foo` and `foo-next` exist:
* Assume that we are in a partial rollout
* If `foo-next` is missing the `desired-replicas` annotation
* Populate the `desired-replicas` annotation to `foo-next` using the
current size of `foo`
* Goto Rollout
#### Rollout
* While size of `foo-next` < `desired-replicas` annotation on `foo-next`
* increase size of `foo-next`
* if size of `foo` > 0
decrease size of `foo`
* Goto Rename
#### Rename
* delete `foo`
* create `foo` that is identical to `foo-next`
* delete `foo-next`
#### Abort
* If `foo-next` doesn't exist
* Exit and indicate to the user that they may want to simply do a new
rollout with the old version
* If `foo` doesn't exist
* Exit and indicate not found to the user
* Otherwise, `foo-next` and `foo` both exist
* Set `desired-replicas` annotation on `foo` to match the annotation on
`foo-next`
* Goto Rollout with `foo` and `foo-next` trading places.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/simple-rolling-update.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/simple-rolling-update.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/simple-rolling-update.md)

View File

@ -1,291 +1 @@
# Taints, Tolerations, and Dedicated Nodes
## Introduction
This document describes *taints* and *tolerations*, which constitute a generic
mechanism for restricting the set of pods that can use a node. We also describe
one concrete use case for the mechanism, namely to limit the set of users (or
more generally, authorization domains) who can access a set of nodes (a feature
we call *dedicated nodes*). There are many other uses--for example, a set of
nodes with a particular piece of hardware could be reserved for pods that
require that hardware, or a node could be marked as unschedulable when it is
being drained before shutdown, or a node could trigger evictions when it
experiences hardware or software problems or abnormal node configurations; see
issues [#17190](https://github.com/kubernetes/kubernetes/issues/17190) and
[#3885](https://github.com/kubernetes/kubernetes/issues/3885) for more discussion.
## Taints, tolerations, and dedicated nodes
A *taint* is a new type that is part of the `NodeSpec`; when present, it
prevents pods from scheduling onto the node unless the pod *tolerates* the taint
(tolerations are listed in the `PodSpec`). Note that there are actually multiple
flavors of taints: taints that prevent scheduling on a node, taints that cause
the scheduler to try to avoid scheduling on a node but do not prevent it, taints
that prevent a pod from starting on Kubelet even if the pod's `NodeName` was
written directly (i.e. pod did not go through the scheduler), and taints that
evict already-running pods.
[This comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
has more background on these different scenarios. We will focus on the first
kind of taint in this doc, since it is the kind required for the "dedicated
nodes" use case.
Implementing dedicated nodes using taints and tolerations is straightforward: in
essence, a node that is dedicated to group A gets taint `dedicated=A` and the
pods belonging to group A get toleration `dedicated=A`. (The exact syntax and
semantics of taints and tolerations are described later in this doc.) This keeps
all pods except those belonging to group A off of the nodes. This approach
easily generalizes to pods that are allowed to schedule into multiple dedicated
node groups, and nodes that are a member of multiple dedicated node groups.
Note that because tolerations are at the granularity of pods, the mechanism is
very flexible -- any policy can be used to determine which tolerations should be
placed on a pod. So the "group A" mentioned above could be all pods from a
particular namespace or set of namespaces, or all pods with some other arbitrary
characteristic in common. We expect that any real-world usage of taints and
tolerations will employ an admission controller to apply the tolerations. For
example, to give all pods from namespace A access to dedicated node group A, an
admission controller would add the corresponding toleration to all pods from
namespace A. Or to give all pods that require GPUs access to GPU nodes, an
admission controller would add the toleration for GPU taints to pods that
request the GPU resource.
Everything that can be expressed using taints and tolerations can be expressed
using [node affinity](https://github.com/kubernetes/kubernetes/pull/18261), e.g.
in the example in the previous paragraph, you could put a label `dedicated=A` on
the set of dedicated nodes and a node affinity `dedicated NotIn A` on all pods *not*
belonging to group A. But it is cumbersome to express exclusion policies using
node affinity because every time you add a new type of restricted node, all pods
that aren't allowed to use those nodes need to start avoiding those nodes using
node affinity. This means the node affinity list can get quite long in clusters
with lots of different groups of special nodes (lots of dedicated node groups,
lots of different kinds of special hardware, etc.). Moreover, you need to also
update any Pending pods when you add new types of special nodes. In contrast,
with taints and tolerations, when you add a new type of special node, "regular"
pods are unaffected, and you just need to add the necessary toleration to the
pods you subsequent create that need to use the new type of special nodes. To
put it another way, with taints and tolerations, only pods that use a set of
special nodes need to know about those special nodes; with the node affinity
approach, pods that have no interest in those special nodes need to know about
all of the groups of special nodes.
One final comment: in practice, it is often desirable to not only keep "regular"
pods off of special nodes, but also to keep "special" pods off of regular nodes.
An example in the dedicated nodes case is to not only keep regular users off of
dedicated nodes, but also to keep dedicated users off of non-dedicated (shared)
nodes. In this case, the "non-dedicated" nodes can be modeled as their own
dedicated node group (for example, tainted as `dedicated=shared`), and pods that
are not given access to any dedicated nodes ("regular" pods) would be given a
toleration for `dedicated=shared`. (As mentioned earlier, we expect tolerations
will be added by an admission controller.) In this case taints/tolerations are
still better than node affinity because with taints/tolerations each pod only
needs one special "marking", versus in the node affinity case where every time
you add a dedicated node group (i.e. a new `dedicated=` value), you need to add
a new node affinity rule to all pods (including pending pods) except the ones
allowed to use that new dedicated node group.
## API
```go
// The node this Taint is attached to has the effect "effect" on
// any pod that that does not tolerate the Taint.
type Taint struct {
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
Value string `json:"value,omitempty"`
Effect TaintEffect `json:"effect"`
}
type TaintEffect string
const (
// Do not allow new pods to schedule unless they tolerate the taint,
// but allow all pods submitted to Kubelet without going through the scheduler
// to start, and allow all already-running pods to continue running.
// Enforced by the scheduler.
TaintEffectNoSchedule TaintEffect = "NoSchedule"
// Like TaintEffectNoSchedule, but the scheduler tries not to schedule
// new pods onto the node, rather than prohibiting new pods from scheduling
// onto the node. Enforced by the scheduler.
TaintEffectPreferNoSchedule TaintEffect = "PreferNoSchedule"
// Do not allow new pods to schedule unless they tolerate the taint,
// do not allow pods to start on Kubelet unless they tolerate the taint,
// but allow all already-running pods to continue running.
// Enforced by the scheduler and Kubelet.
TaintEffectNoScheduleNoAdmit TaintEffect = "NoScheduleNoAdmit"
// Do not allow new pods to schedule unless they tolerate the taint,
// do not allow pods to start on Kubelet unless they tolerate the taint,
// and try to eventually evict any already-running pods that do not tolerate the taint.
// Enforced by the scheduler and Kubelet.
TaintEffectNoScheduleNoAdmitNoExecute = "NoScheduleNoAdmitNoExecute"
)
// The pod this Toleration is attached to tolerates any taint that matches
// the triple <key,value,effect> using the matching operator <operator>.
type Toleration struct {
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
// operator represents a key's relationship to the value.
// Valid operators are Exists and Equal. Defaults to Equal.
// Exists is equivalent to wildcard for value, so that a pod can
// tolerate all taints of a particular category.
Operator TolerationOperator `json:"operator"`
Value string `json:"value,omitempty"`
Effect TaintEffect `json:"effect"`
// TODO: For forgiveness (#1574), we'd eventually add at least a grace period
// here, and possibly an occurrence threshold and period.
}
// A toleration operator is the set of operators that can be used in a toleration.
type TolerationOperator string
const (
TolerationOpExists TolerationOperator = "Exists"
TolerationOpEqual TolerationOperator = "Equal"
)
```
(See [this comment](https://github.com/kubernetes/kubernetes/issues/3885#issuecomment-146002375)
to understand the motivation for the various taint effects.)
We will add:
```go
// Multiple tolerations with the same key are allowed.
Tolerations []Toleration `json:"tolerations,omitempty"`
```
to `PodSpec`. A pod must tolerate all of a node's taints (except taints of type
TaintEffectPreferNoSchedule) in order to be able to schedule onto that node.
We will add:
```go
// Multiple taints with the same key are not allowed.
Taints []Taint `json:"taints,omitempty"`
```
to both `NodeSpec` and `NodeStatus`. The value in `NodeStatus` is the union
of the taints specified by various sources. For now, the only source is
the `NodeSpec` itself, but in the future one could imagine a node inheriting
taints from pods (if we were to allow taints to be attached to pods), from
the node's startup configuration, etc. The scheduler should look at the `Taints`
in `NodeStatus`, not in `NodeSpec`.
Taints and tolerations are not scoped to namespace.
## Implementation plan: taints, tolerations, and dedicated nodes
Using taints and tolerations to implement dedicated nodes requires these steps:
1. Add the API described above
1. Add a scheduler predicate function that respects taints and tolerations (for
TaintEffectNoSchedule) and a scheduler priority function that respects taints
and tolerations (for TaintEffectPreferNoSchedule).
1. Add to the Kubelet code to implement the "no admit" behavior of
TaintEffectNoScheduleNoAdmit and TaintEffectNoScheduleNoAdmitNoExecute
1. Implement code in Kubelet that evicts a pod that no longer satisfies
TaintEffectNoScheduleNoAdmitNoExecute. In theory we could do this in the
controllers instead, but since taints might be used to enforce security
policies, it is better to do in kubelet because kubelet can respond quickly and
can guarantee the rules will be applied to all pods. Eviction may need to happen
under a variety of circumstances: when a taint is added, when an existing taint
is updated, when a toleration is removed from a pod, or when a toleration is
modified on a pod.
1. Add a new `kubectl` command that adds/removes taints to/from nodes,
1. (This is the one step is that is specific to dedicated nodes) Implement an
admission controller that adds tolerations to pods that are supposed to be
allowed to use dedicated nodes (for example, based on pod's namespace).
In the future one can imagine a generic policy configuration that configures an
admission controller to apply the appropriate tolerations to the desired class
of pods and taints to Nodes upon node creation. It could be used not just for
policies about dedicated nodes, but also other uses of taints and tolerations,
e.g. nodes that are restricted due to their hardware configuration.
The `kubectl` command to add and remove taints on nodes will be modeled after
`kubectl label`. Examples usages:
```sh
# Update node 'foo' with a taint with key 'dedicated' and value 'special-user' and effect 'NoScheduleNoAdmitNoExecute'.
# If a taint with that key already exists, its value and effect are replaced as specified.
$ kubectl taint nodes foo dedicated=special-user:NoScheduleNoAdmitNoExecute
# Remove from node 'foo' the taint with key 'dedicated' if one exists.
$ kubectl taint nodes foo dedicated-
```
## Example: implementing a dedicated nodes policy
Let's say that the cluster administrator wants to make nodes `foo`, `bar`, and `baz` available
only to pods in a particular namespace `banana`. First the administrator does
```sh
$ kubectl taint nodes foo dedicated=banana:NoScheduleNoAdmitNoExecute
$ kubectl taint nodes bar dedicated=banana:NoScheduleNoAdmitNoExecute
$ kubectl taint nodes baz dedicated=banana:NoScheduleNoAdmitNoExecute
```
(assuming they want to evict pods that are already running on those nodes if those
pods don't already tolerate the new taint)
Then they ensure that the `PodSpec` for all pods created in namespace `banana` specify
a toleration with `key=dedicated`, `value=banana`, and `policy=NoScheduleNoAdmitNoExecute`.
In the future, it would be nice to be able to specify the nodes via a `NodeSelector` rather than having
to enumerate them by name.
## Future work
At present, the Kubernetes security model allows any user to add and remove any
taints and tolerations. Obviously this makes it impossible to securely enforce
rules like dedicated nodes. We need some mechanism that prevents regular users
from mutating the `Taints` field of `NodeSpec` (probably we want to prevent them
from mutating any fields of `NodeSpec`) and from mutating the `Tolerations`
field of their pods. [#17549](https://github.com/kubernetes/kubernetes/issues/17549)
is relevant.
Another security vulnerability arises if nodes are added to the cluster before
receiving their taint. Thus we need to ensure that a new node does not become
"Ready" until it has been configured with its taints. One way to do this is to
have an admission controller that adds the taint whenever a Node object is
created.
A quota policy may want to treat nodes differently based on what taints, if any,
they have. For example, if a particular namespace is only allowed to access
dedicated nodes, then it may be convenient to give the namespace unlimited
quota. (To use finite quota, you'd have to size the namespace's quota to the sum
of the sizes of the machines in the dedicated node group, and update it when
nodes are added/removed to/from the group.)
It's conceivable that taints and tolerations could be unified with
[pod anti-affinity](https://github.com/kubernetes/kubernetes/pull/18265).
We have chosen not to do this for the reasons described in the "Future work"
section of that doc.
## Backward compatibility
Old scheduler versions will ignore taints and tolerations. New scheduler
versions will respect them.
Users should not start using taints and tolerations until the full
implementation has been in Kubelet and the master for enough binary versions
that we feel comfortable that we will not need to roll back either Kubelet or
master to a version that does not support them. Longer-term we will use a
programatic approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
## Related issues
This proposal is based on the discussion in [#17190](https://github.com/kubernetes/kubernetes/issues/17190).
There are a number of other related issues, all of which are linked to from
[#17190](https://github.com/kubernetes/kubernetes/issues/17190).
The relationship between taints and node drains is discussed in [#1574](https://github.com/kubernetes/kubernetes/issues/1574).
The concepts of taints and tolerations were originally developed as part of the
Omega project at Google.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/taint-toleration-dedicated.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/taint-toleration-dedicated.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/taint-toleration-dedicated.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 38 KiB

View File

@ -1,174 +1 @@
# Kubernetes API and Release Versioning
Reference: [Semantic Versioning](http://semver.org)
Legend:
* **Kube X.Y.Z** refers to the version (git tag) of Kubernetes that is released.
This versions all components: apiserver, kubelet, kubectl, etc. (**X** is the
major version, **Y** is the minor version, and **Z** is the patch version.)
* **API vX[betaY]** refers to the version of the HTTP API.
## Release versioning
### Minor version scheme and timeline
* Kube X.Y.0-alpha.W, W > 0 (Branch: master)
* Alpha releases are released roughly every two weeks directly from the master
branch.
* No cherrypick releases. If there is a critical bugfix, a new release from
master can be created ahead of schedule.
* Kube X.Y.Z-beta.W (Branch: release-X.Y)
* When master is feature-complete for Kube X.Y, we will cut the release-X.Y
branch 2 weeks prior to the desired X.Y.0 date and cherrypick only PRs essential
to X.Y.
* This cut will be marked as X.Y.0-beta.0, and master will be revved to X.Y+1.0-alpha.0.
* If we're not satisfied with X.Y.0-beta.0, we'll release other beta releases,
(X.Y.0-beta.W | W > 0) as necessary.
* Kube X.Y.0 (Branch: release-X.Y)
* Final release, cut from the release-X.Y branch cut two weeks prior.
* X.Y.1-beta.0 will be tagged at the same commit on the same branch.
* X.Y.0 occur 3 to 4 months after X.(Y-1).0.
* Kube X.Y.Z, Z > 0 (Branch: release-X.Y)
* [Patch releases](#patch-releases) are released as we cherrypick commits into
the release-X.Y branch, (which is at X.Y.Z-beta.W,) as needed.
* X.Y.Z is cut straight from the release-X.Y branch, and X.Y.Z+1-beta.0 is
tagged on the followup commit that updates pkg/version/base.go with the beta
version.
* Kube X.Y.Z, Z > 0 (Branch: release-X.Y.Z)
* These are special and different in that the X.Y.Z tag is branched to isolate
the emergency/critical fix from all other changes that have landed on the
release branch since the previous tag
* Cut release-X.Y.Z branch to hold the isolated patch release
* Tag release-X.Y.Z branch + fixes with X.Y.(Z+1)
* Branched [patch releases](#patch-releases) are rarely needed but used for
emergency/critical fixes to the latest release
* See [#19849](https://issues.k8s.io/19849) tracking the work that is needed
for this kind of release to be possible.
### Major version timeline
There is no mandated timeline for major versions. They only occur when we need
to start the clock on deprecating features. A given major version should be the
latest major version for at least one year from its original release date.
### CI and dev version scheme
* Continuous integration versions also exist, and are versioned off of alpha and
beta releases. X.Y.Z-alpha.W.C+aaaa is C commits after X.Y.Z-alpha.W, with an
additional +aaaa build suffix added; X.Y.Z-beta.W.C+bbbb is C commits after
X.Y.Z-beta.W, with an additional +bbbb build suffix added. Furthermore, builds
that are built off of a dirty build tree, (during development, with things in
the tree that are not checked it,) it will be appended with -dirty.
### Supported releases and component skew
We expect users to stay reasonably up-to-date with the versions of Kubernetes
they use in production, but understand that it may take time to upgrade,
especially for production-critical components.
We expect users to be running approximately the latest patch release of a given
minor release; we often include critical bug fixes in
[patch releases](#patch-release), and so encourage users to upgrade as soon as
possible.
Different components are expected to be compatible across different amounts of
skew, all relative to the master version. Nodes may lag masters components by
up to two minor versions but should be at a version no newer than the master; a
client should be skewed no more than one minor version from the master, but may
lead the master by up to one minor version. For example, a v1.3 master should
work with v1.1, v1.2, and v1.3 nodes, and should work with v1.2, v1.3, and v1.4
clients.
Furthermore, we expect to "support" three minor releases at a time. "Support"
means we expect users to be running that version in production, though we may
not port fixes back before the latest minor version. For example, when v1.3
comes out, v1.0 will no longer be supported: basically, that means that the
reasonable response to the question "my v1.0 cluster isn't working," is, "you
should probably upgrade it, (and probably should have some time ago)". With
minor releases happening approximately every three months, that means a minor
release is supported for approximately nine months.
This policy is in line with
[GKE's supported upgrades policy](https://cloud.google.com/container-engine/docs/clusters/upgrade).
## API versioning
### Release versions as related to API versions
Here is an example major release cycle:
* **Kube 1.0 should have API v1 without v1beta\* API versions**
* The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have
the stable v1 API. This enables you to migrate all your objects off of the beta
API versions of the API and allows us to remove those beta API versions in Kube
1.0 with no effect. There will be tooling to help you detect and migrate any
v1beta\* data versions or calls to v1 before you do the upgrade.
* **Kube 1.x may have API v2beta***
* The first incarnation of a new (backwards-incompatible) API in HEAD is
v2beta1. By default this will be unregistered in apiserver, so it can change
freely. Once it is available by default in apiserver (which may not happen for
several minor releases), it cannot change ever again because we serialize
objects in versioned form, and we always need to be able to deserialize any
objects that are saved in etcd, even between alpha versions. If further changes
to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x
versions.
* **Kube 1.y (where y is the last version of the 1.x series) must have final
API v2**
* Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two
things: (1) users can upgrade to API v2 when running Kube 1.x and then switch
over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can
cleanup and remove all API v2beta\* versions because no one should have
v2beta\* objects left in their database. As mentioned above, tooling will exist
to make sure there are no calls or references to a given API version anywhere
inside someone's kube installation before someone upgrades.
* Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only.
It *may* include the v1 API as well if the burden is not high - this will be
determined on a per-major-version basis.
#### Rationale for API v2 being complete before v2.0's release
It may seem a bit strange to complete the v2 API before v2.0 is released,
but *adding* a v2 API is not a breaking change. *Removing* the v2beta\*
APIs *is* a breaking change, which is what necessitates the major version bump.
There are other ways to do this, but having the major release be the fresh start
of that release's API without the baggage of its beta versions seems most
intuitive out of the available options.
## Patch releases
Patch releases are intended for critical bug fixes to the latest minor version,
such as addressing security vulnerabilities, fixes to problems affecting a large
number of users, severe problems with no workaround, and blockers for products
based on Kubernetes.
They should not contain miscellaneous feature additions or improvements, and
especially no incompatibilities should be introduced between patch versions of
the same minor version (or even major version).
Dependencies, such as Docker or Etcd, should also not be changed unless
absolutely necessary, and also just to fix critical bugs (so, at most patch
version changes, not new major nor minor versions).
## Upgrades
* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a
rolling upgrade across their cluster. (Rolling upgrade means being able to
upgrade the master first, then one node at a time. See #4855 for details.)
* However, we do not recommend upgrading more than two minor releases at a
time (see [Supported releases](#supported-releases)), and do not recommend
running non-latest patch releases of a given minor release.
* No hard breaking changes over version boundaries.
* For example, if a user is at Kube 1.x, we may require them to upgrade to
Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across
major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as
graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone
to go from 1.x to 1.x+y before they go to 2.x.
There is a separate question of how to track the capabilities of a kubelet to
facilitate rolling upgrades. That is not addressed here.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/versioning.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/versioning.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/versioning.md)

View File

@ -1,523 +1 @@
Kubernetes Snapshotting Proposal
================================
**Authors:** [Cindy Wang](https://github.com/ciwang)
## Background
Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs).
Typical existing backup solutions offer on demand or scheduled snapshots.
An application developer using a storage may want to create a snapshot before an update or other major event. Kubernetes does not currently offer a standardized snapshot API for creating, listing, deleting, and restoring snapshots on an arbitrary volume.
Existing solutions for scheduled snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265) and [external storage drivers](http://rancher.com/introducing-convoy-a-docker-volume-driver-for-backup-and-recovery-of-persistent-data/). Some cloud storage volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves.
## Objectives
For the first version of snapshotting support in Kubernetes, only on-demand snapshots will be supported. Features listed in the roadmap for future versions are also nongoals.
* Goal 1: Enable *on-demand* snapshots of Kubernetes persistent volumes by application developers.
* Nongoal: Enable *automatic* periodic snapshotting for direct volumes in pods.
* Goal 2: Expose standardized snapshotting operations Create and List in Kubernetes REST API.
* Nongoal: Support Delete and Restore snapshot operations in API.
* Goal 3: Implement snapshotting interface for GCE PDs.
* Nongoal: Implement snapshotting interface for non GCE PD volumes.
### Feature Roadmap
Major features, in order of priority (bold features are priorities for v1):
* **On demand snapshots**
* **API to create new snapshots and list existing snapshots**
* API to restore a disk from a snapshot and delete old snapshots
* Scheduled snapshots
* Support snapshots for non-cloud storage volumes (i.e. plugins that require actions to be triggered from the node)
## Requirements
### Performance
* Time SLA from issuing a snapshot to completion:
* The period we are interested is the time between the scheduled snapshot time and the time the snapshot is finishes uploading to its storage location.
* This should be on the order of a few minutes.
### Reliability
* Data corruption
* Though it is generally recommended to stop application writes before executing the snapshot command, we will not do this for several reasons:
* GCE and Amazon can create snapshots while the application is running.
* Stopping application writes cannot be done from the master and varies by application, so doing so will introduce unnecessary complexity and permission issues in the code.
* Most file systems and server applications are (and should be) able to restore inconsistent snapshots the same way as a disk that underwent an unclean shutdown.
* Snapshot failure
* Case: Failure during external process, such as during API call or upload
* Log error, retry until success (indefinitely)
* Case: Failure within Kubernetes, such as controller restarts
* If the master restarts in the middle of a snapshot operation, then the controller does not know whether or not the operation succeeded. However, since the annotation has not been deleted, the controller will retry, which may result in a crash loop if the first operation has not yet completed. This issue will not be addressed in the alpha version, but future versions will need to address it by persisting state.
## Solution Overview
Snapshot operations will be triggered by [annotations](http://kubernetes.io/docs/user-guide/annotations/) on PVC API objects.
* **Create:**
* Key: create.snapshot.volume.alpha.kubernetes.io
* Value: [snapshot name]
* **List:**
* Key: snapshot.volume.alpha.kubernetes.io/[snapshot name]
* Value: [snapshot timestamp]
A new controller responsible solely for snapshot operations will be added to the controllermanager on the master. This controller will watch the API server for new annotations on PVCs. When a create snapshot annotation is added, it will trigger the appropriate snapshot creation logic for the underlying persistent volume type. The list annotation will be populated by the controller and only identify all snapshots created for that PVC by Kubernetes.
The snapshot operation is a no-op for volume plugins that do not support snapshots via an API call (i.e. non-cloud storage).
## Detailed Design
### API
* Create snapshot
* Usage:
* Users create annotation with key "create.snapshot.volume.alpha.kubernetes.io", value does not matter
* When the annotation is deleted, the operation has succeeded. The snapshot will be listed in the value of snapshot-list.
* API is declarative and guarantees only that it will begin attempting to create the snapshot once the annotation is created and will complete eventually.
* PVC control loop in master
* If annotation on new PVC, search for PV of volume type that implements SnapshottableVolumePlugin. If one is available, use it. Otherwise, reject the claim and post an event to the PV.
* If annotation on existing PVC, if PV type implements SnapshottableVolumePlugin, continue to SnapshotController logic. Otherwise, delete the annotation and post an event to the PV.
* List existing snapshots
* Only displayed as annotations on PVC object.
* Only lists unique names and timestamps of snapshots taken using the Kubernetes API.
* Usage:
* Get the PVC object
* Snapshots are listed as key-value pairs within the PVC annotations
### SnapshotController
![Snapshot Controller Diagram](volume-snapshotting.png?raw=true "Snapshot controller diagram")
**PVC Informer:** A shared informer that stores (references to) PVC objects, populated by the API server. The annotations on the PVC objects are used to add items to SnapshotRequests.
**SnapshotRequests:** An in-memory cache of incomplete snapshot requests that is populated by the PVC informer. This maps unique volume IDs to PVC objects. Volumes are added when the create snapshot annotation is added, and deleted when snapshot requests are completed successfully.
**Reconciler:** Simple loop that triggers asynchronous snapshots via the OperationExecutor. Deletes create snapshot annotation if successful.
The controller will have a loop that does the following:
* Fetch State
* Fetch all PVC objects from the API server.
* Act
* Trigger snapshot:
* Loop through SnapshotRequests and trigger create snapshot logic (see below) for any PVCs that have the create snapshot annotation.
* Persist State
* Once a snapshot operation completes, write the snapshot ID/timestamp to the PVC Annotations and delete the create snapshot annotation in the PVC object via the API server.
Snapshot operations can take a long time to complete, so the primary controller loop should not block on these operations. Instead the reconciler should spawn separate threads for these operations via the operation executor.
The controller will reject snapshot requests if the unique volume ID already exists in the SnapshotRequests. Concurrent operations on the same volume will be prevented by the operation executor.
### Create Snapshot Logic
To create a snapshot:
* Acquire operation lock for volume so that no other attach or detach operations can be started for volume.
* Abort if there is already a pending operation for the specified volume (main loop will retry, if needed).
* Spawn a new thread:
* Execute the volume-specific logic to create a snapshot of the persistent volume reference by the PVC.
* For any errors, log the error, and terminate the thread (the main controller will retry as needed).
* Once a snapshot is created successfully:
* Make a call to the API server to delete the create snapshot annotation in the PVC object.
* Make a call to the API server to add the new snapshot ID/timestamp to the PVC Annotations.
*Brainstorming notes below, read at your own risk!*
* * *
Open questions:
* What has more value: scheduled snapshotting or exposing snapshotting/backups as a standardized API?
* It seems that the API route is a bit more feasible in implementation and can also be fully utilized.
* Can the API call methods on VolumePlugins? Yeah via controller
* The scheduler gives users functionality that doesnt already exist, but required adding an entirely new controller
* Should the list and restore operations be part of v1?
* Do we call them snapshots or backups?
* From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is is necessary, but not sufficient, when conducting a backup of a stateful application."
* At what minimum granularity should snapshots be allowed?
* How do we store information about the most recent snapshot in case the controller restarts?
* In case of error, do we err on the side of fewer or more snapshots?
Snapshot Scheduler
1. PVC API Object
A new field, backupSchedule, will be added to the PVC API Object. The value of this field must be a cron expression.
* CRUD operations on snapshot schedules
* Create: Specify a snapshot within a PVC spec as a [cron expression](http://crontab-generator.org/)
* The cron expression provides flexibility to decrease the interval between snapshots in future versions
* Read: Display snapshot schedule to user via kubectl get pvc
* Update: Do not support changing the snapshot schedule for an existing PVC
* Delete: Do not support deleting the snapshot schedule for an existing PVC
* In v1, the snapshot schedule is tied to the lifecycle of the PVC. Update and delete operations are therefore not supported. In future versions, this may be done using kubectl edit pvc/name
* Validation
* Cron expressions must have a 0 in the minutes place and use exact, not interval syntax
* [EBS](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/TakeScheduledSnapshot.html) appears to be able to take snapshots at the granularity of minutes, GCE PD takes at most minutes. Therefore for v1, we ensure that snapshots are taken at most hourly and at exact times (rather than at time intervals).
* If Kubernetes cannot find a PV that supports snapshotting via its API, reject the PVC and display an error message to the user
Objective
Goal: Enable automatic periodic snapshotting (NOTE: A snapshot is a read-only copy of a disk.) for all kubernetes volume plugins.
Goal: Implement snapshotting interface for GCE PDs.
Goal: Protect against data loss by allowing users to restore snapshots of their disks.
Nongoal: Implement snapshotting support on Kubernetes for non GCE PD volumes.
Nongoal: Use snapshotting to provide additional features such as migration.
Background
Many storage systems (GCE PD, Amazon EBS, NFS, etc.) provide the ability to create "snapshots" of a persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs).
Currently, no container orchestration software (i.e. Kubernetes and its competitors) provide snapshot scheduling for application storage.
Existing solutions for automatic snapshotting include [cron jobs](https://forums.aws.amazon.com/message.jspa?messageID=570265)/shell scripts. Some volumes can be configured to take automatic snapshots, but this is specified on the volumes themselves, not via their associated applications. Snapshotting support gives Kubernetes clear competitive advantage for users who want automatic snapshotting on their volumes, and particularly those who want to configure application-specific schedules.
what is the value case? Who wants this? What do we enable by implementing this?
I think it introduces a lot of complexity, so what is the pay off? That should be clear in the document. Do mesos, or swarm or our competition implement this? AWS? Just curious.
Requirements
Functionality
Should this support PVs, direct volumes, or both?
Should we support deletion?
Should we support restores?
Automated schedule -- times or intervals? Before major event?
Performance
Snapshots are supposed to provide timely state freezing. What is the SLA from issuing one to it completing?
* GCE: The snapshot operation takes [a fraction of a second](https://cloudplatform.googleblog.com/2013/10/persistent-disk-backups-using-snapshots.html). If file writes can be paused, they should be paused until the snapshot is created (but can be restarted while it is pending). If file writes cannot be paused, the volume should be unmounted before snapshotting then remounted afterwards.
* Pending = uploading to GCE
* EBS is the same, but if the volume is the root device the instance should be stopped before snapshotting
Reliability
How do we ascertain that deletions happen when we want them to?
For the same reasons that Kubernetes should not expose a direct create-snapshot command, it should also not allow users to delete snapshots for arbitrary volumes from Kubernetes.
We may, however, want to allow users to set a snapshotExpiryPeriod and delete snapshots once they have reached certain age. At this point we do not see an immediate need to implement automatic deletion (re:Saad) but may want to revisit this.
What happens when the snapshot fails as these are async operations?
Retry (for some time period? indefinitely?) and log the error
Other
What is the UI for seeing the list of snapshots?
In the case of GCE PD, the snapshots are uploaded to cloud storage. They are visible and manageable from the GCE console. The same applies for other cloud storage providers (i.e. Amazon). Otherwise, users may need to ssh into the device and access a ./snapshot or similar directory. In other words, users will continue to access snapshots in the same way as they have been while creating manual snapshots.
Overview
There are several design options for the design of each layer of implementation as follows.
1. **Public API:**
Users will specify a snapshotting schedule for particular volumes, which Kubernetes will then execute automatically. There are several options for where this specification can happen. In order from most to least invasive:
1. New Volume API object
1. Currently, pods, PVs, and PVCs are API objects, but Volume is not. A volume is represented as a field within pod/PV objects and its details are lost upon destruction of its enclosing object.
2. We define Volume to be a brand new API object, with a snapshot schedule attribute that specifies the time at which Kubernetes should call out to the volume plugin to create a snapshot.
3. The Volume API object will be referenced by the pod/PV API objects. The new Volume object exists entirely independently of the Pod object.
4. Pros
1. Snapshot schedule conflicts: Since a single Volume API object ideally refers to a single volume, each volume has a single unique snapshot schedule. In the case where the same underlying PD is used by different pods which specify different snapshot schedules, we have a straightforward way of identifying and resolving the conflicts. Instead of using extra space to create duplicate snapshots, we can decide to, for example, use the most frequent snapshot schedule.
5. Cons
2. Heavyweight codewise; involves changing and touching a lot of existing code.
3. Potentially bad UX: How is the Volume API object created?
1. By the user independently of the pod (i.e. with something like my-volume.yaml). In order to create 1 pod with a volume, the user needs to create 2 yaml files and run 2 commands.
2. When a unique volume is specified in a pod or PV spec.
2. Directly in volume definition in the pod/PV object
6. When specifying a volume as part of the pod or PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule.
7. Pros
4. Easy for users to implement and understand
8. Cons
5. The same underlying PD may be used by different pods. In this case, we need to resolve when and how often to take snapshots. If two pods specify the same snapshot time for the same PD, we should not perform two snapshots at that time. However, there is no unique global identifier for a volume defined in a pod definition--its identifying details are particular to the volume plugin used.
6. Replica sets have the same pod spec and support needs to be added so that underlying volume used does not create new snapshots for each member of the set.
3. Only in PV object
9. When specifying a volume as part of the PV spec, users have the option to include an extra attribute, e.g. ssTimes, to denote the snapshot schedule.
10. Pros
7. Slightly cleaner than (b). It logically makes more sense to specify snapshotting at the time of the persistent volume definition (as opposed to in the pod definition) since the snapshot schedule is a volume property.
11. Cons
8. No support for direct volumes
9. Only useful for PVs that do not already have automatic snapshotting tools (e.g. Schedule Snapshot Wizard for iSCSI) -- many do and the same can be achieved with a simple cron job
10. Same problems as (b) with respect to non-unique resources. We may have 2 PV API objects for the same underlying disk and need to resolve conflicting/duplicated schedules.
4. Annotations: key value pairs on API object
12. User experience is the same as (b)
13. Instead of storing the snapshot attribute on the pod/PV API object, save this information in an annotation. For instance, if we define a pod with two volumes we might have {"ssTimes-vol1": [1,5], “ssTimes-vol2”: [2,17]} where the values are slices of integer values representing UTC hours.
14. Pros
11. Less invasive to the codebase than (a-c)
15. Cons
12. Same problems as (b-c) with non-unique resources. The only difference here is the API object representation.
2. **Business logic:**
5. Does this go on the master, node, or both?
16. Where the snapshot is stored
13. GCE, Amazon: cloud storage
14. Others stored on volume itself (gluster) or external drive (iSCSI)
17. Requirements for snapshot operation
15. Application flush, sync, and fsfreeze before creating snapshot
6. Suggestion:
18. New SnapshotController on master
16. Controller keeps a list of active pods/volumes, schedule for each, last snapshot
17. If controller restarts and we miss a snapshot in the process, just skip it
3. Alternatively, try creating the snapshot up to the time + retryPeriod (see 5)
18. If snapshotting call fails, retry for an amount of time specified in retryPeriod
19. Timekeeping mechanism: something similar to [cron](http://stackoverflow.com/questions/3982957/how-does-cron-internally-schedule-jobs); keep list of snapshot times, calculate time until next snapshot, and sleep for that period
19. Logic to prepare the disk for snapshotting on node
20. Application I/Os need to be flushed and the filesystem should be frozen before snapshotting (on GCE PD)
7. Alternatives: login entirely on node
20. Problems:
21. If pod moves from one node to another
4. A different node is in now in charge of snapshotting
5. If the volume plugin requires external memory for snapshots, we need to move the existing data
22. If the same pod exists on two different nodes, which node is in charge
3. **Volume plugin interface/internal API:**
8. Allow VolumePlugins to implement the SnapshottableVolumePlugin interface (structure similar to AttachableVolumePlugin)
9. When logic is triggered for a snapshot by the SnapshotController, the SnapshottableVolumePlugin calls out to volume plugin API to create snapshot
10. Similar to volume.attach call
4. **Other questions:**
11. Snapshot period
12. Time or period
13. What is our SLO around time accuracy?
21. Best effort, but no guarantees (depends on time or period) -- if going with time.
14. What if we miss a snapshot?
22. We will retry (assuming this means that we failed) -- take at the nearest next opportunity
15. Will we know when an operation has failed? How do we report that?
23. Get response from volume plugin API, log in kubelet log, generate Kube event in success and failure cases
16. Will we be responsible for GCing old snapshots?
24. Maybe this can be explicit non-goal, in the future can automate garbage collection
17. If the pod dies do we continue creating snapshots?
18. How to communicate errors (PD doesnt support snapshotting, time period unsupported)
19. Off schedule snapshotting like before an application upgrade
20. We may want to take snapshots of encrypted disks. For instance, for GCE PDs, the encryption key must be passed to gcloud to snapshot an encrypted disk. Should Kubernetes handle this?
Options, pros, cons, suggestion/recommendation
Example 1b
During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pods associated volume.
For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2):
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: gcr.io/google_containers/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist.
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
Introduce a new field into the volume spec:
apiVersion: v1
kind: Pod
metadata:
name: test-pd
spec:
containers:
- image: gcr.io/google_containers/test-webserver
name: test-container
volumeMounts:
- mountPath: /test-pd
name: test-volume
volumes:
- name: test-volume
# This GCE PD must already exist.
gcePersistentDisk:
pdName: my-data-disk
fsType: ext4
** ssTimes: ****[1, 5]**
Caveats
* Snapshotting should not be exposed to the user through the Kubernetes API (via an operation such as create-snapshot) because
* this does not provide value to the user and only adds an extra layer of indirection/complexity.
* ?
Dependencies
* Kubernetes
* Persistent volume snapshot support through API
* POST https://www.googleapis.com/compute/v1/projects/example-project/zones/us-central1-f/disks/example-disk/createSnapshot
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/volume-snapshotting.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->
This file has moved to [https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-snapshotting.md](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/volume-snapshotting.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB