mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-07-30 06:54:01 +00:00
Cleanup.
This commit is contained in:
parent
0e889a1afd
commit
8202906055
@ -31,7 +31,7 @@ Documentation for other releases can be found at
|
|||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
|
|
||||||
NOTE: It is useful to read about [node affinity](https://github.com/kubernetes/kubernetes/pull/18261) first.
|
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
|
||||||
|
|
||||||
This document describes a proposal for specifying and implementing inter-pod topological affinity and
|
This document describes a proposal for specifying and implementing inter-pod topological affinity and
|
||||||
anti-affinity. By that we mean: rules that specify that certain pods should be placed
|
anti-affinity. By that we mean: rules that specify that certain pods should be placed
|
||||||
@ -61,11 +61,11 @@ be met later on at runtime, for one of them the system will try to eventually ev
|
|||||||
while for the other the system may not try to do so. The third variant
|
while for the other the system may not try to do so. The third variant
|
||||||
simply provides scheduling-time *hints* that the scheduler will try
|
simply provides scheduling-time *hints* that the scheduler will try
|
||||||
to satisfy but may not be able to. These three variants are directly analogous to the three
|
to satisfy but may not be able to. These three variants are directly analogous to the three
|
||||||
variants of [node affinity](https://github.com/kubernetes/kubernetes/pull/18261).
|
variants of [node affinity](nodeaffinity.md).
|
||||||
|
|
||||||
Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
|
Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
|
||||||
There are other forms of topological affinity and anti-affinity. For example,
|
There are other forms of topological affinity and anti-affinity. For example,
|
||||||
you can use [node affinity](https://github.com/kubernetes/kubernetes/pull/18261) to require (prefer)
|
you can use [node affinity](nodeaffinity.md) to require (prefer)
|
||||||
that a set of pods all be scheduled in some specific zone Z. Node affinity is not
|
that a set of pods all be scheduled in some specific zone Z. Node affinity is not
|
||||||
capable of expressing inter-pod dependencies, and conversely the API
|
capable of expressing inter-pod dependencies, and conversely the API
|
||||||
we descibe in this document is not capable of expressing node affinity rules.
|
we descibe in this document is not capable of expressing node affinity rules.
|
||||||
@ -159,7 +159,7 @@ type PodAffinityTerm struct {
|
|||||||
// nil list means "this pod's namespace," empty list means "all namespaces"
|
// nil list means "this pod's namespace," empty list means "all namespaces"
|
||||||
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
|
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
|
||||||
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
|
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
|
||||||
Namespaces []api.Namespace `json:"namespaces"`
|
Namespaces []api.Namespace `json:"namespaces,omitempty"`
|
||||||
// empty topology key is interpreted by the scheduler as "all topologies"
|
// empty topology key is interpreted by the scheduler as "all topologies"
|
||||||
TopologyKey string `json:"topologyKey,omitempty"`
|
TopologyKey string `json:"topologyKey,omitempty"`
|
||||||
}
|
}
|
||||||
@ -405,12 +405,27 @@ RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
|
|||||||
(i.e. exclusive access to a node), it could charge for the resources of the
|
(i.e. exclusive access to a node), it could charge for the resources of the
|
||||||
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
|
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
|
||||||
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
|
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
|
||||||
entire cluster. If a cluster administrator wants to overcommit quota, for
|
entire cluster. If node affinity is used to
|
||||||
|
constrain the pod to a particular topology domain, then the admission-time quota
|
||||||
|
charging should take that into account (e.g. not charge for the average/largest machine
|
||||||
|
if the PodSpec constrains the pod to a specific machine with a known size; instead charge
|
||||||
|
for the size of the actual machine that the pod was constrained to). In all cases
|
||||||
|
once the pod is scheduled, the quota charge should be adjusted down to the
|
||||||
|
actual amount of resources allocated (e.g. the size of the actual machine that was
|
||||||
|
assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for
|
||||||
example to allow more than N pods across all users to request exclusive node
|
example to allow more than N pods across all users to request exclusive node
|
||||||
access in a cluster with N nodes, then a priority/preemption scheme should be added
|
access in a cluster with N nodes, then a priority/preemption scheme should be added
|
||||||
so that the most important pods run when resource demand exceeds supply.
|
so that the most important pods run when resource demand exceeds supply.
|
||||||
|
|
||||||
Our initial implementation will use quota that charges based on opportunity cost.
|
An alternative approach, which is a bit of a blunt hammer, is to use a
|
||||||
|
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
|
||||||
|
to trusted users. A more complex capability mechanism might only restrict it when
|
||||||
|
using a non-"node" TopologyKey.
|
||||||
|
|
||||||
|
Our initial implementation will use a variant of the capability approach, which
|
||||||
|
requires no configuration: we will simply reject ALL requests, regardless of user,
|
||||||
|
that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity.
|
||||||
|
This allows the "exclusive node" use case while prohibiting the more dangerous ones.
|
||||||
|
|
||||||
A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
|
A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
|
||||||
the scheduling quality of another pod, but not completely block it from scheduling.
|
the scheduling quality of another pod, but not completely block it from scheduling.
|
||||||
@ -505,8 +520,8 @@ affinity).
|
|||||||
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
|
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
|
||||||
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
|
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
|
||||||
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
|
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
|
||||||
4. Implement a quota mechanism that charges for the entire topology domain when `RequiredDuringScheduling` anti-affinity is used. Later
|
4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
|
||||||
this should be refined to only apply when it is used to request exclusive access, not when it is used to express conflict with specific pods.
|
This admission controller should be enabled by default.
|
||||||
5. Implement the recommended solution to the "co-existing with daemons" issue
|
5. Implement the recommended solution to the "co-existing with daemons" issue
|
||||||
6. At this point, the feature can be deployed.
|
6. At this point, the feature can be deployed.
|
||||||
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
|
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
|
||||||
@ -565,9 +580,9 @@ specified by adding an integer `Limit` to `PodAffinityTerm` just for the
|
|||||||
system and we do not intend to implement it.
|
system and we do not intend to implement it.
|
||||||
|
|
||||||
It is likely that the specification and implementation of pod anti-affinity
|
It is likely that the specification and implementation of pod anti-affinity
|
||||||
can be unified with [taints and tolerations](https://github.com/kubernetes/kubernetes/pull/18263),
|
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
|
||||||
and likewise that the specification and implementation of pod affinity
|
and likewise that the specification and implementation of pod affinity
|
||||||
can be unified with [node affinity](https://github.com/kubernetes/kubernetes/pull/18261).
|
can be unified with [node affinity](nodeaffinity.md).
|
||||||
The basic idea is that pod labels would be "inherited" by the node, and pods
|
The basic idea is that pod labels would be "inherited" by the node, and pods
|
||||||
would only be able to specify affinity and anti-affinity for a node's labels.
|
would only be able to specify affinity and anti-affinity for a node's labels.
|
||||||
Our main motivation for not unifying taints and tolerations with
|
Our main motivation for not unifying taints and tolerations with
|
||||||
|
Loading…
Reference in New Issue
Block a user