mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-09-07 12:11:43 +00:00
address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree
Signed-off-by: mikebrow <brownwm@us.ibm.com>
This commit is contained in:
@@ -38,45 +38,48 @@ Documentation for other releases can be found at
|
||||
|
||||
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
|
||||
|
||||
This document describes a proposal for specifying and implementing inter-pod topological affinity and
|
||||
anti-affinity. By that we mean: rules that specify that certain pods should be placed
|
||||
in the same topological domain (e.g. same node, same rack, same zone, same
|
||||
power domain, etc.) as some other pods, or, conversely, should *not* be placed in the
|
||||
same topological domain as some other pods.
|
||||
This document describes a proposal for specifying and implementing inter-pod
|
||||
topological affinity and anti-affinity. By that we mean: rules that specify that
|
||||
certain pods should be placed in the same topological domain (e.g. same node,
|
||||
same rack, same zone, same power domain, etc.) as some other pods, or,
|
||||
conversely, should *not* be placed in the same topological domain as some other
|
||||
pods.
|
||||
|
||||
Here are a few example rules; we explain how to express them using the API described
|
||||
in this doc later, in the section "Examples."
|
||||
Here are a few example rules; we explain how to express them using the API
|
||||
described in this doc later, in the section "Examples."
|
||||
* Affinity
|
||||
* Co-locate the pods from a particular service or Job in the same availability zone,
|
||||
without specifying which zone that should be.
|
||||
* Co-locate the pods from service S1 with pods from service S2 because S1 uses S2
|
||||
and thus it is useful to minimize the network latency between them. Co-location
|
||||
might mean same nodes and/or same availability zone.
|
||||
* Co-locate the pods from a particular service or Job in the same availability
|
||||
zone, without specifying which zone that should be.
|
||||
* Co-locate the pods from service S1 with pods from service S2 because S1 uses
|
||||
S2 and thus it is useful to minimize the network latency between them.
|
||||
Co-location might mean same nodes and/or same availability zone.
|
||||
* Anti-affinity
|
||||
* Spread the pods of a service across nodes and/or availability zones,
|
||||
e.g. to reduce correlated failures
|
||||
* Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods
|
||||
* Spread the pods of a service across nodes and/or availability zones, e.g. to
|
||||
reduce correlated failures.
|
||||
* Give a pod "exclusive" access to a node to guarantee resource isolation --
|
||||
it must never share the node with other pods.
|
||||
* Don't schedule the pods of a particular service on the same nodes as pods of
|
||||
another service that are known to interfere with the performance of the pods of the first service.
|
||||
another service that are known to interfere with the performance of the pods of
|
||||
the first service.
|
||||
|
||||
For both affinity and anti-affinity, there are three variants. Two variants have the
|
||||
property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed
|
||||
to schedule onto a node; the difference between them is that if the condition ceases to
|
||||
be met later on at runtime, for one of them the system will try to eventually evict the pod,
|
||||
while for the other the system may not try to do so. The third variant
|
||||
simply provides scheduling-time *hints* that the scheduler will try
|
||||
to satisfy but may not be able to. These three variants are directly analogous to the three
|
||||
variants of [node affinity](nodeaffinity.md).
|
||||
For both affinity and anti-affinity, there are three variants. Two variants have
|
||||
the property of requiring the affinity/anti-affinity to be satisfied for the pod
|
||||
to be allowed to schedule onto a node; the difference between them is that if
|
||||
the condition ceases to be met later on at runtime, for one of them the system
|
||||
will try to eventually evict the pod, while for the other the system may not try
|
||||
to do so. The third variant simply provides scheduling-time *hints* that the
|
||||
scheduler will try to satisfy but may not be able to. These three variants are
|
||||
directly analogous to the three variants of [node affinity](nodeaffinity.md).
|
||||
|
||||
Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
|
||||
There are other forms of topological affinity and anti-affinity. For example,
|
||||
you can use [node affinity](nodeaffinity.md) to require (prefer)
|
||||
that a set of pods all be scheduled in some specific zone Z. Node affinity is not
|
||||
capable of expressing inter-pod dependencies, and conversely the API
|
||||
we describe in this document is not capable of expressing node affinity rules.
|
||||
For simplicity, we will use the terms "affinity" and "anti-affinity" to mean
|
||||
"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively,
|
||||
in the remainder of this document.
|
||||
Note that this proposal is only about *inter-pod* topological affinity and
|
||||
anti-affinity. There are other forms of topological affinity and anti-affinity.
|
||||
For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
|
||||
that a set of pods all be scheduled in some specific zone Z. Node affinity is
|
||||
not capable of expressing inter-pod dependencies, and conversely the API we
|
||||
describe in this document is not capable of expressing node affinity rules. For
|
||||
simplicity, we will use the terms "affinity" and "anti-affinity" to mean
|
||||
"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
|
||||
respectively, in the remainder of this document.
|
||||
|
||||
## API
|
||||
|
||||
@@ -90,28 +93,28 @@ The `Affinity` type is defined as follows
|
||||
|
||||
```go
|
||||
type Affinity struct {
|
||||
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
|
||||
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
|
||||
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
|
||||
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
|
||||
}
|
||||
|
||||
type PodAffinity struct {
|
||||
// If the affinity requirements specified by this field are not met at
|
||||
// If the affinity requirements specified by this field are not met at
|
||||
// scheduling time, the pod will not be scheduled onto the node.
|
||||
// If the affinity requirements specified by this field cease to be met
|
||||
// at some point during pod execution (e.g. due to a pod label update), the
|
||||
// system will try to eventually evict the pod from its node.
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||
// If the affinity requirements specified by this field are not met at
|
||||
// scheduling time, the pod will not be scheduled onto the node.
|
||||
// If the affinity requirements specified by this field cease to be met
|
||||
// at some point during pod execution (e.g. due to a pod label update), the
|
||||
// system may or may not try to eventually evict the pod from its node.
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||
// the affinity expressions specified by this field, but it may choose
|
||||
// a node that violates one or more of the expressions. The node that is
|
||||
// most preferred is the one with the greatest sum of weights, i.e.
|
||||
@@ -120,27 +123,27 @@ type PodAffinity struct {
|
||||
// compute a sum by iterating through the elements of this field and adding
|
||||
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
||||
// node(s) with the highest sum are the most preferred.
|
||||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
}
|
||||
|
||||
type PodAntiAffinity struct {
|
||||
// If the anti-affinity requirements specified by this field are not met at
|
||||
// If the anti-affinity requirements specified by this field are not met at
|
||||
// scheduling time, the pod will not be scheduled onto the node.
|
||||
// If the anti-affinity requirements specified by this field cease to be met
|
||||
// at some point during pod execution (e.g. due to a pod label update), the
|
||||
// system will try to eventually evict the pod from its node.
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
|
||||
// If the anti-affinity requirements specified by this field are not met at
|
||||
// scheduling time, the pod will not be scheduled onto the node.
|
||||
// If the anti-affinity requirements specified by this field cease to be met
|
||||
// at some point during pod execution (e.g. due to a pod label update), the
|
||||
// system may or may not try to eventually evict the pod from its node.
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||
// When there are multiple elements, the lists of nodes corresponding to each
|
||||
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
|
||||
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
// The scheduler will prefer to schedule pods to nodes that satisfy
|
||||
// the anti-affinity expressions specified by this field, but it may choose
|
||||
// a node that violates one or more of the expressions. The node that is
|
||||
// most preferred is the one with the greatest sum of weights, i.e.
|
||||
@@ -149,7 +152,7 @@ type PodAntiAffinity struct {
|
||||
// compute a sum by iterating through the elements of this field and adding
|
||||
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
|
||||
// node(s) with the highest sum are the most preferred.
|
||||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
|
||||
}
|
||||
|
||||
type WeightedPodAffinityTerm struct {
|
||||
@@ -159,23 +162,25 @@ type WeightedPodAffinityTerm struct {
|
||||
}
|
||||
|
||||
type PodAffinityTerm struct {
|
||||
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
|
||||
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
|
||||
// nil list means "this pod's namespace," empty list means "all namespaces"
|
||||
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
|
||||
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
|
||||
Namespaces []api.Namespace `json:"namespaces,omitempty"`
|
||||
// empty topology key is interpreted by the scheduler as "all topologies"
|
||||
TopologyKey string `json:"topologyKey,omitempty"`
|
||||
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
|
||||
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
|
||||
// nil list means "this pod's namespace," empty list means "all namespaces"
|
||||
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
|
||||
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
|
||||
Namespaces []api.Namespace `json:"namespaces,omitempty"`
|
||||
// empty topology key is interpreted by the scheduler as "all topologies"
|
||||
TopologyKey string `json:"topologyKey,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped
|
||||
to the pod's namespace, but we need to be able to match against all pods globally.
|
||||
Note that the `Namespaces` field is necessary because normal `LabelSelector` is
|
||||
scoped to the pod's namespace, but we need to be able to match against all pods
|
||||
globally.
|
||||
|
||||
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity`
|
||||
that is configured as follows (note that we've omitted and collapsed some fields for
|
||||
simplicity, but this should sufficiently convey the intent of the design):
|
||||
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
|
||||
`Affinity` that is configured as follows (note that we've omitted and collapsed
|
||||
some fields for simplicity, but this should sufficiently convey the intent of
|
||||
the design):
|
||||
|
||||
```go
|
||||
PodAffinity {
|
||||
@@ -188,130 +193,160 @@ PodAntiAffinity {
|
||||
}
|
||||
```
|
||||
|
||||
Then when scheduling pod P, the scheduler
|
||||
* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key `node` and value specifying their node name.)
|
||||
* Should try to schedule P onto zones that are running pods that satisfy `P2`. (Assumes all nodes have a label with key `zone` and value specifying their zone.)
|
||||
* Cannot schedule P onto any racks that are running pods that satisfy `P3`. (Assumes all nodes have a label with key `rack` and value specifying their rack name.)
|
||||
* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key `power` and value specifying their power domain.)
|
||||
Then when scheduling pod P, the scheduler:
|
||||
* Can only schedule P onto nodes that are running pods that satisfy `P1`.
|
||||
(Assumes all nodes have a label with key `node` and value specifying their node
|
||||
name.)
|
||||
* Should try to schedule P onto zones that are running pods that satisfy `P2`.
|
||||
(Assumes all nodes have a label with key `zone` and value specifying their
|
||||
zone.)
|
||||
* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
|
||||
(Assumes all nodes have a label with key `rack` and value specifying their rack
|
||||
name.)
|
||||
* Should try not to schedule P onto any power domains that are running pods that
|
||||
satisfy `P4`. (Assumes all nodes have a label with key `power` and value
|
||||
specifying their power domain.)
|
||||
|
||||
When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed.
|
||||
For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and
|
||||
the node(s) with the highest weight(s) are the most preferred.
|
||||
When `RequiredDuringScheduling` has multiple elements, the requirements are
|
||||
ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
|
||||
are satisfied for each node, and the node(s) with the highest weight(s) are the
|
||||
most preferred.
|
||||
|
||||
In reality there are two variants of `RequiredDuringScheduling`: one suffixed with
|
||||
`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the
|
||||
first variant, if the affinity/anti-affinity ceases to be met at some point during
|
||||
pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod
|
||||
from its node. In the second variant, the system may or may not try to eventually
|
||||
evict the pod from its node.
|
||||
In reality there are two variants of `RequiredDuringScheduling`: one suffixed
|
||||
with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`.
|
||||
For the first variant, if the affinity/anti-affinity ceases to be met at some
|
||||
point during pod execution (e.g. due to a pod label update), the system will try
|
||||
to eventually evict the pod from its node. In the second variant, the system may
|
||||
or may not try to eventually evict the pod from its node.
|
||||
|
||||
## A comment on symmetry
|
||||
|
||||
One thing that makes affinity and anti-affinity tricky is symmetry.
|
||||
|
||||
Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule
|
||||
"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when
|
||||
you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod,
|
||||
*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1
|
||||
pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned
|
||||
RequiredDuringScheduling anti-affinity rule, then
|
||||
Imagine a cluster that is running pods from two services, S1 and S2. Imagine
|
||||
that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
|
||||
run me on nodes that are running pods from S2." It is not sufficient just to
|
||||
check that there are no S2 pods on a node when you are scheduling a S1 pod. You
|
||||
also need to ensure that there are no S1 pods on a node when you are scheduling
|
||||
a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
|
||||
Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
|
||||
RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
|
||||
S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
|
||||
anti-affinity rule, then:
|
||||
* if a node is empty, you can schedule S1 or S2 onto the node
|
||||
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
|
||||
|
||||
Note that while RequiredDuringScheduling anti-affinity is symmetric,
|
||||
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
|
||||
pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More
|
||||
specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then
|
||||
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
|
||||
have a RequiredDuringScheduling affinity rule "run me on nodes that are running
|
||||
pods from S2," it is not required that there be S1 pods on a node in order to
|
||||
schedule a S2 pod onto that node. More specifically, if S1 has the
|
||||
aforementioned RequiredDuringScheduling affinity rule, then:
|
||||
* if a node is empty, you can schedule S2 onto the node
|
||||
* if a node is empty, you cannot schedule S1 onto the node
|
||||
* if a node is running S2, you can schedule S1 onto the node
|
||||
* if a node is running S1+S2 and S1 terminates, S2 continues running
|
||||
* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
|
||||
* if a node is running S1+S2 and S2 terminates, the system terminates S1
|
||||
(eventually)
|
||||
|
||||
However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every
|
||||
RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
|
||||
pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node,
|
||||
but it would be better if there are.
|
||||
However, although RequiredDuringScheduling affinity is not symmetric, there is
|
||||
an implicit PreferredDuringScheduling affinity rule corresponding to every
|
||||
RequiredDuringScheduling affinity rule: if the pods of S1 have a
|
||||
RequiredDuringScheduling affinity rule "run me on nodes that are running pods
|
||||
from S2" then it is not required that there be S1 pods on a node in order to
|
||||
schedule a S2 pod onto that node, but it would be better if there are.
|
||||
|
||||
PreferredDuringScheduling is symmetric.
|
||||
If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2"
|
||||
then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also
|
||||
to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
|
||||
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer
|
||||
to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place
|
||||
a S2 pod that we are scheduling onto a node that is running a S1 pod.
|
||||
PreferredDuringScheduling is symmetric. If the pods of S1 had a
|
||||
PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
|
||||
are running pods from S2" then we would prefer to keep a S1 pod that we are
|
||||
scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
|
||||
we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
|
||||
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
|
||||
are running pods from S2" then we would prefer to place a S1 pod that we are
|
||||
scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
|
||||
we are scheduling onto a node that is running a S1 pod.
|
||||
|
||||
## Examples
|
||||
|
||||
Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.
|
||||
Here are some examples of how you would express various affinity and
|
||||
anti-affinity rules using the API we described.
|
||||
|
||||
### Affinity
|
||||
|
||||
In the examples below, the word "put" is intentionally ambiguous; the rules are the same
|
||||
whether "put" means "must put" (RequiredDuringScheduling) or "try to put"
|
||||
(PreferredDuringScheduling)--all that changes is which field the rule goes into.
|
||||
Also, we only discuss scheduling-time, and ignore the execution-time.
|
||||
Finally, some of the examples
|
||||
use "zone" and some use "node," just to make the examples more interesting; any of the examples
|
||||
with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa.
|
||||
In the examples below, the word "put" is intentionally ambiguous; the rules are
|
||||
the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
|
||||
put" (PreferredDuringScheduling)--all that changes is which field the rule goes
|
||||
into. Also, we only discuss scheduling-time, and ignore the execution-time.
|
||||
Finally, some of the examples use "zone" and some use "node," just to make the
|
||||
examples more interesting; any of the examples with "zone" will also work for
|
||||
"node" if you change the `TopologyKey`, and vice-versa.
|
||||
|
||||
* **Put the pod in zone Z**:
|
||||
Tricked you! It is not possible express this using the API described here. For this you should use node affinity.
|
||||
Tricked you! It is not possible express this using the API described here. For
|
||||
this you should use node affinity.
|
||||
|
||||
* **Put the pod in a zone that is running at least one pod from service S**:
|
||||
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
|
||||
|
||||
* **Put the pod on a node that is already running a pod that requires a license for software package P**:
|
||||
Assuming pods that require a license for software package P have a label `{key=license, value=P}`:
|
||||
* **Put the pod on a node that is already running a pod that requires a license
|
||||
for software package P**: Assuming pods that require a license for software
|
||||
package P have a label `{key=license, value=P}`:
|
||||
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
|
||||
|
||||
* **Put this pod in the same zone as other pods from its same service**:
|
||||
Assuming pods from this pod's service have some label `{key=service, value=S}`:
|
||||
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
|
||||
|
||||
This last example illustrates a small issue with this API when it is used
|
||||
with a scheduler that processes the pending queue one pod at a time, like the current
|
||||
This last example illustrates a small issue with this API when it is used with a
|
||||
scheduler that processes the pending queue one pod at a time, like the current
|
||||
Kubernetes scheduler. The RequiredDuringScheduling rule
|
||||
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
|
||||
only "works" once one pod from service S has been scheduled. But if all pods in service
|
||||
S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule
|
||||
will block the first
|
||||
pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from
|
||||
the same service. And of course that means none of the pods of the service will be able
|
||||
to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not
|
||||
PreferredDuringScheduling affinity or any variant of anti-affinity.
|
||||
There are at least three ways to solve this problem
|
||||
* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement
|
||||
matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement.
|
||||
This approach has a corner case when running parallel schedulers that are allowed to
|
||||
schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to
|
||||
schedule pods from the set
|
||||
at the same time and think there are no other pods from that set scheduled yet (e.g. they are
|
||||
trying to schedule the first two pods from the set), but by the time
|
||||
the second binding is committed, the first one has already been committed, leaving you with
|
||||
two pods running that do not respect their RequiredDuringScheduling affinity. There is no
|
||||
simple way to detect this "conflict" at scheduling time given the current system implementation.
|
||||
* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those
|
||||
pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
|
||||
* **very long-term/speculative**: controllers could present the scheduler with a group of pods from
|
||||
the same PodTemplate as a single unit. This is similar to the first approach described above but
|
||||
avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow
|
||||
the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845)
|
||||
since it could receive an entire gang simultaneously as a single unit.
|
||||
only "works" once one pod from service S has been scheduled. But if all pods in
|
||||
service S have this RequiredDuringScheduling rule in their PodSpec, then the
|
||||
RequiredDuringScheduling rule will block the first pod of the service from ever
|
||||
scheduling, since it is only allowed to run in a zone with another pod from the
|
||||
same service. And of course that means none of the pods of the service will be
|
||||
able to schedule. This problem *only* applies to RequiredDuringScheduling
|
||||
affinity, not PreferredDuringScheduling affinity or any variant of
|
||||
anti-affinity. There are at least three ways to solve this problem:
|
||||
* **short-term**: have the scheduler use a rule that if the
|
||||
RequiredDuringScheduling affinity requirement matches a pod's own labels, and
|
||||
there are no other such pods anywhere, then disregard the requirement. This
|
||||
approach has a corner case when running parallel schedulers that are allowed to
|
||||
schedule pods from the same replicated set (e.g. a single PodTemplate): both
|
||||
schedulers may try to schedule pods from the set at the same time and think
|
||||
there are no other pods from that set scheduled yet (e.g. they are trying to
|
||||
schedule the first two pods from the set), but by the time the second binding is
|
||||
committed, the first one has already been committed, leaving you with two pods
|
||||
running that do not respect their RequiredDuringScheduling affinity. There is no
|
||||
simple way to detect this "conflict" at scheduling time given the current system
|
||||
implementation.
|
||||
* **longer-term**: when a controller creates pods from a PodTemplate, for
|
||||
exactly *one* of those pods, it should omit any RequiredDuringScheduling
|
||||
affinity rules that select the pods of that PodTemplate.
|
||||
* **very long-term/speculative**: controllers could present the scheduler with a
|
||||
group of pods from the same PodTemplate as a single unit. This is similar to the
|
||||
first approach described above but avoids the corner case. No special logic is
|
||||
needed in the controllers. Moreover, this would allow the scheduler to do proper
|
||||
[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
|
||||
it could receive an entire gang simultaneously as a single unit.
|
||||
|
||||
### Anti-affinity
|
||||
|
||||
As with the affinity examples, the examples here can be RequiredDuringScheduling or
|
||||
PreferredDuringScheduling anti-affinity, i.e.
|
||||
"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears
|
||||
in `RequiredDuringScheduling` or `PreferredDuringScheduling`.
|
||||
As with the affinity examples, the examples here can be RequiredDuringScheduling
|
||||
or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
|
||||
"must not" or as "try not to" depending on whether the rule appears in
|
||||
`RequiredDuringScheduling` or `PreferredDuringScheduling`.
|
||||
|
||||
* **Spread the pods of this service S across nodes and zones**:
|
||||
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
|
||||
(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second
|
||||
clause will force the scheduler to not put more than one pod from S in the same zone, and thus by
|
||||
definition it will not put more than one pod from S on the same node, assuming each node is in one zone.
|
||||
This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in
|
||||
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
|
||||
{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
|
||||
(note that if this is specified as a RequiredDuringScheduling anti-affinity,
|
||||
then the first clause is redundant, since the second clause will force the
|
||||
scheduler to not put more than one pod from S in the same zone, and thus by
|
||||
definition it will not put more than one pod from S on the same node, assuming
|
||||
each node is in one zone. This rule is more useful as PreferredDuringScheduling
|
||||
anti-affinity, e.g. one might expect it to be common in
|
||||
[Ubernetes](../../docs/proposals/federation.md) clusters.)
|
||||
|
||||
* **Don't co-locate pods of this service with pods from service "evilService"**:
|
||||
@@ -323,25 +358,29 @@ This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one mi
|
||||
* **Don't co-locate pods of this service with any other pods except other pods of this service**:
|
||||
Assuming pods from the service have some label `{key=service, value=S}`:
|
||||
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
|
||||
Note that this works because `"service" NotIn "S"` matches pods with no key "service"
|
||||
as well as pods with key "service" and a corresponding value that is not "S."
|
||||
Note that this works because `"service" NotIn "S"` matches pods with no key
|
||||
"service" as well as pods with key "service" and a corresponding value that is
|
||||
not "S."
|
||||
|
||||
## Algorithm
|
||||
|
||||
An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows.
|
||||
There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's
|
||||
semantics are implementable.
|
||||
An example algorithm a scheduler might use to implement affinity and
|
||||
anti-affinity rules is as follows. There are certainly more efficient ways to
|
||||
do it; this is just intended to demonstrate that the API's semantics are
|
||||
implementable.
|
||||
|
||||
Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler
|
||||
predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling
|
||||
time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.
|
||||
Terminology definition: We say a pod P is "feasible" on a node N if P meets all
|
||||
of the scheduler predicates for scheduling P onto N. Note that this algorithm is
|
||||
only concerned about scheduling time, thus it makes no distinction between
|
||||
RequiredDuringExecution and IgnoredDuringExecution.
|
||||
|
||||
To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand
|
||||
for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for
|
||||
"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
|
||||
To make the algorithm slightly more readable, we use the term "HardPodAffinity"
|
||||
as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
|
||||
"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
|
||||
Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
|
||||
|
||||
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account;
|
||||
currently it assumes all terms have weight 1. **
|
||||
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
|
||||
into account; currently it assumes all terms have weight 1. **
|
||||
|
||||
```
|
||||
Z = the pod you are scheduling
|
||||
@@ -389,74 +428,81 @@ foreach node A of {N}
|
||||
|
||||
## Special considerations for RequiredDuringScheduling anti-affinity
|
||||
|
||||
In this section we discuss three issues with RequiredDuringScheduling anti-affinity:
|
||||
Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill.
|
||||
See issue #18265 for additional discussion of these topics.
|
||||
In this section we discuss three issues with RequiredDuringScheduling
|
||||
anti-affinity: Denial of Service (DoS), co-existing with daemons, and
|
||||
determining which pod(s) to kill. See issue #18265 for additional discussion of
|
||||
these topics.
|
||||
|
||||
### Denial of Service
|
||||
|
||||
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally
|
||||
or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.
|
||||
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
|
||||
can intentionally or unintentionally cause various problems for other pods, due
|
||||
to the symmetry property of anti-affinity.
|
||||
|
||||
The most notable danger is the ability for a
|
||||
pod that arrives first to some topology domain, to block all other pods from
|
||||
scheduling there by stating a conflict with all other pods.
|
||||
The standard approach
|
||||
to preventing resource hogging is quota, but simple resource quota cannot prevent
|
||||
this scenario because the pod may request very little resources. Addressing this
|
||||
using quota requires a quota scheme that charges based on "opportunity cost" rather
|
||||
than based simply on requested resources. For example, when handling a pod that expresses
|
||||
The most notable danger is the ability for a pod that arrives first to some
|
||||
topology domain, to block all other pods from scheduling there by stating a
|
||||
conflict with all other pods. The standard approach to preventing resource
|
||||
hogging is quota, but simple resource quota cannot prevent this scenario because
|
||||
the pod may request very little resources. Addressing this using quota requires
|
||||
a quota scheme that charges based on "opportunity cost" rather than based simply
|
||||
on requested resources. For example, when handling a pod that expresses
|
||||
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
|
||||
(i.e. exclusive access to a node), it could charge for the resources of the
|
||||
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
|
||||
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
|
||||
entire cluster. If node affinity is used to
|
||||
constrain the pod to a particular topology domain, then the admission-time quota
|
||||
charging should take that into account (e.g. not charge for the average/largest machine
|
||||
if the PodSpec constrains the pod to a specific machine with a known size; instead charge
|
||||
for the size of the actual machine that the pod was constrained to). In all cases
|
||||
once the pod is scheduled, the quota charge should be adjusted down to the
|
||||
actual amount of resources allocated (e.g. the size of the actual machine that was
|
||||
assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for
|
||||
average or largest node in the cluster. Likewise if a pod expresses
|
||||
RequiredDuringScheduling anti-affinity for all pods using a "cluster"
|
||||
`TopologyKey`, it could charge for the resources of the entire cluster. If node
|
||||
affinity is used to constrain the pod to a particular topology domain, then the
|
||||
admission-time quota charging should take that into account (e.g. not charge for
|
||||
the average/largest machine if the PodSpec constrains the pod to a specific
|
||||
machine with a known size; instead charge for the size of the actual machine
|
||||
that the pod was constrained to). In all cases once the pod is scheduled, the
|
||||
quota charge should be adjusted down to the actual amount of resources allocated
|
||||
(e.g. the size of the actual machine that was assigned, not the
|
||||
average/largest). If a cluster administrator wants to overcommit quota, for
|
||||
example to allow more than N pods across all users to request exclusive node
|
||||
access in a cluster with N nodes, then a priority/preemption scheme should be added
|
||||
so that the most important pods run when resource demand exceeds supply.
|
||||
access in a cluster with N nodes, then a priority/preemption scheme should be
|
||||
added so that the most important pods run when resource demand exceeds supply.
|
||||
|
||||
An alternative approach, which is a bit of a blunt hammer, is to use a
|
||||
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
|
||||
to trusted users. A more complex capability mechanism might only restrict it when
|
||||
using a non-"node" TopologyKey.
|
||||
to trusted users. A more complex capability mechanism might only restrict it
|
||||
when using a non-"node" TopologyKey.
|
||||
|
||||
Our initial implementation will use a variant of the capability approach, which
|
||||
requires no configuration: we will simply reject ALL requests, regardless of user,
|
||||
that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity.
|
||||
This allows the "exclusive node" use case while prohibiting the more dangerous ones.
|
||||
requires no configuration: we will simply reject ALL requests, regardless of
|
||||
user, that specify "all namespaces" with non-"node" TopologyKey for
|
||||
RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
|
||||
case while prohibiting the more dangerous ones.
|
||||
|
||||
A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
|
||||
the scheduling quality of another pod, but not completely block it from scheduling.
|
||||
For example, a set of pods S1 could use node affinity to request to schedule onto a set
|
||||
of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1
|
||||
have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2,
|
||||
then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from
|
||||
scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and
|
||||
with some probability that depends on the weighting scheme for the PreferredDuringScheduling case).
|
||||
A very sophisticated priority and/or quota scheme could mitigate this, or alternatively
|
||||
we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity.
|
||||
Then only RequiredDuringScheduling anti-affinity could affect scheduling quality
|
||||
of another pod, and as we described in the previous paragraph, such pods could be charged
|
||||
quota for the full topology domain, thereby reducing the potential for abuse.
|
||||
A weaker variant of the problem described in the previous paragraph is a pod's
|
||||
ability to use anti-affinity to degrade the scheduling quality of another pod,
|
||||
but not completely block it from scheduling. For example, a set of pods S1 could
|
||||
use node affinity to request to schedule onto a set of nodes that some other set
|
||||
of pods S2 prefers to schedule onto. If the pods in S1 have
|
||||
RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
|
||||
S2, then due to the symmetry property of anti-affinity, they can prevent the
|
||||
pods in S2 from scheduling onto their preferred nodes if they arrive first (for
|
||||
sure in the RequiredDuringScheduling case, and with some probability that
|
||||
depends on the weighting scheme for the PreferredDuringScheduling case). A very
|
||||
sophisticated priority and/or quota scheme could mitigate this, or alternatively
|
||||
we could eliminate the symmetry property of the implementation of
|
||||
PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
|
||||
anti-affinity could affect scheduling quality of another pod, and as we
|
||||
described in the previous paragraph, such pods could be charged quota for the
|
||||
full topology domain, thereby reducing the potential for abuse.
|
||||
|
||||
We won't try to address this issue in our initial implementation; we can consider one
|
||||
of the approaches mentioned above if it turns out to be a problem in practice.
|
||||
We won't try to address this issue in our initial implementation; we can
|
||||
consider one of the approaches mentioned above if it turns out to be a problem
|
||||
in practice.
|
||||
|
||||
### Co-existing with daemons
|
||||
|
||||
A cluster administrator
|
||||
may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with
|
||||
system daemon pods, such as those run by DaemonSet. In principle, we would like the specification
|
||||
for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more
|
||||
other pods (see #18263 for a more detailed explanation of the toleration concept). There are
|
||||
at least two ways to accomplish this:
|
||||
A cluster administrator may wish to allow pods that express anti-affinity
|
||||
against all pods, to nonetheless co-exist with system daemon pods, such as those
|
||||
run by DaemonSet. In principle, we would like the specification for
|
||||
RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
|
||||
more other pods (see #18263 for a more detailed explanation of the toleration
|
||||
concept). There are at least two ways to accomplish this:
|
||||
|
||||
* Scheduler special-cases the namespace(s) where daemons live, in the
|
||||
sense that it ignores pods in those namespaces when it is
|
||||
@@ -478,147 +524,168 @@ Our initial implementation will use the first approach.
|
||||
|
||||
### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
|
||||
|
||||
Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution
|
||||
anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in
|
||||
such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution
|
||||
anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod
|
||||
with the anti-affinity rule that becomes violated should be the one killed.
|
||||
A pod should only specify constraints that apply to
|
||||
namespaces it trusts to not do malicious things. Once we have priority/preemption, we can
|
||||
change the rule to say that the lowest-priority pod(s) are killed until all
|
||||
Because anti-affinity is symmetric, in the case of
|
||||
RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
|
||||
determine which pod(s) to kill when a pod's labels are updated in such as way as
|
||||
to cause them to conflict with one or more other pods'
|
||||
RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
|
||||
absence of a priority/preemption scheme, our rule will be that the pod with the
|
||||
anti-affinity rule that becomes violated should be the one killed. A pod should
|
||||
only specify constraints that apply to namespaces it trusts to not do malicious
|
||||
things. Once we have priority/preemption, we can change the rule to say that the
|
||||
lowest-priority pod(s) are killed until all
|
||||
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
|
||||
|
||||
## Special considerations for RequiredDuringScheduling affinity
|
||||
|
||||
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry:
|
||||
if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods,
|
||||
and pods that conflict with P cannot schedule onto the node one P has been scheduled there.
|
||||
The design we have described says that the symmetry property for RequiredDuringScheduling *affinity*
|
||||
is weaker: if a pod P says it can only schedule onto nodes running pod Q, this
|
||||
does not mean Q can only run on a node that is running P, but the scheduler will try
|
||||
to schedule Q onto a node that is running P (i.e. treats the reverse direction as
|
||||
preferred). This raises the same scheduling quality concern as we mentioned at the
|
||||
end of the Denial of Service section above, and can be addressed in similar ways.
|
||||
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
|
||||
symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
|
||||
conflicting pods, and pods that conflict with P cannot schedule onto the node
|
||||
one P has been scheduled there. The design we have described says that the
|
||||
symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
|
||||
says it can only schedule onto nodes running pod Q, this does not mean Q can
|
||||
only run on a node that is running P, but the scheduler will try to schedule Q
|
||||
onto a node that is running P (i.e. treats the reverse direction as preferred).
|
||||
This raises the same scheduling quality concern as we mentioned at the end of
|
||||
the Denial of Service section above, and can be addressed in similar ways.
|
||||
|
||||
The nature of affinity (as opposed to anti-affinity) means that there is no issue of
|
||||
determining which pod(s) to kill
|
||||
when a pod's labels change: it is obviously the pod with the affinity rule that becomes
|
||||
violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule;
|
||||
it can only "fix" violation an anti-affinity rule.) However, affinity does have a
|
||||
different question related to killing: how long should the system wait before declaring
|
||||
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime?
|
||||
For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed
|
||||
so that it can be updated to a new binary version, should that trigger killing of P? More
|
||||
generally, how long should the system wait before declaring that P's affinity is
|
||||
violated? (Of course affinity is expressed in terms of label selectors, not for a specific
|
||||
pod, but the scenario is easier to describe using a concrete pod.) This is closely related to
|
||||
the concept of forgiveness (see issue #1574). In theory we could make this time duration be
|
||||
configurable by the user on a per-pod basis, but for the first version of this feature we will
|
||||
make it a configurable property of whichever component does the killing and that applies across
|
||||
all pods using the feature. Making it configurable by the user would require a nontrivial change
|
||||
to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution
|
||||
affinity).
|
||||
The nature of affinity (as opposed to anti-affinity) means that there is no
|
||||
issue of determining which pod(s) to kill when a pod's labels change: it is
|
||||
obviously the pod with the affinity rule that becomes violated that must be
|
||||
killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
|
||||
"fix" violation an anti-affinity rule.) However, affinity does have a different
|
||||
question related to killing: how long should the system wait before declaring
|
||||
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
|
||||
at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
|
||||
is temporarily killed so that it can be updated to a new binary version, should
|
||||
that trigger killing of P? More generally, how long should the system wait
|
||||
before declaring that P's affinity is violated? (Of course affinity is expressed
|
||||
in terms of label selectors, not for a specific pod, but the scenario is easier
|
||||
to describe using a concrete pod.) This is closely related to the concept of
|
||||
forgiveness (see issue #1574). In theory we could make this time duration be
|
||||
configurable by the user on a per-pod basis, but for the first version of this
|
||||
feature we will make it a configurable property of whichever component does the
|
||||
killing and that applies across all pods using the feature. Making it
|
||||
configurable by the user would require a nontrivial change to the API syntax
|
||||
(since the field would only apply to
|
||||
RequiredDuringSchedulingRequiredDuringExecution affinity).
|
||||
|
||||
## Implementation plan
|
||||
|
||||
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types.
|
||||
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
|
||||
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
|
||||
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
|
||||
4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
|
||||
This admission controller should be enabled by default.
|
||||
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
|
||||
`PodAntiAffinity` types to the API along with all of their descendant types.
|
||||
2. Implement a scheduler predicate that takes
|
||||
`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
|
||||
account. Include a workaround for the issue described at the end of the Affinity
|
||||
section of the Examples section (can't schedule first pod).
|
||||
3. Implement a scheduler priority function that takes
|
||||
`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
|
||||
into account.
|
||||
4. Implement admission controller that rejects requests that specify "all
|
||||
namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
|
||||
anti-affinity. This admission controller should be enabled by default.
|
||||
5. Implement the recommended solution to the "co-existing with daemons" issue
|
||||
6. At this point, the feature can be deployed.
|
||||
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
|
||||
the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take
|
||||
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism,
|
||||
the "co-existing with daemons" solution).
|
||||
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision
|
||||
9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
|
||||
`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`;
|
||||
if controller then potentially for all `TopologyKeys`'s.
|
||||
(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
||||
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
|
||||
and anti-affinity, and make sure the pieces of the system already implemented
|
||||
for `RequiredDuringSchedulingIgnoredDuringExecution` also take
|
||||
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
|
||||
scheduler predicate, the quota mechanism, the "co-existing with daemons"
|
||||
solution).
|
||||
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
|
||||
`TopologyKey` to Kubelet's admission decision.
|
||||
9. Implement code in Kubelet *or* the controllers that evicts a pod that no
|
||||
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
|
||||
then only for "node" `TopologyKey`; if controller then potentially for all
|
||||
`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
|
||||
Do so in a way that addresses the "determining which pod(s) to kill" issue.
|
||||
|
||||
We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
|
||||
domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
|
||||
We assume Kubelet publishes labels describing the node's membership in all of
|
||||
the relevant scheduling domains (e.g. node name, rack name, availability zone
|
||||
name, etc.). See #9044.
|
||||
|
||||
## Backward compatibility
|
||||
|
||||
Old versions of the scheduler will ignore `Affinity`.
|
||||
|
||||
Users should not start using `Affinity` until the full implementation has
|
||||
been in Kubelet and the master for enough binary versions that we feel
|
||||
comfortable that we will not need to roll back either Kubelet or
|
||||
master to a version that does not support them. Longer-term we will
|
||||
use a programmatic approach to enforcing this (#4855).
|
||||
Users should not start using `Affinity` until the full implementation has been
|
||||
in Kubelet and the master for enough binary versions that we feel comfortable
|
||||
that we will not need to roll back either Kubelet or master to a version that
|
||||
does not support them. Longer-term we will use a programmatic approach to
|
||||
enforcing this (#4855).
|
||||
|
||||
## Extensibility
|
||||
|
||||
The design described here is the result of careful analysis of use cases, a decade of experience
|
||||
with Borg at Google, and a review of similar features in other open-source container orchestration
|
||||
systems. We believe that it properly balances the goal of expressiveness against the goals of
|
||||
simplicity and efficiency of implementation. However, we recognize that
|
||||
use cases may arise in the future that cannot be expressed using the syntax described here.
|
||||
Although we are not implementing an affinity-specific extensibility mechanism for a variety
|
||||
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
|
||||
users to get a consistent experience, etc.), the regular Kubernetes
|
||||
annotation mechanism can be used to add or replace affinity rules. The way this work would is
|
||||
The design described here is the result of careful analysis of use cases, a
|
||||
decade of experience with Borg at Google, and a review of similar features in
|
||||
other open-source container orchestration systems. We believe that it properly
|
||||
balances the goal of expressiveness against the goals of simplicity and
|
||||
efficiency of implementation. However, we recognize that use cases may arise in
|
||||
the future that cannot be expressed using the syntax described here. Although we
|
||||
are not implementing an affinity-specific extensibility mechanism for a variety
|
||||
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
|
||||
for Kubernetes users to get a consistent experience, etc.), the regular
|
||||
Kubernetes annotation mechanism can be used to add or replace affinity rules.
|
||||
The way this work would is:
|
||||
1. Define one or more annotations to describe the new affinity rule(s)
|
||||
1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
|
||||
If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
|
||||
from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
|
||||
annotation(s).
|
||||
1. User (or an admission controller) attaches the annotation(s) to pods to
|
||||
request the desired scheduling behavior. If the new rule(s) *replace* one or
|
||||
more fields of `Affinity` then the user would omit those fields from `Affinity`;
|
||||
if they are *additional rules*, then the user would fill in `Affinity` as well
|
||||
as the annotation(s).
|
||||
1. Scheduler takes the annotation(s) into account when scheduling.
|
||||
|
||||
If some particular new syntax becomes popular, we would consider upstreaming it by integrating
|
||||
it into the standard `Affinity`.
|
||||
If some particular new syntax becomes popular, we would consider upstreaming it
|
||||
by integrating it into the standard `Affinity`.
|
||||
|
||||
## Future work and non-work
|
||||
|
||||
One can imagine that in the anti-affinity RequiredDuringScheduling case
|
||||
one might want to associate a number with the rule,
|
||||
for example "do not allow this pod to share a rack with more than three other
|
||||
pods (in total, or from the same service as the pod)." We could allow this to be
|
||||
specified by adding an integer `Limit` to `PodAffinityTerm` just for the
|
||||
`RequiredDuringScheduling` case. However, this flexibility complicates the
|
||||
system and we do not intend to implement it.
|
||||
One can imagine that in the anti-affinity RequiredDuringScheduling case one
|
||||
might want to associate a number with the rule, for example "do not allow this
|
||||
pod to share a rack with more than three other pods (in total, or from the same
|
||||
service as the pod)." We could allow this to be specified by adding an integer
|
||||
`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
|
||||
However, this flexibility complicates the system and we do not intend to
|
||||
implement it.
|
||||
|
||||
It is likely that the specification and implementation of pod anti-affinity
|
||||
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
|
||||
and likewise that the specification and implementation of pod affinity
|
||||
can be unified with [node affinity](nodeaffinity.md).
|
||||
The basic idea is that pod labels would be "inherited" by the node, and pods
|
||||
would only be able to specify affinity and anti-affinity for a node's labels.
|
||||
Our main motivation for not unifying taints and tolerations with
|
||||
pod anti-affinity is that we foresee taints and tolerations as being a concept that
|
||||
only cluster administrators need to understand (and indeed in some setups taints and
|
||||
tolerations wouldn't even be directly manipulated by a cluster administrator,
|
||||
instead they would only be set by an admission controller that is implementing the administrator's
|
||||
high-level policy about different classes of special machines and the users who belong to the groups
|
||||
allowed to access them). Moreover, the concept of nodes "inheriting" labels
|
||||
from pods seems complicated; it seems conceptually simpler to separate rules involving
|
||||
relatively static properties of nodes from rules involving which other pods are running
|
||||
on the same node or larger topology domain.
|
||||
can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
|
||||
labels would be "inherited" by the node, and pods would only be able to specify
|
||||
affinity and anti-affinity for a node's labels. Our main motivation for not
|
||||
unifying taints and tolerations with pod anti-affinity is that we foresee taints
|
||||
and tolerations as being a concept that only cluster administrators need to
|
||||
understand (and indeed in some setups taints and tolerations wouldn't even be
|
||||
directly manipulated by a cluster administrator, instead they would only be set
|
||||
by an admission controller that is implementing the administrator's high-level
|
||||
policy about different classes of special machines and the users who belong to
|
||||
the groups allowed to access them). Moreover, the concept of nodes "inheriting"
|
||||
labels from pods seems complicated; it seems conceptually simpler to separate
|
||||
rules involving relatively static properties of nodes from rules involving which
|
||||
other pods are running on the same node or larger topology domain.
|
||||
|
||||
Data/storage affinity is related to pod affinity, and is likely to draw on some of the
|
||||
ideas we have used for pod affinity. Today, data/storage affinity is expressed using
|
||||
node affinity, on the assumption that the pod knows which node(s) store(s) the data
|
||||
it wants. But a more flexible approach would allow the pod to name the data rather than
|
||||
the node.
|
||||
Data/storage affinity is related to pod affinity, and is likely to draw on some
|
||||
of the ideas we have used for pod affinity. Today, data/storage affinity is
|
||||
expressed using node affinity, on the assumption that the pod knows which
|
||||
node(s) store(s) the data it wants. But a more flexible approach would allow the
|
||||
pod to name the data rather than the node.
|
||||
|
||||
## Related issues
|
||||
|
||||
The review for this proposal is in #18265.
|
||||
|
||||
The topic of affinity/anti-affinity has generated a lot of discussion. The main issue
|
||||
is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906
|
||||
all have additional discussion and use cases.
|
||||
The topic of affinity/anti-affinity has generated a lot of discussion. The main
|
||||
issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341,
|
||||
|
||||
As the examples in this document have demonstrated, topological affinity is very useful
|
||||
in clusters that are spread across availability zones, e.g. to co-locate pods of a service
|
||||
in the same zone to avoid a wide-area network hop, or to spread pods across zones for
|
||||
failure tolerance. #17059, #13056, #13063, and #4235 are relevant.
|
||||
# 1965, and #2906 all have additional discussion and use cases.
|
||||
|
||||
As the examples in this document have demonstrated, topological affinity is very
|
||||
useful in clusters that are spread across availability zones, e.g. to co-locate
|
||||
pods of a service in the same zone to avoid a wide-area network hop, or to
|
||||
spread pods across zones for failure tolerance. #17059, #13056, #13063, and
|
||||
|
||||
# 4235 are relevant.
|
||||
|
||||
Issue #15675 describes connection affinity, which is vaguely related.
|
||||
|
||||
|
Reference in New Issue
Block a user