address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree

Signed-off-by: mikebrow <brownwm@us.ibm.com>
This commit is contained in:
mikebrow
2016-04-13 19:55:22 -05:00
parent 4638f2f355
commit 6bdc0bfdb7
39 changed files with 3744 additions and 2375 deletions

View File

@@ -38,45 +38,48 @@ Documentation for other releases can be found at
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
This document describes a proposal for specifying and implementing inter-pod topological affinity and
anti-affinity. By that we mean: rules that specify that certain pods should be placed
in the same topological domain (e.g. same node, same rack, same zone, same
power domain, etc.) as some other pods, or, conversely, should *not* be placed in the
same topological domain as some other pods.
This document describes a proposal for specifying and implementing inter-pod
topological affinity and anti-affinity. By that we mean: rules that specify that
certain pods should be placed in the same topological domain (e.g. same node,
same rack, same zone, same power domain, etc.) as some other pods, or,
conversely, should *not* be placed in the same topological domain as some other
pods.
Here are a few example rules; we explain how to express them using the API described
in this doc later, in the section "Examples."
Here are a few example rules; we explain how to express them using the API
described in this doc later, in the section "Examples."
* Affinity
* Co-locate the pods from a particular service or Job in the same availability zone,
without specifying which zone that should be.
* Co-locate the pods from service S1 with pods from service S2 because S1 uses S2
and thus it is useful to minimize the network latency between them. Co-location
might mean same nodes and/or same availability zone.
* Co-locate the pods from a particular service or Job in the same availability
zone, without specifying which zone that should be.
* Co-locate the pods from service S1 with pods from service S2 because S1 uses
S2 and thus it is useful to minimize the network latency between them.
Co-location might mean same nodes and/or same availability zone.
* Anti-affinity
* Spread the pods of a service across nodes and/or availability zones,
e.g. to reduce correlated failures
* Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods
* Spread the pods of a service across nodes and/or availability zones, e.g. to
reduce correlated failures.
* Give a pod "exclusive" access to a node to guarantee resource isolation --
it must never share the node with other pods.
* Don't schedule the pods of a particular service on the same nodes as pods of
another service that are known to interfere with the performance of the pods of the first service.
another service that are known to interfere with the performance of the pods of
the first service.
For both affinity and anti-affinity, there are three variants. Two variants have the
property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed
to schedule onto a node; the difference between them is that if the condition ceases to
be met later on at runtime, for one of them the system will try to eventually evict the pod,
while for the other the system may not try to do so. The third variant
simply provides scheduling-time *hints* that the scheduler will try
to satisfy but may not be able to. These three variants are directly analogous to the three
variants of [node affinity](nodeaffinity.md).
For both affinity and anti-affinity, there are three variants. Two variants have
the property of requiring the affinity/anti-affinity to be satisfied for the pod
to be allowed to schedule onto a node; the difference between them is that if
the condition ceases to be met later on at runtime, for one of them the system
will try to eventually evict the pod, while for the other the system may not try
to do so. The third variant simply provides scheduling-time *hints* that the
scheduler will try to satisfy but may not be able to. These three variants are
directly analogous to the three variants of [node affinity](nodeaffinity.md).
Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
There are other forms of topological affinity and anti-affinity. For example,
you can use [node affinity](nodeaffinity.md) to require (prefer)
that a set of pods all be scheduled in some specific zone Z. Node affinity is not
capable of expressing inter-pod dependencies, and conversely the API
we describe in this document is not capable of expressing node affinity rules.
For simplicity, we will use the terms "affinity" and "anti-affinity" to mean
"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively,
in the remainder of this document.
Note that this proposal is only about *inter-pod* topological affinity and
anti-affinity. There are other forms of topological affinity and anti-affinity.
For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
that a set of pods all be scheduled in some specific zone Z. Node affinity is
not capable of expressing inter-pod dependencies, and conversely the API we
describe in this document is not capable of expressing node affinity rules. For
simplicity, we will use the terms "affinity" and "anti-affinity" to mean
"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
respectively, in the remainder of this document.
## API
@@ -90,28 +93,28 @@ The `Affinity` type is defined as follows
```go
type Affinity struct {
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
PodAffinity *PodAffinity `json:"podAffinity,omitempty"`
PodAntiAffinity *PodAntiAffinity `json:"podAntiAffinity,omitempty"`
}
type PodAffinity struct {
// If the affinity requirements specified by this field are not met at
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system will try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// If the affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system may or may not try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
@@ -120,27 +123,27 @@ type PodAffinity struct {
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
// node(s) with the highest sum are the most preferred.
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}
type PodAntiAffinity struct {
// If the anti-affinity requirements specified by this field are not met at
// If the anti-affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the anti-affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system will try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
// If the anti-affinity requirements specified by this field are not met at
// scheduling time, the pod will not be scheduled onto the node.
// If the anti-affinity requirements specified by this field cease to be met
// at some point during pod execution (e.g. due to a pod label update), the
// system may or may not try to eventually evict the pod from its node.
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// When there are multiple elements, the lists of nodes corresponding to each
// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
RequiredDuringSchedulingIgnoredDuringExecution []PodAffinityTerm `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
// The scheduler will prefer to schedule pods to nodes that satisfy
// the anti-affinity expressions specified by this field, but it may choose
// a node that violates one or more of the expressions. The node that is
// most preferred is the one with the greatest sum of weights, i.e.
@@ -149,7 +152,7 @@ type PodAntiAffinity struct {
// compute a sum by iterating through the elements of this field and adding
// "weight" to the sum if the node matches the corresponding MatchExpressions; the
// node(s) with the highest sum are the most preferred.
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}
type WeightedPodAffinityTerm struct {
@@ -159,23 +162,25 @@ type WeightedPodAffinityTerm struct {
}
type PodAffinityTerm struct {
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
// nil list means "this pod's namespace," empty list means "all namespaces"
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
Namespaces []api.Namespace `json:"namespaces,omitempty"`
// empty topology key is interpreted by the scheduler as "all topologies"
TopologyKey string `json:"topologyKey,omitempty"`
LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
// namespaces specifies which namespaces the LabelSelector applies to (matches against);
// nil list means "this pod's namespace," empty list means "all namespaces"
// The json tag here is not "omitempty" since we need to distinguish nil and empty.
// See https://golang.org/pkg/encoding/json/#Marshal for more details.
Namespaces []api.Namespace `json:"namespaces,omitempty"`
// empty topology key is interpreted by the scheduler as "all topologies"
TopologyKey string `json:"topologyKey,omitempty"`
}
```
Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped
to the pod's namespace, but we need to be able to match against all pods globally.
Note that the `Namespaces` field is necessary because normal `LabelSelector` is
scoped to the pod's namespace, but we need to be able to match against all pods
globally.
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity`
that is configured as follows (note that we've omitted and collapsed some fields for
simplicity, but this should sufficiently convey the intent of the design):
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
`Affinity` that is configured as follows (note that we've omitted and collapsed
some fields for simplicity, but this should sufficiently convey the intent of
the design):
```go
PodAffinity {
@@ -188,130 +193,160 @@ PodAntiAffinity {
}
```
Then when scheduling pod P, the scheduler
* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key `node` and value specifying their node name.)
* Should try to schedule P onto zones that are running pods that satisfy `P2`. (Assumes all nodes have a label with key `zone` and value specifying their zone.)
* Cannot schedule P onto any racks that are running pods that satisfy `P3`. (Assumes all nodes have a label with key `rack` and value specifying their rack name.)
* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key `power` and value specifying their power domain.)
Then when scheduling pod P, the scheduler:
* Can only schedule P onto nodes that are running pods that satisfy `P1`.
(Assumes all nodes have a label with key `node` and value specifying their node
name.)
* Should try to schedule P onto zones that are running pods that satisfy `P2`.
(Assumes all nodes have a label with key `zone` and value specifying their
zone.)
* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
(Assumes all nodes have a label with key `rack` and value specifying their rack
name.)
* Should try not to schedule P onto any power domains that are running pods that
satisfy `P4`. (Assumes all nodes have a label with key `power` and value
specifying their power domain.)
When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed.
For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and
the node(s) with the highest weight(s) are the most preferred.
When `RequiredDuringScheduling` has multiple elements, the requirements are
ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
are satisfied for each node, and the node(s) with the highest weight(s) are the
most preferred.
In reality there are two variants of `RequiredDuringScheduling`: one suffixed with
`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the
first variant, if the affinity/anti-affinity ceases to be met at some point during
pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod
from its node. In the second variant, the system may or may not try to eventually
evict the pod from its node.
In reality there are two variants of `RequiredDuringScheduling`: one suffixed
with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`.
For the first variant, if the affinity/anti-affinity ceases to be met at some
point during pod execution (e.g. due to a pod label update), the system will try
to eventually evict the pod from its node. In the second variant, the system may
or may not try to eventually evict the pod from its node.
## A comment on symmetry
One thing that makes affinity and anti-affinity tricky is symmetry.
Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule
"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when
you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod,
*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1
pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned
RequiredDuringScheduling anti-affinity rule, then
Imagine a cluster that is running pods from two services, S1 and S2. Imagine
that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
run me on nodes that are running pods from S2." It is not sufficient just to
check that there are no S2 pods on a node when you are scheduling a S1 pod. You
also need to ensure that there are no S1 pods on a node when you are scheduling
a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
anti-affinity rule, then:
* if a node is empty, you can schedule S1 or S2 onto the node
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
Note that while RequiredDuringScheduling anti-affinity is symmetric,
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More
specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2," it is not required that there be S1 pods on a node in order to
schedule a S2 pod onto that node. More specifically, if S1 has the
aforementioned RequiredDuringScheduling affinity rule, then:
* if a node is empty, you can schedule S2 onto the node
* if a node is empty, you cannot schedule S1 onto the node
* if a node is running S2, you can schedule S1 onto the node
* if a node is running S1+S2 and S1 terminates, S2 continues running
* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
* if a node is running S1+S2 and S2 terminates, the system terminates S1
(eventually)
However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every
RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node,
but it would be better if there are.
However, although RequiredDuringScheduling affinity is not symmetric, there is
an implicit PreferredDuringScheduling affinity rule corresponding to every
RequiredDuringScheduling affinity rule: if the pods of S1 have a
RequiredDuringScheduling affinity rule "run me on nodes that are running pods
from S2" then it is not required that there be S1 pods on a node in order to
schedule a S2 pod onto that node, but it would be better if there are.
PreferredDuringScheduling is symmetric.
If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2"
then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also
to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer
to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place
a S2 pod that we are scheduling onto a node that is running a S1 pod.
PreferredDuringScheduling is symmetric. If the pods of S1 had a
PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
are running pods from S2" then we would prefer to keep a S1 pod that we are
scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
are running pods from S2" then we would prefer to place a S1 pod that we are
scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
we are scheduling onto a node that is running a S1 pod.
## Examples
Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.
Here are some examples of how you would express various affinity and
anti-affinity rules using the API we described.
### Affinity
In the examples below, the word "put" is intentionally ambiguous; the rules are the same
whether "put" means "must put" (RequiredDuringScheduling) or "try to put"
(PreferredDuringScheduling)--all that changes is which field the rule goes into.
Also, we only discuss scheduling-time, and ignore the execution-time.
Finally, some of the examples
use "zone" and some use "node," just to make the examples more interesting; any of the examples
with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa.
In the examples below, the word "put" is intentionally ambiguous; the rules are
the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
put" (PreferredDuringScheduling)--all that changes is which field the rule goes
into. Also, we only discuss scheduling-time, and ignore the execution-time.
Finally, some of the examples use "zone" and some use "node," just to make the
examples more interesting; any of the examples with "zone" will also work for
"node" if you change the `TopologyKey`, and vice-versa.
* **Put the pod in zone Z**:
Tricked you! It is not possible express this using the API described here. For this you should use node affinity.
Tricked you! It is not possible express this using the API described here. For
this you should use node affinity.
* **Put the pod in a zone that is running at least one pod from service S**:
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
* **Put the pod on a node that is already running a pod that requires a license for software package P**:
Assuming pods that require a license for software package P have a label `{key=license, value=P}`:
* **Put the pod on a node that is already running a pod that requires a license
for software package P**: Assuming pods that require a license for software
package P have a label `{key=license, value=P}`:
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
* **Put this pod in the same zone as other pods from its same service**:
Assuming pods from this pod's service have some label `{key=service, value=S}`:
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
This last example illustrates a small issue with this API when it is used
with a scheduler that processes the pending queue one pod at a time, like the current
This last example illustrates a small issue with this API when it is used with a
scheduler that processes the pending queue one pod at a time, like the current
Kubernetes scheduler. The RequiredDuringScheduling rule
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
only "works" once one pod from service S has been scheduled. But if all pods in service
S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule
will block the first
pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from
the same service. And of course that means none of the pods of the service will be able
to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not
PreferredDuringScheduling affinity or any variant of anti-affinity.
There are at least three ways to solve this problem
* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement
matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement.
This approach has a corner case when running parallel schedulers that are allowed to
schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to
schedule pods from the set
at the same time and think there are no other pods from that set scheduled yet (e.g. they are
trying to schedule the first two pods from the set), but by the time
the second binding is committed, the first one has already been committed, leaving you with
two pods running that do not respect their RequiredDuringScheduling affinity. There is no
simple way to detect this "conflict" at scheduling time given the current system implementation.
* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those
pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
* **very long-term/speculative**: controllers could present the scheduler with a group of pods from
the same PodTemplate as a single unit. This is similar to the first approach described above but
avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow
the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845)
since it could receive an entire gang simultaneously as a single unit.
only "works" once one pod from service S has been scheduled. But if all pods in
service S have this RequiredDuringScheduling rule in their PodSpec, then the
RequiredDuringScheduling rule will block the first pod of the service from ever
scheduling, since it is only allowed to run in a zone with another pod from the
same service. And of course that means none of the pods of the service will be
able to schedule. This problem *only* applies to RequiredDuringScheduling
affinity, not PreferredDuringScheduling affinity or any variant of
anti-affinity. There are at least three ways to solve this problem:
* **short-term**: have the scheduler use a rule that if the
RequiredDuringScheduling affinity requirement matches a pod's own labels, and
there are no other such pods anywhere, then disregard the requirement. This
approach has a corner case when running parallel schedulers that are allowed to
schedule pods from the same replicated set (e.g. a single PodTemplate): both
schedulers may try to schedule pods from the set at the same time and think
there are no other pods from that set scheduled yet (e.g. they are trying to
schedule the first two pods from the set), but by the time the second binding is
committed, the first one has already been committed, leaving you with two pods
running that do not respect their RequiredDuringScheduling affinity. There is no
simple way to detect this "conflict" at scheduling time given the current system
implementation.
* **longer-term**: when a controller creates pods from a PodTemplate, for
exactly *one* of those pods, it should omit any RequiredDuringScheduling
affinity rules that select the pods of that PodTemplate.
* **very long-term/speculative**: controllers could present the scheduler with a
group of pods from the same PodTemplate as a single unit. This is similar to the
first approach described above but avoids the corner case. No special logic is
needed in the controllers. Moreover, this would allow the scheduler to do proper
[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
it could receive an entire gang simultaneously as a single unit.
### Anti-affinity
As with the affinity examples, the examples here can be RequiredDuringScheduling or
PreferredDuringScheduling anti-affinity, i.e.
"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears
in `RequiredDuringScheduling` or `PreferredDuringScheduling`.
As with the affinity examples, the examples here can be RequiredDuringScheduling
or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
"must not" or as "try not to" depending on whether the rule appears in
`RequiredDuringScheduling` or `PreferredDuringScheduling`.
* **Spread the pods of this service S across nodes and zones**:
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second
clause will force the scheduler to not put more than one pod from S in the same zone, and thus by
definition it will not put more than one pod from S on the same node, assuming each node is in one zone.
This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
(note that if this is specified as a RequiredDuringScheduling anti-affinity,
then the first clause is redundant, since the second clause will force the
scheduler to not put more than one pod from S in the same zone, and thus by
definition it will not put more than one pod from S on the same node, assuming
each node is in one zone. This rule is more useful as PreferredDuringScheduling
anti-affinity, e.g. one might expect it to be common in
[Ubernetes](../../docs/proposals/federation.md) clusters.)
* **Don't co-locate pods of this service with pods from service "evilService"**:
@@ -323,25 +358,29 @@ This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one mi
* **Don't co-locate pods of this service with any other pods except other pods of this service**:
Assuming pods from the service have some label `{key=service, value=S}`:
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
Note that this works because `"service" NotIn "S"` matches pods with no key "service"
as well as pods with key "service" and a corresponding value that is not "S."
Note that this works because `"service" NotIn "S"` matches pods with no key
"service" as well as pods with key "service" and a corresponding value that is
not "S."
## Algorithm
An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows.
There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's
semantics are implementable.
An example algorithm a scheduler might use to implement affinity and
anti-affinity rules is as follows. There are certainly more efficient ways to
do it; this is just intended to demonstrate that the API's semantics are
implementable.
Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler
predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling
time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.
Terminology definition: We say a pod P is "feasible" on a node N if P meets all
of the scheduler predicates for scheduling P onto N. Note that this algorithm is
only concerned about scheduling time, thus it makes no distinction between
RequiredDuringExecution and IgnoredDuringExecution.
To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand
for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for
"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
To make the algorithm slightly more readable, we use the term "HardPodAffinity"
as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account;
currently it assumes all terms have weight 1. **
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
into account; currently it assumes all terms have weight 1. **
```
Z = the pod you are scheduling
@@ -389,74 +428,81 @@ foreach node A of {N}
## Special considerations for RequiredDuringScheduling anti-affinity
In this section we discuss three issues with RequiredDuringScheduling anti-affinity:
Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill.
See issue #18265 for additional discussion of these topics.
In this section we discuss three issues with RequiredDuringScheduling
anti-affinity: Denial of Service (DoS), co-existing with daemons, and
determining which pod(s) to kill. See issue #18265 for additional discussion of
these topics.
### Denial of Service
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally
or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
can intentionally or unintentionally cause various problems for other pods, due
to the symmetry property of anti-affinity.
The most notable danger is the ability for a
pod that arrives first to some topology domain, to block all other pods from
scheduling there by stating a conflict with all other pods.
The standard approach
to preventing resource hogging is quota, but simple resource quota cannot prevent
this scenario because the pod may request very little resources. Addressing this
using quota requires a quota scheme that charges based on "opportunity cost" rather
than based simply on requested resources. For example, when handling a pod that expresses
The most notable danger is the ability for a pod that arrives first to some
topology domain, to block all other pods from scheduling there by stating a
conflict with all other pods. The standard approach to preventing resource
hogging is quota, but simple resource quota cannot prevent this scenario because
the pod may request very little resources. Addressing this using quota requires
a quota scheme that charges based on "opportunity cost" rather than based simply
on requested resources. For example, when handling a pod that expresses
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
(i.e. exclusive access to a node), it could charge for the resources of the
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
entire cluster. If node affinity is used to
constrain the pod to a particular topology domain, then the admission-time quota
charging should take that into account (e.g. not charge for the average/largest machine
if the PodSpec constrains the pod to a specific machine with a known size; instead charge
for the size of the actual machine that the pod was constrained to). In all cases
once the pod is scheduled, the quota charge should be adjusted down to the
actual amount of resources allocated (e.g. the size of the actual machine that was
assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for
average or largest node in the cluster. Likewise if a pod expresses
RequiredDuringScheduling anti-affinity for all pods using a "cluster"
`TopologyKey`, it could charge for the resources of the entire cluster. If node
affinity is used to constrain the pod to a particular topology domain, then the
admission-time quota charging should take that into account (e.g. not charge for
the average/largest machine if the PodSpec constrains the pod to a specific
machine with a known size; instead charge for the size of the actual machine
that the pod was constrained to). In all cases once the pod is scheduled, the
quota charge should be adjusted down to the actual amount of resources allocated
(e.g. the size of the actual machine that was assigned, not the
average/largest). If a cluster administrator wants to overcommit quota, for
example to allow more than N pods across all users to request exclusive node
access in a cluster with N nodes, then a priority/preemption scheme should be added
so that the most important pods run when resource demand exceeds supply.
access in a cluster with N nodes, then a priority/preemption scheme should be
added so that the most important pods run when resource demand exceeds supply.
An alternative approach, which is a bit of a blunt hammer, is to use a
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
to trusted users. A more complex capability mechanism might only restrict it when
using a non-"node" TopologyKey.
to trusted users. A more complex capability mechanism might only restrict it
when using a non-"node" TopologyKey.
Our initial implementation will use a variant of the capability approach, which
requires no configuration: we will simply reject ALL requests, regardless of user,
that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity.
This allows the "exclusive node" use case while prohibiting the more dangerous ones.
requires no configuration: we will simply reject ALL requests, regardless of
user, that specify "all namespaces" with non-"node" TopologyKey for
RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
case while prohibiting the more dangerous ones.
A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
the scheduling quality of another pod, but not completely block it from scheduling.
For example, a set of pods S1 could use node affinity to request to schedule onto a set
of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1
have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2,
then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from
scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and
with some probability that depends on the weighting scheme for the PreferredDuringScheduling case).
A very sophisticated priority and/or quota scheme could mitigate this, or alternatively
we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity.
Then only RequiredDuringScheduling anti-affinity could affect scheduling quality
of another pod, and as we described in the previous paragraph, such pods could be charged
quota for the full topology domain, thereby reducing the potential for abuse.
A weaker variant of the problem described in the previous paragraph is a pod's
ability to use anti-affinity to degrade the scheduling quality of another pod,
but not completely block it from scheduling. For example, a set of pods S1 could
use node affinity to request to schedule onto a set of nodes that some other set
of pods S2 prefers to schedule onto. If the pods in S1 have
RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
S2, then due to the symmetry property of anti-affinity, they can prevent the
pods in S2 from scheduling onto their preferred nodes if they arrive first (for
sure in the RequiredDuringScheduling case, and with some probability that
depends on the weighting scheme for the PreferredDuringScheduling case). A very
sophisticated priority and/or quota scheme could mitigate this, or alternatively
we could eliminate the symmetry property of the implementation of
PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
anti-affinity could affect scheduling quality of another pod, and as we
described in the previous paragraph, such pods could be charged quota for the
full topology domain, thereby reducing the potential for abuse.
We won't try to address this issue in our initial implementation; we can consider one
of the approaches mentioned above if it turns out to be a problem in practice.
We won't try to address this issue in our initial implementation; we can
consider one of the approaches mentioned above if it turns out to be a problem
in practice.
### Co-existing with daemons
A cluster administrator
may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with
system daemon pods, such as those run by DaemonSet. In principle, we would like the specification
for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more
other pods (see #18263 for a more detailed explanation of the toleration concept). There are
at least two ways to accomplish this:
A cluster administrator may wish to allow pods that express anti-affinity
against all pods, to nonetheless co-exist with system daemon pods, such as those
run by DaemonSet. In principle, we would like the specification for
RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
more other pods (see #18263 for a more detailed explanation of the toleration
concept). There are at least two ways to accomplish this:
* Scheduler special-cases the namespace(s) where daemons live, in the
sense that it ignores pods in those namespaces when it is
@@ -478,147 +524,168 @@ Our initial implementation will use the first approach.
### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution
anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in
such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution
anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod
with the anti-affinity rule that becomes violated should be the one killed.
A pod should only specify constraints that apply to
namespaces it trusts to not do malicious things. Once we have priority/preemption, we can
change the rule to say that the lowest-priority pod(s) are killed until all
Because anti-affinity is symmetric, in the case of
RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
determine which pod(s) to kill when a pod's labels are updated in such as way as
to cause them to conflict with one or more other pods'
RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
absence of a priority/preemption scheme, our rule will be that the pod with the
anti-affinity rule that becomes violated should be the one killed. A pod should
only specify constraints that apply to namespaces it trusts to not do malicious
things. Once we have priority/preemption, we can change the rule to say that the
lowest-priority pod(s) are killed until all
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
## Special considerations for RequiredDuringScheduling affinity
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry:
if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods,
and pods that conflict with P cannot schedule onto the node one P has been scheduled there.
The design we have described says that the symmetry property for RequiredDuringScheduling *affinity*
is weaker: if a pod P says it can only schedule onto nodes running pod Q, this
does not mean Q can only run on a node that is running P, but the scheduler will try
to schedule Q onto a node that is running P (i.e. treats the reverse direction as
preferred). This raises the same scheduling quality concern as we mentioned at the
end of the Denial of Service section above, and can be addressed in similar ways.
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
conflicting pods, and pods that conflict with P cannot schedule onto the node
one P has been scheduled there. The design we have described says that the
symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
says it can only schedule onto nodes running pod Q, this does not mean Q can
only run on a node that is running P, but the scheduler will try to schedule Q
onto a node that is running P (i.e. treats the reverse direction as preferred).
This raises the same scheduling quality concern as we mentioned at the end of
the Denial of Service section above, and can be addressed in similar ways.
The nature of affinity (as opposed to anti-affinity) means that there is no issue of
determining which pod(s) to kill
when a pod's labels change: it is obviously the pod with the affinity rule that becomes
violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule;
it can only "fix" violation an anti-affinity rule.) However, affinity does have a
different question related to killing: how long should the system wait before declaring
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime?
For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed
so that it can be updated to a new binary version, should that trigger killing of P? More
generally, how long should the system wait before declaring that P's affinity is
violated? (Of course affinity is expressed in terms of label selectors, not for a specific
pod, but the scenario is easier to describe using a concrete pod.) This is closely related to
the concept of forgiveness (see issue #1574). In theory we could make this time duration be
configurable by the user on a per-pod basis, but for the first version of this feature we will
make it a configurable property of whichever component does the killing and that applies across
all pods using the feature. Making it configurable by the user would require a nontrivial change
to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution
affinity).
The nature of affinity (as opposed to anti-affinity) means that there is no
issue of determining which pod(s) to kill when a pod's labels change: it is
obviously the pod with the affinity rule that becomes violated that must be
killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
"fix" violation an anti-affinity rule.) However, affinity does have a different
question related to killing: how long should the system wait before declaring
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
is temporarily killed so that it can be updated to a new binary version, should
that trigger killing of P? More generally, how long should the system wait
before declaring that P's affinity is violated? (Of course affinity is expressed
in terms of label selectors, not for a specific pod, but the scenario is easier
to describe using a concrete pod.) This is closely related to the concept of
forgiveness (see issue #1574). In theory we could make this time duration be
configurable by the user on a per-pod basis, but for the first version of this
feature we will make it a configurable property of whichever component does the
killing and that applies across all pods using the feature. Making it
configurable by the user would require a nontrivial change to the API syntax
(since the field would only apply to
RequiredDuringSchedulingRequiredDuringExecution affinity).
## Implementation plan
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types.
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
This admission controller should be enabled by default.
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
`PodAntiAffinity` types to the API along with all of their descendant types.
2. Implement a scheduler predicate that takes
`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
account. Include a workaround for the issue described at the end of the Affinity
section of the Examples section (can't schedule first pod).
3. Implement a scheduler priority function that takes
`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
into account.
4. Implement admission controller that rejects requests that specify "all
namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
anti-affinity. This admission controller should be enabled by default.
5. Implement the recommended solution to the "co-existing with daemons" issue
6. At this point, the feature can be deployed.
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism,
the "co-existing with daemons" solution).
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision
9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`;
if controller then potentially for all `TopologyKeys`'s.
(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
and anti-affinity, and make sure the pieces of the system already implemented
for `RequiredDuringSchedulingIgnoredDuringExecution` also take
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
scheduler predicate, the quota mechanism, the "co-existing with daemons"
solution).
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
`TopologyKey` to Kubelet's admission decision.
9. Implement code in Kubelet *or* the controllers that evicts a pod that no
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
then only for "node" `TopologyKey`; if controller then potentially for all
`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
Do so in a way that addresses the "determining which pod(s) to kill" issue.
We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
We assume Kubelet publishes labels describing the node's membership in all of
the relevant scheduling domains (e.g. node name, rack name, availability zone
name, etc.). See #9044.
## Backward compatibility
Old versions of the scheduler will ignore `Affinity`.
Users should not start using `Affinity` until the full implementation has
been in Kubelet and the master for enough binary versions that we feel
comfortable that we will not need to roll back either Kubelet or
master to a version that does not support them. Longer-term we will
use a programmatic approach to enforcing this (#4855).
Users should not start using `Affinity` until the full implementation has been
in Kubelet and the master for enough binary versions that we feel comfortable
that we will not need to roll back either Kubelet or master to a version that
does not support them. Longer-term we will use a programmatic approach to
enforcing this (#4855).
## Extensibility
The design described here is the result of careful analysis of use cases, a decade of experience
with Borg at Google, and a review of similar features in other open-source container orchestration
systems. We believe that it properly balances the goal of expressiveness against the goals of
simplicity and efficiency of implementation. However, we recognize that
use cases may arise in the future that cannot be expressed using the syntax described here.
Although we are not implementing an affinity-specific extensibility mechanism for a variety
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
users to get a consistent experience, etc.), the regular Kubernetes
annotation mechanism can be used to add or replace affinity rules. The way this work would is
The design described here is the result of careful analysis of use cases, a
decade of experience with Borg at Google, and a review of similar features in
other open-source container orchestration systems. We believe that it properly
balances the goal of expressiveness against the goals of simplicity and
efficiency of implementation. However, we recognize that use cases may arise in
the future that cannot be expressed using the syntax described here. Although we
are not implementing an affinity-specific extensibility mechanism for a variety
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
for Kubernetes users to get a consistent experience, etc.), the regular
Kubernetes annotation mechanism can be used to add or replace affinity rules.
The way this work would is:
1. Define one or more annotations to describe the new affinity rule(s)
1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
annotation(s).
1. User (or an admission controller) attaches the annotation(s) to pods to
request the desired scheduling behavior. If the new rule(s) *replace* one or
more fields of `Affinity` then the user would omit those fields from `Affinity`;
if they are *additional rules*, then the user would fill in `Affinity` as well
as the annotation(s).
1. Scheduler takes the annotation(s) into account when scheduling.
If some particular new syntax becomes popular, we would consider upstreaming it by integrating
it into the standard `Affinity`.
If some particular new syntax becomes popular, we would consider upstreaming it
by integrating it into the standard `Affinity`.
## Future work and non-work
One can imagine that in the anti-affinity RequiredDuringScheduling case
one might want to associate a number with the rule,
for example "do not allow this pod to share a rack with more than three other
pods (in total, or from the same service as the pod)." We could allow this to be
specified by adding an integer `Limit` to `PodAffinityTerm` just for the
`RequiredDuringScheduling` case. However, this flexibility complicates the
system and we do not intend to implement it.
One can imagine that in the anti-affinity RequiredDuringScheduling case one
might want to associate a number with the rule, for example "do not allow this
pod to share a rack with more than three other pods (in total, or from the same
service as the pod)." We could allow this to be specified by adding an integer
`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
However, this flexibility complicates the system and we do not intend to
implement it.
It is likely that the specification and implementation of pod anti-affinity
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
and likewise that the specification and implementation of pod affinity
can be unified with [node affinity](nodeaffinity.md).
The basic idea is that pod labels would be "inherited" by the node, and pods
would only be able to specify affinity and anti-affinity for a node's labels.
Our main motivation for not unifying taints and tolerations with
pod anti-affinity is that we foresee taints and tolerations as being a concept that
only cluster administrators need to understand (and indeed in some setups taints and
tolerations wouldn't even be directly manipulated by a cluster administrator,
instead they would only be set by an admission controller that is implementing the administrator's
high-level policy about different classes of special machines and the users who belong to the groups
allowed to access them). Moreover, the concept of nodes "inheriting" labels
from pods seems complicated; it seems conceptually simpler to separate rules involving
relatively static properties of nodes from rules involving which other pods are running
on the same node or larger topology domain.
can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
labels would be "inherited" by the node, and pods would only be able to specify
affinity and anti-affinity for a node's labels. Our main motivation for not
unifying taints and tolerations with pod anti-affinity is that we foresee taints
and tolerations as being a concept that only cluster administrators need to
understand (and indeed in some setups taints and tolerations wouldn't even be
directly manipulated by a cluster administrator, instead they would only be set
by an admission controller that is implementing the administrator's high-level
policy about different classes of special machines and the users who belong to
the groups allowed to access them). Moreover, the concept of nodes "inheriting"
labels from pods seems complicated; it seems conceptually simpler to separate
rules involving relatively static properties of nodes from rules involving which
other pods are running on the same node or larger topology domain.
Data/storage affinity is related to pod affinity, and is likely to draw on some of the
ideas we have used for pod affinity. Today, data/storage affinity is expressed using
node affinity, on the assumption that the pod knows which node(s) store(s) the data
it wants. But a more flexible approach would allow the pod to name the data rather than
the node.
Data/storage affinity is related to pod affinity, and is likely to draw on some
of the ideas we have used for pod affinity. Today, data/storage affinity is
expressed using node affinity, on the assumption that the pod knows which
node(s) store(s) the data it wants. But a more flexible approach would allow the
pod to name the data rather than the node.
## Related issues
The review for this proposal is in #18265.
The topic of affinity/anti-affinity has generated a lot of discussion. The main issue
is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906
all have additional discussion and use cases.
The topic of affinity/anti-affinity has generated a lot of discussion. The main
issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341,
As the examples in this document have demonstrated, topological affinity is very useful
in clusters that are spread across availability zones, e.g. to co-locate pods of a service
in the same zone to avoid a wide-area network hop, or to spread pods across zones for
failure tolerance. #17059, #13056, #13063, and #4235 are relevant.
# 1965, and #2906 all have additional discussion and use cases.
As the examples in this document have demonstrated, topological affinity is very
useful in clusters that are spread across availability zones, e.g. to co-locate
pods of a service in the same zone to avoid a wide-area network hop, or to
spread pods across zones for failure tolerance. #17059, #13056, #13063, and
# 4235 are relevant.
Issue #15675 describes connection affinity, which is vaguely related.