address issue #1488; clean up linewrap and some minor editing issues in the docs/design/* tree

Signed-off-by: mikebrow <brownwm@us.ibm.com>
2025-09-07 12:11:43 +00:00 · 2016-04-13 19:55:22 -05:00
parent 4638f2f355
commit 6bdc0bfdb7
39 changed files with 3744 additions and 2375 deletions
--- a/docs/design/podaffinity.md
+++ b/docs/design/podaffinity.md
@@ -38,45 +38,48 @@ Documentation for other releases can be found at

 NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.

-This document describes a proposal for specifying and implementing inter-pod topological affinity and
-anti-affinity. By that we mean: rules that specify that certain pods should be placed
-in the same topological domain (e.g. same node, same rack, same zone, same
-power domain, etc.) as some other pods, or, conversely, should *not* be placed in the
-same topological domain as some other pods.
+This document describes a proposal for specifying and implementing inter-pod
+topological affinity and anti-affinity. By that we mean: rules that specify that
+certain pods should be placed in the same topological domain (e.g. same node,
+same rack, same zone, same power domain, etc.) as some other pods, or,
+conversely, should *not* be placed in the same topological domain as some other
+pods.

-Here are a few example rules; we explain how to express them using the API described
-in this doc later, in the section "Examples."
+Here are a few example rules; we explain how to express them using the API
+described in this doc later, in the section "Examples."
 * Affinity
-  * Co-locate the pods from a particular service or Job in the same availability zone,
-	without specifying which zone that should be.
-  * Co-locate the pods from service S1 with pods from service S2 because S1 uses S2
-	and thus it is useful to minimize the network latency between them. Co-location
-	might mean same nodes and/or same availability zone.
+  * Co-locate the pods from a particular service or Job in the same availability
+zone, without specifying which zone that should be.
+  * Co-locate the pods from service S1 with pods from service S2 because S1 uses
+S2 and thus it is useful to minimize the network latency between them.
+Co-location might mean same nodes and/or same availability zone.
 * Anti-affinity
-  * Spread the pods of a service across nodes and/or availability zones,
-	e.g. to reduce correlated failures
-  * Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods
+  * Spread the pods of a service across nodes and/or availability zones, e.g. to
+reduce correlated failures.
+  * Give a pod "exclusive" access to a node to guarantee resource isolation --
+it must never share the node with other pods.
  * Don't schedule the pods of a particular service on the same nodes as pods of
-  another service that are known to interfere with the performance of the pods of the first service.
+another service that are known to interfere with the performance of the pods of
+the first service.

-For both affinity and anti-affinity, there are three variants. Two variants have the
-property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed
-to schedule onto a node; the difference between them is that if the condition ceases to
-be met later on at runtime, for one of them the system will try to eventually evict the pod,
-while for the other the system may not try to do so. The third variant
-simply provides scheduling-time *hints* that the scheduler will try
-to satisfy but may not be able to. These three variants are directly analogous to the three
-variants of [node affinity](nodeaffinity.md).
+For both affinity and anti-affinity, there are three variants. Two variants have
+the property of requiring the affinity/anti-affinity to be satisfied for the pod
+to be allowed to schedule onto a node; the difference between them is that if
+the condition ceases to be met later on at runtime, for one of them the system
+will try to eventually evict the pod, while for the other the system may not try
+to do so. The third variant simply provides scheduling-time *hints* that the
+scheduler will try to satisfy but may not be able to. These three variants are
+directly analogous to the three variants of [node affinity](nodeaffinity.md).

-Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
-There are other forms of topological affinity and anti-affinity. For example,
-you can use [node affinity](nodeaffinity.md) to require (prefer)
-that a set of pods all be scheduled in some specific zone Z. Node affinity is not
-capable of expressing inter-pod dependencies, and conversely the API
-we describe in this document is not capable of expressing node affinity rules.
-For simplicity, we will use the terms "affinity" and "anti-affinity" to mean
-"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively,
-in the remainder of this document.
+Note that this proposal is only about *inter-pod* topological affinity and
+anti-affinity. There are other forms of topological affinity and anti-affinity.
+For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
+that a set of pods all be scheduled in some specific zone Z. Node affinity is
+not capable of expressing inter-pod dependencies, and conversely the API we
+describe in this document is not capable of expressing node affinity rules. For
+simplicity, we will use the terms "affinity" and "anti-affinity" to mean
+"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
+respectively, in the remainder of this document.

 ## API

@@ -90,28 +93,28 @@ The `Affinity` type is defined as follows

 ```go
 type Affinity struct {
-	PodAffinity     *PodAffinity  `json:"podAffinity,omitempty"`
-	PodAntiAffinity *PodAntiAffinity  `json:"podAntiAffinity,omitempty"`
+    PodAffinity     *PodAffinity  `json:"podAffinity,omitempty"`
+    PodAntiAffinity *PodAntiAffinity  `json:"podAntiAffinity,omitempty"`
 }

 type PodAffinity struct {
-	// If the affinity requirements specified by this field are not met at
+    // If the affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system will try to eventually evict the pod from its node.
-	// When there are multiple elements, the lists of nodes corresponding to each
-	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
-	RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
+    // When there are multiple elements, the lists of nodes corresponding to each
+    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+    RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
    // If the affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system may or may not try to eventually evict the pod from its node.
-	// When there are multiple elements, the lists of nodes corresponding to each
-	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
-	RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
-	// The scheduler will prefer to schedule pods to nodes that satisfy
+    // When there are multiple elements, the lists of nodes corresponding to each
+    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+    RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
+    // The scheduler will prefer to schedule pods to nodes that satisfy
    // the affinity expressions specified by this field, but it may choose
    // a node that violates one or more of the expressions. The node that is
    // most preferred is the one with the greatest sum of weights, i.e.
@@ -120,27 +123,27 @@ type PodAffinity struct {
    // compute a sum by iterating through the elements of this field and adding
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
    // node(s) with the highest sum are the most preferred.
-	PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
+    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
 }

 type PodAntiAffinity struct {
-	// If the anti-affinity requirements specified by this field are not met at
+    // If the anti-affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the anti-affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system will try to eventually evict the pod from its node.
-	// When there are multiple elements, the lists of nodes corresponding to each
-	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
-	RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
+    // When there are multiple elements, the lists of nodes corresponding to each
+    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+    RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
    // If the anti-affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the anti-affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system may or may not try to eventually evict the pod from its node.
-	// When there are multiple elements, the lists of nodes corresponding to each
-	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
-	RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
-	// The scheduler will prefer to schedule pods to nodes that satisfy
+    // When there are multiple elements, the lists of nodes corresponding to each
+    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
+    RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
+    // The scheduler will prefer to schedule pods to nodes that satisfy
    // the anti-affinity expressions specified by this field, but it may choose
    // a node that violates one or more of the expressions. The node that is
    // most preferred is the one with the greatest sum of weights, i.e.
@@ -149,7 +152,7 @@ type PodAntiAffinity struct {
    // compute a sum by iterating through the elements of this field and adding
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
    // node(s) with the highest sum are the most preferred.
-	PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
+    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
 }

 type WeightedPodAffinityTerm struct {
@@ -159,23 +162,25 @@ type WeightedPodAffinityTerm struct {
 }

 type PodAffinityTerm struct {
-	LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
-	// namespaces specifies which namespaces the LabelSelector applies to (matches against);
-	// nil list means "this pod's namespace," empty list means "all namespaces"
-	// The json tag here is not "omitempty" since we need to distinguish nil and empty.
-	// See https://golang.org/pkg/encoding/json/#Marshal for more details.
-	Namespaces []api.Namespace  `json:"namespaces,omitempty"`
-	// empty topology key is interpreted by the scheduler as "all topologies"
-	TopologyKey string `json:"topologyKey,omitempty"`
+    LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
+    // namespaces specifies which namespaces the LabelSelector applies to (matches against);
+    // nil list means "this pod's namespace," empty list means "all namespaces"
+    // The json tag here is not "omitempty" since we need to distinguish nil and empty.
+    // See https://golang.org/pkg/encoding/json/#Marshal for more details.
+    Namespaces []api.Namespace  `json:"namespaces,omitempty"`
+    // empty topology key is interpreted by the scheduler as "all topologies"
+    TopologyKey string `json:"topologyKey,omitempty"`
 }
 ```

-Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped
-to the pod's namespace, but we need to be able to match against all pods globally.
+Note that the `Namespaces` field is necessary because normal `LabelSelector` is
+scoped to the pod's namespace, but we need to be able to match against all pods
+globally.

-To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity`
-that is configured as follows (note that we've omitted and collapsed some fields for
-simplicity, but this should sufficiently convey the intent of the design):
+To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
+`Affinity` that is configured as follows (note that we've omitted and collapsed
+some fields for simplicity, but this should sufficiently convey the intent of
+the design):

 ```go
 PodAffinity {
@@ -188,130 +193,160 @@ PodAntiAffinity {
 }
 ```

-Then when scheduling pod P, the scheduler
-* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key `node` and value specifying their node name.)
-* Should try to schedule P onto zones that are running pods that satisfy `P2`. (Assumes all nodes have a label with key `zone` and value specifying their zone.)
-* Cannot schedule P onto any racks that are running pods that satisfy `P3`. (Assumes all nodes have a label with key `rack` and value specifying their rack name.)
-* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key `power` and value specifying their power domain.)
+Then when scheduling pod P, the scheduler:
+* Can only schedule P onto nodes that are running pods that satisfy `P1`.
+(Assumes all nodes have a label with key `node` and value specifying their node
+name.)
+* Should try to schedule P onto zones that are running pods that satisfy `P2`.
+(Assumes all nodes have a label with key `zone` and value specifying their
+zone.)
+* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
+(Assumes all nodes have a label with key `rack` and value specifying their rack
+name.)
+* Should try not to schedule P onto any power domains that are running pods that
+satisfy `P4`. (Assumes all nodes have a label with key `power` and value
+specifying their power domain.)

-When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed.
-For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and
-the node(s) with the highest weight(s) are the most preferred.
+When `RequiredDuringScheduling` has multiple elements, the requirements are
+ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
+are satisfied for each node, and the node(s) with the highest weight(s) are the
+most preferred.

-In reality there are two variants of `RequiredDuringScheduling`: one suffixed with
-`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the
-first variant, if the affinity/anti-affinity ceases to be met at some point during
-pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod
-from its node. In the second variant, the system may or may not try to eventually
-evict the pod from its node.
+In reality there are two variants of `RequiredDuringScheduling`: one suffixed
+with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`.
+For the first variant, if the affinity/anti-affinity ceases to be met at some
+point during pod execution (e.g. due to a pod label update), the system will try
+to eventually evict the pod from its node. In the second variant, the system may
+or may not try to eventually evict the pod from its node.

 ## A comment on symmetry

 One thing that makes affinity and anti-affinity tricky is symmetry.

-Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule
-"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when
-you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod,
-*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1
-pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned
-RequiredDuringScheduling anti-affinity rule, then
+Imagine a cluster that is running pods from two services, S1 and S2. Imagine
+that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
+run me on nodes that are running pods from S2." It is not sufficient just to
+check that there are no S2 pods on a node when you are scheduling a S1 pod. You
+also need to ensure that there are no S1 pods on a node when you are scheduling
+a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
+Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
+RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
+S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
+anti-affinity rule, then:
 * if a node is empty, you can schedule S1 or S2 onto the node
 * if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node

 Note that while RequiredDuringScheduling anti-affinity is symmetric,
-RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
-pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More
-specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then
+RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
+have a RequiredDuringScheduling affinity rule "run me on nodes that are running
+pods from S2," it is not required that there be S1 pods on a node in order to
+schedule a S2 pod onto that node. More specifically, if S1 has the
+aforementioned RequiredDuringScheduling affinity rule, then:
 * if a node is empty, you can schedule S2 onto the node
 * if a node is empty, you cannot schedule S1 onto the node
 * if a node is running S2, you can schedule S1 onto the node
 * if a node is running S1+S2 and S1 terminates, S2 continues running
-* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
+* if a node is running S1+S2 and S2 terminates, the system terminates S1
+(eventually)

-However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every
-RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
-pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node,
-but it would be better if there are.
+However, although RequiredDuringScheduling affinity is not symmetric, there is
+an implicit PreferredDuringScheduling affinity rule corresponding to every
+RequiredDuringScheduling affinity rule: if the pods of S1 have a
+RequiredDuringScheduling affinity rule "run me on nodes that are running pods
+from S2" then it is not required that there be S1 pods on a node in order to
+schedule a S2 pod onto that node, but it would be better if there are.

-PreferredDuringScheduling is symmetric.
-If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2"
-then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also
-to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
-S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer
-to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place
-a S2 pod that we are scheduling onto a node that is running a S1 pod.
+PreferredDuringScheduling is symmetric. If the pods of S1 had a
+PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
+are running pods from S2" then we would prefer to keep a S1 pod that we are
+scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
+we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
+S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
+are running pods from S2" then we would prefer to place a S1 pod that we are
+scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
+we are scheduling onto a node that is running a S1 pod.

 ## Examples

-Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.
+Here are some examples of how you would express various affinity and
+anti-affinity rules using the API we described.

 ### Affinity

-In the examples below, the word "put" is intentionally ambiguous; the rules are the same
-whether "put" means "must put" (RequiredDuringScheduling) or "try to put"
-(PreferredDuringScheduling)--all that changes is which field the rule goes into.
-Also, we only discuss scheduling-time, and ignore the execution-time.
-Finally, some of the examples
-use "zone" and some use "node," just to make the examples more interesting; any of the examples
-with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa.
+In the examples below, the word "put" is intentionally ambiguous; the rules are
+the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
+put" (PreferredDuringScheduling)--all that changes is which field the rule goes
+into. Also, we only discuss scheduling-time, and ignore the execution-time.
+Finally, some of the examples use "zone" and some use "node," just to make the
+examples more interesting; any of the examples with "zone" will also work for
+"node" if you change the `TopologyKey`, and vice-versa.

 * **Put the pod in zone Z**:
-Tricked you! It is not possible express this using the API described here. For this you should use node affinity.
+Tricked you! It is not possible express this using the API described here. For
+this you should use node affinity.

 * **Put the pod in a zone that is running at least one pod from service S**:
 `{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`

-* **Put the pod on a node that is already running a pod that requires a license for software package P**:
-Assuming pods that require a license for software package P have a label `{key=license, value=P}`:
+* **Put the pod on a node that is already running a pod that requires a license
+for software package P**: Assuming pods that require a license for software
+package P have a label `{key=license, value=P}`:
 `{LabelSelector: "license" In "P", TopologyKey: "node"}`

 * **Put this pod in the same zone as other pods from its same service**:
 Assuming pods from this pod's service have some label `{key=service, value=S}`:
 `{LabelSelector: "service" In "S", TopologyKey: "zone"}`

-This last example illustrates a small issue with this API when it is used
-with a scheduler that processes the pending queue one pod at a time, like the current
+This last example illustrates a small issue with this API when it is used with a
+scheduler that processes the pending queue one pod at a time, like the current
 Kubernetes scheduler. The RequiredDuringScheduling rule
 `{LabelSelector: "service" In "S", TopologyKey: "zone"}`
-only "works" once one pod from service S has been scheduled. But if all pods in service
-S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule
-will block the first
-pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from
-the same service. And of course that means none of the pods of the service will be able
-to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not
-PreferredDuringScheduling affinity or any variant of anti-affinity.
-There are at least three ways to solve this problem
-* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement
-matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement.
-This approach has a corner case when running parallel schedulers that are allowed to
-schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to
-schedule pods from the set
-at the same time and think there are no other pods from that set scheduled yet (e.g. they are
-trying to schedule the first two pods from the set), but by the time
-the second binding is committed, the first one has already been committed, leaving you with
-two pods running that do not respect their RequiredDuringScheduling affinity. There is no
-simple way to detect this "conflict" at scheduling time given the current system implementation.
-* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those
-pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
-* **very long-term/speculative**: controllers could present the scheduler with a group of pods from
-the same PodTemplate as a single unit. This is similar to the first approach described above but
-avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow
-the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845)
-since it could receive an entire gang simultaneously as a single unit.
+only "works" once one pod from service S has been scheduled. But if all pods in
+service S have this RequiredDuringScheduling rule in their PodSpec, then the
+RequiredDuringScheduling rule will block the first pod of the service from ever
+scheduling, since it is only allowed to run in a zone with another pod from the
+same service. And of course that means none of the pods of the service will be
+able to schedule. This problem *only* applies to RequiredDuringScheduling
+affinity, not PreferredDuringScheduling affinity or any variant of
+anti-affinity. There are at least three ways to solve this problem:
+* **short-term**: have the scheduler use a rule that if the
+RequiredDuringScheduling affinity requirement matches a pod's own labels, and
+there are no other such pods anywhere, then disregard the requirement. This
+approach has a corner case when running parallel schedulers that are allowed to
+schedule pods from the same replicated set (e.g. a single PodTemplate): both
+schedulers may try to schedule pods from the set at the same time and think
+there are no other pods from that set scheduled yet (e.g. they are trying to
+schedule the first two pods from the set), but by the time the second binding is
+committed, the first one has already been committed, leaving you with two pods
+running that do not respect their RequiredDuringScheduling affinity. There is no
+simple way to detect this "conflict" at scheduling time given the current system
+implementation.
+* **longer-term**: when a controller creates pods from a PodTemplate, for
+exactly *one* of those pods, it should omit any RequiredDuringScheduling
+affinity rules that select the pods of that PodTemplate.
+* **very long-term/speculative**: controllers could present the scheduler with a
+group of pods from the same PodTemplate as a single unit. This is similar to the
+first approach described above but avoids the corner case. No special logic is
+needed in the controllers. Moreover, this would allow the scheduler to do proper
+[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
+it could receive an entire gang simultaneously as a single unit.

 ### Anti-affinity

-As with the affinity examples, the examples here can be RequiredDuringScheduling or
-PreferredDuringScheduling anti-affinity, i.e.
-"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears
-in `RequiredDuringScheduling` or `PreferredDuringScheduling`.
+As with the affinity examples, the examples here can be RequiredDuringScheduling
+or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
+"must not" or as "try not to" depending on whether the rule appears in
+`RequiredDuringScheduling` or `PreferredDuringScheduling`.

 * **Spread the pods of this service S across nodes and zones**:
-`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
-(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second
-clause will force the scheduler to not put more than one pod from S in the same zone, and thus by
-definition it will not put more than one pod from S on the same node, assuming each node is in one zone.
-This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in
+`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
+{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
+(note that if this is specified as a RequiredDuringScheduling anti-affinity,
+then the first clause is redundant, since the second clause will force the
+scheduler to not put more than one pod from S in the same zone, and thus by
+definition it will not put more than one pod from S on the same node, assuming
+each node is in one zone. This rule is more useful as PreferredDuringScheduling
+anti-affinity, e.g. one might expect it to be common in
 [Ubernetes](../../docs/proposals/federation.md) clusters.)

 * **Don't co-locate pods of this service with pods from service "evilService"**:
@@ -323,25 +358,29 @@ This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one mi
 * **Don't co-locate pods of this service with any other pods except other pods of this service**:
 Assuming pods from the service have some label `{key=service, value=S}`:
 `{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
-Note that this works because `"service" NotIn "S"` matches pods with no key "service"
-as well as pods with key "service" and a corresponding value that is not "S."
+Note that this works because `"service" NotIn "S"` matches pods with no key
+"service" as well as pods with key "service" and a corresponding value that is
+not "S."

 ## Algorithm

-An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows.
-There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's
-semantics are implementable.
+An example algorithm a scheduler might use to implement affinity and
+anti-affinity rules is as follows. There are certainly more efficient ways to
+do it; this is just intended to demonstrate that the API's semantics are
+implementable.

-Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler
-predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling
-time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.
+Terminology definition: We say a pod P is "feasible" on a node N if P meets all
+of the scheduler predicates for scheduling P onto N. Note that this algorithm is
+only concerned about scheduling time, thus it makes no distinction between
+RequiredDuringExecution and IgnoredDuringExecution.

-To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand
-for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for
-"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
+To make the algorithm slightly more readable, we use the term "HardPodAffinity"
+as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
+"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
+Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."

-** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account;
-currently it assumes all terms have weight 1. **
+** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
+into account; currently it assumes all terms have weight 1. **

 ```
 Z = the pod you are scheduling
@@ -389,74 +428,81 @@ foreach node A of {N}

 ## Special considerations for RequiredDuringScheduling anti-affinity

-In this section we discuss three issues with RequiredDuringScheduling anti-affinity:
-Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill.
-See issue #18265 for additional discussion of these topics.
+In this section we discuss three issues with RequiredDuringScheduling
+anti-affinity: Denial of Service (DoS), co-existing with daemons, and
+determining which pod(s) to kill. See issue #18265 for additional discussion of
+these topics.

 ### Denial of Service

-Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally
-or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.
+Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
+can intentionally or unintentionally cause various problems for other pods, due
+to the symmetry property of anti-affinity.

-The most notable danger is the ability for a
-pod that arrives first to some topology domain, to block all other pods from
-scheduling there by stating a conflict with all other pods.
-The standard approach
-to preventing resource hogging is quota, but simple resource quota cannot prevent
-this scenario because the pod may request very little resources. Addressing this
-using quota requires a quota scheme that charges based on "opportunity cost" rather
-than based simply on requested resources. For example, when handling a pod that expresses
+The most notable danger is the ability for a pod that arrives first to some
+topology domain, to block all other pods from scheduling there by stating a
+conflict with all other pods. The standard approach to preventing resource
+hogging is quota, but simple resource quota cannot prevent this scenario because
+the pod may request very little resources. Addressing this using quota requires
+a quota scheme that charges based on "opportunity cost" rather than based simply
+on requested resources. For example, when handling a pod that expresses
 RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
 (i.e. exclusive access to a node), it could charge for the resources of the
-average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
-anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
-entire cluster. If node affinity is used to
-constrain the pod to a particular topology domain, then the admission-time quota
-charging should take that into account (e.g. not charge for the average/largest machine
-if the PodSpec constrains the pod to a specific machine with a known size; instead charge
-for the size of the actual machine that the pod was constrained to). In all cases
-once the pod is scheduled, the quota charge should be adjusted down to the
-actual amount of resources allocated (e.g. the size of the actual machine that was
-assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for
+average or largest node in the cluster. Likewise if a pod expresses
+RequiredDuringScheduling anti-affinity for all pods using a "cluster"
+`TopologyKey`, it could charge for the resources of the entire cluster. If node
+affinity is used to constrain the pod to a particular topology domain, then the
+admission-time quota charging should take that into account (e.g. not charge for
+the average/largest machine if the PodSpec constrains the pod to a specific
+machine with a known size; instead charge for the size of the actual machine
+that the pod was constrained to). In all cases once the pod is scheduled, the
+quota charge should be adjusted down to the actual amount of resources allocated
+(e.g. the size of the actual machine that was assigned, not the
+average/largest). If a cluster administrator wants to overcommit quota, for
 example to allow more than N pods across all users to request exclusive node
-access in a cluster with N nodes, then a priority/preemption scheme should be added
-so that the most important pods run when resource demand exceeds supply.
+access in a cluster with N nodes, then a priority/preemption scheme should be
+added so that the most important pods run when resource demand exceeds supply.

 An alternative approach, which is a bit of a blunt hammer, is to use a
 capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
-to trusted users. A more complex capability mechanism might only restrict it when
-using a non-"node" TopologyKey.
+to trusted users. A more complex capability mechanism might only restrict it
+when using a non-"node" TopologyKey.

 Our initial implementation will use a variant of the capability approach, which
-requires no configuration: we will simply reject ALL requests, regardless of user,
-that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity.
-This allows the "exclusive node" use case while prohibiting the more dangerous ones.
+requires no configuration: we will simply reject ALL requests, regardless of
+user, that specify "all namespaces" with non-"node" TopologyKey for
+RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
+case while prohibiting the more dangerous ones.

-A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
-the scheduling quality of another pod, but not completely block it from scheduling.
-For example, a set of pods S1 could use node affinity to request to schedule onto a set
-of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1
-have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2,
-then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from
-scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and
-with some probability that depends on the weighting scheme for the PreferredDuringScheduling case).
-A very sophisticated priority and/or quota scheme could mitigate this, or alternatively
-we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity.
-Then only RequiredDuringScheduling anti-affinity could affect scheduling quality
-of another pod, and as we described in the previous paragraph, such pods could be charged
-quota for the full topology domain, thereby reducing the potential for abuse.
+A weaker variant of the problem described in the previous paragraph is a pod's
+ability to use anti-affinity to degrade the scheduling quality of another pod,
+but not completely block it from scheduling. For example, a set of pods S1 could
+use node affinity to request to schedule onto a set of nodes that some other set
+of pods S2 prefers to schedule onto. If the pods in S1 have
+RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
+S2, then due to the symmetry property of anti-affinity, they can prevent the
+pods in S2 from scheduling onto their preferred nodes if they arrive first (for
+sure in the RequiredDuringScheduling case, and with some probability that
+depends on the weighting scheme for the PreferredDuringScheduling case). A very
+sophisticated priority and/or quota scheme could mitigate this, or alternatively
+we could eliminate the symmetry property of the implementation of
+PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
+anti-affinity could affect scheduling quality of another pod, and as we
+described in the previous paragraph, such pods could be charged quota for the
+full topology domain, thereby reducing the potential for abuse.

-We won't try to address this issue in our initial implementation; we can consider one
-of the approaches mentioned above if it turns out to be a problem in practice.
+We won't try to address this issue in our initial implementation; we can
+consider one of the approaches mentioned above if it turns out to be a problem
+in practice.

 ### Co-existing with daemons

-A cluster administrator
-may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with
-system daemon pods, such as those run by DaemonSet. In principle, we would like the specification
-for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more
-other pods (see #18263 for a more detailed explanation of the toleration concept). There are
-at least two ways to accomplish this:
+A cluster administrator may wish to allow pods that express anti-affinity
+against all pods, to nonetheless co-exist with system daemon pods, such as those
+run by DaemonSet. In principle, we would like the specification for
+RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
+more other pods (see #18263 for a more detailed explanation of the toleration
+concept). There are at least two ways to accomplish this:

 * Scheduler special-cases the namespace(s) where daemons live, in the
  sense that it ignores pods in those namespaces when it is
@@ -478,147 +524,168 @@ Our initial implementation will use the first approach.

 ### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)

-Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution
-anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in
-such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution
-anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod
-with the anti-affinity rule that becomes violated should be the one killed.
-A pod should only specify constraints that apply to
-namespaces it trusts to not do malicious things. Once we have priority/preemption, we can
-change the rule to say that the lowest-priority pod(s) are killed until all
+Because anti-affinity is symmetric, in the case of
+RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
+determine which pod(s) to kill when a pod's labels are updated in such as way as
+to cause them to conflict with one or more other pods'
+RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
+absence of a priority/preemption scheme, our rule will be that the pod with the
+anti-affinity rule that becomes violated should be the one killed. A pod should
+only specify constraints that apply to namespaces it trusts to not do malicious
+things. Once we have priority/preemption, we can change the rule to say that the
+lowest-priority pod(s) are killed until all
 RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.

 ## Special considerations for RequiredDuringScheduling affinity

-The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry:
-if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods,
-and pods that conflict with P cannot schedule onto the node one P has been scheduled there.
-The design we have described says that the symmetry property for RequiredDuringScheduling *affinity*
-is weaker: if a pod P says it can only schedule onto nodes running pod Q, this
-does not mean Q can only run on a node that is running P, but the scheduler will try
-to schedule Q onto a node that is running P (i.e. treats the reverse direction as
-preferred). This raises the same scheduling quality concern as we mentioned at the
-end of the Denial of Service section above, and can be addressed in similar ways.
+The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
+symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
+conflicting pods, and pods that conflict with P cannot schedule onto the node
+one P has been scheduled there. The design we have described says that the
+symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
+says it can only schedule onto nodes running pod Q, this does not mean Q can
+only run on a node that is running P, but the scheduler will try to schedule Q
+onto a node that is running P (i.e. treats the reverse direction as preferred).
+This raises the same scheduling quality concern as we mentioned at the end of
+the Denial of Service section above, and can be addressed in similar ways.

-The nature of affinity (as opposed to anti-affinity) means that there is no issue of
-determining which pod(s) to kill
-when a pod's labels change: it is obviously the pod with the affinity rule that becomes
-violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule;
-it can only "fix" violation an anti-affinity rule.) However, affinity does have a
-different question related to killing: how long should the system wait before declaring
-that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime?
-For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed
-so that it can be updated to a new binary version, should that trigger killing of P? More
-generally, how long should the system wait before declaring that P's affinity is
-violated? (Of course affinity is expressed in terms of label selectors, not for a specific
-pod, but the scenario is easier to describe using a concrete pod.) This is closely related to
-the concept of forgiveness (see issue #1574). In theory we could make this time duration be
-configurable by the user on a per-pod basis, but for the first version of this feature we will
-make it a configurable property of whichever component does the killing and that applies across
-all pods using the feature. Making it configurable by the user would require a nontrivial change
-to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution
-affinity).
+The nature of affinity (as opposed to anti-affinity) means that there is no
+issue of determining which pod(s) to kill when a pod's labels change: it is
+obviously the pod with the affinity rule that becomes violated that must be
+killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
+"fix" violation an anti-affinity rule.) However, affinity does have a different
+question related to killing: how long should the system wait before declaring
+that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
+at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
+is temporarily killed so that it can be updated to a new binary version, should
+that trigger killing of P? More generally, how long should the system wait
+before declaring that P's affinity is violated? (Of course affinity is expressed
+in terms of label selectors, not for a specific pod, but the scenario is easier
+to describe using a concrete pod.) This is closely related to the concept of
+forgiveness (see issue #1574). In theory we could make this time duration be
+configurable by the user on a per-pod basis, but for the first version of this
+feature we will make it a configurable property of whichever component does the
+killing and that applies across all pods using the feature. Making it
+configurable by the user would require a nontrivial change to the API syntax
+(since the field would only apply to
+RequiredDuringSchedulingRequiredDuringExecution affinity).

 ## Implementation plan

-1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types.
-2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
-affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
-3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
-4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
-This admission controller should be enabled by default.
+1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
+`PodAntiAffinity` types to the API along with all of their descendant types.
+2. Implement a scheduler predicate that takes
+`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
+account. Include a workaround for the issue described at the end of the Affinity
+section of the Examples section (can't schedule first pod).
+3. Implement a scheduler priority function that takes
+`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
+into account.
+4. Implement admission controller that rejects requests that specify "all
+namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
+anti-affinity. This admission controller should be enabled by default.
 5. Implement the recommended solution to the "co-existing with daemons" issue
 6. At this point, the feature can be deployed.
-7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
-the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take
-`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism,
-the "co-existing with daemons" solution).
-8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision
-9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
-`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`;
-if controller then potentially for all `TopologyKeys`'s.
-(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
+7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
+and anti-affinity, and make sure the pieces of the system already implemented
+for `RequiredDuringSchedulingIgnoredDuringExecution` also take
+`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
+scheduler predicate, the quota mechanism, the "co-existing with daemons"
+solution).
+8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
+`TopologyKey` to Kubelet's admission decision.
+9. Implement code in Kubelet *or* the controllers that evicts a pod that no
+longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
+then only for "node" `TopologyKey`; if controller then potentially for all
+`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
 Do so in a way that addresses the "determining which pod(s) to kill" issue.

-We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
-domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
+We assume Kubelet publishes labels describing the node's membership in all of
+the relevant scheduling domains (e.g. node name, rack name, availability zone
+name, etc.). See #9044.

 ## Backward compatibility

 Old versions of the scheduler will ignore `Affinity`.

-Users should not start using `Affinity` until the full implementation has
-been in Kubelet and the master for enough binary versions that we feel
-comfortable that we will not need to roll back either Kubelet or
-master to a version that does not support them. Longer-term we will
-use a programmatic approach to enforcing this (#4855).
+Users should not start using `Affinity` until the full implementation has been
+in Kubelet and the master for enough binary versions that we feel comfortable
+that we will not need to roll back either Kubelet or master to a version that
+does not support them. Longer-term we will use a programmatic approach to
+enforcing this (#4855).

 ## Extensibility

-The design described here is the result of careful analysis of use cases, a decade of experience
-with Borg at Google, and a review of similar features in other open-source container orchestration
-systems. We believe that it properly balances the goal of expressiveness against the goals of
-simplicity and efficiency of implementation. However, we recognize that
-use cases may arise in the future that cannot be expressed using the syntax described here.
-Although we are not implementing an affinity-specific extensibility mechanism for a variety
-of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
-users to get a consistent experience, etc.), the regular Kubernetes
-annotation mechanism can be used to add or replace affinity rules. The way this work would is
+The design described here is the result of careful analysis of use cases, a
+decade of experience with Borg at Google, and a review of similar features in
+other open-source container orchestration systems. We believe that it properly
+balances the goal of expressiveness against the goals of simplicity and
+efficiency of implementation. However, we recognize that use cases may arise in
+the future that cannot be expressed using the syntax described here. Although we
+are not implementing an affinity-specific extensibility mechanism for a variety
+of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
+for Kubernetes users to get a consistent experience, etc.), the regular
+Kubernetes annotation mechanism can be used to add or replace affinity rules.
+The way this work would is:
 1. Define one or more annotations to describe the new affinity rule(s)
-1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
-If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
-from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
-annotation(s).
+1. User (or an admission controller) attaches the annotation(s) to pods to
+request the desired scheduling behavior. If the new rule(s) *replace* one or
+more fields of `Affinity` then the user would omit those fields from `Affinity`;
+if they are *additional rules*, then the user would fill in `Affinity` as well
+as the annotation(s).
 1. Scheduler takes the annotation(s) into account when scheduling.

-If some particular new syntax becomes popular, we would consider upstreaming it by integrating
-it into the standard `Affinity`.
+If some particular new syntax becomes popular, we would consider upstreaming it
+by integrating it into the standard `Affinity`.

 ## Future work and non-work

-One can imagine that in the anti-affinity RequiredDuringScheduling case
-one might want to associate a number with the rule,
-for example "do not allow this pod to share a rack with more than three other
-pods (in total, or from the same service as the pod)." We could allow this to be
-specified by adding an integer `Limit` to `PodAffinityTerm` just for the
-`RequiredDuringScheduling` case. However, this flexibility complicates the
-system and we do not intend to implement it.
+One can imagine that in the anti-affinity RequiredDuringScheduling case one
+might want to associate a number with the rule, for example "do not allow this
+pod to share a rack with more than three other pods (in total, or from the same
+service as the pod)." We could allow this to be specified by adding an integer
+`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
+However, this flexibility complicates the system and we do not intend to
+implement it.

 It is likely that the specification and implementation of pod anti-affinity
 can be unified with [taints and tolerations](taint-toleration-dedicated.md),
 and likewise that the specification and implementation of pod affinity
-can be unified with [node affinity](nodeaffinity.md).
-The basic idea is that pod labels would be "inherited" by the node, and pods
-would only be able to specify affinity and anti-affinity for a node's labels.
-Our main motivation for not unifying taints and tolerations with
-pod anti-affinity is that we foresee taints and tolerations as being a concept that
-only cluster administrators need to understand (and indeed in some setups taints and
-tolerations wouldn't even be directly manipulated by a cluster administrator,
-instead they would only be set by an admission controller that is implementing the administrator's
-high-level policy about different classes of special machines and the users who belong to the groups
-allowed to access them). Moreover, the concept of nodes "inheriting" labels
-from pods seems complicated; it seems conceptually simpler to separate rules involving
-relatively static properties of nodes from rules involving which other pods are running
-on the same node or larger topology domain.
+can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
+labels would be "inherited" by the node, and pods would only be able to specify
+affinity and anti-affinity for a node's labels. Our main motivation for not
+unifying taints and tolerations with pod anti-affinity is that we foresee taints
+and tolerations as being a concept that only cluster administrators need to
+understand (and indeed in some setups taints and tolerations wouldn't even be
+directly manipulated by a cluster administrator, instead they would only be set
+by an admission controller that is implementing the administrator's high-level
+policy about different classes of special machines and the users who belong to
+the groups allowed to access them). Moreover, the concept of nodes "inheriting"
+labels from pods seems complicated; it seems conceptually simpler to separate
+rules involving relatively static properties of nodes from rules involving which
+other pods are running on the same node or larger topology domain.

-Data/storage affinity is related to pod affinity, and is likely to draw on some of the
-ideas we have used for pod affinity. Today, data/storage affinity is expressed using
-node affinity, on the assumption that the pod knows which node(s) store(s) the data
-it wants. But a more flexible approach would allow the pod to name the data rather than
-the node.
+Data/storage affinity is related to pod affinity, and is likely to draw on some
+of the ideas we have used for pod affinity. Today, data/storage affinity is
+expressed using node affinity, on the assumption that the pod knows which
+node(s) store(s) the data it wants. But a more flexible approach would allow the
+pod to name the data rather than the node.

 ## Related issues

 The review for this proposal is in #18265.

-The topic of affinity/anti-affinity has generated a lot of discussion. The main issue
-is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906
-all have additional discussion and use cases.
+The topic of affinity/anti-affinity has generated a lot of discussion. The main
+issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341,

-As the examples in this document have demonstrated, topological affinity is very useful
-in clusters that are spread across availability zones, e.g. to co-locate pods of a service
-in the same zone to avoid a wide-area network hop, or to spread pods across zones for
-failure tolerance. #17059, #13056, #13063, and #4235 are relevant.
+# 1965, and #2906 all have additional discussion and use cases.
+
+As the examples in this document have demonstrated, topological affinity is very
+useful in clusters that are spread across availability zones, e.g. to co-locate
+pods of a service in the same zone to avoid a wide-area network hop, or to
+spread pods across zones for failure tolerance. #17059, #13056, #13063, and
+
+# 4235 are relevant.

 Issue #15675 describes connection affinity, which is vaguely related.