mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-31 13:50:01 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			264 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			264 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| 
 | |
| <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | |
| 
 | |
| If you are using a released version of Kubernetes, you should
 | |
| refer to the docs that go with that version.
 | |
| 
 | |
| Documentation for other releases can be found at
 | |
| [releases.k8s.io](http://releases.k8s.io).
 | |
| </strong>
 | |
| --
 | |
| 
 | |
| <!-- END STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <!-- END MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| # Node affinity and NodeSelector
 | |
| 
 | |
| ## Introduction
 | |
| 
 | |
| This document proposes a new label selector representation, called `NodeSelector`,
 | |
| that is similar in many ways to `LabelSelector`, but is a bit more flexible and is
 | |
| intended to be used only for selecting nodes.
 | |
| 
 | |
| In addition, we propose to replace the `map[string]string` in `PodSpec` that the scheduler
 | |
| currently uses as part of restricting the set of nodes onto which a pod is
 | |
| eligible to schedule, with a field of type `Affinity` that contains contains one or
 | |
| more affinity specifications. In this document we discuss `NodeAffinity`, which
 | |
| contains one or more of the following
 | |
| * a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
 | |
| represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
 | |
| the current `map[string]string` but still serves the purpose of restricting
 | |
| the set of nodes onto which the pod can schedule. In addition, unlike the behavior
 | |
| of the current `map[string]string`, when it becomes violated the system will
 | |
| try to eventually evict the pod from its node.
 | |
| * a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is identical
 | |
| to `RequiredDuringSchedulingRequiredDuringExecution` except that the system
 | |
| may or may not try to eventually evict the pod from its node.
 | |
| * a field called `PreferredDuringSchedulingIgnoredDuringExecution` that specifies which nodes are
 | |
| preferred for scheduling among those that meet all scheduling requirements.
 | |
| 
 | |
| (In practice, as discussed later, we will actually *add* the `Affinity` field
 | |
| rather than replacing `map[string]string`, due to backward compatibility requirements.)
 | |
| 
 | |
| The affiniy specifications described above allow a pod to request various properties
 | |
| that are inherent to nodes, for example "run this pod on a node with an Intel CPU" or, in a
 | |
| multi-zone cluster, "run this pod on a node in zone Z."
 | |
| ([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
 | |
| some of the properties that a node might publish as labels, which affinity expressions
 | |
| can match against.)
 | |
| They do *not* allow a pod to request to schedule
 | |
| (or not schedule) on a node based on what other pods are running on the node. That
 | |
| feature is called "inter-pod topological affinity/anti-afinity" and is described
 | |
| [here](https://github.com/kubernetes/kubernetes/pull/18265).
 | |
| 
 | |
| ## API
 | |
| 
 | |
| ### NodeSelector
 | |
| 
 | |
| ```go
 | |
| // A node selector represents the union of the results of one or more label queries
 | |
| // over a set of nodes; that is, it represents the OR of the selectors represented
 | |
| // by the nodeSelectorTerms.
 | |
| type NodeSelector struct {
 | |
| 	// nodeSelectorTerms is a list of node selector terms. The terms are ORed.
 | |
| 	NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"`
 | |
| }
 | |
| 
 | |
| // An empty node selector term matches all objects. A null node selector term
 | |
| // matches no objects.
 | |
| type NodeSelectorTerm struct {
 | |
| 	// matchExpressions is a list of node selector requirements. The requirements are ANDed.
 | |
| 	MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
 | |
| }
 | |
| 
 | |
| // A node selector requirement is a selector that contains values, a key, and an operator
 | |
| // that relates the key and values.
 | |
| type NodeSelectorRequirement struct {
 | |
| 	// key is the label key that the selector applies to.
 | |
| 	Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
 | |
| 	// operator represents a key's relationship to a set of values.
 | |
| 	// Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
 | |
| 	Operator NodeSelectorOperator `json:"operator"`
 | |
| 	// values is an array of string values. If the operator is In or NotIn,
 | |
| 	// the values array must be non-empty. If the operator is Exists or DoesNotExist,
 | |
| 	// the values array must be empty. If the operator is Gt or Lt, the values
 | |
| 	// array must have a single element, which will be interpreted as an integer.
 | |
|     // This array is replaced during a strategic merge patch.
 | |
| 	Values []string `json:"values,omitempty"`
 | |
| }
 | |
| 
 | |
| // A node selector operator is the set of operators that can be used in
 | |
| // a node selector requirement.
 | |
| type NodeSelectorOperator string
 | |
| 
 | |
| const (
 | |
| 	NodeSelectorOpIn           NodeSelectorOperator = "In"
 | |
| 	NodeSelectorOpNotIn        NodeSelectorOperator = "NotIn"
 | |
| 	NodeSelectorOpExists       NodeSelectorOperator = "Exists"
 | |
| 	NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
 | |
| 	NodeSelectorOpGt           NodeSelectorOperator = "Gt"
 | |
| 	NodeSelectorOpLt           NodeSelectorOperator = "Lt"
 | |
| )
 | |
| ```
 | |
| 
 | |
| ### NodeAffinity
 | |
| 
 | |
| We will add one field to `PodSpec`
 | |
| 
 | |
| ```go
 | |
| Affinity *Affinity  `json:"affinity,omitempty"`
 | |
| ```
 | |
| 
 | |
| The `Affinity` type is defined as follows
 | |
| 
 | |
| ```go
 | |
| type Affinity struct {
 | |
| 	NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"`
 | |
| }
 | |
| 
 | |
| type NodeAffinity struct {
 | |
| 	// If the affinity requirements specified by this field are not met at
 | |
| 	// scheduling time, the pod will not be scheduled onto the node.
 | |
| 	// If the affinity requirements specified by this field cease to be met
 | |
| 	// at some point during pod execution (e.g. due to a node label update),
 | |
| 	// the system will try to eventually evict the pod from its node.
 | |
| 	RequiredDuringSchedulingRequiredDuringExecution *NodeSelector  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
 | |
| 	// If the affinity requirements specified by this field are not met at
 | |
| 	// scheduling time, the pod will not be scheduled onto the node.
 | |
| 	// If the affinity requirements specified by this field cease to be met
 | |
| 	// at some point during pod execution (e.g. due to a node label update),
 | |
| 	// the system may or may not try to eventually evict the pod from its node.
 | |
| 	RequiredDuringSchedulingIgnoredDuringExecution  *NodeSelector  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | |
| 	// The scheduler will prefer to schedule pods to nodes that satisfy
 | |
| 	// the affinity expressions specified by this field, but it may choose
 | |
| 	// a node that violates one or more of the expressions. The node that is
 | |
| 	// most preferred is the one with the greatest sum of weights, i.e.
 | |
| 	// for each node that meets all of the scheduling requirements (resource
 | |
| 	// request, RequiredDuringScheduling affinity expressions, etc.),
 | |
| 	// compute a sum by iterating through the elements of this field and adding
 | |
| 	// "weight" to the sum if the node matches the corresponding MatchExpressions; the
 | |
| 	// node(s) with the highest sum are the most preferred.
 | |
| 	PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | |
| }
 | |
| 
 | |
| // An empty preferred scheduling term matches all objects with implicit weight 0
 | |
| // (i.e. it's a no-op). A null preferred scheduling term matches no objects.
 | |
| type PreferredSchedulingTerm struct {
 | |
|     // weight is in the range 1-100
 | |
| 	Weight int  `json:"weight"`
 | |
| 	// matchExpressions is a list of node selector requirements. The requirements are ANDed.
 | |
| 	MatchExpressions []NodeSelectorRequirement  `json:"matchExpressions,omitempty"`
 | |
| }
 | |
| ```
 | |
| 
 | |
| Unfortunately, the name of the existing `map[string]string` field in PodSpec is `NodeSelector`
 | |
| and we can't change it since this name is part of the API. Hopefully this won't
 | |
| cause too much confusion.
 | |
| 
 | |
| ## Examples
 | |
| 
 | |
| ** TODO: fill in this section **
 | |
| 
 | |
| * Run this pod on a node with an Intel or AMD CPU
 | |
| 
 | |
| * Run this pod on a node in availability zone Z
 | |
| 
 | |
| 
 | |
| ## Backward compatibility
 | |
| 
 | |
| When we add `Affinity` to PodSpec, we will deprecate, but not remove, the current field in PodSpec
 | |
| 
 | |
| ```go
 | |
| NodeSelector map[string]string `json:"nodeSelector,omitempty"`
 | |
| ```
 | |
| 
 | |
| Old version of the scheduler will ignore the `Affinity` field.
 | |
| New versions of the scheduler will apply their scheduling predicates to both `Affinity` and `nodeSelector`,
 | |
| i.e. the pod can only schedule onto nodes that satisfy both sets of requirements. We will not
 | |
| attempt to convert between `Affinity` and `nodeSelector`.
 | |
| 
 | |
| Old versions of non-scheduling clients will not know how to do anything semantically meaningful
 | |
| with `Affinity`, but we don't expect that this will cause a problem.
 | |
| 
 | |
| See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
 | |
| for more discussion.
 | |
| 
 | |
| Users should not start using `NodeAffinity` until the full implementation has been in Kubelet and the master
 | |
| for enough binary versions that we feel comfortable that we will not need to roll back either Kubelet
 | |
| or master to a version that does not support them. Longer-term we will use a programatic approach to
 | |
| enforcing this (#4855).
 | |
| 
 | |
| ## Implementation plan
 | |
| 
 | |
| 1. Add the `Affinity` field to PodSpec and the `NodeAffinity`, `PreferredDuringSchedulingIgnoredDuringExecution`,
 | |
| and `RequiredDuringSchedulingIgnoredDuringExecution` types to the API
 | |
| 2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution` into account
 | |
| 3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` into account
 | |
| 4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be marked as deprecated
 | |
| 5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API
 | |
| 6. Modify the scheduler predicate from step 2 to also take `RequiredDuringSchedulingRequiredDuringExecution` into account
 | |
| 7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission decision
 | |
| 8. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
 | |
| `RequiredDuringSchedulingRequiredDuringExecution`
 | |
| (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
 | |
| 
 | |
| We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
 | |
| domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
 | |
| 
 | |
| ## Extensibility
 | |
| 
 | |
| The design described here is the result of careful analysis of use cases, a decade of experience
 | |
| with Borg at Google, and a review of similar features in other open-source container orchestration
 | |
| systems. We believe that it properly balances the goal of expressiveness against the goals of
 | |
| simplicity and efficiency of implementation. However, we recognize that
 | |
| use cases may arise in the future that cannot be expressed using the syntax described here.
 | |
| Although we are not implementing an affinity-specific extensibility mechanism for a variety
 | |
| of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
 | |
| users to get a consistent experience, etc.), the regular Kubernetes
 | |
| annotation mechanism can be used to add or replace affinity rules. The way this work would is
 | |
| 
 | |
| 1. Define one or more annotations to describe the new affinity rule(s)
 | |
| 1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
 | |
| If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
 | |
| from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
 | |
| annotation(s).
 | |
| 1. Scheduler takes the annotation(s) into account when scheduling.
 | |
| 
 | |
| If some particular new syntax becomes popular, we would consider upstreaming it by integrating
 | |
| it into the standard `Affinity`.
 | |
| 
 | |
| ## Future work
 | |
| 
 | |
| Are there any other fields we should convert from `map[string]string` to `NodeSelector`?
 | |
| 
 | |
| ## Related issues
 | |
| 
 | |
| The review for this proposal is in #18261.
 | |
| 
 | |
| The main related issue is #341. Issue #367 is also related. Those issues reference other
 | |
| related issues.
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | |
| []()
 | |
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |