Merge pull request #18724 from derekwaynecarr/eviction_policy_spec

Automatic merge from submit-queue Kubelet pod eviction proposal The following is a proposal for how the `kubelet` may pro-actively fail a pod in response to local compute resources being starved. The proposal focuses on memory as a first candidate, and defines a `greedy` strategy for reclaiming starved resources on the node since it seemed easiest to describe for operators versus other options and probably satisfies a broad set of use case environments. Putting this out now for community feedback, but anticipate some more refinement around how we report eviction configuration back to users in the `Node API`. /cc @bgrant0607 @smarterclayton @vishh @dchen1107 @kubernetes/rh-cluster-infra @kubernetes/goog-node
2025-09-16 06:32:32 +00:00 · 2016-05-04 03:06:54 -07:00
parent 303742ce12 542668cc79
commit 981890160e
1 changed files with 276 additions and 0 deletions
--- a/docs/proposals/kubelet-eviction.md
+++ b/docs/proposals/kubelet-eviction.md
@@ -0,0 +1,276 @@
+<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
+
+<!-- BEGIN STRIP_FOR_RELEASE -->
+
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
+     width="25" height="25">
+
+<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
+
+If you are using a released version of Kubernetes, you should
+refer to the docs that go with that version.
+
+Documentation for other releases can be found at
+[releases.k8s.io](http://releases.k8s.io).
+</strong>
+--
+
+<!-- END STRIP_FOR_RELEASE -->
+
+<!-- END MUNGE: UNVERSIONED_WARNING -->
+
+# Kubelet - Eviction Policy
+
+**Author**: Derek Carr (@derekwaynecarr)
+
+**Status**: Proposed
+
+This document presents a specification for how the `kubelet` evicts pods when compute resources are too low.
+
+## Goals
+
+The node needs a mechanism to preserve stability when available compute resources are low.
+
+This is especially important when dealing with incompressible compute resources such
+as memory or disk.  If either resource is exhausted, the node would become unstable.
+
+The `kubelet` has some support for influencing system behavior in response to a system OOM by
+having the system OOM killer see higher OOM score adjust scores for containers that have consumed
+the largest amount of memory relative to their request.  System OOM events are very compute
+intensive, and can stall the node until the OOM killing process has completed.  In addition,
+the system is prone to return to an unstable state since the containers that are killed due to OOM
+are either restarted or a new pod is scheduled on to the node.
+
+Instead, we would prefer a system where the `kubelet` can pro-actively monitor for
+and prevent against total starvation of a compute resource, and in cases of where it
+could appear to occur, pro-actively fail one or more pods, so the workload can get
+moved and scheduled elsewhere when/if its backing controller creates a new pod.
+
+## Scope of proposal
+
+This proposal defines a pod eviction policy for reclaiming compute resources.
+
+In the first iteration, it focuses on memory; later iterations are expected to cover
+other resources like disk.  The proposal focuses on a simple default eviction strategy
+intended to cover the broadest class of user workloads.
+
+## Eviction Signals
+
+The `kubelet` will support the ability to trigger eviction decisions on the following signals.
+
+| Eviction Signal  | Description                                                                     |
+|------------------|---------------------------------------------------------------------------------|
+| memory.available | memory.available := node.status.capacity[memory] - node.stats.memory.workingSet |
+
+## Eviction Thresholds
+
+The `kubelet` will support the ability to specify eviction thresholds.
+
+An eviction threshold is of the following form:
+
+`<eviction-signal><operator><quantity>`
+
+* valid `eviction-signal` tokens as defined above.
+* valid `operator` tokens are `<`
+* valid `quantity` tokens must match the quantity representation used by Kubernetes
+
+If threhold criteria are met, the `kubelet` will take pro-active action to attempt
+to reclaim the starved compute resource associated with the eviction signal.
+
+The `kubelet` will support soft and hard eviction thresholds.
+
+### Soft Eviction Thresholds
+
+A soft eviction threshold pairs an eviction threshold with a required
+administrator specified grace period.  No action is taken by the `kubelet`
+to reclaim resources associated with the eviction signal until that grace
+period has been exceeded.  If no grace period is provided, the `kubelet` will
+error on startup.
+
+In addition, if a soft eviction threshold has been met, an operator can
+specify a maximum allowed pod termination grace period to use when evicting
+pods from the node.  If specified, the `kubelet` will use the lesser value among
+the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period.
+If not specified, the `kubelet` will kill pods immediately with no graceful
+termination.
+
+To configure soft eviction thresholds, the following flags will be supported:
+
+```
+--eviction-soft="": A set of eviction thresholds (e.g. memory.available<1.5Gi) that if met over a corresponding grace period would trigger a pod eviction.
+--eviction-soft-grace-period="": A set of eviction grace periods (e.g. memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.
+--eviction-max-pod-grace-period="0": Maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.
+```
+
+### Hard Eviction Thresholds
+
+A hard eviction threshold has no grace period, and if observed, the `kubelet`
+will take immediate action to reclaim the associated starved resource.  If a
+hard eviction threshold is met, the `kubelet` will kill the pod immediately
+with no graceful termination.
+
+To configure hard eviction thresholds, the following flag will be supported:
+
+```
+--eviction-hard="": A set of eviction thresholds (e.g. memory.available<1Gi) that if met would trigger a pod eviction.
+```
+
+## Eviction Monitoring Interval
+
+The `kubelet` will initially evaluate eviction thresholds at the same
+housekeeping interval as `cAdvisor` housekeeping.
+
+In Kubernetes 1.2, this was defaulted to `10s`.
+
+It is a goal to shrink the monitoring interval to a much shorter window.
+This may require changes to `cAdvisor` to let alternate housekeeping intervals
+be specified for selected data (https://github.com/google/cadvisor/issues/1247)
+
+For the purposes of this proposal, we expect the monitoring interval to be no
+more than `10s` to know when a threshold has been triggered, but we will strive
+to reduce that latency time permitting.
+
+## Node Conditions
+
+The `kubelet` will support a node condition that corresponds to each eviction signal.
+
+If a hard eviction threshold has been met, or a soft eviction threshold has been met
+independent of its associated grace period, the `kubelet` will report a condition that
+reflects the node is under pressure.
+
+The following node conditions are defined that correspond to the specified eviction signal.
+
+| Node Condition | Eviction Signal  | Description                                                      |
+|----------------|------------------|------------------------------------------------------------------|
+| MemoryPressure | memory.available | Available memory on the node has satisfied an eviction threshold |
+
+The `kubelet` will continue to report node status updates at the frequency specified by
+`--node-status-update-frequency` which defaults to `10s`.
+
+## Eviction scenario
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-hard="memory.available<100Mi"
+--eviction-soft="memory.available<300Mi"
+--eviction-soft-grace-period="memory.available=30s"
+```
+
+The `kubelet` will run a sync loop that looks at the available memory
+on the node as reported from `cAdvisor` by calculating (capacity - workingSet).
+If available memory is observed to drop below 100Mi, the `kubelet` will immediately
+initiate eviction. If available memory is observed as falling below `300Mi`,
+it will record when that signal was observed internally in a cache.  If at the next
+sync, that criteria was no longer satisfied, the cache is cleared for that
+signal.  If that signal is observed as being satisfied for longer than the
+specified period, the `kubelet` will initiate eviction to attempt to
+reclaim the resource that has met its eviction threshold.
+
+## Eviction of Pods
+
+If an eviction threshold has been met, the `kubelet` will initiate the
+process of evicting pods until it has observed the signal has gone below
+its defined threshold.
+
+The eviction sequence works as follows:
+
+* for each monitoring interval, if eviction thresholds have been met
+ * find candidate pod
+ * fail the pod
+ * block until pod is terminated on node
+
+If a pod is not terminated because a container does not happen to die
+(i.e. processes stuck in disk IO for example), the `kubelet` may select
+an additional pod to fail instead.  The `kubelet` will invoke the `KillPod`
+operation exposed on the runtime interface.  If an error is returned,
+the `kubelet` will select a subsequent pod.
+
+## Eviction Strategy
+
+The `kubelet` will implement a default eviction strategy oriented around
+the pod quality of service class.
+
+It will target pods that are the largest consumers of the starved compute
+resource relative to their scheduling request.  It ranks pods within a
+quality of service tier in the following order.
+
+* `BestEffort` pods that consume the most of the starved resource are failed
+first.
+* `Burstable` pods that consume the greatest amount of the starved resource
+relative to their request for that resource are killed first.  If no pod
+has exceeded its request, the strategy targets the largest consumer of the
+starved resource.
+* `Guaranteed` pods that consume the greatest amount of the starved resource
+relative to their request are killed first.  If no pod has exceeded its request,
+the strategy targets the largest consumer of the starved resource.
+
+A guaranteed pod is guaranteed to never be evicted because of another pod's
+resource consumption.  That said, guarantees are only as good as the underlying
+foundation they are built upon.  If a system daemon
+(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than
+were reserved via `system-reserved` or `kube-reserved` allocations, and the node
+only has guaranteed pod(s) remaining, then the node must choose to evict a
+guaranteed pod in order to preserve node stability, and to limit the impact
+of the unexpected consumption to other guaranteed pod(s).
+
+## Kubelet Admission Control
+
+### Feasibility checks during kubelet admission
+
+The `kubelet` will reject `BestEffort` pods if any of its associated
+eviction thresholds have been exceeded independent of the configured
+grace period.
+
+Let's assume the operator started the `kubelet` with the following:
+
+```
+--eviction-soft="memory.available<256Mi"
+--eviction-soft-grace-period="memory.available=30s"
+```
+
+If the `kubelet` sees that it has less than `256Mi` of memory available
+on the node, but the `kubelet` has not yet initiated eviction since the
+grace period criteria has not yet been met, the `kubelet` will still immediately
+fail any incoming best effort pods.
+
+The reasoning for this decision is the expectation that the incoming pod is
+likely to further starve the particular compute resource and the `kubelet` should
+return to a steady state before accepting new workloads.
+
+## Scheduler
+
+The node will report a condition when a compute resource is under pressure.  The
+scheduler should view that condition as a signal to dissuade placing additional
+best effort pods on the node.  In this case, the `MemoryPressure` condition if true
+should dissuade the scheduler from placing new best effort pods on the node since
+they will be rejected by the `kubelet` in admission.
+
+## Best Practices
+
+### DaemonSet
+
+It is never desired for a `kubelet` to evict a pod that was derived from
+a `DaemonSet` since the pod will immediately be recreated and rescheduled
+back to the same node.
+
+At the moment, the `kubelet` has no ability to distinguish a pod created
+from `DaemonSet` versus any other object.  If/when that information is
+available, the `kubelet` could pro-actively filter those pods from the
+candidate set of pods provided to the eviction strategy.
+
+In general, it should be strongly recommended that `DaemonSet` not
+create `BestEffort` pods to avoid being identified as a candidate pod
+for eviction.
+
+<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
+[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
+<!-- END MUNGE: GENERATED_ANALYTICS -->