Updating QoS policy to be per-pod instead of per-resource.

Signed-off-by: Vishnu kannan <vishnuk@google.com>
This commit is contained in:
Vishnu kannan 2015-10-01 11:57:17 -07:00
parent 9625926852
commit f48c83600c
6 changed files with 338 additions and 230 deletions

246
docs/design/resource-qos.md Normal file
View File

@ -0,0 +1,246 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Resource Quality of Service in Kubernetes
**Author(s)**: Vishnu Kannan (vishh@), Ananya Kumar (@AnanyaKumar)
**Last Updated**: 5/17/2016
**Status**: Implemented
*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.*
## Introduction
This document describes the way Kubernetes provides different levels of Quality of Service to pods depending on what they *request*.
Pods that need to stay up reliably can request guaranteed resources, while pods with less stringent requirements can use resources with weaker or no guarantee.
Specifically, for each resource, containers specify a request, which is the amount of that resource that the system will guarantee to the container, and a limit which is the maximum amount that the system will allow the container to use.
The system computes pod level requests and limits by summing up per-resource requests and limits across all containers.
When request == limit, the resources are guaranteed, and when request < limit, the pod is guaranteed the request but can opportunistically scavenge the difference between request and limit if they are not being used by other containers.
This allows Kubernetes to oversubscribe nodes, which increases utilization, while at the same time maintaining resource guarantees for the containers that need guarantees.
Borg increased utilization by about 20% when it started allowing use of such non-guaranteed resources, and we hope to see similar improvements in Kubernetes.
## Requests and Limits
For each resource, containers can specify a resource request and limit, `0 <= request <= [Node Allocatable](../proposals/node-allocatable.md)` & `request <= limit <= Infinity`.
If a pod is successfully scheduled, the container is guaranteed the amount of resources requested.
Scheduling is based on `requests` and not `limits`.
The pods and its containers will not be allowed to exceed the specified limit.
How the request and limit are enforced depends on whether the resource is [compressible or incompressible](resources.md).
### Compressible Resource Guarantees
- For now, we are only supporting CPU.
- Pods are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running). This isn't fully guaranteed today because cpu isolation is at the container level. Pod level cgroups will be introduced soon to achieve this goal.
- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 600 milli CPUs, and container B requests for 300 milli CPUs. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10 milli CPUs will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
- Pods will be throttled if they exceed their limit. If limit is unspecified, then the pods can use excess CPU when available.
### Incompressible Resource Guarantees
- For now, we are only supporting memory.
- Pods will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other pod needs memory), but if pods consume less memory than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
- When Pods use more memory than their limit, a process that is using the most amount of memory, inside one of the pod's containers, will be killed by the kernel.
### Admission/Scheduling Policy
- Pods will be admitted by Kubelet & scheduled by the scheduler based on the sum of requests of its containers. The scheduler & kubelet will ensure that sum of requests of all containers is within the node's [allocatable](../proposals/node-allocatable.md) capacity (for both memory and CPU).
## QoS Classes
In an overcommitted system (where sum of limits > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority.
The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying pods into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a pod is guaranteed or best-effort. However, in the current design, the policy of classifying pods into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section.
Pods can be of one of 3 different classes:
- If `limits` and optionally `requests` (not equal to `0`) are set for all resources across all containers and they are *equal*, then the container is classified as **Guaranteed**.
Examples:
```yaml
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
```
```yaml
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 10m
memory: 1Gi
```
- If `requests` and optionally `limits` are set (not equal to `0`) for one or more resources across one or more containers, and they are *not equal*, then the pod is classified as **Burstable**.
When `limits` are not specified, they default to the node capacity.
Examples:
Container `bar` has not resources specified.
```yaml
containers:
name: foo
resources:
limits:
cpu: 10m
memory: 1Gi
requests:
cpu: 10m
memory: 1Gi
name: bar
```
Container `foo` and `bar` have limits set for different resources.
```yaml
containers:
name: foo
resources:
limits:
memory: 1Gi
name: bar
resources:
limits:
cpu: 100m
```
Container `foo` has no limits set, and `bar` has neither requests nor limits specified.
```yaml
containers:
name: foo
resources:
requests:
cpu: 10m
memory: 1Gi
name: bar
```
- If `requests` and `limits` are not set for all of the resources, across all containers, then the pod is classified as **Best-Effort**.
Examples:
```yaml
containers:
name: foo
resources:
name: bar
resources:
```
Pods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
Memory is an incompressible resource and so let's discuss the semantics of memory management a bit.
- *Best-Effort* pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory.
These containers can use any amount of free memory in the node though.
- *Guaranteed* pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
- *Burstable* pods have some form of minimal resource guarantee, but can use more resources when available.
Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no *Best-Effort* pods exist.
### OOM Score configuration at the Nodes
Pod OOM score configuration
- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed.
- The base OOM score is between 0 and 1000, so if process As OOM_SCORE_ADJ - process Bs OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B.
- The final OOM score of a process is also between 0 and 1000
*Best-effort*
- Set OOM_SCORE_ADJ: 1000
- So processes in best-effort containers will have an OOM_SCORE of 1000
*Guaranteed*
- Set OOM_SCORE_ADJ: -998
- So processes in guaranteed containers will have an OOM_SCORE of 0 or 1
*Burstable*
- If total memory request > 99.8% of available memory, OOM_SCORE_ADJ: 2
- Otherwise, set OOM_SCORE_ADJ to 1000 - 10 * (% of memory requested)
- This ensures that the OOM_SCORE of burstable pod is > 1
- If memory request is `0`, OOM_SCORE_ADJ is set to `999`.
- So burstable pods will be killed if they conflict with guaranteed pods
- If a burstable pod uses less memory than requested, its OOM_SCORE < 1000
- So best-effort pods will be killed if they conflict with burstable pods using less than requested memory
- If a process in burstable pod's container uses more memory than what the container had requested, its OOM_SCORE will be 1000, if not its OOM_SCORE will be < 1000
- Assuming that a container typically has a single big process, if a burstable pod's container that uses more memory than requested conflicts with another burstable pod's container using less memory than requested, the former will be killed
- If burstable pod's containers with multiple processes conflict, then the formula for OOM scores is a heuristic, it will not ensure "Request and Limit" guarantees.
*Pod infra containers* or *Special Pod init process*
- OOM_SCORE_ADJ: -998
*Kubelet, Docker*
- OOM_SCORE_ADJ: -999 (wont be OOM killed)
- Hack, because these critical tasks might die if they conflict with guaranteed containers. in the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume.
## Known issues and possible improvements
The above implementation provides for basic oversubscription with protection, but there are a few known limitations.
#### Support for Swap
- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isnt enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior.
## Alternative QoS Class Policy
An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed).
A strict hierarchy of user-specified numerical priorities is not desirable because:
1. Achieved behavior would be emergent based on how users assigned priorities to their pods. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively
2. Changes to desired priority bands would require changes to all user pod configurations.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/resource-qos.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -42,7 +42,7 @@ The purpose of filtering the nodes is to filter out the nodes that do not meet c
- `NoDiskConflict`: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted.
- `NoVolumeZoneConflict`: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.
- `PodFitsResources`: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check [QoS proposal](../proposals/resource-qos.md).
- `PodFitsResources`: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check [QoS proposal](../design/resource-qos.md).
- `PodFitsHostPorts`: Check if any HostPort required by the Pod is already occupied on the node.
- `HostName`: Filter out all nodes except the one specified in the PodSpec's NodeName field.
- `MatchNodeSelector`: Check if the labels of the node match the labels specified in the Pod's `nodeSelector` field and, as of Kubernetes v1.2, also match the `scheduler.alpha.kubernetes.io/affinity` pod annotation if present. See [here](../user-guide/node-selection/) for more details on both.

View File

@ -39,11 +39,10 @@ and set them before the container is run. This document describes design of the
## Motivation
Since we want to make Kubernetes as simple as possible for its users we dont want to require setting
[Resources](resource-qos.md#resource-specifications)
for container by its owner. On the other hand having Resources filled is critical for scheduling decisions.
Current solution to set up Resources to hardcoded value has obvious drawbacks. We need to implement a component
which will set initial Resources to a reasonable value.
Since we want to make Kubernetes as simple as possible for its users we dont want to require setting [Resources](../design/resource-qos.md) for container by its owner.
On the other hand having Resources filled is critical for scheduling decisions.
Current solution to set up Resources to hardcoded value has obvious drawbacks.
We need to implement a component which will set initial Resources to a reasonable value.
## Design
@ -51,11 +50,9 @@ InitialResources component will be implemented as an [admission plugin](../../pl
[LimitRanger](https://github.com/kubernetes/kubernetes/blob/7c9bbef96ed7f2a192a1318aa312919b861aee00/cluster/gce/config-default.sh#L91).
For every container without Resources specified it will try to predict amount of resources that should be sufficient for it.
So that a pod without specified resources will be treated as
[Burstable](resource-qos.md#qos-classes).
.
InitialResources will set only [request](resource-qos.md#resource-specifications)
(independently for each resource type: cpu, memory)
field in the first version to avoid killing containers due to OOM (however the container still may be killed if exceeds requested resources).
InitialResources will set only [request](../design/resource-qos.md#requests-and-limits) (independently for each resource type: cpu, memory) field in the first version to avoid killing containers due to OOM (however the container still may be killed if exceeds requested resources).
To make the component work with LimitRanger the estimated value will be capped by min and max possible values if defined.
It will prevent from situation when the pod is rejected due to too low or too high estimation.

View File

@ -1,142 +0,0 @@
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
<!-- BEGIN STRIP_FOR_RELEASE -->
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
width="25" height="25">
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
<!-- TAG RELEASE_LINK, added by the munger automatically -->
<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.2/docs/proposals/resource-qos.md).
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--
<!-- END STRIP_FOR_RELEASE -->
<!-- END MUNGE: UNVERSIONED_WARNING -->
# Resource Quality of Service in Kubernetes
**Author**: Ananya Kumar (@AnanyaKumar) Vishnu Kannan (@vishh)
**Status**: Design & Implementation in progress.
*This document presents the design of resource quality of service for containers in Kubernetes, and describes use cases and implementation details.*
**Quality of Service is still under development. Look [here](resource-qos.md#under-development) for more details**
## Motivation
Kubernetes allocates resources to containers in a simple way. Users can specify resource limits for containers. For example, a user can specify a 1gb memory limit for a container. The scheduler uses resource limits to schedule containers (technically, the scheduler schedules pods comprised of containers). For example, the scheduler will not place 5 containers with a 1gb memory limit onto a machine with 4gb memory. Currently, Kubernetes does not have robust mechanisms to ensure that containers run reliably on an overcommitted system.
In the current implementation, **if users specify limits for every container, cluster utilization is poor**. Containers often dont use all the resources that they request which leads to a lot of wasted resources. For example, we might have 4 containers, each reserving 1GB of memory in a node with 4GB memory but only using 500MB of memory. Theoretically, we could fit more containers on the node, but Kubernetes will not schedule new pods (with specified limits) on the node.
A possible solution is to launch containers without specified limits - containers that don't ask for any resource guarantees. But **containers with limits specified are not very well protected from containers without limits specified**. If a container without a specified memory limit goes overboard and uses lots of memory, other containers (with specified memory limits) might be killed. This is bad, because users often want a way to launch containers that have resources guarantees, and that stay up reliably.
This proposal provides mechanisms for oversubscribing nodes while maintaining resource guarantees, by allowing containers to specify levels of resource guarantees. Containers will be able to *request* for a minimum resource guarantee. The *request* is different from the *limit* - containers will not be allowed to exceed resource limits. With this change, users can launch *best-effort* containers with 0 request. Best-effort containers use resources only if not being used by other containers, and can be used for resource-scavenging. Supporting best-effort containers in Borg increased utilization by about 20%, and we hope to see similar improvements in Kubernetes.
## Requests and Limits
Note: this section describes the functionality that QoS should eventually provide. Due to implementation issues, providing some of these guarantees, while maintaining our broader goals of efficient cluster utilization, is difficult. Later sections will go into the nuances of how the functionality will be achieved, and limitations of the initial implementation.
For each resource, containers can specify a resource request and limit, 0 <= request <= limit <= Infinity. If the container is successfully scheduled, the container is guaranteed the amount of resource requested. The container will not be allowed to exceed the specified limit. How the request and limit are enforced depends on whether the resource is [compressible or incompressible](../../docs/design/resources.md).
### Compressible Resource Guarantees
- For now, we are only supporting CPU.
- Minimum CPU limit is 10 milli cores (`10m`). This a limitation of the Linux kernel.
- Containers are guaranteed to get the amount of CPU they request, they may or may not get additional CPU time (depending on the other jobs running).
- Excess CPU resources will be distributed based on the amount of CPU requested. For example, suppose container A requests for 60% of the CPU, and container B requests for 30% of the CPU. Suppose that both containers are trying to use as much CPU as they can. Then the extra 10% of CPU will be distributed to A and B in a 2:1 ratio (implementation discussed in later sections).
- Containers will be throttled if they exceed their limit. If limit is unspecified, then the containers can use excess CPU when available.
### Incompressible Resource Guarantees
- For now, we are only supporting memory.
- Containers will get the amount of memory they request, if they exceed their memory request, they could be killed (if some other container needs memory), but if containers consume fewer resources than requested, they will not be killed (except in cases where system tasks or daemons need more memory).
- Containers will be killed if they use more memory than their limit.
### Kubelet Admission Policy
- Pods will be admitted by Kubelet based on the sum of requests of its containers. The Kubelet will ensure that sum of requests of all containers (over all pods) is within the systems resource (for both memory and CPU).
## QoS Classes
In an overcommitted system (where sum of requests > machine capacity) containers might eventually have to be killed, for example if the system runs out of CPU or memory resources. Ideally, we should kill containers that are less important. For each resource, we divide containers into 3 QoS classes: *Guaranteed*, *Burstable*, and *Best-Effort*, in decreasing order of priority.
The relationship between "Requests and Limits" and "QoS Classes" is subtle. Theoretically, the policy of classifying containers into QoS classes is orthogonal to the requests and limits specified for the container. Hypothetically, users could use an (currently unplanned) API to specify whether a container is guaranteed or best-effort. However, in this proposal, the policy of classifying containers into QoS classes is intimately tied to "Requests and Limits" - in fact, QoS classes are used to implement some of the memory guarantees described in the previous section.
For each resource, containers will be split into 3 different classes
- For now, we will only focus on memory. Containers will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
- Containers with a 0 memory request are classified as memory *Best-Effort*. These containers are not requesting resource guarantees, and will be treated as lowest priority (processes in these containers are the first to get killed if the system runs out of memory).
- Containers with the same request and limit and non-zero request are classified as memory *Guaranteed*. These containers ask for a well-defined amount of the resource and are considered top-priority (with respect to memory usage).
- All other containers are memory *Burstable* - middle priority containers that have some form of minimal resource guarantee, but can use more resources when available.
- In the current policy and implementation, best-effort containers are technically a subset of Burstable containers (where the request is 0), but they are a very important special case. Memory best-effort containers don't ask for any resource guarantees so they can utilize unused resources in a cluster (resource scavenging).
### Alternative QoS Class Policy
An alternative is to have user-specified numerical priorities that guide Kubelet on which tasks to kill (if the node runs out of memory, lower priority tasks will be killed). A strict hierarchy of user-specified numerical priorities is not desirable because:
1. Achieved behavior would be emergent based on how users assigned priorities to their containers. No particular SLO could be delivered by the system, and usage would be subject to gaming if not restricted administratively
2. Changes to desired priority bands would require changes to all user container configurations.
## Under Development
This feature is still under development.
Following are some of the primary issues.
* Our current design supports QoS per-resource.
Given that unified hierarchy is in the horizon, a per-resource QoS cannot be supported.
[#14943](https://github.com/kubernetes/kubernetes/pull/14943) has more information.
* Scheduler does not take usage into account.
The scheduler can pile up BestEffort tasks on a node and cause resource pressure.
[#14081](https://github.com/kubernetes/kubernetes/issues/14081) needs to be resolved for the scheduler to start utilizing node's usage.
The semantics of this feature can change in subsequent releases.
## Implementation Issues and Extensions
The above implementation provides for basic oversubscription with protection, but there are a number of issues. Below is a list of issues and TODOs for each of them. The first iteration of QoS will not solve these problems, but we aim to solve them in subsequent iterations of QoS. This list is not exhaustive. We expect to add issues to the list, and reference issues and PRs associated with items on this list.
Supporting other platforms:
- **RKT**: The proposal focuses on Docker. TODO: add support for RKT.
- **Systemd**: Systemd platforms need to be handled in a different way. Handling distributions of Linux based on systemd is critical, because major Linux distributions like Debian and Ubuntu are moving to systemd. TODO: Add code to handle systemd based operating systems.
Protecting containers and guarantees:
- **Control loops**: The OOM score assignment is not perfect for burstable containers, and system OOM kills are expensive. TODO: Add a control loop to reduce memory pressure, while ensuring guarantees for various containers.
- **Kubelet, Kube-proxy, Docker daemon protection**: If a system is overcommitted with memory guaranteed containers, then all processes will have an OOM_SCORE of 0. So Docker daemon could be killed instead of a container or pod being killed. TODO: Place all user-pods into a separate cgroup, and set a limit on the memory they can consume. Initially, the limits can be based on estimated memory usage of Kubelet, Kube-proxy, and CPU limits, eventually we can monitor the resources they consume.
- **OOM Assignment Races**: We cannot set OOM_SCORE_ADJ of a process until it has launched. This could lead to races. For example, suppose that a memory burstable container is using 70% of the systems memory, and another burstable container is using 30% of the systems memory. A best-effort burstable container attempts to launch on the Kubelet. Initially the best-effort container is using 2% of memory, and has an OOM_SCORE_ADJ of 20. So its OOM_SCORE is lower than the burstable pod using 70% of system memory. The burstable pod will be evicted by the best-effort pod. Short-term TODO: Implement a restart policy where best-effort pods are immediately evicted if OOM killed, but burstable pods are given a few retries. Long-term TODO: push support for OOM scores in cgroups to the upstream Linux kernel.
- **Swap Memory**: The QoS proposal assumes that swap memory is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can start allocating memory on swap space. Eventually, if there isnt enough swap space, processes in the pods might get killed. TODO: ensure that swap space is disabled on our cluster setups scripts.
Killing and eviction mechanics:
- **Killing Containers**: Usually, containers cannot function properly if one of the constituent processes in the container is killed. TODO: When a process in a container is out of resource killed (e.g. OOM killed), kill the entire container.
- **Out of Resource Eviction**: If a container in a multi-container pod fails, we might want restart the entire pod instead of just restarting the container. In some cases (e.g. if a memory best-effort container is out of resource killed), we might change pods to "failed" phase and pods might need to be evicted. TODO: Draft a policy for out of resource eviction and implement it.
Maintaining CPU performance:
- **CPU-sharing Issues** Suppose that a node is running 2 container: a container A requesting for 50% of CPU (but without a CPU limit), and a container B not requesting for resources. Suppose that both pods try to use as much CPU as possible. After the proposal is implemented, A will get 100% of the CPU, and B will get around 0% of the CPU. However, a fairer scheme would give the Burstable container 75% of the CPU and the Best-Effort container 25% of the CPU (since resources past the Burstable containers request are not guaranteed). TODO: think about whether this issue to be solved, implement a solution.
- **CPU kills**: System tasks or daemons like the Kubelet could consume more CPU, and we won't be able to guarantee containers the CPU amount they requested. If the situation persists, we might want to kill the container. TODO: Draft a policy for CPU usage killing and implement it.
- **CPU limits**: Enabling CPU limits can be problematic, because processes might be hard capped and might stall for a while. TODO: Enable CPU limits intelligently using CPU quota and core allocation.
Documentation:
- **Documentation**: TODO: add user docs for resource QoS
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/resource-qos.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->

View File

@ -26,20 +26,30 @@ const (
KubeProxyOOMScoreAdj int = -999
)
// isMemoryBestEffort returns true if the container's memory requirements are best-effort.
func isMemoryBestEffort(container *api.Container) bool {
// A container is memory best-effort if its memory request is unspecified or 0.
// If a request is specified, then the user expects some kind of resource guarantee.
return container.Resources.Requests.Memory().Value() == 0
// isBestEffort returns true if the container's resource requirements are best-effort.
func isBestEffort(container *api.Container) bool {
// A container is best-effort if any of its resource requests is unspecified or 0.
if container.Resources.Requests.Memory().Value() == 0 ||
container.Resources.Requests.Cpu().Value() == 0 {
return true
}
return false
}
// isMemoryGuaranteed returns true if the container's memory requirements are Guaranteed.
func isMemoryGuaranteed(container *api.Container) bool {
// A container is memory guaranteed if its memory request == memory limit.
// If memory request == memory limit, the user is very confident of resource consumption.
memoryRequest := container.Resources.Requests.Memory()
memoryLimit := container.Resources.Limits.Memory()
return (*memoryRequest).Cmp(*memoryLimit) == 0 && memoryRequest.Value() != 0
// isGuaranteed returns true if the container's resource requirements are Guaranteed.
func isGuaranteed(container *api.Container) bool {
// A container is guaranteed if all its request == limit.
memoryRequest := container.Resources.Requests.Memory().Value()
memoryLimit := container.Resources.Limits.Memory().Value()
cpuRequest := container.Resources.Requests.Cpu().Value()
cpuLimit := container.Resources.Limits.Cpu().Value()
if memoryRequest != 0 &&
cpuRequest != 0 &&
cpuRequest == cpuLimit &&
memoryRequest == memoryLimit {
return true
}
return false
}
// GetContainerOOMAdjust returns the amount by which the OOM score of all processes in the
@ -48,25 +58,25 @@ func isMemoryGuaranteed(container *api.Container) bool {
// and 1000. Containers with higher OOM scores are killed if the system runs out of memory.
// See https://lwn.net/Articles/391222/ for more information.
func GetContainerOOMScoreAdjust(container *api.Container, memoryCapacity int64) int {
if isMemoryGuaranteed(container) {
// Memory guaranteed containers should be the last to get killed.
if isGuaranteed(container) {
// Guaranteed containers should be the last to get killed.
return -999
} else if isMemoryBestEffort(container) {
// Memory best-effort containers should be the first to be killed.
} else if isBestEffort(container) {
// Best-effort containers should be the first to be killed.
return 1000
} else {
// Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally,
// we want to protect Burstable containers that consume less memory than requested.
// The formula below is a heuristic. A container requesting for 10% of a system's
// memory will have an oom score adjust of 900. If a process in container Y
// memory will have an OOM score adjust of 900. If a process in container Y
// uses over 10% of memory, its OOM score will be 1000. The idea is that containers
// which use more than their request will have an OOM score of 1000 and will be prime
// targets for OOM kills.
// Note that this is a heuristic, it won't work if a container has many small processes.
memoryRequest := container.Resources.Requests.Memory().Value()
oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
// A memory guaranteed container using 100% of memory can have an OOM score of 1. Ensure
// that memory burstable containers have a higher OOM score.
// A guaranteed container using 100% of memory can have an OOM score of 1. Ensure
// that burstable containers have a higher OOM score.
if oomScoreAdjust < 2 {
return 2
}

View File

@ -29,11 +29,40 @@ const (
)
var (
zeroRequestMemoryBestEffort = api.Container{
zeroRequestBestEffort = api.Container{
Resources: api.ResourceRequirements{
Limits: api.ResourceList{
api.ResourceName(api.ResourceCPU): resource.MustParse("10"),
},
},
}
edgeBestEffort = api.Container{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceName(api.ResourceCPU): resource.MustParse("0"),
},
Limits: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("10G"),
},
},
}
noRequestBestEffort = api.Container{
Resources: api.ResourceRequirements{
Limits: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("0"),
},
},
}
noLimitBestEffort = api.Container{}
guaranteed = api.Container{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("10G"),
api.ResourceName(api.ResourceCPU): resource.MustParse("5m"),
api.ResourceName(api.ResourceMemory): resource.MustParse("0G"),
},
Limits: api.ResourceList{
api.ResourceName(api.ResourceCPU): resource.MustParse("5m"),
@ -42,43 +71,11 @@ var (
},
}
edgeMemoryBestEffort = api.Container{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("0G"),
},
Limits: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("0G"),
},
},
}
noRequestMemoryBestEffort = api.Container{
Resources: api.ResourceRequirements{
Limits: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("10G"),
},
},
}
noLimitMemoryBestEffort = api.Container{}
memoryGuaranteed = api.Container{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("10G"),
},
Limits: api.ResourceList{
api.ResourceName(api.ResourceCPU): resource.MustParse("5m"),
api.ResourceName(api.ResourceMemory): resource.MustParse("10G"),
},
},
}
memoryBurstable = api.Container{
burstable = api.Container{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse(strconv.Itoa(standardMemoryAmount / 2)),
api.ResourceName(api.ResourceCPU): resource.MustParse("5m"),
},
Limits: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse("10G"),
@ -86,41 +83,42 @@ var (
},
}
memoryBurstableNoLimit = api.Container{
burstableNoLimit = api.Container{
Resources: api.ResourceRequirements{
Requests: api.ResourceList{
api.ResourceName(api.ResourceMemory): resource.MustParse(strconv.Itoa(standardMemoryAmount - 1)),
api.ResourceName(api.ResourceCPU): resource.MustParse("5m"),
},
},
}
)
func TestIsMemoryBestEffort(t *testing.T) {
validCases := []api.Container{zeroRequestMemoryBestEffort, noRequestMemoryBestEffort, noLimitMemoryBestEffort, edgeMemoryBestEffort}
func TestIsBestEffort(t *testing.T) {
validCases := []api.Container{zeroRequestBestEffort, noRequestBestEffort, noLimitBestEffort, edgeBestEffort}
for _, container := range validCases {
if !isMemoryBestEffort(&container) {
t.Errorf("container %+v is memory best-effort", container)
if !isBestEffort(&container) {
t.Errorf("container %+v is best-effort", container)
}
}
invalidCases := []api.Container{memoryGuaranteed, memoryBurstable}
invalidCases := []api.Container{guaranteed, burstable}
for _, container := range invalidCases {
if isMemoryBestEffort(&container) {
t.Errorf("container %+v is not memory best-effort", container)
if isBestEffort(&container) {
t.Errorf("container %+v is not best-effort", container)
}
}
}
func TestIsMemoryGuaranteed(t *testing.T) {
validCases := []api.Container{memoryGuaranteed}
func TestIsGuaranteed(t *testing.T) {
validCases := []api.Container{guaranteed}
for _, container := range validCases {
if !isMemoryGuaranteed(&container) {
t.Errorf("container %+v is memory guaranteed", container)
if !isGuaranteed(&container) {
t.Errorf("container %+v is guaranteed", container)
}
}
invalidCases := []api.Container{zeroRequestMemoryBestEffort, noRequestMemoryBestEffort, noLimitMemoryBestEffort, edgeMemoryBestEffort, memoryBurstable}
invalidCases := []api.Container{zeroRequestBestEffort, noRequestBestEffort, noLimitBestEffort, edgeBestEffort, burstable}
for _, container := range invalidCases {
if isMemoryGuaranteed(&container) {
t.Errorf("container %+v is not memory guaranteed", container)
if isGuaranteed(&container) {
t.Errorf("container %+v is not guaranteed", container)
}
}
}
@ -133,46 +131,45 @@ type oomTest struct {
}
func TestGetContainerOOMScoreAdjust(t *testing.T) {
oomTests := []oomTest{
{
container: &zeroRequestMemoryBestEffort,
container: &zeroRequestBestEffort,
memoryCapacity: 4000000000,
lowOOMScoreAdj: 1000,
highOOMScoreAdj: 1000,
},
{
container: &edgeMemoryBestEffort,
container: &edgeBestEffort,
memoryCapacity: 8000000000,
lowOOMScoreAdj: 1000,
highOOMScoreAdj: 1000,
},
{
container: &noRequestMemoryBestEffort,
container: &noRequestBestEffort,
memoryCapacity: 7230457451,
lowOOMScoreAdj: 1000,
highOOMScoreAdj: 1000,
},
{
container: &noLimitMemoryBestEffort,
container: &noLimitBestEffort,
memoryCapacity: 4000000000,
lowOOMScoreAdj: 1000,
highOOMScoreAdj: 1000,
},
{
container: &memoryGuaranteed,
container: &guaranteed,
memoryCapacity: 123456789,
lowOOMScoreAdj: -999,
highOOMScoreAdj: -999,
},
{
container: &memoryBurstable,
container: &burstable,
memoryCapacity: standardMemoryAmount,
lowOOMScoreAdj: 495,
highOOMScoreAdj: 505,
},
{
container: &memoryBurstableNoLimit,
container: &burstableNoLimit,
memoryCapacity: standardMemoryAmount,
lowOOMScoreAdj: 2,
highOOMScoreAdj: 2,