mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-31 05:40:42 +00:00 
			
		
		
		
	Merge pull request #12810 from yujuhong/podcache_proposal
Auto commit by PR queue bot
This commit is contained in:
		
							
								
								
									
										
											BIN
										
									
								
								docs/proposals/pod-cache.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								docs/proposals/pod-cache.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| After Width: | Height: | Size: 50 KiB | 
							
								
								
									
										202
									
								
								docs/proposals/runtime-pod-cache.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										202
									
								
								docs/proposals/runtime-pod-cache.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,202 @@ | |||||||
|  | <!-- BEGIN MUNGE: UNVERSIONED_WARNING --> | ||||||
|  |  | ||||||
|  | <!-- BEGIN STRIP_FOR_RELEASE --> | ||||||
|  |  | ||||||
|  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||||||
|  |      width="25" height="25"> | ||||||
|  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||||||
|  |      width="25" height="25"> | ||||||
|  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||||||
|  |      width="25" height="25"> | ||||||
|  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||||||
|  |      width="25" height="25"> | ||||||
|  | <img src="http://kubernetes.io/img/warning.png" alt="WARNING" | ||||||
|  |      width="25" height="25"> | ||||||
|  |  | ||||||
|  | <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2> | ||||||
|  |  | ||||||
|  | If you are using a released version of Kubernetes, you should | ||||||
|  | refer to the docs that go with that version. | ||||||
|  |  | ||||||
|  | Documentation for other releases can be found at | ||||||
|  | [releases.k8s.io](http://releases.k8s.io). | ||||||
|  | </strong> | ||||||
|  | -- | ||||||
|  |  | ||||||
|  | <!-- END STRIP_FOR_RELEASE --> | ||||||
|  |  | ||||||
|  | <!-- END MUNGE: UNVERSIONED_WARNING --> | ||||||
|  |  | ||||||
|  | # Kubelet: Runtime Pod Cache | ||||||
|  |  | ||||||
|  | This proposal builds on top of the Pod Lifecycle Event Generator (PLEG) proposed | ||||||
|  | in [#12802](https://issues.k8s.io/12802). It assumes that Kubelet subscribes to | ||||||
|  | the pod lifecycle event stream to eliminate periodic polling of pod | ||||||
|  | states. Please see [#12802](https://issues.k8s.io/12802). for the motivation and | ||||||
|  | design concept for PLEG. | ||||||
|  |  | ||||||
|  | Runtime pod cache is an in-memory cache which stores the *status* of | ||||||
|  | all pods, and is maintained by PLEG. It serves as a single source of | ||||||
|  | truth for internal pod status, freeing Kubelet from querying the | ||||||
|  | container runtime. | ||||||
|  |  | ||||||
|  | ## Motivation | ||||||
|  |  | ||||||
|  | With PLEG, Kubelet no longer needs to perform comprehensive state | ||||||
|  | checking for all pods periodically. It only instructs a pod worker to | ||||||
|  | start syncing when there is a change of its pod status. Nevertheless, | ||||||
|  | during each sync, a pod worker still needs to construct the pod status | ||||||
|  | by examining all containers (whether dead or alive) in the pod, due to | ||||||
|  | the lack of the caching of previous states. With the integration of | ||||||
|  | pod cache, we can further improve Kubelet's CPU usage by | ||||||
|  |  | ||||||
|  |  1. Lowering the number of concurrent requests to the container | ||||||
|  |     runtime since pod workers no longer have to query the runtime | ||||||
|  |     individually. | ||||||
|  |  2. Lowering the total number of inspect requests because there is no | ||||||
|  |     need to inspect containers with no state changes. | ||||||
|  |  | ||||||
|  | ***Don't we already have a [container runtime cache] | ||||||
|  | (https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/runtime_cache.go)?*** | ||||||
|  |  | ||||||
|  | The runtime cache is an optimization that reduces the number of `GetPods()` | ||||||
|  | calls from the workers. However, | ||||||
|  |  | ||||||
|  |  * The cache does not store all information necessary for a worker to | ||||||
|  |    complete a sync (e.g., `docker inspect`); workers still need to inspect | ||||||
|  |    containers individually to generate `api.PodStatus`. | ||||||
|  |  * Workers sometimes need to bypass the cache in order to retrieve the | ||||||
|  |    latest pod state. | ||||||
|  |  | ||||||
|  | This proposal generalizes the cache and instructs PLEG to populate the cache, so | ||||||
|  | that the content is always up-to-date. | ||||||
|  |  | ||||||
|  | **Why can't each worker cache its own pod status?** | ||||||
|  |  | ||||||
|  | The short answer is yes, they can. The longer answer is that localized | ||||||
|  | caching limits the use of the cache content -- other components cannot | ||||||
|  | access it. This often leads to caching at multiple places and/or passing | ||||||
|  | objects around, complicating the control flow. | ||||||
|  |  | ||||||
|  | ## Runtime Pod Cache | ||||||
|  |  | ||||||
|  |  | ||||||
|  |  | ||||||
|  | Pod cache stores the `PodStatus` for all pods on the node. `PodStatus` encompasses | ||||||
|  | all the information required from the container runtime to generate | ||||||
|  | `api.PodStatus` for a pod. | ||||||
|  |  | ||||||
|  | ```go | ||||||
|  | // PodStatus represents the status of the pod and its containers. | ||||||
|  | // api.PodStatus can be derived from examining PodStatus and api.Pod. | ||||||
|  | type PodStatus struct { | ||||||
|  |     ID types.UID | ||||||
|  |     Name string | ||||||
|  |     Namespace string | ||||||
|  |     IP string | ||||||
|  |     ContainerStatuses []*ContainerStatus | ||||||
|  | } | ||||||
|  |  | ||||||
|  | // ContainerStatus represents the status of a container. | ||||||
|  | type ContainerStatus struct { | ||||||
|  |     ID ContainerID | ||||||
|  |     Name string | ||||||
|  |     State ContainerState | ||||||
|  |     CreatedAt time.Time | ||||||
|  |     StartedAt time.Time | ||||||
|  |     FinishedAt time.Time | ||||||
|  |     ExitCode int | ||||||
|  |     Image string | ||||||
|  |     ImageID string | ||||||
|  |     Hash uint64 | ||||||
|  |     RestartCount int | ||||||
|  |     Reason string | ||||||
|  |     Message string | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  |  | ||||||
|  | `PodStatus` is defined in the container runtime interface, hence is | ||||||
|  | runtime-agnostic. | ||||||
|  |  | ||||||
|  | PLEG is responsible for updating the entries pod cache, hence always keeping | ||||||
|  | the cache up-to-date. | ||||||
|  |  | ||||||
|  | 1. Detect change of container state | ||||||
|  | 2. Inspect the pod for details | ||||||
|  | 3. Update the pod cache with the new PodStatus | ||||||
|  |   - If there is no real change of the pod entry, do nothing | ||||||
|  |   - Otherwise, generate and send out the corresponding pod lifecycle event | ||||||
|  |  | ||||||
|  | Note that in (3), PLEG can check if there is any disparity between the old | ||||||
|  | and the new pod entry to filter out duplicated events if needed. | ||||||
|  |  | ||||||
|  | ### Evict cache entries | ||||||
|  |  | ||||||
|  | Note that the cache represents all the pods/containers known by the container | ||||||
|  | runtime. A cache entry should only be evicted if the pod is no longer visible | ||||||
|  | by the container runtime. PLEG is responsible for deleting entries in the | ||||||
|  | cache. | ||||||
|  |  | ||||||
|  | ### Generate `api.PodStatus` | ||||||
|  |  | ||||||
|  | Because pod cache stores the up-to-date `PodStatus` of the pods, Kubelet can | ||||||
|  | generate the `api.PodStatus` by interpreting the cache entry at any | ||||||
|  | time. To avoid sending intermediate status (e.g., while a pod worker | ||||||
|  | is restarting a container), we will instruct the pod worker to generate a new | ||||||
|  | status at the beginning of each sync. | ||||||
|  |  | ||||||
|  | ### Cache contention | ||||||
|  |  | ||||||
|  | Cache contention should not be a problem when the number of pods is | ||||||
|  | small. When Kubelet scales, we can always shard the pods by ID to | ||||||
|  | reduce contention. | ||||||
|  |  | ||||||
|  | ### Disk management | ||||||
|  |  | ||||||
|  | The pod cache is not capable to fulfill the needs of container/image garbage | ||||||
|  | collectors as they may demand more than pod-level information. These components | ||||||
|  | will still need to query the container runtime directly at times. We may | ||||||
|  | consider extending the cache for these use cases, but they are beyond the scope | ||||||
|  | of this proposal. | ||||||
|  |  | ||||||
|  |  | ||||||
|  | ## Impact on Pod Worker Control Flow | ||||||
|  |  | ||||||
|  | A pod worker may perform various operations (e.g., start/kill a container) | ||||||
|  | during a sync. They will expect to see the results of such operations reflected | ||||||
|  | in the cache in the next sync. Alternately, they can bypass the cache and | ||||||
|  | query the container runtime directly to get the latest status. However, this | ||||||
|  | is not desirable since the cache is introduced exactly to eliminate unnecessary, | ||||||
|  | concurrent queries. Therefore, a pod worker should be blocked until all expected | ||||||
|  | results have been updated to the cache by PLEG. | ||||||
|  |  | ||||||
|  | Depending on the type of PLEG (see [#12802](https://issues.k8s.io/12802)) in | ||||||
|  | use, the methods to check whether a requirement is met can differ. For a | ||||||
|  | PLEG that solely relies on relisting, a pod worker can simply wait until the | ||||||
|  | relist timestamp is newer than the end of the worker's last sync. On the other | ||||||
|  | hand, if pod worker knows what events to expect, they can also block until the | ||||||
|  | events are observed. | ||||||
|  |  | ||||||
|  | It should be noted that `api.PodStatus` will only be generated by the pod | ||||||
|  | worker *after* the cache has been updated. This means that the perceived | ||||||
|  | responsiveness of Kubelet (from querying the API server) will be affected by | ||||||
|  | how soon the cache can be populated. For the pure-relisting PLEG, the relist | ||||||
|  | period can become the bottleneck. On the other hand, A PLEG which watches the | ||||||
|  | upstream event stream (and knows how what events to expect) is not restricted | ||||||
|  | by such periods and should improve Kubelet's perceived responsiveness. | ||||||
|  |  | ||||||
|  | ## TODOs for v1.2 | ||||||
|  |  | ||||||
|  |  - Redefine container runtime types ([#12619](https://issues.k8s.io/12619)): | ||||||
|  |    and introduce `PodStatus`. Refactor dockertools and rkt to use the new type. | ||||||
|  |  | ||||||
|  |  - Add cache and instruct PLEG to populate it. | ||||||
|  |  | ||||||
|  |  - Refactor Kubelet to use the cache. | ||||||
|  |  | ||||||
|  |  - Deprecate the old runtime cache. | ||||||
|  |  | ||||||
|  |  | ||||||
|  | <!-- BEGIN MUNGE: GENERATED_ANALYTICS --> | ||||||
|  | []() | ||||||
|  | <!-- END MUNGE: GENERATED_ANALYTICS --> | ||||||
		Reference in New Issue
	
	Block a user