kubelet dra: add checkpointing mechanism in the DRA Manager

The checkpointing mechanism will repopulate DRA Manager in-memory cache on kubelet restart.
This will ensure that the information needed by the PodResources API is available across
a kubelet restart.

The ClaimInfoState struct represent the DRA Manager in-memory cache state in checkpoint.
It is embedd in the ClaimInfo which also include the annotation field. The separation between
the in-memory cache and the cache state in the checkpoint is so we won't be tied to the in-memory
cache struct which may change in the future. In the ClaimInfoState we save the minimal required fields
to restore the in-memory cache.

Signed-off-by: Moshe Levi <moshele@nvidia.com>
This commit is contained in:
Moshe Levi
2023-02-16 23:27:26 +02:00
parent b99fe0d5b9
commit e7256e08d3
6 changed files with 348 additions and 103 deletions

View File

@@ -315,7 +315,7 @@ func NewContainerManager(mountUtil mount.Interface, cadvisorInterface cadvisor.I
// initialize DRA manager
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.DynamicResourceAllocation) {
klog.InfoS("Creating Dynamic Resource Allocation (DRA) manager")
cm.draManager, err = dra.NewManagerImpl(kubeClient)
cm.draManager, err = dra.NewManagerImpl(kubeClient, nodeConfig.KubeletRootDir)
if err != nil {
return nil, err
}