mirror of
				https://github.com/kata-containers/kata-containers.git
				synced 2025-10-31 17:37:20 +00:00 
			
		
		
		
	mount-info.json should be mountInfo.json according to the description in the doc. Fixes: #5716 Signed-off-by: Jason Zhang <zhanghj.lc@inspur.com>
		
			
				
	
	
		
			254 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			254 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # Motivation
 | ||
| Today, there exist a few gaps between Container Storage Interface (CSI) and virtual machine (VM) based runtimes such as Kata Containers 
 | ||
| that prevent them from working together smoothly.
 | ||
| 
 | ||
| First, it’s cumbersome to use a persistent volume (PV) with Kata Containers. Today, for a PV with Filesystem volume mode, Virtio-fs
 | ||
| is the only way to surface it inside a Kata Container guest VM. But often mounting the filesystem (FS) within the guest operating system (OS) is 
 | ||
| desired due to performance benefits, availability of native FS features and security benefits over the Virtio-fs mechanism.
 | ||
| 
 | ||
| Second, it’s difficult if not impossible to resize a PV online with Kata Containers. While a PV can be expanded on the host OS, 
 | ||
| the updated metadata needs to be propagated to the guest OS in order for the application container to use the expanded volume. 
 | ||
| Currently, there is not a way to propagate the PV metadata from the host OS to the guest OS without restarting the Pod sandbox.
 | ||
| 
 | ||
| # Proposed Solution
 | ||
| 
 | ||
| Because of the OS boundary, these features cannot be implemented in the CSI node driver plugin running on the host OS 
 | ||
| as is normally done in the runc container. Instead, they can be done by the Kata Containers agent inside the guest OS, 
 | ||
| but it requires the CSI driver to pass the relevant information to the Kata Containers runtime. 
 | ||
| An ideal long term solution would be to have the `kubelet` coordinating the communication between the CSI driver and 
 | ||
| the container runtime, as described in [KEP-2857](https://github.com/kubernetes/enhancements/pull/2893/files). 
 | ||
| However, as the KEP is still under review, we would like to propose a short/medium term solution to unblock our use case.
 | ||
| 
 | ||
| The proposed solution is built on top of a previous [proposal](https://github.com/egernst/kata-containers/blob/da-proposal/docs/design/direct-assign-volume.md) 
 | ||
| described by Eric Ernst. The previous proposal has two gaps:
 | ||
| 
 | ||
| 1. Writing a `csiPlugin.json` file to the volume root path introduced a security risk. A malicious user can gain unauthorized 
 | ||
| access to a block device by writing their own `csiPlugin.json` to the above location through an ephemeral CSI plugin.  
 | ||
| 
 | ||
| 2. The proposal didn't describe how to establish a mapping between a volume and a kata sandbox, which is needed for 
 | ||
| implementing CSI volume resize and volume stat collection APIs.
 | ||
| 
 | ||
| This document particularly focuses on how to address these two gaps.
 | ||
| 
 | ||
| ## Assumptions and Limitations
 | ||
| 1. The proposal assumes that a block device volume will only be used by one Pod on a node at a time, which we believe 
 | ||
| is the most common pattern in Kata Containers use cases. It’s also unsafe to have the same block device attached to more than 
 | ||
| one Kata pod. In the context of Kubernetes, the `PersistentVolumeClaim` (PVC) needs to have the `accessMode` as `ReadWriteOncePod`. 
 | ||
| 2. More advanced Kubernetes volume features such as, `fsGroup`, `fsGroupChangePolicy`, and `subPath` are not supported. 
 | ||
| 
 | ||
| ## End User Interface
 | ||
| 
 | ||
| 1. The user specifies a PV as a direct-assigned volume. How a PV is specified as a direct-assigned volume is left for each CSI implementation to decide.
 | ||
| There are a few options for reference:
 | ||
|    1. A storage class parameter specifies whether it's a direct-assigned volume. This avoids any lookups of PVC 
 | ||
|    or Pod information from the CSI plugin (as external provisioner takes care of these). However, all PVs in the storage class with the parameter set 
 | ||
|    will have host mounts skipped.
 | ||
|    2. Use a PVC annotation. This approach requires the CSI plugins have `--extra-create-metadata` [set](https://kubernetes-csi.github.io/docs/external-provisioner.html#persistentvolumeclaim-and-persistentvolume-parameters)
 | ||
|    to be able to perform a lookup of the PVC annotations from the API server. Pro: API server lookup of annotations only required during creation of PV. 
 | ||
|    Con: The CSI plugin will always skip host mounting of the PV.
 | ||
|    3. The CSI plugin can also lookup pod `runtimeclass` during `NodePublish`. This approach can be found in the [ALIBABA CSI plugin](https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver/blob/master/pkg/disk/nodeserver.go#L248).
 | ||
| 2. The CSI node driver delegates the direct assigned volume to the Kata Containers runtime. The CSI node driver APIs need to 
 | ||
|    be modified to pass the volume mount information and collect volume information to/from the Kata Containers runtime by invoking `kata-runtime` command line commands.
 | ||
|    * **NodePublishVolume** -- It invokes `kata-runtime direct-volume add --volume-path [volumePath] --mount-info [mountInfo]` 
 | ||
|    to propagate the volume mount information to the Kata Containers runtime for it to carry out the filesystem mount operation.
 | ||
|    The `volumePath` is the [target_path](https://github.com/container-storage-interface/spec/blob/master/csi.proto#L1364) in the CSI `NodePublishVolumeRequest`.
 | ||
|    The `mountInfo` is a serialized JSON string. 
 | ||
|    * **NodeGetVolumeStats** -- It invokes `kata-runtime direct-volume stats --volume-path [volumePath]` to retrieve the filesystem stats of direct-assigned volume.
 | ||
|    * **NodeExpandVolume** -- It invokes `kata-runtime direct-volume resize --volume-path [volumePath] --size [size]` to send a resize request to the Kata Containers runtime to
 | ||
|    resize the direct-assigned volume.
 | ||
|    * **NodeStageVolume/NodeUnStageVolume** -- It invokes `kata-runtime direct-volume remove --volume-path [volumePath]` to remove the persisted metadata of a direct-assigned volume.
 | ||
| 
 | ||
| The `mountInfo` object is defined as follows:
 | ||
| ```Golang
 | ||
| type MountInfo struct {
 | ||
|     // The type of the volume (ie. block)
 | ||
|     VolumeType string `json:"volume-type"`
 | ||
|     // The device backing the volume.
 | ||
|     Device string `json:"device"`
 | ||
|     // The filesystem type to be mounted on the volume.
 | ||
|     FsType string `json:"fstype"`
 | ||
|     // Additional metadata to pass to the agent regarding this volume.
 | ||
|     Metadata map[string]string `json:"metadata,omitempty"`
 | ||
|     // Additional mount options.
 | ||
|     Options []string `json:"options,omitempty"`
 | ||
| }
 | ||
| ```
 | ||
| Notes: given that the `mountInfo` is persisted to the disk by the Kata runtime, it shouldn't container any secrets (such as SMB mount password).
 | ||
| 
 | ||
| ## Implementation Details
 | ||
| 
 | ||
| ### Kata runtime
 | ||
| Instead of the CSI node driver writing the mount info into a `csiPlugin.json` file under the volume root, 
 | ||
| as described in the original proposal, here we propose that the CSI node driver passes the mount information to 
 | ||
| the Kata Containers runtime through a new `kata-runtime` commandline command. The `kata-runtime` then writes the mount 
 | ||
| information to a `mountInfo.json` file in a predefined location (`/run/kata-containers/shared/direct-volumes/[volume_path]/`).
 | ||
| 
 | ||
| When the Kata Containers runtime starts a container, it verifies whether a volume mount is a direct-assigned volume by checking 
 | ||
| whether there is a `mountInfo` file under the computed Kata `direct-volumes` directory. If it is, the runtime parses the `mountInfo` file, 
 | ||
| updates the mount spec with the data in `mountInfo`. The updated mount spec is then passed to the Kata agent in the guest VM together
 | ||
| with other mounts. The Kata Containers runtime also creates a file named by the sandbox id under the `direct-volumes/[volume_path]/` 
 | ||
| directory. The reason for adding a sandbox id file is to establish a mapping between the volume and the sandbox using it. 
 | ||
| Later, when the Kata Containers runtime handles the `get-stats` and `resize` commands, it uses the sandbox id to identify 
 | ||
| the endpoint of the corresponding `containerd-shim-kata-v2`.
 | ||
| 
 | ||
| ### containerd-shim-kata-v2 changes
 | ||
| `containerd-shim-kata-v2` provides an API for sandbox management through a Unix domain socket. Two new handlers are proposed: `/direct-volume/stats` and `/direct-volume/resize`:
 | ||
| 
 | ||
| Example:
 | ||
| 
 | ||
| ```bash
 | ||
| $ curl --unix-socket "$shim_socket_path" -I -X GET 'http://localhost/direct-volume/stats/[urlSafeVolumePath]'
 | ||
| $ curl --unix-socket "$shim_socket_path" -I -X POST 'http://localhost/direct-volume/resize' -d '{ "volumePath"": [volumePath], "Size": "123123" }'
 | ||
| ```
 | ||
| 
 | ||
| The shim then forwards the corresponding request to the `kata-agent` to carry out the operations inside the guest VM. For `resize` operation, 
 | ||
| the Kata runtime also needs to notify the hypervisor to resize the block device (e.g. call `block_resize` in QEMU). 
 | ||
| 
 | ||
| ### Kata agent changes
 | ||
| 
 | ||
| The mount spec of a direct-assigned volume is passed to `kata-agent` through the existing `Storage` GRPC object. 
 | ||
| Two new APIs and three new GRPC objects are added to GRPC protocol between the shim and agent for resizing and getting volume stats:
 | ||
| ```protobuf
 | ||
| 
 | ||
| rpc GetVolumeStats(VolumeStatsRequest) returns (VolumeStatsResponse);
 | ||
| rpc ResizeVolume(ResizeVolumeRequest) returns (google.protobuf.Empty);
 | ||
| 
 | ||
| message VolumeStatsRequest {
 | ||
| // The volume path on the guest outside the container
 | ||
|     string volume_guest_path = 1;
 | ||
| }
 | ||
| 
 | ||
| message ResizeVolumeRequest {
 | ||
| // Full VM guest path of the volume (outside the container)
 | ||
|     string volume_guest_path = 1;
 | ||
|     uint64 size = 2;
 | ||
| }
 | ||
| 
 | ||
| // This should be kept in sync with CSI NodeGetVolumeStatsResponse (https://github.com/container-storage-interface/spec/blob/v1.5.0/csi.proto)
 | ||
| message VolumeStatsResponse {
 | ||
|    // This field is OPTIONAL.
 | ||
|    repeated VolumeUsage usage = 1;
 | ||
|    // Information about the current condition of the volume.
 | ||
|    // This field is OPTIONAL.
 | ||
|    // This field MUST be specified if the VOLUME_CONDITION node
 | ||
|    // capability is supported.
 | ||
|    VolumeCondition volume_condition = 2;
 | ||
| }
 | ||
| message VolumeUsage {
 | ||
|    enum Unit {
 | ||
|       UNKNOWN = 0;
 | ||
|       BYTES = 1;
 | ||
|       INODES = 2;
 | ||
|    }
 | ||
|    // The available capacity in specified Unit. This field is OPTIONAL.
 | ||
|    // The value of this field MUST NOT be negative.
 | ||
|    uint64 available = 1;
 | ||
| 
 | ||
|    // The total capacity in specified Unit. This field is REQUIRED.
 | ||
|    // The value of this field MUST NOT be negative.
 | ||
|    uint64 total = 2;
 | ||
| 
 | ||
|    // The used capacity in specified Unit. This field is OPTIONAL.
 | ||
|    // The value of this field MUST NOT be negative.
 | ||
|    uint64 used = 3;
 | ||
| 
 | ||
|    // Units by which values are measured. This field is REQUIRED.
 | ||
|    Unit unit = 4;
 | ||
| }
 | ||
| 
 | ||
| // VolumeCondition represents the current condition of a volume.
 | ||
| message VolumeCondition {
 | ||
| 
 | ||
|    // Normal volumes are available for use and operating optimally.
 | ||
|    // An abnormal volume does not meet these criteria.
 | ||
|    // This field is REQUIRED.
 | ||
|    bool abnormal = 1;
 | ||
| 
 | ||
|    // The message describing the condition of the volume.
 | ||
|    // This field is REQUIRED.
 | ||
|    string message = 2;
 | ||
| }
 | ||
| 
 | ||
| ```
 | ||
| 
 | ||
| ### Step by step walk-through
 | ||
| 
 | ||
| Given the following definition:
 | ||
| ```YAML
 | ||
| ---
 | ||
| apiVersion: v1
 | ||
| kind: Pod
 | ||
| metadata:
 | ||
|   name: app
 | ||
| spec:
 | ||
|   runtime-class: kata-qemu
 | ||
|   containers:
 | ||
|   - name: app
 | ||
|     image: centos
 | ||
|     command: ["/bin/sh"]
 | ||
|     args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
 | ||
|     volumeMounts:
 | ||
|     - name: persistent-storage
 | ||
|       mountPath: /data
 | ||
|   volumes:
 | ||
|   - name: persistent-storage
 | ||
|     persistentVolumeClaim:
 | ||
|       claimName: ebs-claim
 | ||
| ---
 | ||
| apiVersion: v1
 | ||
| kind: PersistentVolumeClaim
 | ||
| metadata:
 | ||
|   annotations:
 | ||
|     skip-hostmount: "true"
 | ||
|   name: ebs-claim
 | ||
| spec:
 | ||
|   accessModes:
 | ||
|     - ReadWriteOncePod
 | ||
|   volumeMode: Filesystem
 | ||
|   storageClassName: ebs-sc
 | ||
|   resources:
 | ||
|     requests:
 | ||
|       storage: 4Gi
 | ||
| ---
 | ||
| kind: StorageClass
 | ||
| apiVersion: storage.k8s.io/v1
 | ||
| metadata:
 | ||
|   name: ebs-sc
 | ||
| provisioner: ebs.csi.aws.com
 | ||
| volumeBindingMode: WaitForFirstConsumer
 | ||
| parameters:
 | ||
|   csi.storage.k8s.io/fstype: ext4
 | ||
| 
 | ||
| ```
 | ||
| Let’s assume that changes have been made in the `aws-ebs-csi-driver` node driver.
 | ||
| 
 | ||
| **Node publish volume**
 | ||
| 1. In the node CSI driver, the `NodePublishVolume` API invokes: `kata-runtime direct-volume add --volume-path "/kubelet/a/b/c/d/sdf" --mount-info "{\"Device\": \"/dev/sdf\", \"fstype\": \"ext4\"}"`.
 | ||
| 2. The `Kata-runtime` writes the mount-info JSON to a file called `mountInfo.json` under `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
 | ||
| 
 | ||
| **Node unstage volume**
 | ||
| 1. In the node CSI driver, the `NodeUnstageVolume` API invokes: `kata-runtime direct-volume remove --volume-path "/kubelet/a/b/c/d/sdf"`.
 | ||
| 2. Kata-runtime deletes the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
 | ||
| 
 | ||
| **Use the volume in sandbox**
 | ||
| 1. Upon the request to start a container, the `containerd-shim-kata-v2` examines the container spec,
 | ||
| and iterates through the mounts. For each mount, if there is a `mountInfo.json` file under `/run/kata-containers/shared/direct-volumes/[mount source path]`,
 | ||
| it generates a `storage` GRPC object after overwriting the mount spec with the information in `mountInfo.json`.
 | ||
| 2. The shim sends the storage objects to kata-agent through TTRPC.
 | ||
| 3. The shim writes a file with the sandbox id as the name under `/run/kata-containers/shared/direct-volumes/[mount source path]`.
 | ||
| 4. The kata-agent mounts the storage objects for the container.
 | ||
| 
 | ||
| **Node expand volume**
 | ||
| 1. In the node CSI driver, the `NodeExpandVolume` API invokes: `kata-runtime direct-volume resize –-volume-path "/kubelet/a/b/c/d/sdf" –-size 8Gi`.
 | ||
| 2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
 | ||
| 3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to resize the volume.
 | ||
| 4. The shim handles the request, asks the hypervisor to resize the block device and sends a GRPC request to Kata agent to resize the filesystem.
 | ||
| 5. Kata agent receives the request and resizes the filesystem.
 | ||
| 
 | ||
| **Node get volume stats**
 | ||
| 1. In the node CSI driver, the `NodeGetVolumeStats` API invokes: `kata-runtime direct-volume stats –-volume-path "/kubelet/a/b/c/d/sdf"`.
 | ||
| 2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
 | ||
| 3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to get the volume stats.
 | ||
| 4. The shim handles the request and forwards it to the Kata agent.
 | ||
| 5. Kata agent receives the request and returns the filesystem stats.
 |