diff --git a/docs/design/README.md b/docs/design/README.md index d5dee7095d..ad20cd7204 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -11,6 +11,7 @@ Kata Containers design documents: - [`Inotify` support](inotify.md) - [Metrics(Kata 2.0)](kata-2-0-metrics.md) - [Design for Kata Containers `Lazyload` ability with `nydus`](kata-nydus-design.md) +- [Design for direct-assigned volume](direct-blk-device-assignment.md) --- diff --git a/docs/design/direct-blk-device-assignment.md b/docs/design/direct-blk-device-assignment.md new file mode 100644 index 0000000000..9997d0e7a6 --- /dev/null +++ b/docs/design/direct-blk-device-assignment.md @@ -0,0 +1,253 @@ +# Motivation +Today, there exist a few gaps between Container Storage Interface (CSI) and virtual machine (VM) based runtimes such as Kata Containers +that prevent them from working together smoothly. + +First, it’s cumbersome to use a persistent volume (PV) with Kata Containers. Today, for a PV with Filesystem volume mode, Virtio-fs +is the only way to surface it inside a Kata Container guest VM. But often mounting the filesystem (FS) within the guest operating system (OS) is +desired due to performance benefits, availability of native FS features and security benefits over the Virtio-fs mechanism. + +Second, it’s difficult if not impossible to resize a PV online with Kata Containers. While a PV can be expanded on the host OS, +the updated metadata needs to be propagated to the guest OS in order for the application container to use the expanded volume. +Currently, there is not a way to propagate the PV metadata from the host OS to the guest OS without restarting the Pod sandbox. + +# Proposed Solution + +Because of the OS boundary, these features cannot be implemented in the CSI node driver plugin running on the host OS +as is normally done in the runc container. Instead, they can be done by the Kata Containers agent inside the guest OS, +but it requires the CSI driver to pass the relevant information to the Kata Containers runtime. +An ideal long term solution would be to have the `kubelet` coordinating the communication between the CSI driver and +the container runtime, as described in [KEP-2857](https://github.com/kubernetes/enhancements/pull/2893/files). +However, as the KEP is still under review, we would like to propose a short/medium term solution to unblock our use case. + +The proposed solution is built on top of a previous [proposal](https://github.com/egernst/kata-containers/blob/da-proposal/docs/design/direct-assign-volume.md) +described by Eric Ernst. The previous proposal has two gaps: + +1. Writing a `csiPlugin.json` file to the volume root path introduced a security risk. A malicious user can gain unauthorized +access to a block device by writing their own `csiPlugin.json` to the above location through an ephemeral CSI plugin. + +2. The proposal didn't describe how to establish a mapping between a volume and a kata sandbox, which is needed for +implementing CSI volume resize and volume stat collection APIs. + +This document particularly focuses on how to address these two gaps. + +## Assumptions and Limitations +1. The proposal assumes that a block device volume will only be used by one Pod on a node at a time, which we believe +is the most common pattern in Kata Containers use cases. It’s also unsafe to have the same block device attached to more than +one Kata pod. In the context of Kubernetes, the `PersistentVolumeClaim` (PVC) needs to have the `accessMode` as `ReadWriteOncePod`. +2. More advanced Kubernetes volume features such as, `fsGroup`, `fsGroupChangePolicy`, and `subPath` are not supported. + +## End User Interface + +1. The user specifies a PV as a direct-assigned volume. How a PV is specified as a direct-assigned volume is left for each CSI implementation to decide. +There are a few options for reference: + 1. A storage class parameter specifies whether it's a direct-assigned volume. This avoids any lookups of PVC + or Pod information from the CSI plugin (as external provisioner takes care of these). However, all PVs in the storage class with the parameter set + will have host mounts skipped. + 2. Use a PVC annotation. This approach requires the CSI plugins have `--extra-create-metadata` [set](https://kubernetes-csi.github.io/docs/external-provisioner.html#persistentvolumeclaim-and-persistentvolume-parameters) + to be able to perform a lookup of the PVC annotations from the API server. Pro: API server lookup of annotations only required during creation of PV. + Con: The CSI plugin will always skip host mounting of the PV. + 3. The CSI plugin can also lookup pod `runtimeclass` during `NodePublish`. This approach can be found in the [ALIBABA CSI plugin](https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver/blob/master/pkg/disk/nodeserver.go#L248). +2. The CSI node driver delegates the direct assigned volume to the Kata Containers runtime. The CSI node driver APIs need to + be modified to pass the volume mount information and collect volume information to/from the Kata Containers runtime by invoking `kata-runtime` command line commands. + * **NodePublishVolume** -- It invokes `kata-runtime direct-volume add --volume-path [volumePath] --mount-info [mountInfo]` + to propagate the volume mount information to the Kata Containers runtime for it to carry out the filesystem mount operation. + The `volumePath` is the [target_path](https://github.com/container-storage-interface/spec/blob/master/csi.proto#L1364) in the CSI `NodePublishVolumeRequest`. + The `mountInfo` is a serialized JSON string. + * **NodeGetVolumeStats** -- It invokes `kata-runtime direct-volume stats --volume-path [volumePath]` to retrieve the filesystem stats of direct-assigned volume. + * **NodeExpandVolume** -- It invokes `kata-runtime direct-volume resize --volume-path [volumePath] --size [size]` to send a resize request to the Kata Containers runtime to + resize the direct-assigned volume. + * **NodeStageVolume/NodeUnStageVolume** -- It invokes `kata-runtime direct-volume remove --volume-path [volumePath]` to remove the persisted metadata of a direct-assigned volume. + +The `mountInfo` object is defined as follows: +```Golang +type MountInfo struct { + // The type of the volume (ie. block) + VolumeType string `json:"volume-type"` + // The device backing the volume. + Device string `json:"device"` + // The filesystem type to be mounted on the volume. + FsType string `json:"fstype"` + // Additional metadata to pass to the agent regarding this volume. + Metadata map[string]string `json:"metadata,omitempty"` + // Additional mount options. + Options []string `json:"options,omitempty"` +} +``` +Notes: given that the `mountInfo` is persisted to the disk by the Kata runtime, it shouldn't container any secrets (such as SMB mount password). + +## Implementation Details + +### Kata runtime +Instead of the CSI node driver writing the mount info into a `csiPlugin.json` file under the volume root, +as described in the original proposal, here we propose that the CSI node driver passes the mount information to +the Kata Containers runtime through a new `kata-runtime` commandline command. The `kata-runtime` then writes the mount +information to a `mount-info.json` file in a predefined location (`/run/kata-containers/shared/direct-volumes/[volume_path]/`). + +When the Kata Containers runtime starts a container, it verifies whether a volume mount is a direct-assigned volume by checking +whether there is a `mountInfo` file under the computed Kata `direct-volumes` directory. If it is, the runtime parses the `mountInfo` file, +updates the mount spec with the data in `mountInfo`. The updated mount spec is then passed to the Kata agent in the guest VM together +with other mounts. The Kata Containers runtime also creates a file named by the sandbox id under the `direct-volumes/[volume_path]/` +directory. The reason for adding a sandbox id file is to establish a mapping between the volume and the sandbox using it. +Later, when the Kata Containers runtime handles the `get-stats` and `resize` commands, it uses the sandbox id to identify +the endpoint of the corresponding `containerd-shim-kata-v2`. + +### containerd-shim-kata-v2 changes +`containerd-shim-kata-v2` provides an API for sandbox management through a Unix domain socket. Two new handlers are proposed: `/direct-volume/stats` and `/direct-volume/resize`: + +Example: + +```bash +$ curl --unix-socket "$shim_socket_path" -I -X GET 'http://localhost/direct-volume/stats/[urlSafeVolumePath]' +$ curl --unix-socket "$shim_socket_path" -I -X POST 'http://localhost/direct-volume/resize' -d '{ "volumePath"": [volumePath], "Size": "123123" }' +``` + +The shim then forwards the corresponding request to the `kata-agent` to carry out the operations inside the guest VM. For `resize` operation, +the Kata runtime also needs to notify the hypervisor to resize the block device (e.g. call `block_resize` in QEMU). + +### Kata agent changes + +The mount spec of a direct-assigned volume is passed to `kata-agent` through the existing `Storage` GRPC object. +Two new APIs and three new GRPC objects are added to GRPC protocol between the shim and agent for resizing and getting volume stats: +```protobuf + +rpc GetVolumeStats(VolumeStatsRequest) returns (VolumeStatsResponse); +rpc ResizeVolume(ResizeVolumeRequest) returns (google.protobuf.Empty); + +message VolumeStatsRequest { +// The volume path on the guest outside the container + string volume_guest_path = 1; +} + +message ResizeVolumeRequest { +// Full VM guest path of the volume (outside the container) + string volume_guest_path = 1; + uint64 size = 2; +} + +// This should be kept in sync with CSI NodeGetVolumeStatsResponse (https://github.com/container-storage-interface/spec/blob/v1.5.0/csi.proto) +message VolumeStatsResponse { + // This field is OPTIONAL. + repeated VolumeUsage usage = 1; + // Information about the current condition of the volume. + // This field is OPTIONAL. + // This field MUST be specified if the VOLUME_CONDITION node + // capability is supported. + VolumeCondition volume_condition = 2; +} +message VolumeUsage { + enum Unit { + UNKNOWN = 0; + BYTES = 1; + INODES = 2; + } + // The available capacity in specified Unit. This field is OPTIONAL. + // The value of this field MUST NOT be negative. + uint64 available = 1; + + // The total capacity in specified Unit. This field is REQUIRED. + // The value of this field MUST NOT be negative. + uint64 total = 2; + + // The used capacity in specified Unit. This field is OPTIONAL. + // The value of this field MUST NOT be negative. + uint64 used = 3; + + // Units by which values are measured. This field is REQUIRED. + Unit unit = 4; +} + +// VolumeCondition represents the current condition of a volume. +message VolumeCondition { + + // Normal volumes are available for use and operating optimally. + // An abnormal volume does not meet these criteria. + // This field is REQUIRED. + bool abnormal = 1; + + // The message describing the condition of the volume. + // This field is REQUIRED. + string message = 2; +} + +``` + +### Step by step walk-through + +Given the following definition: +```YAML +--- +apiVersion: v1 +kind: Pod +metadata: + name: app +spec: + runtime-class: kata-qemu + containers: + - name: app + image: centos + command: ["/bin/sh"] + args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"] + volumeMounts: + - name: persistent-storage + mountPath: /data + volumes: + - name: persistent-storage + persistentVolumeClaim: + claimName: ebs-claim +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + annotations: + skip-hostmount: "true" + name: ebs-claim +spec: + accessModes: + - ReadWriteOncePod + volumeMode: Filesystem + storageClassName: ebs-sc + resources: + requests: + storage: 4Gi +--- +kind: StorageClass +apiVersion: storage.k8s.io/v1 +metadata: + name: ebs-sc +provisioner: ebs.csi.aws.com +volumeBindingMode: WaitForFirstConsumer +parameters: + csi.storage.k8s.io/fstype: ext4 + +``` +Let’s assume that changes have been made in the `aws-ebs-csi-driver` node driver. + +**Node publish volume** +1. In the node CSI driver, the `NodePublishVolume` API invokes: `kata-runtime direct-volume add --volume-path "/kubelet/a/b/c/d/sdf" --mount-info "{\"Device\": \"/dev/sdf\", \"fstype\": \"ext4\"}"`. +2. The `Kata-runtime` writes the mount-info JSON to a file called `mountInfo.json` under `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`. + +**Node unstage volume** +1. In the node CSI driver, the `NodeUnstageVolume` API invokes: `kata-runtime direct-volume remove --volume-path "/kubelet/a/b/c/d/sdf"`. +2. Kata-runtime deletes the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`. + +**Use the volume in sandbox** +1. Upon the request to start a container, the `containerd-shim-kata-v2` examines the container spec, +and iterates through the mounts. For each mount, if there is a `mountInfo.json` file under `/run/kata-containers/shared/direct-volumes/[mount source path]`, +it generates a `storage` GRPC object after overwriting the mount spec with the information in `mountInfo.json`. +2. The shim sends the storage objects to kata-agent through TTRPC. +3. The shim writes a file with the sandbox id as the name under `/run/kata-containers/shared/direct-volumes/[mount source path]`. +4. The kata-agent mounts the storage objects for the container. + +**Node expand volume** +1. In the node CSI driver, the `NodeExpandVolume` API invokes: `kata-runtime direct-volume resize –-volume-path "/kubelet/a/b/c/d/sdf" –-size 8Gi`. +2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`. +3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to resize the volume. +4. The shim handles the request, asks the hypervisor to resize the block device and sends a GRPC request to Kata agent to resize the filesystem. +5. Kata agent receives the request and resizes the filesystem. + +**Node get volume stats** +1. In the node CSI driver, the `NodeGetVolumeStats` API invokes: `kata-runtime direct-volume stats –-volume-path "/kubelet/a/b/c/d/sdf"`. +2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`. +3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to get the volume stats. +4. The shim handles the request and forwards it to the Kata agent. +5. Kata agent receives the request and returns the filesystem stats.