mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-06-25 23:11:57 +00:00
docs: Direct-assigned volume design
Detail design description on direct-assigned volume Fixes: #1468 Signed-off-by: Feng Wang <feng.wang@databricks.com>
This commit is contained in:
parent
bc3f63bf0a
commit
7ffe5a16f2
@ -11,6 +11,7 @@ Kata Containers design documents:
|
|||||||
- [`Inotify` support](inotify.md)
|
- [`Inotify` support](inotify.md)
|
||||||
- [Metrics(Kata 2.0)](kata-2-0-metrics.md)
|
- [Metrics(Kata 2.0)](kata-2-0-metrics.md)
|
||||||
- [Design for Kata Containers `Lazyload` ability with `nydus`](kata-nydus-design.md)
|
- [Design for Kata Containers `Lazyload` ability with `nydus`](kata-nydus-design.md)
|
||||||
|
- [Design for direct-assigned volume](direct-blk-device-assignment.md)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
253
docs/design/direct-blk-device-assignment.md
Normal file
253
docs/design/direct-blk-device-assignment.md
Normal file
@ -0,0 +1,253 @@
|
|||||||
|
# Motivation
|
||||||
|
Today, there exist a few gaps between Container Storage Interface (CSI) and virtual machine (VM) based runtimes such as Kata Containers
|
||||||
|
that prevent them from working together smoothly.
|
||||||
|
|
||||||
|
First, it’s cumbersome to use a persistent volume (PV) with Kata Containers. Today, for a PV with Filesystem volume mode, Virtio-fs
|
||||||
|
is the only way to surface it inside a Kata Container guest VM. But often mounting the filesystem (FS) within the guest operating system (OS) is
|
||||||
|
desired due to performance benefits, availability of native FS features and security benefits over the Virtio-fs mechanism.
|
||||||
|
|
||||||
|
Second, it’s difficult if not impossible to resize a PV online with Kata Containers. While a PV can be expanded on the host OS,
|
||||||
|
the updated metadata needs to be propagated to the guest OS in order for the application container to use the expanded volume.
|
||||||
|
Currently, there is not a way to propagate the PV metadata from the host OS to the guest OS without restarting the Pod sandbox.
|
||||||
|
|
||||||
|
# Proposed Solution
|
||||||
|
|
||||||
|
Because of the OS boundary, these features cannot be implemented in the CSI node driver plugin running on the host OS
|
||||||
|
as is normally done in the runc container. Instead, they can be done by the Kata Containers agent inside the guest OS,
|
||||||
|
but it requires the CSI driver to pass the relevant information to the Kata Containers runtime.
|
||||||
|
An ideal long term solution would be to have the `kubelet` coordinating the communication between the CSI driver and
|
||||||
|
the container runtime, as described in [KEP-2857](https://github.com/kubernetes/enhancements/pull/2893/files).
|
||||||
|
However, as the KEP is still under review, we would like to propose a short/medium term solution to unblock our use case.
|
||||||
|
|
||||||
|
The proposed solution is built on top of a previous [proposal](https://github.com/egernst/kata-containers/blob/da-proposal/docs/design/direct-assign-volume.md)
|
||||||
|
described by Eric Ernst. The previous proposal has two gaps:
|
||||||
|
|
||||||
|
1. Writing a `csiPlugin.json` file to the volume root path introduced a security risk. A malicious user can gain unauthorized
|
||||||
|
access to a block device by writing their own `csiPlugin.json` to the above location through an ephemeral CSI plugin.
|
||||||
|
|
||||||
|
2. The proposal didn't describe how to establish a mapping between a volume and a kata sandbox, which is needed for
|
||||||
|
implementing CSI volume resize and volume stat collection APIs.
|
||||||
|
|
||||||
|
This document particularly focuses on how to address these two gaps.
|
||||||
|
|
||||||
|
## Assumptions and Limitations
|
||||||
|
1. The proposal assumes that a block device volume will only be used by one Pod on a node at a time, which we believe
|
||||||
|
is the most common pattern in Kata Containers use cases. It’s also unsafe to have the same block device attached to more than
|
||||||
|
one Kata pod. In the context of Kubernetes, the `PersistentVolumeClaim` (PVC) needs to have the `accessMode` as `ReadWriteOncePod`.
|
||||||
|
2. More advanced Kubernetes volume features such as, `fsGroup`, `fsGroupChangePolicy`, and `subPath` are not supported.
|
||||||
|
|
||||||
|
## End User Interface
|
||||||
|
|
||||||
|
1. The user specifies a PV as a direct-assigned volume. How a PV is specified as a direct-assigned volume is left for each CSI implementation to decide.
|
||||||
|
There are a few options for reference:
|
||||||
|
1. A storage class parameter specifies whether it's a direct-assigned volume. This avoids any lookups of PVC
|
||||||
|
or Pod information from the CSI plugin (as external provisioner takes care of these). However, all PVs in the storage class with the parameter set
|
||||||
|
will have host mounts skipped.
|
||||||
|
2. Use a PVC annotation. This approach requires the CSI plugins have `--extra-create-metadata` [set](https://kubernetes-csi.github.io/docs/external-provisioner.html#persistentvolumeclaim-and-persistentvolume-parameters)
|
||||||
|
to be able to perform a lookup of the PVC annotations from the API server. Pro: API server lookup of annotations only required during creation of PV.
|
||||||
|
Con: The CSI plugin will always skip host mounting of the PV.
|
||||||
|
3. The CSI plugin can also lookup pod `runtimeclass` during `NodePublish`. This approach can be found in the [ALIBABA CSI plugin](https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver/blob/master/pkg/disk/nodeserver.go#L248).
|
||||||
|
2. The CSI node driver delegates the direct assigned volume to the Kata Containers runtime. The CSI node driver APIs need to
|
||||||
|
be modified to pass the volume mount information and collect volume information to/from the Kata Containers runtime by invoking `kata-runtime` command line commands.
|
||||||
|
* **NodePublishVolume** -- It invokes `kata-runtime direct-volume add --volume-path [volumePath] --mount-info [mountInfo]`
|
||||||
|
to propagate the volume mount information to the Kata Containers runtime for it to carry out the filesystem mount operation.
|
||||||
|
The `volumePath` is the [target_path](https://github.com/container-storage-interface/spec/blob/master/csi.proto#L1364) in the CSI `NodePublishVolumeRequest`.
|
||||||
|
The `mountInfo` is a serialized JSON string.
|
||||||
|
* **NodeGetVolumeStats** -- It invokes `kata-runtime direct-volume stats --volume-path [volumePath]` to retrieve the filesystem stats of direct-assigned volume.
|
||||||
|
* **NodeExpandVolume** -- It invokes `kata-runtime direct-volume resize --volume-path [volumePath] --size [size]` to send a resize request to the Kata Containers runtime to
|
||||||
|
resize the direct-assigned volume.
|
||||||
|
* **NodeStageVolume/NodeUnStageVolume** -- It invokes `kata-runtime direct-volume remove --volume-path [volumePath]` to remove the persisted metadata of a direct-assigned volume.
|
||||||
|
|
||||||
|
The `mountInfo` object is defined as follows:
|
||||||
|
```Golang
|
||||||
|
type MountInfo struct {
|
||||||
|
// The type of the volume (ie. block)
|
||||||
|
VolumeType string `json:"volume-type"`
|
||||||
|
// The device backing the volume.
|
||||||
|
Device string `json:"device"`
|
||||||
|
// The filesystem type to be mounted on the volume.
|
||||||
|
FsType string `json:"fstype"`
|
||||||
|
// Additional metadata to pass to the agent regarding this volume.
|
||||||
|
Metadata map[string]string `json:"metadata,omitempty"`
|
||||||
|
// Additional mount options.
|
||||||
|
Options []string `json:"options,omitempty"`
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Notes: given that the `mountInfo` is persisted to the disk by the Kata runtime, it shouldn't container any secrets (such as SMB mount password).
|
||||||
|
|
||||||
|
## Implementation Details
|
||||||
|
|
||||||
|
### Kata runtime
|
||||||
|
Instead of the CSI node driver writing the mount info into a `csiPlugin.json` file under the volume root,
|
||||||
|
as described in the original proposal, here we propose that the CSI node driver passes the mount information to
|
||||||
|
the Kata Containers runtime through a new `kata-runtime` commandline command. The `kata-runtime` then writes the mount
|
||||||
|
information to a `mount-info.json` file in a predefined location (`/run/kata-containers/shared/direct-volumes/[volume_path]/`).
|
||||||
|
|
||||||
|
When the Kata Containers runtime starts a container, it verifies whether a volume mount is a direct-assigned volume by checking
|
||||||
|
whether there is a `mountInfo` file under the computed Kata `direct-volumes` directory. If it is, the runtime parses the `mountInfo` file,
|
||||||
|
updates the mount spec with the data in `mountInfo`. The updated mount spec is then passed to the Kata agent in the guest VM together
|
||||||
|
with other mounts. The Kata Containers runtime also creates a file named by the sandbox id under the `direct-volumes/[volume_path]/`
|
||||||
|
directory. The reason for adding a sandbox id file is to establish a mapping between the volume and the sandbox using it.
|
||||||
|
Later, when the Kata Containers runtime handles the `get-stats` and `resize` commands, it uses the sandbox id to identify
|
||||||
|
the endpoint of the corresponding `containerd-shim-kata-v2`.
|
||||||
|
|
||||||
|
### containerd-shim-kata-v2 changes
|
||||||
|
`containerd-shim-kata-v2` provides an API for sandbox management through a Unix domain socket. Two new handlers are proposed: `/direct-volume/stats` and `/direct-volume/resize`:
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ curl --unix-socket "$shim_socket_path" -I -X GET 'http://localhost/direct-volume/stats/[urlSafeVolumePath]'
|
||||||
|
$ curl --unix-socket "$shim_socket_path" -I -X POST 'http://localhost/direct-volume/resize' -d '{ "volumePath"": [volumePath], "Size": "123123" }'
|
||||||
|
```
|
||||||
|
|
||||||
|
The shim then forwards the corresponding request to the `kata-agent` to carry out the operations inside the guest VM. For `resize` operation,
|
||||||
|
the Kata runtime also needs to notify the hypervisor to resize the block device (e.g. call `block_resize` in QEMU).
|
||||||
|
|
||||||
|
### Kata agent changes
|
||||||
|
|
||||||
|
The mount spec of a direct-assigned volume is passed to `kata-agent` through the existing `Storage` GRPC object.
|
||||||
|
Two new APIs and three new GRPC objects are added to GRPC protocol between the shim and agent for resizing and getting volume stats:
|
||||||
|
```protobuf
|
||||||
|
|
||||||
|
rpc GetVolumeStats(VolumeStatsRequest) returns (VolumeStatsResponse);
|
||||||
|
rpc ResizeVolume(ResizeVolumeRequest) returns (google.protobuf.Empty);
|
||||||
|
|
||||||
|
message VolumeStatsRequest {
|
||||||
|
// The volume path on the guest outside the container
|
||||||
|
string volume_guest_path = 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
message ResizeVolumeRequest {
|
||||||
|
// Full VM guest path of the volume (outside the container)
|
||||||
|
string volume_guest_path = 1;
|
||||||
|
uint64 size = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
// This should be kept in sync with CSI NodeGetVolumeStatsResponse (https://github.com/container-storage-interface/spec/blob/v1.5.0/csi.proto)
|
||||||
|
message VolumeStatsResponse {
|
||||||
|
// This field is OPTIONAL.
|
||||||
|
repeated VolumeUsage usage = 1;
|
||||||
|
// Information about the current condition of the volume.
|
||||||
|
// This field is OPTIONAL.
|
||||||
|
// This field MUST be specified if the VOLUME_CONDITION node
|
||||||
|
// capability is supported.
|
||||||
|
VolumeCondition volume_condition = 2;
|
||||||
|
}
|
||||||
|
message VolumeUsage {
|
||||||
|
enum Unit {
|
||||||
|
UNKNOWN = 0;
|
||||||
|
BYTES = 1;
|
||||||
|
INODES = 2;
|
||||||
|
}
|
||||||
|
// The available capacity in specified Unit. This field is OPTIONAL.
|
||||||
|
// The value of this field MUST NOT be negative.
|
||||||
|
uint64 available = 1;
|
||||||
|
|
||||||
|
// The total capacity in specified Unit. This field is REQUIRED.
|
||||||
|
// The value of this field MUST NOT be negative.
|
||||||
|
uint64 total = 2;
|
||||||
|
|
||||||
|
// The used capacity in specified Unit. This field is OPTIONAL.
|
||||||
|
// The value of this field MUST NOT be negative.
|
||||||
|
uint64 used = 3;
|
||||||
|
|
||||||
|
// Units by which values are measured. This field is REQUIRED.
|
||||||
|
Unit unit = 4;
|
||||||
|
}
|
||||||
|
|
||||||
|
// VolumeCondition represents the current condition of a volume.
|
||||||
|
message VolumeCondition {
|
||||||
|
|
||||||
|
// Normal volumes are available for use and operating optimally.
|
||||||
|
// An abnormal volume does not meet these criteria.
|
||||||
|
// This field is REQUIRED.
|
||||||
|
bool abnormal = 1;
|
||||||
|
|
||||||
|
// The message describing the condition of the volume.
|
||||||
|
// This field is REQUIRED.
|
||||||
|
string message = 2;
|
||||||
|
}
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step by step walk-through
|
||||||
|
|
||||||
|
Given the following definition:
|
||||||
|
```YAML
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Pod
|
||||||
|
metadata:
|
||||||
|
name: app
|
||||||
|
spec:
|
||||||
|
runtime-class: kata-qemu
|
||||||
|
containers:
|
||||||
|
- name: app
|
||||||
|
image: centos
|
||||||
|
command: ["/bin/sh"]
|
||||||
|
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
|
||||||
|
volumeMounts:
|
||||||
|
- name: persistent-storage
|
||||||
|
mountPath: /data
|
||||||
|
volumes:
|
||||||
|
- name: persistent-storage
|
||||||
|
persistentVolumeClaim:
|
||||||
|
claimName: ebs-claim
|
||||||
|
---
|
||||||
|
apiVersion: v1
|
||||||
|
kind: PersistentVolumeClaim
|
||||||
|
metadata:
|
||||||
|
annotations:
|
||||||
|
skip-hostmount: "true"
|
||||||
|
name: ebs-claim
|
||||||
|
spec:
|
||||||
|
accessModes:
|
||||||
|
- ReadWriteOncePod
|
||||||
|
volumeMode: Filesystem
|
||||||
|
storageClassName: ebs-sc
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
storage: 4Gi
|
||||||
|
---
|
||||||
|
kind: StorageClass
|
||||||
|
apiVersion: storage.k8s.io/v1
|
||||||
|
metadata:
|
||||||
|
name: ebs-sc
|
||||||
|
provisioner: ebs.csi.aws.com
|
||||||
|
volumeBindingMode: WaitForFirstConsumer
|
||||||
|
parameters:
|
||||||
|
csi.storage.k8s.io/fstype: ext4
|
||||||
|
|
||||||
|
```
|
||||||
|
Let’s assume that changes have been made in the `aws-ebs-csi-driver` node driver.
|
||||||
|
|
||||||
|
**Node publish volume**
|
||||||
|
1. In the node CSI driver, the `NodePublishVolume` API invokes: `kata-runtime direct-volume add --volume-path "/kubelet/a/b/c/d/sdf" --mount-info "{\"Device\": \"/dev/sdf\", \"fstype\": \"ext4\"}"`.
|
||||||
|
2. The `Kata-runtime` writes the mount-info JSON to a file called `mountInfo.json` under `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||||||
|
|
||||||
|
**Node unstage volume**
|
||||||
|
1. In the node CSI driver, the `NodeUnstageVolume` API invokes: `kata-runtime direct-volume remove --volume-path "/kubelet/a/b/c/d/sdf"`.
|
||||||
|
2. Kata-runtime deletes the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||||||
|
|
||||||
|
**Use the volume in sandbox**
|
||||||
|
1. Upon the request to start a container, the `containerd-shim-kata-v2` examines the container spec,
|
||||||
|
and iterates through the mounts. For each mount, if there is a `mountInfo.json` file under `/run/kata-containers/shared/direct-volumes/[mount source path]`,
|
||||||
|
it generates a `storage` GRPC object after overwriting the mount spec with the information in `mountInfo.json`.
|
||||||
|
2. The shim sends the storage objects to kata-agent through TTRPC.
|
||||||
|
3. The shim writes a file with the sandbox id as the name under `/run/kata-containers/shared/direct-volumes/[mount source path]`.
|
||||||
|
4. The kata-agent mounts the storage objects for the container.
|
||||||
|
|
||||||
|
**Node expand volume**
|
||||||
|
1. In the node CSI driver, the `NodeExpandVolume` API invokes: `kata-runtime direct-volume resize –-volume-path "/kubelet/a/b/c/d/sdf" –-size 8Gi`.
|
||||||
|
2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||||||
|
3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to resize the volume.
|
||||||
|
4. The shim handles the request, asks the hypervisor to resize the block device and sends a GRPC request to Kata agent to resize the filesystem.
|
||||||
|
5. Kata agent receives the request and resizes the filesystem.
|
||||||
|
|
||||||
|
**Node get volume stats**
|
||||||
|
1. In the node CSI driver, the `NodeGetVolumeStats` API invokes: `kata-runtime direct-volume stats –-volume-path "/kubelet/a/b/c/d/sdf"`.
|
||||||
|
2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||||||
|
3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to get the volume stats.
|
||||||
|
4. The shim handles the request and forwards it to the Kata agent.
|
||||||
|
5. Kata agent receives the request and returns the filesystem stats.
|
Loading…
Reference in New Issue
Block a user