mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-04-29 04:04:45 +00:00
mount-info.json should be mountInfo.json according to the description in the doc. Fixes: #5716 Signed-off-by: Jason Zhang <zhanghj.lc@inspur.com>
254 lines
14 KiB
Markdown
254 lines
14 KiB
Markdown
# Motivation
|
||
Today, there exist a few gaps between Container Storage Interface (CSI) and virtual machine (VM) based runtimes such as Kata Containers
|
||
that prevent them from working together smoothly.
|
||
|
||
First, it’s cumbersome to use a persistent volume (PV) with Kata Containers. Today, for a PV with Filesystem volume mode, Virtio-fs
|
||
is the only way to surface it inside a Kata Container guest VM. But often mounting the filesystem (FS) within the guest operating system (OS) is
|
||
desired due to performance benefits, availability of native FS features and security benefits over the Virtio-fs mechanism.
|
||
|
||
Second, it’s difficult if not impossible to resize a PV online with Kata Containers. While a PV can be expanded on the host OS,
|
||
the updated metadata needs to be propagated to the guest OS in order for the application container to use the expanded volume.
|
||
Currently, there is not a way to propagate the PV metadata from the host OS to the guest OS without restarting the Pod sandbox.
|
||
|
||
# Proposed Solution
|
||
|
||
Because of the OS boundary, these features cannot be implemented in the CSI node driver plugin running on the host OS
|
||
as is normally done in the runc container. Instead, they can be done by the Kata Containers agent inside the guest OS,
|
||
but it requires the CSI driver to pass the relevant information to the Kata Containers runtime.
|
||
An ideal long term solution would be to have the `kubelet` coordinating the communication between the CSI driver and
|
||
the container runtime, as described in [KEP-2857](https://github.com/kubernetes/enhancements/pull/2893/files).
|
||
However, as the KEP is still under review, we would like to propose a short/medium term solution to unblock our use case.
|
||
|
||
The proposed solution is built on top of a previous [proposal](https://github.com/egernst/kata-containers/blob/da-proposal/docs/design/direct-assign-volume.md)
|
||
described by Eric Ernst. The previous proposal has two gaps:
|
||
|
||
1. Writing a `csiPlugin.json` file to the volume root path introduced a security risk. A malicious user can gain unauthorized
|
||
access to a block device by writing their own `csiPlugin.json` to the above location through an ephemeral CSI plugin.
|
||
|
||
2. The proposal didn't describe how to establish a mapping between a volume and a kata sandbox, which is needed for
|
||
implementing CSI volume resize and volume stat collection APIs.
|
||
|
||
This document particularly focuses on how to address these two gaps.
|
||
|
||
## Assumptions and Limitations
|
||
1. The proposal assumes that a block device volume will only be used by one Pod on a node at a time, which we believe
|
||
is the most common pattern in Kata Containers use cases. It’s also unsafe to have the same block device attached to more than
|
||
one Kata pod. In the context of Kubernetes, the `PersistentVolumeClaim` (PVC) needs to have the `accessMode` as `ReadWriteOncePod`.
|
||
2. More advanced Kubernetes volume features such as, `fsGroup`, `fsGroupChangePolicy`, and `subPath` are not supported.
|
||
|
||
## End User Interface
|
||
|
||
1. The user specifies a PV as a direct-assigned volume. How a PV is specified as a direct-assigned volume is left for each CSI implementation to decide.
|
||
There are a few options for reference:
|
||
1. A storage class parameter specifies whether it's a direct-assigned volume. This avoids any lookups of PVC
|
||
or Pod information from the CSI plugin (as external provisioner takes care of these). However, all PVs in the storage class with the parameter set
|
||
will have host mounts skipped.
|
||
2. Use a PVC annotation. This approach requires the CSI plugins have `--extra-create-metadata` [set](https://kubernetes-csi.github.io/docs/external-provisioner.html#persistentvolumeclaim-and-persistentvolume-parameters)
|
||
to be able to perform a lookup of the PVC annotations from the API server. Pro: API server lookup of annotations only required during creation of PV.
|
||
Con: The CSI plugin will always skip host mounting of the PV.
|
||
3. The CSI plugin can also lookup pod `runtimeclass` during `NodePublish`. This approach can be found in the [ALIBABA CSI plugin](https://github.com/kubernetes-sigs/alibaba-cloud-csi-driver/blob/master/pkg/disk/nodeserver.go#L248).
|
||
2. The CSI node driver delegates the direct assigned volume to the Kata Containers runtime. The CSI node driver APIs need to
|
||
be modified to pass the volume mount information and collect volume information to/from the Kata Containers runtime by invoking `kata-runtime` command line commands.
|
||
* **NodePublishVolume** -- It invokes `kata-runtime direct-volume add --volume-path [volumePath] --mount-info [mountInfo]`
|
||
to propagate the volume mount information to the Kata Containers runtime for it to carry out the filesystem mount operation.
|
||
The `volumePath` is the [target_path](https://github.com/container-storage-interface/spec/blob/master/csi.proto#L1364) in the CSI `NodePublishVolumeRequest`.
|
||
The `mountInfo` is a serialized JSON string.
|
||
* **NodeGetVolumeStats** -- It invokes `kata-runtime direct-volume stats --volume-path [volumePath]` to retrieve the filesystem stats of direct-assigned volume.
|
||
* **NodeExpandVolume** -- It invokes `kata-runtime direct-volume resize --volume-path [volumePath] --size [size]` to send a resize request to the Kata Containers runtime to
|
||
resize the direct-assigned volume.
|
||
* **NodeStageVolume/NodeUnStageVolume** -- It invokes `kata-runtime direct-volume remove --volume-path [volumePath]` to remove the persisted metadata of a direct-assigned volume.
|
||
|
||
The `mountInfo` object is defined as follows:
|
||
```Golang
|
||
type MountInfo struct {
|
||
// The type of the volume (ie. block)
|
||
VolumeType string `json:"volume-type"`
|
||
// The device backing the volume.
|
||
Device string `json:"device"`
|
||
// The filesystem type to be mounted on the volume.
|
||
FsType string `json:"fstype"`
|
||
// Additional metadata to pass to the agent regarding this volume.
|
||
Metadata map[string]string `json:"metadata,omitempty"`
|
||
// Additional mount options.
|
||
Options []string `json:"options,omitempty"`
|
||
}
|
||
```
|
||
Notes: given that the `mountInfo` is persisted to the disk by the Kata runtime, it shouldn't container any secrets (such as SMB mount password).
|
||
|
||
## Implementation Details
|
||
|
||
### Kata runtime
|
||
Instead of the CSI node driver writing the mount info into a `csiPlugin.json` file under the volume root,
|
||
as described in the original proposal, here we propose that the CSI node driver passes the mount information to
|
||
the Kata Containers runtime through a new `kata-runtime` commandline command. The `kata-runtime` then writes the mount
|
||
information to a `mountInfo.json` file in a predefined location (`/run/kata-containers/shared/direct-volumes/[volume_path]/`).
|
||
|
||
When the Kata Containers runtime starts a container, it verifies whether a volume mount is a direct-assigned volume by checking
|
||
whether there is a `mountInfo` file under the computed Kata `direct-volumes` directory. If it is, the runtime parses the `mountInfo` file,
|
||
updates the mount spec with the data in `mountInfo`. The updated mount spec is then passed to the Kata agent in the guest VM together
|
||
with other mounts. The Kata Containers runtime also creates a file named by the sandbox id under the `direct-volumes/[volume_path]/`
|
||
directory. The reason for adding a sandbox id file is to establish a mapping between the volume and the sandbox using it.
|
||
Later, when the Kata Containers runtime handles the `get-stats` and `resize` commands, it uses the sandbox id to identify
|
||
the endpoint of the corresponding `containerd-shim-kata-v2`.
|
||
|
||
### containerd-shim-kata-v2 changes
|
||
`containerd-shim-kata-v2` provides an API for sandbox management through a Unix domain socket. Two new handlers are proposed: `/direct-volume/stats` and `/direct-volume/resize`:
|
||
|
||
Example:
|
||
|
||
```bash
|
||
$ curl --unix-socket "$shim_socket_path" -I -X GET 'http://localhost/direct-volume/stats/[urlSafeVolumePath]'
|
||
$ curl --unix-socket "$shim_socket_path" -I -X POST 'http://localhost/direct-volume/resize' -d '{ "volumePath"": [volumePath], "Size": "123123" }'
|
||
```
|
||
|
||
The shim then forwards the corresponding request to the `kata-agent` to carry out the operations inside the guest VM. For `resize` operation,
|
||
the Kata runtime also needs to notify the hypervisor to resize the block device (e.g. call `block_resize` in QEMU).
|
||
|
||
### Kata agent changes
|
||
|
||
The mount spec of a direct-assigned volume is passed to `kata-agent` through the existing `Storage` GRPC object.
|
||
Two new APIs and three new GRPC objects are added to GRPC protocol between the shim and agent for resizing and getting volume stats:
|
||
```protobuf
|
||
|
||
rpc GetVolumeStats(VolumeStatsRequest) returns (VolumeStatsResponse);
|
||
rpc ResizeVolume(ResizeVolumeRequest) returns (google.protobuf.Empty);
|
||
|
||
message VolumeStatsRequest {
|
||
// The volume path on the guest outside the container
|
||
string volume_guest_path = 1;
|
||
}
|
||
|
||
message ResizeVolumeRequest {
|
||
// Full VM guest path of the volume (outside the container)
|
||
string volume_guest_path = 1;
|
||
uint64 size = 2;
|
||
}
|
||
|
||
// This should be kept in sync with CSI NodeGetVolumeStatsResponse (https://github.com/container-storage-interface/spec/blob/v1.5.0/csi.proto)
|
||
message VolumeStatsResponse {
|
||
// This field is OPTIONAL.
|
||
repeated VolumeUsage usage = 1;
|
||
// Information about the current condition of the volume.
|
||
// This field is OPTIONAL.
|
||
// This field MUST be specified if the VOLUME_CONDITION node
|
||
// capability is supported.
|
||
VolumeCondition volume_condition = 2;
|
||
}
|
||
message VolumeUsage {
|
||
enum Unit {
|
||
UNKNOWN = 0;
|
||
BYTES = 1;
|
||
INODES = 2;
|
||
}
|
||
// The available capacity in specified Unit. This field is OPTIONAL.
|
||
// The value of this field MUST NOT be negative.
|
||
uint64 available = 1;
|
||
|
||
// The total capacity in specified Unit. This field is REQUIRED.
|
||
// The value of this field MUST NOT be negative.
|
||
uint64 total = 2;
|
||
|
||
// The used capacity in specified Unit. This field is OPTIONAL.
|
||
// The value of this field MUST NOT be negative.
|
||
uint64 used = 3;
|
||
|
||
// Units by which values are measured. This field is REQUIRED.
|
||
Unit unit = 4;
|
||
}
|
||
|
||
// VolumeCondition represents the current condition of a volume.
|
||
message VolumeCondition {
|
||
|
||
// Normal volumes are available for use and operating optimally.
|
||
// An abnormal volume does not meet these criteria.
|
||
// This field is REQUIRED.
|
||
bool abnormal = 1;
|
||
|
||
// The message describing the condition of the volume.
|
||
// This field is REQUIRED.
|
||
string message = 2;
|
||
}
|
||
|
||
```
|
||
|
||
### Step by step walk-through
|
||
|
||
Given the following definition:
|
||
```YAML
|
||
---
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
name: app
|
||
spec:
|
||
runtime-class: kata-qemu
|
||
containers:
|
||
- name: app
|
||
image: centos
|
||
command: ["/bin/sh"]
|
||
args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
|
||
volumeMounts:
|
||
- name: persistent-storage
|
||
mountPath: /data
|
||
volumes:
|
||
- name: persistent-storage
|
||
persistentVolumeClaim:
|
||
claimName: ebs-claim
|
||
---
|
||
apiVersion: v1
|
||
kind: PersistentVolumeClaim
|
||
metadata:
|
||
annotations:
|
||
skip-hostmount: "true"
|
||
name: ebs-claim
|
||
spec:
|
||
accessModes:
|
||
- ReadWriteOncePod
|
||
volumeMode: Filesystem
|
||
storageClassName: ebs-sc
|
||
resources:
|
||
requests:
|
||
storage: 4Gi
|
||
---
|
||
kind: StorageClass
|
||
apiVersion: storage.k8s.io/v1
|
||
metadata:
|
||
name: ebs-sc
|
||
provisioner: ebs.csi.aws.com
|
||
volumeBindingMode: WaitForFirstConsumer
|
||
parameters:
|
||
csi.storage.k8s.io/fstype: ext4
|
||
|
||
```
|
||
Let’s assume that changes have been made in the `aws-ebs-csi-driver` node driver.
|
||
|
||
**Node publish volume**
|
||
1. In the node CSI driver, the `NodePublishVolume` API invokes: `kata-runtime direct-volume add --volume-path "/kubelet/a/b/c/d/sdf" --mount-info "{\"Device\": \"/dev/sdf\", \"fstype\": \"ext4\"}"`.
|
||
2. The `Kata-runtime` writes the mount-info JSON to a file called `mountInfo.json` under `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||
|
||
**Node unstage volume**
|
||
1. In the node CSI driver, the `NodeUnstageVolume` API invokes: `kata-runtime direct-volume remove --volume-path "/kubelet/a/b/c/d/sdf"`.
|
||
2. Kata-runtime deletes the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||
|
||
**Use the volume in sandbox**
|
||
1. Upon the request to start a container, the `containerd-shim-kata-v2` examines the container spec,
|
||
and iterates through the mounts. For each mount, if there is a `mountInfo.json` file under `/run/kata-containers/shared/direct-volumes/[mount source path]`,
|
||
it generates a `storage` GRPC object after overwriting the mount spec with the information in `mountInfo.json`.
|
||
2. The shim sends the storage objects to kata-agent through TTRPC.
|
||
3. The shim writes a file with the sandbox id as the name under `/run/kata-containers/shared/direct-volumes/[mount source path]`.
|
||
4. The kata-agent mounts the storage objects for the container.
|
||
|
||
**Node expand volume**
|
||
1. In the node CSI driver, the `NodeExpandVolume` API invokes: `kata-runtime direct-volume resize –-volume-path "/kubelet/a/b/c/d/sdf" –-size 8Gi`.
|
||
2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||
3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to resize the volume.
|
||
4. The shim handles the request, asks the hypervisor to resize the block device and sends a GRPC request to Kata agent to resize the filesystem.
|
||
5. Kata agent receives the request and resizes the filesystem.
|
||
|
||
**Node get volume stats**
|
||
1. In the node CSI driver, the `NodeGetVolumeStats` API invokes: `kata-runtime direct-volume stats –-volume-path "/kubelet/a/b/c/d/sdf"`.
|
||
2. The Kata runtime checks whether there is a sandbox id file under the directory `/run/kata-containers/shared/direct-volumes/kubelet/a/b/c/d/sdf`.
|
||
3. The Kata runtime identifies the shim instance through the sandbox id, and sends a GRPC request to get the volume stats.
|
||
4. The shim handles the request and forwards it to the Kata agent.
|
||
5. Kata agent receives the request and returns the filesystem stats.
|