As virtio-scsi has been set the default block device driver, the
runtime also need to correctly handle the virtio-scsi info, specially
the SCSI address required within kata-agent handling logic.
And getting and assigning the scsi_addr to kata agent device id
will be enough. This commit just do such work.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit introduces generic support for running the VMM in rootless mode in runtime-rs:
1.Detect whether the VMM is running in rootless mode.
2.Before starting the VMM process, create a non-root user and launch the VMM with that user’s UID and GID; also add the KVM user's group ID to the VMM process's supplementary groups so the VMM process can access /dev/kvm.
3.Add the setup of the rootless directory located in the dir /run/user/<uid> directory, and modify some path variables to be functions that return the path with the rootless directory prefix when running in rootless mode.
Fixes: #11414
Signed-off-by: stevenfryto <sunzitai_1832@bupt.edu.cn>
Correctly set dir's permissions and mode. This update ensures:
The dir_mode field of CopyFileRequest is set to DIR_MODE_PERMS
(equivalent to Go's 0o750 | os.ModeDir), which is primarily used for the
top-level directory creation permissions.
The file_mode field now directly uses metadata.mode() (equivalent to
Go's st.Mode) for the target entry.
This change aims to resolve potential permission issues or inconsistencies
during directory and file creation within the guest environment by precisely
matching the expected mode propagation of the Kata agent.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The core purpose of introducing volume_manager to VolumeResource is to
centralize the management of shared file system volumes. By creating a
single VolumeManager instance within VolumeResource, all shared file
volumes are managed by one central entity. This single volume_manager
can accurately track the references of all ShareFsVolume instances to
the shared volumes, ensuring correct reference counting, proper volume
lifecycle management, and preventing issues like volumes being
overwritten.
This new design ensures that all shared volumes are managed by a central
entity, which:
(1) Guarantees correct reference counting.
(2) Manages the volume lifecycle correctly, avoiding issues like volumes
being overwritten.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit integrates the new `VolumeManager` into the `ShareFsVolume`
lifecycle. Instead of directly copying files, `ShareFsVolume::new` now
uses the `VolumeManager` to get a guest path and determine if the volume
needs to be copied. It also updates the `cleanup` function to release
the volume's reference count, allowing the `VolumeManager` to manage its
state and clean up resources when no longer in use.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit introduces a new `VolumeManager` to track the state of shared
volumes, including their reference count and its corresponding container
ids.
The manager's goal is to handle the lifecycle of shared filesystem volumes,
including:
(1) Volume State Tracking: Tracks the mapping from host source paths to guest
destination paths.
(2) Reference Counting: Manages reference counts for each volume, preventing
premature cleanup when multiple containers share the same source.
(3) Deterministic guest paths: Generates unique guest paths using random string
to avoid naming conflicts.
(4) Improved Management: Provides a centralized way to handle volume creation,
copying, and release, including aborting file watchers when volumes are no longer
in use.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit refactors the `CopyFile` related code to streamline the
logic for creating guest directories and make the code structure
clearer.
Its main goal is to improve the overall maintainability and facilitate
future feature extensions.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit is designed to perform a full sync before starting monitoring
to ensure that files which exist before monitoring starts are also synced.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit enhances control over block device AIO modes via hotplug.
Previously, hotplugging block devices was set with default AIO mode (io_uring).
Even if users reset the AIO mode in the configuration file, the changes would
not be correctly applied to individual block devices.
With this update, users can now explicitly configure the AIO mode for hot-plugging
block devices via the configuration, and those settings will be correctly applied.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need more information about block device, just relapce the original
method get_block_driver with get_block_device_info and return its
BlockDeviceInfo.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Docker containers support specifying the shm size using the --shm-size
option and support sandbox-level shm volumes, so we've added support for
shm volumes. Since Kubernetes doesn't support specifying the shm size,
it typically uses a memory-based emptydir as the container's shm, and
its size can be specified.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
As prvious configure with overlayfs is incorrect, which causes the agent
policy validation failure. And it's also different with runtime-go's
configuration. In this patch, we'll correct its fstype with overlay and
align with runtime on this matter.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Mount validation for sealed secret requires the base path to start with
`/run/kata-containers/shared/containers`. Previously, it used
`/run/kata-containers/sandbox/passthrough`, which caused test
failures where volume mounts are used.
This commit renames the path to satisfy the validation check.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
- Set guest Storage.options for block rootfs to empty (do not propagate host mount options).
- Align behavior with Go runtime: only add xfs nouuid when needed.
Signed-off-by: Caspian443 <scrisis843@gmail.com>
Also included (as commented out) is a test that does not pass although
it should. See source code comment for explanation why fixing this seems
beyond the scope of this PR.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit focuses purely on the formal change of type. If any subsequent
changes in semantics are needed they are purposely avoided here so that the
commit can be reviewed as a 100% formal and 0% semantic change.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit addresses a part of the same problem as PR #7623 did for the
golang runtime. So far we've been rounding up individual containers'
vCPU requests and then summing them up which can lead to allocation of
excess vCPUs as described in the mentioned PR's cover letter. We address
this by reversing the order of operations, we sum the (possibly fractional)
container requests and only then round up the total.
We also align runtime-rs's behaviour with runtime-go in that we now
include the default vcpu request from the config file ('default_vcpu')
in the total.
We diverge from PR #7623 in that `default_vcpu` is still treated as an
integer (this will be a topic of a separate commit), and that this
implementation avoids relying on 32-bit floating point arithmetic as there
are some potential problems with using f32. For instance, some numbers
commonly used in decimal, notably all of single-decimal-digit numbers
0.1, 0.2 .. 0.9 except 0.5, are periodic in binary and thus fundamentally
not representable exactly. Arithmetics performed on such numbers can lead
to surprising results, e.g. adding 0.1 ten times gives 1.0000001, not 1,
and taking a ceil() results in 2, clearly a wrong answer in vcpu
allocation.
So instead, we take advantage of the fact that container requests happen
to be expressed as a quota/period fraction so we can sum up quotas,
fundamentally integral numbers (possibly fractional only due to the need
to rewrite them with a common denominator) with much less danger of
precision loss.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Add full cgroups support on host. Cgroups are managed by `FsManager` and
`SystemdManager`. As the names impies, the `FsManager` manages cgroups
through cgroupfs, while the `SystemdManager` manages cgroups through
systemd. The two manages support cgroup v1 and cgroup v2.
Two types of cgroups path are supported:
1. For colon paths, for example "foo.slice:bar:baz", the runtime manages
cgroups by `SystemdManager`;
2. For relative/absolute paths, the runtime manages cgroups by
`FsManager`.
vCPU threads are added into the sandbox cgroups in cgroup v1 + cgroupfs,
others, cgroup v1 + systemd, cgroup v2 + cgroupfs, cgroup v2 + systemd, VMM
process is added into the cgroups.
The systemd doesn't provide a way to add thread to a unit. `add_thread()`
in `SystemdManager` is equivalent to `add_process()`.
Cgroup v2 supports threaded mode. However, we should enable threaded mode
from leaf node to the root node (`/`) iteratively [1]. This means the
runtime needs to modify the cgroups created by container runtime (e.g.
containerd). Considering cgroupfs + cgroup v2 is not a common combination,
its behavior is aligned with systemd + cgroup v2, which is not allowed to
manage process at the thread level.
1: https://www.kernel.org/doc/html/v4.18/admin-guide/cgroup-v2.html#threadsFixes: #11356
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
As some reasons, it first should make it align with runtime-go, this
commit will do this work.
Fixes#11543
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
It supports handling scsi device when block device driver is `scsi`.
And it will ensure a correct storage source with LUN.
Fixes#11516
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
The init data could not be read properly within kata-agent because the
data length field was omitted, a consequence of a mismatch in the data
write format.
Fixes#11556
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce a const value `KATA_VIRTUAL_VOLUME_PREFIX` defined in the libs/kata-types,
and it'll be better import such const value from there.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In the pre commit:74eccc54e7b31cc4c9abd8b6e4007c3a4c1d4dd4,
it missed return the right rootfs volume.
In the is_block_rootfs fn, if the rootfs is based on a
block device such as devicemapper, it should clear the
volume's source and let the device_manager to use the
dev_id to get the device's host path.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
For containerd's Blockfile Snapshotter, it will pass
a rootfs mounts with a rawfile as a mount source
and mount options with "loop" embeded.
To support this type of rootfs, it is necessary to identify this as a
blockfile rootfs through the "loop" flag, and then use the volume source
of the rootfs as the source of the block device to hot-insert it into
the guest.
Fixes:#11464
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
To correctly manage initdata as a block device, a new InitData
Resource type, inherently a block device, has been introduced
within the ResourceManager. As a component of the Sandbox's
resources, this InitData Resource needs to be appropriately
handled by the Device Manager's handler.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Container image integrity protection is a critical practice involving a
multi-layered defense mechanism. While container images inherently offer
basic integrity verification through Content-Addressable Storage (CAS)
(ensuring pulled content matches stored hashes), a combination of other
measures is crucial for production environments. These layers include:
Encrypted Transport (HTTPS/TLS) to prevent tampering during transfer;
Image Signing to confirm the image originates from a trusted source;
Vulnerability Scanning to ensure the image content is "healthy"; and
Trusted Registries with stringent access controls.
In certain scenarios, such as when container image confidentiality
requirements are not stringent, and integrity is already ensured via the
aforementioned mechanisms (especially CAS and HTTPS/TLS), adopting
"force guest pull" can be a viable option. This implies that even when
pulling images from a container registry, their integrity remains
guaranteed through content hashes and other built-in mechanisms, without
relying on additional host-side verification or specialized transfer
methods.
Since this feature is already available in runtime-go and offers
synergistic benefits with guest pull, we have chosen to support force
guest pull.
Fixes#10690
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In CoCo scenarios, there's no image pulling on host side, and it will
disable such operations, that's to say, there's no files sharing between
host and guest, especially for container rootfs.
We introduce Kata Virtual Volume to help handle such cases:
(1) Introduce is_kata_virtual_volume to ensure the volume is kata
virtual volume.
(2) Introduce VirtualVolume Handling logic in handle_rootfs when the
mount is kata virtual volume.
Fixes#10690
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This commit introduces comprehensive support for rootfs mount mgmt
through Kata Virtual Volumes, specifically enabling the guest-pull
mechanism.
It enhances the runtime's ability to:
(1) Extract image references from container annotations (CRI/CRI-O).
(2) Process `KataVirtualVolume` objects, configuring them for guest-pull operations.
(3) Set up the agent's storage for guest-pulled images.
This functionality streamlines the process of pulling container images
directly within the guest for rootfs, aligning with guest-side image management strategies.
Fixes#10690
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
We're switching to using a rev as it may take some time for the package
to be updated on crates.io.
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
When try to delete a cgroup, it's needed to move all of the
tasks/procs in the cgroup into root cgroup and then delete it.
Since for cgroup v2, it doesn't support to move thread into
root cgroup, thus move the processes instead of moving tasks
can fix this issue.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
For security reasons, we have restricted directory copying.
Introduces the `is_allowlisted_copy_volume` function to verify
if a given volume path is present in an allowed copy directory.
This enhances security by ensuring only permitted volumes are
copied
Currently, only directories under the path
`/var/lib/kubelet/pods/<uid>/volumes/{kubernetes.io~configmap,
kubernetes.io~secret, kubernetes.io~downward-api,
kubernetes.io~projected}` are allowed to be copied into the
guest. Copying of other directories will be prohibited.
Fixes#11237
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce event-driven file sync mechanism between host and guest when
sharedfs is disabled, which will help monitor the host path in time and
do sync files changes:
1. Introduce FsWatcher to monitor directory changes via inotify;
2. Support recursive watching with configurable filters;
3. Add debounce logic (default 500ms cooldown) to handle burst events;
4. Trigger `copy_dir_recursively` on stable state;
5. Handle CREATE/MODIFY/DELETE/MOVED/CLOSE_WRITE events;
Fixes#11237
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In Kubernetes (k8s), while Kata Pods often use virtiofs for injecting
Service Accounts, Secrets, and ConfigMaps, security-sensitive
environments like CoCo disable host-guest sharing. Consequently, when
SharedFs is disabled, we propagate these configurations into the guest
via file copy and bind mount for correct container access.
Fixes#11237
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
A new resource type `PortDevice` is introduced which is dedicated
for handling root ports/switch ports during sandbox creation(VM).
Fixes#10361
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>