Commit Graph

1223 Commits

Author SHA1 Message Date
Champ-Goblem
ef642fe890 runtime: fix cgroupv2 deletion when sandbox_cgroup_only=false
Currently, when a new sandbox resource controller is created with cgroupsv2 and sandbox_cgroup_only is disabled,
the cgroup management falls back to cgroupfs. During deletion, `IsSystemdCgroup` checks if the path contains `:`
and tries to delete the cgroup via systemd. However, the cgroup was originally set up via cgroupfs and this process
fails with `lstat /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/....scope: no such file or directory`.

This patch updates the deletion logic to take in to account the sandbox_cgroup_only=false option and in this case uses
the cgroupfs delete.

Fixes: #11036
Signed-off-by: Champ-Goblem <cameron@northflank.com>
2025-05-30 17:51:31 +02:00
Paul Meyer
c4815eb3ad runtime: add option to force guest pull
This enables guest pull via config, without the need of any external
snapshotter. When the config enables runtime.experimental_force_guest_pull, instead of
relying on annotations to select the way to share the root FS, we always
use guest pull.

Co-authored-by: Markus Rudy <mr@edgeless.systems>
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-05-27 12:42:00 +02:00
Fabiano Fidêncio
6c9b199ef1 Merge pull request #11289 from BbolroC/fix-vfio-coldplug
runtime: Preserve hotplug devices for vfio-coldplug mode
2025-05-21 09:48:25 +02:00
Hyounggyu Choi
2fd2cd4a9b runtime: Preserve hotplug devices for vfio-coldplug mode
Fixes: #11288

This commit appends hotplug devices (e.g., persistent volume)
to deviceInfos when `vfio_mod` is `vfio` and `cold_plug_vfio`
is set to one except `no-port`. For details, please visit the issue.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-05-19 13:46:49 +02:00
Pradipta Banerjee
9f9841492e runtime: Fix logging for remote hypervisor
Need to use hvLogger

Fixes: #11286

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
2025-05-19 07:01:59 -04:00
Fabiano Fidêncio
219d6e8ea6 Merge pull request #11257 from mythi/coco-guest-hardening
confidential guest kernel hardening changes
2025-05-16 08:52:36 +02:00
Mikko Ylinen
ab29c8c979 runtime: do not add virtio-rng-pci device for confidential guests
Adding:
"-object rng-random,id=rng0,filename=/dev/urandom -device
virtio-rng-pci,rng=rng0"

for confidential guests is not necessary as the RNG source cannot
be trusted and the guest kernel has the driver already disable as well.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2025-05-12 17:14:51 +03:00
stevenhorsman
17843e50bb runtime: Switch userns packages
Switch imports to resolve:
```
SA1019: "github.com/opencontainers/runc/libcontainer/userns" is deprecated:
use github.com/moby/sys/userns
```

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-05-08 11:04:11 +01:00
stevenhorsman
5472662b33 runtime: Fix Incorrect conversion between integer types
Fix the high severity codeql issue by checking the
value is in bounds before converting

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-05-06 15:18:37 +01:00
stevenhorsman
4de79b9821 runtime: Ignoring deprecated warning.
In the latest oci-spec, the prestart hook is deprecated.
However, the docker & nerdctl tests failed when I switched
to one of the newer hooks which don't run at quite the same time,
so ignore the deprecation warnings for now to unblock the security fix

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-05-06 15:18:37 +01:00
stevenhorsman
3740ce6e7b runtime: Update crio annotations
We've been using the
github.com/containers/podman/v4/pkg/annotations module
to get cri-o annotations, which has some major CVEs in, but
in v5 most of the annotations were moved into crio (from 1.30)
(see https://github.com/cri-o/cri-o/pull/7867). Let's switch
to use the cri-o annotations module instead and remediate
CVE-2024-3056.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-05-06 15:18:37 +01:00
Fabiano Fidêncio
78bf9d7500 Merge pull request #11232 from lifupan/mtu
runtime: add the mtu support for updating routes
2025-05-06 15:55:04 +02:00
ChengyuZhu6
f63ec50ba3 runtime: Add EROFS snapshotter with block device support
- Detection of EROFS options in container rootfs
- Creation of necessary EROFS devices
- Sharing of rootfs with EROFS via overlayfs

Fixes: #11163

Signed-off-by: ChengyuZhu6 <hudson@cyzhu.com>
2025-05-05 23:51:13 +02:00
Fupan Li
492329fc02 runtime: add the mtu support for updating routes
Some cni plugins will set the MTU of some routes, such as cilium will
modify the MTU of the default route. If the mtu of the route is not set
correctly, it may cause excessive fragmentation or even packet loss of
network packets. Therefore, this PR adds the setting of the MTU of the
route. First, when obtaining the route, if the MTU is set, the MTU will
also be obtained and set to the route in the guest.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-05-04 23:12:57 +02:00
Shunsuke Kimura
3dba8ddd98 runtime: remove wrong qemu-system-x86_64 option
qemu-system-x86_64 does not support "-machine virt".
(this is only supported by arm,aarch64)
<https://people.redhat.com/~cohuck/2022/01/05/qemu-machine-types.html>

Fixes: #11229

Signed-off-by: Shunsuke Kimura <pbrehpuum@gmail.com>
2025-05-02 04:37:12 +09:00
Shunsuke Kimura
62639c861e runtime: remove wrong xfs options
"data=ordered" and "errors=remount-ro" are wrong options in xfs.
(they are ext4 options)
<https://manpages.ubuntu.com/manpages/focal/man5/xfs.5.html>

Fixes: #11205

Signed-off-by: Shunsuke Kimura <pbrehpuum@gmail.com>
2025-05-01 07:56:39 +09:00
Fabiano Fidêncio
b747f8380e clh: Rework CreateVM to reduce the amount of cycles
Otherwise the static checks will whip us as hard as possible.

Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
2025-04-25 21:30:47 +02:00
Champ-Goblem
9f76467cb7 runtime: clh: Add reclaim_guest_freed_memory [BACKPORT]
We're bringing to *Cloud Hypervisor only* the reclaim_guest_freed_memory
option already present in the runtime-rs.

This allows us to use virtio-balloon for the hypervisor to reclaim
memory freed by the guest.

The reason we're not touching other hypervisors is because we're very
much aware of avoiding to clutter the go code at this point, so we'll
leave it for whoever really needs this on other hypervisor (and trust
me, we really do need it for Cloud Hypervisor right now ;-)).

Signed-off-by: Champ-Goblem <cameron@northflank.com>
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
2025-04-25 21:05:53 +02:00
Alex Lyn
8b49564c01 Merge pull request #10610 from Xynnn007/faet-initdata-rbd
Feat | Implement initdata for bare-metal/qemu hypervisor
2025-04-24 09:59:14 +08:00
Zvonko Kaiser
3946435291 gpu: Handle VFIO devices with DevicePlugin and CDI
We can provide devices during cold-plug with CDI annotation on a Pod
level and add per container device information wit the device plugin.
Since the sandbox has already attached the VFIO device remove them
from consideration and just apply the inner runtime CDI annotation.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-04-23 21:02:06 +00:00
Zvonko Kaiser
486244b292 gpu: Remove unneeded parsing of CDI devices
The addition of CDI devices is now done for single_container
and pod_sandbox and pod_container before the devmanager creates
the deviceinfos no need for extra parsing.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-04-23 21:02:06 +00:00
Xynnn007
91bb6b7c34 runtime: add support for io.katacontainers.config.runtime.cc_init_data
io.katacontainers.config.runtime.cc_init_data specifies initdata used by
the pod in base64(gzip(initdata toml)) format. The initdata will be
encapsulated into an initdata image and mount it as a raw block device
to the guest.

The initdata image will be aligned with 512 bytes, which is chosen as a
usual sector size supported by different hypervisors like qemu, clh and
dragonball.

Note that this patch only adds support for qemu hypervisor.

Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
2025-04-15 16:35:59 +08:00
Bo Chen
ee84068aed versions: Upgrade to Cloud Hypervisor v45.0
Details of this release can be found in our roadmap project as iteration
v45.0: https://github.com/orgs/cloud-hypervisor/projects/6.

Fixes: #10723

Signed-off-by: Bo Chen <bchen@crusoe.ai>
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
2025-04-07 20:33:34 +02:00
Zvonko Kaiser
d81a1747bd Merge pull request #11085 from kevinzs2048/fix-virtiomem
runtime-go: qemu: Fix sandbox start failing with virtio-mem enable on arm64
2025-03-31 17:09:43 -04:00
Ruoqing He
96b2d25508 runtime: Define default values for QEMU riscv
Provide default values while invoking QEMU as the hypervisor for Go
runtime on riscv64 platform.

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-03-27 10:05:36 +08:00
Ruoqing He
1e4963a3b2 runtime: Define availableGuestProtection for riscv64
`GuestProtection` feature is not made available yet, return
`noneProtection` for now.

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-03-27 09:34:53 +08:00
Ruoqing He
4947938ce8 runtime: Introduce riscv64 template for vm factory
Set `templateDeviceStateSize` to 8 as other architectures did.

Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
2025-03-27 09:28:32 +08:00
Kevin Zhao
211a36559c runtime-go: qemu: Fix sandbox start failing with virtio-mem enable on arm64
Also add CONFIG_VIRTIO_MEM to arm64 platform

Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
2025-03-26 22:31:00 +08:00
Paul Meyer
a994f142d0 runtime: make SNP IDBlock configurable
For a use case, we want to set the SNP IDBlock, which allows
configuring the AMD ASP to enforce parameters like expected launch
digest at launch. The struct with the config that should be enforced
(IDBlock) is signed. The public key is placed in the auth block and
the signature is verified by the ASP before launch. The digest of the
public key is also part of the attestation report (ID_KEY_DIGESTS).

Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-03-14 07:50:54 +01:00
Fupan Li
1ade2a874f runtime: add the flags support to the route setting
We should support the flags when add the route from
host to guest. Otherwise, some route would be set
failed.

Fixes: #7934

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-03-07 09:56:08 +08:00
Hyounggyu Choi
624f7bfe0b runtime: Remove console=ttysclp0 for s390x
After the introduction of the following kernel parameters (see #6163):

```
CONFIG_SCLP_VT220_TTY=y
CONFIG_SCLP_VT220_CONSOLE=y
```

the system log for Kata components (e.g., the agent) no longer appeared
on the SCLP console (i.e., /dev/ttysclp0). Let's switch to the default
fallback console (likely /dev/console) for logging.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-03-05 15:06:08 +01:00
Greg Kurz
545022f295 Merge pull request #10817 from Jakob-Naucke/virtio-net-ccw
Fix virtio-net-ccw
2025-03-03 17:37:46 +01:00
Jakob Naucke
a084b99324 virtcontainers: Separate PCI/CCW for net devices
On s390x, virtio-net devices should use CCW, alongside a different
device path. Use accordingly.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2025-02-26 11:36:43 +01:00
Jakob Naucke
2aa523f08a virtcontainers: Fix virtio-net-ccw address format
Hex device number was formatted as hex twice, thus encoding the string
as hex.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2025-02-26 11:36:43 +01:00
Jakob Naucke
2a992c4080 virtcontainers: Add CCW device to endpoint
To support virtio-net-ccw for s390x, add CCW devices to the Endpoint
interface. Add respective fields and functions to implementing structs.

Device paths may be empty. PciPath resolves this by being a list that
may be empty, but this design does not map to CcwDevice. Use a pointer
instead.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2025-02-26 11:36:42 +01:00
Jakob Naucke
9935f9ea7e proto: Rename Interface.pciPath to devicePath
Field is being used for both PCI and CCW devices. Name it devicePath
to avoid confusion when the device isn't a PCI device.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2025-02-26 11:36:42 +01:00
Markus Rudy
1f6833bd0d runtime: add cause to CDI errors
Adding devices by CDI annotation can fail for a variety of reasons. If
that happens, it's helpful to know the root cause of the issue (CDI spec
missing, malformatted, requested device not present, etc.).

This commit adds the root cause of the CDI device addition to the errors
reported back to the caller. Since this error is bubbled up all the way
back to the shimv2 task.Create handler, it will be visible in Kubernetes
logs and enable fixing the root cause.

Signed-off-by: Markus Rudy <mr@edgeless.systems>
2025-02-26 08:36:15 +01:00
Aurélien Bombo
111803e168 runtime: cgroups: Remove commented out code
Doesn't seem like we're going to use this and it's confusing when inspecting
code.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2025-02-21 17:52:17 -06:00
Hyounggyu Choi
58647bb654 Merge pull request #10743 from zvonkok/iommufd-gpu-fix
IOMMUFD GPU enhancement
2025-02-20 23:43:00 +01:00
Zvonko Kaiser
7cca2c4925 gpu: Use a dedicated VFIO group vs iommufd entry
We do not want to abuse the sysfsentry lets use a dedicated
devfsentry.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-02-20 18:27:52 +00:00
Zvonko Kaiser
9add633258 qemu: Add command line for IOMMUFD
For each IOMMUFD device create an object and assign
it to the device, we need additional information that
is populated now correctly to decide if we run the old VFIO
or new VFIO backend.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-02-20 18:27:50 +00:00
Dan Mihai
3fc170788d Merge pull request #10811 from microsoft/cameronbaird/hyp-loglevel-upstream
CLH: config: add hypervisor_loglevel
2025-02-04 11:59:21 -08:00
Zvonko Kaiser
122ad95da6 Merge pull request #10751 from ryansavino/snp-upstream-host-kernel-support
snp: update kata to use latest upstream packages for snp
2025-02-03 11:20:59 -05:00
Cameron Baird
b6b0addd5e config: add hypervisor_loglevel
Implement HypervisorLoglevel config option for clh.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2025-01-31 18:37:03 +00:00
Ryan Savino
e87231edc7 snp: remove snp certs on qemu cmdline
snp standard attestation with the upstream kernel and qemu do not support extended attestation with certs.

Fixes: #10750

Signed-Off-By: Ryan Savino <ryan.savino@amd.com>
2025-01-28 18:09:40 -06:00
Hyounggyu Choi
4a6ba534f1 runtime: Introduce new gRPC device type for VFIO-AP coldplug
This commit introduces a new gRPC device type, `vfio-ap-cold`, to support
VFIO-AP coldplug. This enables the VM guest to handle passthrough devices
differently from VFIO-AP hotplug.
With this new type, the guest no longer needs to wait for events (e.g., device
addition) because the device already exists at the time the device type is checked.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-01-28 10:53:00 +01:00
Hyounggyu Choi
419b5ed715 runtime: Add DeviceInfo to Container for VFIO coldplug configuration
Even though ociSpec.Linux.Devices is preserved when vfio_mode is VFIO,
it has not been updated correctly for coldplug scenarios. This happens
because the device info passed to the agent via CreateContainerRequest
is dropped by the Kata runtime.
This commit ensures that the device info is added to the sandbox's
device manager when vfio_mode is VFIO and coldPlugVFIO is true
(e.g., vfio-ap-cold), allowing ociSpec.Linux.Devices to be properly
updated with the device information before the container is created on
the guest.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-01-28 10:53:00 +01:00
Zvonko Kaiser
e82fdee20f runtime: Add proper IOMMUFD parsing
With newer kernels we have a new backend for VFIO
called IOMMUFD this is a departure from VFIO IOMMU Groups
since it has only one device associated with an IOMMUFD entry.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-01-15 23:39:33 +00:00
Dan Mihai
2e21f51375 runtime: skip empty Guest console output lines
Skip logging empty lines of text from the Guest console output, if
there are any such lines.

Without this change, the Guest console log from CLH + /dev/pts/0 has
twice as many lines of text. Half of these lines are empty.

Fixes: #10737

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-01-15 00:28:26 +00:00
Sumedh Alok Sharma
ac4f986e3e runtime: Set memory config shared=false when shared_fs=None in CLH.
This commit sets memory config `shared` to false in cloud hypervisor
when creating vm with shared_fs=None && hugePages = false.

Currently in runtime/virtcontainers/clh.go,the memory config shared is by default set to true.
As per the CLH memory document,
(a) shared=true is needed in case like when using virtio_fs since virtiofs daemon runs as separate process than clh.
(b) for shared_fs=none + hugespages=false, shared=false can be set to use private anonymous memory for guest (with no file backing).
(c) Another memory config thp (use transparent huge pages) is always enabled by default.
As per documentation, (b) + (c) can be used in combination.
However, with the current CLH implementation, the above combination cannot be used since shared=true is always set.

Fixes #10547

Signed-off-by: Sumedh Alok Sharma <sumsharma@microsoft.com>
2024-12-06 21:22:51 +05:30