kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-05-14 02:53:02 +00:00

Author	SHA1	Message	Date
Dan Mihai	a0bd9e02ca	tests: policy-job: detect create container errors early During the ${wait_time} for an expected condition, if CreateContainerRequest was NOT expected to fail: detect possible CreateContainerRequest failures early and abort the wait. For example, before this change: not ok 1 Successful job with auto-generated policy in 107111ms ok 2 Policy failure: unexpected environment variable in 7920ms ok 3 Policy failure: unexpected command line argument in 7874ms ok 4 Policy failure: unexpected emptyDir volume in 7823ms ok 5 Policy failure: unexpected projected volume in 7812ms ok 6 Policy failure: unexpected readOnlyRootFilesystem in 7903ms ok 7 Policy failure: unexpected UID = 222 in 7720ms After this change: not ok 1 Successful job with auto-generated policy in 10271ms ok 2 Policy failure: unexpected environment variable in 8018ms ok 3 Policy failure: unexpected command line argument in 7886ms ok 4 Policy failure: unexpected emptyDir volume in 7621ms ok 5 Policy failure: unexpected projected volume in 7843ms ok 6 Policy failure: unexpected readOnlyRootFilesystem in 7632ms ok 7 Policy failure: unexpected UID = 222 in 7619ms Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2025-11-03 15:55:55 +00:00
Dan Mihai	992c91371c	tests: policy-deployment-sc: detect create container errors early During the ${wait_time} for an expected condition, if CreateContainerRequest was NOT expected to fail: detect possible CreateContainerRequest failures early and abort the wait. For example, before this change: ok 1 Successful sc deployment with auto-generated policy and container image volumes in 14769ms ok 2 Successful sc with fsGroup/supplementalGroup deployment with auto-generated policy and container image volumes in 8384ms not ok 3 Successful sc deployment with security context choosing another valid user in 136149ms ok 4 Successful layered sc deployment with auto-generated policy and container image volumes in 8862ms ok 5 Policy failure: unexpected GID = 0 for layered securityContext deployment in 7941ms ok 6 Policy failure: malicious root group added via supplementalGroups deployment in 11612ms After: ok 1 Successful sc deployment with auto-generated policy and container image volumes in 15230ms ok 2 Successful sc with fsGroup/supplementalGroup deployment with auto-generated policy and container image volumes in 9364ms not ok 3 Successful sc deployment with security context choosing another valid user in 11060ms ok 4 Successful layered sc deployment with auto-generated policy and container image volumes in 9124ms ok 5 Policy failure: unexpected GID = 0 for layered securityContext deployment in 7919ms ok 6 Policy failure: malicious root group added via supplementalGroups deployment in 11666ms Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2025-11-03 15:55:55 +00:00
Dan Mihai	704ee76f1e	tests: policy-deployment-sc: reduced redundancy Call common function instead of copy/paste of three commands. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2025-11-03 15:55:55 +00:00
Dan Mihai	2cafb10a6a	tests: policy-pod: detect create container errors early During the ${wait_time} for an expected condition, if CreateContainerRequest was NOT expected to fail: detect possible CreateContainerRequest failures early and abort the wait. For example, before this change: not ok 1 Successful pod with auto-generated policy in 110801ms not ok 2 Able to read env variables sourced from configmap using envFrom in 94104ms not ok 3 Successful pod with auto-generated policy and runtimeClassName filter in 95838ms not ok 4 Successful pod with auto-generated policy and custom layers cache path in 110712ms ok 5 Policy failure: unexpected container image in 8113ms ok 6 Policy failure: unexpected privileged security context in 7943ms ok 7 Policy failure: unexpected terminationMessagePath in 11530ms ok 8 Policy failure: unexpected hostPath volume mount in 7970ms ok 9 Policy failure: unexpected config map in 7933ms not ok 10 Policy failure: unexpected lifecycle.postStart.exec.command in 112677ms ok 11 RuntimeClassName filter: no policy in 2302ms not ok 12 ExecProcessRequest tests in 93946ms not ok 13 Successful pod: runAsUser having the same value as the UID from the container image in 94003ms ok 14 Policy failure: unexpected UID = 0 in 8016ms ok 15 Policy failure: unexpected UID = 1234 in 7850ms After: not ok 1 Successful pod with auto-generated policy in 12182ms not ok 2 Able to read env variables sourced from configmap using envFrom in 10121ms not ok 3 Successful pod with auto-generated policy and runtimeClassName filter in 11738ms not ok 4 Successful pod with auto-generated policy and custom layers cache path in 26592ms ok 5 Policy failure: unexpected container image in 7742ms ok 6 Policy failure: unexpected privileged security context in 7949ms ok 7 Policy failure: unexpected terminationMessagePath in 7789ms ok 8 Policy failure: unexpected hostPath volume mount in 7887ms ok 9 Policy failure: unexpected config map in 7818ms not ok 10 Policy failure: unexpected lifecycle.postStart.exec.command in 9120ms ok 11 RuntimeClassName filter: no policy in 2081ms not ok 12 ExecProcessRequest tests in 9883ms not ok 13 Successful pod: runAsUser having the same value as the UID from the container image in 9870ms ok 14 Policy failure: unexpected UID = 0 in 11161ms ok 15 Policy failure: unexpected UID = 1234 in 7814ms Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2025-11-03 15:55:55 +00:00
Alex Lyn	897ecfb503	Merge pull request #12014 from fidencio/topic/release-ensure-helm-dependencies-update scripts: release: Run helm dependencies update	2025-11-03 16:34:17 +08:00
Fabiano Fidêncio	c539a9e90e	tests: k8s: parallel: Increase timeout We've seen a few cases where we fail the test due to timeout and when we print the pods we just see that they've been created. With that in mind, let's just increase the timeout a little bit. Example: ``` not ok 1 Parallel jobs in 6250ms (in test file k8s-parallel.bats, line 41) `kubectl wait --for=condition=Ready --timeout=$timeout pod -l jobgroup=${job_name}' failed No resources found in kata-containers-k8s-tests namespace. [bats-exec-test:71] INFO: k8s configured to use runtimeclass job.batch/process-item-test1 created job.batch/process-item-test2 created job.batch/process-item-test3 created NAME STATUS COMPLETIONS DURATION AGE process-item-test1 Running 0/1 0s process-item-test2 Running 0/1 0s process-item-test3 Running 0/1 0s error: no matching resources found No resources found in kata-containers-k8s-tests namespace. No resources found in kata-containers-k8s-tests namespace. DEBUG: system logs of node 'aks-nodepool1-25989463-vmss000000' since test start time (2025-11-01 16:39:03) -- No entries -- job.batch "process-item-test1" deleted job.batch "process-item-test2" deleted job.batch "process-item-test3" deleted ``` Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-01 18:09:37 +01:00
Fabiano Fidêncio	8a5ebd5d16	tests: k8s: run QoS tests on a bigger instance It's been failing to start quite regularly on the smaller instance. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-01 17:54:58 +01:00
Fabiano Fidêncio	157b2c32ce	scripts: release: Run helm dependencies update Otherwise we'll face issues like: ``` Error: found in Chart.yaml, but missing in charts/ directory: node-feature-discovery ``` Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-01 17:54:58 +01:00
Fabiano Fidêncio	c75a46d17f	tests: Do not enable NFD on s390x As we're failing on the uninstall, which seems related to a bug on NFD itself, but I don't have access to a s390x machine to debug, let's skip the enablement for now and enable it back once we've experimented it better on s390x. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 16:30:13 +01:00
Fabiano Fidêncio	67e38e0f92	tests: Do not enable NFD on cbl-mariner As we're failing to install NFD on CBL Mariner, let's skip the enablement there, and enable it once we've experimented it better there. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 16:30:13 +01:00
Fabiano Fidêncio	1bc873397b	tests: Use NFD as part of the tests As we have the ability to deploy NFD as a sub-chart of our chart, let's make sure we test it during our CI. We had to increase the timeout values, where we had timeouts set, to deploy / undeploy kata, as now NFD is also deployed / undeployed. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 16:30:13 +01:00
Fabiano Fidêncio	ebe15d154e	kata-deploy: Add NFD as a dependency Let's ensure that we add NFD as a weak dependency of the kata-deploy helm chart. What we're doing for now is leaving it up to the user / admin to enable it, and if enabled then we do a explicit check for virtualization support (x86_64 only for now). In case NFD is already deployed, we fail the installation (in case it's enabled on the kata-deploy helm chart) with a clear error message to the user. While I know that kata-remote DOES NOT require virtualization, I've left this out (with a comment for when we add a peer-pods dependency on kata-deploy) in order to simplify things for now, as kata-remote is not a deployed shim by default. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 16:30:13 +01:00
Fabiano Fidêncio	be05e1370c	kata-deploy: Allow setting the default runtime class name As Kata Containers can be consumed by other helm-charts, hard coding the default runtime class name to `kata` is not optimal. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 16:14:53 +01:00
Fabiano Fidêncio	820e6d6351	kata-deploy: Add more per-arch options All the options that take a specific shim as an argument MUST have specific per arch settings, as not all the shims are available for all the arches, leading to issues when setting up multi-arch deployments. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 16:14:53 +01:00
Zvonko Kaiser	94abe4fc00	osbuilder: nvrc: Consume NVRC release instead of building it Let's ensure that we consume NVRC releases straight from GitHub instead of building the binaries ourselves. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-31 12:10:20 +01:00
Zvonko Kaiser	69c76971f3	gpu: Handle VFIO and IOMMUFD We have here either /dev/vfio/<num> or /dev/vfio/devices/vfio<num>, for IOMMUFD format /dev/vfio/devices/vfio<num>, strip "vfio" prefix /dev/vfio/123 - basename "123" - vfioNum = "123" - cdi.k8s.io/vfio123 /dev/vfio/devices/vfio123 - basename "vfio123" - strip - vfioNum = "123" - cdi.k8s.io/vfio123 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-10-31 09:46:07 +01:00
Fabiano Fidêncio	e30e2b5f45	tests: k8s: Remove tests running on GitHub provided runner We have 2 tests running on GitHub provided runners: * devmapper * CRI-O - devmapper situation For devmapper, we're currently testing devmapper with s390x as part of one of its jobs. More than that, this test has been failing here due to a lack of space in the machine for quite some time, and no-action was taken to bring it back either via GARM or some other way. With that said, let's rely on the s390x CI to test devmapper and avoid one extra failure on our CI by removing this one. - cri-o situation CRI-O is being tested with a fixed version of kubernetes that's already reached its EOL, and a CRI-O version that matches that k8s version. There has been attempts to raise issues, and also to provide a PR that does at least part of the work ... leaving the debugging part for the maintainers of the CI. However, there was no action on those from the maintainers. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-30 11:46:59 +01:00
Alex Lyn	fa521220a9	Merge pull request #11816 from jiuyi123/rs-vm-template-kata-ctl-merge kata-ctl: add factory subcommands for VM template management	2025-10-30 18:21:12 +08:00
ssc	551caad4b1	docs: add guide on VM templating usage in runtime-rs - Explained the concept and benefits of VM templating - Provided step-by-step instructions for enabling VM templating - Detailed the setup for using snapshotter in place of VirtioFS for template-based VM creation - Added performance test results comparing template-based and direct VM creation Signed-off-by: ssc <741026400@qq.com>	2025-10-30 15:18:31 +08:00
ssc	5a586e13a1	kata-ctl: add factory subcommands for VM template management - init: initialize the VM template factory - status: check the current factory status - destroy: clean up and remove factory resources These commands provide basic lifecycle management for VM templates. Signed-off-by: ssc <741026400@qq.com>	2025-10-30 10:27:17 +08:00
RuoqingHe	8878c46e8f	Merge pull request #11867 from spectator333/update-rust-vmm-deps dragonball: Bump kvm-ioctls to fix security issue	2025-10-30 00:17:29 +08:00
Siyu Tao	dd444d23b3	dragonball: Bump kvm-ioctls to fix security issue Use `ioctl_with_mut_ref` instead of `ioctl_with_ref` in the `create_device` method as it needs to write to the `kvm_create_device` struct passed to it, which was released in v0.12.1. Signed-off-by: Siyu Tao <taosiyu2024@163.com>	2025-10-29 14:03:29 +00:00
Steve Horsman	0e19a2bf91	Merge pull request #11993 from zvonkok/vectorAdd gpu: Add libs for CC	2025-10-29 13:42:34 +00:00
stevenhorsman	555926ea1a	libs: Fix formatting issue Fix the cargo fmt issues and then we can make the libs tests required again to avoid this regression happening again. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-10-29 13:13:50 +01:00
Steve Horsman	dbdd1009af	Merge pull request #11933 from kata-containers/topic/kata-deploy-nfd-dependency-part-I kata-deploy: Automatically deploy NodeFeatureRules for TEEs	2025-10-29 09:50:38 +00:00
Fabiano Fidêncio	103f80c7f5	readme: install: Drop outdated documentation kata-deploy helm chart is THE way to deploy kata-containers on kubernetes environments, and kubernetes environments is basically the only reliably tested deployment we have. For now, let's just drop documentation that is outdated / incorrect, and in the future let's ensure we update the linked docs, as we work on update / upgrade for the helm chart. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-29 09:41:57 +01:00
Zvonko Kaiser	5ff218823c	gpu: Remove unneeded libraries The libs in question were added when moving to developer.nvidia.com but switching back to ubuntu only based builds they are not needed. Remove them to keep the rootfs as minimal as possible. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-10-29 08:03:36 +01:00
Zvonko Kaiser	6d9b4059f5	gpu: Add libs for CC In the case of CC we need additional libraries in the rootfs. Add them conditionally if type == confidential. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-10-29 08:03:36 +01:00
Xuewei Niu	55d181beb1	Merge pull request #11828 from jiuyi123/rs-vm-template-runtime-rs runtime-rs: introduce VM template lifecycle and integration	2025-10-29 14:03:46 +08:00
Xuewei Niu	8aca32dfa9	Merge pull request #11862 from StevenFryto/rootless_clh runtime-rs: supporting the CLH VMM process running in non-root mode	2025-10-29 13:31:53 +08:00
ssc	16e8cf1a09	runtime-rs: boot vm from template Add build_vm_from_template() that flips boot_from_template flag, wires factory.template_path/{memory,state} into the hypervisor config, and returns ready-to-use hypervisor & agent instances. When factory.template is enabled, VirtContainer bypasses normal creation and directly boots the VM by restoring the template through incoming migration, completing the "create → save → clone" loop. Fixes: #11413 Signed-off-by: ssc <741026400@qq.com>	2025-10-29 12:38:28 +08:00
ssc	550615285c	runtime-rs: add factory, template and vm modules for VM template lifecycle Introduced factory::FactoryConfig with init/destroy/status commands to manage template pools. Added template::Template to fetch, create and persist base VMs. Introduced vm::{VM, VMConfig} exposing create, pause, save, resume, stop, disconnect and migration helpers for sandbox integration. Extended QemuInner to executes QMP incoming migration, pause/resume and status tracking. Fixes: #11413 Signed-off-by: ssc <741026400@qq.com>	2025-10-29 12:38:28 +08:00
ssc	135c84b6cb	kata-types: add VM template and factory configuration Added new fields in Hypervisor struct to support VM template creation, template boot, memory and device state paths, shared path, and store paths. Introduced a Factory struct in config to manage template path, cache endpoint, cache number, and template enable flag. Integrated Factory into TomlConfig for runtime configuration parsing. Fixes: #11413 Signed-off-by: ssc <741026400@qq.com>	2025-10-29 11:49:08 +08:00
stevenfryto	2ceadc5fa3	runtime-rs: supporting the CLH VMM process running in non-root mode This change enables to run the Cloud Hypervisor VMM using a non-root user when rootless flag is set true in configuration. Fixes: #11414 Signed-off-by: stevenfryto <sunzitai_1832@bupt.edu.cn>	2025-10-29 01:55:10 +00:00
stevenfryto	2ddbae3aa6	runtime-rs: pass the tuntap fds down to Cloud Hypervisor Pass the file descriptors of the tuntap device to the Cloud Hypervisor VMM process so that the process could open the device without cap_net_admin Signed-off-by: stevenfryto <sunzitai_1832@bupt.edu.cn>	2025-10-29 01:55:10 +00:00
Fabiano Fidêncio	59883a2d99	actions: Remove unused USING_NFD There's no reason to keep the env var / input as it's never been used and now kata-deploy detects automatically whether NFD is deployed or not. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-28 21:24:27 +01:00
Fabiano Fidêncio	f9825b4e6e	kata-deploy: Automatically deploy NodeFeatureRules for TEEs When the NodeFeatureRule CRD is detected kata-deploy will: * Create the specific NodeFeatureRules for the x86_64 TEEs * Adapt the TEEs runtime classes to take into account the amount of keys available in the system when spawning the podsandbox. Note, we still do not have NFD as sub-dependency of the helm chart, and I'm not even sure if we will have. However, it's important to integrate better with the scenarios where the NFD is already present. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-28 21:24:27 +01:00
Manuel Huber	8dc78057d6	ci: Refactor NVIDIA NIM test Change NIM bats file logic to allow skipping test cases which require multiple GPUs. This can be helpful for test clusters where there is only one node with a single GPU, or for local test environments with a single-node cluster with a single GPU. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-10-28 19:12:16 +01:00
Manuel Huber	be32b77baf	ci: Add NVIDIA CUDA vectoradd test This change adds a CUDA vectoradd test case and makes enabling NVRC tracing optional and idempotent. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-10-28 19:12:16 +01:00
Fabiano Fidêncio	a164693e1a	release: Bump version to 3.22.0 Bump VERSION and helm-chart versions Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> 3.22.0	2025-10-28 16:28:18 +01:00
Steve Horsman	1b46cf43c4	Merge pull request #11989 from Amulyam24/actionpz-ppc64le revert: Enable new ibm runners for ppc64le	2025-10-28 12:09:03 +00:00
Amulyam24	c603094584	revert: Enable new ibm runners for ppc64le Temporarily disables the new runners for building artifacts jobs. Will be re-enabled once they are stable. Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>	2025-10-28 17:09:26 +05:30
Hyounggyu Choi	7d2fe5e187	revert: Enable new ibm runners for s390x This partially reverts `8dcd91c` for the s390x because the CI jobs are currently blocking the release. The new runners will be re-introduced once they are stable and no longer impact critical paths. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2025-10-28 11:11:51 +01:00
Fabiano Fidêncio	754e832cfa	kata-deploy: Allow passing shims / defaultShim per arch This allows us to do a full multi-arch deployment, as the user can easily select which shim can be deployed per arch, as some of the VMMs are not supported on all architectures, which would lead to a broken installation. Now, passing shims per arch we can easily have an heterogenous deployment where, for instance, we can set qemu-se-runtime-rs for s390x, qemu-cca for aarch64, and qemu-snp / qemu-tdx for x86_64 and call all of those a default kata-confidential ... and have everything working with the same deployment. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-10-27 22:42:37 +01:00
Greg Kurz	ffdc80733a	Merge pull request #11966 from zvonkok/gpu-cc-fix gpu: rootfs fixes	2025-10-27 10:18:13 +01:00
Alex Lyn	418d5f724e	Merge pull request #11971 from lifupan/fupan_blk_ratelimit runtime-rs: Support disk rate limiter for dragonball	2025-10-27 17:12:47 +08:00
Alex Lyn	f86ac595a8	Merge pull request #11973 from Apokleos/enhance-oci-spec runtime-rs: Enhancements for items within OCI Spec	2025-10-27 16:15:00 +08:00
Alex Lyn	690dad5528	runtime-rs: Ensure complete cleanup of stale Device Cgroups The previous procedure failed to reliably ensure that all unused Device Cgroups were completely removed, a failure consistently verified by CI tests. This change introduces a more robust and thorough cleanup mechanism. The goal is to prevent previous issues—likely stemming from improper use of Rust mutable references—that caused the modifications to be ineffective or incomplete. This ensures a clean environment and reliable CI test execution. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2025-10-27 12:47:48 +08:00
Alex Lyn	25ab615da5	Merge pull request #11913 from Apokleos/dedicated-error-rs CI: Add dedicated expected error message for runtime-rs	2025-10-27 10:47:07 +08:00
Zvonko Kaiser	39848e0983	gpu: rootfs fixes Build only from Ubuntu repositories do not mix with developer.nvidia.com Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Update tools/osbuilder/rootfs-builder/nvidia/nvidia_chroot.sh Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-26 19:36:55 +01:00

1 2 3 4 5 ...

17127 Commits