kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-03-17 10:12:24 +00:00

Author	SHA1	Message	Date
Aurélien Bombo	cbfdc4b764	Revert "ci: Implement build step for CSI driver" This partially reverts commit `fb87bf221f`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-10 13:51:21 -05:00
Aurélien Bombo	b6c60d9229	Merge pull request #10559 from sprt/conf-local-storage coco: Implement trusted ephemeral data storage	2026-03-10 10:39:40 -05:00
Dan Mihai	f9a8eb6ecc	genpolicy: allow_mount improvements for emptyDir 1. Reduce the complexity of the new allow_mount rules for emptyDir. 2. Reverse the order of the two allow_mount versions, as a hint to the rego engine that the first version is more often matching the input. 3. Remove `p_mount.source != ""` from mount_source_allows, because: - Policy rules typically test the values from input, not values read from Policy. - mount_source_allows is no longer called for emptyDir mounts after these changes, so p_mount.source is not empty. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-03-09 14:52:17 -05:00
Fabiano Fidêncio	374b0abe29	tests: Fix kubelet data dir for k0s in trusted ephemeral storage test k0s uses /var/lib/k0s/kubelet instead of /var/lib/kubelet as its kubelet data directory. Introduce get_kubelet_data_dir() in tests_common.sh and use it in k8s-trusted-ephemeral-data-storage.bats instead of hardcoding /var/lib/kubelet. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	718632bfe0	build: Add artifacts to .gitignore This adds various files that are generated during development. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	68bdbef676	tests: Improve logging for some tests Use modern test semantics to ease debugging. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	3dd77bf576	tests: Introduce new env variables to ease development It can be useful to set these variables during local testing: * AZ_REGION: Region for the cluster. * AZ_NODEPOOL_TAGS: Node pool tags for the cluster. * GENPOLICY_BINARY: Path to the genpolicy binary. * GENPOLICY_SETTINGS_DIR: Directory holding the genpolicy settings. I've also made it so that tests_common.sh modifies the duplicated genpolicy-settings.json (used for testing) instead of the original git-tracked one. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	aae54f704c	ci: Stop deploying the CSI driver The design moved away from CSI driver so stop deploying that. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	a98e328359	tests: Add test for trusted ephemeral data storage This tests the feature on CoCo machines. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	9fe03fb170	genpolicy: Support trusted ephemeral data storage * Introduces a new cluster_config setting encrypted_emptydir defaulting to true. * Adapts genpolicy for encrypted emptyDirs. Crucially, the rules.rego change checks that the mount and the storage are well-formed together: * i_storage.source matches a known regex. * i_storage.mount_point == $(spath)/BASE64(i_storage.source) * i_storage.mount_point == p_storage.mount_point * i_storage.mount_point == i_mount.source Note that policy enforcement is necessary to prevent rogue device injection. E.g. the agent could not blindly encrypt all block devices as some use cases only need dm-verity. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	eaa711617e	agent: Support trusted ephemeral data storage Handles block-based emptyDirs plugged via virtio-blk and virtio-scsi by encrypting and formatting them. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Aurélien Bombo	a4fd32a29a	runtime: Support trusted ephemeral data storage * Introduces the `emptydir_mode` config flag to allow instructing the runtime to create a block device for emptyDir volumes. * The block device is created in the original emptyDir folder on the host so that Kubelet can monitors its disk usage and evict the pod if it exceeds its sizeLimit. This matches runc and virtio-fs. * The block device's disk image file is sparse to minimize host disk footprint. Fixes: #10560 Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Alex Lyn	fb743a304c	runtime: Support plugging a disk as an image file Some VMMs support plugging a disk as an image file instead of a block device, so we adapt the runtime to support that. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com> Signed-off-by: Aurélien Bombo <abombo@microsoft.com> Co-authored-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-09 14:52:17 -05:00
Alex Lyn	22c4cab237	Merge pull request #12623 from Apokleos/fix-dgb-ut runtime-rs: Fix dragonball's flaky unit tests	2026-03-09 11:38:02 +08:00
Alex Lyn	62b0f63e37	dragonball: Generate unique TAP names to avoid conflicts The vhost-kern net unit test used a fixed TAP interface name ("test_vhosttap"). When tests run in parallel or a previous run leaves the interface behind, TAP creation can fail with EBUSY ("Resource busy"), making CI flaky. Introduce a unique_tap_name() helper in the tests and use it to generate a per-test TAP name (based on pid/thread/counter), avoiding name collisions and stabilizing CI. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 17:33:40 +08:00
Alex Lyn	b2932f963a	Merge pull request #12631 from Apokleos/fix-suffix ci: keep mktemp output suffix stable with .yaml	2026-03-06 14:15:49 +08:00
Alex Lyn	1c8c0089da	dragonball: fix flaky signal_handler test using libc::raise The signal_handler test was intermittently failing because it used kill(pid, sig), which sends signals asynchronously to the process. This created a race condition where the child thread could exit and be joined before the signal was delivered or processed. This fix including: 1. Replaces `kill` with `libc::raise` to ensure signals are delivered synchronously to the calling thread. 2. Reorders triggers to verify standard signals before installing seccomp filters. 3. Guarantees that metrics are incremented before the child thread terminates and is joined by the main thread. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	d0718f6001	dragonball: Fix unnecessary parentheses around type warning: unnecessary parentheses around type --> src/dragonball/dbs_legacy_devices/src/serial.rs:245:39 \| 245 \| let out: Arc<Mutex<Option<Box<(dyn std::io::Write + Send + 'static)>>>> = \| ^ ^ \| = note: `#[warn(unused_parens)]` (part of `#[warn(unused)]`) on by default help: remove these parentheses Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	b4161198ee	dragonball: Remove unused imports variables in dbs_pci Fix warnings of unused imports as below: ``` warning: unused imports: `DEVICE_ACKNOWLEDGE`, `DEVICE_DRIVER_OK`, `DEVICE_DRIVER`, `DEVICE_FEATURES_OK`, and `DEVICE_INIT` --> src/dragonball/dbs_pci/src/virtio_pci.rs:1177:9 \| 1177 \| DEVICE_ACKNOWLEDGE, DEVICE_DRIVER, DEVICE_DRIVER_OK, DEVICE_FEATURES_OK, DEVICE_INIT, \| ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^ \| = note: `#[warn(unused_imports)]` (part of `#[warn(unused)]`) on by default ``` Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	ca4e14086f	runtime-rs: Fix warnings of unformatted codes Fix warnings from unformattted codes. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	ce800b7c37	dragonball: Fix flaky test_vhost_user_net_virtio_device_activate hang The vhost-user-net tests could hang in CI because VhostUserNet::new_server() blocks indefinitely on listener.accept() when the slave fails to connect in time (e.g. due to scheduler delays or flaky socket paths). This also caused panics when connect_slave() returned None and the test unwrapped it. Fix the tests by: - using a `/tmp`, absolute, unique unix socket path per test run retrying slave connect with a deadline - running new_server() in a separate thread and waiting via recv_timeout() to ensure the test never blocks indefinitely Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	a988b10440	dragonball: Fix flaky test_vhost_user_net_virtio_device_normal hang It aims to fix flaky test hang by implementing thread timeouts. The `test_vhost_user_net_virtio_device_normal` was hanging in CI when master/slave threads drifted. This commit stabilizes the test by: - Using `tempfile` and unique paths to ensure socket isolation. - Adding a 5s deadline for slave connections to handle CI jitter. - Running `new_server` in a separate thread with a `recv_timeout` to prevent the CI pipeline from deadlocking. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	f36218d566	dragonball: Fix flaky test_inner_stream_timeout in inner backend The `test_inner_stream_timeout` test case was prone to failure due to a race condition between the main thread and the background handler. The test relied on hardcoded `thread::sleep` durations, which could cause the second read operation to time out (150ms window) before the main thread performed its write (after a 300ms sleep) under high system load. This commit stabilizes the test by: 1. Replacing fixed sleep durations with a `Condvar` and a `stage` variable to implement a deterministic state machine. 2. Synchronizing the threads so that the main thread only writes data after the background handler has confirmed it is ready or has completed its previous phase. 3. Ensuring the read timeout is explicitly managed between different validation stages to prevent accidental `TimedOut` errors. This change eliminates the flakiness and ensures the test passes consistently across different CIenvironments. Fixes #12618 Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	c8a39ad28d	dragonball: Fix flaky test_epoll_manager by improving synchronization This commit aims to address issues of "Infinite loop in epoll_manager tests" and improve stablity. Root causes as below: 1. Using `handle_events(-1)` caused the worker thread to block forever if an event was missed or if the internal `kick()` signal was not accounted for correctly. 2. Relying on event counts was unreliable because internal signals could fluctuate the total count, causing the it to enter an infinite loop. 3. Using `EventSet::OUT` on an EventFd is often continuously ready, leading to non-deterministic trigger behavior. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:28:56 +08:00
Alex Lyn	a35dcf952e	ci: Fix YAML parsing flakiness caused by mktemp random suffixes In some CI runs, `mktemp` generates random characters that accidentally form file extensions like `.cSV` or `.Xml`. This triggers downstream parsing errors because the YAML content is misidentified as CSV/XML. The issues look like as below: ``` '/tmp/bats-run-KodZEA/.../pod-guest-pull-in-trusted-storage.yaml.in.cSV': ... ``` This commit fixes the issue by: 1. Moving the `XXXXXX` placeholder before the `.yaml` extension. 2. Ensuring the generated file always ends in `.yaml`. This prevents format misidentification while maintaining filename uniqueness and security. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-03-06 09:21:29 +08:00
Fabiano Fidêncio	2fff33cfa4	Merge pull request #12628 from stevenhorsman/agent-ctl-bump-aws-lc-rs agent-ctl: Update aws-lc-rs	2026-03-05 20:52:03 +01:00
Fabiano Fidêncio	83a8b257d1	Merge pull request #12265 from fidencio/topic/nvidia-bump-container-toolkit nvidia: Bump nvidia-container-toolkit to 1.18.1	2026-03-05 15:25:15 +01:00
Fabiano Fidêncio	079fac1309	Merge pull request #12591 from fidencio/topic/kernel-add-mmio-back-to-the-unified-kernels kernel: include mmio fragment in unified build for firecracker	2026-03-05 13:45:41 +01:00
Steve Horsman	5df7c4aa9c	Merge pull request #12630 from zachspar/spar/kata-deploy-helm/configurable-pod-overhead kata-deploy: add per-shim configurable pod overhead	2026-03-05 12:42:53 +00:00
Fabiano Fidêncio	e9894c0bd8	nvidia: Bump nvidia-container-toolkit to 1.18.1 Let's update the nvidia-container-toolkit to 1.18.1 (from 1.17.6). We're, from now on, relying on the version set in the versions.yaml file. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-05 11:53:09 +01:00
stevenhorsman	c57f2be18e	agent-ctl: Update aws-lc-rs aws-lc has mutliple high severity CVEs: - GHSA-vw5v-4f2q-w9xf - GHSA-65p9-r9h6-22vj - GHSA-hfpc-8r3f-gw53 so try and bump to the latest `aws-lc-rs` crate to pull in the available fixed versions Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-05 10:02:22 +00:00
Zachary Spar	bda9f6491f	kata-deploy: add per-shim configurable pod overhead Allow users to override the default RuntimeClass pod overhead for any shim via shims.<name>.runtimeClass.overhead.{memory,cpu}. When the field is absent the existing hardcoded defaults from the dict are used, so this is fully backward compatible. Signed-off-by: Zachary Spar <zspar@coreweave.com>	2026-03-05 08:00:01 +01:00
Fabiano Fidêncio	8f35c31b30	Merge pull request #12542 from fidencio/topic/genpolicy-distribute-different-settings-rather-than-patching-for-ci genpolicy: settings.d drop-ins and scenario example drop-ins	2026-03-05 07:37:30 +01:00
Fabiano Fidêncio	b5e0a5b7d6	Merge pull request #12555 from fidencio/topic/tests-use-local-pv-pvc-for-policy-tests k8s-policy-pvc: use local PV/PVC when no default StorageClass exists	2026-03-05 07:37:11 +01:00
Dan Mihai	cb97ebd067	Merge pull request #12615 from microsoft/danmihai1/subPathExpr tests: k8s: basic test for subPathExpr	2026-03-04 13:10:57 -08:00
Fabiano Fidêncio	a0b9d965e5	k8s-policy-pvc: use local PV/PVC when no default StorageClass exists Create local block storage (loop device, StorageClass, PV) in the test only when the cluster has no default StorageClass, matching the approach used in k8s-volume.bats. Set our StorageClass as default so the PVC binds to our PV; tear it down after the test. When a default already exists (e.g. AKS), skip creation and cleanup so we do not change the cluster's default storage class. Fixes: #9846 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:50:51 +01:00
Fabiano Fidêncio	83dd7dcc75	runtimes: reject virtio-blk-mmio when confidential_guest is true Virtio-mmio transport is not hardened for confidential computing (unlike virtio-pci). Reject config that would use virtio-blk-mmio for rootfs/block when confidential_guest is set, so CoCo guests only use virtio-blk-pci. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:41:27 +01:00
Fabiano Fidêncio	cb0d02e40b	kernel: include mmio fragment in unified build for firecracker Remove # !confidential from mmio.conf so CONFIG_VIRTIO_MMIO and CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES are included when building the unified x86_64/s390x kernel with -x Firecracker requires virtio-mmio for block devices; without it the guest kernel panics (no /dev/vda). Fixes: #12581 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:18:35 +01:00
Fabiano Fidêncio	d40afe592c	genpolicy: add settings drop-in directory and RFC 6902 JSON Patch support Allow genpolicy -j to accept a directory instead of a single file. When given a directory, genpolicy loads genpolicy-settings.json from it and applies all genpolicy-settings.d/.json files (sorted by name) as RFC 6902 JSON Patches. This gives precise control over settings with explicit operations (add, remove, replace, move, copy, test), including array index manipulation and assertions. Ship composable drop-in examples in drop-in-examples/: - 10- files set platform base settings (non-CoCo, AKS, CBL-Mariner) - 20-* files overlay specific adjustments (OCI version, guest pull) Users copy the combination they need into genpolicy-settings.d/. Replace the old adapt_common_policy_settings_* jq-patching functions in tests_common.sh with install_genpolicy_drop_ins(), which copies the right combination of 10-* and 20-* drop-ins for the CI scenario. Tests still generate 99-test-overrides.json on the fly for per-test request/exec overrides. Packaging installs 10-* and 20-* drop-ins from drop-in-examples/ into the tarball; the default genpolicy-settings.d/ is left empty. Made-with: Cursor Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 20:13:21 +01:00
Dan Mihai	e40d962b13	genpolicy: improve allow_mount logging Add simple -------- text lines separator to the beginnning of the allow_mount log output, to help log readers easier separate the ~30 lines of text generated while verifying each mount. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-03-04 16:28:29 +00:00
Dan Mihai	3f845af9d4	tests: k8s: basic test for subPathExpr Add basic genpolicy test coverage for subPathExpr and corresponding container mounts. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-03-04 16:28:29 +00:00
Steve Horsman	a4a4683ec7	Merge pull request #12626 from kata-containers/topic/kata-deploy-k3s-rke2-use-imports kata-deploy: a bunch of fixes regarding uninstall, rke2 and k3s tests	2026-03-04 14:01:09 +00:00
Steve Horsman	2687ad75c1	Merge pull request #12617 from BbolroC/skip-cgroup-device-check-for-remote runtime: Skip to call sandboxDevices() for remote hypervisor	2026-03-04 14:00:23 +00:00
Steve Horsman	8e11bb2526	Merge pull request #12611 from mythi/coco-kernel-v6.18.15 versions: bump to Linux v6.18.15 (LTS)	2026-03-04 14:00:00 +00:00
Steve Horsman	94f850979f	Merge pull request #12613 from stevenhorsman/tooling-bump-x/net-to-v0.51.0 Tooling bump x/net to v0.51.0	2026-03-04 13:44:22 +00:00
stevenhorsman	8640f27516	ci: Remove SNP tests from required The SNP tests have been unstable on nightlies, but even when these it seems to be manually cleaned up or something as PR tests are consistently failing, so we should skip this from the required list until it is reliable. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-04 14:41:09 +01:00
Fabiano Fidêncio	56c3618c1d	tests: kata-deploy: wait for API recovery after uninstall kata-deploy's SIGTERM cleanup restarts the CRI runtime, which on k3s/rke2 takes down the API server temporarily. The helm uninstall may complete with errors, and the next test suite would start with a dead API. Add a wait loop after uninstall to ensure the API is available before proceeding. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	966d710df5	tests: increase kata-deploy wait timeout to 15 minutes kata-deploy restarts the CRI runtime during install, which can cause the kata-deploy pod to be killed and recreated by the DaemonSet controller. On k3s and rke2 in particular, the restart can take several minutes. Increase the default timeout from 600s (10m) to 900s (15m) to accommodate this. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	ebe75cc3e3	kata-deploy: make verification job resilient to CRI runtime restarts kata-deploy restarts the CRI runtime (k3s/containerd) during install, which can kill the verification job pod or cause transient API server errors. Bump backoffLimit from 0 to 3 so the job can retry after being killed, and add a retry loop around kubectl rollout status to handle transient connection failures. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	7a08ef2f8d	kata-deploy: run cleanup on SIGTERM instead of preStop hook Move the cleanup logic from a preStop lifecycle hook (separate exec) into the main process's SIGTERM handler. This simplifies the architecture: the install process now handles its own teardown when the pod is terminated. The SIGTERM handler is registered before install begins, and tokio::select! races install against SIGTERM so cleanup always runs even if SIGTERM arrives mid-install (e.g. helm uninstall while the container is restarting after a failed install attempt). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00

1 2 3 4 5 ...

18136 Commits