kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-04-04 19:16:12 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	fd583d833b	kata-deploy: nydus: clean containerd metadata before wiping backend When /var/lib/nydus-snapshotter is removed, containerd's BoltDB (meta.db at /var/lib/containerd/) still holds snapshot records for the nydus snapshotter. On the next install these stale records cause image pulls to fail with: "unable to prepare extraction snapshot: target snapshot \"sha256:...\": already exists" The failure path in core/unpack/unpacker.go: 1. sn.Prepare() → metadata layer finds the target chainID in BoltDB → returns AlreadyExists without touching the nydus backend. 2. sn.Stat() → metadata layer finds the BoltDB record, then calls s.Snapshotter.Stat(bkey) on the nydus gRPC backend → NotFound (backend was wiped). 3. The unpacker treats NotFound as a transient key-collision race and retries 3 times; all 3 attempts hit the same dead end, and the pull is aborted. The commit message of `62ad0814c` ("nydus: Always start from a clean state") assumed "containerd will re-pull/re-unpack when it finds non- existent snapshots", but that is not what happens: the metadata layer intercepts the Prepare call in BoltDB before the backend is ever consulted. Fix: call cleanup_containerd_nydus_snapshots() before stopping the nydus service (and thus before wiping its data directory) in both install_nydus_snapshotter and uninstall_nydus_snapshotter. The cleanup must run while the service is still up because ctr snapshots rm goes through the metadata layer which calls the nydus gRPC backend to physically remove the snapshot; if the service is already stopped the backend call fails and the BoltDB record remains. The cleanup: - Discovers all containerd namespaces via `ctr namespaces ls -q` (falls back to k8s.io if that fails). - Removes containers whose Snapshotter field matches the nydus plugin name; these become dangling references once snapshots are gone and can confuse container reconciliation after an aborted CI run. - Removes snapshots round by round (leaf-first) until either the list is empty or no progress can be made (see below). Note: containerd's GC cannot substitute for this explicit cleanup. The image record (a GC root) references content blobs which reference the snapshots via gc.ref labels, keeping the entire chain alive in the GC graph even after the nydus backend is wiped. Snapshot removal rounds ----------------------- Snapshot chains are linear: an image with N layers produces a chain of N snapshots, each parented on the previous. Only the current leaf can be removed each round, so N layers require exactly N rounds. There is no fixed round cap — the loop terminates when either the list reaches zero (success) or a round removes nothing at all (all remaining snapshots are actively in use by running workloads). Active workload safety ---------------------- If active workloads still hold nydus snapshots (e.g. during a live upgrade), no progress is made in a round and cleanup_nydus_snapshots returns false. Both install_nydus_snapshotter and uninstall_nydus_snapshotter gate the fs::remove_dir_all on that return value: - true → proceed as before: stop service, wipe data dir. - false → stop service, skip data dir removal, log a warning. The new nydus instance starts on the existing backend state; running containers are left intact. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-03-24 16:44:25 +01:00
Fabiano Fidêncio	1ec97d25e7	Merge pull request #12704 from stevenhorsman/security-fixes-23-mar-26 Security fixes 23 mar 26	2026-03-23 15:27:07 +01:00
Fabiano Fidêncio	514a2b1a7c	Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter nvidia: cc: Use nydus-snapshotter	2026-03-23 12:50:15 +01:00
dependabot[bot]	8df9cf35df	build(deps): bump rustls-webpki in /tools/packaging/kata-deploy/binary Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.8 to 0.103.10. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.8...v/0.103.10) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.10 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:27 +00:00
Alex Lyn	d2c2ec6e23	Merge pull request #12633 from LandonTClipp/docs_materialx docs: Move to mkdocs-material, port Helm to docs site	2026-03-23 09:29:25 +08:00
Fabiano Fidêncio	6194510e90	nvidia: cc: Use nydus-snapshotter We've been using `experimental_force_guest_pull`, but now that we have a containerd release that should work more reliably with the multi snapshotter setup, we want to give it a try. Note: We need containerd 2.2.2+. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-22 10:10:34 +01:00
Agam Dua	7e3fd74779	kernel: bump config version With debug/ebpf updates in place, let's bump the kata config version. Signed-off-by: Agam Dua <agam_dua@apple.com> Co-authored-by: Eric Ernst <eric_ernst@apple.com>	2026-03-20 15:04:15 -07:00
Agam Dua	f6319da73d	tests: Add eBPF and dwarves to spell check dictionary Add missing terms to the spell check dictionary to fix CI failures for kernel debug documentation: - eBPF - dwarves: Linux package with DWARF/BTF tools (pahole) required for CONFIG_DEBUG_INFO_BTF kernel option Also fix the casing of "ebpf" to "eBPF" in the kernel README to match the official naming convention. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 15:04:08 -07:00
Agam Dua	91d6c39f06	kernel: Fix debug build and add debug symbols to installation Fixed a bug with the debug kernel build where common/ was repeated after the common path variable, resulting in the debug confs never being picked up. This exposed a subsequent bug where the debug conf was included in other builds, this is also fixed by creating a separate directory for debug confs with one file at the moment, debug.conf that contains debug configurations and bpf specific configs. To enable kernel builds (specifically for bpf) the dwarves package was added to the kernel dockerfile for the pahole package. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	5ab0744c25	ci: Add pipeline for building and distributing the debug kernel Add the debug kernel to the kata tarball alongside the other kernels. Also update the kernel README documentation to describe the new debug kernel build process. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	e905b74267	kernel: Add eBPF configs for debug builds Adds a BPF section in the debug.conf kernel configuration options to enable eBPF and BTF support for debug kernel builds. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
LandonTClipp	795869152d	docs: Move to mkdocs-material, port Helm to docs site This supersedes https://github.com/kata-containers/kata-containers/pull/12622. I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material created after mkdocs-material was put into maintenance mode. We'll use this platform until Zensical is more feature complete. Added a few of the existing docs into the site to make a more user-friendly flow. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-20 14:51:39 -05:00
stevenhorsman	44ec815f77	gatekeeper: Add check-spelling to required Add the new job to the list of static checks that runs on doc updates. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:22:54 +00:00
Aurélien Bombo	352b4cdad2	Merge pull request #12660 from LandonTClipp/ci_docs ci: Don't run CI builds on doc PRs	2026-03-17 12:19:11 -05:00
Manuel Huber	660e3bb653	gpu: Obsolete the NVIDIA initrd build As the NVIDIA stack has shifted to using an image for both the confidential and non-confidential variants, we retire the initrd build. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 21:29:58 -04:00
Aurélien Bombo	f8e234c6f9	Merge pull request #12650 from kata-containers/sprt/remove-csi ci: Stop building/deploying CSI driver	2026-03-16 16:53:02 -05:00
Manuel Huber	5210584f95	release: Bump version to 3.28.0 Bump VERSION and helm-charts versions. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:52:35 -07:00
Manuel Huber	a9b222f91e	gpu: Update chiseled rootfs with new CDH deps With CDH requiring libcryptsetup, mkfs.ext4, dd, and their dependencies, we will need to update the chiseled NVIDIA rootfs accordingly. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:43:17 -07:00
Manuel Huber	169f92ff09	agent: cdh: Update CDH and API With the new CDH version, the secure_mount API changes. Further, the new CDH version no longer uses the luks-encrypt-storage script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt some of the CDH and Kata components build steps Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:43:17 -07:00
Alex Lyn	ef5db0a01f	Merge pull request #12607 from zvonkok/system-map kernel: Ship System.map as part of the kernel build	2026-03-16 09:37:44 +08:00
Zvonko Kaiser	99f32de1e5	kata-deploy: Update RuntimeClass PodOverhead Align the podOverhead with the default_memory updated in the previous commit. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Zvonko Kaiser	d22c314e91	gpu: Increase dial_timeout=1200 For cold-plug when running with nerdctl the timeouts in the config are being used, increase the dial_timeout (e.g. for CreateSandbox) to match create_container_timeout. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Zvonko Kaiser	7fe84c8038	gpu: HGX Rootfs Fixes Various smaller fixes to enable HGX systems. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Joji Mekkattuparamban	1fd66db271	nvidia-gpu: add missing libraries to rootfs Added the missing packages to the nvidia rootfs. Fixes #12534 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2026-03-13 16:24:32 -07:00
Zvonko Kaiser	d382379571	kernel: Ship System.map as part of the kernel build Some use-cases need the System.map of the running kernel, ship it via kernel-artifact. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-13 19:27:18 +00:00
LandonTClipp	d5d741f4e3	ci: Don't run CI builds on doc PRs We disable the Kata artifact builds and testing if the PR is only related to documentation. Regular static checks will remain. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-12 15:48:05 -05:00
Manuel Huber	8162d15b46	nvidia: fix invalid CTK reference Use proper reference from versions yaml structure. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-11 12:49:29 -07:00
Aurélien Bombo	32444737b5	gatekeeper: Remove csi-kata-directvolume build from required tests Since we don't build that anymore. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-11 12:55:23 -05:00
Aurélien Bombo	64aed13d5f	Revert "ci: Add no-op step to compile CSI driver" This reverts commit `e43c59a2c6`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-11 12:55:23 -05:00
Aurélien Bombo	d598e0baf1	Revert "ci: Implement build step for CSI driver" This partially reverts commit `fb87bf221f`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-11 12:55:23 -05:00
Dan Mihai	04f180434e	Merge pull request #12640 from burgerdev/genpolicy-workspace genpolicy: add to Cargo workspace	2026-03-11 09:02:39 -07:00
Markus Rudy	cf7d4c33b3	kata-deploy: fix binary location for genpolicy Moving the genpolicy crate into the root workspace causes the build outputs to go into the root workspace's target directory, instead of src/tools/genpolicy/target, invalidating assumptions made by the kata-deploy-binaries script. This commit adds a special case for the lookup path of the genpolicy binary, and fixes two bugs that made identifying this problem harder. Signed-off-by: Markus Rudy <mr@edgeless.systems>	2026-03-11 09:30:48 +01:00
stevenhorsman	8ae0e36737	versions: bump golang to 1.25.8 Bump the builder image and versions to resolve CVEs: - GO-2026-4601 - GO-2026-4602 - GO-2026-4603 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-09 09:10:01 +00:00
Fabiano Fidêncio	83a8b257d1	Merge pull request #12265 from fidencio/topic/nvidia-bump-container-toolkit nvidia: Bump nvidia-container-toolkit to 1.18.1	2026-03-05 15:25:15 +01:00
Fabiano Fidêncio	079fac1309	Merge pull request #12591 from fidencio/topic/kernel-add-mmio-back-to-the-unified-kernels kernel: include mmio fragment in unified build for firecracker	2026-03-05 13:45:41 +01:00
Fabiano Fidêncio	e9894c0bd8	nvidia: Bump nvidia-container-toolkit to 1.18.1 Let's update the nvidia-container-toolkit to 1.18.1 (from 1.17.6). We're, from now on, relying on the version set in the versions.yaml file. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-05 11:53:09 +01:00
Zachary Spar	bda9f6491f	kata-deploy: add per-shim configurable pod overhead Allow users to override the default RuntimeClass pod overhead for any shim via shims.<name>.runtimeClass.overhead.{memory,cpu}. When the field is absent the existing hardcoded defaults from the dict are used, so this is fully backward compatible. Signed-off-by: Zachary Spar <zspar@coreweave.com>	2026-03-05 08:00:01 +01:00
Fabiano Fidêncio	cb0d02e40b	kernel: include mmio fragment in unified build for firecracker Remove # !confidential from mmio.conf so CONFIG_VIRTIO_MMIO and CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES are included when building the unified x86_64/s390x kernel with -x Firecracker requires virtio-mmio for block devices; without it the guest kernel panics (no /dev/vda). Fixes: #12581 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:18:35 +01:00
Fabiano Fidêncio	d40afe592c	genpolicy: add settings drop-in directory and RFC 6902 JSON Patch support Allow genpolicy -j to accept a directory instead of a single file. When given a directory, genpolicy loads genpolicy-settings.json from it and applies all genpolicy-settings.d/.json files (sorted by name) as RFC 6902 JSON Patches. This gives precise control over settings with explicit operations (add, remove, replace, move, copy, test), including array index manipulation and assertions. Ship composable drop-in examples in drop-in-examples/: - 10- files set platform base settings (non-CoCo, AKS, CBL-Mariner) - 20-* files overlay specific adjustments (OCI version, guest pull) Users copy the combination they need into genpolicy-settings.d/. Replace the old adapt_common_policy_settings_* jq-patching functions in tests_common.sh with install_genpolicy_drop_ins(), which copies the right combination of 10-* and 20-* drop-ins for the CI scenario. Tests still generate 99-test-overrides.json on the fly for per-test request/exec overrides. Packaging installs 10-* and 20-* drop-ins from drop-in-examples/ into the tarball; the default genpolicy-settings.d/ is left empty. Made-with: Cursor Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 20:13:21 +01:00
Steve Horsman	a4a4683ec7	Merge pull request #12626 from kata-containers/topic/kata-deploy-k3s-rke2-use-imports kata-deploy: a bunch of fixes regarding uninstall, rke2 and k3s tests	2026-03-04 14:01:09 +00:00
Steve Horsman	8e11bb2526	Merge pull request #12611 from mythi/coco-kernel-v6.18.15 versions: bump to Linux v6.18.15 (LTS)	2026-03-04 14:00:00 +00:00
Steve Horsman	94f850979f	Merge pull request #12613 from stevenhorsman/tooling-bump-x/net-to-v0.51.0 Tooling bump x/net to v0.51.0	2026-03-04 13:44:22 +00:00
stevenhorsman	8640f27516	ci: Remove SNP tests from required The SNP tests have been unstable on nightlies, but even when these it seems to be manually cleaned up or something as PR tests are consistently failing, so we should skip this from the required list until it is reliable. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-04 14:41:09 +01:00
Fabiano Fidêncio	ebe75cc3e3	kata-deploy: make verification job resilient to CRI runtime restarts kata-deploy restarts the CRI runtime (k3s/containerd) during install, which can kill the verification job pod or cause transient API server errors. Bump backoffLimit from 0 to 3 so the job can retry after being killed, and add a retry loop around kubectl rollout status to handle transient connection failures. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	7a08ef2f8d	kata-deploy: run cleanup on SIGTERM instead of preStop hook Move the cleanup logic from a preStop lifecycle hook (separate exec) into the main process's SIGTERM handler. This simplifies the architecture: the install process now handles its own teardown when the pod is terminated. The SIGTERM handler is registered before install begins, and tokio::select! races install against SIGTERM so cleanup always runs even if SIGTERM arrives mid-install (e.g. helm uninstall while the container is restarting after a failed install attempt). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	01895bf87e	kata-deploy: use k3s/rke2 drop-in Check the rendered containerd config for the versioned drop-in dir import (config.toml.d or config-v3.toml.d) and bail with a clear error if it is missing. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:08:26 +01:00
Aurélien Bombo	d821d4e572	Merge pull request #12619 from sprt/require-editorconfig gatekeeper: Add EditorConfig checker to required tests	2026-03-03 21:36:32 -06:00
Fabiano Fidêncio	b0345d50e8	build: kernel: Do not expect a modules tarball for vanilla kernel When I added this I had in mind the period that we still relied on the SEV module being generated, which we don't do for quite a long time. This wrong assumption caused the cache to ALWAYS fail, increasing our build time considerably for no reason. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-03 20:14:42 +01:00
Aurélien Bombo	911742e26e	gatekeeper: Add EditorConfig checker to required tests Now that it's stable and fully configured. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-03 11:34:06 -06:00
Mikko Ylinen	2cf9018e35	versions: bump to Linux v6.18.15 (LTS) Bump to the latest LTS kernel to get a fix for TDX: efi: Fix reservation of unaccepted memory table See details in: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0862438c90487e79822d5647f854977d50381505 Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-03-03 07:56:24 +02:00

1 2 3 4 5 ...

2152 Commits