kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-04-03 02:22:55 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	3a1683ccdc	gatekeeper: unrequire kata-deploy k3s tests Those are breaking, and I need time to investigate why. For now, unrequire those tests. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-31 18:32:17 +02:00
Fabiano Fidêncio	4fad88499c	kata-deploy: rename nydus-snapshotter to nydus-for-kata-tee Rename all host-visible names of the nydus-snapshotter instance managed by kata-deploy from the generic "nydus-snapshotter" to "nydus-for-kata-tee". This covers the systemd service name, the containerd proxy plugin key, the runtime class snapshotter field, the data directory (/var/lib/nydus-for-kata-tee), the socket path (/run/nydus-for-kata-tee/), and the host install subdirectory. The rename makes it immediately clear that this nydus-snapshotter instance is the one deployed and managed by kata-deploy specifically for Kata TEE use cases, rather than any general-purpose nydus-snapshotter that might be present on the host. Because the old code operated under a completely separate set of paths (nydus-snapshotter.*), any previously deployed installation continues to run without interference during the transition to this new naming. CI pipelines and operators can upgrade kata-deploy on their own schedule without having to coordinate an atomic cutover: the old service keeps serving its existing workloads until it is explicitly replaced, and the new deployment lands cleanly alongside it. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-27 11:14:54 +01:00
Steve Horsman	8c2b7ed619	Merge pull request #12729 from fidencio/topic/kata-deploy-nydus-dont-touch-data-dir-on-install kata-deploy: nydus: never remove the data dir	2026-03-25 10:28:50 +00:00
Steve Horsman	0d8186ae16	Merge pull request #12730 from fidencio/topic/bump-nydus-snapshotter versions: Bump nydus-snapshotter to v0.15.13	2026-03-25 10:20:23 +00:00
Fabiano Fidêncio	bcfb2354e0	gatekeeper: Unrequire NVIDIA GPU SNP tests till auth is fixed SSIA, the NIM tests are breaking due to authentication issues, and those issues are blocking other PRs. Let's unrequire the test for now, and mark it as required again once we fixed the auth issues. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-25 10:23:53 +01:00
Fabiano Fidêncio	caf6b244e6	versions: Bump nydus-snapshotter to v0.15.13 As this brings in a fix for using images with too many layers. https://github.com/containerd/nydus-snapshotter/releases/tag/v0.15.13 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-25 08:31:48 +01:00
Fabiano Fidêncio	fb5482f647	kata-deploy: nydus: never remove the data directory Removing /var/lib/nydus-snapshotter during install or uninstall creates a split-brain state: the nydus backend starts empty while containerd's BoltDB (meta.db) still holds snapshot records from the previous run. Any subsequent image pull then fails with: "unable to prepare extraction snapshot: target snapshot \"sha256:...\": already exists" An earlier attempt cleaned up containerd's BoltDB via `ctr snapshots rm` before wiping the directory, but that cleanup is inherently fragile: - It requires the nydus gRPC service to be reachable at cleanup time. If the service is stopped, crashed, or not yet running, every `ctr` call silently fails and the stale records remain. - Any workload still actively using a snapshot blocks the entire cleanup, making it impossible to guarantee a clean state. The correct invariant is that meta.db and the nydus backend always agree. Preserving the data directory unconditionally guarantees this: - Fresh install: data directory does not exist, nydus starts empty. - Reinstall: existing snapshots and nydus.db are preserved, meta.db and backend remain in sync, new binary starts cleanly. - After uninstall: containerd is reconfigured without the nydus proxy_plugins entry and restarted, so the snapshot records in meta.db are completely dormant — nothing will use them. If nydus is reinstalled later, the data directory is still present and both sides remain in sync, so no split-brain can occur. Any stale snapshots from previous workloads are garbage-collected by containerd once the images referencing them are removed. This also removes the cleanup_containerd_nydus_snapshots, cleanup_nydus_snapshots, and cleanup_nydus_containers helpers that were introduced by the earlier (fragile) attempt. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-03-25 07:06:41 +01:00
Alex Lyn	46aa318b74	Merge pull request #12716 from lifupan/bump_dragonball_kernel kernel: Bump the kernel to v6.18.15 for dragonball	2026-03-25 11:04:44 +08:00
Aurélien Bombo	ec9c57c595	Merge pull request #12467 from ldoktor/gk-output tools.gatekeeper: Improve output	2026-03-24 17:03:55 -05:00
Fabiano Fidêncio	fd583d833b	kata-deploy: nydus: clean containerd metadata before wiping backend When /var/lib/nydus-snapshotter is removed, containerd's BoltDB (meta.db at /var/lib/containerd/) still holds snapshot records for the nydus snapshotter. On the next install these stale records cause image pulls to fail with: "unable to prepare extraction snapshot: target snapshot \"sha256:...\": already exists" The failure path in core/unpack/unpacker.go: 1. sn.Prepare() → metadata layer finds the target chainID in BoltDB → returns AlreadyExists without touching the nydus backend. 2. sn.Stat() → metadata layer finds the BoltDB record, then calls s.Snapshotter.Stat(bkey) on the nydus gRPC backend → NotFound (backend was wiped). 3. The unpacker treats NotFound as a transient key-collision race and retries 3 times; all 3 attempts hit the same dead end, and the pull is aborted. The commit message of `62ad0814c` ("nydus: Always start from a clean state") assumed "containerd will re-pull/re-unpack when it finds non- existent snapshots", but that is not what happens: the metadata layer intercepts the Prepare call in BoltDB before the backend is ever consulted. Fix: call cleanup_containerd_nydus_snapshots() before stopping the nydus service (and thus before wiping its data directory) in both install_nydus_snapshotter and uninstall_nydus_snapshotter. The cleanup must run while the service is still up because ctr snapshots rm goes through the metadata layer which calls the nydus gRPC backend to physically remove the snapshot; if the service is already stopped the backend call fails and the BoltDB record remains. The cleanup: - Discovers all containerd namespaces via `ctr namespaces ls -q` (falls back to k8s.io if that fails). - Removes containers whose Snapshotter field matches the nydus plugin name; these become dangling references once snapshots are gone and can confuse container reconciliation after an aborted CI run. - Removes snapshots round by round (leaf-first) until either the list is empty or no progress can be made (see below). Note: containerd's GC cannot substitute for this explicit cleanup. The image record (a GC root) references content blobs which reference the snapshots via gc.ref labels, keeping the entire chain alive in the GC graph even after the nydus backend is wiped. Snapshot removal rounds ----------------------- Snapshot chains are linear: an image with N layers produces a chain of N snapshots, each parented on the previous. Only the current leaf can be removed each round, so N layers require exactly N rounds. There is no fixed round cap — the loop terminates when either the list reaches zero (success) or a round removes nothing at all (all remaining snapshots are actively in use by running workloads). Active workload safety ---------------------- If active workloads still hold nydus snapshots (e.g. during a live upgrade), no progress is made in a round and cleanup_nydus_snapshots returns false. Both install_nydus_snapshotter and uninstall_nydus_snapshotter gate the fs::remove_dir_all on that return value: - true → proceed as before: stop service, wipe data dir. - false → stop service, skip data dir removal, log a warning. The new nydus instance starts on the existing backend state; running containers are left intact. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-03-24 16:44:25 +01:00
Fupan Li	6a832dd1f3	kernel: Bump the kernel to v6.18.15 for dragonball Bump the dragonball supported kernel to v6.18.15. Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>	2026-03-24 06:46:43 +08:00
Fabiano Fidêncio	1ec97d25e7	Merge pull request #12704 from stevenhorsman/security-fixes-23-mar-26 Security fixes 23 mar 26	2026-03-23 15:27:07 +01:00
Fabiano Fidêncio	514a2b1a7c	Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter nvidia: cc: Use nydus-snapshotter	2026-03-23 12:50:15 +01:00
dependabot[bot]	8df9cf35df	build(deps): bump rustls-webpki in /tools/packaging/kata-deploy/binary Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.8 to 0.103.10. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.8...v/0.103.10) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.10 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:27 +00:00
Alex Lyn	d2c2ec6e23	Merge pull request #12633 from LandonTClipp/docs_materialx docs: Move to mkdocs-material, port Helm to docs site	2026-03-23 09:29:25 +08:00
Fabiano Fidêncio	6194510e90	nvidia: cc: Use nydus-snapshotter We've been using `experimental_force_guest_pull`, but now that we have a containerd release that should work more reliably with the multi snapshotter setup, we want to give it a try. Note: We need containerd 2.2.2+. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-22 10:10:34 +01:00
Agam Dua	7e3fd74779	kernel: bump config version With debug/ebpf updates in place, let's bump the kata config version. Signed-off-by: Agam Dua <agam_dua@apple.com> Co-authored-by: Eric Ernst <eric_ernst@apple.com>	2026-03-20 15:04:15 -07:00
Agam Dua	f6319da73d	tests: Add eBPF and dwarves to spell check dictionary Add missing terms to the spell check dictionary to fix CI failures for kernel debug documentation: - eBPF - dwarves: Linux package with DWARF/BTF tools (pahole) required for CONFIG_DEBUG_INFO_BTF kernel option Also fix the casing of "ebpf" to "eBPF" in the kernel README to match the official naming convention. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 15:04:08 -07:00
Agam Dua	91d6c39f06	kernel: Fix debug build and add debug symbols to installation Fixed a bug with the debug kernel build where common/ was repeated after the common path variable, resulting in the debug confs never being picked up. This exposed a subsequent bug where the debug conf was included in other builds, this is also fixed by creating a separate directory for debug confs with one file at the moment, debug.conf that contains debug configurations and bpf specific configs. To enable kernel builds (specifically for bpf) the dwarves package was added to the kernel dockerfile for the pahole package. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	5ab0744c25	ci: Add pipeline for building and distributing the debug kernel Add the debug kernel to the kata tarball alongside the other kernels. Also update the kernel README documentation to describe the new debug kernel build process. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	e905b74267	kernel: Add eBPF configs for debug builds Adds a BPF section in the debug.conf kernel configuration options to enable eBPF and BTF support for debug kernel builds. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
LandonTClipp	795869152d	docs: Move to mkdocs-material, port Helm to docs site This supersedes https://github.com/kata-containers/kata-containers/pull/12622. I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material created after mkdocs-material was put into maintenance mode. We'll use this platform until Zensical is more feature complete. Added a few of the existing docs into the site to make a more user-friendly flow. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-20 14:51:39 -05:00
stevenhorsman	44ec815f77	gatekeeper: Add check-spelling to required Add the new job to the list of static checks that runs on doc updates. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:22:54 +00:00
Aurélien Bombo	352b4cdad2	Merge pull request #12660 from LandonTClipp/ci_docs ci: Don't run CI builds on doc PRs	2026-03-17 12:19:11 -05:00
Manuel Huber	660e3bb653	gpu: Obsolete the NVIDIA initrd build As the NVIDIA stack has shifted to using an image for both the confidential and non-confidential variants, we retire the initrd build. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 21:29:58 -04:00
Aurélien Bombo	f8e234c6f9	Merge pull request #12650 from kata-containers/sprt/remove-csi ci: Stop building/deploying CSI driver	2026-03-16 16:53:02 -05:00
Manuel Huber	5210584f95	release: Bump version to 3.28.0 Bump VERSION and helm-charts versions. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:52:35 -07:00
Manuel Huber	a9b222f91e	gpu: Update chiseled rootfs with new CDH deps With CDH requiring libcryptsetup, mkfs.ext4, dd, and their dependencies, we will need to update the chiseled NVIDIA rootfs accordingly. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:43:17 -07:00
Manuel Huber	169f92ff09	agent: cdh: Update CDH and API With the new CDH version, the secure_mount API changes. Further, the new CDH version no longer uses the luks-encrypt-storage script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt some of the CDH and Kata components build steps Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:43:17 -07:00
Alex Lyn	ef5db0a01f	Merge pull request #12607 from zvonkok/system-map kernel: Ship System.map as part of the kernel build	2026-03-16 09:37:44 +08:00
Zvonko Kaiser	99f32de1e5	kata-deploy: Update RuntimeClass PodOverhead Align the podOverhead with the default_memory updated in the previous commit. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Zvonko Kaiser	d22c314e91	gpu: Increase dial_timeout=1200 For cold-plug when running with nerdctl the timeouts in the config are being used, increase the dial_timeout (e.g. for CreateSandbox) to match create_container_timeout. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Zvonko Kaiser	7fe84c8038	gpu: HGX Rootfs Fixes Various smaller fixes to enable HGX systems. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
Joji Mekkattuparamban	1fd66db271	nvidia-gpu: add missing libraries to rootfs Added the missing packages to the nvidia rootfs. Fixes #12534 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2026-03-13 16:24:32 -07:00
Zvonko Kaiser	d382379571	kernel: Ship System.map as part of the kernel build Some use-cases need the System.map of the running kernel, ship it via kernel-artifact. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-13 19:27:18 +00:00
LandonTClipp	d5d741f4e3	ci: Don't run CI builds on doc PRs We disable the Kata artifact builds and testing if the PR is only related to documentation. Regular static checks will remain. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-12 15:48:05 -05:00
Manuel Huber	8162d15b46	nvidia: fix invalid CTK reference Use proper reference from versions yaml structure. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-11 12:49:29 -07:00
Aurélien Bombo	32444737b5	gatekeeper: Remove csi-kata-directvolume build from required tests Since we don't build that anymore. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-11 12:55:23 -05:00
Aurélien Bombo	64aed13d5f	Revert "ci: Add no-op step to compile CSI driver" This reverts commit `e43c59a2c6`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-11 12:55:23 -05:00
Aurélien Bombo	d598e0baf1	Revert "ci: Implement build step for CSI driver" This partially reverts commit `fb87bf221f`. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-03-11 12:55:23 -05:00
Dan Mihai	04f180434e	Merge pull request #12640 from burgerdev/genpolicy-workspace genpolicy: add to Cargo workspace	2026-03-11 09:02:39 -07:00
Markus Rudy	cf7d4c33b3	kata-deploy: fix binary location for genpolicy Moving the genpolicy crate into the root workspace causes the build outputs to go into the root workspace's target directory, instead of src/tools/genpolicy/target, invalidating assumptions made by the kata-deploy-binaries script. This commit adds a special case for the lookup path of the genpolicy binary, and fixes two bugs that made identifying this problem harder. Signed-off-by: Markus Rudy <mr@edgeless.systems>	2026-03-11 09:30:48 +01:00
stevenhorsman	8ae0e36737	versions: bump golang to 1.25.8 Bump the builder image and versions to resolve CVEs: - GO-2026-4601 - GO-2026-4602 - GO-2026-4603 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-09 09:10:01 +00:00
Lukáš Doktor	ce65d17276	tools.gatekeeper: Add support for GITHUB_STEP_SUMMARY this should produce a table of failed/running jobs as a table along with links to them. On pass it should only produce simple line with how many jobs passed. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-03-06 12:19:26 -03:00
Lukáš Doktor	27bebfb438	tools.gatekeeper: Print link to the results in status output to simplify analyzing failures let's print the link to the job result next to the status. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>	2026-03-06 12:19:26 -03:00
Fabiano Fidêncio	83a8b257d1	Merge pull request #12265 from fidencio/topic/nvidia-bump-container-toolkit nvidia: Bump nvidia-container-toolkit to 1.18.1	2026-03-05 15:25:15 +01:00
Fabiano Fidêncio	079fac1309	Merge pull request #12591 from fidencio/topic/kernel-add-mmio-back-to-the-unified-kernels kernel: include mmio fragment in unified build for firecracker	2026-03-05 13:45:41 +01:00
Fabiano Fidêncio	e9894c0bd8	nvidia: Bump nvidia-container-toolkit to 1.18.1 Let's update the nvidia-container-toolkit to 1.18.1 (from 1.17.6). We're, from now on, relying on the version set in the versions.yaml file. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-05 11:53:09 +01:00
Zachary Spar	bda9f6491f	kata-deploy: add per-shim configurable pod overhead Allow users to override the default RuntimeClass pod overhead for any shim via shims.<name>.runtimeClass.overhead.{memory,cpu}. When the field is absent the existing hardcoded defaults from the dict are used, so this is fully backward compatible. Signed-off-by: Zachary Spar <zspar@coreweave.com>	2026-03-05 08:00:01 +01:00
Fabiano Fidêncio	cb0d02e40b	kernel: include mmio fragment in unified build for firecracker Remove # !confidential from mmio.conf so CONFIG_VIRTIO_MMIO and CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES are included when building the unified x86_64/s390x kernel with -x Firecracker requires virtio-mmio for block devices; without it the guest kernel panics (no /dev/vda). Fixes: #12581 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:18:35 +01:00

1 2 3 4 5 ...

2164 Commits