kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-04-09 13:32:08 +00:00

Author	SHA1	Message	Date
Harshitha Gowda	bb1165b23f	tests: Set sev-snp, qemu-snp CIs as required run-k8s-tests-on-tee (sev-snp, qemu-snp) Signed-off-by: Harshitha Gowda <hgowda@amd.com>	2026-04-08 22:36:58 +02:00
Fabiano Fidêncio	21466eb4e5	kata-deploy: Fix clippy warnings across crate Fix all clippy warnings triggered by -D warnings: - install.rs: remove useless .into() conversions on PathBuf values and replace vec! with an array literal where a Vec is not needed - utils/toml.rs: replace while-let-on-iterator with a for loop and drop the now-unnecessary mut on the iterator binding - main.rs: replace match-with-single-pattern with if-let in two places dealing with experimental_setup_snapshotter - utils/yaml.rs: extract repeated serde_yaml::Value::String key into a local variable, removing needless borrows on temporary values Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-08 20:47:59 +02:00
Fabiano Fidêncio	1874d4617b	kata-deploy: Run cargo clippy during build Ensure code formatting and compilation are verified early in the Docker build pipeline, before tests and the release build. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-08 20:47:59 +02:00
Greg Kurz	817580e35d	Merge pull request #12795 from fidencio/topic/kata-deploy-do-not-try-to-install-a-snapshotter-when-using-crio kata-deploy: Skip snapshotter install/uninstall on CRI-O	2026-04-08 17:18:05 +02:00
Fabiano Fidêncio	f27def1a5b	kata-deploy: Skip snapshotter install/uninstall on CRI-O Snapshotters (nydus, erofs) are containerd-specific. The validation code already warned that EXPERIMENTAL_SETUP_SNAPSHOTTER would be ignored on CRI-O, but the actual install/configure and uninstall loops still ran unconditionally, attempting containerd-specific operations on CRI-O nodes. Guard both the install and cleanup snapshotter loops with a `runtime != "crio"` check so the binary itself skips snapshotter work when it detects CRI-O as the container runtime. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:49 +02:00
Fabiano Fidêncio	bc719a66eb	kata-deploy: nvidia: Align force_guest_pull with default values.yaml The defdault is already false, but let's keep those aligned on explicitly setting the default. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:21 +02:00
Fabiano Fidêncio	78f02f2155	kata-deploy: nvidia: Align labels with default values.yaml Joji's added the labels for the default values.yaml, but we missed adding those to the nvidia specific values.yaml file. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:21 +02:00
Fabiano Fidêncio	f00b589ccd	Revert "kata-deploy: Temporarily comment GPU specific labels" This reverts commit `02c9a4b23c`, as GPU Operator v26.3.0 is out, and becomes a requirement. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 14:41:21 +02:00
Alex Lyn	c00f895338	kata-deploy: Fix noisy caused by unformatted code When do cargo fmt --all, some files changes as unformatted with `cargo fmt`. This commit is just to address it. Just use this as an example: ``` // Generate the common drop-in files (shared with standard // runtimes) - write_common_drop_ins(config, &runtime.base_config, &config_d_dir, container_runtime)?; + write_common_drop_ins( + config, + &runtime.base_config, + &config_d_dir, + container_runtime, + )?; ``` Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-04-08 14:39:57 +02:00
Fabiano Fidêncio	a12e0f1204	build: cache: Take NVRC & NVAT version into consideration Without those, we'd end up pulling the same / old rootfs that's cached without re-building it in case of a bump in any of those components. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-08 10:14:11 +02:00
Fabiano Fidêncio	b3ae6ef99c	Merge pull request #12760 from fitzthum/bump-nvat Bump trustee and guest-components to add nvswitch / ppcie support	2026-04-07 19:07:50 +02:00
Fabiano Fidêncio	461907918d	kata-deploy: pin nydus-snapshotter via versions.yaml Resolve externals.nydus-snapshotter version and url in the Docker image build with yq from the repo-root versions.yaml instead of Dockerfile ARG defaults. Drop the redundant workflow that only enforced parity between those two sources. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-07 10:07:06 +08:00
Fabiano Fidêncio	9e1f595160	kata-deploy: add Rust binary to root workspace Add tools/packaging/kata-deploy/binary as a workspace member, inherit shared dependency versions from the root manifest, and refresh Cargo.lock. Build the kata-deploy image from the repository root: copy the workspace layout into the rust-builder stage, run cargo test/build with -p kata-deploy, and adjust artifact and static asset COPY paths. Update the payload build script to invoke docker buildx with -f .../Dockerfile from the repo root. Add a repo-root .dockerignore to keep the Docker build context smaller. Document running unit tests with cargo test -p kata-deploy from the root. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-07 10:07:06 +08:00
Tobin Feldman-Fitzthum	0444d70704	rootfs: add runtime support for NVAT Update NVIDIA rootfs builder to include runtime dependencies for NVAT Rust bindings. The nvattest package does not include the .so file, so we need to build from source. Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>	2026-04-06 17:51:12 +00:00
Tobin Feldman-Fitzthum	78c61459f8	packaging: add built-time support for NVAT The attestation agent will soon rely on the NVAT rust bindings, which have some built-time dependencies. There is currently no nvattest-dev package, so we need to build from source to get the headers and .so file. Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>	2026-04-06 17:51:12 +00:00
Fabiano Fidêncio	47770daa3b	helm: Align values.yaml with try-kata-nvidia-gpu.values.yaml We've switched to nydus there, but never did for the values.yaml. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-06 18:51:54 +02:00
Fabiano Fidêncio	b4b62417ed	kata-deploy: skip cleanup on pod restart to avoid crashing kata pods When a kata-deploy DaemonSet pod is restarted (e.g. due to a label change or rolling update), the SIGTERM handler runs cleanup which unconditionally removes kata artifacts and restarts containerd. This causes containerd to lose the kata shim binary, crashing all running kata pods on the node. Fix this by implementing a three-stage cleanup decision: 1. If this pod's owning DaemonSet still exists (exact name match via DAEMONSET_NAME env var), this is a pod restart — skip all cleanup. The replacement pod will re-run install, which is idempotent. 2. If this DaemonSet is gone but other kata-deploy DaemonSets still exist (multi-install scenario), perform instance-specific cleanup only (snapshotters, CRI config, artifacts) but skip shared resources (node label removal, CRI restart) to avoid disrupting the other instances. 3. If no kata-deploy DaemonSets remain, perform full cleanup including node label removal and CRI restart. The Helm chart injects a DAEMONSET_NAME environment variable with the exact DaemonSet name (including any multi-install suffix), ensuring instance-aware lookup rather than broadly matching any DaemonSet containing "kata-deploy". Fixes: #12761 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 15:20:52 +02:00
Fabiano Fidêncio	28414a614e	kata-deploy: detect k3s/rke2 via systemd services instead of version string Newer k3s releases (v1.34+) no longer include "k3s" in the containerd version string at all (e.g. "containerd://2.2.2-bd1.34" instead of "containerd://2.1.5-k3s1"). This caused kata-deploy to fall through to the default "containerd" runtime, configuring and restarting the system containerd service instead of k3s's embedded containerd — leaving the kata runtime invisible to k3s. Fix by detecting k3s/rke2 via their systemd service names (k3s, k3s-agent, rke2-server, rke2-agent) rather than parsing the containerd version string. This is more robust and works regardless of how k3s formats its containerd version. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 14:24:55 +02:00
Fabiano Fidêncio	fe1f804543	kata-deploy: Restart nydus-snapshotter in case of failure Let's ensure that in case nydus-snapshotter crashes for one reason or another, the service is restarted. This follows containerd approach, and avoids manual intervention in the node. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 11:00:21 +02:00
Fabiano Fidêncio	789abe6fdf	kata-deploy: Make nydus a soft dep of containerd Let's relax our RequiredBy and use a WantedBy in the nydus systemd unit file as, in case of a nydus crash, containerd would also be put down, causing the node to become NotReady. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 10:52:29 +02:00
Fabiano Fidêncio	3a1683ccdc	gatekeeper: unrequire kata-deploy k3s tests Those are breaking, and I need time to investigate why. For now, unrequire those tests. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-31 18:32:17 +02:00
Fabiano Fidêncio	4fad88499c	kata-deploy: rename nydus-snapshotter to nydus-for-kata-tee Rename all host-visible names of the nydus-snapshotter instance managed by kata-deploy from the generic "nydus-snapshotter" to "nydus-for-kata-tee". This covers the systemd service name, the containerd proxy plugin key, the runtime class snapshotter field, the data directory (/var/lib/nydus-for-kata-tee), the socket path (/run/nydus-for-kata-tee/), and the host install subdirectory. The rename makes it immediately clear that this nydus-snapshotter instance is the one deployed and managed by kata-deploy specifically for Kata TEE use cases, rather than any general-purpose nydus-snapshotter that might be present on the host. Because the old code operated under a completely separate set of paths (nydus-snapshotter.*), any previously deployed installation continues to run without interference during the transition to this new naming. CI pipelines and operators can upgrade kata-deploy on their own schedule without having to coordinate an atomic cutover: the old service keeps serving its existing workloads until it is explicitly replaced, and the new deployment lands cleanly alongside it. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-27 11:14:54 +01:00
Steve Horsman	8c2b7ed619	Merge pull request #12729 from fidencio/topic/kata-deploy-nydus-dont-touch-data-dir-on-install kata-deploy: nydus: never remove the data dir	2026-03-25 10:28:50 +00:00
Steve Horsman	0d8186ae16	Merge pull request #12730 from fidencio/topic/bump-nydus-snapshotter versions: Bump nydus-snapshotter to v0.15.13	2026-03-25 10:20:23 +00:00
Fabiano Fidêncio	bcfb2354e0	gatekeeper: Unrequire NVIDIA GPU SNP tests till auth is fixed SSIA, the NIM tests are breaking due to authentication issues, and those issues are blocking other PRs. Let's unrequire the test for now, and mark it as required again once we fixed the auth issues. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-25 10:23:53 +01:00
Fabiano Fidêncio	caf6b244e6	versions: Bump nydus-snapshotter to v0.15.13 As this brings in a fix for using images with too many layers. https://github.com/containerd/nydus-snapshotter/releases/tag/v0.15.13 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-25 08:31:48 +01:00
Fabiano Fidêncio	fb5482f647	kata-deploy: nydus: never remove the data directory Removing /var/lib/nydus-snapshotter during install or uninstall creates a split-brain state: the nydus backend starts empty while containerd's BoltDB (meta.db) still holds snapshot records from the previous run. Any subsequent image pull then fails with: "unable to prepare extraction snapshot: target snapshot \"sha256:...\": already exists" An earlier attempt cleaned up containerd's BoltDB via `ctr snapshots rm` before wiping the directory, but that cleanup is inherently fragile: - It requires the nydus gRPC service to be reachable at cleanup time. If the service is stopped, crashed, or not yet running, every `ctr` call silently fails and the stale records remain. - Any workload still actively using a snapshot blocks the entire cleanup, making it impossible to guarantee a clean state. The correct invariant is that meta.db and the nydus backend always agree. Preserving the data directory unconditionally guarantees this: - Fresh install: data directory does not exist, nydus starts empty. - Reinstall: existing snapshots and nydus.db are preserved, meta.db and backend remain in sync, new binary starts cleanly. - After uninstall: containerd is reconfigured without the nydus proxy_plugins entry and restarted, so the snapshot records in meta.db are completely dormant — nothing will use them. If nydus is reinstalled later, the data directory is still present and both sides remain in sync, so no split-brain can occur. Any stale snapshots from previous workloads are garbage-collected by containerd once the images referencing them are removed. This also removes the cleanup_containerd_nydus_snapshots, cleanup_nydus_snapshots, and cleanup_nydus_containers helpers that were introduced by the earlier (fragile) attempt. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-03-25 07:06:41 +01:00
Alex Lyn	46aa318b74	Merge pull request #12716 from lifupan/bump_dragonball_kernel kernel: Bump the kernel to v6.18.15 for dragonball	2026-03-25 11:04:44 +08:00
Aurélien Bombo	ec9c57c595	Merge pull request #12467 from ldoktor/gk-output tools.gatekeeper: Improve output	2026-03-24 17:03:55 -05:00
Fabiano Fidêncio	fd583d833b	kata-deploy: nydus: clean containerd metadata before wiping backend When /var/lib/nydus-snapshotter is removed, containerd's BoltDB (meta.db at /var/lib/containerd/) still holds snapshot records for the nydus snapshotter. On the next install these stale records cause image pulls to fail with: "unable to prepare extraction snapshot: target snapshot \"sha256:...\": already exists" The failure path in core/unpack/unpacker.go: 1. sn.Prepare() → metadata layer finds the target chainID in BoltDB → returns AlreadyExists without touching the nydus backend. 2. sn.Stat() → metadata layer finds the BoltDB record, then calls s.Snapshotter.Stat(bkey) on the nydus gRPC backend → NotFound (backend was wiped). 3. The unpacker treats NotFound as a transient key-collision race and retries 3 times; all 3 attempts hit the same dead end, and the pull is aborted. The commit message of `62ad0814c` ("nydus: Always start from a clean state") assumed "containerd will re-pull/re-unpack when it finds non- existent snapshots", but that is not what happens: the metadata layer intercepts the Prepare call in BoltDB before the backend is ever consulted. Fix: call cleanup_containerd_nydus_snapshots() before stopping the nydus service (and thus before wiping its data directory) in both install_nydus_snapshotter and uninstall_nydus_snapshotter. The cleanup must run while the service is still up because ctr snapshots rm goes through the metadata layer which calls the nydus gRPC backend to physically remove the snapshot; if the service is already stopped the backend call fails and the BoltDB record remains. The cleanup: - Discovers all containerd namespaces via `ctr namespaces ls -q` (falls back to k8s.io if that fails). - Removes containers whose Snapshotter field matches the nydus plugin name; these become dangling references once snapshots are gone and can confuse container reconciliation after an aborted CI run. - Removes snapshots round by round (leaf-first) until either the list is empty or no progress can be made (see below). Note: containerd's GC cannot substitute for this explicit cleanup. The image record (a GC root) references content blobs which reference the snapshots via gc.ref labels, keeping the entire chain alive in the GC graph even after the nydus backend is wiped. Snapshot removal rounds ----------------------- Snapshot chains are linear: an image with N layers produces a chain of N snapshots, each parented on the previous. Only the current leaf can be removed each round, so N layers require exactly N rounds. There is no fixed round cap — the loop terminates when either the list reaches zero (success) or a round removes nothing at all (all remaining snapshots are actively in use by running workloads). Active workload safety ---------------------- If active workloads still hold nydus snapshots (e.g. during a live upgrade), no progress is made in a round and cleanup_nydus_snapshots returns false. Both install_nydus_snapshotter and uninstall_nydus_snapshotter gate the fs::remove_dir_all on that return value: - true → proceed as before: stop service, wipe data dir. - false → stop service, skip data dir removal, log a warning. The new nydus instance starts on the existing backend state; running containers are left intact. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-03-24 16:44:25 +01:00
Fupan Li	6a832dd1f3	kernel: Bump the kernel to v6.18.15 for dragonball Bump the dragonball supported kernel to v6.18.15. Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>	2026-03-24 06:46:43 +08:00
Fabiano Fidêncio	1ec97d25e7	Merge pull request #12704 from stevenhorsman/security-fixes-23-mar-26 Security fixes 23 mar 26	2026-03-23 15:27:07 +01:00
Fabiano Fidêncio	514a2b1a7c	Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter nvidia: cc: Use nydus-snapshotter	2026-03-23 12:50:15 +01:00
dependabot[bot]	8df9cf35df	build(deps): bump rustls-webpki in /tools/packaging/kata-deploy/binary Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.8 to 0.103.10. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.8...v/0.103.10) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.10 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:27 +00:00
Alex Lyn	d2c2ec6e23	Merge pull request #12633 from LandonTClipp/docs_materialx docs: Move to mkdocs-material, port Helm to docs site	2026-03-23 09:29:25 +08:00
Fabiano Fidêncio	6194510e90	nvidia: cc: Use nydus-snapshotter We've been using `experimental_force_guest_pull`, but now that we have a containerd release that should work more reliably with the multi snapshotter setup, we want to give it a try. Note: We need containerd 2.2.2+. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-22 10:10:34 +01:00
Agam Dua	7e3fd74779	kernel: bump config version With debug/ebpf updates in place, let's bump the kata config version. Signed-off-by: Agam Dua <agam_dua@apple.com> Co-authored-by: Eric Ernst <eric_ernst@apple.com>	2026-03-20 15:04:15 -07:00
Agam Dua	f6319da73d	tests: Add eBPF and dwarves to spell check dictionary Add missing terms to the spell check dictionary to fix CI failures for kernel debug documentation: - eBPF - dwarves: Linux package with DWARF/BTF tools (pahole) required for CONFIG_DEBUG_INFO_BTF kernel option Also fix the casing of "ebpf" to "eBPF" in the kernel README to match the official naming convention. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 15:04:08 -07:00
Agam Dua	91d6c39f06	kernel: Fix debug build and add debug symbols to installation Fixed a bug with the debug kernel build where common/ was repeated after the common path variable, resulting in the debug confs never being picked up. This exposed a subsequent bug where the debug conf was included in other builds, this is also fixed by creating a separate directory for debug confs with one file at the moment, debug.conf that contains debug configurations and bpf specific configs. To enable kernel builds (specifically for bpf) the dwarves package was added to the kernel dockerfile for the pahole package. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	5ab0744c25	ci: Add pipeline for building and distributing the debug kernel Add the debug kernel to the kata tarball alongside the other kernels. Also update the kernel README documentation to describe the new debug kernel build process. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	e905b74267	kernel: Add eBPF configs for debug builds Adds a BPF section in the debug.conf kernel configuration options to enable eBPF and BTF support for debug kernel builds. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
LandonTClipp	795869152d	docs: Move to mkdocs-material, port Helm to docs site This supersedes https://github.com/kata-containers/kata-containers/pull/12622. I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material created after mkdocs-material was put into maintenance mode. We'll use this platform until Zensical is more feature complete. Added a few of the existing docs into the site to make a more user-friendly flow. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-20 14:51:39 -05:00
stevenhorsman	44ec815f77	gatekeeper: Add check-spelling to required Add the new job to the list of static checks that runs on doc updates. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:22:54 +00:00
Aurélien Bombo	352b4cdad2	Merge pull request #12660 from LandonTClipp/ci_docs ci: Don't run CI builds on doc PRs	2026-03-17 12:19:11 -05:00
Manuel Huber	660e3bb653	gpu: Obsolete the NVIDIA initrd build As the NVIDIA stack has shifted to using an image for both the confidential and non-confidential variants, we retire the initrd build. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 21:29:58 -04:00
Aurélien Bombo	f8e234c6f9	Merge pull request #12650 from kata-containers/sprt/remove-csi ci: Stop building/deploying CSI driver	2026-03-16 16:53:02 -05:00
Manuel Huber	5210584f95	release: Bump version to 3.28.0 Bump VERSION and helm-charts versions. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:52:35 -07:00
Manuel Huber	a9b222f91e	gpu: Update chiseled rootfs with new CDH deps With CDH requiring libcryptsetup, mkfs.ext4, dd, and their dependencies, we will need to update the chiseled NVIDIA rootfs accordingly. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:43:17 -07:00
Manuel Huber	169f92ff09	agent: cdh: Update CDH and API With the new CDH version, the secure_mount API changes. Further, the new CDH version no longer uses the luks-encrypt-storage script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt some of the CDH and Kata components build steps Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-16 09:43:17 -07:00
Alex Lyn	ef5db0a01f	Merge pull request #12607 from zvonkok/system-map kernel: Ship System.map as part of the kernel build	2026-03-16 09:37:44 +08:00

1 2 3 4 5 ...

2184 Commits