kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-04-02 18:13:57 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	fd583d833b	kata-deploy: nydus: clean containerd metadata before wiping backend When /var/lib/nydus-snapshotter is removed, containerd's BoltDB (meta.db at /var/lib/containerd/) still holds snapshot records for the nydus snapshotter. On the next install these stale records cause image pulls to fail with: "unable to prepare extraction snapshot: target snapshot \"sha256:...\": already exists" The failure path in core/unpack/unpacker.go: 1. sn.Prepare() → metadata layer finds the target chainID in BoltDB → returns AlreadyExists without touching the nydus backend. 2. sn.Stat() → metadata layer finds the BoltDB record, then calls s.Snapshotter.Stat(bkey) on the nydus gRPC backend → NotFound (backend was wiped). 3. The unpacker treats NotFound as a transient key-collision race and retries 3 times; all 3 attempts hit the same dead end, and the pull is aborted. The commit message of `62ad0814c` ("nydus: Always start from a clean state") assumed "containerd will re-pull/re-unpack when it finds non- existent snapshots", but that is not what happens: the metadata layer intercepts the Prepare call in BoltDB before the backend is ever consulted. Fix: call cleanup_containerd_nydus_snapshots() before stopping the nydus service (and thus before wiping its data directory) in both install_nydus_snapshotter and uninstall_nydus_snapshotter. The cleanup must run while the service is still up because ctr snapshots rm goes through the metadata layer which calls the nydus gRPC backend to physically remove the snapshot; if the service is already stopped the backend call fails and the BoltDB record remains. The cleanup: - Discovers all containerd namespaces via `ctr namespaces ls -q` (falls back to k8s.io if that fails). - Removes containers whose Snapshotter field matches the nydus plugin name; these become dangling references once snapshots are gone and can confuse container reconciliation after an aborted CI run. - Removes snapshots round by round (leaf-first) until either the list is empty or no progress can be made (see below). Note: containerd's GC cannot substitute for this explicit cleanup. The image record (a GC root) references content blobs which reference the snapshots via gc.ref labels, keeping the entire chain alive in the GC graph even after the nydus backend is wiped. Snapshot removal rounds ----------------------- Snapshot chains are linear: an image with N layers produces a chain of N snapshots, each parented on the previous. Only the current leaf can be removed each round, so N layers require exactly N rounds. There is no fixed round cap — the loop terminates when either the list reaches zero (success) or a round removes nothing at all (all remaining snapshots are actively in use by running workloads). Active workload safety ---------------------- If active workloads still hold nydus snapshots (e.g. during a live upgrade), no progress is made in a round and cleanup_nydus_snapshots returns false. Both install_nydus_snapshotter and uninstall_nydus_snapshotter gate the fs::remove_dir_all on that return value: - true → proceed as before: stop service, wipe data dir. - false → stop service, skip data dir removal, log a warning. The new nydus instance starts on the existing backend state; running containers are left intact. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-03-24 16:44:25 +01:00
Fabiano Fidêncio	eb4ce0e98b	Merge pull request #12676 from manuelh-dev/mahuber/gpu-ci-data-storage tests: gpu: use container data storage feature	2026-03-24 09:59:13 +01:00
Manuel Huber	79efe3e041	tests: gpu: use container data storage feature Use the container data storage feature for the k8s-nvidia-nim.bats test pod manifests. This reduces the pods' memory requirements. For this, enable the block-encrypted emptydir_mode for the NVIDIA GPU TEE handlers. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-23 11:43:11 -07:00
Steve Horsman	2728b493d5	Merge pull request #12681 from manuelh-dev/mahuber/ci-pip-py-venv tests: cc: setup function for python venv	2026-03-23 14:33:30 +00:00
Fabiano Fidêncio	1ec97d25e7	Merge pull request #12704 from stevenhorsman/security-fixes-23-mar-26 Security fixes 23 mar 26	2026-03-23 15:27:07 +01:00
Fabiano Fidêncio	aa6890eae1	Merge pull request #12675 from manuelh-dev/mahuber/cdh-storage-options agent: add mkfs_opts parameter to cdh_secure_mount	2026-03-23 15:18:38 +01:00
Fabiano Fidêncio	fe817bb47b	Merge pull request #12705 from fidencio/topic/tests-nginx-connectibity-2nd-try tests: nginx-connectivity: Use `-O index.html` to override the downloaded file	2026-03-23 13:08:51 +01:00
Fabiano Fidêncio	514a2b1a7c	Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter nvidia: cc: Use nydus-snapshotter	2026-03-23 12:50:15 +01:00
stevenhorsman	2edb588ed9	kata-ctl: Pin micro_http the micro_http crate was just pointing the the main branch and hadn't been updated for around 3 years, so pin to the latest for stability and update to remediate RUSTSEC-2024-0002 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-23 10:34:28 +00:00
stevenhorsman	9871256771	versions: Bump cloud-hypervisor to v51 In v51 the license was added, so try bumping to this version to solve the cargo deny issue Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-23 10:34:28 +00:00
dependabot[bot]	8de7f29981	agent-ctl: Bump aws-lc-rs to 1.16.2 Bump aws-lc-rs, so that aws-lc-sys updates to 0.39.0 to remediate RUSTSEC-2026-0044 and https://osv.dev/vulnerability/RUSTSEC-2026-0048 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-23 10:34:28 +00:00
dependabot[bot]	1c63738b80	build(deps): bump aws-lc-fips-sys in /src/tools/agent-ctl Bumps [aws-lc-fips-sys](https://github.com/aws/aws-lc-rs) from 0.13.12 to 0.13.13. - [Release notes](https://github.com/aws/aws-lc-rs/releases) - [Commits](https://github.com/aws/aws-lc-rs/compare/aws-lc-fips-sys/v0.13.12...aws-lc-fips-sys/v0.13.13) --- updated-dependencies: - dependency-name: aws-lc-fips-sys dependency-version: 0.13.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:28 +00:00
dependabot[bot]	6e79a9d6ad	build(deps): bump rustls-webpki in /src/tools/agent-ctl Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.3 to 0.103.10. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.3...v/0.103.10) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.10 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:27 +00:00
dependabot[bot]	8df9cf35df	build(deps): bump rustls-webpki in /tools/packaging/kata-deploy/binary Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.8 to 0.103.10. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.8...v/0.103.10) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.10 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:27 +00:00
dependabot[bot]	ef32923461	build(deps): bump tar from 0.4.44 to 0.4.45 Bumps [tar](https://github.com/alexcrichton/tar-rs) from 0.4.44 to 0.4.45. - [Commits](https://github.com/alexcrichton/tar-rs/compare/0.4.44...0.4.45) --- updated-dependencies: - dependency-name: tar dependency-version: 0.4.45 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-03-23 10:34:27 +00:00
stevenhorsman	85e17c2e77	deps: Bump rustls-webpki Bump rusttls-webpki to 0.103.10 to remediate RUSTSEC-2026-0049 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-23 10:34:27 +00:00
stevenhorsman	c3868f8e60	deps: Bump aws-lc-rs to 1.16.2 Bump aws-lc-rs, so that aws-lc-sys updates to 0.39.0 to remediate RUSTSEC-2026-0044 and https://osv.dev/vulnerability/RUSTSEC-2026-0048 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-23 10:34:27 +00:00
stevenhorsman	27417d9d15	ci: Add more crates to dependabot groups Add aws-lc, and rustls-webpki, so that in future the different component bumps are all done together Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-23 10:34:27 +00:00
Fabiano Fidêncio	83f37f4beb	tests: nginx-connectivity: Override index.html (2nd try) We need to explicitly pass `-O index.html` as the busybox' wget has a different behaviour than GNU's wget. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-23 11:11:44 +01:00
Fabiano Fidêncio	e44dfccf7a	Revert "tests: nginx-connectivity: Allow overriding the downloded file" This reverts commit `4403289123`. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-23 11:06:23 +01:00
Hyounggyu Choi	1035504492	Merge pull request #12701 from fidencio/topic/tests-arm-nginx-connectivity tests: nginx-connectivity: Allow overriding the downloded file	2026-03-23 10:37:25 +01:00
Steve Horsman	20cb65b1fb	Merge pull request #12624 from lifupan/bump_rust_vmms runtime-rs: Bump rust vmms for dragonball	2026-03-23 08:56:47 +00:00
Fabiano Fidêncio	864f181faf	Merge pull request #12694 from manuelh-dev/mahuber/nv-test-timeout tests: nvidia: Increase run test timeout	2026-03-23 09:13:20 +01:00
Fabiano Fidêncio	642b5661ff	Merge pull request #12651 from manuelh-dev/mahuber/doc-update-nvidia-gpu-op docs: Update NVIDIA GPU passthrough QEMU scenario	2026-03-23 09:01:02 +01:00
Fabiano Fidêncio	4403289123	tests: nginx-connectivity: Allow overriding the downloded file In case a wget fails for one reason or another, it'll leave behind an 'index.html' file. Let's make sure we allow overriding that file so the retry loop doesn't fail for no reason. Fixes: #12670 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-23 04:08:24 +01:00
Alex Lyn	d2c2ec6e23	Merge pull request #12633 from LandonTClipp/docs_materialx docs: Move to mkdocs-material, port Helm to docs site	2026-03-23 09:29:25 +08:00
Fupan Li	608f378bff	dragonball: make sure the nydus's worker thread access network Since the dragonball's vmm thread had been joined in the pod's netns, which wouldn't access the network, thus we should make sure the nydus's worker thread join into the runD's main thread's netns which would access the network. Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>	2026-03-22 22:44:24 +08:00
Fabiano Fidêncio	f14895bdc4	Merge pull request #12673 from manuelh-dev/mahuber/release-doc-update docs: Update release process notes	2026-03-22 13:11:03 +01:00
Fabiano Fidêncio	fd716c017d	Merge pull request #12567 from agamdua/ebpf-confs kernel: Add debug kernel with eBPF configs to static tarball builds	2026-03-22 13:06:54 +01:00
Fabiano Fidêncio	740d380b8e	tests: nvidia: cc: Use nydus-snapshotter So we can test what we just changed in the config files. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-22 10:10:34 +01:00
Fabiano Fidêncio	6194510e90	nvidia: cc: Use nydus-snapshotter We've been using `experimental_force_guest_pull`, but now that we have a containerd release that should work more reliably with the multi snapshotter setup, we want to give it a try. Note: We need containerd 2.2.2+. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-22 10:10:34 +01:00
Agam Dua	7e3fd74779	kernel: bump config version With debug/ebpf updates in place, let's bump the kata config version. Signed-off-by: Agam Dua <agam_dua@apple.com> Co-authored-by: Eric Ernst <eric_ernst@apple.com>	2026-03-20 15:04:15 -07:00
Agam Dua	f6319da73d	tests: Add eBPF and dwarves to spell check dictionary Add missing terms to the spell check dictionary to fix CI failures for kernel debug documentation: - eBPF - dwarves: Linux package with DWARF/BTF tools (pahole) required for CONFIG_DEBUG_INFO_BTF kernel option Also fix the casing of "ebpf" to "eBPF" in the kernel README to match the official naming convention. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 15:04:08 -07:00
Agam Dua	91d6c39f06	kernel: Fix debug build and add debug symbols to installation Fixed a bug with the debug kernel build where common/ was repeated after the common path variable, resulting in the debug confs never being picked up. This exposed a subsequent bug where the debug conf was included in other builds, this is also fixed by creating a separate directory for debug confs with one file at the moment, debug.conf that contains debug configurations and bpf specific configs. To enable kernel builds (specifically for bpf) the dwarves package was added to the kernel dockerfile for the pahole package. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	5ab0744c25	ci: Add pipeline for building and distributing the debug kernel Add the debug kernel to the kata tarball alongside the other kernels. Also update the kernel README documentation to describe the new debug kernel build process. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
Agam Dua	e905b74267	kernel: Add eBPF configs for debug builds Adds a BPF section in the debug.conf kernel configuration options to enable eBPF and BTF support for debug kernel builds. Signed-off-by: Agam Dua <agam_dua@apple.com>	2026-03-20 14:50:23 -07:00
LandonTClipp	5333e45313	docs: Fix static-checks.sh when running locally This fixes the test_dir variable in static-checks.sh so that when a --repo-path is provided, the test_dir variable uses that for the location instead of the GOPATH location. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-20 14:51:45 -05:00
LandonTClipp	795869152d	docs: Move to mkdocs-material, port Helm to docs site This supersedes https://github.com/kata-containers/kata-containers/pull/12622. I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material created after mkdocs-material was put into maintenance mode. We'll use this platform until Zensical is more feature complete. Added a few of the existing docs into the site to make a more user-friendly flow. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2026-03-20 14:51:39 -05:00
Manuel Huber	8903b12d34	tests: nvidia: Increase run test timeout Increase the timeout as a few new features and tests are going to be onboarded for the NVIDIA GPU CI. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-20 11:12:52 -07:00
Manuel Huber	476f550977	docs: Update NVIDIA GPU passthrough QEMU scenario With the upcoming GPU operator 26.3 relase and recent changes to kata-containers, we adapt this documentation with notes on multi GPU passthrough, support for TDX, changed deployment instructions, and with various other minor improvements. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-20 10:53:14 -07:00
Manuel Huber	ae59cf26a0	kata-deploy: Check kata-tarball size limits For kata tarballs we eventually release to GitHub, check their size against the GitHub size limit. With this, we fail in case of an ongoing release process in 'CI \| Publish Kata Containers payload' instead of only later on in the 'Release Kata Containers' action, and we fail during PR builds, avoiding this situation at all. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-20 10:40:55 -07:00
RuoqingHe	cfc1836a31	Merge pull request #12672 from stevenhorsman/agent-security-fixes agent: Bump tracing-subscriber	2026-03-20 17:37:16 +08:00
Steve Horsman	7ab6e11e10	Merge pull request #12678 from kata-containers/dependabot/go_modules/src/runtime/google.golang.org/grpc-1.79.3 build(deps): bump google.golang.org/grpc from 1.72.0 to 1.79.3 in /src/runtime	2026-03-20 08:49:35 +00:00
Steve Horsman	e475fb2116	Merge pull request #12680 from kata-containers/dependabot/go_modules/src/tools/csi-kata-directvolume/google.golang.org/grpc-1.79.3 build(deps): bump google.golang.org/grpc from 1.63.2 to 1.79.3 in /src/tools/csi-kata-directvolume	2026-03-20 08:49:27 +00:00
RuoqingHe	f62a6b6ab2	Merge pull request #12677 from stevenhorsman/cspell-action Switch to use cspell for spell checking	2026-03-20 13:27:36 +08:00
Manuel Huber	4afb55154a	docs: Update release process notes Update the Release-Process.md file with some clarifications on the release process. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-03-19 15:14:23 -07:00
stevenhorsman	38a655487f	vsock-exporter: Switch bincode for serde_json bincode is not maintained, so switch to serde_json to resolve RUSTSEC-2025-0141 Assisted-By: Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:45:17 +00:00
stevenhorsman	e1d7d5bef8	agent: Remove async-std It's a dev-dependency that doesn't seem to be used, so remove it and resolve RUSTSEC-2025-0052 Assisted-By: Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:45:17 +00:00
stevenhorsman	e4eda5e1d8	agent: Bump tracing-subscriber - Bump tracing-subscriber to 0.3.20 to resolve RUSTSEC-2025-0055 - Switch deprecated `slog_info!` for `slog::info!` Generated-By: Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:45:17 +00:00
stevenhorsman	e62df07b6a	static-checks: Delete kata-spell-check The old hunspell based spell-check was causing contributors challenges and proving a barrier to doc updates. We've replaced it with a cspell based-solution, so clean up the old approach. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-03-19 10:22:54 +00:00

1 2 3 4 5 ...

18259 Commits