kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-02 07:02:16 +00:00

Author	SHA1	Message	Date
Pradipta Banerjee	1487eaaaa2	kernel: Enable landlock LSM Allows using landlock LSM for the container process Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>	2026-05-27 13:33:46 +02:00
Fabiano Fidêncio	25491fc20c	Merge pull request #13104 from kata-containers/topic/kata-deploy-build-as-an-artefact kata-deploy: prebuild payload-specific component artifacts	2026-05-25 22:56:55 +02:00
Fabiano Fidêncio	c65d64873b	kata-deploy: prebuild payload-specific component artifacts Build and publish the kata-deploy binary and CoCo guest-pull nydus snapshotter as dedicated per-arch artifacts, then consume those tarballs when assembling the kata-deploy image. This avoids rebuilding those components in the payload image (which would happen in serial) path and reduces overall CI build time. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-25 22:13:41 +02:00
Fabiano Fidêncio	3dc02a8604	Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only runtime-rs: Support erofs snapshotter with gpt vmdk mode	2026-05-25 16:29:59 +02:00
Zvonko Kaiser	6c6c5809f1	Merge pull request #13109 from fidencio/topic/build-validate-measured-rootfs-root-hashes-for-all-shims build: Validate measured-rootfs root hashes all shims	2026-05-25 15:58:35 +02:00
Zvonko Kaiser	aeadb1af35	Merge pull request #12948 from fidencio/topic/numa runtime (go): agent: Add NUMA support for QEMU	2026-05-25 15:33:14 +02:00
Alex Lyn	a359d13476	build: Validate measured-rootfs root hashes all shims The cached shim-v2 tarballs ship per-variant `root_hash_.txt` files embedded in the matching measured-rootfs image. Until now only shim-v2-rust validated those hashes against the freshly built rootfs images on a cache hit; shim-v2-go reused whatever was cached without checking, even though its bundled configuration files contain the `KERNELVERITYPARAMS_` values baked in at build time. When a PR changes the agent (and therefore the rootfs image and its dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache key stays the same and the stale tarball is reused. The resulting guest cmdline carries a verity hash that no longer matches the new rootfs image, so the VM panics very early in boot: device-mapper: verity: 254:1: metadata block 0 is corrupted erofs (device dm-0): cannot read erofs superblock Kernel panic - not syncing: VFS: Unable to mount root fs ... Generalize the shim-v2-rust cache validation so it also runs for shim-v2-go, push the per-variant root-hash sidecar files for both shims, and fall back to a full rebuild whenever the cached hash is missing or differs from the image one. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:12:52 +08:00
Alex Lyn	fd139a1143	kata-deploy: Reset max_unmerged_layers to "0" within erofs snapshotter we should set max_unmerged_layers = 0 for erofs snapshotter gpt-vmdk mode. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Fabiano Fidêncio	72be31c384	build: Validate measured-rootfs root hashes all shims The cached shim-v2 tarballs ship per-variant `root_hash_.txt` files embedded in the matching measured-rootfs image. Until now only shim-v2-rust validated those hashes against the freshly built rootfs images on a cache hit; shim-v2-go reused whatever was cached without checking, even though its bundled configuration files contain the `KERNELVERITYPARAMS_` values baked in at build time. When a PR changes the agent (and therefore the rootfs image and its dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache key stays the same and the stale tarball is reused. The resulting guest cmdline carries a verity hash that no longer matches the new rootfs image, so the VM panics very early in boot: device-mapper: verity: 254:1: metadata block 0 is corrupted erofs (device dm-0): cannot read erofs superblock Kernel panic - not syncing: VFS: Unable to mount root fs ... Generalize the shim-v2-rust cache validation so it also runs for shim-v2-go, push the per-variant root-hash sidecar files for both shims, and fall back to a full rebuild whenever the cached hash is missing or differs from the image one. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-25 11:04:08 +02:00
Fabiano Fidêncio	7ddea26137	Merge pull request #13086 from fvichot/flo-kata-monitor-fix kata-monitor: use full URI for connecting to containerd	2026-05-25 10:16:11 +02:00
Fabiano Fidêncio	407a6946f2	Merge pull request #13077 from hdp617/fix-kata-deploy-build packaging: fix parallel kernel build race and kata-deploy script bugs	2026-05-25 09:53:38 +02:00
Fabiano Fidêncio	8d2ecaabb5	versions: Bump QEMU to v11.0.0 For more details see QEMU's release notes: https://www.qemu.org/2026/04/22/qemu-11-0-0/ GPU experimental variants are also using v11.0.0 plus one patch to solve issues related to NUMA mapping. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Florian Vichot	554e8f91b1	kata-monitor: use full URI for connecting to containerd Without the protocol in the URI, grpc-go defaults to the DNS resolver, which results in an error for unix sockets (`name resolver error: produced zero addresses`). We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as they are no longer necessary, grpc-go supports connecting to unix sockets directly. This also removes the matching tests. This also adds a `Makefile` and tweaks the Dockerfile to simplify building the Docker image. Fixes #12398 Signed-off-by: Florian Vichot <florian.vichot@gmail.com>	2026-05-23 16:47:46 +02:00
Huy Pham	3ec444a7df	kernel: bump config version Bump the Kata Containers kernel configuration version to 195. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-22 12:26:53 -07:00
Huy Pham	c490373a78	kata-deploy: packaging: fix absolute path resolution in merge script The `kata-deploy-merge-builds.sh` script blindly prepended `PWD` to the `kata_versions_yaml_file` argument, assuming it was always a relative path. However, the `Makefile` passes an absolute path using `$(MK_DIR)`. This resulted in invalid double-concatenated paths like `/workspace/...//workspace/...` which failed to copy. Fix this by using `readlink -f` to safely resolve the path. This correctly handles both relative and absolute paths, preventing path corruption. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-22 12:05:56 -07:00
Fabiano Fidêncio	5d3e1e6396	kata-deploy: verify kata-runtime label remains stable on rke2/k3s The retry loop added in `efd468df3f` still allows the install to declare success while inside the kubelet's post-restart re-register window. On rke2/k3s, `systemctl restart rke2-agent` restarts both containerd and the kubelet, but `wait_till_node_is_ready` polls `.status.conditions[Ready]` every 2 s and returns on the first `True` observation it sees. By default the kubelet only publishes node status every ~10 s, so that first `True` is almost always the stale value from before the restart — the kubelet hasn't actually finished restarting yet. `label_node_with_retry` then applies the label, sleeps 1 s, reads back "true" (still stale, kubelet still down), and returns Ok. Install completes, `/readyz` flips to 200, helm releases its `--wait`, and the bats test starts — and only then does the kubelet finish coming up, re-register the node, and clobber the label with its cached set. The lifecycle test sees an empty `katacontainers.io/kata-runtime` and fails: # Node label katacontainers.io/kata-runtime: not ok 1 Kata artifacts are present on host after install A single-shot verification can't distinguish "still stale true" from "truly stable true after kubelet re-register". Replace it with a stability window: after (re)applying the label, require it to remain at the expected value for STABILITY_CHECKS=6 consecutive observations spaced CHECK_INTERVAL=2 s apart (≈ 12 s — comfortably more than the kubelet's status-update period). If the value ever drifts inside the window, re-apply and restart the stability counter. Bounded by MAX_APPLY_ATTEMPTS=12, so worst case is ~3 min; happy path adds ~12 s to install. Also add a short polling loop to the test's own label assertion as belt-and-suspenders for any leftover transient race, matching the existing retry pattern used for the container-runtime version check. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-22 11:53:18 +02:00
Huy Pham	ee4f756b75	kata-deploy: packaging: fix buggy return statements in cache check The `install_cached_tarball_component` function in the binaries packaging script contained syntax errors where it attempted to capture the empty stdout of the `cleanup_and_fail` function inside a return statement (e.g., `return "$(cleanup_and_fail ...)"`). Since `cleanup_and_fail` only returns an exit status and produces no stdout, this evaluated to `return ""`, which is invalid in bash and causes the script to crash with `numeric argument required` instead of returning the failure status. Fix this by replacing the buggy inline returns with proper `if` blocks that call `cleanup_and_fail` and explicitly return `1`. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-21 09:21:05 -07:00
Huy Pham	9ddcc53f6f	kernel: build: resolve race condition in parallel config generation During parallel builds of different kernel variants (e.g., generic, debug, nvidia-gpu), the config generation script wrote to a shared static path: `tools/packaging/kernel/configs/fragments/x86_64/.config`. This caused critical race conditions where concurrent processes would overwrite or delete the `.config` file while another process was reading it, leading to sporadic build failures with "No such file or directory" errors. Resolve this by changing the temporary configuration path to be build-specific, writing it inside the unique kernel build directory (e.g., `kata-linux-.../.config.generated`). The final config is still copied to `.config` in the kernel source tree as before, but the intermediate merge process is now isolated. Signed-off-by: Huy Pham <huypham@google.com>	2026-05-21 09:19:45 -07:00
Fabiano Fidêncio	7536f2c616	Merge pull request #13055 from kata-containers/topic/kata-deploy-only-install-what-will-be-used kata-deploy: only install what will actually be used	2026-05-21 17:53:09 +02:00
Fabiano Fidêncio	efd468df3f	kata-deploy: retry node labeling after CRI restart On rke2/k3s a CRI restart also restarts the kubelet, which may briefly re-register the node with its cached label set and clobber the kata-runtime label that was just applied via the API. Replace the single label_node call with a retry loop that verifies the label value after setting it. If the label is missing or has the wrong value, it is re-applied (up to 10 attempts with 2 s back-off). This fixes a race condition that became more visible after the switch to individual tarball extraction, which made install take slightly longer and shifted the kubelet re-registration timing window. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	291e4d37be	kata-deploy: implement selective tarball extraction in installer Add zstd and tar as Rust dependencies and rewrite the artifact installation logic to extract only the component tarballs required by the enabled runtime classes. extract_component_tarballs reads shim-components.json to determine which kata-static-<name>.tar.zst files are needed for the selected shims and current architecture. Shared components (e.g. kernel, shim-v2-go) are listed by multiple shims and must only be unpacked once per install run. Deduplication is handled with an in-memory set passed through the call, avoiding any risk of stale on-disk state surviving across pod restarts. Within each tarball, opt/kata path prefixes are stripped and absolute symlink / hard-link targets are rewritten to point at the resolved installation directory, correctly handling MULTI_INSTALL_SUFFIX. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	9a0acc6c4c	kata-deploy: ship individual component tarballs; drop merged tarball Update the Dockerfile to copy each kata-static-<name>.tar.zst directly into the image alongside shim-components.json, replacing the old artifact-extractor stage that unpacked a single merged tarball. Update the publish-kata-deploy-payload and release CI workflows to download individual per-component artifacts instead of waiting for a merged tarball, and simplify kata-deploy-build-and-upload-payload.sh accordingly. The kata-deploy image build is no longer blocked on the merge step. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	87e55be4a3	kata-deploy: add shim-components.json component manifest Introduces the human-maintained shim-components.json that maps each runtime class to the list of kata-static-<name>.tar.zst component tarballs it needs per architecture. This is the source of truth read by the installer at deploy time to decide which tarballs to extract. Key design choices encoded here: - shim-v2-go vs shim-v2-rust: explicit per-shim, so a node running only Rust shims never extracts the Go shim binary. - virtiofsd and nydus are both listed for hypervisors that support configurable shared_fs (we cannot know which the user will choose). - fc/firecracker: no virtiofsd or nydus (devmapper only). - remote: only the shim binary (no local hypervisor artifacts). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
Fabiano Fidêncio	c87e327876	kata-deploy: split shim-v2 into shim-v2-go and shim-v2-rust Split the monolithic shim-v2 build target into separate shim-v2-go and shim-v2-rust targets in kata-deploy-binaries.sh, the local-build Makefile, and the four architecture CI workflows. The Go and Rust shims now each produce their own kata-static-<name>.tar.zst artifact, allowing downstream consumers to select only the shim variant they need. MEASURED_ROOTFS is set per-arch for the Rust job in CI. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-20 20:52:36 +02:00
stevenhorsman	3f27052184	kata-deploy: always add HEAD commit SHA tag to all builds Previously, the commit SHA tag was only added for specific components (agent, agent-ctl) by setting artefact_tag in individual install functions. This was inconsistent and error-prone. Now, the HEAD commit SHA is always added as a tag for all builds in the central tagging logic. This ensures: - All components get tagged with the commit SHA - The correct HEAD commit is used (not the last commit that modified a specific path) - Simpler, more maintainable code The git command uses `git -C` to change to the repo directory before running git log, which correctly returns the HEAD commit SHA regardless of which files were modified in recent commits. Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-20 17:42:09 +01:00
stevenhorsman	76fc847c78	release: correct .cargo/config.toml reference in generate_vendor.sh The script was creating .cargo/config.toml but referencing .cargo/config in the vendor_dir_list, causing tar to fail with 'Cannot stat' error. Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-05-19 18:23:53 +01:00
stevenhorsman	a4cfe32157	release: Bump version to 3.31.0 Bump VERSION and helm-charts versions. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-19 15:32:50 +02:00
Alex Lyn	bbef0a755c	Merge pull request #13005 from stevenhorsman/remove-osbuilder-tests osbuilder: Remove tests	2026-05-19 11:58:27 +08:00
Fabiano Fidêncio	7c971f0c4c	Merge pull request #13069 from fidencio/topic/kata-deploy-prevent-eviction helm-chart: add priorityClassName to prevent kata-deploy eviction	2026-05-18 21:08:45 +02:00
Fabiano Fidêncio	5d40ba66ff	helm-chart: add priorityClassName to prevent kata-deploy eviction kata-deploy is a per-node infrastructure DaemonSet; if it gets evicted under node memory/CPU pressure the node loses its Kata runtime until the pod is rescheduled. Default to system-node-critical so the kubelet evicts lower-priority workloads first. The value is configurable via `priorityClassName` in values.yaml. Fixes: #13068 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-18 15:14:06 +02:00
stevenhorsman	e3a00a2ec2	kata-deploy: fix binary location for agent-ctl Moving agent-ctl into the root workspace moves the target directory, so update this target to be in root, not src/tools Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-18 09:47:15 +01:00
stevenhorsman	2c1aaa8ae7	osbuilder: Remove tests The tests haven't been run at least since we moved to GHA, so in the spirit of lean and mean, let clear them up Fixes: #10957 Assisted-by IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-18 09:46:42 +01:00
Manuel Huber	275a63b266	Revert "gatekeeper: Unrequire NVIDIA GPU test" This reverts commit `edfb6f5716`. The NVIDIA non-TEE CI job has passed again over the last 5 nightly runs after merging PRs #13007 and #13020. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-15 15:20:12 -07:00
Steve Horsman	aade0f5fbe	Merge pull request #12854 from kata-containers/dependabot/go_modules/tools/testing/kata-webhook/github.com/sirupsen/logrus-1.9.4 build(deps): bump github.com/sirupsen/logrus from 1.9.3 to 1.9.4 in /tools/testing/kata-webhook	2026-05-14 13:55:44 +01:00
Manuel Huber	ed4233bf91	rootfs: cdh: Update CDH to new version Update CDH to a newer version and: - adjust the NVIDIA root filesystem build to reflect the change from using libcryptsetup to using the cryptsetup binary. - adjust image-pull test cases to conduct parallel write operations on the /dev/trusted_store backed guest image pull location since issue #12721 has been solved on CDH side. Fixes #12721 Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-13 20:20:45 +02:00
dependabot[bot]	18a13773da	build(deps): bump github.com/sirupsen/logrus Bumps [github.com/sirupsen/logrus](https://github.com/sirupsen/logrus) from 1.9.3 to 1.9.4. - [Release notes](https://github.com/sirupsen/logrus/releases) - [Changelog](https://github.com/sirupsen/logrus/blob/master/CHANGELOG.md) - [Commits](https://github.com/sirupsen/logrus/compare/v1.9.3...v1.9.4) --- updated-dependencies: - dependency-name: github.com/sirupsen/logrus dependency-version: 1.9.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2026-05-13 06:11:16 +00:00
Greg Kurz	d2dc0a923c	Merge pull request #13030 from stevenhorsman/go-1.25.10-bump Go 1.25.10 bump	2026-05-13 08:09:51 +02:00
stevenhorsman	7cc72b933d	versions: bump golang.org/x/net to v0.53.0 Bump golang.org/x/net to resolve CVE: - GO-2026-4918 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Assisted-by: IBM Bob	2026-05-12 11:56:26 +01:00
stevenhorsman	4a65aca9cf	versions: bump golang to 1.25.10 Bump the go version to resolve CVEs: - GO-2026-4918 - GO-2026-4971 - GO-2026-4976 - GO-2026-4977 - GO-2026-4980 - GO-2026-4981 - GO-2026-4982 - GO-2026-4986 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Assisted-by: IBM Bob	2026-05-12 11:56:13 +01:00
Fabiano Fidêncio	93e02944fa	image-builder/nvidia: skip DAX header for virtio-blk-pci images The DAX header (2 MiB of NVDIMM metadata + a duplicate MBR) is unconditionally prepended to every image by set_dax_header(). NVIDIA images use virtio-blk-pci with disable_image_nvdimm=true, so the kernel reads MBR #1 directly and never touches the DAX metadata -- it is dead weight. Add a SKIP_DAX_HEADER environment variable (default "no") that, when set to "yes", skips the DAX header entirely: - Removes the 2 MiB DAX overhead from image size calculations in both the erofs and ext4 paths - Skips the set_dax_header() call, avoiding compilation and execution of the nsdax tool - Passes the variable through to containerised builds Enable SKIP_DAX_HEADER=yes for both install_image_nvidia_gpu() and install_image_nvidia_gpu_confidential() in the build pipeline. All other image builds are unaffected (default remains "no"). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	b72bb7243e	image-builder: bump base image from Fedora 42 to 44 Fedora 42 reaches end-of-life in May 2026. Move the image-builder container to Fedora 44, which is the current stable release. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	6b802a4e30	nvidia: switch GPU rootfs images to erofs Switch the NVIDIA GPU rootfs images (both standard and confidential) from ext4 to erofs (Enhanced Read-Only File System). Unlike ext4, which is a read-write filesystem mounted read-only by convention, erofs is structurally read-only -- no journal, no write metadata, no superblock write path. This eliminates accidental mutation and reduces the attack surface inside the guest VM, which is particularly important for confidential workloads using dm-verity. Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4 so non-NVIDIA configurations are unaffected. Update all six NVIDIA GPU configuration templates (base, SNP, TDX for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global @DEFROOTFSTYPE@. Export FS_TYPE=erofs in install_image_nvidia_gpu() and install_image_nvidia_gpu_confidential() so the build pipeline produces erofs images via the image builder. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	bfcd249f40	image-builder: add erofs dm-verity support and lz4hc compression Add full dm-verity and measured rootfs support to create_erofs_rootfs_image(), bringing it to parity with the ext4 path. Unlike ext4, which is a read-write filesystem mounted read-only by convention, erofs is structurally read-only -- no journal, no write metadata, no superblock write path. This is a natural fit for dm-verity: erofs never attempts writes, so verity never has to reject anything. With ext4, the kernel must skip journal replay on verity-protected devices, which is a fragile assumption. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	d2e0555cf0	image-builder: refactor dm-verity setup into shared functions Extract build_kernel_verity_params() and setup_verity() from the inline block inside create_rootfs_image() into top-level functions. This is a pure refactoring with no behavior change. The verity logic is moved verbatim, with the only difference being that build_kernel_verity_params() now takes the image path as an explicit parameter instead of capturing it from the enclosing scope. The extracted functions will be reused by create_erofs_rootfs_image() in a subsequent commit to add dm-verity support for erofs images. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 17:18:05 +02:00
Fabiano Fidêncio	341a0d366c	kata-deploy: Fix containerd debug level path for config schema v4 Containerd 2.3 (config schema v4) uses the top-level [debug] table for log level configuration, not plugins."io.containerd.server.v1.debug" as was the case in the RC builds. Update containerd_debug_level_toml_path() to use .debug.level for all schema versions, matching the released containerd behavior. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-10 12:02:24 +02:00
stevenhorsman	87664c608d	version: Bump to latest 6.18 kernel Pick up the latest kernel that fixes CVE-2026-43284 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-08 17:15:24 +01:00
Fabiano Fidêncio	09bbc70302	Merge pull request #13002 from manuelh-dev/mahuber/unrequire-nim-svc gatekeeper: Unrequire NVIDIA GPU test (temporary)	2026-05-08 10:02:00 +02:00
Manuel Huber	edfb6f5716	gatekeeper: Unrequire NVIDIA GPU test (temporary) Temporarily unrequire the NVIDIA GPU test. We are experiencing situations in which two NIM service instances get deployed almost at the same time into the kata-containers-k8s-tests namespace (expected current context) and into the default namespace. This causes the NIM operator to create two deployments in the two namespaces and to then schedule two pods at the same time. This usually causes the NIM pod in the default namespace to fail and to linger. We can't explain yet why this does not happen in the TEE CI path and why this is happening at all. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-07 14:39:24 +02:00
Fabiano Fidêncio	0f3160276b	ci: k8s: skip no-op Helm uninstall on free runners In cleanup_kata_deploy, bail out early when no kata-deploy Helm release exists so baremetal-* pre-deploy cleanup on fresh clusters does not block on helm uninstall --wait (up to 10m). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00
Fabiano Fidêncio	f5533950e6	kata-deploy: helm: cap container RSS via resources block Plumb a resources block into the kata-deploy DaemonSet container in the Helm chart so the cluster can size its memory footprint predictably. Defaults are sized from real /proc/<pid>/status numbers on an unpatched 3.30.0 build running on a ~220-vCPU GPU node: VmRSS: 9944 kB (~9.7 MiB) <- actual physical memory RssAnon: 2628 kB (~2.6 MiB) <- heap + dirty stack pages VmData: 464668 kB (~454 MiB) <- tokio multi-thread workers' reserved-but-untouched stacks Threads: 225 <- num_cpus()-driven worker pool That VmData number is the source of the original "kata-deploy is using 400 MB" reports: any monitoring layer that surfaces virtual data size, committed memory, or memory.usage_in_bytes on a kernel that includes mapped-but-untouched memory will happily reproduce ~400 MB even though only ~10 MiB is ever made resident. The earlier commits in this series (current_thread tokio, mimalloc, shared kube client, JSONPath removal, post-install re-exec) collapse VmData into the tens of MiB and drop the post-install resident set further. The defaults below are picked accordingly: requests: cpu: 25m # install is mostly I/O wait; the post-install # waiter is genuinely idle memory: 16Mi # ~2x headroom over the unpatched VmRSS we # measured, far more over the patched waiter Operators who hit OOMKilled on unusually large or churny clusters can override `resources` directly in their Helm values (or set it to {} to remove all requests and inherit cluster defaults). Fixes: https://github.com/kata-containers/kata-containers/discussions/12976 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-07 13:40:55 +02:00

1 2 3 4 5 ...

2365 Commits