kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-05-11 06:05:04 +00:00

Author	SHA1	Message	Date
Zvonko Kaiser	040f920de1	qemu: Enable NUMA support Enable NUMA support with QEMU. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-11-25 15:45:00 +01:00
Alex Lyn	de9308419b	Merge pull request #12135 from microsoft/danmihai1/init-data agent: allow disabling detect_initdata_device	2025-11-25 21:07:57 +08:00
Alex Lyn	34d3bd18bc	Merge pull request #12132 from fidencio/topic/runtime-classes-fix-nvidia-gpu-podOverhead runtimeclasses: Fix nvidia-gpu podOverhead	2025-11-25 20:23:07 +08:00
Alex Lyn	aadf1d6f71	Merge pull request #11932 from Apokleos/enhance-blk-params runtime-rs: Allow configuration of virtio block queue parameters	2025-11-25 15:24:12 +08:00
Dan Mihai	22d60a36c0	agent: allow disabling detect_initdata_device Allow users to build the Kata Agent using INIT_DATA=no to disable the detect_initdata_device() code loop and associated debug log output. Future additional improvements related to Init Data are tracked by #11532. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2025-11-25 02:44:28 +00:00
Fabiano Fidêncio	bb56a2e4d9	runtimeclasses: Fix nvidia-gpu podOverhead On `69c4fc4e76`, I've mistakenly changed the nvidia-gpu podOverhead while I should only have changed the TEE nvidia-gpu ones. Let's move it back to its original value. Reported-by: Joji Mekkattuparamban <jojim@nvidia.com> Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-24 21:43:29 +01:00
Zvonko Kaiser	55489818d6	gpu: TDX kernel param cleanup This settings is not needed anymore with Ubuntu 25.10 and the newest QEMU releases for TDX by Ubuntu. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-11-24 15:49:16 +01:00
Steve Horsman	e1e370091c	Merge pull request #12128 from fidencio/topic/kata-deploy-nfd-adjust-runtime-classe kata-deploy: nfd: Patch TEE runtimeclasses when needed	2025-11-24 14:05:43 +00:00
Steve Horsman	d437f875aa	Merge pull request #12126 from zvonkok/cold-plug-cleanup gpu: Cleanup Makefile	2025-11-24 14:01:49 +00:00
Zvonko Kaiser	77089fe5b3	Merge pull request #12115 from nheinemans-asml/main Kata-deploy: Add tolerations to daemonset and cleanup job	2025-11-24 09:00:42 -05:00
Fabiano Fidêncio	d0f3eb935e	kata-deploy: nfd: Patch TEE runtimeclasses when needed We've added logic to properly do the book keeping of the TEE keys when using NFD AND creating the runtime classes. However, we need to also take into consideration the case where the runtimeclasses are being created by the helm template, and in that case we just update what helm has deployed. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-23 10:27:52 +01:00
Zvonko Kaiser	dce207397c	gpu: Cleanup Makefile Some VARS were introduced but not cleaned up with the recent cold-plug PR, doing this now Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-11-21 22:03:34 +00:00
Zvonko Kaiser	8afcdae31f	Merge pull request #12092 from manuelh-dev/mahuber/cc-gpu-ci-smi-srs tests: nvidia: cc: Remove nvrc.smi.srs=1 parameter	2025-11-21 08:26:13 -05:00
Steve Horsman	37dd055283	Merge pull request #12090 from stevenhorsman/required-tests-update-14-nov-2025 Required tests update 14 nov 2025	2025-11-21 12:05:05 +00:00
nheinemans-asml	ef9d4e8b0d	kata-deploy: Add tolerations value to kata-deploy This allows the daemonset and cleanup job to run on tainted nodes. fixes #12114 Signed-off-by: nheinemans-asml <nick.heinemans@asml.com> Signed-off-by: nheinemans-asml <97238218+nheinemans-asml@users.noreply.github.com>	2025-11-21 09:49:47 +01:00
Manuel Huber	dfc229f51e	tests: nvidia: cc: Remove nvrc.smi.srs=1 parameter Remove the nvrc.smi.srs=1 parameter from the kernel command line. In CC use cases, the attestation agent is expected to set the GPU ready state. For the CUDA vectorAdd case where attestation agent is not being used, we set the ready state by adding the kernel command line parameter through an annotation. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-21 09:35:05 +01:00
Manuel Huber	6c6fc50aa5	tests: nvidia: cc: allow-all policy and init-data Add an allow-all policy for the CC GPU tests and ensure the init-data device is being created (hypervisor annotations). Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-21 09:24:15 +01:00
Manuel Huber	7e20118c8e	tests: nvidia: move secret definitions to bottom The add_allow_all_policy_to_yaml in tests_common.sh needs some improvements so that this function can support pod manifests with different resource kinds. For now, moving the Secret definition to the bottom so that we can create a default policy for the Pod. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-21 09:24:15 +01:00
Manuel Huber	ffd5443637	tests: nvidia: adapt is_aks_cluster The qemu-nvida-gpu handlers should not cause is_aks_cluster to return 1. Otherwise, CI logic will assume these hypervisors run on AKS hosts, see the following message in CI w/o this change: INFO: Adapting common policy settings for AKS Hosts Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-21 09:24:15 +01:00
Manuel Huber	f2bdd12e5e	tests: nvidia: Check KATA_HYPERVISOR var Fail explicitly when a wrong KATA_HYPERVISOR variable is provided. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-21 09:24:15 +01:00
Xuewei Niu	bf967b81cc	runtime-rs: Bump cgroups-rs to v0.5.0 The new version fixes some issues with systemd version, path verification. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>	2025-11-21 09:06:26 +01:00
Fabiano Fidêncio	6b40b59861	tests: Reduce KBS deployment check flakeness We currently start a pod that does a `wget` to the KBS address, and fails after 5 seconds. By the time it fails and reports back, we can see that KBS is actually running, but the workflow failed as the checker failed. :-/ Let's give it more time for the KBS to show up, and the flakeness should go away. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-20 19:29:26 +01:00
Fabiano Fidêncio	35672ec5ee	tests: cc: Test authenticated images with force guest pull As this should simply work. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-20 19:02:15 +01:00
Fupan Li	b86e7ff42b	Merge pull request #12087 from jojimt/device_cold_plug shim: Support device cold plug with Kubernetes	2025-11-20 19:17:13 +08:00
Joji Mekkattuparamban	7dc292094c	shim: go vendor changes for cold plug support Vendor in the kubelet pod resources API. Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2025-11-20 10:58:55 +01:00
Joji Mekkattuparamban	5aa184925a	shim: Support device cold plug with Kubernetes Utilize Kubelet's Pod Resource API to determine device allocations for the Pod during sandbox creation. Use CDI files to translate the device IDs to corresponding device paths and perform device injection. Fixes #12009 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2025-11-20 10:58:55 +01:00
Manuel Huber	477ca3980b	tests: nvidia: cc: Re-enable multi GPU test case Use the pod name variable so that kubectl wait finds the pod. Currently, kubectl waits for nvidia-nim-llama-3-2-nv-embedqa-1b-v2, not for nvidia-nim-llama-3-2-nv-embedqa-1b-v2-tee Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-20 10:05:46 +01:00
Zvonko Kaiser	89cd561340	Merge pull request #12059 from manuelh-dev/mahuber/bb-debug-v2 gpu: introduce a new devkit build flag to produce a rootfs for developers	2025-11-19 13:03:46 -05:00
Steve Horsman	8c6c31555a	Merge pull request #12111 from fidencio/topic/ci-fix-erofs-ci tests: k8s: Fix typo in authenticated tests	2025-11-19 16:08:48 +00:00
Manuel Huber	3966864376	gpu: introduce devkit build flag Introduce a new devkit parameter which will produce a rootfs without chisselling. This results in a larger rootfs with various packages and binaries being included, for instance, enabling the use of the debug console. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-19 15:50:03 +01:00
Manuel Huber	2c9e0f9f4f	gpu: add signed-by to package sources Pin to specific key. CUDA package sources in /etc/apt/sources.list.d already use a specific key. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-19 15:50:03 +01:00
Ruoqing He	54bfbf5687	build: Exclude tools from root workspace There are rust packages being cloned and built inside tools/packaging/kata-deploy/local-build/build folder, which may mislead those packages to think they are part of the kata root workspace. Exclude the directory to avoid that. Reported-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>	2025-11-19 15:49:25 +01:00
Fabiano Fidêncio	ae463642ed	tests: k8s: Fix typo in authenticated tests The person who introduced the check, someone named Fabiano Fidêncio, forgot a `$` in a variable assignment. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-19 11:59:59 +01:00
Steve Horsman	87b180383e	Merge pull request #11802 from kata-containers/dependabot/github_actions/oras-project/setup-oras-1.2.4 build(deps): bump oras-project/setup-oras from 1.2.2 to 1.2.4	2025-11-19 09:58:37 +00:00
dependabot[bot]	ede5ac9c2d	build(deps): bump the bit-vec group across 2 directories with 1 update Bumps the bit-vec group with 1 update in the /src/agent directory: [bit-vec](https://github.com/contain-rs/bit-vec). Bumps the bit-vec group with 1 update in the /src/tools/agent-ctl directory: [bit-vec](https://github.com/contain-rs/bit-vec). Updates `bit-vec` from 0.6.3 to 0.8.0 - [Changelog](https://github.com/contain-rs/bit-vec/blob/master/RELEASES.md) - [Commits](https://github.com/contain-rs/bit-vec/commits) Updates `bit-vec` from 0.6.3 to 0.8.0 - [Changelog](https://github.com/contain-rs/bit-vec/blob/master/RELEASES.md) - [Commits](https://github.com/contain-rs/bit-vec/commits) --- updated-dependencies: - dependency-name: bit-vec dependency-version: 0.8.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: bit-vec - dependency-name: bit-vec dependency-version: 0.8.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: bit-vec ... Signed-off-by: dependabot[bot] <support@github.com>	2025-11-19 10:43:25 +01:00
stevenhorsman	b75d90b483	ci: Comment out snp ci from required-tests The snp CI has not been required for a while and has recently been broken, so comment it out from the list of required jobs. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-11-19 09:39:36 +00:00
stevenhorsman	ae71921be2	ci: Update build-checks name in required-tests to update the required-tests to match. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-11-19 09:39:36 +00:00
stevenhorsman	112ed9bb46	ci: Comment out run-nydus from required-tests The run-nydus tests are not stable and blocking PRs, so make them non-required temporarily until they can be looked at Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-11-19 09:38:38 +00:00
Fupan Li	478a5ff693	Merge pull request #12109 from Apokleos/enable-cocodev-rs tests: Enable AUTO_GENERATE_POLICY for qemu-coco-dev-runtime-rs	2025-11-19 12:05:22 +08:00
Alex Lyn	1da225efc5	tests: Enable AUTO_GENERATE_POLICY for qemu-coco-dev-runtime-rs Enable auto-generate policy on cbl-mariner Hosts for qemu-coco-dev-runtime-rs if the user didn't specify an AUTO_GENERATE_POLICY value. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2025-11-19 10:44:03 +08:00
Alex Lyn	8d85548711	Merge pull request #12102 from Apokleos/rs-copyfile-devcgrp runtime-rs: Clear Linux.Resources.Devices completely and correct the guest path for container mount binding	2025-11-19 09:05:59 +08:00
Fabiano Fidêncio	8c02b5b913	tests: nvidia: cc: Temporarily skip multi GPU for nim tests We will re-enable this one later on once the changes to properly cold plug multi GPUs are merged. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	69c4fc4e76	kata-deploy: Adjust podOverhead for GPU TEEs Let's just move the podOverhead to a gigantic value, as we do need pod snadboxes as big as that, and we've noticed QEMU being OOM killed with smaller overheads. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	94ed4051b0	tests: nvidia: cc: Increase RAM for NIM pods Those need to pull the models inside the guest, and the guest has 50% of its memory "allowed" to be used as tmpfs, so, we gotta usa the RAM that we have. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	e5062a056e	tests: nvidia: cc: Adjust timeouts on NIM pods Timeout increases for confidential computing slowness: * livenessProbe: * initialDelaySeconds: 15 → 120 seconds * timeoutSeconds: 1 → 10 seconds * failureThreshold: 3 → 10 * readinessProbe: * initialDelaySeconds: 15 → 120 seconds * timeoutSeconds: 1 → 10 seconds * failureThreshold: 3 → 10 * startupProbe: * initialDelaySeconds: 40 → 180 seconds * timeoutSeconds: 1 → 10 seconds * failureThreshold: 180 → 300 Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	dee6f2666b	runtime: nvidia: Increase the guest pull timeout to 20 minutes Yes, we're dealing with a combination of large images and image-rs concurrent image layers being not optimal. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	6be43b2308	tests: nvidia: Retry kubectl commands As with CoCo some of the commands may take longer, way longer than expected. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	bb5bf6b864	tests: nvidia: nims: Use the current auths format for KBS We cannot use the same format used for docker, as it includes username and password, while what's expected when using Trustee does not. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Fabiano Fidêncio	92da54c088	tests: nvidia: cc: Enable NIM tests Now that we've bumped Trustee to a version that supports the NVIDIA remote verifier, let's re-enable the tests. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-18 22:29:42 +01:00
Steve Horsman	74254cba8f	Merge pull request #12106 from stevenhorsman/gatekeeper-paging-reduction ci: Adjust gatekeeper's job fetch	2025-11-18 14:08:26 +00:00

1 2 3 4 5 ...

17361 Commits