kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	e529ca0292	Merge pull request #13170 from fidencio/topic/kata-deploy-custom-runtimes-podOverhead kata-deploy: inherit custom RuntimeClass overhead from baseConfig	2026-06-05 19:46:17 +02:00
Fabiano Fidêncio	e9ee97f751	kata-deploy: inherit custom RuntimeClass overhead from baseConfig Default custom runtime RuntimeClass overhead.podFixed to the selected baseConfig values, so equivalent runtimes behave consistently without repeating boilerplate. In case the user wants to enforce that no overhead is set on the custom RuntimeClass, disable inheritance with inheritBaseOverhead=false. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-05 17:22:25 +02:00
Steve Horsman	2ac6bb173b	Merge pull request #13036 from stevenhorsman/jaeger-to-otlp-tracing-switch trace-forwarder: migrate from Jaeger to OTLP exporter	2026-06-05 14:30:26 +01:00
Steve Horsman	1624ebe362	Merge pull request #13135 from kata-containers/dependabot/cargo/tar-0.4.46 build(deps): bump tar from 0.4.45 to 0.4.46	2026-06-05 09:44:46 +01:00
stevenhorsman	b737ae48bf	trace-forwarder: migrate from Jaeger to OTLP exporter Migrate trace-forwarder from the deprecated opentelemetry-jaeger exporter to the modern opentelemetry-otlp exporter. This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger crate is no longer maintained and depends on vulnerable thrift versions (0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift and is actively maintained. Changes: - Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml - Update tracer.rs to use OTLP exporter instead of Jaeger exporter - Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag - Update server.rs to use TracerProvider instead of SpanExporter - Update documentation to reflect OTLP migration - Add examples for common OTLP-compatible collectors Breaking change: Users must update their trace-forwarder invocations to use --otlp-endpoint instead of --jaeger-host and --jaeger-port. Default endpoint: http://localhost:4317 (OTLP gRPC) Generated-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com> Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-06-04 19:39:47 +01:00
Dan Mihai	c78ccc2e9f	Merge pull request #13088 from kata-containers/dependabot/cargo/openssl-0.10.80 build(deps): bump openssl from 0.10.79 to 0.10.80	2026-06-04 11:38:08 -07:00
Fabiano Fidêncio	743b0a4839	Merge pull request #13165 from stevenhorsman/bump-go-to-1.25.11 versions: bump golang to 1.25.11	2026-06-04 20:24:57 +02:00
Fabiano Fidêncio	cd21b7b607	Merge pull request #13156 from fidencio/topic/runtime-rs-shim-leftover-on-failure runtime-rs: shut down shim daemon on a failed create	2026-06-04 20:09:28 +02:00
Fabiano Fidêncio	354b85784c	Merge pull request #13166 from stevenhorsman/required-tests/remote-kata-monitor ci: Remove kata-monitor test from required	2026-06-04 20:04:15 +02:00
stevenhorsman	81c7dde0ae	ci: Remove kata-monitor test from required The kata-monitor test is currently failing and is running a very EoL version of cri-o. This area is being actively reworked in #13107, so remove this and then once kata-monitor tests are stable we can re-add the new versions Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-04 14:40:17 +01:00
Fabiano Fidêncio	80e2473440	runtime-rs: shut down shim daemon on a failed create When CreateContainer fails before the runtime instance is registered (e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal teardown. containerd's follow-up Shutdown RPC then reaches get_runtime_instance(), fails with "runtime not ready", and returns before the service loop is ever told to stop. Because the shim ignores SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned. Make the Shutdown RPC force the daemon to exit when there is no runtime instance, emitting the same Action::Shutdown that sandbox.shutdown() sends on the normal path. This guarantees the shim process is reaped after a failed create instead of leaking. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-04 14:12:01 +02:00
Fabiano Fidêncio	2a1ce7b8c4	Merge pull request #12539 from mythi/no-vcpu-hotplug Disable CPU hotplug when confidential guest setting enabled	2026-06-04 10:56:52 +02:00
dependabot[bot]	4ab63d0a5d	build(deps): bump tar from 0.4.45 to 0.4.46 Bumps [tar](https://github.com/composefs/tar-rs) from 0.4.45 to 0.4.46. - [Release notes](https://github.com/composefs/tar-rs/releases) - [Commits](https://github.com/composefs/tar-rs/compare/0.4.45...0.4.46) --- updated-dependencies: - dependency-name: tar dependency-version: 0.4.46 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-06-04 07:52:44 +00:00
dependabot[bot]	d155f1a4ab	build(deps): bump openssl from 0.10.79 to 0.10.80 Bumps [openssl](https://github.com/rust-openssl/rust-openssl) from 0.10.79 to 0.10.80. - [Release notes](https://github.com/rust-openssl/rust-openssl/releases) - [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.79...openssl-v0.10.80) --- updated-dependencies: - dependency-name: openssl dependency-version: 0.10.80 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>	2026-06-04 07:51:50 +00:00
stevenhorsman	879912be25	versions: bump golang to 1.25.11 Bump the go version to resolve CVEs: - GO-2026-5037 - GO-2026-5038 - GO-2026-5039 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-04 08:49:17 +01:00
Steve Horsman	53c1a627e4	Merge pull request #13143 from stevenhorsman/x/net-0.55-bump bump golang.org/x/dependencies	2026-06-03 16:46:08 +01:00
Mikko Ylinen	018389cb22	tests: enable k8s-sandbox-vcpus-allocation.bats for tdx and coco-dev k8s-sandbox-vcpus-allocation.bats was disabled for qemu-tdx due to errors when moving to use "upstream" TDX KVM code. The failing test is vcpus-less-than-one-with-no-limits pod which ends up getting x86 default MaxCPU = 240 and erroring: Number of hotpluggable cpus requested (240) exceeds the maximum cpus supported by KVM (224) TDX max vcpus is capped to host's logical CPUs so 240 is too much. With the maxcpus logic fixed (=maxcpus not set at all) for configurations where confidential guest is enabled, qemu-tdx can be enabled for k8s-sandox-vcpus-allocation.bats again. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
Mikko Ylinen	e475d870fb	runtime: qemu: don't set maxcpus when confidential guest is enabled QEMU maxcpus enables CPU hotplug capabilities but it's unused when confidential guest is enabled. Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug is not needed. Commit `07db945b09` built a relationship between kernel's cmdline nr_cpus and the maxcpus config. Now that maxcpus is dropped for confidential guests, drop nr_cpus from kernel commandline too. This hopefully helps with the reference values computation too. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
Mikko Ylinen	2e625d0bab	runtime-rs: qemu: don't set maxcpus when confidential guest is enabled QEMU maxcpus enables CPU hotplug capabilities but it's unused when confidential guest is enabled. Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug is not needed. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
stevenhorsman	51eee428f4	testing/webhook: bump golang.org/x dependencies Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys from v0.43.0 to v0.44.0 to resolve CVEs: - GO-2026-5024 - GO-2026-5025 - GO-2026-5026 - GO-2026-5027 - GO-2026-5028 - GO-2026-5029 - GO-2026-5030 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-03 09:56:54 +01:00
stevenhorsman	144ab161f1	tetss: bump golang.org/x/sys dependency Bump golang.org/x/sys from v0.19.0 to v0.44.0 to resolve CVE: - GO-2026-5024 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
stevenhorsman	46d704a7ab	log-parser: bump golang.org/x/sys dependency Bump golang.org/x/sys from v0.1.0 to v0.44.0 to resolve CVE: - GO-2026-5024 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
stevenhorsman	08ab789d9a	csi-kata-directvolume: bump golang.org/x dependencies Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys from v0.43.0 to v0.44.0 to resolve CVEs: - GO-2026-5024 - GO-2026-5025 - GO-2026-5026 - GO-2026-5027 - GO-2026-5028 - GO-2026-5029 - GO-2026-5030 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
stevenhorsman	c0f549860e	runtime: bump golang.org/x dependencies Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys from v0.43.0 to v0.44.0 to resolve CVEs: - GO-2026-5024 - GO-2026-5025 - GO-2026-5026 - GO-2026-5027 - GO-2026-5028 - GO-2026-5029 - GO-2026-5030 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
Fabiano Fidêncio	a2bb3f64b0	Merge pull request #12436 from mythi/tdx-updates-2026-3 runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default	2026-06-03 08:50:26 +02:00
Fabiano Fidêncio	ecd9344dd1	Merge pull request #13144 from stevenhorsman/bump-rust-to-1.94 Bump rust to 1.94	2026-06-02 09:58:56 +02:00
Fabiano Fidêncio	230e01b04e	Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs runtime/runtime-rs: introduce Azure specific configs	2026-06-02 09:17:09 +02:00
stevenhorsman	b1928cc22f	runtime-rs: run cargo fmt for Rust 1.94 Run cargo fmt on runtime-rs to ensure consistent formatting with Rust 1.94 toolchain. Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-01 17:32:06 +01:00
Fabiano Fidêncio	57de50f43c	Merge pull request #13141 from fidencio/topic/kata-deploy-fix-stale-containerd-import kata-deploy: scrub stale containerd import on conf.d migration	2026-06-01 18:13:08 +02:00
Steve Horsman	a3cc016e2f	Merge pull request #13140 from fidencio/topic/fix-besteffort-sandbox-cpu-sizing runtime: oci: Only derive sandbox CPUs from shares when quota is unconstrained	2026-06-01 17:09:12 +01:00
stevenhorsman	f9c95a279e	dragonball: Remove unnecessary unsafe blocks in cpuid Rust 1.94 now warns about unnecessary unsafe blocks around __get_cpuid_max(), __cpuid_count(), and host_cpuid() calls. Remove the unsafe blocks as they are no longer needed. This fixes the following clippy warnings in dbs-arch: - warning: unnecessary `unsafe` block at brand_string.rs:106 - warning: unnecessary `unsafe` block at brand_string.rs:114 - warning: unnecessary `unsafe` block at common.rs:28 - warning: unnecessary `unsafe` block at common.rs:36 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-01 17:07:16 +01:00
stevenhorsman	a63a948b4a	libs: Remove unnecessary unsafe blocks in protection.rs Rust 1.94 now warns about unnecessary unsafe blocks around x86_64::__cpuid() calls. Remove the unsafe blocks as they are no longer needed. This fixes the following clippy warnings: - warning: unnecessary `unsafe` block at line 129 - warning: unnecessary `unsafe` block at line 142 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-01 17:04:43 +01:00
stevenhorsman	9625bf8056	versions: Update MSRV to 1.94 With the bump to 1.94, we are now relying on some 1.94+ apis, so update the MSRV to reflect this Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-01 17:02:20 +01:00
stevenhorsman	4987d79e26	versions: Bump rust to 1.94 Now that 1.96 has been released, in compliance with our toolchain guidance we should bump to rust 1.94 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-06-01 16:39:06 +01:00
Greg Kurz	8a49ecb159	Merge pull request #13097 from BbolroC/fix-shim-components-for-s390x ci: Refactor boot-image-se build and update shim components	2026-06-01 11:43:42 +02:00
Fabiano Fidêncio	f788997253	kata-deploy: scrub stale containerd import on conf.d migration Since the conf.d migration (containerd >= 2.2.0), kata-deploy writes its drop-in to the auto-imported /etc/containerd/conf.d/ and no longer manages the main config's `imports` array. A node upgraded from a pre-conf.d kata-deploy keeps the legacy `{dest_dir}/containerd/config.d/kata-deploy.toml` entry in `imports`, since the new code neither adds nor removes it. On uninstall, remove_artifacts() deletes the artifacts dir (including the file that import still points at) and then restarts containerd, which fails to load the now-dangling import and wedges the node: pods get stuck Terminating and new pods cannot start. This broke the lifecycle-manager E2E tests (TC-02..TC-07) which repeatedly upgrade then reinstall across the 3.30.0 -> latest version boundary. Defensively scrub the legacy import from the main containerd config in both configure_containerd (at conf.d migration time) and cleanup_containerd (before artifacts are removed and containerd is restarted). The helper is a no-op when the config is absent, has no `imports` array, or does not contain the legacy entry. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-01 11:07:13 +02:00
Fabiano Fidêncio	9b5b829265	runtime: oci: derive sandbox CPUs from shares only if unconstrained The shares-based fallback added for cpuManagerPolicy=static fired whenever the quota-based CPU count was 0, including for BestEffort sandboxes that have no CPU request. Those sandboxes still carry the cgroup-floor shares value (2), so the fallback derived ceil(2/1024)=1 and inflated every such sandbox by one vCPU. For peer-pods (static resource management) this changed the VM sizing to default_vcpus+1, regressing the libvirt instance-type CI checks. Gate the fallback on the quota being explicitly unconstrained (< 0), which is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0. BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs while the static-policy case still recovers the CPU count from shares. Add unit tests covering the static-policy, rounding, BestEffort, and explicit-quota cases. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-01 09:50:49 +02:00
Fabiano Fidêncio	02fd572195	Merge pull request #13134 from jojimt/rc-version kata-deploy: Add a version annotation to runtimeclass	2026-06-01 08:21:30 +02:00
manuelh-dev	953b306ff3	Merge pull request #12979 from manuelh-dev/mahuber/erofs-tmpfs-mount runtime-rs/agent: support EROFS snapshots without a rwlayer	2026-05-29 13:50:27 -07:00
Aurélien Bombo	9acef4bc55	Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"	2026-05-29 14:57:07 -05:00
Fabiano Fidêncio	f349d19bf4	Merge pull request #12956 from zvonkok/nvgpu-tarball-chart build: add kata-deploy-publish target	2026-05-29 21:22:44 +02:00
Fabiano Fidêncio	4e7b49fede	Merge pull request #13103 from fidencio/topic/mlx-coldplug-support-v2 runtime / agent / kernel: fix cold-plug VFIO guest-kernel mode for SR-IOV RoCE/InfiniBand	2026-05-29 21:21:46 +02:00
Joji Mekkattuparamban	8549d71c6f	kata-deploy: Add a version annotation to runtimeclass Enables automations to determine version with a simple read RBAC on the runtime class. Helpful when versions need to match with other tools (e.g. genpolicy) or when simple version determination is needed for other reasons. Fixes #13123 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2026-05-29 10:50:19 -07:00
Cameron Baird	7a9d207ab2	Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair" This reverts commit `2799f7d36b`.	2026-05-29 17:05:40 +00:00
Zvonko Kaiser	7f906ec95d	build: add kata-deploy-publish target Mirror the CI payload publish flow in local builds, including image and helm chart publishing, while reusing the same chart upload helper in payload-after-push to avoid duplicated chart packaging logic. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-29 16:22:12 +02:00
Zvonko Kaiser	fb73ccc352	build: include kata-deploy static artifacts in nvgpu bundle Build and package kata-deploy binary and nydus snapshotter component tarballs as part of nvgpu-tarball so local publish can consume a single kata-static.tar.zst without rebuilding extra artifacts. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-29 16:22:12 +02:00
Fabiano Fidêncio	10e70a2a9f	runtime-rs: expose InfiniBand devices to VFIO containers The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a PCIDEVICE_* environment variable; it does not add the VFIO char device to linux.devices in the OCI spec. As a result the agent's container_has_vfio_device() gate stays closed and expose_guest_infiniband_devices() is never triggered — leaving /dev/infiniband absent from the container even though the guest kernel created the IB devices (mlx5_core.rdma.0 probes successfully). The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network endpoints via host_bdf()) was already present inside handler_devices() but could never be consumed because the LinuxDeviceType::C loop has no entries to iterate over when linux.devices is empty. After that loop, iterate over any unmatched cold-plug BDFs, derive the VFIO group path via bdf_to_vfio_group_path() (reads /sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk ContainerDevice. The vfio_group_to_bdf() short-circuit inside the loop handles the case where the device plugin does add VFIO char devices to linux.devices; it now supports both legacy (/dev/vfio/N) and iommufd (/dev/vfio/devices/vfioN) path formats. Add host_bdf() to the Endpoint trait (default: None) so that PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	60f2878c68	runtime-rs: call network.remove() during resource cleanup network.remove() — which detaches endpoints and rebinds VFs from vfio-pci back to the host driver — was never being called. ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs, swap and ephemeral disks, but completely omitted the network teardown. Call network.remove() at the start of cleanup(), using the already-held self.hypervisor reference. Errors are logged as warnings rather than propagated, so they don't block the rest of the cleanup sequence. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	0b4b51dff6	runtime-rs: always detach endpoints on network removal network_with_netns::remove() bailed out early when network_created=false (i.e. the netns was created by the CNI, not by kata). This caused physical endpoint VFs to remain bound to vfio-pci after pod deletion, because PhysicalEndpoint::detach() — which calls bind_device_to_host() to rebind the VF from vfio-pci back to mlx5_core — was never reached. Separate endpoint detachment from netns deletion: always detach endpoints, but only remove the netns if kata created it. Detach errors are logged as warnings rather than propagated, to mirror the Go runtime's best-effort approach and avoid blocking sandbox teardown. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00
Fabiano Fidêncio	be2ec02c9a	runtime-rs: resolve cold-plug VFIO guest PCI path via QMP The PCIe topology pre-computes a wrong path for cold-plugged physical- endpoint VFs because the root port has no explicit addr and QEMU auto- assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] } resolves to 0000:00:00.0 (the Q35 MCH), causing wait_for_pci_net_interface to time out looking for a netdev there. Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait. Implement it in QemuInner using qmp.get_device_by_qdev_id(), which queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00" = slot 5 on pcie.0 / slot 0 on the root port bus). Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the Endpoint trait and add an endpoints() accessor to the Network trait. In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths() before apply_network_to_agent() to populate the correct path from QMP into each PhysicalEndpoint's guest_pci_path field. The field is then consumed by network_with_netns::interfaces() to fill Interface.device_path before update_interface is sent to the agent. This is the runtime-rs counterpart of the Go runtime's ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-29 13:07:45 +02:00

1 2 3 4 5 ...

19241 Commits