Default custom runtime RuntimeClass overhead.podFixed to the selected
baseConfig values, so equivalent runtimes behave consistently without
repeating boilerplate.
In case the user wants to enforce that no overhead is set on the custom
RuntimeClass, disable inheritance with inheritBaseOverhead=false.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Migrate trace-forwarder from the deprecated opentelemetry-jaeger
exporter to the modern opentelemetry-otlp exporter.
This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a
medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger
crate is no longer maintained and depends on vulnerable thrift versions
(0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift
and is actively maintained.
Changes:
- Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml
- Update tracer.rs to use OTLP exporter instead of Jaeger exporter
- Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag
- Update server.rs to use TracerProvider instead of SpanExporter
- Update documentation to reflect OTLP migration
- Add examples for common OTLP-compatible collectors
Breaking change: Users must update their trace-forwarder invocations
to use --otlp-endpoint instead of --jaeger-host and --jaeger-port.
Default endpoint: http://localhost:4317 (OTLP gRPC)
Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The kata-monitor test is currently failing and is running a very EoL
version of cri-o. This area is being actively reworked in #13107,
so remove this and then once kata-monitor tests are stable we
can re-add the new versions
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When CreateContainer fails before the runtime instance is registered
(e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal
teardown. containerd's follow-up Shutdown RPC then reaches
get_runtime_instance(), fails with "runtime not ready", and returns
before the service loop is ever told to stop. Because the shim ignores
SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned.
Make the Shutdown RPC force the daemon to exit when there is no runtime
instance, emitting the same Action::Shutdown that sandbox.shutdown()
sends on the normal path. This guarantees the shim process is reaped
after a failed create instead of leaking.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
Bump the go version to resolve CVEs:
- GO-2026-5037
- GO-2026-5038
- GO-2026-5039
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
k8s-sandbox-vcpus-allocation.bats was disabled for qemu-tdx due to
errors when moving to use "upstream" TDX KVM code. The failing test
is vcpus-less-than-one-with-no-limits pod which ends up getting
x86 default MaxCPU = 240 and erroring:
Number of hotpluggable cpus requested (240) exceeds the maximum cpus supported by KVM (224)
TDX max vcpus is capped to host's logical CPUs so 240 is too much.
With the maxcpus logic fixed (=maxcpus not set at all) for configurations
where confidential guest is enabled, qemu-tdx can be enabled for
k8s-sandox-vcpus-allocation.bats again.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.
Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.
Commit 07db945b09 built a relationship between kernel's cmdline nr_cpus and
the maxcpus config. Now that maxcpus is dropped for confidential guests, drop
nr_cpus from kernel commandline too. This hopefully helps with the reference
values computation too.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.
Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Run cargo fmt on runtime-rs to ensure consistent formatting
with Rust 1.94 toolchain.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Rust 1.94 now warns about unnecessary unsafe blocks around
__get_cpuid_max(), __cpuid_count(), and host_cpuid() calls.
Remove the unsafe blocks as they are no longer needed.
This fixes the following clippy warnings in dbs-arch:
- warning: unnecessary `unsafe` block at brand_string.rs:106
- warning: unnecessary `unsafe` block at brand_string.rs:114
- warning: unnecessary `unsafe` block at common.rs:28
- warning: unnecessary `unsafe` block at common.rs:36
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Rust 1.94 now warns about unnecessary unsafe blocks around
x86_64::__cpuid() calls. Remove the unsafe blocks as they are
no longer needed.
This fixes the following clippy warnings:
- warning: unnecessary `unsafe` block at line 129
- warning: unnecessary `unsafe` block at line 142
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Now that 1.96 has been released, in compliance with our toolchain guidance
we should bump to rust 1.94
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Since the conf.d migration (containerd >= 2.2.0), kata-deploy writes its
drop-in to the auto-imported /etc/containerd/conf.d/ and no longer manages
the main config's `imports` array. A node upgraded from a pre-conf.d
kata-deploy keeps the legacy `{dest_dir}/containerd/config.d/kata-deploy.toml`
entry in `imports`, since the new code neither adds nor removes it.
On uninstall, remove_artifacts() deletes the artifacts dir (including the
file that import still points at) and then restarts containerd, which fails
to load the now-dangling import and wedges the node: pods get stuck
Terminating and new pods cannot start. This broke the lifecycle-manager E2E
tests (TC-02..TC-07) which repeatedly upgrade then reinstall across the
3.30.0 -> latest version boundary.
Defensively scrub the legacy import from the main containerd config in both
configure_containerd (at conf.d migration time) and cleanup_containerd
(before artifacts are removed and containerd is restarted). The helper is a
no-op when the config is absent, has no `imports` array, or does not contain
the legacy entry.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The shares-based fallback added for cpuManagerPolicy=static fired whenever
the quota-based CPU count was 0, including for BestEffort sandboxes that
have no CPU request. Those sandboxes still carry the cgroup-floor shares
value (2), so the fallback derived ceil(2/1024)=1 and inflated every such
sandbox by one vCPU. For peer-pods (static resource management) this
changed the VM sizing to default_vcpus+1, regressing the libvirt
instance-type CI checks.
Gate the fallback on the quota being explicitly unconstrained (< 0), which
is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0.
BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs
while the static-policy case still recovers the CPU count from shares.
Add unit tests covering the static-policy, rounding, BestEffort, and
explicit-quota cases.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enables automations to determine version with a simple read RBAC
on the runtime class. Helpful when versions need to match with other
tools (e.g. genpolicy) or when simple version determination is needed
for other reasons.
Fixes#13123
Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
Mirror the CI payload publish flow in local builds, including image and
helm chart publishing, while reusing the same chart upload helper in
payload-after-push to avoid duplicated chart packaging logic.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Build and package kata-deploy binary and nydus snapshotter component
tarballs as part of nvgpu-tarball so local publish can consume a single
kata-static.tar.zst without rebuilding extra artifacts.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec. As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).
The cold_plug_bdfs map (host_bdf → guest_pci_path, built from network
endpoints via host_bdf()) was already present inside handler_devices()
but could never be consumed because the LinuxDeviceType::C loop has
no entries to iterate over when linux.devices is empty.
After that loop, iterate over any unmatched cold-plug BDFs, derive the
VFIO group path via bdf_to_vfio_group_path() (reads
/sys/bus/pci/devices/<bdf>/iommu_group), and push a vfio-pci-gk
ContainerDevice. The vfio_group_to_bdf() short-circuit inside the
loop handles the case where the device plugin does add VFIO char
devices to linux.devices; it now supports both legacy (/dev/vfio/N)
and iommufd (/dev/vfio/devices/vfioN) path formats.
Add host_bdf() to the Endpoint trait (default: None) so that
PhysicalEndpoint can expose its BDF for the cold_plug_bdfs map.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
network.remove() — which detaches endpoints and rebinds VFs from
vfio-pci back to the host driver — was never being called.
ResourceManagerInner::cleanup() handled cgroups, bindmounts, share-fs,
swap and ephemeral disks, but completely omitted the network teardown.
Call network.remove() at the start of cleanup(), using the already-held
self.hypervisor reference. Errors are logged as warnings rather than
propagated, so they don't block the rest of the cleanup sequence.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
network_with_netns::remove() bailed out early when network_created=false
(i.e. the netns was created by the CNI, not by kata). This caused
physical endpoint VFs to remain bound to vfio-pci after pod deletion,
because PhysicalEndpoint::detach() — which calls bind_device_to_host()
to rebind the VF from vfio-pci back to mlx5_core — was never reached.
Separate endpoint detachment from netns deletion: always detach
endpoints, but only remove the netns if kata created it. Detach errors
are logged as warnings rather than propagated, to mirror the Go runtime's
best-effort approach and avoid blocking sandbox teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
The PCIe topology pre-computes a wrong path for cold-plugged physical-
endpoint VFs because the root port has no explicit addr and QEMU auto-
assigns its slot. The pre-computed PciPath { slots: [PciSlot(0)] }
resolves to 0000:00:00.0 (the Q35 MCH), causing
wait_for_pci_net_interface to time out looking for a netdev there.
Add resolve_vfio_device_pci_path(hostdev_id) to the Hypervisor trait.
Implement it in QemuInner using qmp.get_device_by_qdev_id(), which
queries QEMU's query-pci to find the full guest PCIe path (e.g. "05/00"
= slot 5 on pcie.0 / slot 0 on the root port bus).
Store the QEMU device ID (hostdev_id) in PhysicalEndpoint during
attach(). Add vfio_hostdev_id() and set_guest_pci_path() to the
Endpoint trait and add an endpoints() accessor to the Network trait.
In setup_after_start_vm(), call resolve_physical_endpoint_pci_paths()
before apply_network_to_agent() to populate the correct path from QMP
into each PhysicalEndpoint's guest_pci_path field. The field is then
consumed by network_with_netns::interfaces() to fill Interface.device_path
before update_interface is sent to the agent.
This is the runtime-rs counterpart of the Go runtime's
ResolveColdPlugVFIOGuestPciPaths / qomGetPciPath.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>