Commit Graph

19263 Commits

Author SHA1 Message Date
Fabiano Fidêncio
e122d7ffb0 versions: bump containerd to 2.3 and define minimum/latest test matrix
Bump the containerd version used by CI from v1.7.25 to v2.3.0.

Rename the version-range fields in versions.yaml and throughout the
GitHub Actions workflows from lts/active/version/sandbox_api to
minimum/latest to make their meaning self-evident:

  minimum: "v1.7"   # oldest containerd branch under test
  latest:  "v2.3"   # newest containerd branch under test

Drop the bare version field (superseded by the matrix) and the
sandbox_api alias (covered by latest).  Update all containerd_version
matrix entries in the workflow files accordingly, and update
gha-run-k8s-common.sh to resolve the new key names.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
2026-06-08 19:20:14 +02:00
Fabiano Fidêncio
a4138794ea Merge pull request #13183 from fidencio/topic/kata-deploy-custom-kata-drop-in-for-default-runtimes
kata-deploy: support drop-in configs for default runtimes
2026-06-08 18:44:33 +02:00
Fabiano Fidêncio
d6e1b45ce7 Merge pull request #13171 from fidencio/topic/runtime-rs-enforce-sandbox_cgroup_only-and-static_sandbox_resource_mgmt
runtime-rs: default static sizing-related config flags to true
2026-06-08 17:43:37 +02:00
Fabiano Fidêncio
b119b051cb kata-deploy: support drop-in configs for default runtimes
Allow operators to provide per-shim drop-in TOML for built-in runtimes
and reconcile stale override files so upgrades and migrations remain
safe when drop-ins are added or removed.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex
2026-06-08 13:31:03 +02:00
Fabiano Fidêncio
4dc288401e runtime-rs: make sandbox cgroup runtime attach idempotent
The dragonball nerdctl CI job can race when creating and attaching the
runtime process to the sandbox cgroup, surfacing an os error 17
(AlreadyExists) during shim task creation.

Let's retry add_proc once on this pre-existing cgroup condition so
startup remains robust.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
2026-06-08 13:11:34 +02:00
Fabiano Fidêncio
4d569c22b4 runtime-rs: enforce a minimum vsock reconnect window
Low-CPU sandboxes can take longer than a few seconds to complete guest
boot and start the agent.

Let's clamp the reconnect timeout to a safe minimum so sandbox startup
does not fail early with transient vsock ECONNRESET.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
2026-06-08 13:11:34 +02:00
Fabiano Fidêncio
ed34d7811d runtime-rs: supplement static sizing from sandbox annotations
When static sandbox resource management is enabled, CRI CPU/memory
sizing may live only in sandbox annotations and be missing from the OCI
spec.

Let's fill missing sizing fields from annotations before applying static
VM sizing so runtime-rs follows the expected Kubernetes behavior for
constrained pods.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
2026-06-08 13:11:34 +02:00
Fabiano Fidêncio
e93558e810 runtime-rs: default static sizing-related config flags to true
Add top-level runtime-rs Makefile options `DEFSANDBOXCGROUP_ONLY` and
`DEFSTATICRESOURCEMGMT`, both defaulting to true, and use them for the
runtime defaults that previously disabled these paths.

This aligns runtime-rs defaults with static sandbox resource management,
which sizes sandbox memory up front instead of relying on memory hotplug,
helping avoid architecture-specific hotplug limitations.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-08 12:57:40 +02:00
Fupan Li
024c2531a5 Merge pull request #13029 from fidencio/topic/rfc-composable-vm-images
docs: add composable VM images design proposal
2026-06-08 18:40:35 +08:00
Fabiano Fidêncio
9e65f85ccd Merge pull request #13174 from stevenhorsman/cri-o-cve-false-positive
runtime: ignore false positive CRI-O vulnerabilities
2026-06-08 09:13:39 +02:00
Fabiano Fidêncio
5801a87a4b Merge pull request #13182 from fidencio/topic/tests-enable-more-tests-for-tdx-runtime-rs
tests: unskip hard-coded policy tests on qemu-tdx-runtime-rs
2026-06-08 07:24:50 +02:00
Fabiano Fidêncio
2440b5940b docs: add composable VM images design proposal
Add an RFC document describing the composable image architecture that
replaces monolithic guest rootfs images with a lean base image plus
purpose-specific addon images cold-plugged as virtio-blk devices.

The proposal covers the runtime configuration (extra_images), host-side
cold-plugging, guest-side mounting via systemd and dm-verity, agent-side
dynamic path resolution, the image build pipeline, and the security
model.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-06-07 13:58:17 +02:00
Fabiano Fidêncio
57c61e0c2f tests: unskip hard-coded policy tests on qemu-tdx-runtime-rs
Enable the hard-coded init-data policy test gate for qemu-tdx-runtime-rs
so runtime-rs and Go TDX variants exercise the same Kubernetes policy
coverage.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-06 22:48:20 +02:00
Fabiano Fidêncio
43321c7a78 Merge pull request #12931 from mythi/qemu-tdx-tests
tests: fix TDX runtime-rs and initdata tests
2026-06-06 11:42:19 +02:00
Fabiano Fidêncio
1ca7129581 Merge pull request #13176 from Amulyam24/kata-deploy-fix
kata-deploy: add the imports directive explicitly if expected but not found
2026-06-05 22:24:16 +02:00
Fabiano Fidêncio
f6ff9578d4 Merge pull request #13161 from kata-containers/sprt/remove-configure-mariner
ci: remove Mariner annotations and use new config
2026-06-05 20:22:58 +02:00
Fabiano Fidêncio
e529ca0292 Merge pull request #13170 from fidencio/topic/kata-deploy-custom-runtimes-podOverhead
kata-deploy: inherit custom RuntimeClass overhead from baseConfig
2026-06-05 19:46:17 +02:00
Fabiano Fidêncio
e9ee97f751 kata-deploy: inherit custom RuntimeClass overhead from baseConfig
Default custom runtime RuntimeClass overhead.podFixed to the selected
baseConfig values, so equivalent runtimes behave consistently without
repeating boilerplate.

In case the user wants to enforce that no overhead is set on the custom
RuntimeClass, disable inheritance with inheritBaseOverhead=false.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-05 17:22:25 +02:00
Steve Horsman
2ac6bb173b Merge pull request #13036 from stevenhorsman/jaeger-to-otlp-tracing-switch
trace-forwarder: migrate from Jaeger to OTLP exporter
2026-06-05 14:30:26 +01:00
Amulyam24
b15a5fbe36 kata-deploy: add the imports directive explicitly if expected but not found
For containerd v2.2+, the flow assumes that the imports directive would be present.
It is better to check it and add if it doesn't exist.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2026-06-05 18:47:07 +05:30
Mikko Ylinen
013e901f1b tests: re-enable initdata tests for qemu-tdx
The coco initdata tests signature verification and authenticated registry
never worked on qemu-tdx and so they have been disabled since.

Add them back now that all necessary fixes are in place.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-05 16:04:05 +03:00
Mikko Ylinen
9313e336b5 tests: set image.image_pull_proxy for CDH initdata
initdata tests set kernel arguments to "" which resets the
kernel arguments configured by Helm install. However, TDX
runner depends on agent.https_proxy= kernel arguments to pull
images.

In order for initdata tests to work on TDX, the same needs to
be added to CDH configuration via image.image_pull_proxy.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-05 16:04:05 +03:00
Mikko Ylinen
f3a0ef6a7c tests: use kubectl set to configure KBS env
No need to patch yamls locally. Also, set RUST_LOG=debug
and enable https_proxy for all TDX targets when the runner
has HTTPS_PROXY is set.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-05 16:04:05 +03:00
stevenhorsman
7033d56e2c runtime: ignore false positive CRI-O vulnerabilities
Add osv-scanner ignores for GO-2025-3426 (CVE-2025-0750) and
GO-2025-3897 (CVE-2025-4437), which are false positives for
kata-containers.

The vulnerabilities have been open for 10 and 16 months and there
is no indication that the cri-o community have any intension of addressing
the situation. They also only affect the main CRI-O runtime code (log
management and user creation functions), but kata-containers only
imports github.com/cri-o/cri-o/pkg/annotations for string constant
definitions. The vulnerable code paths are not imported or used,
therefore we should just filter these out.

GO-2025-3426: Path traversal in UnMountPodLogs/LinkContainerLogs
GO-2025-3897: Memory exhaustion when reading /etc/passwd

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-05 10:08:06 +01:00
Steve Horsman
1624ebe362 Merge pull request #13135 from kata-containers/dependabot/cargo/tar-0.4.46
build(deps): bump tar from 0.4.45 to 0.4.46
2026-06-05 09:44:46 +01:00
stevenhorsman
b737ae48bf trace-forwarder: migrate from Jaeger to OTLP exporter
Migrate trace-forwarder from the deprecated opentelemetry-jaeger
exporter to the modern opentelemetry-otlp exporter.

This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a
medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger
crate is no longer maintained and depends on vulnerable thrift versions
(0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift
and is actively maintained.

Changes:
- Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml
- Update tracer.rs to use OTLP exporter instead of Jaeger exporter
- Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag
- Update server.rs to use TracerProvider instead of SpanExporter
- Update documentation to reflect OTLP migration
- Add examples for common OTLP-compatible collectors

Breaking change: Users must update their trace-forwarder invocations
to use --otlp-endpoint instead of --jaeger-host and --jaeger-port.

Default endpoint: http://localhost:4317 (OTLP gRPC)

Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-06-04 19:39:47 +01:00
Dan Mihai
c78ccc2e9f Merge pull request #13088 from kata-containers/dependabot/cargo/openssl-0.10.80
build(deps): bump openssl from 0.10.79 to 0.10.80
2026-06-04 11:38:08 -07:00
Fabiano Fidêncio
743b0a4839 Merge pull request #13165 from stevenhorsman/bump-go-to-1.25.11
versions: bump golang to 1.25.11
2026-06-04 20:24:57 +02:00
Fabiano Fidêncio
cd21b7b607 Merge pull request #13156 from fidencio/topic/runtime-rs-shim-leftover-on-failure
runtime-rs: shut down shim daemon on a failed create
2026-06-04 20:09:28 +02:00
Fabiano Fidêncio
354b85784c Merge pull request #13166 from stevenhorsman/required-tests/remote-kata-monitor
ci: Remove kata-monitor test from required
2026-06-04 20:04:15 +02:00
stevenhorsman
81c7dde0ae ci: Remove kata-monitor test from required
The kata-monitor test is currently failing and is running a very EoL
version of cri-o. This area is being actively reworked in #13107,
so remove this and then once kata-monitor tests are stable we
can re-add the new versions

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-04 14:40:17 +01:00
Fabiano Fidêncio
80e2473440 runtime-rs: shut down shim daemon on a failed create
When CreateContainer fails before the runtime instance is registered
(e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal
teardown. containerd's follow-up Shutdown RPC then reaches
get_runtime_instance(), fails with "runtime not ready", and returns
before the service loop is ever told to stop. Because the shim ignores
SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned.

Make the Shutdown RPC force the daemon to exit when there is no runtime
instance, emitting the same Action::Shutdown that sandbox.shutdown()
sends on the normal path. This guarantees the shim process is reaped
after a failed create instead of leaking.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
2026-06-04 14:12:01 +02:00
Fabiano Fidêncio
2a1ce7b8c4 Merge pull request #12539 from mythi/no-vcpu-hotplug
Disable CPU hotplug when confidential guest setting enabled
2026-06-04 10:56:52 +02:00
dependabot[bot]
4ab63d0a5d build(deps): bump tar from 0.4.45 to 0.4.46
Bumps [tar](https://github.com/composefs/tar-rs) from 0.4.45 to 0.4.46.
- [Release notes](https://github.com/composefs/tar-rs/releases)
- [Commits](https://github.com/composefs/tar-rs/compare/0.4.45...0.4.46)

---
updated-dependencies:
- dependency-name: tar
  dependency-version: 0.4.46
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-04 07:52:44 +00:00
dependabot[bot]
d155f1a4ab build(deps): bump openssl from 0.10.79 to 0.10.80
Bumps [openssl](https://github.com/rust-openssl/rust-openssl) from 0.10.79 to 0.10.80.
- [Release notes](https://github.com/rust-openssl/rust-openssl/releases)
- [Commits](https://github.com/rust-openssl/rust-openssl/compare/openssl-v0.10.79...openssl-v0.10.80)

---
updated-dependencies:
- dependency-name: openssl
  dependency-version: 0.10.80
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-06-04 07:51:50 +00:00
stevenhorsman
879912be25 versions: bump golang to 1.25.11
Bump the go version to resolve CVEs:
- GO-2026-5037
- GO-2026-5038
- GO-2026-5039

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-04 08:49:17 +01:00
Steve Horsman
53c1a627e4 Merge pull request #13143 from stevenhorsman/x/net-0.55-bump
bump golang.org/x/dependencies
2026-06-03 16:46:08 +01:00
Aurélien Bombo
de5333f275 ci: remove Mariner annotations and use new config
This is a follow-up to #13126 where we forgot to remove this now-unused code.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-06-03 09:25:12 -05:00
Mikko Ylinen
018389cb22 tests: enable k8s-sandbox-vcpus-allocation.bats for tdx and coco-dev
k8s-sandbox-vcpus-allocation.bats was disabled for qemu-tdx due to
errors when moving to use "upstream" TDX KVM code. The failing test
is vcpus-less-than-one-with-no-limits pod which ends up getting
x86 default MaxCPU = 240 and erroring:

Number of hotpluggable cpus requested (240) exceeds the maximum cpus supported by KVM (224)

TDX max vcpus is capped to host's logical CPUs so 240 is too much.

With the maxcpus logic fixed (=maxcpus not set at all) for configurations
where confidential guest is enabled, qemu-tdx can be enabled for
k8s-sandox-vcpus-allocation.bats again.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-03 15:27:35 +03:00
Mikko Ylinen
e475d870fb runtime: qemu: don't set maxcpus when confidential guest is enabled
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.

Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.

Commit 07db945b09 built a relationship between kernel's cmdline nr_cpus and
the maxcpus config. Now that maxcpus is dropped for confidential guests, drop
nr_cpus from kernel commandline too. This hopefully helps with the reference
values computation too.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-03 15:27:35 +03:00
Mikko Ylinen
2e625d0bab runtime-rs: qemu: don't set maxcpus when confidential guest is enabled
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.

Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-03 15:27:35 +03:00
stevenhorsman
51eee428f4 testing/webhook: bump golang.org/x dependencies
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
 - GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-03 09:56:54 +01:00
stevenhorsman
144ab161f1 tetss: bump golang.org/x/sys dependency
Bump golang.org/x/sys from v0.19.0 to v0.44.0 to resolve CVE:
- GO-2026-5024

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
stevenhorsman
46d704a7ab log-parser: bump golang.org/x/sys dependency
Bump golang.org/x/sys from v0.1.0 to v0.44.0 to resolve CVE:
- GO-2026-5024

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
stevenhorsman
08ab789d9a csi-kata-directvolume: bump golang.org/x dependencies
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
stevenhorsman
c0f549860e runtime: bump golang.org/x dependencies
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
Fabiano Fidêncio
a2bb3f64b0 Merge pull request #12436 from mythi/tdx-updates-2026-3
runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default
2026-06-03 08:50:26 +02:00
Fabiano Fidêncio
ecd9344dd1 Merge pull request #13144 from stevenhorsman/bump-rust-to-1.94
Bump rust to 1.94
2026-06-02 09:58:56 +02:00
Fabiano Fidêncio
230e01b04e Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs
runtime/runtime-rs: introduce Azure specific configs
2026-06-02 09:17:09 +02:00
stevenhorsman
b1928cc22f runtime-rs: run cargo fmt for Rust 1.94
Run cargo fmt on runtime-rs to ensure consistent formatting
with Rust 1.94 toolchain.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-06-01 17:32:06 +01:00