Commit Graph

18497 Commits

Author SHA1 Message Date
Fabiano Fidêncio
eda3bc6190 runtime-rs: wire GetDiagnosticData for termination logs
Add runtime-rs support for the GetDiagnosticData RPC. This extends
the Agent trait, types, and protocol translation layer with the new
request/response types.

During container stop, when shared_fs is "none" and the
terminationMessagePolicy annotation is "File", the runtime copies
the termination log from the guest via GetDiagnosticData. The call
is best-effort to avoid blocking container teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-17 13:16:25 +02:00
Fabiano Fidêncio
411f8cf583 genpolicy: policy-gate GetDiagnosticDataRequest
Add policy rules for the new GetDiagnosticDataRequest RPC.
The request is denied by default in genpolicy-generated policies,
ensuring CoCo workloads do not expose diagnostic data unless
explicitly opted in via policy_data.request_defaults.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
2026-04-17 13:16:25 +02:00
Fabiano Fidêncio
64c139208f agent: add GetDiagnosticData RPC with termination log support
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.

The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.

On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
2026-04-17 13:01:13 +02:00
Steve Horsman
1db12f8ccf Merge pull request #12812 from stevenhorsman/tee-test-refactor
ci: Refactor confidential TEE support
2026-04-17 11:12:13 +01:00
Steve Horsman
e4b3ba56dd Merge pull request #12855 from stevenhorsman/increase-stale-issues-frequency
ci: increase stale issues workflow frequency
2026-04-17 08:37:20 +01:00
stevenhorsman
1dc57c6cef ci: increase stale issues workflow frequency
Update the stale issues workflow to run more frequently:
- Weekdays: Every 4 hours (6x per day) at 00:00, 06:00, 12:00, 18:00 UTC
- Weekends: Every hour (24x per day)

Previously ran once daily at midnight UTC. This change reduces the time
it will take for us to get through our backlog, particularly increasing
the runs at the weekend, when we should have less other CI running,
which it could impact due to GH API rate limiting.

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 20:50:38 +01:00
Fabiano Fidêncio
d9128a58d9 Merge pull request #11611 from Xynnn007/docs-typo
docs: fix nerdctl guest image command
2026-04-16 15:36:37 +02:00
Fabiano Fidêncio
57ce3a1347 Merge pull request #11364 from kata-containers/dependabot/github_actions/tim-actions/wip-check-1.1.0
build(deps): bump tim-actions/w.i.p.-check from 1.0.0 to 1.1.0
2026-04-16 14:11:12 +02:00
Fabiano Fidêncio
78a8133112 Merge pull request #12242 from stevenhorsman/msrv-current-thoughts
doc: Add MSRV comments to toolchain guidance
2026-04-16 14:09:30 +02:00
Fabiano Fidêncio
88ce64819d Merge pull request #12726 from LandonTClipp/doc_annotations
docs: Add annotation config to doc site
2026-04-16 13:07:53 +02:00
stevenhorsman
05430d5690 doc: Add MSRV comments to toolchain guidance
Add some extra clarification about our current position on
MSRV.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 12:06:46 +01:00
Fabiano Fidêncio
beb06573fa Merge pull request #12790 from kata-containers/dependabot/cargo/src/tools/kata-ctl/tracing-0d2b5df27c
build(deps): bump tracing from 0.1.41 to 0.1.44 in /src/tools/kata-ctl in the tracing group across 1 directory
2026-04-16 12:52:05 +02:00
dependabot[bot]
c044403409 build(deps): bump tim-actions/wip-check from 1.0.0 to 1.1.0
Bumps [tim-actions/wip-check](https://github.com/tim-actions/wip-check) from 1.0.0 to 1.1.0.
- [Release notes](https://github.com/tim-actions/wip-check/releases)
- [Commits](1c2a1ca6c1...8c84f59872)

---
updated-dependencies:
- dependency-name: tim-actions/wip-check
  dependency-version: 1.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-16 10:48:41 +00:00
Xynnn007
1d806e0cfa docs: fix nerdctl guest image command
the image name is delivered via annotation than label in nerdctl >= 2.0
version.

See the release note
https://github.com/containerd/nerdctl/releases/tag/v2.0.0

and PR
https://github.com/containerd/nerdctl/pull/2906

If an old version of nerdctl (< 2.0), --label will still work.

Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
2026-04-16 11:34:03 +02:00
stevenhorsman
ff246f9538 ci: Remove deploy_snapshotter
Snapshotter deployment is a no-op now that
kata-deploy handles this, so clean up this code.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 09:21:04 +01:00
stevenhorsman
fce6415865 tests: Use hypervisor helpers
Utilise the new hypervisor helpers in our CI and test
code to help add clarity and reduce duplication

Note: `kubernetes_dir` is declared as readonly in
tests/integration/kubernetes/setup.sh which is sourced
by tests_common.sh, so we update it to only be set if
unset

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 09:21:04 +01:00
stevenhorsman
2f3fec9727 tests: Add new hypervisor helper script
Add a pure shell script which the CI and integration tests can
use to check for different categories of runtime

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-16 09:21:04 +01:00
Alex Lyn
c546b3c585 Merge pull request #12843 from microsoft/saul/build-opt
runtime-rs: add build optimization flags
2026-04-16 09:05:20 +08:00
Dan Mihai
c967b45996 Merge pull request #12838 from kata-containers/sprt/new-az-region
ci: Change Azure region to eastus2
2026-04-15 16:08:21 -07:00
Aurélien Bombo
1602e04b2d ci: Change Azure region to eastus2
I'm doing some bookkeeping in the Azure subscription that requires we move
from eastus to eastus2. This should have no user-facing impact.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-15 14:37:13 -05:00
Fabiano Fidêncio
19441e5515 Merge pull request #12844 from Apokleos/fix-warning
runtime-rs: Fix unformatted code in runtime-rs
2026-04-15 17:35:03 +02:00
Fabiano Fidêncio
d2fb22edbe Merge pull request #12847 from fidencio/topic/ci-adjust-timeout-for-k8s-tests
ci: k8s: Adjust timeout on free runners
2026-04-15 17:30:51 +02:00
Fabiano Fidêncio
8d6f1d6f34 ci: k8s: Adjust timeout on free runners
I've seen several cases of the CLH tests just being killed due to the 60
minutes timeout. Let's bump it to 75 and see how it goes.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-15 17:09:30 +02:00
dependabot[bot]
bbb037e025 build(deps): bump the tracing group across 1 directory with 1 update
Bumps the tracing group with 1 update in the /src/tools/kata-ctl directory: [tracing](https://github.com/tokio-rs/tracing).


Updates `tracing` from 0.1.41 to 0.1.44
- [Release notes](https://github.com/tokio-rs/tracing/releases)
- [Commits](https://github.com/tokio-rs/tracing/compare/tracing-0.1.41...tracing-0.1.44)

Updates `tracing` from 0.1.41 to 0.1.44
- [Release notes](https://github.com/tokio-rs/tracing/releases)
- [Commits](https://github.com/tokio-rs/tracing/compare/tracing-0.1.41...tracing-0.1.44)

---
updated-dependencies:
- dependency-name: tracing
  dependency-version: 0.1.44
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: tracing
- dependency-name: tracing
  dependency-version: 0.1.44
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: tracing
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-15 15:06:48 +00:00
LandonTClipp
fd896e4e76 ci: Add kata-dictionary.txt to required_tests.yaml
This makes it so that changes to the kata-dictionary.txt file only trigger the
static checks to run.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-04-15 14:48:01 +01:00
LandonTClipp
56cdfa831f docs: Add annotation config to doc site
Adding the pod annotation config to the doc site. A symlink is created
at docs/pod-annotations.md that points to
how-to/how-to-set-sandbox-config-kata.md so that the URL for this file will be
created at `/pod-annotations`. Also adding brief contrbuting guidelines and
how-to's for running the documentation site locally for local previews.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-04-15 14:48:01 +01:00
Alex Lyn
2f6319f130 runtime-rs: Fix unformatted code in runtime-rs
When build runtime-rs, one unformatted code block comes up,as below:
```
-        config
-            .hypervisor
-            .entry("qemu".to_owned())
-            .and_modify(|hv| {
-                hv.cpu_info.default_vcpus = default_vcpus;
-                hv.cpu_info.default_maxvcpus = default_maxvcpus;
-                hv.memory_info.default_memory = default_memory;
-                hv.memory_info.default_maxmemory = default_maxmemory;
-            });
+        config.hypervisor.entry("qemu".to_owned()).and_modify(|hv| {
+            hv.cpu_info.default_vcpus = default_vcpus;
+            hv.cpu_info.default_maxvcpus = default_maxvcpus;
+            hv.memory_info.default_memory = default_memory;
+            hv.memory_info.default_maxmemory = default_maxmemory;
+        });
```
Let's format it now.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-15 14:48:23 +02:00
Fabiano Fidêncio
57898de1fe Merge pull request #12845 from fidencio/topic/fix-signed-image-tests
tests: Update images used for signed tests
2026-04-15 14:47:58 +02:00
Fabiano Fidêncio
ba9a02897e genpolicy: make allowed cgroup v2 mount extras configurable
Newer kernels and containerd versions (>= 2.2.3) may add extra mount
options to /sys/fs/cgroup that genpolicy does not embed in the policy
(e.g. nsdelegate, memory_recursiveprot). This causes the Kata agent to
reject CreateContainerRequest with PERMISSION_DENIED because the
check_mount rules require an exact match.

Rather than hard-coding the allowed extras in Rego, make them
configurable via genpolicy-settings.json under
cluster_config.cgroup_mount_extras_allowed. The corresponding Rego rule
(check_mount 4) reads the list from policy_data.cluster_config and
allows only those named options beyond the policy-embedded set.

To support this, cluster_config is now included in PolicyData so that
it gets serialized into the Rego policy_data object at generation time.

This follows the established pattern of keeping site- and
version-specific tunables in genpolicy-settings.json so they can be
overridden via JSON-Patch drop-ins without touching the Rego source.

A policy test case is added to verify that the default allowed extras
(nsdelegate, memory_recursiveprot) are accepted and that unknown extras
are rejected.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-15 13:24:21 +02:00
Fabiano Fidêncio
d29b77e953 tests: Update images used for signed tests
I've updaed the images on the Confidential Containers side, in order to
add arm64 support, but I didn't realize it'd break tests not using
those.

Apologies!

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-15 12:11:37 +02:00
Saul Paredes
9404104aba runtime-rs: add build optimization flags
Enable the following optimizations when building runtime-rs in release mode:
- lto: true
- codegen-units=1:

Setting these reduce the binary size and improve performance at the cost of longer build times.

Without these flags:
- build time: 4m 55s
- binary size: 51 MB

With these flags:
- build time: 7m 21s
- binary size: 38MB

Per https://github.com/kata-containers/kata-containers/issues/1125 and local experiments,
a smaller binary size leads to a smaller shim memory footprint.

- https://nnethercote.github.io/perf-book/build-configuration.html#codegen-units
- https://nnethercote.github.io/perf-book/build-configuration.html#link-time-optimization

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2026-04-14 15:52:38 -07:00
Fabiano Fidêncio
2d57b89857 Merge pull request #12805 from stevenhorsman/stale-bot-improvements
Stale bot improvements
2026-04-14 23:20:41 +02:00
Fabiano Fidêncio
672d3f2b0f workflows: Use docker buildx to build and push auth test image
skopeo copy with --override-arch fails with "authentication required"
during blob existence checks at the destination, regardless of how
credentials are provided (--dest-creds, --authfile, REGISTRY_AUTH_FILE).
This is a known issue with skopeo 1.13.x when copying from manifest
list sources.

Replace the skopeo/buildah approach with docker/build-push-action,
which is already proven in this repo (build-kubectl-image.yaml) and
handles multi-arch builds and Quay pushes reliably. The workflow now
builds a trivial FROM busybox image using buildx with QEMU emulation.

Fixes: b0abe5999 ("workflows: Add workflow to create auth registry test image")
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 22:44:30 +02:00
Fabiano Fidêncio
09ef32eaf1 Merge pull request #12827 from fidencio/topic/kata-deploy-custom-containerd-config
kata-deploy: Allow overriding containerd config path and file name
2026-04-14 22:23:33 +02:00
stevenhorsman
5ea30b33ae workflows: stale-issue: Increase operations-per-run
At a rate of default 30 per run, with over 1.5k issues, it will take
us over 50 days to do a pass of the issues we have, so increase
operations-per-run as suggested in the workflow by github to
reduce this. Based on the stats of the latest run, we are not too
close to hitting the API rate limit:
```
Github API rate used: 32
Github API rate remaining: 3693; reset at: Thu Apr 09 2026 10:23:31 GMT+0000 (Coordinated Universal Time)
```
so I think this should be okay.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-14 16:25:35 +01:00
stevenhorsman
a0359326e9 workflow: Bump stale action version
v9 is based on Node.js 20 which is deprecated, so update to the
latest to pick up a Node.js 24 version before Github removes Node 20

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-14 16:25:35 +01:00
Fabiano Fidêncio
0713b2d5d3 Merge pull request #12828 from kata-containers/dependabot/pip/docs/pillow-12.2.0
build(deps): bump pillow from 12.1.1 to 12.2.0 in /docs
2026-04-14 17:23:07 +02:00
Fabiano Fidêncio
661cfd7efa Merge pull request #12800 from kata-containers/dependabot/go_modules/src/runtime/go.opentelemetry.io/otel/sdk-1.43.0
build(deps): bump go.opentelemetry.io/otel/sdk from 1.40.0 to 1.43.0 in /src/runtime
2026-04-14 17:22:47 +02:00
dependabot[bot]
b54f02aa6c build(deps): bump pillow from 12.1.1 to 12.2.0 in /docs
Bumps [pillow](https://github.com/python-pillow/Pillow) from 12.1.1 to 12.2.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/12.1.1...12.2.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-version: 12.2.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-14 14:40:14 +00:00
Steve Horsman
8289aaf0c7 Merge pull request #12831 from kata-containers/topic/ci-move-out-of-nodejs-20
ci: Update GitHub Actions to Node.js 24 compatible versions
2026-04-14 14:59:03 +01:00
Fabiano Fidêncio
c087eb92ec ci: Update GitHub Actions to Node.js 24 compatible versions
Node.js 20 is deprecated on GitHub Actions runners and will be
forced to Node.js 24 starting June 2nd, 2026. Update all affected
actions to versions that natively support Node.js 24:

- actions/upload-artifact: v4.6.2 -> v6.0.0
- actions/download-artifact: v4.3.0 -> v7.0.0
- docker/build-push-action: v5.4.0 -> v7.0.0
- docker/login-action: v3.4.0 -> v4.1.0
- docker/setup-buildx-action: v3.10.0 -> v4.0.0
- docker/setup-qemu-action: v3.6.0 -> v4.0.0
- geekyeggo/delete-artifact: v5.1.0 -> v6.0.0
- azure/login: v2.3.0 -> v3.0.0
- azure/setup-kubectl: v4.0.1 -> v5.0.0
- nick-fields/retry: v3.0.2 -> v4.0.0

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 15:48:45 +02:00
Fabiano Fidêncio
7e464f13a5 Merge pull request #12830 from fidencio/topic/workflows-create-auth-registry-image
workflows: Add workflow to create auth registry test image
2026-04-14 11:28:23 +02:00
Fabiano Fidêncio
b0abe59993 workflows: Add workflow to create auth registry test image
Add a manually-triggered workflow that builds and pushes a multi-arch
busybox-based image to quay.io/kata-containers/confidential-containers-auth
for use as an authenticated container image in CI tests.

The workflow uses skopeo to copy per-arch images and buildah to create
and push the multi-arch manifest.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-14 10:59:12 +02:00
Fabiano Fidêncio
b0a87880e7 Merge pull request #12826 from fidencio/topic/fix-concurrent-map-access-in-wait
runtime: Fix concurrent map read/write panic in Wait()
2026-04-14 08:48:52 +02:00
Fabiano Fidêncio
df1d02d3cf kata-deploy: Allow overriding containerd config path and file name
Add two new Helm values under `containerd`:
- `configDir`: overrides the host directory where the containerd
  config lives, taking precedence over the k8sDistribution-based
  auto-detection.
- `configFileName`: overrides the containerd config file name,
  propagated to the kata-deploy binary via the new
  CONTAINERD_CONFIG_FILE_NAME environment variable.

These are useful for non-standard containerd setups that don't match
any of the built-in k8sDistribution presets (k8s, k3s, rke2, k0s,
microk8s).

The config file name override only affects the default runtime branch
in get_containerd_paths(). The k0s/microk8s/k3s/rke2 branches are
left untouched since those runtimes have mandatory file naming
conventions.

Also fixes a spurious leading space in the k3s containerdConfPath
branch.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-13 22:31:55 +02:00
Fabiano Fidêncio
b17dd2a902 runtime: Fix concurrent map read/write panic in Wait()
Wait() was releasing s.mu immediately after getContainer(), then
calling getExec() — which reads c.execs — without holding any lock.
Concurrent Exec() or Delete() calls that write to c.execs under s.mu
triggered a "concurrent map read and map write" fatal panic.

Add a dedicated sync.RWMutex to the container struct that protects the
execs map. getExec() now acquires a read lock internally, and all
writes go through new setExec()/deleteExec() helpers that acquire the
write lock. This keeps the locking concern local to the map and avoids
complicating the s.mu usage in Wait().

Add a regression test (TestConcurrentExecAccess) that exercises
concurrent getExec reads against setExec/deleteExec writes; this
reliably reproduces the panic under the race detector without the fix.

Fixes: #12825

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-13 21:14:28 +02:00
Fabiano Fidêncio
4c567a9c05 ci: Reduce TEE test scope for PR runs
TEE hardware (TDX, SEV-SNP) is very limited in CI. Running the full
test suite on every PR consumes these resources unnecessarily, since
most tests exercises what is already exercised by the -coco-dev CIs.

Introduce a `tee-test-scope` workflow input (small/full) and a new
`baremetal-small-tee` K8S_TEST_HOST_TYPE that runs only the 12 tests
that are TEE-relevant: attestation tests (encrypted/authenticated/
signed image pull, confidential attestation) plus policy and trusted
ephemeral data storage tests.

PR runs default to "small" (12 tests), nightly runs use "full" (59
tests), and manual dispatch offers a dropdown to choose.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-04-13 20:26:46 +02:00
dependabot[bot]
b303600283 build(deps): bump go.opentelemetry.io/otel/sdk in /src/runtime
Bumps [go.opentelemetry.io/otel/sdk](https://github.com/open-telemetry/opentelemetry-go) from 1.40.0 to 1.43.0.
- [Release notes](https://github.com/open-telemetry/opentelemetry-go/releases)
- [Changelog](https://github.com/open-telemetry/opentelemetry-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/open-telemetry/opentelemetry-go/compare/v1.40.0...v1.43.0)

---
updated-dependencies:
- dependency-name: go.opentelemetry.io/otel/sdk
  dependency-version: 1.43.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-13 10:36:44 +00:00
Fabiano Fidêncio
bd6377a038 Merge pull request #12614 from manuelh-dev/mahuber/image-signing-nim
tests: nvidia: Enforce image signing for NIM test
2026-04-11 14:48:04 +02:00
Fabiano Fidêncio
5eb7844183 Merge pull request #12430 from stevenhorsman/cargo-deny-static-checks
static-checks: Rework cargo deny check
2026-04-11 12:05:53 +02:00