The Go runtime's CoCo dev config uses dial_timeout = 45s, but all
runtime-rs confidential VM configs had reconnect_timeout_ms set to
3000ms (3s) or 5000ms (SE). This is too short for confidential VMs,
especially on arm64 where UEFI firmware (AAVMF) adds significant
boot time on top of the measured boot process, causing ECONNRESET
errors on the vsock connection before the agent is ready.
Bump reconnect_timeout_ms to 45000ms across all confidential VM
configs (coco-dev, SNP, TDX, SE) to match the Go runtime.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add qemu-coco-dev-runtime-rs to the arm64 k8s test matrix so that the
CoCo non-TEE configuration is exercised on aarch64 runners.
Also enable auto-generated policy for qemu-coco-dev on aarch64 (matching
the existing x86_64 behavior) and register the new job as a required
gatekeeper check.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add aarch64/arm64 to the list of supported architectures for
qemu-coco-dev and qemu-coco-dev-runtime-rs shims across kata-deploy
configuration, Helm chart values, and test helper scripts.
Note that guest-components and the related build dependencies are not
yet wired for arm64 in these configurations; those will be addressed
separately.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Build coco-guest-components, pause-image, and rootfs-image-confidential
for arm64, which are required by qemu-coco-dev-runtime-rs.
Enable MEASURED_ROOTFS on the arm64 shim-v2 build, add the aarch64 case
to install_kernel() so the default kernel is built as a unified kernel
(with confidential guest support, like x86_64), and adjust the kernel
install naming so only CCA builds get the -confidential suffix.
Also wire rootfs-image-confidential-tarball into the aarch64 local-build
Makefile.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
The arm64 build workflow was missing the tools build entirely.
Add build-tools-asset and create-kata-tools-tarball jobs mirroring
the amd64 workflow so that genpolicy and the other tools are
available for coco-dev tests that need auto-generated policy.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add kernel_verity_params to the qemu-coco-dev-runtime-rs configuration
so the runtime can assemble dm-verity kernel parameters, and remove the
test skip that was disabling measured rootfs tests for this hypervisor.
Fixes: #12851
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add BATS tests for the GetDiagnosticData termination log feature on
CoCo platforms where shared_fs=none.
Three test cases cover:
- Successful exit (exit 0): termination message is propagated when
GetDiagnosticDataRequest is allowed by policy.
- Failed exit (exit 1): termination message is propagated when
GetDiagnosticDataRequest is allowed by policy.
- Policy denied: with default CoCo policy (GetDiagnosticDataRequest
is false), the container stops cleanly but no termination message
is propagated (best-effort behavior).
Tests are skipped on non-CoCo platforms where shared_fs is not "none".
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add runtime-rs support for the GetDiagnosticData RPC. This extends
the Agent trait, types, and protocol translation layer with the new
request/response types.
During container stop, when shared_fs is "none" and the
terminationMessagePolicy annotation is "File", the runtime copies
the termination log from the guest via GetDiagnosticData. The call
is best-effort to avoid blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add policy rules for the new GetDiagnosticDataRequest RPC.
The request is denied by default in genpolicy-generated policies,
ensuring CoCo workloads do not expose diagnostic data unless
explicitly opted in via policy_data.request_defaults.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Update the stale issues workflow to run more frequently:
- Weekdays: Every 4 hours (6x per day) at 00:00, 06:00, 12:00, 18:00 UTC
- Weekends: Every hour (24x per day)
Previously ran once daily at midnight UTC. This change reduces the time
it will take for us to get through our backlog, particularly increasing
the runs at the weekend, when we should have less other CI running,
which it could impact due to GH API rate limiting.
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Utilise the new hypervisor helpers in our CI and test
code to help add clarity and reduce duplication
Note: `kubernetes_dir` is declared as readonly in
tests/integration/kubernetes/setup.sh which is sourced
by tests_common.sh, so we update it to only be set if
unset
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add a pure shell script which the CI and integration tests can
use to check for different categories of runtime
Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
I'm doing some bookkeeping in the Azure subscription that requires we move
from eastus to eastus2. This should have no user-facing impact.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
I've seen several cases of the CLH tests just being killed due to the 60
minutes timeout. Let's bump it to 75 and see how it goes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Adding the pod annotation config to the doc site. A symlink is created
at docs/pod-annotations.md that points to
how-to/how-to-set-sandbox-config-kata.md so that the URL for this file will be
created at `/pod-annotations`. Also adding brief contrbuting guidelines and
how-to's for running the documentation site locally for local previews.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Newer kernels and containerd versions (>= 2.2.3) may add extra mount
options to /sys/fs/cgroup that genpolicy does not embed in the policy
(e.g. nsdelegate, memory_recursiveprot). This causes the Kata agent to
reject CreateContainerRequest with PERMISSION_DENIED because the
check_mount rules require an exact match.
Rather than hard-coding the allowed extras in Rego, make them
configurable via genpolicy-settings.json under
cluster_config.cgroup_mount_extras_allowed. The corresponding Rego rule
(check_mount 4) reads the list from policy_data.cluster_config and
allows only those named options beyond the policy-embedded set.
To support this, cluster_config is now included in PolicyData so that
it gets serialized into the Rego policy_data object at generation time.
This follows the established pattern of keeping site- and
version-specific tunables in genpolicy-settings.json so they can be
overridden via JSON-Patch drop-ins without touching the Rego source.
A policy test case is added to verify that the default allowed extras
(nsdelegate, memory_recursiveprot) are accepted and that unknown extras
are rejected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
I've updaed the images on the Confidential Containers side, in order to
add arm64 support, but I didn't realize it'd break tests not using
those.
Apologies!
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
skopeo copy with --override-arch fails with "authentication required"
during blob existence checks at the destination, regardless of how
credentials are provided (--dest-creds, --authfile, REGISTRY_AUTH_FILE).
This is a known issue with skopeo 1.13.x when copying from manifest
list sources.
Replace the skopeo/buildah approach with docker/build-push-action,
which is already proven in this repo (build-kubectl-image.yaml) and
handles multi-arch builds and Quay pushes reliably. The workflow now
builds a trivial FROM busybox image using buildx with QEMU emulation.
Fixes: b0abe5999 ("workflows: Add workflow to create auth registry test image")
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
At a rate of default 30 per run, with over 1.5k issues, it will take
us over 50 days to do a pass of the issues we have, so increase
operations-per-run as suggested in the workflow by github to
reduce this. Based on the stats of the latest run, we are not too
close to hitting the API rate limit:
```
Github API rate used: 32
Github API rate remaining: 3693; reset at: Thu Apr 09 2026 10:23:31 GMT+0000 (Coordinated Universal Time)
```
so I think this should be okay.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
v9 is based on Node.js 20 which is deprecated, so update to the
latest to pick up a Node.js 24 version before Github removes Node 20
Signed-off-by: stevenhorsman <steven@uk.ibm.com>