The cri-containerd integration tests fail with the shim sandboxer when
running non-runc runtimes (e.g. Kata). The root cause is a bug in
containerd's client/task.go: getRuncOptions() unconditionally tries to
unmarshal the container's stored runtimeOptions into containerd.runc.v1.Options,
but Kata containers store runtimeoptions.v1.Options. This causes:
failed to create containerd task: failed to get runtime v2 options:
can't unmarshal type "runtimeoptions.v1.Options" to output
"containerd.runc.v1.Options"
A fix has been submitted upstream. Until it is merged and released,
clone containerd from the fork that carries the fix so that
`make cri-integration` (which builds and runs its own containerd daemon)
picks up the corrected binary.
TODO: revert once the fix is in an upstream containerd release and
versions.yaml is updated accordingly.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Creating a new container in the same sandbox VM after the previous
container has exited and been removed has never been supported by
kata-containers (neither with the go-based nor the rust-based runtime).
When the last container is removed the kata VM shuts down, so any
attempt to start a new container in the same sandbox fails.
This test exercises a use-case kata does not currently support, and it
has never been part of the passing list for good reason. Mark it
explicitly excluded with a comment so it is clear this is a deliberate
omission rather than an oversight.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The check_daemon_setup function verifies that containerd + runc are
functional before the real kata tests run. Using the shim sandboxer
for this runc check hits a known containerd bug where the OCI spec is
not populated before NewBundle is called, so config.json is never
written and containerd-shim-runc-v2 fails at startup.
See https://github.com/containerd/containerd/issues/11640
The sandboxer choice is irrelevant for this sanity check, so use
podsandbox which works correctly with runc.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
In some CI runs, `mktemp` generates random characters that accidentally
form file extensions like `.cSV` or `.Xml`. This triggers downstream
parsing errors because the YAML content is misidentified as CSV/XML.
The issues look like as below:
```
'/tmp/bats-run-KodZEA/.../pod-guest-pull-in-trusted-storage.yaml.in.cSV':
...
```
This commit fixes the issue by:
1. Moving the `XXXXXX` placeholder before the `.yaml` extension.
2. Ensuring the generated file always ends in `.yaml`.
This prevents format misidentification while maintaining filename
uniqueness and security.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Create local block storage (loop device, StorageClass, PV) in the test
only when the cluster has no default StorageClass, matching the approach
used in k8s-volume.bats. Set our StorageClass as default so the PVC
binds to our PV; tear it down after the test.
When a default already exists (e.g. AKS), skip creation and cleanup so
we do not change the cluster's default storage class.
Fixes: #9846
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Allow genpolicy -j to accept a directory instead of a single file.
When given a directory, genpolicy loads genpolicy-settings.json from it
and applies all genpolicy-settings.d/*.json files (sorted by name) as
RFC 6902 JSON Patches. This gives precise control over settings with
explicit operations (add, remove, replace, move, copy, test), including
array index manipulation and assertions.
Ship composable drop-in examples in drop-in-examples/:
- 10-* files set platform base settings (non-CoCo, AKS, CBL-Mariner)
- 20-* files overlay specific adjustments (OCI version, guest pull)
Users copy the combination they need into genpolicy-settings.d/.
Replace the old adapt_common_policy_settings_* jq-patching functions
in tests_common.sh with install_genpolicy_drop_ins(), which copies the
right combination of 10-* and 20-* drop-ins for the CI scenario.
Tests still generate 99-test-overrides.json on the fly for per-test
request/exec overrides.
Packaging installs 10-* and 20-* drop-ins from drop-in-examples/ into
the tarball; the default genpolicy-settings.d/ is left empty.
Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
kata-deploy's SIGTERM cleanup restarts the CRI runtime, which on
k3s/rke2 takes down the API server temporarily. The helm uninstall
may complete with errors, and the next test suite would start with
a dead API. Add a wait loop after uninstall to ensure the API is
available before proceeding.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
kata-deploy restarts the CRI runtime during install, which can cause
the kata-deploy pod to be killed and recreated by the DaemonSet
controller. On k3s and rke2 in particular, the restart can take
several minutes. Increase the default timeout from 600s (10m) to
900s (15m) to accommodate this.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
If the rendered config-v3.toml does not import the drop-in dir, write
the full k3s ContainerdConfigTemplateV3 (with hardcoded import path) so
kata-deploy can use drop-in.
This allows us to test with K3s/RKE2 before my patch there gets
released.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
As the default is `info` and that actually overrides whatever is set in
the drop-in file used by k0s.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Align with kubeadm and bare metal by setting the kubelet CRI
runtime-request-timeout to 600s in deploy functions for k0s (worker
profile), k3s (--kubelet-arg), rke2 (config.yaml), and microk8s
(args/kubelet + restart).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
k3s and rke2 use containerd that expects OCI bundle 1.2.1; otherwise
autogenerated policy tests fail. Add adapt_common_policy_settings_for_k3s_rke2
and call it from adapt_common_policy_settings when KUBERNETES is k3s or rke2.
Tested with k3s v1.34.4+k3s1, rke2 v1.34.4+rke2r1.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Having a script to install go is legacy from Jenkins, so
delete it, so there is less code in our repo.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Reinstate mariner host testing - including the Agent Policy tests on
these hosts - now that a new CLH version brought in the required fixes.
This reverts commit ea53779b90.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Trustee now returns the binary SNP TCB claims as hex rather than base64
(for consistency with other platforms). Fortunately, the sev-snp-measure
tool has a flag for setting the output type of the launch digest.
I think hex is the default, but let's keep the flag here to be explicit.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
Fix `T1005: error strings should not be capitalized (staticcheck)`
This is to comply with go conventitions as errors are normally appended,
so there would be a spurious captialisation in the middle of the message
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
strings.ReplaceAll was introduced in Go 1.12 as a more readable and self-documenting way to say "replace everything".
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add a setting to skip the
`T1005: error strings should not be capitalized (staticcheck)`
rule to avoid impact to our error strings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Since runtime-rs added support for virtio-blk-ccw on s390x in #12531,
the assertion in k8s-guest-pull-image.bats should be generalized
to apply to all hypervisors ending with `-runtime-rs`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Specify runAsUser, runAsGroup, supplementalGroups values embedded
in the image's /etc/group file explicitly in the security context.
With this, both genpolicy and containerd, which in case of using
nydus guest-pull, lack image introspection capabilities, use the
same values for user/group/additionalG IDs at policy generation
time and at runtime when the OCI spec is passed.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Extend the timeout for the assert_pod_fail function call for the
test case "Test we cannot pull a large image that pull time exceeds
createcontainer timeout inside the guest" when the experimental
force guest-pull method is being used. In this method, the image is
first pulled on the host before creating the pod sandbox. While
image pull times can suddenly spike, we already time out in the
assert_pod_fail function before the image is even pulled on the
host.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Set KubeletConfiguration runtimeRequestTimeout to 600s mainly for CoCo
(Confidential Containers) tests, so container creation (attestation,
policy, image pull, VM start) does not hit the default CRI timeout.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Treat waiting.reason RunContainerError and terminated.reason StartError/Error
as container failure, so tests that expect guest image-pull failure (e.g.
wrong credentials) pass when the container fails with those states instead
of only BackOff.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Run all CoCo non-TEE variants in a single job on the free runner with an
explicit environment matrix (vmm, snapshotter, pull_type, kbs,
containerd_version).
Here we're testing CoCo only with the "active" version of containerd.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
We were running most of the k8s integration tests on AKS. The ones that
don't actually depend on AKS's environment now run on normal
ubuntu-24.04 GitHub runners instead: we bring up a kubeadm cluster
there, test with both containerd lts and active, and skip attestation
tests since those runtimes don't need them. AKS is left only for the
jobs that do depend on it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Disable mariner host testing in CI, and auto-generated policy testing
for the temporary replacements of these hosts (based on ubuntu), to work
around missing:
1. cloud-hypervisor/cloud-hypervisor@0a5e79a, that will allow Kata
in the future to disable the nested property of guest VPs. Nested
is enabled by default and doesn't work yet with mariner's MSHV.
2. cloud-hypervisor/cloud-hypervisor@bf6f0f8, exposed by the large
ttrpc replies intentionally produced by the Kata CI Policy tests.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Make `az aks create` command easier to change when needed, by moving the
arguments specific to mariner nodes onto a separate line of this script.
This change also removes the need for `shellcheck disable=SC2046` here.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Issue 10838 is resolved by the prior commit, enabling the -m
option of the kernel build for confidential guests which are
not users of the measured rootfs, and by commit
976df22119, which ensures
relevant user space packages are present.
Not every confidential guest has the measured rootfs option
enabled. Every confidential guest is assumed to support CDH's
secure storage features, in contrast.
We also adjust test timeouts to account for occasional spikes on
our bare metal runners (e.g., SNP, TDX, s390x).
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This test uses YAML files from a different directory than the other
k8s CI tests, so annotations have to be added into these separate
files.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>