- Swap out the hard-coded nginx registry and verisons for reading
the test image details for version.yaml
which can also ensure that the quay.io mirror is used
rather than the docker hub versions which can hit pull limits
- Try setting imagePullPoliycy Always to fix issues with the arm CI
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When tests regress, the CI wait time can increase significantly
with the current kubectly_retry attempt logic. Thus, align with
other tests and remove kubectl_retry invocations. Instead, rely on
proper timeouts.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Containerd configuration syntax (`config.toml`) varies across versions,
requiring per-version logic for fields like `runtime`.
However, testing confirms that containerd LTS (1.7.x) and newer
versions fully support the v3 schema for the nydus remote snapshotter.
This commit changes the previous containerd v1 settings in `config.toml`.
Instead, it introduces a unified v3-style configuration for nydus, which
can be vailid for lts and active containerds.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In order to have a better way to set things up using a toml editor, we
should take the containerd approach and actually have everything
uncommnted. This will help us to unify how we deal with such values in
the future from the kata-deploy POV.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
With issue 11777 being resolved, this commit enables openvpn
policy testing. The remaining work on the security policy
required to successfully run this test case was to enable UDP
ports for Service kinds and to use the mount path's last component
instead of the volume name to construct the expected storage
source path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Remove the nvrc.smi.srs=1 parameter from the kernel command line.
In CC use cases, the attestation agent is expected to set the GPU
ready state. For the CUDA vectorAdd case where attestation agent
is not being used, we set the ready state by adding the kernel
command line parameter through an annotation.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add an allow-all policy for the CC GPU tests and ensure the init-data
device is being created (hypervisor annotations).
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The add_allow_all_policy_to_yaml in tests_common.sh needs some
improvements so that this function can support pod manifests with
different resource kinds. For now, moving the Secret definition
to the bottom so that we can create a default policy for the Pod.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The qemu-nvida-gpu handlers should not cause is_aks_cluster to
return 1. Otherwise, CI logic will assume these hypervisors run on
AKS hosts, see the following message in CI w/o this change:
INFO: Adapting common policy settings for AKS Hosts
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
We currently start a pod that does a `wget` to the KBS address, and
fails after 5 seconds.
By the time it fails and reports back, we can see that KBS is actually
running, but the workflow failed as the checker failed. :-/
Let's give it more time for the KBS to show up, and the flakeness should
go away.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Use the pod name variable so that kubectl wait finds the pod. Currently,
kubectl waits for nvidia-nim-llama-3-2-nv-embedqa-1b-v2, not for
nvidia-nim-llama-3-2-nv-embedqa-1b-v2-tee
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The person who introduced the check, someone named Fabiano Fidêncio,
forgot a `$` in a variable assignment.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enable auto-generate policy on cbl-mariner Hosts for
qemu-coco-dev-runtime-rs if the user didn't specify an
AUTO_GENERATE_POLICY value.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We will re-enable this one later on once the changes to properly cold
plug multi GPUs are merged.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Those need to pull the models inside the guest, and the guest has 50% of
its memory "allowed" to be used as tmpfs, so, we gotta usa the RAM that
we have.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We cannot use the same format used for docker, as it includes username
and password, while what's expected when using Trustee does not.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Now that we've bumped Trustee to a version that supports the NVIDIA
remote verifier, let's re-enable the tests.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This makes the user experience better, as the admin can deploy Kata
Containers without having to download / set up any additional file.
Of course, if the admin wants something more specific, examples are
provided.
Tests and documentation are updated to reflect this change.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add three example values files to make it easier for users to try out
different Kata Containers configurations:
- try-kata.values.yaml: Enables all available shims
- try-kata-tee.values.yaml: Enables only TEE/confidential computing shims
- try-kata-nvidia-gpu.values.yaml: Enables only NVIDIA GPU shims
These files use the new structured configuration format and serve as
ready-to-use examples for common deployment scenarios.
Also update the README.md to document these example files and how to use them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Update the helm_helper function in gha-run-k8s-common.sh to use the
new structured configuration format instead of the legacy env.* format.
All possible settings have been migrated to the structured format:
- HELM_DEBUG now sets root-level 'debug' boolean
- HELM_SHIMS now enables shims in structured format with automatic
architecture detection based on shim name
- HELM_DEFAULT_SHIM now sets per-architecture defaultShim mapping
- HELM_EXPERIMENTAL_SETUP_SNAPSHOTTER now sets snapshotter.setup array
- HELM_ALLOWED_HYPERVISOR_ANNOTATIONS now sets per-shim allowedHypervisorAnnotations
- HELM_SNAPSHOTTER_HANDLER_MAPPING now sets per-shim containerd.snapshotter
- HELM_AGENT_HTTPS_PROXY and HELM_AGENT_NO_PROXY now set per-shim agent proxy settings
- HELM_PULL_TYPE_MAPPING now sets per-shim forceGuestPull/guestPull settings
- HELM_EXPERIMENTAL_FORCE_GUEST_PULL now sets per-shim forceGuestPull/guestPull
The test helper automatically determines supported architectures for
each shim (e.g., qemu-se supports s390x, qemu-cca supports arm64,
qemu-snp/qemu-tdx support amd64, etc.) and applies per-shim settings
to the appropriate shims based on HELM_SHIMS.
Only HELM_HOST_OS remains in legacy env.* format as it doesn't have
a structured equivalent yet.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Re-enable AUTO_GENERATE_POLICY for coco-dev Hosts, unless PULL_TYPE is
"experimental-force-guest-pull", or the caller specified a different
value for AUTO_GENERATE_POLICY.
Auto-generated Policy has been disabled accidentally and recently for
these Hosts, by a GHA workflow change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
The formatting wasn't quite right, so the `qemu-coco-dev-runtime-rs`
hypervisor wasn't skipping this test
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The teardown_common will print the description of the running pods, kill
them all and print the system's syslogs afterwards.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We have had those tests broken for months. It's time to get rid of
those.
NOTE that we could easily revert this commit and re-add those tests as
soon as we find someone to maintain and be responsible for such
integration.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This is a bump pre-release, which brings several fixes and some
improvements related to initData, and NVIDIA's remote verifier.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The test case designed to verify policy failures due to an "unexpected
capability" was misconfigured. It was using "CAP_SYS_CHROOT" as the
unexpected capability to be added.
This configuration was flawed for two main reasons:
1.Incorrect Syntax: Kubernetes Pod specs expect capability names without
the "CAP_" prefix (e.g., "SYS_CHROOT", not "CAP_SYS_CHROOT").
This made the test case's premise incorrect from a K8s API perspective.
2.Part of Default Set: "SYS_CHROOT" is already included in the
`default_caps` list for a standard container. Therefore, adding it would
not trigger a policy violation, defeating the purpose of the
"unexpected capability" test.
Furthermore, a related issue was observed where a malformed capability
like "CAP_CAP_SYS_CHROOT" was being generated, causing parsing failures
in the `oci-spec-rs` library. This was a symptom of incorrect string
manipulation when handling capabilities.
This commit corrects the test by selecting "SYS_NICE" as the unexpected
capability. "SYS_NICE" is a more suitable choice because:
- It is a valid Linux capability.
- It is relatively harmless.
- It is **not** part of the default capability set defined in
`genpolicy-settings.json`.
By using "SYS_NICE", the test now accurately simulates a scenario where
a Pod requests a legitimate but non-default capability, which the policy
(generated from a baseline Pod without this capability) should correctly
reject. This change fixes the test's logic and also resolves the
downstream `oci-spec-rs` parsing error by ensuring only valid capability
names are processed.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Thankfully there's only one piece that's still SNP specific (for the
supported TEEs). Let's adjust it so we can have an easy and smooth
execution when adding a TDX CI machine.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
There are several changes needed in order to get this test working with
CC, and yet we still are skipping it.
Basically, we need to:
* Pull an authenticated image inside the guest, which requires:
* Using Trustee to release the credential
* We still depend on a PR to be merged on Trustee side
* https://github.com/confidential-containers/trustee/pull/1035
* We still depend on a Trustee bump (including the PR above) on our
side
Apart from those changes, I ended up "duplicating" the tests by adding a
"-tee" version of those, which already have:
* The proper kbs annotations set up
* Dropped host mounts
* Increases the memory needed
Last but not least, as "bats" probably means "being a terrible script",
I had to re-arrange a few things otherwise the tests would not even run
due to bats-isms that I am sincerely not able to pin-point.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We added the tests using virtio-9p as we knew it'd require incremental
changes to be able to use any kind of guest-pull method.
Now, as in the coming commits we'll be actually ensuring that guest-pull
works and is in use, we can enforce the experimental_force_guest_pull
usage for the nvidia cases.
Note: We're using experimental_force_guest_pull instead of
nydus-snapshotter due to stability concerns with the snapshotter.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
It takes either a shim name or "", but we were treating this (thankfully
only in this specific file) as a boolean.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Adjust output to the setup_file and teardown_file behavior.
With this, we will be able to observe relevant logging rather than
adding to the output variable.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The new image reference has changed to mcr.microsoft.com/oss/v2/kubernetes/pause:3.6
from mcr.microsoft.com/oss/kubernetes/pause:3.6.
The new image uses by default UID=0, GID=0 while the older. The older image had:
UID=65535, GID=65535.
There is a new pause_container_id_policy field in genpolicy-settings.json, informing
genpolicy about the way AdditionalGids gets updated - "v1" for the older behavior
and "v2" for the newer AKS version:
- When using v1, the default value of AdditionalGids is {65535}.
- When using v2, the default value of AdditionalGids is {}.
UID=65535 and GID=65535 are still hard-coded by default in genpolicy-settings.json.
We might be able to remove/ignore these fields in the future, if we'll stop relying
on policy::KataSpec::get_process_fields to use these fields.
A new CI function adapt_common_policy_settings_for_aks() changes the pause container
UID, GID, pause_container_id_policy, and image ref settings values when testing on
AKS Hosts - i.e., when testing coco-dev or mariner Hosts.
The genpolicy workarounds for the unexpected behavior with guest pull enabled have
been improved to use the current container's GID instead of hard-coding GID=0 as the
guest pull default. Also, AdditionalGids gets updated when the current container's GID is
changing, instead of always changing the AdditionalGids at the very end of
policy::AgentPolicy::get_container_process(), when the relevant evolution of the GID
value was no longer available.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Same deal as the previous commut, just enabling the tests here, with the
same list of improvements that we will need to go through in order to
get is working in a perfect way.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
While the primary goal of this change is to detect regressions to the
NVIDIA SNP GPU scenario, various improvements to reflect a more
realistic CC setting are planned in subsequent changes, such as:
* moving away from the overlayfs snapshotter
* disabling filesystem sharing
* applying a pod security policy
* activating the GPUs only after attestation
* using a refined approach for GPU cold-plugging without requiring
annotations
* revisiting pod timeout and overhead parameters (the podOverhead value
was increased due to CUDA vectorAdd requiring about 6Gi of
podOverhead, as well as the inference and embedqa requiring at least
12Gi, respectively, 14Gi of podOverhead to run without invoking the
host's oom-killer. We will revisit this aspect after addressing
points 1. and 2.)
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Since there's something wrong with the cpu hotplug
on qemu-runtime-rs, thus disable this test temporally.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>