The reason we're doing this is because mariner image uses, by default,
cgroups default-hierarchy as `unified` (aka, cgroupsv2).
In order to keep the same initrd behaviour for mariner, let's enforce
that `SYSTEMD_CGROUP_ENABLE_LEGACY_FORCE=1
systemd.legacy_systemd_cgroup_controller=yes
systemd.unified_cgroup_hierarchy=0` is passed to the kernel cmdline, at
least for now.
Other tests that are setting `kernel_params` are not running on mariner,
then we're safe taking this path as it's done as part of this PR.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
As an image has been added for mariner as part of the commit 63c1f81c2,
let's start using it in the CI, instead of using the initrd.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
In https://github.com/confidential-containers/trustee/pull/521
the overlays logic was modified to add non-SE
s390x support and simplify non-ibm-se platforms.
We need to update the logic in `kbs_k8s_deploy`
to match and can remove the dummying of `IBM_SE_CREDS_DIR`
for non-SE now
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The behavior of Kata CI doesn't change.
For local testing using kubernetes/gha-run.sh and AUTO_GENERATE_POLICY=yes:
1. Before these changes users were forced to use:
- SEV, SNP, or TDX guests, or
- KATA_HOST_OS=cbl-mariner
2. After these changes users can also use other platforms that are
configured with "shared_fs = virtio-fs" - e.g.,
- KATA_HOST_OS=ubuntu + KATA_HYPERVISOR=qemu
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
The behavior of Kata CI doesn't change.
For local testing using kubernetes/gha-run.sh:
1. Before these changes:
- AUTO_GENERATE_POLICY=yes was always used by the users of SEV, SNP,
TDX, or KATA_HOST_OS=cbl-mariner.
2. After these changes:
- Users of SEV, SNP, TDX, or KATA_HOST_OS=cbl-mariner must specify
AUTO_GENERATE_POLICY=yes if they want to auto-generate policy.
- These users have the option to test just using hard-coded policies
(e.g., using the default policy built into the Guest rootfs) by
using AUTO_GENERATE_POLICY=no. AUTO_GENERATE_POLICY=no is the default
value of this env variable.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This PR adds the trap statement in the confidential kbs script
to clean up temporary files and ensure we are leaving them.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
The tests is disabled for qemu-coco-dev / qemu-tdx, but it doesn't seen
to actually be failing on those. Plus, it's passing on SEV / SNP, which
means that we most likely missed re-enabling this one in the past.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Currently, `qemu-runtime-rs` does not support `virtio-scsi`,
which causes the `k8s-block-volume.bats` test to fail.
We should skip this test until `virtio-scsi` is supported by the runtime.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The nginx container seems to error out when using UID=123.
Depending on the timing between container initialization and "kubectl
wait", the test might have gotten lucky and found the pod briefly in
Ready state before nginx errored out. But on some of the nodes, the pod
never got reported as Ready.
Also, don't block in "kubectl wait --for=condition=Ready" when wrapping
that command in a waitForProcess call, because waitForProcess is
designed for short-lived commands.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This imports the k8s-block-volume test from the tests repo and modifies
it slightly to set up the host volume on the AKS host.
This is a follow-up to #7132.
Fixes: #7164
Signed-off-by: Salvador Fuentes <salvador.fuentes@intel.com>
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The ita kustomization for Trustee, as well as previously used one
(DCAP), doesn't have a $(uname -m) directory after the deployment
directory name.
Let's follow the same logic used for the deploy-kbs script and clean
those up accordingly.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Intel Tiber Trust Services (formerly known as Intel Trust Authority) is
Intel's own attestation service, and we want to take advantage of the
TDX CI in order to ensure ITTS works as expected.
In order to do so, let's replace the former method used (DCAP) to use
ITTS instead.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
There are many similar or duplicated code patterns in `teardown()`.
This commit consolidates them into a new function, `teardown_common()`,
which is now called within `teardown()`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The current `exec_host()` accepts a given node name and
creates a node debugger pod, even if the name is invalid.
This could result in the creation of an unnecessary pending
pod (since we are using nodeAffinity; if the given name
does not match any actual node names, the pod won’t be scheduled),
which wastes resources.
This commit introduces validation for the node name to
prevent this situation.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It was observed that the custom node debugger pod is not
cleaned up when a test times out.
This commit ensures the pod is cleaned up by triggering
the cleanup on EXIT, preventing any debugger pods from
being left behind.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
With #10232 merged, we now have a persistent node debugger pod throughout the test.
As a result, there’s no need to spawn another debugger pod using `kubectl debug`,
which could lead to false negatives due to premature pod termination, as reported
in #10081.
This commit removes the `print_node_journal()` call that uses `kubectl debug` and
instead uses `exec_host()` to capture the host journal. The `exec_host()` function
is relocated to `tests/integration/kubernetes/lib.sh` to prevent cyclical dependencies
between `tests_common.sh` and `lib.sh`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
`assert_pod_fail()` currently calls `k8s_create_pod()` to ensure that a pod
does not become ready within the default 120s. However, this delays the test's
completion even if an error message is detected earlier in the journal.
This commit removes the use of `k8s_create_pod()` and modifies `assert_pod_fail()`
to fail as soon as the pod enters a failed state.
All failing pods end up in one of the following states:
- CrashLoopBackOff
- ImagePullBackOff
The function now polls the pod's state every 5 seconds to check for these conditions.
If the pod enters a failed state, the function immediately returns 0. If the pod
does not reach a failed state within 120 seconds, it returns 1.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Now that the issue with handling loop devices has been resolved,
this commit re-enables the guest-pull-image tests for `qemu-coco-dev`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Timeouts occur (e.g. `create_container_timeout` and `wait_time`)
when using qemu-coco-dev.
This commit increases these timeouts for the trusted image storage
test cases
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
If the host running the tests is different from the host where the cluster is running,
the *_loop_device() functions do not work as expected because the device is created
on the test host, while the cluster expects the device to be local.
This commit ensures that all commands for the relevant functions are executed via exec_host()
so that a device should be handled on a cluster node.
Additionally, it modifies exec_host() to return the exit code of the last executed command
because the existing logic with `kubectl debug` sometimes includes unexpected characters
that are difficult to handle. `kubectl exec` appears to properly return the exit code for
a given command to it.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Creating and deleting a node debugger pod for every `exec_host()`
call is inefficient.
This commit changes the test suite to create and delete the pod
only once, globally.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This commit addresses an issue with handling loop devices
via a node debugger due to restricted privileges.
It runs a pod with full privileges, allowing it to mount
the host root to `/host`, similar to the node debugger.
This change enables us to run tests for trusted image storage
using the `qemu-coco-dev` runtime class.
Fixes: #10133
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
We've added s390x test container image, so add support
to use them based on the arch the test is running on
Fixes: #10302
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
fixuop
This commit brings some public parts of image pulling test series like
encrypted image pulling, pulling images from authenticated registry and
image verification. This would help to reduce the cost of maintainance.
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Close#8120
**Case 1**
Create a pod from an unsigned image, on an insecureAcceptAnything
registry works.
Image: quay.io/prometheus/busybox:latest
Policy rule:
```
"default": [
{
"type": "insecureAcceptAnything"
}
]
```
**Case 2**
Create a pod from an unsigned image, on a 'restricted registry' is
rejected.
Image: ghcr.io/confidential-containers/test-container-image-rs:unsigned
Policy rule:
```
"quay.io/confidential-containers/test-container-image-rs": [
{
"type": "sigstoreSigned",
"keyPath": "kbs:///default/cosign-public-key/test"
}
]
```
**Case 3**
Create a pod from a signed image, on a 'restricted registry' is
successful.
Image: ghcr.io/confidential-containers/test-container-image-rs:cosign-signed
Policy rule:
```
"ghcr.io/confidential-containers/test-container-image-rs": [
{
"type": "sigstoreSigned",
"keyPath": "kbs:///default/cosign-public-key/test"
}
]
```
**Case 4**
Create a pod from a signed image, on a 'restricted registry', but with
the wrong key is rejected
Image:
ghcr.io/confidential-containers/test-container-image-rs:cosign-signed-key2
Policy:
```
"ghcr.io/confidential-containers/test-container-image-rs": [
{
"type": "sigstoreSigned",
"keyPath": "kbs:///default/cosign-public-key/test"
}
]
```
**Case 5**
Create a pod from an unsigned image, on a 'restricted registry' works
if enable_signature_verfication is false
Image: ghcr.io/kata-containers/confidential-containers:unsigned
image security enable: false
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Change pod runAsUser value of a Replication Controller after generating
the RC's policy, and verify that the RC pods get rejected due to this
change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change pod runAsUser value of a Job after generating the Job's policy,
and verify that the Job gets rejected due to this change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change pod runAsUser value of a Deployment after generating the
Deployment's policy, and verify that the Deployment fails due to
this change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Poll/wait for pod termination instead of sleeping 2 minutes. This
change typically saves ~90 seconds in my test cluster.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Kata-deploy often fails due to a transiently unreachable k8s cluster
for the qemu-coco-dev test on s390x.
(e.g. https://github.com/kata-containers/kata-containers/actions/runs/10831142906/job/30058527098?pr=10009)
This commit introduces a retry mechanism to mitigate these failures by
retrying the command two more times with a 10-second interval as a workaround.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
If the CI platform being tested doesn't support yet the prometheus
container image:
- Use busybox instead of prometheus.
- Skip the test cases that depend on the prometheus image.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Remove the workaround for #9928, now that genpolicy is able to
convert user names from container images into the corresponding
UIDs from these images.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Disabling the UID Policy rule was a workaround for #9928. Re-enable
that rule here and add a new test/CI temporary workaround for this
issue. This new test workaround will be removed after fixing #9928.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change quay.io/prometheus/busybox to quay.io/prometheus/prometheus in
this test. The prometheus image will be helpful for testing the future
fix for #9928 because it specifies user = "nobody".
Also, change:
sh -c "ls -l /"
to:
echo -n "readinessProbe with space characters"
as the test readinessProbe command line. Both include a command line
argument containing space characters, but "sh -c" behaves differently
when using the prometheus container image (causes the readinessProbe
to time out, etc.).
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
With an older version of image-rs, we were getting the following error:
```
Message: failed to create containerd task: failed to create shim task: failed to handle layer: failed to get decrypt key no suitable key found for decrypting layer key:
```
However, with the version of image-rs we are bumping to, the error comes
as:
```
Message: failed to create containerd task: failed to create shim task: failed to handle layer: failed to get decrypt key
Caused by:
no suitable key found for decrypting layer key:
keyprovider: failed to unwrap key by ttrpc
```
Due to this change, I'm splitting the check in two different ones.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
In the process of switching the TDX CI machine we've noticed that
`hostname -i` in one of the machines returns an one and only IP address,
while in another machine it returns a full list of IPs.
As we're only interested in the first one, let's adapt the code to
always return the first one.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
On commit 51690bc157 we switched the installation from kubectl to helm
and used its `--wait` expecting the execution would continue when all
kata-deploy Pods were Ready. It turns out that there is a limitation on
helm install that won't wait properly when the daemonset is made of a
single replica and maxUnavailable=1. In order to fix that issue, let's
revert the changes partially to keep using kubectl and waitForProcess
to the exection while Pods aren't Running.
Fixes#10168
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
To deploy KBS on s390x, the environment variable `IBM_SE_CREDS_DIR`
must be exported, and the corresponding directory must be created.
This commit enables KBS deployment for `qemu-coco-dev`, in addition
to the existing `qemu-se` support on the platform.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
GH-9592 addressed a bug in a previous version of the AKS Mariner host
kernel that blocked the CH v39 upgrade. This bug has now been fixed so
we undo that PR.
Note we also specify a different OCI version for Mariner as it differs
from Ubuntu's.
Fixes: #9594
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
This reverts commit 41b7577f08.
We were seeing a lot of issues in the TDX CI of the nature:
"Error: failed to create containerd container: create instance
470: object with key "470" already exists: unknown"
With the TDX CI, we moved to having the nydus snapsotter pre-installed.
Essentially the `deploy-snapshotter` step was performed once before any
actual CI runs.
We were seeing failures related to the error message above.
On reverting this change, we are no longer seeing errors related to
"key exists" with the TDX CI passing now.
The change reverted here is related to downloading incomplete images, but this
seems to be messing up TDX CI.
It is possible to pass --snapshotter to `ctr image check` but that does
not seem to have any effect on the data set returned.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
In certain environments (e.g., those with lower performance), `k8s_create_pod()`
may require additional wait time, especially when dealing with large images.
Since `k8s_wait_pod_be_ready()` — which is called by `k8s_create_pod()` — already
accepts `wait_time` as a second argument, it makes sense to introduce `wait_time`
to `k8s_create_pod()` and propagate it to the callee.
This commit adds `wait_time` to `k8s_create_pod()` as the 2nd (optional) argument.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Some of the tests call set_metadata_annotation() for updating the kernel
parameters. For `kata-qemu-se`, repack_secure_image() is called which is
defined in `lib_se.sh` and sourced by `confidential_kbs.sh`.
This commit ensures that the function call chain for the relevant
`KATA_HYPERVISOR` is properly handled.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
* genpolicy: deny UpdateEphemeralMountsRequest
Deny UpdateEphemeralMountsRequest by default, because paths to
critical Guest components can be redirected using such request.
Signed-off-by: Dan Mihai <Daniel.Mihai@microsoft.com>
This partially reverts commit 94b3348d3c,
as there's more work needed in order to have this one done in a robust
way, and we are taking the safer path of reverting for now, and adding
it back as soon as the release is cut out.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
This partially reverts commit 51690bc157,
as there's more work needed in order to have this one done in a robust
way, and we are taking the safer path of reverting for now, and adding
it back as soon as the release is cut out.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
This reverts commit 1221ab73f9, as there's
more work needed in order to have this one done in a robust way, and we
are taking the safer path of reverting for now, and adding it back as
soon as the release is cut out.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
The deploy-kbs.sh script generates the kbs.key that's used to install
KBS. This same file is used lately by kbs-client to authenticate. This ensures
that the file was created, otherwise fail.
Another problem solved here is that on bare-metal machines the key doesn't survive
a reboot as it is created in a temporary directory (/tmp/trustee). So let's save
the file to a non-temporary location.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Add bind mounts for volumes defined by docker container images, unless
those mounts have been defined in the input K8s YAML file too.
For example, quay.io/opstree/redis defines two mounts:
/data
/node-conf
Before these changes, if these mounts were not defined in the YAML file
too, the auto-generated policy did not allow this container image to
start.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Instead of deploying and removing the snapshotter on every single run,
let's make sure the snapshotter is always deploy on the TDX case.
We're doing this as an experiment, in order to see if we'll be able to
reduce the failures we've been facing with the nydus snapshotter.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The `exec_host()` function often fails to capture the output of a given command
because the node debugger pod is prematurely terminated. To address this issue,
the function has been refactored to ensure consistent output capture by adjusting
the `kubectl debug` process as follows:
- Keep the node debugger pod running
- Wait until the pod is fully ready
- Execute the command using `kubectl exec`
- Capture the output and terminate the pod
This commit refactors `exec_host()` to implement the above steps, improving its reliability.
Fixes: #10081
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
I can't set up loop device with `exec_host`, which the command is
necessary for qemu-coco-dev. See issue #10133.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Update error message in pulling image encrypted to "failed to get decrypt key no suitable key found for decrypting layer key".
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
add tests for guest pull with configured timeout:
1) failed case: Test we cannot pull a large image that pull time exceeds a short creatcontainer timeout(10s) inside the guest
2) successful case: Test we can pull a large image inside the guest with increasing createcontainer timeout(120s)
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
add tests for pull images in the guest using trusted storage:
1) failed case: Test we cannot pull an image that exceeds the memory limit inside the guest
2) successful case: Test we can pull an image inside the guest using
trusted ephemeral storage.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
This PR disables the k8s file volume test as we are having random failures
in multiple GHA CIs mainly because the exec_host function sometimes
does it not work properly.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Currently, setting `io.containerd.cri.runtime-handler` annotation in
the yaml is not necessary for pulling images in the guest. All TEE
hypervisors are already running tests with guest-pulling enabled.
Therefore, we can remove some duplicate tests and re-enable the
guest-pull test for running different runtime pods at the same time.
While considering to support different containerd version, I recommend
to keep setting "io.containerd.cri.runtime-handler".
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
This PR updates the ubuntu image for stress Dockerfile. The main purpose
is to have a more updated image compared with the one that is in libpod
which has not been updated in a while.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
These new "kata-deploy" and "cleanup" actions are equivalent to
"kata-deploy-garm" "cleanup-garm", respectively, and should be
used on the workflows being migrated from GARM to
Github's managed runners.
Eventually "kata-deploy-garm" and "cleanup-garm" won't be used anymore
then we will be able to remove them.
See: #9940
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
This PR updates the image that we are using in the kubectl debug command
as part of the exec host function, as the current alpine image does not
allow to create a temporary file for example and creates random kubernetes
failures.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Rather then modifying the kata-depoy scripts let's use Helm and
create a values.yaml that can be used to render the final templates
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
For easier handling of kata-deploy we can leverage a Helm chart to get
rid of all the base and overlays for the various components
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
After experimenting a little bit with those tests, they seem to be
passing on all the available TEE machines.
With this in mind, let's just enable them for those machines.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>