As #11076 introduces VFIO-AP bind/associate funtions for IBM Secure
Execution (SEL), a new internal nightly test has been established.
This PR adds a new entry `cc-vfio-ap-e2e-tests` to the existing matrix
to share the test result.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For help with debugging add, logging of the KBS,
like the container system logs if the confidential test fails
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Nydus+guest_pull has specific behavior where it improperly handles image layers on
the host, causing the CRI to not find /etc/passwd and /etc/group files
on container images which have them. The unfortunately causes different
outcomes w.r.t. GID used which we are trying to enforce with policy.
This behavior is observed/explained in https://github.com/kata-containers/kata-containers/issues/11162
Handle this exception with a config.settings.cluster_config.guest_pull
field. When this is true, simply ignore the /etc/* files in the
container image as they will not be parsed by the CRI.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Add security context testcases for genpolicy, verifying that UID and GID
configurations controlled by the kubernetes security context are
enforced.
Also, fix the other CreateContainerRequest tests' expected contents to
reflect our new genpolicy parsing/enforcement of GIDs.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Introduce tests to check for policy correctness on a redis deployment
with 1. a pod-level securityContext 2. a container-level securityContext
which shadows the pod-level securityContext 3. a pod-level
securityContext which selects an existing user (nobody), causing a new GID to be selected.
Redis is an interesting container image to test with because it includes
a /etc/passwd file with existing user/group configuration of 1000:1000 baked in.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
With fixes to align policy GID parsing with the CRI behavior, we can now
enable policy verification of GIDs.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
The GID used for the running process in an OCI container is a function of
1. The securityContext.runAsGroup specified in a pod yaml, 2. The UID:GID mapping in
/etc/passwd, if present in the container image layers, 3. Zero, even if
the userstr specifies a GID.
Make our policy engine align with this behavior by:
1. At the registry level, always obtain the GID from the /etc/passwd
file if present. Ignore GIDs specified in the userstr encoded in the
OCI container.
2. After an update to UID due to securityContexts, perform one final check against
the /etc/passwd file if present. The GID used for the running
process is the mapping in this file from UID->GID.
3. Override everything above with the GID of the securityContext
configuration if provided
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Our policy should cover these fields for securityContexts at the pod or
container level of granularity.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
The get_process logic in registry.rs did not account for all cases
(username:groupname), did not defer to contents of /etc/group,
/etc/passwd when it should, and was difficult to read.
Clean this implementation up, factoring the string parsing for
user/group strings into their own functions. Enable the
registry::Container class to query /etc/passwd and /etc/group, if they
exist.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
By running on "all" host type there are two consequences:
1) run the "normal" tests too (until now, it's only "small" tests), so
increasing the coverage
2) create AKS cluster with larger VMs. This is a new requirement due to
the current ingress controller for the KBS service eating too much
vCPUs and lefting only few for the tests (resulting on failures)
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
_print_instance_type() returns the instance type of the AKS nodes, based
on the host type. Tests are grouped per host type in "small" and "normal"
sets based on the CPU requirements: "small" tests require few CPUs and
"normal" more.
There is an 3rd case: "all" host type maps to the union of "small"
and "normal" tests, which should be handled by _print_instance_type()
properly. In this case, it should return the largest instance type
possible because "normal" tests will be executed too.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
It's used an AKS managed ingress controller which keeps two nginx pod
replicas where both request 500m of CPU. On small VMs like we've used on
CI for running the CoCo non-TEE tests, it left only a few amount of CPU
for the tests. Actually, one of these pod replicas won't even get
started. So let's patch the ingress controller to have only one replica
of nginx.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
The Azure AKS addon-http-application-routing add-on is deprecated and
cannot be enabled on new clusters which has caused some CI jobs to fail.
Migrated our code to use approuting instead. Unlike
addon-http-application-routing, this add-on doesn't
configure a managed cluster DNS zone, but the created ingress has a
public IP. To avoid having to deal with DNS setup, we will be using that
address from now on. Thus, some functions no longer used are deleted.
Fixes#11156
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
In #11044, `run-k8s-tests-coco-nontee` was set as requried by mistake.
This PR disables the test again.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
We're bringing to *Cloud Hypervisor only* the reclaim_guest_freed_memory
option already present in the runtime-rs.
This allows us to use virtio-balloon for the hypervisor to reclaim
memory freed by the guest.
The reason we're not touching other hypervisors is because we're very
much aware of avoiding to clutter the go code at this point, so we'll
leave it for whoever really needs this on other hypervisor (and trust
me, we really do need it for Cloud Hypervisor right now ;-)).
Signed-off-by: Champ-Goblem <cameron@northflank.com>
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
The AKS CLI recently introduced a regression that prevents using
aks-preview extensions (Azure/azure-cli#31345), and hence create
CI clusters.
To address this, we temporarily hardcode the last known good version of
aks-preview.
Note that I removed the comment about this being a Mariner requirement,
as aks-preview is also a requirement of AKS App Routing, which will
be introduced soon in #11164.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Knowing that the upstream project provides a "ready to use" version of
the kernel, it's good to include an easy way to users to monitor
performance, and that's what we're doing by enabling the TASKSTATS (and
related) kernel configs.
This has been present as part of older kernels, but I couldn't
reasonably find the reason why it's been dropped.
Signed-off-by: Champ-Goblem <cameron@northflank.com>
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
Let's add a RUNTIME_CHOICE env var that can be passed to be build
scripts, which allows the user to select whether they bulld the go
runtime, the rust runtime, or both.
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
genpolicy is sending more HTTPS requests than other components during
CI so it's more likely to be affected by transient network errors
similar to:
ConnectError(
"dns error",
Custom {
kind: Uncategorized,
error: "failed to lookup address information: Try again",
},
)
Note that genpolicy is not the only component hitting network errors
during CI. Recent example from a different component:
"Message: failed to create containerd task: failed to create shim task:
failed to async pull blob stream HTTP status server error (502 Bad Gateway)"
This CI change might help just with the genpolicy errors.
Fixes: #11182
Signed-off-by: Dan Mihai <dmihai@microsoft.com>