The `Create AKS cluster` step in `run-k8s-tests-on-aks.yaml` is likely
to fail fail since we are trying to issue `PUT` to `aks` in a relatively
high frequency, while the `aks` end has it's limit on `bucket-size` and
`refill-rate`, documented here [1].
Use `nick-fields/retry@v3` to retry in 10 seconds after request fail,
based on observations that AKS were request 7, or 8 second delays
before retry as part of their 429 response
[1] https://learn.microsoft.com/en-us/azure/aks/quotas-skus-regions#throttling-limits-on-aks-resource-provider-apisFixes: #10772
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The behavior of Kata CI doesn't change.
For local testing using kubernetes/gha-run.sh:
1. Before these changes:
- AUTO_GENERATE_POLICY=yes was always used by the users of SEV, SNP,
TDX, or KATA_HOST_OS=cbl-mariner.
2. After these changes:
- Users of SEV, SNP, TDX, or KATA_HOST_OS=cbl-mariner must specify
AUTO_GENERATE_POLICY=yes if they want to auto-generate policy.
- These users have the option to test just using hard-coded policies
(e.g., using the default policy built into the Guest rootfs) by
using AUTO_GENERATE_POLICY=no. AUTO_GENERATE_POLICY=no is the default
value of this env variable.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
22.04 is the default today:
23da668261/README.md
Being more specific will avoid unexpected errors when Github updates the
default.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The following tests are disabled because they fail (alike with dragonball):
- k8s-cpu-ns.bats
- k8s-number-cpus.bats
- k8s-sandbox-vcpus-allocation.bats
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
The new run-k8s-tests-coco-nontee job should be the home of attestation
tests.
Changed run-k8s-tests-coco-nontee to get KBS installed and by the time the
KBS variable is exported in the environment then the attestation tests
will kick in (likewise they will skip in run-k8s-tests-on-aks).
Fixes#9455
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
`Node.js 19` is deprecated. Bump to a new version based on `Node.js 20`.
This fixes all remaining sites.
Fixes#9245
Signed-off-by: Greg Kurz <groug@kaod.org>
Add GENPOLICY_PULL_METHOD that will be used to test pulling
container images in genpolicy using the oci-distribution crate
and/or the containerd interface.
GENPOLICY_PULL_METHOD will start being used in a future PR.
Fixes: #9384
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Currently we're only running the small instance tests. This adds the
normal instance tests as well.
Fixes: #9298
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The step to deploy KBS on run-k8s-tests-on-aks workflow should be
increased so that there is enough time for checking the service is
healthy and exposed. Likewise the step that builds the kbs-client
which requires enough time to build the executable.
Fixes#9058
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Changed the "run k8s tests on AKS" workflows to get the CoCo KBS
installed so that we can run attestation tests.
The plan is to run attestation tests only on a subset of non-TEE jobs
initially, so this commit restricts to install KBS only on kata-qemu
configuration. Actually at this point it is added only stubs commands
to tests/integration/kubernetes/gha-run.sh that should be implemented
in a future commit.
Fixes#9058
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
This PR adds the Cloud Hypervisor driver, integrated with the runtime-rs,
as part of the kubernetes tests different with devmapper.
Fixes#8995
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Due to the changes done in the CI, we need to set the correct
subscription to be used with the account from now on, otherwise we'd end
up using CoCo subscription.
Fixes: #8946
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The changes to install and test genpolicy must come later, after CI
picks up these gha changes.
Fixes: #8856
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This reverts commit 44899d4cdf, as we've
decided to keep both golang and rust runtime installable and usable at
the same time.
The decision of having both runtimes installable and usable will help
users to test and easily catch any possible differences between those
runtimes, helping us to get on par with both implementations.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This will be used for selecting the correct runtimes and runtimeclasses
to be deployed with kata-deploy.
Fixes: #8475
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This commit enables StratoVirt hypervisor to be tested in kata GHA,
incluing k8s, metrics, cri-containerd, nydus and so on.
Meanwhile, adding some unit tests for StratoVirt to make sure it works.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Let me start with a fair warning that this commit is hard to split into
different parts that could be easily tested (or not tested, just
ignored) without breaking pieces.
Now, about the commit itself, as we're on the run to reduce costs
related to our sponsorship on Azure, we can split the k8s tests we run
in 2 simple groups:
* Tests that can be run in the smaller Azure instance (D2s_v5)
* Tests that required the normal Azure instance (D4s_v5)
With this in mind, we're now passing to the tests which type of host
we're using, which allows us to select to run either one of the two
types of tests, or even both in case of running the tests on a baremetal
system.
Fixes: #7972
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We're changing what's been done as part of ac939c458c, as we've
notcied issues using `github.event.pull_request.merge_commit_sha`.
Basically, whenever a force-push would happen, the reference of
merge_commit_sha wouldn't be updated, leading us to test PRs with the
old code. :-/
In order to get the rebase properly working, we need to ensure we pull
the hash of the commit as part of checkout action, and ensure
fetch-depth is set to 0.
Fixes: #7414
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
So we have a better control on which flavour of kubernetes kata-deploy
is expected to be targetting.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's make sure we can rely on the tests passing down whether they want
to be tested using Node Feataure Discovery or not.
Right now, only the TDX job has this option set to "true", all the other
jobs have this option set to "false".
We can and have to merge this one before merging the NFD related patches
as:
1) It causes no harm in exporting this environment variable, but not
having it used
2) It will allow us to test the NFD after this one is merged, as changes
in the yaml file, in the case of the pull_request_target event, are
not taken into consideration before they're merged
Fixes: #7495
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This splits deploying Kata and running the tests into separate commands
to make it possible to rerun tests locally without having to redeploy
Kata each time.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Let's ensure we're not relying, on any of the called workflows, on event
specific information.
Right now, the two information we've been relying on are:
* PR number, coming from github.event.pull_request.number
* Commit hash, coming from github.event.pull_request.head.sha
As we want to, in the future, add nightly jobs, which will be triggered
by a different event (thus, having different fields populated), we
should ensure that those are not used unless it's in the "top action"
that's trigerred by the event.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Github Actions reads and runs workflow files from the main branch,
rather than from the PR branch. This means that PRs that modify workflow
files aren't being tested with the updated workflows coming from the PR,
but rather with the old workflows from the main branch. AFAIK, this
behavior isn't avoidable for workflow files (but is for other scripts).
This makes it very hard to reliably test workflow changes before they're
actually merged into main and leads to issues that we have to hotifx
(see #6983, #6995).
This PR aims to mitigate that by extracting the commands used in
workflows to a separate script file. The way our CI is set up, those
script files are read from the PR branch and thus changes would be
reflected in the CI checks.
Fixes: #6971
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Full SHA is 40 characters, while AKS cluster name has a limit of 63. Trim the
SHA to 12 characters, which is widely considered to be unique enough and is
short enough to be used in the cluster name
Fixes: #7010
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
We added that to create the cluster name, but I forgot to add that to
the part we get the k8s config file, or to the part where we delete the
AKS cluster.
Fixes: #6999
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We need to do so, otherwise we'll create two clusters for testing Cloud
Hypervisor with exactly the same name, one using Ubuntu, and one using
Mariner.
Fixes: #6999
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
While the Mariner Kata host is in preview, we need the `aks-preview`
extension to enable the `--workload-runtime KataMshvVmIsolation` flag.
Fixes: #6994
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The current testing setup only supports running Kata on top of an Ubuntu
host. This adds Mariner to the matrix of testable hosts for k8s
tests, with Cloud Hypervisor as a VMM.
As preparation for the upcoming PR that will change only the actual test
code (rather than workflow YAMLs), this also introduces a new file
`setup.sh` that will be used to set host-specific parameters at test
run-time.
Fixes: #6961
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
We're still facing issues related to the time taken to deploy the
kata-deplot daemonset and starting to run the tests.
Ideally, we should solve this with a readiness probe, and that's the
approach we want to take in the future. However, for now, let's just
make sure those tests are not on the way of the community.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We've seen tests being aborted close to the end of the run due to the
timeout. Let's increase it, avoiding to hit such cases again..
Fixes: #6964
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
fa832f4709 increased the timeout, which
helped a lot, mainly in the TEE machines. However, we're still seeing
some failures here and there with the AKS tests.
Let's bump it yet again and, hopefully, those errors to start the tests
will go away.
Fixes: #6905
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We like it or not, every now and then we'll have to deal with flaky
tests, and our tests using GHA are not exempt from that fact.
With this simple commit, we're trying to improve the reliability of the
tests in a few different fronts:
* Giving enough time for the script used by kata-deploy to be executed
* We've hit issues as the kata-deploy pod is considered "Ready" at the
moment it starts running, not when it finishes the needed setup. We
should also be looking on how to solve this on the kata-deploy side
but, for now, let's ensure our tests do not break with the current
kata-deploy behavior.
* Merging the "Deploy kata-deploy" and "Run tests" steps
* We've hit issues re-running tests and seeing even more failures than
the ones we're trying to debug, as a step will simply be taken as
succeeded as part of the re-run, in case it was successful executed
as part of the first run. This causes issues with the kata-deploy
deployment, as the tests would start running before even having the
node set up for running Kata Containers.
Fixes: #6865#6649
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
I should have seen this coming, but currently the "create" and "delete"
AKS workflows cannot be imported and uses as a job's step, resulting on
an error trying to find the correspondent action.yaml file for those.
Fixes: #6630
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit 7855b43062.
Unfortunately we have to revert the PRs related to the switch done to
using `workflow_run` instead of `pull_request_target`. The reason for
that being that we can only mark jobs as required if they are targetting
PRs.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit 85cc5bb534.
Unfortunately we have to revert the PRs related to the switch done to
using `workflow_run` instead of `pull_request_target`. The reason for
that being that we can only mark jobs as required if they are targetting
PRs.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We've been currently using {create,delete}_aks as jobs. However, it
means that if the tests fail we'll end up deleting the AKS cluster (as
expected), but not having a way to recreate the cluster without
re-running all jobs, which is a waste of resources.
Fixes: #6628
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This was missed from the last series, as GHA will use the "target
branch" yaml file to start the workflow.
Basically we changed the name of the cluster created to stop relying on
the PR number, as that's not easily accessible on `workflow_run`.
Fixes: #6611
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're using the `workflow_run` event, the checkout action would
pull the **current target branch** instead of the PR one.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As already done for Cloud Hypervisor and QEMU, let's make sure we can
run the AKS tests using dragonball.
Fixes: #6583
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's switch to using the `ghcr.io` registry for the k8s CI, as this
will save us some troubles on running the CI with PRs coming from forked
repos.
Fixes: #6587
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>