mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-08-17 07:27:22 +00:00
Add information about using required-tests.yaml as a way to track jobs that are required. Signed-off-by: stevenhorsman <steven@uk.ibm.com>
417 lines
18 KiB
Markdown
417 lines
18 KiB
Markdown
# Kata Containers CI
|
|
|
|
> [!WARNING]
|
|
> While this project's CI has several areas for improvement, it is constantly
|
|
> evolving. This document attempts to describe its current state, but due to
|
|
> ongoing changes, you may notice some outdated information here. Feel free to
|
|
> modify/improve this document as you use the CI and notice anything odd. The
|
|
> community appreciates it!
|
|
|
|
## Introduction
|
|
|
|
The Kata Containers CI relies on [GitHub Actions][gh-actions], where the actions
|
|
themselves can be found in the `.github/workflows` directory, and they may call
|
|
helper scripts, which are located under the `tests` directory, to actually
|
|
perform the tasks required for each test case.
|
|
|
|
## The different workflows
|
|
|
|
There are a few different sets of workflows that are running as part of our CI,
|
|
and here we're going to cover the ones that are less likely to get rotten. With
|
|
this said, it's fair to advise that if the reader finds something that got
|
|
rotten, opening an issue to the project pointing to the problem is a nice way to
|
|
help, and providing a fix for the issue is a very encouraging way to help.
|
|
|
|
### Jobs that run automatically when a PR is raised
|
|
|
|
These are a bunch of tests that will automatically run as soon as a PR is
|
|
opened, they're mostly running on "cost free" runners, and they do some
|
|
pre-checks to evaluate that your PR may be okay to start getting reviewed.
|
|
|
|
Mind, though, that the community expects the contributors to, at least, build
|
|
their code before submitting a PR, which the community sees as a very fair
|
|
request.
|
|
|
|
Without getting into the weeds with details on this, those jobs are the ones
|
|
responsible for ensuring that:
|
|
|
|
- The commit message is in the expected format
|
|
- There's no missing Developer's Certificate of Origin
|
|
- Static checks are passing
|
|
|
|
### Jobs that require a maintainer's approval to run
|
|
|
|
There are some tests, and our so-called "CI". These require a
|
|
maintainer's approval to run as parts of those jobs will be running on "paid
|
|
runners", which are currently using Azure infrastructure.
|
|
|
|
Once a maintainer of the project gives "the green light" (currently by adding an
|
|
`ok-to-test` label to the PR, soon to be changed to commenting "/test" as part
|
|
of a PR review), the following tests will be executed:
|
|
|
|
- Build all the components (runs on free cost runners, or bare-metal depending on the architecture)
|
|
- Create a tarball with all the components (runs on free cost runners, or bare-metal depending on the architecture)
|
|
- Create a kata-deploy payload with the tarball generated in the previous step (runs on free costs runner, or bare-metal depending on the architecture)
|
|
- Run the following tests:
|
|
- Tests depending on the generated tarball
|
|
- Metrics (runs on bare-metal)
|
|
- `docker` (runs on cost free runners)
|
|
- `nerdctl` (runs on cost free runners)
|
|
- `kata-monitor` (runs on cost free runners)
|
|
- `cri-containerd` (runs on cost free runners)
|
|
- `nydus` (runs on cost free runners)
|
|
- `vfio` (runs on cost free runners)
|
|
- Tests depending on the generated kata-deploy payload
|
|
- kata-deploy (runs on cost free runners)
|
|
- Tests are performed using different "Kubernetes flavors", such as k0s, k3s, rke2, and Azure Kubernetes Service (AKS).
|
|
- Kubernetes (runs in Azure small and medium instances depending on what's required by each test, and on TEE bare-metal machines)
|
|
- Tests are performed with different runtime engines, such as CRI-O and containerd.
|
|
- Tests are performed with different snapshotters for containerd, namely OverlayFS and devmapper.
|
|
- Tests are performed with all the supported hypervisors, which are Cloud Hypervisor, Dragonball, Firecracker, and QEMU.
|
|
|
|
For all the tests relying on Azure instances, real money is being spent, so the
|
|
community asks for the maintainers to be mindful about those, and avoid abusing
|
|
them to merely debug issues.
|
|
|
|
## The different runners
|
|
|
|
In the previous section we've mentioned using different runners, now in this section we'll go through each type of runner used.
|
|
|
|
- Cost free runners: Those are the runners provided by GitHub itself, and
|
|
those are fairly small machines with virtualization capabilities enabled.
|
|
- Azure small instances: Those are runners which have virtualization
|
|
capabilities enabled, 2 CPUs, and 8GB of RAM. These runners have a "-smaller"
|
|
suffix to their name.
|
|
- Azure normal instances: Those are runners which have virtualization
|
|
capabilities enabled, 4 CPUs, and 16GB of RAM. These runners are usually
|
|
`garm` ones with no "-smaller" suffix.
|
|
- Bare-metal runners: Those are runners provided by community contributors,
|
|
and they may vary in architecture, size and virtualization capabilities.
|
|
Builder runners don't actually require any virtualization capabilities, while
|
|
runners which will be actually performing the tests must have virtualization
|
|
capabilities and a reasonable amount for CPU and RAM available (at least
|
|
matching the Azure normal instances).
|
|
|
|
## Adding new tests
|
|
|
|
Before someone decides to add a new test, we strongly recommend them to go
|
|
through [GitHub Actions Documentation][gh-actions],
|
|
which will provide you a very sensible background on how to read and understand
|
|
current tests we have, and also become familiar with how to write a new test.
|
|
|
|
On the Kata Containers land, there are basically two sets of tests: "standalone"
|
|
and "part of something bigger".
|
|
|
|
The "standalone" tests, for example the commit message check, won't be covered
|
|
here as they're better covered by the GitHub Actions documentation pasted above.
|
|
|
|
The "part of something bigger" is the more complicated one and not so
|
|
straightforward to add, so we'll be focusing our efforts on describing the
|
|
addition of those.
|
|
|
|
> [!NOTE]
|
|
> TODO: Currently, this document refers to "tests" when it actually means the
|
|
> jobs (or workflows) of GitHub. In an ideal world, except in some specific cases,
|
|
> new tests should be added without the need to add new workflows. In the
|
|
> not-too-distant future (hopefully), we will improve the workflows to support
|
|
> this.
|
|
|
|
### Adding a new test that's "part of something bigger"
|
|
|
|
The first important thing here is to align expectations, and we must say that
|
|
the community strongly prefers receiving tests that already come with:
|
|
|
|
- Instructions how to run them
|
|
- A proven run where it's passing
|
|
|
|
There are several ways to achieve those two requirements, and an example of that
|
|
can be seen in PR #8115.
|
|
|
|
With the expectations aligned, adding a test consists in:
|
|
|
|
- Adding a new yaml file for your test, and ensure it's called from the
|
|
"bigger" yaml. See the [Kata Monitor test example][monitor-ex01].
|
|
|
|
- Adding the helper scripts needed for your test to run. Again, use the [Kata Monitor script as example][monitor-ex02].
|
|
|
|
Following those examples, the community advice during the review, and even
|
|
asking the community directly on Slack are the best ways to get your test
|
|
accepted.
|
|
|
|
## Required tests
|
|
|
|
In our CI we have two categories of jobs - required and non-required:
|
|
- Required jobs need to all pass for a PR to be merged normally and
|
|
should cover all the core features on Kata Containers that we want to
|
|
ensure don't have regressions.
|
|
- The non-required jobs are for unstable tests, or for features that
|
|
are experimental and not-fully supported. We'd like those tests to also
|
|
pass on all PRs ideally, but don't block merging if they don't as it's
|
|
not necessarily an indication of the PR code causing regressions.
|
|
|
|
### Transitioning between required and non-required status
|
|
|
|
Required jobs that fail block merging of PRs, so we want to ensure that
|
|
jobs are stable and maintained before we make them required.
|
|
|
|
The [Kata Containers CI Dashboard](https://kata-containers.github.io/)
|
|
is a useful resource to check when collecting evidence of job stability.
|
|
At time of writing it reports the last ten days of Kata CI nightly test
|
|
results for each job. This isn't perfect as it doesn't currently capture
|
|
results on PRs, but is a good guideline for stability.
|
|
|
|
> [!NOTE]
|
|
> Below are general guidelines about jobs being marked as
|
|
> required/non-required, but they are subject to change and the Kata
|
|
> Architecture Committee may overrule these guidelines at their
|
|
> discretion.
|
|
|
|
#### Initial marking as required
|
|
|
|
For new jobs, or jobs that haven't been marked as required recently,
|
|
the criteria to be initially marked as required is ten days
|
|
of passing tests, with no relevant PR failures reported in that time.
|
|
Required jobs also need one or more nominated maintainers that are
|
|
responsible for the stability of their jobs. Maintainers can be registered
|
|
in [`maintainers.yml`](https://github.com/kata-containers/kata-containers.github.io/blob/main/maintainers.yml)
|
|
and will then show on the CI Dashboard.
|
|
|
|
To add transparency to making jobs required/non-required and to keep the
|
|
GitHub UI in sync with the [Gatekeeper job](../tools/testing/gatekeeper),
|
|
the process to update a job's required state is as follows:
|
|
1. Create a PR to update `maintainers.yml`, if new maintainers are being
|
|
declared on a CI job.
|
|
1. Create a PR which updates
|
|
[`required-tests.yaml`](../tools/testing/gatekeeper/required-tests.yaml)
|
|
adding the new job and listing the evidence that the job meets the
|
|
requirements above. Ensure that all maintainers and
|
|
@kata-containers/architecture-committee are notified to give them the
|
|
opportunity to review the PR. See
|
|
[#11015](https://github.com/kata-containers/kata-containers/pull/11015)
|
|
as an example.
|
|
1. The maintainers and Architecture Committee get a chance to review the PR.
|
|
It can be discussed in an AC meeting to get broader input.
|
|
1. Once the PR has been merged, a Kata Containers admin should be notified
|
|
to ensure that the GitHub UI is updated to reflect the change in
|
|
`required-tests.yaml`.
|
|
|
|
#### Expectation of required job maintainers
|
|
|
|
Due to the nature of the Kata Containers community having contributors
|
|
spread around the world, required jobs being blocked due to infrastructure,
|
|
or test issues can have a big impact on work. As such, the expectation is
|
|
that when a problem with a required job is noticed/reported, the maintainers
|
|
have one working day to acknowledge the issue, perform an initial
|
|
investigation and then either fix it, or get it marked as non-required
|
|
whilst the investigation and/or fix it done.
|
|
|
|
### Re-marking of required status
|
|
|
|
Once a job has been removed from the required list, it requires two
|
|
consecutive successful nightly test runs before being made required
|
|
again.
|
|
|
|
## Running tests
|
|
|
|
### Running the tests as part of the CI
|
|
|
|
If you're a maintainer of the project, you'll be able to kick in the tests by
|
|
yourself. With the current approach, you just need to add the `ok-to-test`
|
|
label and the tests will automatically start. We're moving, though, to use a
|
|
`/test` command as part of a GitHub review comment, which will simplify this
|
|
process.
|
|
|
|
If you're not a maintainer, please, send a message on Slack or wait till one of
|
|
the maintainers reviews your PR. Maintainers will then kick in the tests on
|
|
your behalf.
|
|
|
|
In case a test fails and there's the suspicion it happens due to flakiness in
|
|
the test itself, please, create an issue for us, and then re-run (or asks
|
|
maintainers to re-run) the tests following these steps:
|
|
|
|
- Locate which tests is failing
|
|
- Click in "details"
|
|
- In the top right corner, click in "Re-run jobs"
|
|
- And then in "Re-run failed jobs"
|
|
- And finally click in the green "Re-run jobs" button
|
|
|
|
> [!NOTE]
|
|
> TODO: We need figures here
|
|
|
|
### Running the tests locally
|
|
|
|
In this section, aligning expectations is also something very important, as one
|
|
will not be able to run the tests exactly in the same way the tests are running
|
|
in the CI, as one most likely won't have access to an Azure subscription.
|
|
However, we're trying our best here to provide you with instructions on how to
|
|
run the tests in an environment that's "close enough" and will help you to debug
|
|
issues you find with the current tests, or even provide a proof-of-concept to
|
|
the new test you're trying to add.
|
|
|
|
The basic steps, which we will cover in details down below are:
|
|
|
|
1. Create a VM matching the configuration of the target runner
|
|
2. Generate the artifacts you'll need for the test, or download them from a
|
|
current failed run
|
|
3. Follow the steps provided in the action itself to run the tests.
|
|
|
|
Although the general overview looks easy, we know that some tricks need to be
|
|
shared, and we'll go through the general process of debugging one non-Kubernetes
|
|
and one Kubernetes specific test for educational purposes.
|
|
|
|
One important thing to note is that "Create a VM" can be done in innumerable
|
|
different ways, using the tools of your choice. For the sake of simplicity on
|
|
this guide, we'll be using `kcli`, which we strongly recommend in case you're a
|
|
non-experienced user, and happen to be developing on a Linux box.
|
|
|
|
For both non-Kubernetes and Kubernetes cases, we'll be using PR #8070 as an
|
|
example, which at the time this document is being written serves us very well
|
|
the purpose, as you can see that we have `nerdctl` and Kubernetes tests failing.
|
|
|
|
## Debugging tests
|
|
|
|
### Debugging a non Kubernetes test
|
|
|
|
As shown above, the `nerdctl` test is failing.
|
|
|
|
As a developer you can go ahead to the details of the job, and expand the job
|
|
that's failing in order to gather more information.
|
|
|
|
But when that doesn't help, we need to set up our own environment to debug
|
|
what's going on.
|
|
|
|
Taking a look at the `nerdctl` test, which is located here, you can easily see
|
|
that it runs-on a `garm-ubuntu-2304-smaller` virtual machine.
|
|
|
|
The important parts to understand are `ubuntu-2304`, which is the OS where the
|
|
test is running on; and "smaller", which means we're running it on a machine
|
|
with 2 CPUs and 8GB of RAM.
|
|
|
|
With this information, we can go ahead and create a similar VM locally using `kcli`.
|
|
|
|
```bash
|
|
$ sudo kcli create vm -i ubuntu2304 -P disks=[60] -P numcpus=2 -P memory=8192 -P cpumodel=host-passthrough debug-nerdctl-pr8070
|
|
```
|
|
|
|
In order to run the tests, you'll need the "kata-tarball" artifacts, which you
|
|
can build your own using "make kata-tarball" (see below), or simply get them
|
|
from the PR where the tests failed. To download them, click on the "Summary"
|
|
button that's on the top left corner, and then scroll down till you see the
|
|
artifacts, as shown below.
|
|
|
|
Unfortunately GitHub doesn't give us a link that we can download those from
|
|
inside the VM, but we can download them on our local box, and then `scp` the
|
|
tarball to the newly created VM that will be used for debugging purposes.
|
|
|
|
> [!NOTE]
|
|
> Those artifacts are only available (for 15 days) when all jobs are finished.
|
|
|
|
Once you have the `kata-static.tar.xz` in your VM, you can login to the VM with
|
|
`kcli ssh debug-nerdctl-pr8070`, go ahead and then clone your development branch
|
|
|
|
```bash
|
|
$ git clone --branch feat_add-fc-runtime-rs https://github.com/nubificus/kata-containers
|
|
```
|
|
|
|
Add the upstream as a remote, set up your git, and rebase your branch atop of the upstream main one
|
|
|
|
```bash
|
|
$ git remote add upstream https://github.com/kata-containers/kata-containers
|
|
$ git remote update
|
|
$ git config --global user.email "you@example.com"
|
|
$ git config --global user.name "Your Name"
|
|
$ git rebase upstream/main
|
|
```
|
|
|
|
Now copy the `kata-static.tar.xz` into your `kata-containers/kata-artifacts` directory
|
|
|
|
```bash
|
|
$ mkdir kata-artifacts
|
|
$ cp ../kata-static.tar.xz kata-artifacts/
|
|
```
|
|
|
|
> [!NOTE]
|
|
> If you downloaded the .zip from GitHub you need to uncompress first to see `kata-static.tar.xz`
|
|
|
|
And finally run the tests following what's in the yaml file for the test you're
|
|
debugging.
|
|
|
|
In our case, the `run-nerdctl-tests-on-garm.yaml`.
|
|
|
|
When looking at the file you'll notice that some environment variables are set,
|
|
such as `KATA_HYPERVISOR`, and should be aware that, for this particular example,
|
|
the important steps to follow are:
|
|
|
|
Install the dependencies
|
|
Install kata
|
|
Run the tests
|
|
|
|
Let's now run the steps mentioned above exporting the expected environment variables
|
|
|
|
```bash
|
|
$ export KATA_HYPERVISOR=dragonball
|
|
$ bash ./tests/integration/nerdctl/gha-run.sh install-dependencies
|
|
$ bash ./tests/integration/nerdctl/gha-run.sh install-kata
|
|
$ bash tests/integration/nerdctl/gha-run.sh run
|
|
```
|
|
|
|
And with this you should've been able to reproduce exactly the same issue found
|
|
in the CI, and from now on you can build your own code, use your own binaries,
|
|
and have fun debugging and hacking!
|
|
|
|
### Debugging a Kubernetes test
|
|
|
|
Steps for debugging the Kubernetes tests are very similar to the ones for
|
|
debugging non-Kubernetes tests, with the caveat that what you'll need, this
|
|
time, is not the `kata-static.tar.xz` tarball, but rather a payload to be used
|
|
with kata-deploy.
|
|
|
|
In order to generate your own kata-deploy image you can generate your own
|
|
`kata-static.tar.xz` and then take advantage of the following script. Be aware
|
|
that the image generated and uploaded must be accessible by the VM where you'll
|
|
be performing your tests.
|
|
|
|
In case you want to take advantage of the payload that was already generated
|
|
when you faced the CI failure, which is considerably easier, take a look at the
|
|
failed job, then click in "Deploy Kata" and expand the "Final kata-deploy.yaml
|
|
that is used in the test" section. From there you can see exactly what you'll
|
|
have to use when deploying kata-deploy in your local cluster.
|
|
|
|
> [!NOTE]
|
|
> TODO: WAINER TO FINISH THIS PART BASED ON HIS PR TO RUN A LOCAL CI
|
|
|
|
## Adding new runners
|
|
|
|
Any admin of the project is able to add or remove GitHub runners, and those are
|
|
the folks you should rely on.
|
|
|
|
If you need a new runner added, please, tag @ac in the Kata Containers slack,
|
|
and someone from that group will be able to help you.
|
|
|
|
If you're part of that group and you're looking for information on how to help
|
|
someone, this is simple, and must be done in private. Basically what you have to
|
|
do is:
|
|
|
|
- Go to the kata-containers/kata-containers repo
|
|
- Click on the Settings button, located in the top right corner
|
|
- On the left panel, under "Code and automation", click on "Actions"
|
|
- Click on "Runners"
|
|
|
|
If you want to add a new self-hosted runner:
|
|
|
|
- In the top right corner there's a green button called "New self-hosted runner"
|
|
|
|
If you want to remove a current self-hosted runner:
|
|
|
|
- For each runner there's a "..." menu, where you can just click and the
|
|
"Remove runner" option will show up
|
|
|
|
## Known limitations
|
|
|
|
As the GitHub actions are structured right now we cannot: Test the addition of a
|
|
GitHub action that's not triggered by a pull_request event as part of the PR.
|
|
|
|
[gh-actions]: https://docs.github.com/en/actions
|
|
[monitor-ex01]: https://github.com/kata-containers/kata-containers/commit/a3fb067f1bccde0cbd3fd4d5de12dfb3d8c28b60
|
|
[monitor-ex02]: https://github.com/kata-containers/kata-containers/commit/489caf1ad0fae27cfd00ba3c9ed40e3d512fa492
|