diff --git a/ci/README.md b/ci/README.md new file mode 100644 index 0000000000..e4b0fb10f6 --- /dev/null +++ b/ci/README.md @@ -0,0 +1,343 @@ +# Kata Containers CI + +> [!WARNING] +> While this project's CI has several areas for improvement, it is constantly +> evolving. This document attempts to describe its current state, but due to +> ongoing changes, you may notice some outdated information here. Feel free to +> modify/improve this document as you use the CI and notice anything odd. The +> community appreciates it! + +## Introduction + +The Kata Containers CI relies on [GitHub Actions][gh-actions], where the actions +themselves can be found in the `.github/workflows` directory, and they may call +helper scripts, which are located under the `tests` directory, to actually +perform the tasks required for each test case. + +## The different workflows + +There are a few different sets of workflows that are running as part of our CI, +and here we're going to cover the ones that are less likely to get rotten. With +this said, it's fair to advise that if the reader finds something that got +rotten, opening an issue to the project pointing to the problem is a nice way to +help, and providing a fix for the issue is a very encouraging way to help. + +### Jobs that run automatically when a PR is raised + +These are a bunch of tests that will automatically run as soon as a PR is +opened, they're mostly running on "cost free" runners, and they do some +pre-checks to evaluate that your PR may be okay to start getting reviewed. + +Mind, though, that the community expects the contributors to, at least, build +their code before submitting a PR, which the community sees as a very fair +request. + +Without getting into the weeds with details on this, those jobs are the ones +responsible for ensuring that: + +- The commit message is in the expected format +- There's no missing Developer's Certificate of Origin +- Static checks are passing + +### Jobs that require a maintainer's approval to run + +These are the required tests, and our so-called "CI". These require a +maintainer's approval to run as parts of those jobs will be running on "paid +runners", which are currently using Azure infrastructure. + +Once a maintainer of the project gives "the green light" (currently by adding an +`ok-to-test` label to the PR, soon to be changed to commenting "/test" as part +of a PR review), the following tests will be executed: + +- Build all the components (runs on free cost runners, or bare-metal depending on the architecture) +- Create a tarball with all the components (runs on free cost runners, or bare-metal depending on the architecture) +- Create a kata-deploy payload with the tarball generated in the previous step (runs on free costs runner, or bare-metal depending on the architecture) +- Run the following tests: + - Tests depending on the generated tarball + - Metrics (runs on bare-metal) + - `docker` (runs on Azure small instances) + - `nerdctl` (runs on Azure small instances) + - `kata-monitor` (runs on Azure small instances) + - `cri-containerd` (runs on Azure small instances) + - `nydus` (runs on Azure small instances) + - `vfio` (runs on Azure normal instances) + - Tests depending on the generated kata-deploy payload + - kata-deploy (runs on Azure small instances) + - Tests are performed using different "Kubernetes flavors", such as k0s, k3s, rke2, and Azure Kubernetes Service (AKS). + - Kubernetes (runs in Azure small and medium instances depending on what's required by each test, and on TEE bare-metal machines) + - Tests are performed with different runtime engines, such as CRI-O and containerd. + - Tests are performed with different snapshotters for containerd, namely OverlayFS and devmapper. + - Tests are performed with all the supported hypervisors, which are Cloud Hypervisor, Dragonball, Firecracker, and QEMU. + +For all the tests relying on Azure instances, real money is being spent, so the +community asks for the maintainers to be mindful about those, and avoid abusing +them to merely debug issues. + +## The different runners + +In the previous section we've mentioned using different runners, now in this section we'll go through each type of runner used. + +- Cost free runners: Those are the runners provided by GIthub itself, and + those are fairly small machines with no virtualization capabilities enabled - +- Azure small instances: Those are runners which have virtualization + capabilities enabled, 2 CPUs, and 8GB of RAM. These runners have a "-smaller" + suffix to their name. +- Azure normal instances: Those are runners which have virtualization + capabilities enabled, 4 CPUs, and 16GB of RAM. These runners are usually + `garm` ones with no "-smaller" suffix. +- Bare-metal runners: Those are runners provided by community contributors, + and they may vary in architecture, size and virtualization capabilities. + Builder runners don't actually require any virtualization capabilities, while + runners which will be actually performing the tests must have virtualization + capabilities and a reasonable amount for CPU and RAM available (at least + matching the Azure normal instances). + +## Adding new tests + +Before someone decides to add a new test, we strongly recommend them to go +through [GitHub Actions Documentation][gh-actions], +which will provide you a very sensible background on how to read and understand +current tests we have, and also become familiar with how to write a new test. + +On the Kata Containers land, there are basically two sets of tests: "standalone" +and "part of something bigger". + +The "standalone" tests, for example the commit message check, won't be covered +here as they're better covered by the GitHub Actions documentation pasted above. + +The "part of something bigger" is the more complicated one and not so +straightforward to add, so we'll be focusing our efforts on describing the +addition of those. + +> [!NOTE] +> TODO: Currently, this document refers to "tests" when it actually means the +> jobs (or workflows) of GitHub. In an ideal world, except in some specific cases, +> new tests should be added without the need to add new workflows. In the +> not-too-distant future (hopefully), we will improve the workflows to support +> this. + +### Adding a new test that's "part of something bigger" + +The first important thing here is to align expectations, and we must say that +the community strongly prefers receiving tests that already come with: + +- Instructions how to run them +- A proven run where it's passing + +There are several ways to achieve those two requirements, and an example of that +can be seen in PR #8115. + +With the expectations aligned, adding a test consists in: + +- Adding a new yaml file for your test, and ensure it's called from the + "bigger" yaml. See the [Kata Monitor test example][monitor-ex01]. + +- Adding the helper scripts needed for your test to run. Again, use the [Kata Monitor script as example][monitor-ex02]. + +Following those examples, the community advice during the review, and even +asking the community directly on Slack are the best ways to get your test +accepted. + +## Running tests + +### Running the tests as part of the CI + +If you're a maintainer of the project, you'll be able to kick in the tests by +yourself. With the current approach, you just need to add the `ok-to-test` +label and the tests will automatically start. We're moving, though, to use a +`/test` command as part of a GitHub review comment, which will simplify this +process. + +If you're not a maintainer, please, send a message on Slack or wait till one of +the maintainers reviews your PR. Maintainers will then kick in the tests on +your behalf. + +In case a test fails and there's the suspicion it happens due to flakiness in +the test itself, please, create an issue for us, and then re-run (or asks +maintainers to re-run) the tests following these steps: + +- Locate which tests is failing +- Click in "details" +- In the top right corner, click in "Re-run jobs" +- And then in "Re-run failed jobs" +- And finally click in the green "Re-run jobs" button + +> [!NOTE] +> TODO: We need figures here + +### Running the tests locally + +In this section, aligning expectations is also something very important, as one +will not be able to run the tests exactly in the same way the tests are running +in the CI, as one most likely won't have access to an Azure subscription. +However, we're trying our best here to provide you with instructions on how to +run the tests in an environment that's "close enough" and will help you to debug +issues you find with the current tests, or even provide a proof-of-concept to +the new test you're trying to add. + +The basic steps, which we will cover in details down below are: + + 1. Create a VM matching the configuration of the target runner + 2. Generate the artifacts you'll need for the test, or download them from a + current failed run + 3. Follow the steps provided in the action itself to run the tests. + +Although the general overview looks easy, we know that some tricks need to be +shared, and we'll go through the general process of debugging one non-Kubernetes +and one Kubernetes specific test for educational purposes. + +One important thing to note is that "Create a VM" can be done in innumerable +different ways, using the tools of your choice. For the sake of simplicity on +this guide, we'll be using `kcli`, which we strongly recommend in case you're a +non-experienced user, and happen to be developing on a Linux box. + +For both non-Kubernetes and Kubernetes cases, we'll be using PR #8070 as an +example, which at the time this document is being written serves us very well +the purpose, as you can see that we have `nerdctl` and Kubernetes tests failing. + +## Debugging tests + +### Debugging a non Kubernetes test + +As shown above, the `nerdctl` test is failing. + +As a developer you can go ahead to the details of the job, and expand the job +that's failing in order to gather more information. + +But when that doesn't help, we need to set up our own environment to debug +what's going on. + +Taking a look at the `nerdctl` test, which is located here, you can easily see +that it runs-on a `garm-ubuntu-2304-smaller` virtual machine. + +The important parts to understand are `ubuntu-2304`, which is the OS where the +test is running on; and "smaller", which means we're running it on a machine +with 2 CPUs and 8GB of RAM. + +With this information, we can go ahead and create a similar VM locally using `kcli`. + +```bash +$ sudo kcli create vm -i ubuntu2304 -P disks=[60] -P numcpus=2 -P memory=8192 -P cpumodel=host-passthrough debug-nerdctl-pr8070 +``` + +In order to run the tests, you'll need the "kata-tarball" artifacts, which you +can build your own using "make kata-tarball" (see below), or simply get them +from the PR where the tests failed. To download them, click on the "Summary" +button that's on the top left corner, and then scroll down till you see the +artifacts, as shown below. + +Unfortunately GitHub doesn't give us a link that we can download those from +inside the VM, but we can download them on our local box, and then `scp` the +tarball to the newly created VM that will be used for debugging purposes. + +> [!NOTE] +> Those artifacts are only available (for 15 days) when all jobs are finished. + +Once you have the `kata-static.tar.xz` in your VM, you can login to the VM with +`kcli ssh debug-nerdctl-pr8070`, go ahead and then clone your development branch + +```bash +$ git clone --branch feat_add-fc-runtime-rs https://github.com/nubificus/kata-containers +``` + +Add the upstream as a remote, set up your git, and rebase your branch atop of the upstream main one + +```bash +$ git remote add upstream https://github.com/kata-containers/kata-containers +$ git remote update +$ git config --global user.email "you@example.com" +$ git config --global user.name "Your Name" +$ git rebase upstream/main +``` + +Now copy the `kata-static.tar.xz` into your `kata-containers/kata-artifacts` directory + +```bash +$ mkdir kata-artifacts +$ cp ../kata-static.tar.xz kata-artifacts/ +``` + +> [!NOTE] +> If you downloaded the .zip from GitHub you need to uncompress first to see `kata-static.tar.xz` + +And finally run the tests following what's in the yaml file for the test you're +debugging. + +In our case, the `run-nerdctl-tests-on-garm.yaml`. + +When looking at the file you'll notice that some environment variables are set, +such as `KATA_HYPERVISOR`, and should be aware that, for this particular example, +the important steps to follow are: + +Install the dependencies +Install kata +Run the tests + +Let's now run the steps mentioned above exporting the expected environment variables + +```bash +$ export KATA_HYPERVISOR=dragonball +$ bash ./tests/integration/nerdctl/gha-run.sh install-dependencies +$ bash ./tests/integration/nerdctl/gha-run.sh install-kata +$ bash tests/integration/nerdctl/gha-run.sh run +``` + +And with this you should've been able to reproduce exactly the same issue found +in the CI, and from now on you can build your own code, use your own binaries, +and have fun debugging and hacking! + +### Debugging a Kubernetes test + +Steps for debugging the Kubernetes tests are very similar to the ones for +debugging non-Kubernetes tests, with the caveat that what you'll need, this +time, is not the `kata-static.tar.xz` tarball, but rather a payload to be used +with kata-deploy. + +In order to generate your own kata-deploy image you can generate your own +`kata-static.tar.xz` and then take advantage of the following script. Be aware +that the image generated and uploaded must be accessible by the VM where you'll +be performing your tests. + +In case you want to take advantage of the payload that was already generated +when you faced the CI failure, which is considerably easier, take a look at the +failed job, then click in "Deploy Kata" and expand the "Final kata-deploy.yaml +that is used in the test" section. From there you can see exactly what you'll +have to use when deploying kata-deploy in your local cluster. + +> [!NOTE] +> TODO: WAINER TO FINISH THIS PART BASED ON HIS PR TO RUN A LOCAL CI + +## Adding new runners + +Any admin of the project is able to add or remove GitHub runners, and those are +the folks you should rely on. + +If you need a new runner added, please, tag @ac in the Kata Containers slack, +and someone from that group will be able to help you. + +If you're part of that group and you're looking for information on how to help +someone, this is simple, and must be done in private. Basically what you have to +do is: + +- Go to the kata-containers/kata-containers repo +- Click on the Settings button, located in the top right corner +- On the left panel, under "Code and automation", click on "Actions" +- Click on "Runners" + +If you want to add a new self-hosted runner: + +- In the top right corner there's a green button called "New self-hosted runner" + +If you want to remove a current self-hosted runner: + +- For each runner there's a "..." menu, where you can just click and the + "Remove runner" option will show up + +## Known limitations + +As the GitHub actions are structured right now we cannot: Test the addition of a +GitHub action that's not triggered by a pull_request event as part of the PR. + +[gh-actions]: https://docs.github.com/en/actions +[monitor-ex01]: https://github.com/kata-containers/kata-containers/commit/a3fb067f1bccde0cbd3fd4d5de12dfb3d8c28b60 +[monitor-ex02]: https://github.com/kata-containers/kata-containers/commit/489caf1ad0fae27cfd00ba3c9ed40e3d512fa492 diff --git a/docs/README.md b/docs/README.md index 170e17d1ab..0760c268b7 100644 --- a/docs/README.md +++ b/docs/README.md @@ -40,6 +40,7 @@ Documents that help to understand and contribute to Kata Containers. ### Design and Implementations * [Kata Containers Architecture](design/architecture): Architectural overview of Kata Containers +* [Kata Containers CI](../ci/README.md): Kata Containers CI document * [Kata Containers E2E Flow](design/end-to-end-flow.md): The entire end-to-end flow of Kata Containers * [Kata Containers design](./design/README.md): More Kata Containers design documents * [Kata Containers threat model](./threat-model/threat-model.md): Kata Containers threat model diff --git a/tests/cmd/check-spelling/kata-dictionary.dic b/tests/cmd/check-spelling/kata-dictionary.dic index 9313d39cf5..5be515be8f 100644 --- a/tests/cmd/check-spelling/kata-dictionary.dic +++ b/tests/cmd/check-spelling/kata-dictionary.dic @@ -222,6 +222,7 @@ deliverable/AB deploy dev devicemapper/B +devmapper dialer dialog/A dind/B @@ -333,6 +334,7 @@ serverless signoff/A snapcraft/B snapd/B +snapshotters stalebot/B startup stderr/AB