Commit Graph

11350 Commits

Author SHA1 Message Date
Fabiano Fidêncio
1b7c7901d9 local-build: Remove $HOME/.docker/buildx/activity/default
The file can be removed between builds without causing any issue, and
leaving it around has been causing us some headache due to:
```
ERROR: open /home/runner/.docker/buildx/activity/default: permission denied
```

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 3818bf3311)
2023-09-21 13:26:52 +02:00
Fabiano Fidêncio
6a34bae03d gha: Avoid "fail-fast" in tests that are known to be flaky
Otherwise we'll have to re-run all the tests due to a flaky behaviour in
one of the parts.

Fixes: #7757

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit fb49d5d7ce)
2023-09-21 13:26:26 +02:00
Dan Mihai
17d22cae34 tests: use unique test name
k8s-pid-ns.bats was already using the test name from
k8s-kill-all-process-in-container.bats - probably a copy/paste bug.

Fixes: #7753

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
(cherry picked from commit 183f51d6f6)
2023-09-21 13:26:18 +02:00
Dan Mihai
e8c24fa0b9 tests: delete k8s deployment at the test's end
At the end of k8s-kill-all-process-in-container.bats, delete the
deployment it created.

Fixes: #7752

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
(cherry picked from commit 6a974679f2)
2023-09-21 13:26:06 +02:00
Gabriela Cervantes
3e07c89d39 metrics: Remove unused variable in tensorflow nhwc script
This PR removes unused variable in tensorflow nhwc script.

Fixes #7750

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 32a778b6da)
2023-09-21 13:25:56 +02:00
Fabiano Fidêncio
5b9a69433d kata-deploy: Don't try to remove /opt/kata
The directory is a host path mount and cannot be removed from within the
container.  What we actually want to remove is whatever is inside that
directory.

This may raise errors like:
```
rm: cannot remove '/opt/kata/': Device or resource busy
```

Fixes: #7746

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit d8f3ce6497)
2023-09-21 13:25:48 +02:00
Jeremi Piotrowski
e99a13d26c gha: vfio: Run on Ubuntu 23.04 runner
The vfio test requires nested-nested virtualization:

L0 Azure host
-> L1 Ubuntu VM
  -> L2 Fedora VM
    -> L3 Kata

This hits a kernel bug on v5.15 but works quite nicely on the v6.2 kernel
included in Ubuntu 23.04. We can switch back to Ubuntu 22.04 when they roll out
v6.2.

Fixes: #6555
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
(cherry picked from commit 936e8091a7)
2023-09-21 13:25:35 +02:00
Jeremi Piotrowski
394d146b89 local-build: Remove GID before creating group
docker install now creates a group with gid 999 which happens to match what we
need to get docker-in-docker to work. Remove the group first as we don't need
it.

Fixes: #7726
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
(cherry picked from commit 3b881fbc0e)
2023-09-21 13:25:18 +02:00
Gabriela Cervantes
7421737229 metrics: Add TensorFlow ResNet50 fp32 Dockerfile
This PR adds the TensorFlow ResNet50 fp32 Dockerfile for kata metrics.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 959ca49447)
2023-09-21 13:25:10 +02:00
Gabriela Cervantes
9acbf2faf7 metrics: Add TensorFlow ResNet50 FP32 benchmark
This PR adds TensorFlow ResNet50 FP32 benchmark for kata metrics.

Fixes #7735

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 4b7d72c4a8)
2023-09-21 13:25:03 +02:00
Fabiano Fidêncio
4f2c9372c3 kata-deploy: Avoid failing on content removal
We can simply use `rm -f` all over the place and avoid the container
returning any error.

Fixes: #7733

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 5cba38c175)
2023-09-21 13:24:56 +02:00
Gabriela Cervantes
6ea1d3bffd metrics: Add disk link to README
This PR adds disk link to README documentation for kata metrics.

Fixes #7721

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 8afd158cef)
2023-09-21 13:24:35 +02:00
Gabriela Cervantes
ad2036927f metrics: Fix FIO path
This PR fixes the FIO path for the FIO files.

Fixes #7711

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit eee2ee6eeb)
2023-09-21 13:24:06 +02:00
Gabriela Cervantes
abcb225ce3 metrics: Use function from metrics common in pytorch script
This PR uses a common function into the pytorch script.

Fixes #7709

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 39bc3488f5)
2023-09-21 13:23:58 +02:00
Dan Mihai
508f1bba15 gha: capture additional kata-deploy output
10 lines can be insufficient for diagnostics.

Fixes: #7707

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
(cherry picked from commit 400eb88743)
2023-09-21 13:23:48 +02:00
David Esparza
d46c300608 metrics: Enable kata runtime in K8s for FIO test.
This PR configures the corresponding kata runtime in K8s
based on the tested hypervisor.

This PR also enables FIO metrics test in the kata metrics-ci.

Fixes: #7665

Signed-off-by: David Esparza <david.esparza.borquez@intel.com>
(cherry picked from commit fb571f8be9)
2023-09-21 13:23:36 +02:00
Gabriela Cervantes
3d3882a06a metrics: Update tensorflow name in gha run script
This PR update tensorflow name in gha run script.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 85c02828e1)
2023-09-21 13:23:17 +02:00
Gabriela Cervantes
7d0a3dbf24 metrics: Fix check results for tensorflow benchmark
This PR fixes the check results for tensorflow benchmark now
that we change the name of the test.

Fixes #7684

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit e8a5119343)
2023-09-21 13:23:09 +02:00
Fabiano Fidêncio
3e2a383b7d gha: kata-deploy: Do the runtime class cleanup as part of the cleanup
Instead of doing this as part of the test itself, let's ensure it's done
before running the tests and during the tests cleanup.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 2d896ad12f)
2023-09-21 13:23:02 +02:00
Fabiano Fidêncio
2c5db14a1a gha: kata-deploy: Add the first kata-deploy test
This test, at least for now, only checks whether the runtimeclasses
have been properly created.

This is just a migration from a test we had as part of the k8s suite.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 4ffc2c86f3)
2023-09-21 13:22:56 +02:00
Gabriela Cervantes
0b4fb826de metrics: Remove unused variable in tensorflow mobilenet script
This PR removes unused variable in tensorflow mobilenet script.

Fixes #7679

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 8616c050ae)
2023-09-21 13:22:47 +02:00
Fabiano Fidêncio
b38624e2b3 tests: common: Ensure test_type is used as part of the cluster's name
By doing this we can make sure there won't be any clash on the cluster
name created for either the k8s or the kata-deploy tests.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 285e616b5e)
2023-09-21 13:22:40 +02:00
Fabiano Fidêncio
cdfcd9aba8 tests: commob: Don't fail if yq is not part of the cache
This may happen on external runners.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 790bd3548d)
2023-09-21 13:22:33 +02:00
Fabiano Fidêncio
74edbaac96 gha: kata-deploy: Add run-kata-deploy-tests.sh
This will have the same function as run-k8s-tests.sh has, but for
kata-deploy.

Right now it doesn't have any tests, and the command to actually run the
tests is commented out, but right now this is just a placeholder that
will be populated sooner than later.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit ce6adecd0a)
2023-09-21 13:22:27 +02:00
Fabiano Fidêncio
d7130f48b0 gha: k8s: Stop running kata-deploy tests as part of the k8s suite
In a follow-up series, we'll add a whole suite for the kata-deploy
tests.  With this in mind, let's already get rid of this one and avoid
more kata-deploy tests to land here.

Fixes: #7642

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit cfc29c11a3)
2023-09-21 13:22:21 +02:00
Aurélien Bombo
810507e8a3 tests: k8s: Call ensure_yq() in setup.sh
It wasn't the `common.bash` import in `run_kubernetes_tests.sh` causing
the yq error so let's try this instead.

Reference: https://github.com/kata-containers/kata-containers/actions/runs/5674941359/job/15379797568#step:10:341

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
(cherry picked from commit f4dd152863)
2023-09-21 13:22:10 +02:00
Aurélien Bombo
915bace795 kata-deploy: Properly create default runtime class
The default `kata` runtime class would get created with the `kata`
handler instead of `kata-$KATA_HYPERVISOR`. This made Kata use the wrong
hypervisor and broke CI.

Fixes: #7663

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
(cherry picked from commit 339569b69c)
2023-09-21 13:22:00 +02:00
Gabriela Cervantes
870d8004a0 metrics: Fix MobileNet help me description
This PR fixes MobileNet help me description in the
tensorflow script.

Fixes #7661

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 2a491e9b1f)
2023-09-21 13:21:54 +02:00
Fabiano Fidêncio
145450544d gha: ci: Start running kata-deploy tests
Let's add the tests as part of the ci.yaml, so they an be triggered as
part of each PR.

For this PR those tests won't be triggered, courtesy to the
`pull_request_target` event we rely on.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit d19a75e80c)
2023-09-21 13:21:46 +02:00
Gabriela Cervantes
bd29413721 docs: Fix TensorFlow word across the document
This PR fixes the TensorFlow word across the document to have uniformity
across all the document.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit bade6a5c3b)
2023-09-21 13:21:28 +02:00
Gabriela Cervantes
a845e94139 docs: Add Tensorflow Resnet50 documentation
This PR adds the Tensorflow Resnet50 documentation.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 1a1b207760)
2023-09-21 13:21:21 +02:00
Gabriela Cervantes
6e5a5b8249 metrics: Add Dockerfile for ResNet50 int8
This PR adds the dockerfile for ResNet50 int8 benchmark.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 24baededc0)
2023-09-21 13:21:13 +02:00
Gabriela Cervantes
5d85cac1d6 metrics: Add Tensorflow ResNet50 int8 benchmark
This PR adds the Tensorflow ResNet50 int8 script for kata metrics.

Fixes #7652

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 6d971ba8df)
2023-09-21 13:21:07 +02:00
Fabiano Fidêncio
7474e50ae2 gha: cri-containerd: Enable tests
As the cri-containerd tests have been fully migrated to GHA, let's make
sure we get them running.

Fixes: #6543

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit b3592ab25c)
2023-09-21 13:19:36 +02:00
Fabiano Fidêncio
20be3d93d5 gha: cri-containerd: Add timeout to the crictl calls on testContainerStop
As part of the runners, we're hitting a timeout that I cannot reproduce,
at all, when allocating the same instance and running the tests
manually.

The default timeout to connect to the server is 2s when using `crictl`.
Let's increase this to 20s.

It's fairly important to mention that in the first tests I used a
timeout of 10s, and that helped but we still hit issues every now and
then.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 84dd02e0f9)
2023-09-21 13:19:28 +02:00
Fabiano Fidêncio
10058f718a gha: cri-containerd: Show pod before deleting it
It'll help us to debug failures with the pod stop / pod delete.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit b29782984a)
2023-09-21 13:19:22 +02:00
Fabiano Fidêncio
585d5fba03 gha: cri-containerd: Print kata logs in case of error
We need this to fully understand what are the issues we're facing.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit ae0930824a)
2023-09-21 13:19:17 +02:00
Fabiano Fidêncio
2fea5a5f8b gha: cri-containerd: Group containerd logs
This improves readability in case of failures by a lot.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 6c8b2ffa60)
2023-09-21 13:19:11 +02:00
Fabiano Fidêncio
3c7597f4ba gha: cri-containerd: Ensure RUNTIME takes KATA_HYPERVISOR into account
Short commit log says it all.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 9e898701f5)
2023-09-21 13:19:04 +02:00
Gabriela Cervantes
738d808cac metrics: Rename tensorflow scripts
This PR renames the tensorflow scripts to include the data format
that is being used as we will have multiple tests with different
data and model formats for tensorflow so this will help us to
distinguish them.

Fixes #7645

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 18a7fd8e4e)
2023-09-21 13:18:52 +02:00
Fabiano Fidêncio
4bb8fcc0c0 tests: kata-deploy: Add placeholder for kata-deploy-tests-on-tdx
This will not be tested as part of the PR, thanks to the
`pull_request_target` event, but we want it to be added so we can build
atop of that in a coming up series.

Fixes: #7642

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit e55fa93db9)
2023-09-21 13:18:42 +02:00
Fabiano Fidêncio
f5e14ef283 tests: kata-deploy: Add placeholder for kata-deploy-tests-on-aks
This will not be tested as part of the PR, thanks to the
`pull_request_target` event, but we want it to be added so we can build
atop of that in a coming up series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit d9ee17aaec)
2023-09-21 13:18:35 +02:00
Fabiano Fidêncio
e812c437fe tests: kata-deploy: Add functional/kata-deploy/gha-run.sh placeholder
Right now this file does nothing, as it's not even called by any GHA.
However, it'll be populated later on as part of a different series,
where we'll have kata-deploy specific tests running here.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit 831e73ff91)
2023-09-21 13:18:21 +02:00
Fabiano Fidêncio
c19cebfa80 tests: Add gha-run-k8s-common.sh
Let's split a good portion of `tests/integration/kuberentes/gha-run.sh`
out, and put them in a place where they can be used to the soon-to-come
kata-deploy specific tests.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
(cherry picked from commit af1b46bbf2)
2023-09-21 13:18:07 +02:00
David Esparza
4e8c512346 metrics: fix the loop used to stop kata components #7629
This PR fixed the loop that stops the kata-shim and the
hypervisors used in metrics checks.

Fixes: #7628

Signed-off-by: David Esparza <david.esparza.borquez@intel.com>
(cherry picked from commit 767434d50a)
2023-09-21 13:17:32 +02:00
Gabriela Cervantes
47f32c4983 metrics: Add cassandra statefulset yaml
This PR adds cassandra statefulset yaml for kata metrics.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 5d0f0d43c7)
2023-09-21 13:17:26 +02:00
Gabriela Cervantes
d5a14449fc metrics: Add cassandra service yaml
This PR adds the cassandra service yaml for the benchmark.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit c1dcc1396f)
2023-09-21 13:17:20 +02:00
Gabriela Cervantes
1292b51092 metrics: Add block loop pvc yaml for cassandra
This PR adds block loop pvc yaml for cassandra test.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 2297a0d1c5)
2023-09-21 13:17:13 +02:00
Gabriela Cervantes
105a556a30 metrics: Add block loop pv yaml for cassandra test
This PR adds the block loop pv yaml for cassandra test.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit e3d511946f)
2023-09-21 13:17:04 +02:00
Gabriela Cervantes
1b126eb4ce metrics: Add block loop pvc for cassandra test
This PR adds the block loop pvc for cassandra test.

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
(cherry picked from commit 9890271594)
2023-09-21 13:16:59 +02:00