Commit Graph

2042 Commits

Author SHA1 Message Date
Francesco Romani
077c0aa1be node: graduate CPUManagerPolicyOptions to beta
We graduate the `CPUManagerPolicyOptions` feature to beta
in the 1.23 cycle, and we add new experimental feature gates
to guard new options which are planned in the 1.23 and in the
following cycles.

We introduce additional feature gate called `CPUManagerPolicyAlphaOptions` and
`CPUManagerPolicyBetaOptions`. The basic idea is to avoid the
cumbersome process of adding a feature gate for each option, and to have
feature gates which track the maturity level of _groups_ of options.
Besides this change, the graduation process, and the process in general,
for adding new policy options is still unchanged.

The `full-pcpus-only` option added in the 1.22 cycle is intentionally
moved into the beta policy options

For more details:
- KEP: https://github.com/kubernetes/enhancements/pull/2933
- sig-arch discussion:
  https://groups.google.com/u/1/g/kubernetes-sig-architecture/c/Nxsc7pfe5rw

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-29 11:40:03 +02:00
Kubernetes Prow Robot
e5c4defa8e Merge pull request #103370 from verb/1.22-cleanup-shareprocesses-e2e
Remove ShareProcessNamespace tags from e2e_node tests
2021-09-23 10:11:14 -07:00
Elana Hashman
47086a6623 Add test for recreating a static pod 2021-09-15 14:01:48 -04:00
Francesco Romani
54c7d8fbb1 e2e: TM: add option to fail instead of skip
The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.

CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.

However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.

To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-09-13 13:23:36 +02:00
Kubernetes Prow Robot
5261433627 Merge pull request #104606 from endocrimes/dani/device-driver-deflake
[Failing Test] Fix GPU Device Driver test in kubelet-serial
2021-09-10 04:20:00 -07:00
Danielle Lancashire
b970bb5fe0 e2e_node: Update GPU tests to reflect reality
In older versions of Kubernetes (at least pre-0.19, it's the earliest
this test will run unmodified on), Pods that depended on devices could be
restarted after the device plugin had been removed. Currently however,
this isn't possible, as during ContainerManager.GetResources(), we
attempt to DeviceManager.GetDeviceRunContainerOptions() which fails as
there's no cached endpoint information for the plugin type.

This commit therefore breaks apart the existing test into two:
- One active test that validates that assignments are maintained across
  restarts
- One skipped test that validates the behaviour after GPUs have been
  removed, in case we decide that this is a bug that should be fixed in
  the future.
2021-09-06 19:03:15 +02:00
Danielle Lancashire
3884dcb909 e2e_node: run gpu pod long enough to become ready 2021-08-26 14:24:23 +02:00
Danielle Lancashire
7d7884c0e6 e2e_node: install gpu pod with PodClient
Prior to this change, the pod was not getting scheduled on the node as
we don't have a running scheduler in e2e_node. PodClient solves this
problem by manually assigning the pod to the node.
2021-08-26 14:22:22 +02:00
Danielle Lancashire
0cc8af82a1 e2e_node: use upstream gpu installer
The current GPU installer was built in 2017, from source that no longer
exists in Kubernetes ([adding commit][1]. The image was built on 2017-06-13.

Unfortunately, this installer no longer appears to work. When debugging
on the same node type as used by test-infra, it failed to build the
driver as the kernel sha was no longer available.

This lead to needing to find a new way to install GPUs. The smallest
logical change was switching to [cos-gpu-installer][2]
. There is a newer version of this available on [googlesource][3] that
I have not yet tested as it's not clear what the state of the project
is, as I couldn't find docs outside of the source itself.

We install things to the same location as previously to avoid needing
extra downstream changes. There are a couple of weird issues here
however, like needing to run the container twice to correctly update the
LD Cache.

[1]: 1e77594958/cluster/gce/gci/nvidia-gpus/Dockerfile
[2]: https://github.com/GoogleCloudPlatform/cos-gpu-installer
[3]: https://cos.googlesource.com/cos/tools/+/refs/heads/master/src/cmd/cos_gpu_installer/
2021-08-26 14:09:45 +02:00
Stephen Augustus
481cf6fbe7 generated: Run hack/update-gofmt.sh
Signed-off-by: Stephen Augustus <foo@auggie.dev>
2021-08-24 15:47:49 -04:00
Kubernetes Prow Robot
499a1f99a9 Merge pull request #104489 from liggitt/signal-buffer
Fix buffered signal channel go vet error
2021-08-20 14:53:58 -07:00
Jordan Liggitt
322bc82777 Fix buffered signal channel go vet error 2021-08-20 16:47:56 -04:00
Antonio Ojea
0cd75e8fec run hack/update-netparse-cve.sh 2021-08-20 10:42:09 +02:00
Kubernetes Prow Robot
40a9914801 Merge pull request #102916 from odinuge/serial-tests
Ensure images are pulled after eviction tests
2021-08-17 11:41:13 -07:00
Elana Hashman
c69f55519e Revert "E2E test for kubelet exit-on-lock-contention" 2021-08-11 10:45:46 -07:00
Imran Pochi
2c2661a411 e2e test: lock-file and exit-on-lock-contention
This commit adds an e2e test for the kubelet flags `--lock-file` and
`exit-on-lock-contention`. Eventually we would like to move them to the
kubelet configuration file rather than flags.

This test is based on the premise that whenever there is a lock
contention of the lock file (e.g. /var/run/kubelet.lock), the running
kubelet must terminate and the waiting for the lock on the lock file to
be released before starting again.

In this test we simulate that behaviour of a file contention. The test
would try to acquire the lock on the lock file.

Success of the test is determined kubelet health check when the lock is
acquired by the test and passes when the lock on the lock file is
released.

Signed-off-by: Imran Pochi <imran@kinvolk.io>
2021-08-09 15:27:54 +05:30
Elana Hashman
d2ed3b28b7 Revert "revert Bump DynamicKubeConfig metric deprecation to 1.23 by delta update" 2021-08-06 08:38:56 -07:00
Kubernetes Prow Robot
d4179be611 Merge pull request #104183 from SergeyKanzhelev/SergeyKanzhelev-node-e2e-approver
Add SergeyKanzhelev to node e2e test approvers
2021-08-05 20:55:28 -07:00
Kubernetes Prow Robot
4d87be3ec4 Merge pull request #104121 from dims/skip-node-e2e-test-for-recovering-from-ip-leak-with-docker
Skip node e2e test for recovering from ip leak with docker/ubuntu
2021-08-05 16:36:46 -07:00
Sergey Kanzhelev
023f6a90db Add SergeyKanzhelev to node e2e test approvers 2021-08-05 21:32:55 +00:00
Kubernetes Prow Robot
7f231f899b Merge pull request #103883 from ehashman/slow-e2es
Mark "update Node.Spec.ConfigSource" node e2es as slow
2021-08-05 14:10:37 -07:00
Kubernetes Prow Robot
01cd315f3e Merge pull request #104106 from ehashman/ehashman-node-e2e-owners
Add ehashman to node e2e test approvers
2021-08-05 08:18:49 -07:00
Kubernetes Prow Robot
3b84cc9e6b Merge pull request #104075 from kerthcet/cleanup/revert-dynamickubeconfig-metric
revert Bump DynamicKubeConfig metric deprecation to 1.23 by delta update
2021-08-05 08:18:40 -07:00
Davanum Srinivas
9351b57def Skip node e2e test for recovering from ip leak with docker
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-08-05 07:11:07 -04:00
kerthcet
8cf10d9a20 set showHiddenMetricsForVersion=1.22 in dynamicKubeletConfiguration test
Signed-off-by: kerthcet <kerthcet@gmail.com>
2021-08-05 01:04:54 +08:00
Elana Hashman
ac076838c8 Add ehashman to node e2e test approvers
List of files raised by matthyx in SIG Node during the 2021-08-03
meeting.
2021-08-03 10:48:06 -07:00
Davanum Srinivas
3463c2dfa9 Skip NVidia GPU test in node e2e CI jobs for containerd and other runtimes
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-08-03 08:48:44 -04:00
Kubernetes Prow Robot
9ff3b7e744 Merge pull request #104047 from ehashman/fix-node-e2e-logs
Log e2e-node kubelet output directly to file
2021-08-02 12:30:19 -07:00
Davanum Srinivas
dab19517e5 Explicitly restart kubelet to stabilize serial-containerd job
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2021-08-02 11:24:11 -04:00
Elana Hashman
a77f4f4c29 Log e2e-node kubelet output directly to file
For some reason when we send them to journald, many log lines are
consistently dropped as soon as the PLEG is started.

If we log directly to file, we don't have this problem. As a bonus, if
the tests crash, the kubelet logs will always be available since they
were already written; otherwise we normally wait until the end of the
test run to collect them from journald, meaning that we often end up
with empty logs.
2021-07-30 15:35:42 -07:00
Ryan Phillips
163e4974b6 e2e node server: fix crash in log line 2021-07-30 12:36:00 -05:00
Elana Hashman
59a7cc12c9 Mark failing node serial tests as flaky
Tracked in:
- https://github.com/kubernetes/kubernetes/issues/103690
- https://github.com/kubernetes/kubernetes/issues/103691
2021-07-28 10:39:30 -07:00
Elana Hashman
93146048b4 Mark "update Node.Spec.ConfigSource" node e2es as slow
- recover to last-known-good ConfigMap.KubeletConfigKey
  ~12m to run in CI, 13m locally
- non-nil last-known-good to a new non-nil last-known-good
  ~24m to run in CI
- recover to last-known-good ConfigMap
  ~12m to run in CI
- state transitions
   ~8m to run in CI
2021-07-23 12:40:24 -07:00
Nabarun Pal
77afa53f9d Add e2e testing manifest bundle to e2e_node test suite
Ref: https://kubernetes.slack.com/archives/C0BP8PW9G/p1627003199187100?thread_ts=1626988113.184100&cid=C0BP8PW9G

Signed-off-by: Nabarun Pal <pal.nabarun95@gmail.com>
2021-07-23 09:49:33 +05:30
David Porter
3af4fe8c9b Use pointer gomega comparison for UsageNanoCores 2021-07-22 01:08:36 -07:00
Kubernetes Prow Robot
ac8dca79af Merge pull request #103566 from wzshiming/fix/e2e-dbus-config-path
Fix dbus config path for GracefulNodeShutdown e2e
2021-07-15 12:39:14 -07:00
Kubernetes Prow Robot
4f9bfb39ad Merge pull request #102169 from odinuge/rlimit-tests
Ensure node-e2e-test can open enough files
2021-07-15 10:20:45 -07:00
Kubernetes Prow Robot
b55c980279 Merge pull request #102395 from odinuge/node_container_manager_test_skip_systemd
Skip node container manager test on systemd
2021-07-09 13:26:54 -07:00
Kubernetes Prow Robot
617064d732 Merge pull request #101432 from swatisehgal/smtaware
node: cpumanager: add options to reject non SMT-aligned workload
2021-07-08 21:04:53 -07:00
Francesco Romani
a2fb8b0039 smtalign: e2e: add tests
Add e2e tests to cover the basic flows for the `full-pcpus-only` option:
negative flow to ensure rejection with proper error message, and
positive flow to verify the actual cpu allocation.

Co-authored-by: Swati Sehgal <swsehgal@redhat.com>
Signed-off-by: Francesco Romani <fromani@redhat.com>
2021-07-08 23:15:37 +02:00
Shiming Zhang
5d80665b0a Fix dbus config path for GracefulNodeShutdown e2e 2021-07-08 10:41:44 +08:00
Sascha Grunert
2d0f99fba1 Fix resource metrics e2e test
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2021-07-05 11:16:05 +02:00
Kubernetes Prow Robot
62503f254e Merge pull request #103413 from mgutierrez98/refactor-whitelist-blacklist
Refactored files containing whitelist/blacklist to allowlist/denylist…
2021-07-01 18:12:25 -07:00
Kubernetes Prow Robot
062bc359ca Merge pull request #102444 from sanwishe/resourceStartTime
Expose container start time in kubelet /metrics/resource endpoint
2021-07-01 14:27:51 -07:00
mgutierrez98
1cfbb0aa25 remove webhook.go to revert changes to conformance test 2021-07-01 20:24:46 +00:00
Kubernetes Prow Robot
044fd6fdf6 Merge pull request #99829 from palnabarun/migrate-to-go-embed
Replace go-bindata with //go:embed
2021-06-30 10:37:03 -07:00
Lee Verberne
c11041ad99 Remove ShareProcessNamespace tags from e2e tests
This feature became GA in 1.17 and feature gate removed in 1.19. It
should run unconditionally.
2021-06-30 18:12:30 +02:00
Kubernetes Prow Robot
f2e47502fd Merge pull request #103076 from wzshiming/fix/flake-gracefulnodeshutdown-dbus
Fix the GracefulNodeShutdown e2e test running on dbus that refuses to manually start
2021-06-29 11:19:50 -07:00
Nabarun Pal
bbccf2ecb4 e2e-node: move to embedded test manifests
Signed-off-by: Nabarun Pal <pal.nabarun95@gmail.com>
2021-06-29 19:16:49 +05:30
Nabarun Pal
68b334d02b test: setup embedded file sources for manifests
Signed-off-by: Nabarun Pal <pal.nabarun95@gmail.com>
2021-06-29 19:16:46 +05:30