Commit Graph

105399 Commits

Author SHA1 Message Date
Kevin Delgado
b35c444e42 Update fieldValidation godoc 2021-11-29 21:21:28 +00:00
Sergey Kanzhelev
a11453efbc remove ReallyCrashForTesting and cleaned up some references to HandleCrash behavior 2021-11-29 20:00:10 +00:00
Mike Spreitzer
95964c5b35 Correct Generator calls for executing seat count 2021-11-29 14:50:11 -05:00
Antonio Ojea
85797eba70 bump TestHTTP1DoNotReuseRequestAfterTimeout timeout
the test TestHTTP1DoNotReuseRequestAfterTimeout has to wait for
request to time out to assert that subsequent requests does not
reuse the TCP connection.

It seems that current value of 100ms causes issues on some CI
environments and bumping the timeout seems to solve this flakiness,

We can bump the timeout value because is really low compared to real
scenarios and the bump still keeps it in the millisecond order.
2021-11-29 19:11:47 +01:00
menglong.qi
ea31d7b813 refactor: use utilerrors instead of join error msg 2021-11-28 17:16:17 +08:00
wpedrak
d5e1ee4de8 Make writing version.txt more resilient
Writing file first truncate it and writes later on. During disk space pressure it may cause file to become empty. To mitigate above, we create file with new version first and then move it in place of old one (to make sure that disk space is available)
2021-11-26 12:44:50 +01:00
Kubernetes Prow Robot
9a75e7b0fd
Merge pull request #106670 from palnabarun/1.23/update-publishing-bot-rules
publishing-bot: add 1.23 rules
2021-11-25 11:23:23 -08:00
Slavik Panasovets
6ba8c86fc3 add gce elb rbs opt-in annotation 2021-11-25 17:04:28 +00:00
HaoJie Liu
1dc1a37294
fix typo in /test/integration 2021-11-25 18:59:31 +08:00
Nabarun Pal
e8b177cfc1
publishing-bot: add 1.23 rules
Signed-off-by: Nabarun Pal <pal.nabarun95@gmail.com>
2021-11-25 11:25:39 +05:30
DingShujie
25cf49770c update k/utils to v0.0.0-20211116205334-6203023598ed 2021-11-25 09:29:03 +08:00
Kubernetes Prow Robot
aff056d8a1
Merge pull request #106660 from liggitt/smd-merge
Revert sigs.k8s.io/structured-merge-diff/v4 to v4.1.2
2021-11-24 13:37:31 -08:00
Kevin Klues
f8511877e2 Add regression test for CPUManager distribute NUMA algorithm
We witnessed this exact allocation attempt in a live cluster and witnessed the
algorithm fail with an accounting error. This test was added to verify that
this case is now handled by the updates to the algorithm and that we don't
regress from it in the future.

"test" description="ensure previous failure encountered on live machine has been fixed (1/1)"
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4 6] distribution=9 remainder=1 available=[14 2 4 4 0 3 4 1] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 4] distribution=9 remainder=1 available=[0 3 4 1 14 2 4 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2 6] distribution=9 remainder=1 available=[1 14 2 4 4 0 3 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[4 6] distribution=9 remainder=1 available=[1 3 4 0 14 2 4 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[2] distribution=9 remainder=1 available=[4 0 3 4 1 14 2 4] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[4] distribution=9 remainder=1 available=[3 4 0 14 2 4 4 1] balance=4.031
"combo remainderSet balance" combo=[2 4 6] remainderSet=[6] distribution=9 remainder=1 available=[1 13 2 4 4 1 3 4] balance=3.606
"bestCombo found" distribution=9 bestCombo=[2 4 6] bestRemainder=[6]

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 20:49:58 +00:00
Kevin Klues
e284c74d93 Add unit test for CPUManager distribute NUMA algorithm verifying fixes
Before Change:
"test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request"
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 1] distribution=8 remainder=2 available=[-1 -1 0 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 2] distribution=8 remainder=2 available=[-1 0 -1 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[0 3] distribution=8 remainder=2 available=[5 -1 0 0] balance=2.345
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 2] distribution=8 remainder=2 available=[0 -1 -1 6] balance=2.915
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[1 3] distribution=8 remainder=2 available=[0 -1 0 5] balance=2.345
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[2 3] distribution=8 remainder=2 available=[0 0 -1 5] balance=2.345
"bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[0 3]

--- FAIL: TestTakeByTopologyNUMADistributed (0.01s)
    --- FAIL: TestTakeByTopologyNUMADistributed/ensure_bestRemainder_chosen_with_NUMA_nodes_that_have_enough_CPUs_to_satisfy_the_request (0.00s)
        cpu_assignment_test.go:867: unexpected error [accounting error, not enough CPUs allocated, remaining: 1]

After Change:
"test" description="ensure bestRemainder chosen with NUMA nodes that have enough CPUs to satisfy the request"
"combo remainderSet balance" combo=[0 1 2 3] remainderSet=[3] distribution=8 remainder=2 available=[0 0 0 4] balance=1.732
"bestCombo found" distribution=8 bestCombo=[0 1 2 3] bestRemainder=[3]

SUCCESS

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 20:45:37 +00:00
Cheng Xing
4de40e90d4 DelegateFSGroupToCSIDriver e2e: skip tests with chgrp 2021-11-24 11:41:53 -08:00
Kevin Klues
031f11513d Fix accounting bug in CPUManager distribute NUMA policy
Without this fix, the algorithm may decide to allocate "remainder" CPUs from a
NUMA node that has no more CPUs to allocate. Moreover, it was only considering
allocation of remainder CPUs from NUMA nodes such that each NUMA node in the
remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these
two issues in play, one could end up with an accounting error where not enough
CPUs were allocated by the time the algorithm runs to completion.

The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from
the set of NUMA nodes considered for allocating remainder CPUs. Additionally,
we now consider *all* combinations of nodes from the remainder set of size
1..len(remainderSet). This allows us to find a better solution if allocating
CPUs from a smaller set leads to a more balanced allocation. Finally, we loop
through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have
been accounted for and allocated. This ensure that we will not hit an
accounting error later on because we explicitly remove CPUs from the remainder
set until there are none left.

A follow-on commit adds a set of unit tests that will fail before these
changes, but succeeds after them.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 19:18:11 +00:00
Kevin Klues
5317a2e2ac Fix error handling in CPUManager distribute NUMA tests
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:31 +00:00
Kevin Klues
dc4430b663 Add a sum() helper to the CPUManager cpuassignment logic
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:29 +00:00
Kevin Klues
cfacc22459 Allow the map.Values() function in the CPUManager to take a set of keys
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:28 +00:00
Kevin Klues
a160d9a8cd Fix CPUManager algo to calculate min NUMA nodes needed for distribution
Previously the algorithm was too restrictive because it tried to calculate the
minimum based on the number of *available* NUMA nodes and the number of
*available* CPUs on those NUMA nodes. Since there was no (easy) way to tell how
many CPUs an individual NUMA node happened to have, the average across them was
used. Using this value however, could result in thinking you need more NUMA
nodes to possibly satisfy a request than you actually do.

By using the *total* number of NUMA nodes and CPUs per NUMA node, we can get
the true minimum number of nodes required to satisfy a request. For a given
"current" allocation this may not be the true minimum, but its better to start
with fewer and move up than to start with too many and miss out on a better
option.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:26 +00:00
Kevin Klues
209cd20548 Fix unit tests following bug fix in CPUManager for map functions (2/2)
Now that the algorithm for balancing CPU distributions across NUMA nodes is
correct, this test actually behaves differently for the "packed" vs.
"distributed" allocation algorithms (as it should).

In the "packed" case we need to ensure that CPUs are allocated such that they
are packed onto cores. Since one CPU is already allocated from a core on NUMA
node 0, we want the next CPU to be its hyperthreaded pair (even though the
first available CPU id is on Socket 1).

In the "distributed" case, however, we want to ensure CPUs are allocated such
that we have an balanced distribution of CPUs across all NUMA nodes. This
points to allocating from Socket 1 if the only other CPU allocated has been
done on Socket 0.

To allow CPUs allocations to be packed onto full cores, one can allocate them
from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of
hypthreads per core (in this case 2). We added an explicit test case for this,
demonstrating that we get the same result as the "packed" algorithm does, even
though the "distributed" algorithm is in use.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:24 +00:00
Kevin Klues
67f719cb1d Fix unit tests following bug fix in CPUManager for map functions (1/2)
This fixes two related tests to better test our "balanced" distribution algorithm.

The first test originally provided an input with the following number of CPUs
available on each NUMA node:

Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 20

It then attempted to distribute 48 CPUs across them with an expectation that
each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node
0 with no more CPUs in the end).

This would have resulted in the following amount of CPUs on each node:

Node 0: 0
Node 1: 4
Node 2: 4
Node 3: 20

Which results in a standard deviation of 7.6811

However, a more balanced solution would actually be to pull 16 CPUs from NUMA
nodes 1, 2, and 3, and leave 0 untouched, i.e.:

Node 0: 16
Node 1: 4
Node 2: 4
Node 3: 4

Which results in a standard deviation of 5.1961524227066

To fix this test we changed the original number of available CPUs to start with
4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.:

Node 0: 18
Node 1: 20
Node 2: 20
Node 3: 16

So that we end up with a result of:

Node 0: 2
Node 1: 4
Node 2: 4
Node 3: 16

Which pulls the CPUs from where we want and results in a standard deviation of 5.5452

For the second test, we simply reverse the number of CPUs available for Nodes 0
and 3 as:

Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 18

Which forces the allocation to happen just as it did for the first test, except
now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:23 +00:00
Kevin Klues
4008ea0b4c Fix bug in CPUManager map.Keys() and map.Values() implementations
Previously these would return lists that were too long because we appended to
pre-initialized lists with a specific size.

Since the primary place these functions are used is in the mean and standard
deviation calculations for the NUMA distribution algorithm, it meant that the
results of these calculations were often incorrect.

As a result, some of the unit tests we have are actually incorrect (because the
results we expect do not actually produce the best balanced
distribution of CPUs across all NUMA nodes for the input provided).

These tests will be patched up in subsequent commits.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:21 +00:00
Kevin Klues
446c58e0e7 Ensure we balance across *all* NUMA nodes in NUMA distribution algo
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:19 +00:00
Kevin Klues
c8559bc43e Short-circuit CPUManager distribute NUMA algo for unusable cpuGroupSize
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:16 +00:00
Kevin Klues
b28c1392d7 Round the CPUManager mean and stddev calculations to the nearest 1000th
Signed-off-by: Kevin Klues <kklues@nvidia.com>
2021-11-24 16:51:13 +00:00
Kubernetes Prow Robot
0d3f2ca371
Merge pull request #106657 from liggitt/openapiv3
Unversion and normalize openapi v3 fixtures
2021-11-24 08:36:20 -08:00
Jordan Liggitt
88ab0d03b7 Revert "update expected ordering"
This reverts commit fbc8ac9c96.
2021-11-24 11:19:27 -05:00
Jordan Liggitt
ed68909177 Revert sigs.k8s.io/structured-merge-diff/v4 to v4.1.2 2021-11-24 10:32:24 -05:00
Jordan Liggitt
2588ea76ea Regenerate openapi v3 fixtures 2021-11-24 10:03:45 -05:00
Jordan Liggitt
f30c5738ea Unversion and normalize openapi v3 fixtures 2021-11-24 10:03:36 -05:00
Patrick Ohly
9d98c69075 api/errors: explicitly allow nil error parameters
This was already possible before because the underlying errors.As supports
it. But because it wasn't clear, a lot of code unnecessarily checks for nil
before calling the Is* functions.
2021-11-24 08:39:58 +01:00
Anago GCB
c8c81cbfbb CHANGELOG: Update directory for v1.23.0-rc.0 release 2021-11-24 06:19:11 +00:00
haoyun
eb673cec64 fix: klog flag redefined
Signed-off-by: haoyun <yun.hao@daocloud.io>
2021-11-24 10:03:40 +08:00
Kubernetes Prow Robot
e53cf07724
Merge pull request #106611 from verult/delegate-fsgroup-disable-onrootmismatch-e2e
Delegate FSGroup CSI driver e2e: verify fsgroup is passed to CSI calls
2021-11-23 17:52:20 -08:00
Kubernetes Prow Robot
c3e6b66643
Merge pull request #106533 from haircommander/summary-page-fault-test
test: update major page fault values for summary test
2021-11-23 15:09:45 -08:00
Kubernetes Prow Robot
a5622f3f6e
Merge pull request #106616 from mattcary/pvc-race
Clean up deep copy needed for UpdateStatefulSet
2021-11-23 09:38:17 -08:00
Matthew Cary
0e2b901762 Clean up deep copy needed for UpdateStatefulSet
Change-Id: Id732358183d682d1a945cfee56f83bcaac0d7c31
2021-11-23 06:48:54 -08:00
xuweiwei
9ab5c8a36f Fix typo
depenging -> depending
permssion -> permission

Signed-off-by: xuweiwei <xuweiwei_yewu@cmss.chinamobile.com>
2021-11-23 16:18:13 +08:00
Kubernetes Prow Robot
e31aafc4fd
Merge pull request #106348 from endocrimes/dani/rm-gpu
e2e_node: unify device tests
2021-11-22 19:46:16 -08:00
Andrea Hoffer
f5612f100e Adding an example for kubectl plugin list 2021-11-22 21:33:06 -05:00
Cheng Xing
bca1b79728 Delegate FSGroup CSI driver e2e: verify fsgroup is passed to CSI calls using mock driver tests 2021-11-22 17:00:39 -08:00
Kubernetes Prow Robot
f572e4d5b4
Merge pull request #106518 from SergeyKanzhelev/tryProbeFix
Fix the bug with GRPC probe
2021-11-22 15:38:54 -08:00
Kubernetes Prow Robot
a142f86351
Merge pull request #105764 from jlebon/pr/add-ssh-mode
test/e2e_node/remote: support pure SSH mode
2021-11-22 10:53:33 -08:00
Abu Kashem
41cef06f66
add trace step for transformResponseObject 2021-11-22 13:18:02 -05:00
Jonathan Lebon
3ebd93cd02 test-e2e-node: support pure SSH mode
Right now, `run_remote.go` only supports GCE instances. But actually
running the tests is completely independent of GCE and could work just
as well on any SSH-accessible machine.

This patch adds a new `--mode` switch, which defaults to `gce` for
backwards compatibility, but can be set to `ssh`. In that mode, the GCE
API is not used at all, and we simply connect to the hosts given via
`--hosts`.

This is still better than `run_local.go` because the latter mixes build
environment with test environment, which doesn't fit well with
container-optimized operating systems.

This is part of an effort to setup the e2e node tests on Fedora CoreOS
(see https://github.com/coreos/fedora-coreos-tracker/issues/990).

Patch best viewed with whitespace ignored.
2021-11-22 10:13:15 -05:00
Jonathan Lebon
e0723c1e64 test-e2e-node: add SSH_OPTIONS
This allows overriding the default options.
2021-11-22 10:13:13 -05:00
Jonathan Lebon
591f4cdb77 run_remote.go: factor out prepareGceImages()
Mostly a pure code move. Only changed the `klog.Fatalf` to `fmt.Errorf`.
Prep for future patch.
2021-11-22 10:12:29 -05:00
Jonathan Lebon
032dbd2063 run_remote.go: move registerGceHostIP() call to testImage()
I.e. don't assume that `testHost` is called on a GCE host. Prep for
future patch.
2021-11-22 10:12:28 -05:00
Jonathan Lebon
36233b985b run_remote.go: factor out registerGceHostIP()
Prep for future patch.
2021-11-22 10:12:28 -05:00