Without this fix, the algorithm may decide to allocate "remainder" CPUs from a
NUMA node that has no more CPUs to allocate. Moreover, it was only considering
allocation of remainder CPUs from NUMA nodes such that each NUMA node in the
remainderSet could only allocate 1 (i.e. 'cpuGroupSize') more CPUs. With these
two issues in play, one could end up with an accounting error where not enough
CPUs were allocated by the time the algorithm runs to completion.
The updated algorithm will now omit any NUMA nodes that have 0 CPUs left from
the set of NUMA nodes considered for allocating remainder CPUs. Additionally,
we now consider *all* combinations of nodes from the remainder set of size
1..len(remainderSet). This allows us to find a better solution if allocating
CPUs from a smaller set leads to a more balanced allocation. Finally, we loop
through all NUMA nodes 1-by-1 in the remainderSet until all rmeainer CPUs have
been accounted for and allocated. This ensure that we will not hit an
accounting error later on because we explicitly remove CPUs from the remainder
set until there are none left.
A follow-on commit adds a set of unit tests that will fail before these
changes, but succeeds after them.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Previously the algorithm was too restrictive because it tried to calculate the
minimum based on the number of *available* NUMA nodes and the number of
*available* CPUs on those NUMA nodes. Since there was no (easy) way to tell how
many CPUs an individual NUMA node happened to have, the average across them was
used. Using this value however, could result in thinking you need more NUMA
nodes to possibly satisfy a request than you actually do.
By using the *total* number of NUMA nodes and CPUs per NUMA node, we can get
the true minimum number of nodes required to satisfy a request. For a given
"current" allocation this may not be the true minimum, but its better to start
with fewer and move up than to start with too many and miss out on a better
option.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Now that the algorithm for balancing CPU distributions across NUMA nodes is
correct, this test actually behaves differently for the "packed" vs.
"distributed" allocation algorithms (as it should).
In the "packed" case we need to ensure that CPUs are allocated such that they
are packed onto cores. Since one CPU is already allocated from a core on NUMA
node 0, we want the next CPU to be its hyperthreaded pair (even though the
first available CPU id is on Socket 1).
In the "distributed" case, however, we want to ensure CPUs are allocated such
that we have an balanced distribution of CPUs across all NUMA nodes. This
points to allocating from Socket 1 if the only other CPU allocated has been
done on Socket 0.
To allow CPUs allocations to be packed onto full cores, one can allocate them
from the "distributed" algorithm with a 'cpuGroupSize' equal to the number of
hypthreads per core (in this case 2). We added an explicit test case for this,
demonstrating that we get the same result as the "packed" algorithm does, even
though the "distributed" algorithm is in use.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This fixes two related tests to better test our "balanced" distribution algorithm.
The first test originally provided an input with the following number of CPUs
available on each NUMA node:
Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 20
It then attempted to distribute 48 CPUs across them with an expectation that
each of the first 3 NUMA nodes would have 16 CPUs taken from them (leaving Node
0 with no more CPUs in the end).
This would have resulted in the following amount of CPUs on each node:
Node 0: 0
Node 1: 4
Node 2: 4
Node 3: 20
Which results in a standard deviation of 7.6811
However, a more balanced solution would actually be to pull 16 CPUs from NUMA
nodes 1, 2, and 3, and leave 0 untouched, i.e.:
Node 0: 16
Node 1: 4
Node 2: 4
Node 3: 4
Which results in a standard deviation of 5.1961524227066
To fix this test we changed the original number of available CPUs to start with
4 less CPUs on NUMA node 3, and 2 more CPUs on NUMA node 0, i.e.:
Node 0: 18
Node 1: 20
Node 2: 20
Node 3: 16
So that we end up with a result of:
Node 0: 2
Node 1: 4
Node 2: 4
Node 3: 16
Which pulls the CPUs from where we want and results in a standard deviation of 5.5452
For the second test, we simply reverse the number of CPUs available for Nodes 0
and 3 as:
Node 0: 16
Node 1: 20
Node 2: 20
Node 3: 18
Which forces the allocation to happen just as it did for the first test, except
now on NUMA nodes 1, 2, and 3 instead of NUMA nodes 0,1, and 2.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Previously these would return lists that were too long because we appended to
pre-initialized lists with a specific size.
Since the primary place these functions are used is in the mean and standard
deviation calculations for the NUMA distribution algorithm, it meant that the
results of these calculations were often incorrect.
As a result, some of the unit tests we have are actually incorrect (because the
results we expect do not actually produce the best balanced
distribution of CPUs across all NUMA nodes for the input provided).
These tests will be patched up in subsequent commits.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
This was already possible before because the underlying errors.As supports
it. But because it wasn't clear, a lot of code unnecessarily checks for nil
before calling the Is* functions.
Right now, `run_remote.go` only supports GCE instances. But actually
running the tests is completely independent of GCE and could work just
as well on any SSH-accessible machine.
This patch adds a new `--mode` switch, which defaults to `gce` for
backwards compatibility, but can be set to `ssh`. In that mode, the GCE
API is not used at all, and we simply connect to the hosts given via
`--hosts`.
This is still better than `run_local.go` because the latter mixes build
environment with test environment, which doesn't fit well with
container-optimized operating systems.
This is part of an effort to setup the e2e node tests on Fedora CoreOS
(see https://github.com/coreos/fedora-coreos-tracker/issues/990).
Patch best viewed with whitespace ignored.