The Topology Manager e2e tests wants to run on real multi-NUMA system
and want to consume real devices supported by device plugins; SRIOV
devices happen to be the most commonly available of such devices.
CI machines aren't multi NUMA nor expose SRIOV devices, so the biggest portion
of the tests will just skip, and we need to keep it like this until we
figure out how to enable these features.
However, some organizations can and want to run the testsuite on bare metal;
in this case, the current test will skip (not fail) with misconfigured
boxes, and this reports a misleading result. It will be much better to
fail if the test preconditions aren't met.
To satisfy both needs, we add an option, controlled by an environment
variable, to fail (not skip) if the machine on which the test run
doesn't meet the expectations (multi-NUMA, 4+ cores per NUMA cell,
expose SRIOV VFs).
We keep the old behaviour as default to keep being CI friendly.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Since the PR to do this deeper in the stack was declined, we'll do it
ourselves. This ensures that we don't accidentally mutate the input and
then compare that mutated form to the result (which caused previous test
failures).
This is the culmination of all the previous commits which made this last
move less dramatic. More tests and cleanup commits will follow.
Background, for future archaeologists:
Service has (had) an "outer" and "inner" REST handler. This is because of how we do IP and port allocations synchronously, but since we don't have API transactions, we need to roll those back in case of a failure. Both layers use the same `Strategy`, but the outer calls into the inner, which causes a lot of complexity in the code (including an open-coded partial reimplementation of a date-unknown snapshot of the generic REST code) and results in `Prepare` and `Validate` hooks being called twice.
The "normal" REST flow seems to be:
```
mutating webhooks
generic REST store Create {
cleanup = BeginCreate
BeforeCreate {
strategy.PrepareForCreate {
dropDisabledFields
}
strategy.Validate
strategy.Canonicalize
}
createValidation (validating webhooks)
storage Create
cleanup
AfterCreate
Decorator
}
```
Service (before this series of commits) does:
```
mutating webhooks
svc custom Create {
BeforeCreate {
strategy.PrepareForCreate {
dropDisabledFields
}
strategy.Validate
strategy.Canonicalize
}
Allocations
inner (generic) Create {
cleanup = BeginCreate
BeforeCreate {
strategy.PrepareForCreate {
dropDisabledFields
}
strategy.Validate
strategy.Canonicalize
}
createValidation (validating webhooks)
storage Create
cleanup
AfterCreate
Decorator
}
}
```
After this:
```
mutating webhooks
generic REST store Create {
cleanup = BeginCreate
Allocations
BeforeCreate {
strategy.PrepareForCreate {
dropDisabledFields
}
strategy.Validate
strategy.Canonicalize
}
createValidation (validating webhooks)
storage Create
cleanup
AfterCreate
Rollback allocations on error
Decorator
}
```
Previously we would try to infer the `ipFamilyPolicy` from `clusterIPs`
and/or `ipFamilies`. That is too tricky. Now you MUST specify
`ipFamilyPolicy` as one of the dual-stack options in order to get a
dual-stack service.
This removes the old rest_tests and adds significantly more coverage.
Maybe too much. The v4,v6 and v6,v4 tables are identical but for the
order of families.
This exposed that `trimFieldsForDualStackDowngrade` is called too late
to do anything (since we don't run strategy twice any more). I moved
similar logic to `PatchAllocatedValues` but I hit on some unclarity.
Specifically, consider a PATCH operation.
Assume I have a valid dual-stack service (with 2 IPs, 2 families, and
policy either require or prefer). What fields can I patch, on their own,
to trigger a downgrade to single-stack?
I think patching policy to "single" is pretty clear intent.
But what if I leave policy and only patch `ipFamilies` down to a single
value (without violating the "can't change first family" rule)?
Or what if I patch `clusterIPs` down to a single IP value?
After much discussion, we decided to make a small API change (OK since
we are beta). When you want a dual-stack Service you MUST specify the
`ipFamilyPolicy`. Now we can infer less and avoid ambiguity.
This commit started as removing FIXME comments, but in doing so I
realized that the IP allocation process was using unvalidated user
input. Before de-layering, validation was called twice - once before
init and once after, which the init code depended on.
Fortunately (or not?) we had duplicative checks that caught errors but
with less friendly messages.
This commit calls validation before initializing the rest of the
IP-related fields.
This also re-organizes that code a bit, cleans up error messages and
comments, and adds a test SPECIFICALLY for the errors in those cases.