To make the code directory structure more clear: └── src ├── agent ├── libs │ └── logging ├── runtime ├── runtime-rs (to be added) └── tools ├── agent-ctl └── trace-forwarder Fixes: #3204 Signed-off-by: Peng Tao <bergwolf@hyper.sh>
6.7 KiB
Kata Tracing proposals
Overview
This document summarises a set of proposals triggered by the tracing documentation PR.
Required context
This section explains some terminology required to understand the proposals. Further details can be found in the tracing documentation PR.
Agent trace mode terminology
Trace mode | Description | Use-case |
---|---|---|
Static | Trace agent from startup to shutdown | Entire lifespan |
Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" |
Agent trace type terminology
Trace type | Description | Use-case |
---|---|---|
isolated | traces all relate to single component | Observing lifespan |
collated | traces "grouped" (runtime+agent) | Understanding component interaction |
Container lifespan
Lifespan | trace mode | trace type |
---|---|---|
short-lived | static | collated if possible, else isolated? |
long-running | dynamic | collated? (to see interactions) |
Original plan for agent
-
Implement all trace types and trace modes for agent.
-
Why?
-
Maximum flexibility.
Counterargument:
Due to the intrusive nature of adding tracing, we have learnt that landing small incremental changes is simpler and quicker!
-
Compatibility with Kata 1.x tracing.
Counterargument:
Agent tracing in Kata 1.x was extremely awkward to setup (to the extent that it's unclear how many users actually used it!)
This point, coupled with the new architecture for Kata 2.x, suggests that we may not need to supply the same set of tracing features (in fact they may not make sense)).
-
Agent tracing proposals
Agent tracing proposal 1: Don't implement dynamic trace mode
-
All tracing will be static.
-
Why?
-
Because dynamic tracing will always be "partial"
In fact, not only would it be only a "snapshot" of activity, it may not even be possible to create a complete "trace transaction". If this is true, the trace output would be partial and would appear "unstructured".
-
Agent tracing proposal 2: Simplify handling of trace type
-
Agent tracing will be "isolated" by default.
-
Agent tracing will be "collated" if runtime tracing is also enabled.
-
Why?
- Offers a graceful fallback for agent tracing if runtime tracing disabled.
- Simpler code!
Questions to ask yourself (part 1)
-
Are your containers long-running or short-lived?
-
Would you ever need to turn on tracing "briefly"?
-
If "yes", is a "partial trace" useful or useless?
Likely to be considered useless as it is a partial snapshot. Alternative tracing methods may be more appropriate to dynamic OpenTelemetry tracing.
-
Questions to ask yourself (part 2)
-
Are you happy to stop a container to enable tracing? If "no", dynamic tracing may be required.
-
Would you ever want to trace the agent and the runtime "in isolation" at the same time?
-
If "yes", we need to fully implement
trace_mode=isolated
This seems unlikely though.
-
Trace collection
The second set of proposals affect the way traces are collected.
Motivation
Currently:
- The runtime sends trace spans to Jaeger directly.
- The agent will send trace spans to the
trace-forwarder
component. - The trace forwarder will send trace spans to Jaeger.
Kata agent tracing overview:
+-------------------------------------------+
| Host |
| |
| +-----------+ |
| | Trace | |
| | Collector | |
| +-----+-----+ |
| ^ +--------------+ |
| | spans | Kata VM | |
| +-----+-----+ | | |
| | Kata | spans | +-----+ | |
| | Trace |<-----------------|Kata | | |
| | Forwarder | VSOCK | |Agent| | |
| +-----------+ Channel | +-----+ | |
| +--------------+ |
+-------------------------------------------+
Currently:
-
If agent tracing is enabled but the trace forwarder is not running, the agent will error.
-
If the trace forwarder is started but Jaeger is not running, the trace forwarder will error.
Goals
-
The runtime and agent should:
- Use the same trace collection implementation.
- Use the most the common configuration items.
-
Kata should should support more trace collection software or
SaaS
(for exampleZipkin
,datadog
). -
Trace collection should not block normal runtime/agent operations (for example if
vsock-exporter
/Jaeger is not running, Kata Containers should work normally).
Trace collection proposals
Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy
Kata runtime/agent all send spans to trace forwarder, and the trace forwarder,
acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or datadog
.
Pros:
- Runtime/agent will be simple.
- Could update trace collection target while Kata Containers are running.
Cons:
- Requires the trace forwarder component to be running (that is a pressure to operation).
Trace collection proposal 2: Send spans to collector directly from runtime/agent
Send spans to collector directly from runtime/agent, this proposal need network accessible to the collector.
Pros:
- No additional trace forwarder component needed.
Cons:
- Need more code/configuration to support all trace collectors.
Future work
- We could add dynamic and fully isolated tracing at a later stage, if required.
Further details
- See the new GitHub project.
- kata-containers-tracing-status gist.
- tracing documentation PR.
Summary
Time line
- 2021-07-01: A summary of the discussion was posted to the mail list.
- 2021-06-22: These proposals were discussed in the Kata Architecture Committee meeting.
- 2021-06-18: These proposals where announced on the mailing list.
Outcome
- Nobody opposed the agent proposals, so they are being implemented.
- The trace collection proposals are still being considered.