mirror of
				https://github.com/kata-containers/kata-containers.git
				synced 2025-11-03 19:15:58 +00:00 
			
		
		
		
	To make the code directory structure more clear:
└── src
    ├── agent
    ├── libs
    │   └── logging
    ├── runtime
    ├── runtime-rs (to be added)
    └── tools
        ├── agent-ctl
        └── trace-forwarder
Fixes: #3204
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
		
	
		
			
				
	
	
		
			214 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			214 lines
		
	
	
		
			6.7 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Kata Tracing proposals
 | 
						|
 | 
						|
## Overview
 | 
						|
 | 
						|
This document summarises a set of proposals triggered by the
 | 
						|
[tracing documentation PR][tracing-doc-pr].
 | 
						|
 | 
						|
## Required context
 | 
						|
 | 
						|
This section explains some terminology required to understand the proposals.
 | 
						|
Further details can be found in the
 | 
						|
[tracing documentation PR][tracing-doc-pr].
 | 
						|
 | 
						|
### Agent trace mode terminology
 | 
						|
 | 
						|
| Trace mode | Description | Use-case |
 | 
						|
|-|-|-|
 | 
						|
| Static |  Trace agent from startup to shutdown | Entire lifespan |
 | 
						|
| Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" |
 | 
						|
 | 
						|
### Agent trace type terminology
 | 
						|
 | 
						|
| Trace type | Description | Use-case |
 | 
						|
|-|-|-|
 | 
						|
| isolated | traces all relate to single component | Observing lifespan |
 | 
						|
| collated | traces "grouped" (runtime+agent) | Understanding component interaction |
 | 
						|
 | 
						|
### Container lifespan
 | 
						|
 | 
						|
| Lifespan | trace mode | trace type |
 | 
						|
|-|-|-|
 | 
						|
| short-lived | static | collated if possible, else isolated? |
 | 
						|
| long-running | dynamic | collated? (to see interactions) |
 | 
						|
 | 
						|
## Original plan for agent
 | 
						|
 | 
						|
- Implement all trace types and trace modes for agent.
 | 
						|
 | 
						|
- Why?
 | 
						|
  - Maximum flexibility.
 | 
						|
 | 
						|
    > **Counterargument:**
 | 
						|
    >
 | 
						|
    > Due to the intrusive nature of adding tracing, we have
 | 
						|
    > learnt that landing small incremental changes is simpler and quicker!
 | 
						|
 | 
						|
  - Compatibility with [Kata 1.x tracing][kata-1x-tracing].
 | 
						|
 | 
						|
    > **Counterargument:**
 | 
						|
    >
 | 
						|
    > Agent tracing in Kata 1.x was extremely awkward to setup (to the extent
 | 
						|
    > that it's unclear how many users actually used it!)
 | 
						|
    >
 | 
						|
    > This point, coupled with the new architecture for Kata 2.x, suggests
 | 
						|
    > that we may not need to supply the same set of tracing features (in fact
 | 
						|
    > they may not make sense)).
 | 
						|
 | 
						|
## Agent tracing proposals
 | 
						|
 | 
						|
### Agent tracing proposal 1: Don't implement dynamic trace mode
 | 
						|
 | 
						|
- All tracing will be static.
 | 
						|
 | 
						|
- Why?
 | 
						|
  - Because dynamic tracing will always be "partial"
 | 
						|
 | 
						|
    > In fact, not only would it be only a "snapshot" of activity, it may not
 | 
						|
    > even be possible to create a complete "trace transaction". If this is
 | 
						|
    > true, the trace output would be partial and would appear "unstructured".
 | 
						|
 | 
						|
### Agent tracing proposal 2: Simplify handling of trace type
 | 
						|
 | 
						|
- Agent tracing will be "isolated" by default.
 | 
						|
- Agent tracing will be "collated" if runtime tracing is also enabled.
 | 
						|
 | 
						|
- Why?
 | 
						|
  - Offers a graceful fallback for agent tracing if runtime tracing disabled.
 | 
						|
  - Simpler code!
 | 
						|
 | 
						|
## Questions to ask yourself (part 1)
 | 
						|
 | 
						|
- Are your containers long-running or short-lived?
 | 
						|
 | 
						|
- Would you ever need to turn on tracing "briefly"?
 | 
						|
  - If "yes", is a "partial trace" useful or useless?
 | 
						|
 | 
						|
    > Likely to be considered useless as it is a partial snapshot.
 | 
						|
    > Alternative tracing methods may be more appropriate to dynamic
 | 
						|
    > OpenTelemetry tracing.
 | 
						|
 | 
						|
## Questions to ask yourself (part 2)
 | 
						|
 | 
						|
- Are you happy to stop a container to enable tracing?
 | 
						|
  If "no", dynamic tracing may be required.
 | 
						|
 | 
						|
- Would you ever want to trace the agent and the runtime "in isolation" at the
 | 
						|
  same time?
 | 
						|
  - If "yes", we need to fully implement `trace_mode=isolated`
 | 
						|
 | 
						|
    > This seems unlikely though.
 | 
						|
 | 
						|
## Trace collection
 | 
						|
 | 
						|
The second set of proposals affect the way traces are collected.
 | 
						|
 | 
						|
### Motivation
 | 
						|
 | 
						|
Currently:
 | 
						|
 | 
						|
- The runtime sends trace spans to Jaeger directly.
 | 
						|
- The agent will send trace spans to the [`trace-forwarder`][trace-forwarder] component.
 | 
						|
- The trace forwarder will send trace spans to Jaeger.
 | 
						|
 | 
						|
Kata agent tracing overview:
 | 
						|
 | 
						|
```
 | 
						|
+-------------------------------------------+
 | 
						|
| Host                                      |
 | 
						|
|                                           |
 | 
						|
| +-----------+                             |
 | 
						|
| | Trace     |                             |
 | 
						|
| | Collector |                             |
 | 
						|
| +-----+-----+                             |
 | 
						|
|       ^                  +--------------+ |
 | 
						|
|       | spans            | Kata VM      | |
 | 
						|
| +-----+-----+            |              | |
 | 
						|
| | Kata      |    spans   |     +-----+  | |
 | 
						|
| | Trace     |<-----------------|Kata |  | |
 | 
						|
| | Forwarder |    VSOCK   |     |Agent|  | |
 | 
						|
| +-----------+    Channel |     +-----+  | |
 | 
						|
|                          +--------------+ |
 | 
						|
+-------------------------------------------+
 | 
						|
```
 | 
						|
 | 
						|
Currently:
 | 
						|
 | 
						|
- If agent tracing is enabled but the trace forwarder is not running,
 | 
						|
  the agent will error.
 | 
						|
 | 
						|
- If the trace forwarder is started but Jaeger is not running,
 | 
						|
  the trace forwarder will error.
 | 
						|
 | 
						|
### Goals
 | 
						|
 | 
						|
- The runtime and agent should:
 | 
						|
  - Use the same trace collection implementation.
 | 
						|
  - Use the most the common configuration items.
 | 
						|
 | 
						|
- Kata should should support more trace collection software or `SaaS`
 | 
						|
  (for example `Zipkin`, `datadog`).
 | 
						|
 | 
						|
- Trace collection should not block normal runtime/agent operations
 | 
						|
  (for example if `vsock-exporter`/Jaeger is not running, Kata Containers should work normally).
 | 
						|
 | 
						|
### Trace collection proposals
 | 
						|
 | 
						|
#### Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy
 | 
						|
 | 
						|
Kata runtime/agent all send spans to trace forwarder, and the trace forwarder,
 | 
						|
acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or `datadog`.
 | 
						|
 | 
						|
**Pros:**
 | 
						|
 | 
						|
- Runtime/agent will be simple.
 | 
						|
- Could update trace collection target while Kata Containers are running.
 | 
						|
 | 
						|
**Cons:**
 | 
						|
 | 
						|
- Requires the trace forwarder component to be running (that is a pressure to operation).
 | 
						|
 | 
						|
#### Trace collection proposal 2: Send spans to collector directly from runtime/agent
 | 
						|
 | 
						|
Send spans to collector directly from runtime/agent, this proposal need
 | 
						|
network accessible to the collector.
 | 
						|
 | 
						|
**Pros:**
 | 
						|
 | 
						|
- No additional trace forwarder component needed.
 | 
						|
 | 
						|
**Cons:**
 | 
						|
 | 
						|
- Need more code/configuration to support all trace collectors.
 | 
						|
 | 
						|
## Future work
 | 
						|
 | 
						|
- We could add dynamic and fully isolated tracing at a later stage,
 | 
						|
  if required.
 | 
						|
 | 
						|
## Further details
 | 
						|
 | 
						|
- See the new [GitHub project](https://github.com/orgs/kata-containers/projects/28).
 | 
						|
- [kata-containers-tracing-status](https://gist.github.com/jodh-intel/0ee54d41d2a803ba761e166136b42277) gist.
 | 
						|
- [tracing documentation PR][tracing-doc-pr].
 | 
						|
 | 
						|
## Summary
 | 
						|
 | 
						|
### Time line
 | 
						|
 | 
						|
- 2021-07-01: A summary of the discussion was
 | 
						|
  [posted to the mail list](http://lists.katacontainers.io/pipermail/kata-dev/2021-July/001996.html).
 | 
						|
- 2021-06-22: These proposals were
 | 
						|
  [discussed in the Kata Architecture Committee meeting](https://etherpad.opendev.org/p/Kata_Containers_2021_Architecture_Committee_Mtgs).
 | 
						|
- 2021-06-18: These proposals where
 | 
						|
  [announced on the mailing list](http://lists.katacontainers.io/pipermail/kata-dev/2021-June/001980.html).
 | 
						|
 | 
						|
### Outcome
 | 
						|
 | 
						|
- Nobody opposed the agent proposals, so they are being implemented.
 | 
						|
- The trace collection proposals are still being considered.
 | 
						|
 | 
						|
[kata-1x-tracing]: https://github.com/kata-containers/agent/blob/master/TRACING.md
 | 
						|
[trace-forwarder]: /src/tools/trace-forwarder
 | 
						|
[tracing-doc-pr]: https://github.com/kata-containers/kata-containers/pull/1937
 |