docs: Write tracing documentation

Add documentation explaining how to trace the runtime and agent. Fixes: #1892. Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2025-10-29 08:21:46 +00:00 · 2021-05-26 17:55:52 +01:00
parent 9db56ffd85
commit 09a5e03f4a
3 changed files with 241 additions and 7 deletions
--- a/docs/README.md
+++ b/docs/README.md
@@ -11,6 +11,10 @@ For details of the other Kata Containers repositories, see the

 * [Installation guides](./install/README.md): Install and run Kata Containers with Docker or Kubernetes

+## Tracing
+
+See the [tracing documentation](tracing.md).
+
 ## More User Guides

 * [Upgrading](Upgrading.md): how to upgrade from [Clear Containers](https://github.com/clearcontainers) and [runV](https://github.com/hyperhq/runv) to [Kata Containers](https://github.com/kata-containers) and how to upgrade an existing Kata Containers system to the latest version.
--- a/docs/tracing.md
+++ b/docs/tracing.md
@@ -0,0 +1,214 @@
+# Overview
+
+This document explains how to trace Kata Containers components.
+
+# Introduction
+
+The Kata Containers runtime and agent are able to generate
+[OpenTelemetry][opentelemetry] trace spans, which allow the administrator to
+observe what those components are doing and how much time they are spending on
+each operation.
+
+# OpenTelemetry summary
+
+An OpenTelemetry-enabled application creates a number of trace "spans". A span
+contains the following attributes:
+
+- A name
+- A pair of timestamps (recording the start time and end time of some operation)
+- A reference to the span's parent span
+
+All spans need to be *finished*, or *completed*, to allow the OpenTelemetry
+framework to generate the final trace information (by effectively closing the
+transaction encompassing the initial (root) span and all its children).
+
+For Kata, the root span represents the total amount of time taken to run a
+particular component from startup to its shutdown (the "run time").
+
+# Architecture
+
+## Runtime tracing architecture
+
+The runtime, which runs in the host environment, has been modified to
+optionally generate trace spans which are sent to a trace collector on the
+host.
+
+## Agent tracing architecture
+
+An OpenTelemetry system (such as [Jaeger][jaeger-tracing]) uses a collector to
+gather up trace spans from the application for viewing and processing. For an
+application to use the collector, it must run in the same context as
+the collector.
+
+This poses a problem for tracing the Kata Containers agent since it does not
+run in the same context as the collector: it runs inside a virtual machine (VM).
+
+To allow spans from the agent to be sent to the trace collector, Kata provides
+a [trace forwarder][trace-forwarder] component. This runs in the same context
+as the collector (generally on the host system) and listens on a
+[`VSOCK`][vsock] channel for traces generated by the agent, forwarding them on
+to the trace collector.
+
+> **Note:**
+>
+> This design supports agent tracing without having to make changes to the
+> image, but also means that [custom images][osbuilder] can also benefit from
+> agent tracing.
+
+The following diagram summarises the architecture used to trace the Kata
+Containers agent:
+
+```
+--------------------------------------------+
+| Host                                       |
+|                                            |
+| +---------------+                          |
+| | OpenTelemetry |                          |
+| | Trace         |                          |
+| | Collector     |                          |
+| +---------------+                          |
+|       ^                  +---------------+ |
+|       | spans            | Kata VM       | |
+| +-----+-----+            |               | |
+| | Kata      |    spans   o     +-------+ | |
+| | Trace     |<-----------------| Kata  | | |
+| | Forwarder |    VSOCK   o     | Agent | | |
+| +-----------+    Channel |     +-------+ | |
+|                          +---------------+ |
+--------------------------------------------+
+```
+
+# Agent tracing prerequisites
+
+- You must have a trace collector running.
+
+  Although the collector normally runs on the host, it can also be run from
+  inside a Docker image configured to expose the appropriate host ports to the
+  collector.
+
+  The [Jaeger "all-in-one" Docker image][jaeger-all-in-one] method
+  is the quickest and simplest way to run the collector for testing.
+
+- If you wish to trace the agent, you must start the
+  [trace forwarder][trace-forwarder].
+
+> **Notes:**
+>
+> - If agent tracing is enabled but the forwarder is not running,
+>   the agent will log an error (signalling that it cannot generate trace
+>   spans), but continue to work as normal.
+>
+> - The trace forwarder requires a trace collector (such as Jaeger) to be
+>   running before it is started. If a collector is not running, the trace
+>   forwarder will exit with an error.
+
+# Enable tracing
+
+By default, tracing is disabled for all components. To enable _any_ form of
+tracing an `enable_tracing` option must be enabled for at least one component.
+
+> **Note:** 
+>
+> Enabling this option will only allow tracing for subsequently
+> started containers.
+
+## Enable runtime tracing
+
+To enable runtime tracing, set the tracing option as shown:
+
+```toml
+[runtime]
+enable_tracing = true
+```
+
+## Enable agent tracing
+
+To enable agent tracing, set the tracing option as shown:
+
+```toml
+[agent.kata]
+enable_tracing = true
+```
+
+> **Note:**
+>
+> If both agent tracing and runtime tracing are enabled, the resulting trace
+> spans will be "collated": expanding individual runtime spans in the Jaeger
+> web UI will show the agent trace spans resulting from the runtime
+> operation.
+
+# Appendices
+
+## Agent tracing requirements
+
+### Host environment
+
+- The host kernel must support the VSOCK socket type.
+
+  This will be available if the kernel is built with the
+  `CONFIG_VHOST_VSOCK` configuration option.
+
+- The VSOCK kernel module must be loaded:
+
+   ```
+   $ sudo modprobe vhost_vsock
+   ```
+
+### Guest environment
+
+- The guest kernel must support the VSOCK socket type:
+
+  This will be available if the kernel is built with the
+  `CONFIG_VIRTIO_VSOCKETS` configuration option.
+
+  > **Note:** The default Kata Containers guest kernel provides this feature.
+
+## Agent tracing limitations
+
+- Agent tracing is only "completed" when the workload and the Kata agent
+  process have exited.
+
+  Although trace information *can* be inspected before the workload and agent
+  have exited, it is incomplete. This is shown as `<trace-without-root-span>`
+  in the Jaeger web UI.
+
+  If the workload is still running, the trace transaction -- which spans the entire
+  runtime of the Kata agent -- will not have been completed. To view the complete
+  trace details, wait for the workload to end, or stop the container.
+
+## Performance impact
+
+[OpenTelemetry][opentelemetry] is designed for high performance. It combines
+the best of two previous generation projects (OpenTracing and OpenCensus) and
+uses a very efficient mechanism to capture trace spans. Further, the trace
+points inserted into the agent are generated dynamically at compile time. This
+is advantageous since new versions of the agent will automatically benefit
+from improvements in the tracing infrastructure. Overall, the impact of
+enabling runtime and agent tracing should be extremely low.
+
+## Agent shutdown behaviour
+ 
+In normal operation, the Kata runtime manages the VM shutdown and performs
+certain optimisations to speed up this process. However, if agent tracing is
+enabled, the agent itself is responsible for shutting down the VM. This it to
+ensure all agent trace transactions are completed. This means there will be a
+small performance impact for container shutdown when agent tracing is enabled
+as the runtime must wait for the VM to shutdown fully.
+
+## Set up a tracing development environment
+
+If you want to debug, further develop, or test tracing,
+[enabling full debug][enable-full-debug]
+is highly recommended. For working with the agent, you may also wish to
+[enable a debug console][setup-debug-console]
+to allow you to access the VM environment.
+
+[agent-ctl]: https://github.com/kata-containers/kata-containers/blob/main/tools/agent-ctl
+[enable-full-debug]: https://github.com/kata-containers/kata-containers/blob/main/docs/Developer-Guide.md#enable-full-debug
+[jaeger-all-in-one]: https://www.jaegertracing.io/docs/getting-started/
+[jaeger-tracing]: https://www.jaegertracing.io
+[opentelemetry]: https://opentelemetry.io
+[osbuilder]: https://github.com/kata-containers/kata-containers/blob/main/tools/osbuilder
+[setup-debug-console]: https://github.com/kata-containers/kata-containers/blob/main/docs/Developer-Guide.md#set-up-a-debug-console
+[trace-forwarder]: https://github.com/kata-containers/kata-containers/blob/main/src/trace-forwarder
+[vsock]: https://wiki.qemu.org/Features/VirtioVsock
--- a/src/trace-forwarder/README.md
+++ b/src/trace-forwarder/README.md
@@ -3,13 +3,24 @@
 ## Overview

 The Kata Containers trace forwarder, `kata-trace-forwarder`, is a component
-running on the host system which is used to support tracing the agent process
-which runs inside the virtual machine.
+running on the host system which is used to support
+[tracing the agent process][agent-tracing], which runs inside the Kata
+Containers virtual machine (VM).

-The trace forwarder, which must be started before the agent, listens over
-VSOCK for trace data sent by the agent running inside the virtual machine. The
-trace spans are exported to an OpenTelemetry collector (such as Jaeger)
-running by default on the host.
+The trace forwarder, which must be started before the container, listens over
+[`VSOCK`][vsock] for trace data sent by the agent running inside the VM. The
+trace spans are exported to an [OpenTelemetry][opentelemetry] collector (such
+as [Jaeger][jaeger-tracing]) running by default on the host.
+
+> **Notes:**
+>
+> - If agent tracing is enabled but the forwarder is not running,
+>   the agent will log an error (signalling that it cannot generate trace
+>   spans), but continue to work as normal.
+>
+> - The trace forwarder requires a trace collector (such as Jaeger) to be
+>   running before it is started. If a collector is not running, the trace
+>   forwarder will exit with an error.

 ## Quick start

@@ -133,8 +144,13 @@ You can now proceed as normal to create the "foo" Kata container.

 ## Full details

-Run:
+For further information on how to run the trace forwarder, run:

 ```bash
 $ cargo run -- --help
 ```
+
+[agent-tracing]: ../../docs/tracing.md
+[jaeger-tracing]: https://www.jaegertracing.io
+[opentelemetry]: https://opentelemetry.io
+[vsock]: https://wiki.qemu.org/Features/VirtioVsock