docs: Write tracing documentation

Add documentation explaining how to trace the runtime and agent.

Fixes: #1892.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This commit is contained in:
James O. D. Hunt 2021-05-26 17:55:52 +01:00
parent 9db56ffd85
commit 09a5e03f4a
3 changed files with 241 additions and 7 deletions

View File

@ -11,6 +11,10 @@ For details of the other Kata Containers repositories, see the
* [Installation guides](./install/README.md): Install and run Kata Containers with Docker or Kubernetes
## Tracing
See the [tracing documentation](tracing.md).
## More User Guides
* [Upgrading](Upgrading.md): how to upgrade from [Clear Containers](https://github.com/clearcontainers) and [runV](https://github.com/hyperhq/runv) to [Kata Containers](https://github.com/kata-containers) and how to upgrade an existing Kata Containers system to the latest version.

214
docs/tracing.md Normal file
View File

@ -0,0 +1,214 @@
# Overview
This document explains how to trace Kata Containers components.
# Introduction
The Kata Containers runtime and agent are able to generate
[OpenTelemetry][opentelemetry] trace spans, which allow the administrator to
observe what those components are doing and how much time they are spending on
each operation.
# OpenTelemetry summary
An OpenTelemetry-enabled application creates a number of trace "spans". A span
contains the following attributes:
- A name
- A pair of timestamps (recording the start time and end time of some operation)
- A reference to the span's parent span
All spans need to be *finished*, or *completed*, to allow the OpenTelemetry
framework to generate the final trace information (by effectively closing the
transaction encompassing the initial (root) span and all its children).
For Kata, the root span represents the total amount of time taken to run a
particular component from startup to its shutdown (the "run time").
# Architecture
## Runtime tracing architecture
The runtime, which runs in the host environment, has been modified to
optionally generate trace spans which are sent to a trace collector on the
host.
## Agent tracing architecture
An OpenTelemetry system (such as [Jaeger][jaeger-tracing]) uses a collector to
gather up trace spans from the application for viewing and processing. For an
application to use the collector, it must run in the same context as
the collector.
This poses a problem for tracing the Kata Containers agent since it does not
run in the same context as the collector: it runs inside a virtual machine (VM).
To allow spans from the agent to be sent to the trace collector, Kata provides
a [trace forwarder][trace-forwarder] component. This runs in the same context
as the collector (generally on the host system) and listens on a
[`VSOCK`][vsock] channel for traces generated by the agent, forwarding them on
to the trace collector.
> **Note:**
>
> This design supports agent tracing without having to make changes to the
> image, but also means that [custom images][osbuilder] can also benefit from
> agent tracing.
The following diagram summarises the architecture used to trace the Kata
Containers agent:
```
+--------------------------------------------+
| Host |
| |
| +---------------+ |
| | OpenTelemetry | |
| | Trace | |
| | Collector | |
| +---------------+ |
| ^ +---------------+ |
| | spans | Kata VM | |
| +-----+-----+ | | |
| | Kata | spans o +-------+ | |
| | Trace |<-----------------| Kata | | |
| | Forwarder | VSOCK o | Agent | | |
| +-----------+ Channel | +-------+ | |
| +---------------+ |
+--------------------------------------------+
```
# Agent tracing prerequisites
- You must have a trace collector running.
Although the collector normally runs on the host, it can also be run from
inside a Docker image configured to expose the appropriate host ports to the
collector.
The [Jaeger "all-in-one" Docker image][jaeger-all-in-one] method
is the quickest and simplest way to run the collector for testing.
- If you wish to trace the agent, you must start the
[trace forwarder][trace-forwarder].
> **Notes:**
>
> - If agent tracing is enabled but the forwarder is not running,
> the agent will log an error (signalling that it cannot generate trace
> spans), but continue to work as normal.
>
> - The trace forwarder requires a trace collector (such as Jaeger) to be
> running before it is started. If a collector is not running, the trace
> forwarder will exit with an error.
# Enable tracing
By default, tracing is disabled for all components. To enable _any_ form of
tracing an `enable_tracing` option must be enabled for at least one component.
> **Note:**
>
> Enabling this option will only allow tracing for subsequently
> started containers.
## Enable runtime tracing
To enable runtime tracing, set the tracing option as shown:
```toml
[runtime]
enable_tracing = true
```
## Enable agent tracing
To enable agent tracing, set the tracing option as shown:
```toml
[agent.kata]
enable_tracing = true
```
> **Note:**
>
> If both agent tracing and runtime tracing are enabled, the resulting trace
> spans will be "collated": expanding individual runtime spans in the Jaeger
> web UI will show the agent trace spans resulting from the runtime
> operation.
# Appendices
## Agent tracing requirements
### Host environment
- The host kernel must support the VSOCK socket type.
This will be available if the kernel is built with the
`CONFIG_VHOST_VSOCK` configuration option.
- The VSOCK kernel module must be loaded:
```
$ sudo modprobe vhost_vsock
```
### Guest environment
- The guest kernel must support the VSOCK socket type:
This will be available if the kernel is built with the
`CONFIG_VIRTIO_VSOCKETS` configuration option.
> **Note:** The default Kata Containers guest kernel provides this feature.
## Agent tracing limitations
- Agent tracing is only "completed" when the workload and the Kata agent
process have exited.
Although trace information *can* be inspected before the workload and agent
have exited, it is incomplete. This is shown as `<trace-without-root-span>`
in the Jaeger web UI.
If the workload is still running, the trace transaction -- which spans the entire
runtime of the Kata agent -- will not have been completed. To view the complete
trace details, wait for the workload to end, or stop the container.
## Performance impact
[OpenTelemetry][opentelemetry] is designed for high performance. It combines
the best of two previous generation projects (OpenTracing and OpenCensus) and
uses a very efficient mechanism to capture trace spans. Further, the trace
points inserted into the agent are generated dynamically at compile time. This
is advantageous since new versions of the agent will automatically benefit
from improvements in the tracing infrastructure. Overall, the impact of
enabling runtime and agent tracing should be extremely low.
## Agent shutdown behaviour
In normal operation, the Kata runtime manages the VM shutdown and performs
certain optimisations to speed up this process. However, if agent tracing is
enabled, the agent itself is responsible for shutting down the VM. This it to
ensure all agent trace transactions are completed. This means there will be a
small performance impact for container shutdown when agent tracing is enabled
as the runtime must wait for the VM to shutdown fully.
## Set up a tracing development environment
If you want to debug, further develop, or test tracing,
[enabling full debug][enable-full-debug]
is highly recommended. For working with the agent, you may also wish to
[enable a debug console][setup-debug-console]
to allow you to access the VM environment.
[agent-ctl]: https://github.com/kata-containers/kata-containers/blob/main/tools/agent-ctl
[enable-full-debug]: https://github.com/kata-containers/kata-containers/blob/main/docs/Developer-Guide.md#enable-full-debug
[jaeger-all-in-one]: https://www.jaegertracing.io/docs/getting-started/
[jaeger-tracing]: https://www.jaegertracing.io
[opentelemetry]: https://opentelemetry.io
[osbuilder]: https://github.com/kata-containers/kata-containers/blob/main/tools/osbuilder
[setup-debug-console]: https://github.com/kata-containers/kata-containers/blob/main/docs/Developer-Guide.md#set-up-a-debug-console
[trace-forwarder]: https://github.com/kata-containers/kata-containers/blob/main/src/trace-forwarder
[vsock]: https://wiki.qemu.org/Features/VirtioVsock

View File

@ -3,13 +3,24 @@
## Overview
The Kata Containers trace forwarder, `kata-trace-forwarder`, is a component
running on the host system which is used to support tracing the agent process
which runs inside the virtual machine.
running on the host system which is used to support
[tracing the agent process][agent-tracing], which runs inside the Kata
Containers virtual machine (VM).
The trace forwarder, which must be started before the agent, listens over
VSOCK for trace data sent by the agent running inside the virtual machine. The
trace spans are exported to an OpenTelemetry collector (such as Jaeger)
running by default on the host.
The trace forwarder, which must be started before the container, listens over
[`VSOCK`][vsock] for trace data sent by the agent running inside the VM. The
trace spans are exported to an [OpenTelemetry][opentelemetry] collector (such
as [Jaeger][jaeger-tracing]) running by default on the host.
> **Notes:**
>
> - If agent tracing is enabled but the forwarder is not running,
> the agent will log an error (signalling that it cannot generate trace
> spans), but continue to work as normal.
>
> - The trace forwarder requires a trace collector (such as Jaeger) to be
> running before it is started. If a collector is not running, the trace
> forwarder will exit with an error.
## Quick start
@ -133,8 +144,13 @@ You can now proceed as normal to create the "foo" Kata container.
## Full details
Run:
For further information on how to run the trace forwarder, run:
```bash
$ cargo run -- --help
```
[agent-tracing]: ../../docs/tracing.md
[jaeger-tracing]: https://www.jaegertracing.io
[opentelemetry]: https://opentelemetry.io
[vsock]: https://wiki.qemu.org/Features/VirtioVsock