mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-10-24 17:10:44 +00:00
515 lines
22 KiB
Markdown
515 lines
22 KiB
Markdown
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
<!-- BEGIN STRIP_FOR_RELEASE -->
|
|
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
|
|
width="25" height="25">
|
|
|
|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
|
|
|
|
If you are using a released version of Kubernetes, you should
|
|
refer to the docs that go with that version.
|
|
|
|
<!-- TAG RELEASE_LINK, added by the munger automatically -->
|
|
<strong>
|
|
The latest release of this document can be found
|
|
[here](http://releases.k8s.io/release-1.4/docs/proposals/protobuf.md).
|
|
|
|
Documentation for other releases can be found at
|
|
[releases.k8s.io](http://releases.k8s.io).
|
|
</strong>
|
|
--
|
|
|
|
<!-- END STRIP_FOR_RELEASE -->
|
|
|
|
<!-- END MUNGE: UNVERSIONED_WARNING -->
|
|
|
|
# Protobuf serialization and internal storage
|
|
|
|
@smarterclayton
|
|
|
|
March 2016
|
|
|
|
## Proposal and Motivation
|
|
|
|
The Kubernetes API server is a "dumb server" which offers storage, versioning,
|
|
validation, update, and watch semantics on API resources. In a large cluster
|
|
the API server must efficiently retrieve, store, and deliver large numbers
|
|
of coarse-grained objects to many clients. In addition, Kubernetes traffic is
|
|
heavily biased towards intra-cluster traffic - as much as 90% of the requests
|
|
served by the APIs are for internal cluster components like nodes, controllers,
|
|
and proxies. The primary format for intercluster API communication is JSON
|
|
today for ease of client construction.
|
|
|
|
At the current time, the latency of reaction to change in the cluster is
|
|
dominated by the time required to load objects from persistent store (etcd),
|
|
convert them to an output version, serialize them JSON over the network, and
|
|
then perform the reverse operation in clients. The cost of
|
|
serialization/deserialization and the size of the bytes on the wire, as well
|
|
as the memory garbage created during those operations, dominate the CPU and
|
|
network usage of the API servers.
|
|
|
|
In order to reach clusters of 10k nodes, we need roughly an order of magnitude
|
|
efficiency improvement in a number of areas of the cluster, starting with the
|
|
masters but also including API clients like controllers, kubelets, and node
|
|
proxies.
|
|
|
|
We propose to introduce a Protobuf serialization for all common API objects
|
|
that can optionally be used by intra-cluster components. Experiments have
|
|
demonstrated a 10x reduction in CPU use during serialization and deserialization,
|
|
a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount
|
|
of objects created on the heap during serialization. The Protobuf schema
|
|
for each object will be automatically generated from the external API Go structs
|
|
we use to serialize to JSON.
|
|
|
|
Benchmarking showed that the time spent on the server in a typical GET
|
|
resembles:
|
|
|
|
etcd -> decode -> defaulting -> convert to internal ->
|
|
JSON 50us 5us 15us
|
|
Proto 5us
|
|
JSON 150allocs 80allocs
|
|
Proto 100allocs
|
|
|
|
process -> convert to external -> encode -> client
|
|
JSON 15us 40us
|
|
Proto 5us
|
|
JSON 80allocs 100allocs
|
|
Proto 4allocs
|
|
|
|
Protobuf has a huge benefit on encoding because it does not need to allocate
|
|
temporary objects, just one large buffer. Changing to protobuf moves our
|
|
hotspot back to conversion, not serialization.
|
|
|
|
|
|
## Design Points
|
|
|
|
* Generate Protobuf schema from Go structs (like we do for JSON) to avoid
|
|
manual schema update and drift
|
|
* Generate Protobuf schema that is field equivalent to the JSON fields (no
|
|
special types or enumerations), reducing drift for clients across formats.
|
|
* Follow our existing API versioning rules (backwards compatible in major
|
|
API versions, breaking changes across major versions) by creating one
|
|
Protobuf schema per API type.
|
|
* Continue to use the existing REST API patterns but offer an alternative
|
|
serialization, which means existing client and server tooling can remain
|
|
the same while benefiting from faster decoding.
|
|
* Protobuf objects on disk or in etcd will need to be self identifying at
|
|
rest, like JSON, in order for backwards compatibility in storage to work,
|
|
so we must add an envelope with apiVersion and kind to wrap the nested
|
|
object, and make the data format recognizable to clients.
|
|
* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal
|
|
operations, allowing us to bypass the expensive reflection used by the
|
|
golang JSOn operation
|
|
|
|
|
|
## Alternatives
|
|
|
|
* We considered JSON compression to reduce size on wire, but that does not
|
|
reduce the amount of memory garbage created during serialization and
|
|
deserialization.
|
|
* More efficient formats like Msgpack were considered, but they only offer
|
|
2x speed up vs the 10x observed for Protobuf
|
|
* gRPC was considered, but is a larger change that requires more core
|
|
refactoring. This approach does not eliminate the possibility of switching
|
|
to gRPC in the future.
|
|
* We considered attempting to improve JSON serialization, but the cost of
|
|
implementing a more efficient serializer library than ugorji is
|
|
significantly higher than creating a protobuf schema from our Go structs.
|
|
|
|
|
|
## Schema
|
|
|
|
The Protobuf schema for each API group and version will be generated from
|
|
the objects in that API group and version. The schema will be named using
|
|
the package identifier of the Go package, i.e.
|
|
|
|
k8s.io/kubernetes/pkg/api/v1
|
|
|
|
Each top level object will be generated as a Protobuf message, i.e.:
|
|
|
|
type Pod struct { ... }
|
|
|
|
message Pod {}
|
|
|
|
Since the Go structs are designed to be serialized to JSON (with only the
|
|
int, string, bool, map, and array primitive types), we will use the
|
|
canonical JSON serialization as the protobuf field type wherever possible,
|
|
i.e.:
|
|
|
|
JSON Protobuf
|
|
string -> string
|
|
int -> varint
|
|
bool -> bool
|
|
array -> repeating message|primitive
|
|
|
|
We disallow the use of the Go `int` type in external fields because it is
|
|
ambiguous depending on compiler platform, and instead always use `int32` or
|
|
`int64`.
|
|
|
|
We will use maps (a protobuf 3 extension that can serialize to protobuf 2)
|
|
to represent JSON maps:
|
|
|
|
JSON Protobuf Wire (proto2)
|
|
map -> map<string, ...> -> repeated Message { key string; value bytes }
|
|
|
|
We will not convert known string constants to enumerations, since that
|
|
would require extra logic we do not already have in JSOn.
|
|
|
|
To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and
|
|
in the future investigate a Protobuf 3 serialization. We will introduce
|
|
abstractions that let us have more than a single protobuf serialization if
|
|
necessary. Protobuf 3 would require us to support message types for
|
|
pointer primitive (nullable) fields, which is more complex than Protobuf 2's
|
|
support for pointers.
|
|
|
|
### Example of generated proto IDL
|
|
|
|
Without gogo extensions:
|
|
|
|
```
|
|
syntax = 'proto2';
|
|
|
|
package k8s.io.kubernetes.pkg.api.v1;
|
|
|
|
import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
|
|
import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
|
|
import "k8s.io/kubernetes/pkg/runtime/generated.proto";
|
|
import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
|
|
|
|
// Package-wide variables from generator "generated".
|
|
option go_package = "v1";
|
|
|
|
// Represents a Persistent Disk resource in AWS.
|
|
//
|
|
// An AWS EBS disk must exist before mounting to a container. The disk
|
|
// must also be in the same AWS zone as the kubelet. An AWS EBS disk
|
|
// can only be mounted as read/write once. AWS EBS volumes support
|
|
// ownership management and SELinux relabeling.
|
|
message AWSElasticBlockStoreVolumeSource {
|
|
// Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
|
|
// More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
|
|
optional string volumeID = 1;
|
|
|
|
// Filesystem type of the volume that you want to mount.
|
|
// Tip: Ensure that the filesystem type is supported by the host operating system.
|
|
// Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
|
|
// More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
|
|
// TODO: how do we prevent errors in the filesystem from compromising the machine
|
|
optional string fsType = 2;
|
|
|
|
// The partition in the volume that you want to mount.
|
|
// If omitted, the default is to mount by volume name.
|
|
// Examples: For volume /dev/sda1, you specify the partition as "1".
|
|
// Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
|
|
optional int32 partition = 3;
|
|
|
|
// Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
|
|
// If omitted, the default is "false".
|
|
// More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
|
|
optional bool readOnly = 4;
|
|
}
|
|
|
|
// Affinity is a group of affinity scheduling rules, currently
|
|
// only node affinity, but in the future also inter-pod affinity.
|
|
message Affinity {
|
|
// Describes node affinity scheduling rules for the pod.
|
|
optional NodeAffinity nodeAffinity = 1;
|
|
}
|
|
```
|
|
|
|
With extensions:
|
|
|
|
```
|
|
syntax = 'proto2';
|
|
|
|
package k8s.io.kubernetes.pkg.api.v1;
|
|
|
|
import "github.com/gogo/protobuf/gogoproto/gogo.proto";
|
|
import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
|
|
import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
|
|
import "k8s.io/kubernetes/pkg/runtime/generated.proto";
|
|
import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
|
|
|
|
// Package-wide variables from generator "generated".
|
|
option (gogoproto.marshaler_all) = true;
|
|
option (gogoproto.sizer_all) = true;
|
|
option (gogoproto.unmarshaler_all) = true;
|
|
option (gogoproto.goproto_unrecognized_all) = false;
|
|
option (gogoproto.goproto_enum_prefix_all) = false;
|
|
option (gogoproto.goproto_getters_all) = false;
|
|
option go_package = "v1";
|
|
|
|
// Represents a Persistent Disk resource in AWS.
|
|
//
|
|
// An AWS EBS disk must exist before mounting to a container. The disk
|
|
// must also be in the same AWS zone as the kubelet. An AWS EBS disk
|
|
// can only be mounted as read/write once. AWS EBS volumes support
|
|
// ownership management and SELinux relabeling.
|
|
message AWSElasticBlockStoreVolumeSource {
|
|
// Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
|
|
// More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
|
|
optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false];
|
|
|
|
// Filesystem type of the volume that you want to mount.
|
|
// Tip: Ensure that the filesystem type is supported by the host operating system.
|
|
// Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
|
|
// More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
|
|
// TODO: how do we prevent errors in the filesystem from compromising the machine
|
|
optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false];
|
|
|
|
// The partition in the volume that you want to mount.
|
|
// If omitted, the default is to mount by volume name.
|
|
// Examples: For volume /dev/sda1, you specify the partition as "1".
|
|
// Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
|
|
optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false];
|
|
|
|
// Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
|
|
// If omitted, the default is "false".
|
|
// More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
|
|
optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false];
|
|
}
|
|
|
|
// Affinity is a group of affinity scheduling rules, currently
|
|
// only node affinity, but in the future also inter-pod affinity.
|
|
message Affinity {
|
|
// Describes node affinity scheduling rules for the pod.
|
|
optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"];
|
|
}
|
|
```
|
|
|
|
## Wire format
|
|
|
|
In order to make Protobuf serialized objects recognizable in a binary form,
|
|
the encoded object must be prefixed by a magic number, and then wrap the
|
|
non-self-describing Protobuf object in a Protobuf object that contains
|
|
schema information. The protobuf object is referred to as the `raw` object
|
|
and the encapsulation is referred to as `wrapper` object.
|
|
|
|
The simplest serialization is the raw Protobuf object with no identifying
|
|
information. In some use cases, we may wish to have the server identify the
|
|
raw object type on the wire using a protocol dependent format (gRPC uses
|
|
a type HTTP header). This works when all objects are of the same type, but
|
|
we occasionally have reasons to encode different object types in the same
|
|
context (watches, lists of objects on disk, and API calls that may return
|
|
errors).
|
|
|
|
To identify the type of a wrapped Protobuf object, we wrap it in a message
|
|
in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown`
|
|
having the following schema:
|
|
|
|
message Unknown {
|
|
optional TypeMeta typeMeta = 1;
|
|
optional bytes value = 2;
|
|
optional string contentEncoding = 3;
|
|
optional string contentType = 4;
|
|
}
|
|
|
|
message TypeMeta {
|
|
optional string apiVersion = 1;
|
|
optional string kind = 2;
|
|
}
|
|
|
|
The `value` field is an encoded protobuf object that matches the schema
|
|
defined in `typeMeta` and has optional `contentType` and `contentEncoding`
|
|
fields. `contentType` and `contentEncoding` have the same meaning as in
|
|
HTTP, if unspecified `contentType` means "raw protobuf object", and
|
|
`contentEncoding` defaults to no encoding. If `contentEncoding` is
|
|
specified, the defined transformation should be applied to `value` before
|
|
attempting to decode the value.
|
|
|
|
The `contentType` field is required to support objects without a defined
|
|
protobuf schema, like the ThirdPartyResource or templates. Those objects
|
|
would have to be encoded as JSON or another structure compatible form
|
|
when used with Protobuf. Generic clients must deal with the possibility
|
|
that the returned value is not in the known type.
|
|
|
|
We add the `contentEncoding` field here to preserve room for future
|
|
optimizations like encryption-at-rest or compression of the nested content.
|
|
Clients should error when receiving an encoding they do not support.
|
|
Negotioting encoding is not defined here, but introducing new encodings
|
|
is similar to introducing a schema change or new API version.
|
|
|
|
A client should use the `kind` and `apiVersion` fields to identify the
|
|
correct protobuf IDL for that message and version, and then decode the
|
|
`bytes` field into that Protobuf message.
|
|
|
|
Any Unknown value written to stable storage will be given a 4 byte prefix
|
|
`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte.
|
|
The content-type `application/vnd.kubernetes.protobuf` is defined as
|
|
representing the following schema:
|
|
|
|
MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN
|
|
UNKNOWN = <protobuf serialization of k8s.io/kubernetes/pkg/runtime#Unknown>
|
|
|
|
A client should check for the first four bytes, then perform a protobuf
|
|
deserialization of the remaining bytes into the `runtime.Unknown` type.
|
|
|
|
## Streaming wire format
|
|
|
|
While the majority of Kubernetes APIs return single objects that can vary
|
|
in type (Pod vs Status, PodList vs Status), the watch APIs return a stream
|
|
of identical objects (Events). At the time of this writing, this is the only
|
|
current or anticipated streaming RESTful protocol (logging, port-forwarding,
|
|
and exec protocols use a binary protocol over Websockets or SPDY).
|
|
|
|
In JSON, this API is implemented as a stream of JSON objects that are
|
|
separated by their syntax (the closing `}` brace is followed by whitespace
|
|
and the opening `{` brace starts the next object). There is no formal
|
|
specification covering this pattern, nor a unique content-type. Each object
|
|
is expected to be of type `watch.Event`, and is currently not self describing.
|
|
|
|
For expediency and consistency, we define a format for Protobuf watch Events
|
|
that is similar. Since protobuf messages are not self describing, we must
|
|
identify the boundaries between Events (a `frame`). We do that by prefixing
|
|
each frame of N bytes with a 4-byte, big-endian, unsigned integer with the
|
|
value N.
|
|
|
|
frame = length body
|
|
length = 32-bit unsigned integer in big-endian order, denoting length of
|
|
bytes of body
|
|
body = <bytes>
|
|
|
|
# frame containing a single byte 0a
|
|
frame = 01 00 00 00 0a
|
|
|
|
# equivalent JSON
|
|
frame = {"type": "added", ...}
|
|
|
|
The body of each frame is a serialized Protobuf message `Event` in package
|
|
`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this
|
|
format is `application/vnd.kubernetes.protobuf;type=watch`.
|
|
|
|
## Negotiation
|
|
|
|
To allow clients to request protobuf serialization optionally, the `Accept`
|
|
HTTP header is used by callers to indicate which serialization they wish
|
|
returned in the response, and the `Content-Type` header is used to tell the
|
|
server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH
|
|
requests). The server will return 406 if the `Accept` header is not
|
|
recognized or 415 if the `Content-Type` is not recognized (as defined in
|
|
RFC2616).
|
|
|
|
To be backwards compatible, clients must consider that the server does not
|
|
support protobuf serialization. A number of options are possible:
|
|
|
|
### Preconfigured
|
|
|
|
Clients can have a configuration setting that instructs them which version
|
|
to use. This is the simplest option, but requires intervention when the
|
|
component upgrades to protobuf.
|
|
|
|
### Include serialization information in api-discovery
|
|
|
|
Servers can define the list of content types they accept and return in
|
|
their API discovery docs, and clients can use protobuf if they support it.
|
|
Allows dynamic configuration during upgrade if the client is already using
|
|
API-discovery.
|
|
|
|
### Optimistically attempt to send and receive requests using protobuf
|
|
|
|
Using multiple `Accept` values:
|
|
|
|
Accept: application/vnd.kubernetes.protobuf, application/json
|
|
|
|
clients can indicate their preferences and handle the returned
|
|
`Content-Type` using whatever the server responds. On update operations,
|
|
clients can try protobuf and if they receive a 415 error, record that and
|
|
fall back to JSON. Allows the client to be backwards compatible with
|
|
any server, but comes at the cost of some implementation complexity.
|
|
|
|
|
|
## Generation process
|
|
|
|
Generation proceeds in five phases:
|
|
|
|
1. Generate a gogo-protobuf annotated IDL from the source Go struct.
|
|
2. Generate temporary Go structs from the IDL using gogo-protobuf.
|
|
3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf.
|
|
4. Take all tag numbers generated for the IDL and apply them as struct tags
|
|
to the original Go types.
|
|
5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL.
|
|
|
|
The output is a `generated.proto` file in each package containing a standard
|
|
proto2 IDL, and a `generated.pb.go` file in each package that contains the
|
|
generated marshal/unmarshallers.
|
|
|
|
The Go struct generated by gogo-protobuf from the first IDL must be identical
|
|
to the origin struct - a number of changes have been made to gogo-protobuf
|
|
to ensure exact 1-1 conversion. A small number of additions may be necessary
|
|
in the future if we introduce more exotic field types (Go type aliases, maps
|
|
with aliased Go types, and embedded fields were fixed). If they are identical,
|
|
the output marshallers/unmarshallers can then work on the origin struct.
|
|
|
|
Whenever a new field is added, generation will assign that field a unique tag
|
|
and the 4th phase will write that tag back to the origin Go struct as a `protobuf`
|
|
struct tag. This ensures subsequent generation passes are stable, even in the
|
|
face of internal refactors. The first time a field is added, the author will
|
|
need to check in both the new IDL AND the protobuf struct tag changes.
|
|
|
|
The second IDL is generated without gogo-protobuf annotations to allow clients
|
|
in other languages to generate easily.
|
|
|
|
Any errors in the generation process are considered fatal and must be resolved
|
|
early (being unable to identify a field type for conversion, duplicate fields,
|
|
duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure
|
|
that a Go struct can be round-tripped to protobuf and back, as we do for JSON
|
|
and conversion testing.
|
|
|
|
|
|
## Changes to development process
|
|
|
|
All existing API change rules would still apply. New fields added would be
|
|
automatically assigned a tag by the generation process. New API versions will
|
|
have a new proto IDL, and field name and changes across API versions would be
|
|
handled using our existing API change rules. Tags cannot change within an
|
|
API version.
|
|
|
|
Generation would be done by developers and then checked into source control,
|
|
like conversions and ugorji JSON codecs.
|
|
|
|
Because protoc is not packaged well across all platforms, we will add it to
|
|
the `kube-cross` Docker image and developers can use that to generate
|
|
updated protobufs. Protobuf 3 beta is required.
|
|
|
|
The generated protobuf will be checked with a verify script before merging.
|
|
|
|
|
|
## Implications
|
|
|
|
* The generated marshal code is large and will increase build times and binary
|
|
size. We may be able to remove ugorji after protobuf is added, since the
|
|
bulk of our decoding would switch to protobuf.
|
|
* The protobuf schema is naive, which means it may not be as a minimal as
|
|
possible.
|
|
* Debugging of protobuf related errors is harder due to the binary nature of
|
|
the format.
|
|
* Migrating API object storage from JSON to protobuf will require that all
|
|
API servers are upgraded before beginning to write protobuf to disk, since
|
|
old servers won't recognize protobuf.
|
|
* Transport of protobuf between etcd and the api server will be less efficient
|
|
in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON).
|
|
Should still be smaller than current JSON request.
|
|
* Third-party API objects must be stored as JSON inside of a protobuf wrapper
|
|
in etcd, and the API endpoints will not benefit from clients that speak
|
|
protobuf. Clients will have to deal with some API objects not supporting
|
|
protobuf.
|
|
|
|
|
|
## Open Questions
|
|
|
|
* Is supporting stored protobuf files on disk in the kubectl client worth it?
|
|
|
|
|
|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
|
[]()
|
|
<!-- END MUNGE: GENERATED_ANALYTICS -->
|