 
 
 
 
 
PLEASE NOTE: This document applies to the HEAD of the source tree
If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.
Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
--
# Protobuf serialization and internal storage
@smarterclayton
March 2016
## Proposal and Motivation
The Kubernetes API server is a "dumb server" which offers storage, versioning,
validation, update, and watch semantics on API resources. In a large cluster
the API server must efficiently retrieve, store, and deliver large numbers
of coarse-grained objects to many clients. In addition, Kubernetes traffic is
heavily biased towards intra-cluster traffic - as much as 90% of the requests
served by the APIs are for internal cluster components like nodes, controllers,
and proxies. The primary format for intercluster API communication is JSON
today for ease of client construction.
At the current time, the latency of reaction to change in the cluster is
dominated by the time required to load objects from persistent store (etcd),
convert them to an output version, serialize them JSON over the network, and
then perform the reverse operation in clients. The cost of
serialization/deserialization and the size of the bytes on the wire, as well
as the memory garbage created during those operations, dominate the CPU and
network usage of the API servers.
In order to reach clusters of 10k nodes, we need roughly an order of magnitude
efficiency improvement in a number of areas of the cluster, starting with the
masters but also including API clients like controllers, kubelets, and node
proxies.
We propose to introduce a Protobuf serialization for all common API objects
that can optionally be used by intra-cluster components. Experiments have
demonstrated a 10x reduction in CPU use during serialization and deserialization,
a 2x reduction in size in bytes on the wire, and a 6-9x reduction in the amount
of objects created on the heap during serialization. The Protobuf schema
for each object will be automatically generated from the external API Go structs
we use to serialize to JSON.
Benchmarking showed that the time spent on the server in a typical GET
resembles:
          etcd -> decode -> defaulting -> convert to internal ->
    JSON          50us      5us           15us
    Proto         5us
    JSON          150allocs               80allocs
    Proto         100allocs
          process -> convert to external -> encode -> client
    JSON             15us                   40us
    Proto                                   5us
    JSON             80allocs               100allocs
    Proto                                   4allocs
 Protobuf has a huge benefit on encoding because it does not need to allocate
 temporary objects, just one large buffer. Changing to protobuf moves our
 hotspot back to conversion, not serialization.
## Design Points
* Generate Protobuf schema from Go structs (like we do for JSON) to avoid
  manual schema update and drift
* Generate Protobuf schema that is field equivalent to the JSON fields (no
  special types or enumerations), reducing drift for clients across formats.
* Follow our existing API versioning rules (backwards compatible in major
  API versions, breaking changes across major versions) by creating one
  Protobuf schema per API type.
* Continue to use the existing REST API patterns but offer an alternative
  serialization, which means existing client and server tooling can remain
  the same while benefiting from faster decoding.
* Protobuf objects on disk or in etcd will need to be self identifying at
  rest, like JSON, in order for backwards compatibility in storage to work,
  so we must add an envelope with apiVersion and kind to wrap the nested
  object, and make the data format recognizable to clients.
* Use the [gogo-protobuf](https://github.com/gogo/protobuf) Golang library to generate marshal/unmarshal
  operations, allowing us to bypass the expensive reflection used by the
  golang JSOn operation
## Alternatives
* We considered JSON compression to reduce size on wire, but that does not
  reduce the amount of memory garbage created during serialization and
  deserialization.
* More efficient formats like Msgpack were considered, but they only offer
  2x speed up vs the 10x observed for Protobuf
* gRPC was considered, but is a larger change that requires more core
  refactoring. This approach does not eliminate the possibility of switching
  to gRPC in the future.
* We considered attempting to improve JSON serialization, but the cost of
  implementing a more efficient serializer library than ugorji is
  significantly higher than creating a protobuf schema from our Go structs.
## Schema
The Protobuf schema for each API group and version will be generated from
the objects in that API group and version. The schema will be named using
the package identifier of the Go package, i.e.
    k8s.io/kubernetes/pkg/api/v1
Each top level object will be generated as a Protobuf message, i.e.:
    type Pod struct { ... }
    message Pod {}
Since the Go structs are designed to be serialized to JSON (with only the
int, string, bool, map, and array primitive types), we will use the
canonical JSON serialization as the protobuf field type wherever possible,
i.e.:
    JSON      Protobuf
    string -> string
    int    -> varint
    bool   -> bool
    array  -> repeating message|primitive
We disallow the use of the Go `int` type in external fields because it is
ambiguous depending on compiler platform, and instead always use `int32` or
`int64`.
We will use maps (a protobuf 3 extension that can serialize to protobuf 2)
to represent JSON maps:
    JSON      Protobuf            Wire (proto2)
    map    -> map -> repeated Message { key string; value bytes }
We will not convert known string constants to enumerations, since that
would require extra logic we do not already have in JSOn.
To begin with, we will use Protobuf 3 to generate a Protobuf 2 schema, and
in the future investigate a Protobuf 3 serialization. We will introduce
abstractions that let us have more than a single protobuf serialization if
necessary. Protobuf 3 would require us to support message types for
pointer primitive (nullable) fields, which is more complex than Protobuf 2's
support for pointers.
### Example of generated proto IDL
Without gogo extensions:
```
syntax = 'proto2';
package k8s.io.kubernetes.pkg.api.v1;
import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
import "k8s.io/kubernetes/pkg/runtime/generated.proto";
import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
// Package-wide variables from generator "generated".
option go_package = "v1";
// Represents a Persistent Disk resource in AWS.
//
// An AWS EBS disk must exist before mounting to a container. The disk
// must also be in the same AWS zone as the kubelet. An AWS EBS disk
// can only be mounted as read/write once. AWS EBS volumes support
// ownership management and SELinux relabeling.
message AWSElasticBlockStoreVolumeSource {
  // Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional string volumeID = 1;
  // Filesystem type of the volume that you want to mount.
  // Tip: Ensure that the filesystem type is supported by the host operating system.
  // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  // TODO: how do we prevent errors in the filesystem from compromising the machine
  optional string fsType = 2;
  // The partition in the volume that you want to mount.
  // If omitted, the default is to mount by volume name.
  // Examples: For volume /dev/sda1, you specify the partition as "1".
  // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
  optional int32 partition = 3;
  // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
  // If omitted, the default is "false".
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional bool readOnly = 4;
}
// Affinity is a group of affinity scheduling rules, currently
// only node affinity, but in the future also inter-pod affinity.
message Affinity {
  // Describes node affinity scheduling rules for the pod.
  optional NodeAffinity nodeAffinity = 1;
}
```
With extensions:
```
syntax = 'proto2';
package k8s.io.kubernetes.pkg.api.v1;
import "github.com/gogo/protobuf/gogoproto/gogo.proto";
import "k8s.io/kubernetes/pkg/api/resource/generated.proto";
import "k8s.io/kubernetes/pkg/api/unversioned/generated.proto";
import "k8s.io/kubernetes/pkg/runtime/generated.proto";
import "k8s.io/kubernetes/pkg/util/intstr/generated.proto";
// Package-wide variables from generator "generated".
option (gogoproto.marshaler_all) = true;
option (gogoproto.sizer_all) = true;
option (gogoproto.unmarshaler_all) = true;
option (gogoproto.goproto_unrecognized_all) = false;
option (gogoproto.goproto_enum_prefix_all) = false;
option (gogoproto.goproto_getters_all) = false;
option go_package = "v1";
// Represents a Persistent Disk resource in AWS.
//
// An AWS EBS disk must exist before mounting to a container. The disk
// must also be in the same AWS zone as the kubelet. An AWS EBS disk
// can only be mounted as read/write once. AWS EBS volumes support
// ownership management and SELinux relabeling.
message AWSElasticBlockStoreVolumeSource {
  // Unique ID of the persistent disk resource in AWS (Amazon EBS volume).
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional string volumeID = 1 [(gogoproto.customname) = "VolumeID", (gogoproto.nullable) = false];
  // Filesystem type of the volume that you want to mount.
  // Tip: Ensure that the filesystem type is supported by the host operating system.
  // Examples: "ext4", "xfs", "ntfs". Implicitly inferred to be "ext4" if unspecified.
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  // TODO: how do we prevent errors in the filesystem from compromising the machine
  optional string fsType = 2 [(gogoproto.customname) = "FSType", (gogoproto.nullable) = false];
  // The partition in the volume that you want to mount.
  // If omitted, the default is to mount by volume name.
  // Examples: For volume /dev/sda1, you specify the partition as "1".
  // Similarly, the volume partition for /dev/sda is "0" (or you can leave the property empty).
  optional int32 partition = 3 [(gogoproto.customname) = "Partition", (gogoproto.nullable) = false];
  // Specify "true" to force and set the ReadOnly property in VolumeMounts to "true".
  // If omitted, the default is "false".
  // More info: http://releases.k8s.io/HEAD/docs/user-guide/volumes.md#awselasticblockstore
  optional bool readOnly = 4 [(gogoproto.customname) = "ReadOnly", (gogoproto.nullable) = false];
}
// Affinity is a group of affinity scheduling rules, currently
// only node affinity, but in the future also inter-pod affinity.
message Affinity {
  // Describes node affinity scheduling rules for the pod.
  optional NodeAffinity nodeAffinity = 1 [(gogoproto.customname) = "NodeAffinity"];
}
```
## Wire format
In order to make Protobuf serialized objects recognizable in a binary form,
the encoded object must be prefixed by a magic number, and then wrap the
non-self-describing Protobuf object in a Protobuf object that contains
schema information.  The protobuf object is referred to as the `raw` object
and the encapsulation is referred to as `wrapper` object.
The simplest serialization is the raw Protobuf object with no identifying
information. In some use cases, we may wish to have the server identify the
raw object type on the wire using a protocol dependent format (gRPC uses
a type HTTP header). This works when all objects are of the same type, but
we occasionally have reasons to encode different object types in the same
context (watches, lists of objects on disk, and API calls that may return
errors).
To identify the type of a wrapped Protobuf object, we wrap it in a message
in package `k8s.io/kubernetes/pkg/runtime` with message name `Unknown`
having the following schema:
    message Unknown {
      optional TypeMeta typeMeta = 1;
      optional bytes value = 2;
      optional string contentEncoding = 3;
      optional string contentType = 4;
    }
    message TypeMeta {
      optional string apiVersion = 1;
      optional string kind = 2;
    }
The `value` field is an encoded protobuf object that matches the schema
defined in `typeMeta` and has optional `contentType` and `contentEncoding`
fields.  `contentType` and `contentEncoding` have the same meaning as in
HTTP, if unspecified `contentType` means "raw protobuf object", and
`contentEncoding` defaults to no encoding. If `contentEncoding` is
specified, the defined transformation should be applied to `value` before
attempting to decode the value.
The `contentType` field is required to support objects without a defined
protobuf schema, like the ThirdPartyResource or templates. Those objects
would have to be encoded as JSON or another structure compatible form
when used with Protobuf. Generic clients must deal with the possibility
that the returned value is not in the known type.
We add the `contentEncoding` field here to preserve room for future
optimizations like encryption-at-rest or compression of the nested content.
Clients should error when receiving an encoding they do not support.
Negotioting encoding is not defined here, but introducing new encodings
is similar to introducing a schema change or new API version.
A client should use the `kind` and `apiVersion` fields to identify the
correct protobuf IDL for that message and version, and then decode the
`bytes` field into that Protobuf message.
Any Unknown value written to stable storage will be given a 4 byte prefix
`0x6b, 0x38, 0x73, 0x00`, which correspond to `k8s` followed by a zero byte.
The content-type `application/vnd.kubernetes.protobuf` is defined as
representing the following schema:
    MESSAGE = '0x6b 0x38 0x73 0x00' UNKNOWN
    UNKNOWN = 
A client should check for the first four bytes, then perform a protobuf
deserialization of the remaining bytes into the `runtime.Unknown` type.
## Streaming wire format
While the majority of Kubernetes APIs return single objects that can vary
in type (Pod vs Status, PodList vs Status), the watch APIs return a stream
of identical objects (Events). At the time of this writing, this is the only
current or anticipated streaming RESTful protocol (logging, port-forwarding,
and exec protocols use a binary protocol over Websockets or SPDY).
In JSON, this API is implemented as a stream of JSON objects that are
separated by their syntax (the closing `}` brace is followed by whitespace
and the opening `{` brace starts the next object). There is no formal
specification covering this pattern, nor a unique content-type. Each object
is expected to be of type `watch.Event`, and is currently not self describing.
For expediency and consistency, we define a format for Protobuf watch Events
that is similar. Since protobuf messages are not self describing, we must
identify the boundaries between Events (a `frame`). We do that by prefixing
each frame of N bytes with a 4-byte, big-endian, unsigned integer with the
value N.
    frame  = length body
    length = 32-bit unsigned integer in big-endian order, denoting length of
             bytes of body
    body = 
    # frame containing a single byte 0a
    frame = 01 00 00 00 0a
    # equivalent JSON
    frame = {"type": "added", ...}
The body of each frame is a serialized Protobuf message `Event` in package
`k8s.io/kubernetes/pkg/watch/versioned`. The content type used for this
format is `application/vnd.kubernetes.protobuf;type=watch`.
## Negotiation
To allow clients to request protobuf serialization optionally, the `Accept`
HTTP header is used by callers to indicate which serialization they wish
returned in the response, and the `Content-Type` header is used to tell the
server how to decode the bytes sent in the request (for DELETE/POST/PUT/PATCH
requests). The server will return 406 if the `Accept` header is not
recognized or 415 if the `Content-Type` is not recognized (as defined in
RFC2616).
To be backwards compatible, clients must consider that the server does not
support protobuf serialization. A number of options are possible:
### Preconfigured
Clients can have a configuration setting that instructs them which version
to use. This is the simplest option, but requires intervention when the
component upgrades to protobuf.
### Include serialization information in api-discovery
Servers can define the list of content types they accept and return in
their API discovery docs, and clients can use protobuf if they support it.
Allows dynamic configuration during upgrade if the client is already using
API-discovery.
### Optimistically attempt to send and receive requests using protobuf
Using multiple `Accept` values:
    Accept: application/vnd.kubernetes.protobuf, application/json
clients can indicate their preferences and handle the returned
`Content-Type` using whatever the server responds. On update operations,
clients can try protobuf and if they receive a 415 error, record that and
fall back to JSON. Allows the client to be backwards compatible with
any server, but comes at the cost of some implementation complexity.
## Generation process
Generation proceeds in five phases:
1. Generate a gogo-protobuf annotated IDL from the source Go struct.
2. Generate temporary Go structs from the IDL using gogo-protobuf.
3. Generate marshaller/unmarshallers based on the IDL using gogo-protobuf.
4. Take all tag numbers generated for the IDL and apply them as struct tags
   to the original Go types.
5. Generate a final IDL without gogo-protobuf annotations as the canonical IDL.
The output is a `generated.proto` file in each package containing a standard
proto2 IDL, and a `generated.pb.go` file in each package that contains the
generated marshal/unmarshallers.
The Go struct generated by gogo-protobuf from the first IDL must be identical
to the origin struct - a number of changes have been made to gogo-protobuf
to ensure exact 1-1 conversion. A small number of additions may be necessary
in the future if we introduce more exotic field types (Go type aliases, maps
with aliased Go types, and embedded fields were fixed). If they are identical,
the output marshallers/unmarshallers can then work on the origin struct.
Whenever a new field is added, generation will assign that field a unique tag
and the 4th phase will write that tag back to the origin Go struct as a `protobuf`
struct tag. This ensures subsequent generation passes are stable, even in the
face of internal refactors. The first time a field is added, the author will
need to check in both the new IDL AND the protobuf struct tag changes.
The second IDL is generated without gogo-protobuf annotations to allow clients
in other languages to generate easily.
Any errors in the generation process are considered fatal and must be resolved
early (being unable to identify a field type for conversion, duplicate fields,
duplicate tags, protoc errors, etc). The conversion fuzzer is used to ensure
that a Go struct can be round-tripped to protobuf and back, as we do for JSON
and conversion testing.
## Changes to development process
All existing API change rules would still apply. New fields added would be
automatically assigned a tag by the generation process. New API versions will
have a new proto IDL, and field name and changes across API versions would be
handled using our existing API change rules. Tags cannot change within an
API version.
Generation would be done by developers and then checked into source control,
like conversions and ugorji JSON codecs.
Because protoc is not packaged well across all platforms, we will add it to
the `kube-cross` Docker image and developers can use that to generate
updated protobufs. Protobuf 3 beta is required.
The generated protobuf will be checked with a verify script before merging.
## Implications
* The generated marshal code is large and will increase build times and binary
  size. We may be able to remove ugorji after protobuf is added, since the
  bulk of our decoding would switch to protobuf.
* The protobuf schema is naive, which means it may not be as a minimal as
  possible.
* Debugging of protobuf related errors is harder due to the binary nature of
  the format.
* Migrating API object storage from JSON to protobuf will require that all
  API servers are upgraded before beginning to write protobuf to disk, since
  old servers won't recognize protobuf.
* Transport of protobuf between etcd and the api server will be less efficient
  in etcd2 than etcd3 (since etcd2 must encode binary values returned as JSON).
  Should still be smaller than current JSON request.
* Third-party API objects must be stored as JSON inside of a protobuf wrapper
  in etcd, and the API endpoints will not benefit from clients that speak
  protobuf. Clients will have to deal with some API objects not supporting
  protobuf.
## Open Questions
* Is supporting stored protobuf files on disk in the kubectl client worth it?
[]()