mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-30 21:30:16 +00:00 
			
		
		
		
	Automatic merge from submit-queue HyperLink not found and can't redirect HyperLink "http://kubernetes.io/v1.1/docs/design/expansion.html" not found and can't redirected, should be changed to "http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html" <!-- Reviewable:start --> --- This change is [<img src="https://reviewable.kubernetes.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.kubernetes.io/reviews/kubernetes/kubernetes/29609) <!-- Reviewable:end -->
		
			
				
	
	
		
			935 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			935 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| 
 | |
| <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | |
| 
 | |
| If you are using a released version of Kubernetes, you should
 | |
| refer to the docs that go with that version.
 | |
| 
 | |
| <!-- TAG RELEASE_LINK, added by the munger automatically -->
 | |
| <strong>
 | |
| The latest release of this document can be found
 | |
| [here](http://releases.k8s.io/release-1.3/docs/design/indexed-job.md).
 | |
| 
 | |
| Documentation for other releases can be found at
 | |
| [releases.k8s.io](http://releases.k8s.io).
 | |
| </strong>
 | |
| --
 | |
| 
 | |
| <!-- END STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <!-- END MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| # Design: Indexed Feature of Job object
 | |
| 
 | |
| 
 | |
| ## Summary
 | |
| 
 | |
| This design extends kubernetes with user-friendly support for
 | |
| running embarrassingly parallel jobs.
 | |
| 
 | |
| Here, *parallel* means on multiple nodes, which means multiple pods.
 | |
| By *embarrassingly parallel*,  it is meant that the pods
 | |
| have no dependencies between each other.  In particular, neither
 | |
| ordering between pods nor gang scheduling are supported.
 | |
| 
 | |
| Users already have two other options for running embarrassingly parallel
 | |
| Jobs (described in the next section), but both have ease-of-use issues.
 | |
| 
 | |
| Therefore, this document proposes extending the Job resource type to support
 | |
| a third way to run embarrassingly parallel programs, with a focus on
 | |
| ease of use.
 | |
| 
 | |
| This new style of Job is called an *indexed job*, because each Pod of the Job
 | |
| is specialized to work on a particular *index* from a fixed length array of work
 | |
| items.
 | |
| 
 | |
| ## Background
 | |
| 
 | |
| The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
 | |
| the embarrassingly parallel use case through *workqueue jobs*.
 | |
| While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very
 | |
| flexible, they can be difficult to use. They: (1) typically require running a
 | |
| message queue or other database service, (2) typically require modifications
 | |
| to existing binaries and images and (3) subtle race conditions are easy to
 | |
|  overlook.
 | |
| 
 | |
| Users also have another option for parallel jobs: creating [multiple Job objects
 | |
| from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of
 | |
| Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job
 | |
| objects at once. But, that approach also has its drawbacks: (1) for large levels
 | |
| of parallelism (hundreds or thousands of pods) this approach means that listing
 | |
| all jobs presents too much information, (2) users want a single source of
 | |
| information about the success or failure of what the user views as a single
 | |
| logical process.
 | |
| 
 | |
| Indexed job fills provides a third option with better ease-of-use for common
 | |
| use cases.
 | |
| 
 | |
| ## Requirements
 | |
| 
 | |
| ### User Requirements
 | |
| 
 | |
| - Users want an easy way to run a Pod to completion *for each* item within a
 | |
| [work list](#example-use-cases).
 | |
| 
 | |
| - Users want to run these pods in parallel for speed, but to vary the level of
 | |
| parallelism as needed, independent of the number of work items.
 | |
| 
 | |
| - Users want to do this without requiring changes to existing images,
 | |
| or source-to-image pipelines.
 | |
| 
 | |
| - Users want a single object that encompasses the lifetime of the parallel
 | |
| program. Deleting it should delete all dependent objects. It should report the
 | |
| status of the overall process. Users should be able to wait for it to complete,
 | |
| and can refer to it from other resource types, such as
 | |
| [ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).
 | |
| 
 | |
| 
 | |
| ### Example Use Cases
 | |
| 
 | |
| Here are several examples of *work lists*: lists of command lines that the user
 | |
| wants to run, each line its own Pod. (Note that in practice, a work list may not
 | |
| ever be written out in this form, but it exists in the mind of the Job creator,
 | |
| and it is a useful way to talk about the intent of the user when discussing
 | |
| alternatives for specifying Indexed Jobs).
 | |
| 
 | |
| Note that we will not have the user express their requirements in work list
 | |
| form; it is just a format for presenting use cases. Subsequent discussion will
 | |
| reference these work lists.
 | |
| 
 | |
| #### Work List 1
 | |
| 
 | |
| Process several files with the same program:
 | |
| 
 | |
| ```
 | |
| /usr/local/bin/process_file 12342.dat
 | |
| /usr/local/bin/process_file 97283.dat
 | |
| /usr/local/bin/process_file 38732.dat
 | |
| ```
 | |
| 
 | |
| #### Work List 2
 | |
| 
 | |
| Process a matrix (or image, etc) in rectangular blocks:
 | |
| 
 | |
| ```
 | |
| /usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
 | |
| /usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
 | |
| /usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
 | |
| /usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
 | |
| ```
 | |
| 
 | |
| #### Work List 3
 | |
| 
 | |
| Build a program at several different git commits:
 | |
| 
 | |
| ```
 | |
| HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
 | |
| HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH
 | |
| HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH
 | |
| ```
 | |
| 
 | |
| #### Work List 4
 | |
| 
 | |
| Render several frames of a movie:
 | |
| 
 | |
| ```
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3
 | |
| ```
 | |
| 
 | |
| #### Work List 5
 | |
| 
 | |
| Render several blocks of frames (Render blocks to avoid Pod startup overhead for
 | |
| every frame):
 | |
| 
 | |
| ```
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300
 | |
| ```
 | |
| 
 | |
| ## Design Discussion
 | |
| 
 | |
| ### Converting Work Lists into Indexed Jobs.
 | |
| 
 | |
| Given a work list, like in the [work list examples](#work-list-examples),
 | |
| the information from the work list needs to get into each Pod of the Job.
 | |
| 
 | |
| Users will typically not want to create a new image for each job they
 | |
| run. They will want to use existing images. So, the image is not the place
 | |
| for the work list.
 | |
| 
 | |
| A work list can be stored on networked storage, and mounted by pods of the job.
 | |
| Also, as a shortcut, for small worklists, it can be included in an annotation on
 | |
| the Job object, which is then exposed as a volume in the pod via the downward
 | |
| API.
 | |
| 
 | |
| ### What Varies Between Pods of a Job
 | |
| 
 | |
| Pods need to differ in some way to do something different. (They do not differ
 | |
| in the work-queue style of Job, but that style has ease-of-use issues).
 | |
| 
 | |
| A general approach would be to allow pods to differ from each other in arbitrary
 | |
| ways. For example, the Job object could have a list of PodSpecs to run.
 | |
| However, this is so general that it provides little value. It would:
 | |
| 
 | |
| - make the Job Spec very verbose, especially for jobs with thousands of work
 | |
| items
 | |
| - Job becomes such a vague concept that it is hard to explain to users
 | |
| - in practice, we do not see cases where many pods which differ across many
 | |
| fields of their specs, and need to run as a group, with no ordering constraints.
 | |
| - CLIs and UIs need to support more options for creating Job
 | |
| - it is useful for monitoring and accounting databases want to aggregate data
 | |
| for pods with the same controller. However, pods with very different Specs may
 | |
| not make sense to aggregate.
 | |
| - profiling, debugging, accounting, auditing and monitoring tools cannot assume
 | |
| common images/files, behaviors, provenance and so on between Pods of a Job.
 | |
| 
 | |
| Also, variety has another cost. Pods which differ in ways that affect scheduling
 | |
| (node constraints, resource requirements, labels) prevent the scheduler from
 | |
| treating them as fungible, which is an important optimization for the scheduler.
 | |
| 
 | |
| Therefore, we will not allow Pods from the same Job to differ arbitrarily
 | |
| (anyway, users can use multiple Job objects for that case).  We will try to
 | |
| allow as little as possible to differ between pods of the same Job, while still
 | |
| allowing users to express common parallel patterns easily. For users who need to
 | |
| run jobs which differ in other ways, they can create multiple Jobs, and manage
 | |
| them as a group using labels.
 | |
| 
 | |
| From the above work lists, we see a need for Pods which differ in their command
 | |
| lines, and in their environment variables.  These work lists do not require the
 | |
| pods to differ in other ways.
 | |
| 
 | |
| Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf)
 | |
| has shown this model to be applicable to a very broad range of problems, despite
 | |
| this restriction.
 | |
| 
 | |
| Therefore we to allow pods in the same Job to differ **only** in the following
 | |
|  aspects:
 | |
| - command line
 | |
| - environment variables
 | |
| 
 | |
| ### Composition of existing images
 | |
| 
 | |
| The docker image that is used in a job may not be maintained by the person
 | |
| running the job.  Over time, the Dockerfile may change the ENTRYPOINT or CMD.
 | |
| If we require people to specify the complete command line to use Indexed Job,
 | |
| then they will not automatically pick up changes in the default
 | |
| command or args.
 | |
| 
 | |
| This needs more thought.
 | |
| 
 | |
| ### Running Ad-Hoc Jobs using kubectl
 | |
| 
 | |
| A user should be able to easily start an Indexed Job using `kubectl`. For
 | |
| example to run [work list 1](#work-list-1), a user should be able to type
 | |
| something simple like:
 | |
| 
 | |
| ```
 | |
| kubectl run process-files --image=myfileprocessor \
 | |
|    --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
 | |
|    --restart=OnFailure  \
 | |
|    -- \
 | |
|    /usr/local/bin/process_file '$F'
 | |
| ```
 | |
| 
 | |
| In the above example:
 | |
| 
 | |
| - `--restart=OnFailure` implies creating a job instead of replicationController.
 | |
| - Each pods command line is `/usr/local/bin/process_file $F`.
 | |
| - `--per-completion-env=` implies the jobs `.spec.completions` is set to the
 | |
| length of the argument array (3 in the example).
 | |
| - `--per-completion-env=F=<values>` causes env var with `F` to be available in
 | |
| the environment when the command line is evaluated.
 | |
| 
 | |
| How exactly this happens is discussed later in the doc: this is a sketch of the
 | |
| user experience.
 | |
| 
 | |
| In practice, the list of files might be much longer and stored in a file on the
 | |
| users local host, like:
 | |
| 
 | |
| ```
 | |
| $ cat files-to-process.txt
 | |
| 12342.dat
 | |
| 97283.dat
 | |
| 38732.dat
 | |
| ...
 | |
| ```
 | |
| 
 | |
| So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`.
 | |
| 
 | |
| However, `kubectl` should also support a format like:
 | |
|  `--per-completion-env=F=@files-to-process.txt`.
 | |
| That allows `kubectl` to parse the file, point out any syntax errors, and would
 | |
| not run up against command line length limits (2MB is common, as low as 4kB is
 | |
| POSIX compliant).
 | |
| 
 | |
| One case we do not try to handle is where the file of work is stored on a cloud
 | |
| filesystem, and not accessible from the users local host.  Then we cannot easily
 | |
| use indexed job, because we do not know the number of completions.  The user
 | |
| needs to copy the file locally first or use the Work-Queue style of Job (already
 | |
| supported).
 | |
| 
 | |
| Another case we do not try to handle is where the input file does not exist yet
 | |
| because this Job is to be run at a future time, or depends on another job. The
 | |
| workflow and scheduled job proposal need to consider this case. For that case,
 | |
| you could use an indexed job which runs a program which shards the input file
 | |
| (map-reduce-style).
 | |
| 
 | |
| #### Multiple parameters
 | |
| 
 | |
| The user may also have multiple parameters, like in [work list 2](#work-list-2).
 | |
| One way is to just list all the command lines already expanded, one per line, in
 | |
| a file, like this:
 | |
| 
 | |
| ```
 | |
| $ cat matrix-commandlines.txt
 | |
| /usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
 | |
| /usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
 | |
| /usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
 | |
| /usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
 | |
| ```
 | |
| 
 | |
| and run the Job like this:
 | |
| 
 | |
| ```
 | |
| kubectl run process-matrix --image=my/matrix \
 | |
|    --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \
 | |
|    --restart=OnFailure  \
 | |
|    -- \
 | |
|    'eval "$COMMAND_LINE"'
 | |
| ```
 | |
| 
 | |
| However, this may have some subtleties with shell escaping.  Also, it depends on
 | |
| the user knowing all the correct arguments to the docker image being used (more
 | |
| on this later).
 | |
| 
 | |
| Instead, kubectl should support multiple instances of the `--per-completion-env`
 | |
| flag. For example, to implement work list 2, a user could do:
 | |
| 
 | |
| ```
 | |
| kubectl run process-matrix --image=my/matrix \
 | |
|    --per-completion-env=SR="0 16 0 16" \
 | |
|    --per-completion-env=ER="15 31 15 31" \
 | |
|    --per-completion-env=SC="0 0 16 16" \
 | |
|    --per-completion-env=EC="15 15 31 31" \
 | |
|    --restart=OnFailure  \
 | |
|    -- \
 | |
|    /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC 
 | |
| ```
 | |
| 
 | |
| ### Composition With Workflows and ScheduledJob
 | |
| 
 | |
| A user should be able to create a job (Indexed or not) which runs at a specific
 | |
| time(s). For example:
 | |
| 
 | |
| ```
 | |
| $ kubectl run process-files --image=myfileprocessor \
 | |
|    --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
 | |
|    --restart=OnFailure  \
 | |
|    --runAt=2015-07-21T14:00:00Z
 | |
|    -- \
 | |
|    /usr/local/bin/process_file '$F'
 | |
| created "scheduledJob/process-files-37dt3"
 | |
| ```
 | |
| 
 | |
| Kubectl should build the same JobSpec, and then put it into a ScheduledJob
 | |
| (#11980) and create that.
 | |
| 
 | |
| For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a
 | |
| complete workflow from a single command line would be messy, because of the need
 | |
| to specify all the arguments multiple times.
 | |
| 
 | |
| For that use case, the user could create a workflow message by hand. Or the user
 | |
| could create a job template, and then make a workflow from the templates,
 | |
| perhaps like this:
 | |
| 
 | |
| ```
 | |
| $ kubectl run process-files --image=myfileprocessor \
 | |
|    --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
 | |
|    --restart=OnFailure  \
 | |
|    --asTemplate \
 | |
|    -- \
 | |
|    /usr/local/bin/process_file '$F'
 | |
| created "jobTemplate/process-files"
 | |
| $ kubectl run merge-files --image=mymerger \
 | |
|    --restart=OnFailure  \
 | |
|    --asTemplate \
 | |
|    -- \
 | |
|    /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \
 | |
| created "jobTemplate/merge-files"
 | |
| $ kubectl create-workflow process-and-merge \
 | |
|    --job=jobTemplate/process-files
 | |
|    --job=jobTemplate/merge-files
 | |
|    --dependency=process-files:merge-files
 | |
| created "workflow/process-and-merge"
 | |
| ```
 | |
| 
 | |
| ### Completion Indexes
 | |
| 
 | |
| A JobSpec specifies the number of times a pod needs to complete successfully,
 | |
| through the `job.Spec.Completions` field. The number of completions will be
 | |
| equal to the number of work items in the work list.
 | |
| 
 | |
| Each pod that the job controller creates is intended to complete one work item
 | |
| from the work list. Since a pod may fail, several pods may, serially, attempt to
 | |
| complete the same index. Therefore, we call it a *completion index* (or just
 | |
| *index*), but not a *pod index*.
 | |
| 
 | |
| For each completion index, in the range 1 to `.job.Spec.Completions`, the job
 | |
| controller will create a pod with that index, and keep creating them on failure,
 | |
| until each index is completed.
 | |
| 
 | |
| An dense integer index, rather than a sparse string index (e.g. using just
 | |
| `metadata.generate-name`) makes it easy to use the index to lookup parameters
 | |
| in, for example, an array in shared storage.
 | |
| 
 | |
| ### Pod Identity and Template Substitution in Job Controller
 | |
| 
 | |
| The JobSpec contains a single pod template.  When the job controller creates a
 | |
| particular pod, it copies the pod template and modifies it in some way to make
 | |
| that pod distinctive. Whatever is distinctive about that pod is its *identity*.
 | |
| 
 | |
| We consider several options.
 | |
| 
 | |
| #### Index Substitution Only
 | |
| 
 | |
| The job controller substitutes only the *completion index* of the pod into the
 | |
| pod template when creating it.  The JSON it POSTs differs only in a single
 | |
| fields.
 | |
| 
 | |
| We would put the completion index as a stringified integer, into an annotation
 | |
| of the pod. The user can extract it from the annotation into an env var via the
 | |
| downward API, or put it in a file via a Downward API volume, and parse it
 | |
| himself.
 | |
| 
 | |
| Once it is an environment variable in the pod (say `$INDEX`), then one of two
 | |
| things can happen.
 | |
| 
 | |
| First, the main program can know how to map from an integer index to what it
 | |
| needs to do. For example, from Work List 4 above:
 | |
| 
 | |
| ```
 | |
| ./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
 | |
| ```
 | |
| 
 | |
| Second, a shell script can be prepended to the original command line which maps
 | |
| the index to one or more string parameters. For example, to implement Work List
 | |
| 5 above, you could do:
 | |
| 
 | |
| ```
 | |
| /vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
 | |
| ```
 | |
| 
 | |
| In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX`
 | |
| and exports `$START_FRAME` and `$END_FRAME`.
 | |
| 
 | |
| The shell could be part of the image, but more usefully, it could be generated
 | |
| by a program and stuffed in an annotation or a configMap, and from there added
 | |
| to a volume.
 | |
| 
 | |
| The first approach may require the user to modify an existing image (see next
 | |
| section) to be able to accept an `$INDEX` env var or argument. The second
 | |
| approach requires that the image have a shell. We think that together these two
 | |
| options cover a wide range of use cases (though not all).
 | |
| 
 | |
| #### Multiple Substitution
 | |
| 
 | |
| In this option, the JobSpec is extended to include a list of values to
 | |
| substitute, and which fields to substitute them into. For example, a worklist
 | |
| like this:
 | |
| 
 | |
| ```
 | |
| FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
 | |
| FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt
 | |
| FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
 | |
| ```
 | |
| 
 | |
| Can be broken down into a template like this, with three parameters:
 | |
| 
 | |
| ```
 | |
| <custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
 | |
| ```
 | |
| 
 | |
| and a list of parameter tuples, like this:
 | |
| 
 | |
| ```
 | |
| ("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds")
 | |
| ("FRUIT_COLOR=yellow", "-f banana.txt", "")
 | |
| ("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
 | |
| ```
 | |
| 
 | |
| The JobSpec can be extended to hold a list of parameter tuples (which are more
 | |
| easily expressed as a list of lists of individual parameters). For example:
 | |
| 
 | |
| ```
 | |
| apiVersion: extensions/v1beta1
 | |
| kind: Job
 | |
| ...
 | |
| spec:
 | |
|   completions: 3
 | |
|   ...
 | |
|   template:
 | |
|     ...
 | |
|   perCompletionArgs:
 | |
|     container: 0
 | |
|       -
 | |
|         - "-f apple.txt"
 | |
|         - "-f banana.txt"
 | |
|         - "-f cherry.txt"
 | |
|       -
 | |
|         - "--remove-seeds"
 | |
|         - ""
 | |
|         - "--remove-pit"
 | |
|   perCompletionEnvVars:
 | |
|     - name: "FRUIT_COLOR"
 | |
|       - "green"
 | |
|       - "yellow"
 | |
|       - "red"
 | |
| ```
 | |
| 
 | |
| However, just providing custom env vars, and not arguments, is sufficient for
 | |
| many use cases: parameter can be put into env vars, and then substituted on the
 | |
| command line.
 | |
| 
 | |
| #### Comparison
 | |
| 
 | |
| The multiple substitution approach:
 | |
| 
 | |
| - keeps the *per completion parameters* in the JobSpec.
 | |
| - Drawback: makes the job spec large for job with thousands of completions. (But
 | |
| for very large jobs, the work-queue style or another type of controller, such as
 | |
| map-reduce or spark, may be a better fit.)
 | |
| - Drawback: is a form of server-side templating, which we want in Kubernetes but
 | |
| have not fully designed (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
 | |
| 
 | |
| The index-only approach:
 | |
| 
 | |
| - Requires that the user keep the *per completion parameters* in a separate
 | |
| storage, such as a configData or networked storage.
 | |
| - Makes no changes to the JobSpec.
 | |
| - Drawback: while in separate storage, they could be mutated, which would have
 | |
| unexpected effects.
 | |
| - Drawback: Logic for using index to lookup parameters needs to be in the Pod.
 | |
| - Drawback: CLIs and UIs are limited to using the "index" as the identity of a
 | |
| pod from a job. They cannot easily say, for example `repeated failures on the
 | |
| pod processing banana.txt`.
 | |
| 
 | |
| Index-only approach relies on at least one of the following being true:
 | |
| 
 | |
| 1. Image containing a shell and certain shell commands (not all images have
 | |
| this).
 | |
| 1. Use directly consumes the index from annotations (file or env var) and
 | |
| expands to specific behavior in the main program.
 | |
| 
 | |
| Also Using the index-only approach from non-kubectl clients requires that they
 | |
| mimic the script-generation step, or only use the second style.
 | |
| 
 | |
| #### Decision
 | |
| 
 | |
| It is decided to implement the Index-only approach now. Once the server-side
 | |
| templating design is complete for Kubernetes, and we have feedback from users,
 | |
| we can consider if Multiple Substitution.
 | |
| 
 | |
| ## Detailed Design
 | |
| 
 | |
| #### Job Resource Schema Changes
 | |
| 
 | |
| No changes are made to the JobSpec.
 | |
| 
 | |
| 
 | |
| The JobStatus is also not changed. The user can gauge the progress of the job by
 | |
| the `.status.succeeded` count.
 | |
| 
 | |
| 
 | |
| #### Job Spec Compatilibity
 | |
| 
 | |
| A job spec written before this change will work exactly the same as before with
 | |
| the new controller. The Pods it creates will have the same environment as
 | |
| before. They will have a new annotation, but pod are expected to tolerate
 | |
| unfamiliar annotations.
 | |
| 
 | |
| However, if the job controller version is reverted, to a version before this
 | |
| change, the jobs whose pod specs depend on the new annotation will fail.
 | |
| This is okay for a Beta resource.
 | |
| 
 | |
| #### Job Controller Changes
 | |
| 
 | |
| The Job controller will maintain for each Job a data structed which
 | |
| indicates the status of each completion index. We call this the
 | |
| *scoreboard* for short. It is an array of length `.spec.completions`.
 | |
| Elements of the array are `enum` type with possible values including
 | |
| `complete`, `running`, and `notStarted`.
 | |
| 
 | |
| The scoreboard is stored in Job Controller memory for efficiency. In either
 | |
| case, the Status can be reconstructed from watching pods of the job (such as on
 | |
| a controller manager restart). The index of the pods can be extracted from the
 | |
| pod annotation.
 | |
| 
 | |
| When Job controller sees that the number of running pods is less than the
 | |
| desired parallelism of the job, it finds the first index in the scoreboard with
 | |
| value `notRunning`. It creates a pod with this creation index.
 | |
| 
 | |
| When it creates a pod with creation index `i`,  it makes a copy of the
 | |
| `.spec.template`, and sets
 | |
| `.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to
 | |
| `i`. It does this in both the index-only and multiple-substitutions options.
 | |
| 
 | |
| Then it creates the pod.
 | |
| 
 | |
| When the controller notices that a pod has completed or is running or failed,
 | |
| it updates the scoreboard.
 | |
| 
 | |
| When all entries in the scoreboard are `complete`, then the job is complete.
 | |
| 
 | |
| 
 | |
| #### Downward API Changes
 | |
| 
 | |
| The downward API is changed to support extracting specific key names into a
 | |
| single environment variable. So, the following would be supported:
 | |
| 
 | |
| ```
 | |
| kind: Pod
 | |
| version: v1
 | |
| spec:
 | |
|   containers:
 | |
|   - name: foo
 | |
|     env:
 | |
|     - name: MY_INDEX
 | |
|       valueFrom:
 | |
|         fieldRef:
 | |
|           fieldPath: metadata.annotations[kubernetes.io/job/completion-index]
 | |
| ```
 | |
| 
 | |
| This requires kubelet changes.
 | |
| 
 | |
| Users who fail to upgrade their kubelets at the same time as they upgrade their
 | |
| controller manager will see a failure for pods to run when they are created by
 | |
| the controller. The Kubelet will send an event about failure to create the pod.
 | |
| The `kubectl describe job` will show many failed pods.
 | |
| 
 | |
| 
 | |
| #### Kubectl Interface Changes
 | |
| 
 | |
| The `--completions` and `--completion-index-var-name` flags are added to
 | |
| kubectl.
 | |
| 
 | |
| For example, this command:
 | |
| 
 | |
| ```
 | |
| kubectl run say-number --image=busybox \
 | |
|    --completions=3 \
 | |
|    --completion-index-var-name=I \
 | |
|    -- \
 | |
|    sh -c 'echo "My index is $I" && sleep 5' 
 | |
| ```
 | |
| 
 | |
| will run 3 pods to completion, each printing one of the following lines:
 | |
| 
 | |
| ```
 | |
| My index is 1
 | |
| My index is 2
 | |
| My index is 0
 | |
| ```
 | |
| 
 | |
| Kubectl would create the following pod:
 | |
| 
 | |
| 
 | |
| 
 | |
| Kubectl will also support the `--per-completion-env` flag, as described
 | |
| previously. For example, this command:
 | |
| 
 | |
| ```
 | |
| kubectl run say-fruit --image=busybox \
 | |
|    --per-completion-env=FRUIT="apple banana cherry" \
 | |
|    --per-completion-env=COLOR="green yellow red" \
 | |
|    -- \
 | |
|    sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | |
| ```
 | |
| 
 | |
| or equivalently:
 | |
| 
 | |
| ```
 | |
| echo "apple banana cherry" > fruits.txt
 | |
| echo "green yellow red" > colors.txt
 | |
| 
 | |
| kubectl run say-fruit --image=busybox \
 | |
|    --per-completion-env=FRUIT="$(cat fruits.txt)" \
 | |
|    --per-completion-env=COLOR="$(cat fruits.txt)" \
 | |
|    -- \
 | |
|    sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | |
| ```
 | |
| 
 | |
| or similarly:
 | |
| 
 | |
| ```
 | |
| kubectl run say-fruit --image=busybox \
 | |
|    --per-completion-env=FRUIT=@fruits.txt \
 | |
|    --per-completion-env=COLOR=@fruits.txt \
 | |
|    -- \
 | |
|    sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | |
| ```
 | |
| 
 | |
| will all run 3 pods in parallel. Index 0 pod will log:
 | |
| 
 | |
| ```
 | |
| Have a nice grenn apple
 | |
| ```
 | |
| 
 | |
| and so on.
 | |
| 
 | |
| 
 | |
| Notes:
 | |
| 
 | |
| - `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a
 | |
| quoted space separated list or `@` and the name of a text file containing a
 | |
| list.
 | |
| - `--per-completion-env=` can be specified several times, but all must have the
 | |
| same length list.
 | |
| - `--completions=N` with `N` equal to list length is implied.
 | |
| - The flag `--completions=3` sets `job.spec.completions=3`.
 | |
| - The flag `--completion-index-var-name=I` causes an env var to be created named
 | |
| I in each pod, with the index in it.
 | |
| - The flag `--restart=OnFailure` is implied by `--completions` or any
 | |
| job-specific arguments. The user can also specify `--restart=Never` if they
 | |
| desire but may not specify `--restart=Always` with job-related flags.
 | |
| - Setting any of these flags in turn tells kubectl to create a Job, not a
 | |
| replicationController.
 | |
| 
 | |
| #### How Kubectl Creates Job Specs.
 | |
| 
 | |
| To pass in the parameters, kubectl will generate a shell script which
 | |
| can:
 | |
| - parse the index from the annotation
 | |
| - hold all the parameter lists.
 | |
| - lookup the correct index in each parameter list and set an env var.
 | |
| 
 | |
| For example, consider this command:
 | |
| 
 | |
| ```
 | |
| kubectl run say-fruit --image=busybox \
 | |
|    --per-completion-env=FRUIT="apple banana cherry" \
 | |
|    --per-completion-env=COLOR="green yellow red" \
 | |
|    -- \
 | |
|    sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | |
| ```
 | |
| 
 | |
| First, kubectl generates the PodSpec as it normally does for `kubectl run`.
 | |
| 
 | |
| But, then it will generate this script:
 | |
| 
 | |
| ```sh
 | |
| #!/bin/sh
 | |
| # Generated by kubectl run ...
 | |
| # Check for needed commands
 | |
| if [[ ! type cat ]]
 | |
| then
 | |
|   echo "$0: Image does not include required command: cat"
 | |
|   exit 2
 | |
| fi
 | |
| if [[ ! type grep ]]
 | |
| then
 | |
|   echo "$0: Image does not include required command: grep"
 | |
|   exit 2
 | |
| fi
 | |
| # Check that annotations are mounted from downward API
 | |
| if [[ ! -e /etc/annotations ]]
 | |
| then
 | |
|   echo "$0: Cannot find /etc/annotations"
 | |
|   exit 2
 | |
| fi
 | |
| # Get our index from annotations file
 | |
| I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index"
 | |
| export I
 | |
| 
 | |
| # Our parameter lists are stored inline in this script.
 | |
| FRUIT_0="apple"
 | |
| FRUIT_1="banana"
 | |
| FRUIT_2="cherry"
 | |
| # Extract the right parameter value based on our index.
 | |
| # This works on any Bourne-based shell.
 | |
| FRUIT=$(eval echo \$"FRUIT_$I")
 | |
| export FRUIT
 | |
| 
 | |
| COLOR_0="green"
 | |
| COLOR_1="yellow"
 | |
| COLOR_2="red"
 | |
| 
 | |
| COLOR=$(eval echo \$"FRUIT_$I")
 | |
| export COLOR
 | |
| ```
 | |
| 
 | |
| Then it POSTs this script, encoded, inside a ConfigData.
 | |
| It attaches this volume to the PodSpec.
 | |
| 
 | |
| Then it will edit the command line of the Pod to run this script before the rest of
 | |
| the command line.
 | |
| 
 | |
| Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this:
 | |
| It also appends the Secret (later configData) volume with the script in it.
 | |
| 
 | |
| So, the Pod template that kubectl creates (inside the job template) looks like this:
 | |
| 
 | |
| ```
 | |
| apiVersion: v1
 | |
| kind: Job
 | |
| ...
 | |
| spec:
 | |
|   ...
 | |
|   template:
 | |
|     ...
 | |
|     spec:
 | |
|       containers:
 | |
|         - name: c
 | |
|           image: gcr.io/google_containers/busybox
 | |
|           command:
 | |
|             - 'sh'
 | |
|             - '-c'
 | |
|             - '/etc/job-params.sh; echo "this is the rest of the command"'
 | |
|           volumeMounts:
 | |
|             - name: annotations
 | |
|               mountPath: /etc 
 | |
|             - name: script
 | |
|               mountPath: /etc
 | |
|       volumes:
 | |
|         - name: annotations
 | |
|           downwardAPI:
 | |
|             items:
 | |
|               - path: "annotations"
 | |
|                 ieldRef:
 | |
|                   fieldPath: metadata.annotations
 | |
|         - name: script
 | |
|           secret:
 | |
|             secretName: jobparams-abc123
 | |
| ```
 | |
| 
 | |
| ###### Alternatives
 | |
| 
 | |
| Kubectl could append a `valueFrom` line like this to
 | |
| get the index into the environment:
 | |
| 
 | |
| ```yaml
 | |
| apiVersion: extensions/v1beta1
 | |
| kind: Job
 | |
| metadata:
 | |
|   ...
 | |
| spec:
 | |
|   ...
 | |
|   template:
 | |
|     ...
 | |
|     spec:
 | |
|       containers:
 | |
|       - name: foo 
 | |
|         ...
 | |
|         env:        
 | |
|  # following block added:
 | |
|           - name: I
 | |
|             valueFrom:
 | |
|              fieldRef:
 | |
|                fieldPath:  metadata.annotations."kubernetes.io/job-idx"
 | |
| ```
 | |
| 
 | |
| However, in order to inject other env vars from parameter list,
 | |
| kubectl still needs to edit the command line.
 | |
| 
 | |
| Parameter lists could be passed via a configData volume instead of a secret.
 | |
| Kubectl can be changed to work that way once the configData implementation is
 | |
| complete.
 | |
| 
 | |
| Parameter lists could be passed inside an EnvVar.  This would have length
 | |
| limitations, would pollute the output of `kubectl describe pods` and `kubectl
 | |
| get pods -o json`.
 | |
| 
 | |
| Parameter lists could be passed inside an annotation.  This would have length
 | |
| limitations, would pollute the output of `kubectl describe pods` and `kubectl
 | |
| get pods -o json`.  Also, currently annotations can only be extracted into a
 | |
| single file.  Complex logic is then needed to filter out exactly the desired
 | |
| annotation data.
 | |
| 
 | |
| Bash array variables could simplify extraction of a particular parameter from a
 | |
| list of parameters.  However, some popular base images do not include
 | |
| `/bin/bash`.  For example, `busybox` uses a compact `/bin/sh` implementation
 | |
| that does not support array syntax.
 | |
| 
 | |
| Kubelet does support [expanding varaibles without a
 | |
| shell](http://kubernetes.io/kubernetes/v1.1/docs/design/expansion.html).  But it does not
 | |
| allow for recursive substitution, which is required to extract the correct
 | |
| parameter from a list based on the completion index of the pod.  The syntax
 | |
| could be extended, but doing so seems complex and will be an unfamiliar syntax
 | |
| for users.
 | |
| 
 | |
| Putting all the command line editing into a script and running that causes
 | |
| the least pollution to the original command line, and it allows
 | |
| for complex error handling.
 | |
| 
 | |
| Kubectl could store the script in an [Inline Volume](
 | |
| https://github.com/kubernetes/kubernetes/issues/13610) if that proposal
 | |
| is approved. That would remove the need to manage the lifetime of the
 | |
| configData/secret, and prevent the case where someone changes the
 | |
| configData mid-job, and breaks things in a hard-to-debug way.
 | |
| 
 | |
| 
 | |
| ## Interactions with other features
 | |
| 
 | |
| #### Supporting Work Queue Jobs too
 | |
| 
 | |
| For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
 | |
| to be greater than it, and pods have no identity. So, the job controller should
 | |
| not create a scoreboard in the JobStatus, just a count.  Therefore, we need to
 | |
| add one of the following to JobSpec:
 | |
| 
 | |
| - allow unset `.spec.completions` to indicate no scoreboard, and no index for
 | |
| tasks (identical tasks).
 | |
| - allow `.spec.completions=-1` to indicate the same.
 | |
| - add `.spec.indexed` to job to indicate need for scoreboard.
 | |
| 
 | |
| #### Interaction with vertical autoscaling
 | |
| 
 | |
| Since pods of the same job will not be created with different resources,
 | |
| a vertical autoscaler will need to:
 | |
| 
 | |
| - if it has index-specific initial resource suggestions, suggest those at
 | |
| admission time; it will need to understand indexes.
 | |
| - mutate resource requests on already created pods based on usage trend or
 | |
| previous container failures.
 | |
| - modify the job template, affecting all indexes.
 | |
| 
 | |
| #### Comparison to PetSets
 | |
| 
 | |
| The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b.
 | |
| The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more
 | |
| restrictive and thus less verbose.
 | |
| 
 | |
| It would be easier for users if Indexed Job and PetSet are similar where
 | |
| possible. However, PetSet differs in several key respects:
 | |
| 
 | |
| - PetSet is for ones to tens of instances.  Indexed job should work with tens of
 | |
| thousands of instances.
 | |
| - When you have few instances, you may want to given them pet names. When you
 | |
| have many instances, you that many instances, integer indexes make more sense.
 | |
| - When you have thousands of instances, storing the work-list in the JobSpec
 | |
| is verbose.  For PetSet, this is less of a problem.
 | |
| - PetSets (apparently) need to differ in more fields than indexed Jobs.
 | |
| 
 | |
| This differs from PetSet in that PetSet uses names and not indexes. PetSet is
 | |
| intended to support ones to tens of things.
 | |
| 
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | |
| []()
 | |
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |