mirror of
				https://github.com/k3s-io/kubernetes.git
				synced 2025-10-31 13:50:01 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			308 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			308 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| <!-- BEGIN STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | |
|      width="25" height="25">
 | |
| 
 | |
| <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | |
| 
 | |
| If you are using a released version of Kubernetes, you should
 | |
| refer to the docs that go with that version.
 | |
| 
 | |
| <!-- TAG RELEASE_LINK, added by the munger automatically -->
 | |
| <strong>
 | |
| The latest release of this document can be found
 | |
| [here](http://releases.k8s.io/release-1.1/examples/spark/README.md).
 | |
| 
 | |
| Documentation for other releases can be found at
 | |
| [releases.k8s.io](http://releases.k8s.io).
 | |
| </strong>
 | |
| --
 | |
| 
 | |
| <!-- END STRIP_FOR_RELEASE -->
 | |
| 
 | |
| <!-- END MUNGE: UNVERSIONED_WARNING -->
 | |
| 
 | |
| # Spark example
 | |
| 
 | |
| Following this example, you will create a functional [Apache
 | |
| Spark](http://spark.apache.org/) cluster using Kubernetes and
 | |
| [Docker](http://docker.io).
 | |
| 
 | |
| You will setup a Spark master service and a set of
 | |
| Spark workers using Spark's [standalone mode](http://spark.apache.org/docs/latest/spark-standalone.html).
 | |
| 
 | |
| For the impatient expert, jump straight to the [tl;dr](#tldr)
 | |
| section.
 | |
| 
 | |
| ### Sources
 | |
| 
 | |
| The Docker images are heavily based on https://github.com/mattf/docker-spark
 | |
| 
 | |
| ## Step Zero: Prerequisites
 | |
| 
 | |
| This example assumes you have a Kubernetes cluster installed and
 | |
| running, and that you have installed the ```kubectl``` command line
 | |
| tool somewhere in your path. Please see the [getting
 | |
| started](../../docs/getting-started-guides/) for installation
 | |
| instructions for your platform.
 | |
| 
 | |
| ## Step One: Start your Master service
 | |
| 
 | |
| The Master [service](../../docs/user-guide/services.md) is the master service
 | |
| for a Spark cluster.
 | |
| 
 | |
| Use the
 | |
| [`examples/spark/spark-master-controller.yaml`](spark-master-controller.yaml)
 | |
| file to create a
 | |
| [replication controller](../../docs/user-guide/replication-controller.md)
 | |
| running the Spark Master service.
 | |
| 
 | |
| ```console
 | |
| $ kubectl create -f examples/spark/spark-master-controller.yaml
 | |
| replicationcontrollers/spark-master-controller
 | |
| ```
 | |
| 
 | |
| Then, use the
 | |
| [`examples/spark/spark-master-service.yaml`](spark-master-service.yaml) file to
 | |
| create a logical service endpoint that Spark workers can use to access the
 | |
| Master pod.
 | |
| 
 | |
| ```console
 | |
| $ kubectl create -f examples/spark/spark-master-service.yaml
 | |
| services/spark-master
 | |
| ```
 | |
| 
 | |
| You can then create a service for the Spark Master WebUI:
 | |
| 
 | |
| ```console
 | |
| $ kubectl create -f examples/spark/spark-webui.yaml
 | |
| services/spark-webui
 | |
| ```
 | |
| 
 | |
| ### Check to see if Master is running and accessible
 | |
| 
 | |
| ```console
 | |
| $ kubectl get pods
 | |
| NAME                            READY     STATUS    RESTARTS   AGE
 | |
| spark-master-controller-5u0q5   1/1       Running   0          8m
 | |
| ```
 | |
| 
 | |
| Check logs to see the status of the master. (Use the pod retrieved from the previous output.)
 | |
| 
 | |
| ```sh
 | |
| $ kubectl logs spark-master-controller-5u0q5
 | |
| starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out
 | |
| Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
 | |
| ========================================
 | |
| 15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT]
 | |
| 15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
 | |
| 15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
 | |
| 15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
 | |
| 15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
 | |
| 15/10/27 21:25:06 INFO Remoting: Starting remoting
 | |
| 15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
 | |
| 15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
 | |
| 15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077
 | |
| 15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
 | |
| 15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080.
 | |
| 15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080
 | |
| 15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
 | |
| 15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
 | |
| 15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
 | |
| ```
 | |
| 
 | |
| After you know the master is running, you can use the [cluster
 | |
| proxy](../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy) to
 | |
| connect to the Spark WebUI:
 | |
| 
 | |
| ```console
 | |
| kubectl proxy --port=8001
 | |
| ```
 | |
| 
 | |
| At which point the UI will be available at
 | |
| [http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).
 | |
| 
 | |
| ## Step Two: Start your Spark workers
 | |
| 
 | |
| The Spark workers do the heavy lifting in a Spark cluster. They
 | |
| provide execution resources and data cache capabilities for your
 | |
| program.
 | |
| 
 | |
| The Spark workers need the Master service to be running.
 | |
| 
 | |
| Use the [`examples/spark/spark-worker-controller.yaml`](spark-worker-controller.yaml) file to create a
 | |
| [replication controller](../../docs/user-guide/replication-controller.md) that manages the worker pods.
 | |
| 
 | |
| ```console
 | |
| $ kubectl create -f examples/spark/spark-worker-controller.yaml
 | |
| ```
 | |
| 
 | |
| ### Check to see if the workers are running
 | |
| 
 | |
| If you launched the Spark WebUI, your workers should just appear in the UI when
 | |
| they're ready. (It may take a little bit to pull the images and launch the
 | |
| pods.) You can also interrogate the status in the following way:
 | |
| 
 | |
| ```console
 | |
| $ kubectl get pods
 | |
| NAME                            READY     STATUS    RESTARTS   AGE
 | |
| spark-master-controller-5u0q5   1/1       Running   0          25m
 | |
| spark-worker-controller-e8otp   1/1       Running   0          6m
 | |
| spark-worker-controller-fiivl   1/1       Running   0          6m
 | |
| spark-worker-controller-ytc7o   1/1       Running   0          6m
 | |
| 
 | |
| $ kubectl logs spark-master-controller-5u0q5
 | |
| [...]
 | |
| 15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM
 | |
| 15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM
 | |
| 15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM
 | |
| ```
 | |
| 
 | |
| Assuming you still have the `kubectl proxy` running from the previous section,
 | |
| you should now see the workers in the UI as well. *Note:* The UI will have links
 | |
| to worker Web UIs. The worker UI links do not work (the links will attempt to
 | |
| connect to cluster IPs, which Kubernetes won't proxy automatically).
 | |
| 
 | |
| ## Step Three: Start the Zeppelin UI to launch jobs on your Spark cluster
 | |
| 
 | |
| The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
 | |
| a web notebook frontend or the traditional Spark command line. See
 | |
| [Zeppelin](https://zeppelin.incubator.apache.org/) and
 | |
| [Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html)
 | |
| for more details.
 | |
| 
 | |
| ```console
 | |
| $ kubectl create -f examples/spark/zeppelin-controller.yaml
 | |
| replicationcontrollers/zeppelin-controller
 | |
| ```
 | |
| 
 | |
| Zeppelin needs the Master service to be running.
 | |
| 
 | |
| ### Check to see if Zeppelin is running
 | |
| 
 | |
| ```console
 | |
| $ kubectl get pods -lcomponent=zeppelin
 | |
| NAME                        READY     STATUS    RESTARTS   AGE
 | |
| zeppelin-controller-ja09s   1/1       Running   0          53s
 | |
| ```
 | |
| 
 | |
| ## Step Four: Do something with the cluster
 | |
| 
 | |
| Now you have two choices, depending on your predilections. You can do something
 | |
| graphical with the Spark cluster, or you can stay in the CLI.
 | |
| 
 | |
| ### Do something fast with pyspark!
 | |
| 
 | |
| Use the kubectl exec to connect to the Zeppelin driver and run a pipeline.
 | |
| 
 | |
| ```console
 | |
| $ kubectl exec zeppelin-controller-ja09s -it pyspark
 | |
| Python 2.7.9 (default, Mar  1 2015, 12:57:24)
 | |
| [GCC 4.9.2] on linux2
 | |
| Type "help", "copyright", "credits" or "license" for more information.
 | |
| Welcome to
 | |
|       ____              __
 | |
|      / __/__  ___ _____/ /__
 | |
|     _\ \/ _ \/ _ `/ __/  '_/
 | |
|    /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
 | |
|       /_/
 | |
| 
 | |
| Using Python version 2.7.9 (default, Mar  1 2015 12:57:24)
 | |
| SparkContext available as sc, HiveContext available as sqlContext.
 | |
| >>> sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
 | |
| 939193
 | |
| ```
 | |
| 
 | |
| Congratulations, you just counted all of the words in all of the plays of
 | |
| Shakespeare.
 | |
| 
 | |
| ### Do something graphical and shiny!
 | |
| 
 | |
| Take the Zeppelin pod from above and port-forward the WebUI port:
 | |
| 
 | |
| ```console
 | |
| $ kubectl port-forward zeppelin-controller-ja09s 8080:8080
 | |
| ```
 | |
| 
 | |
| This forwards `localhost` 8080 to container port 8080. You can then find
 | |
| Zeppelin at (https://localhost:8080/)[https://localhost:8080/].
 | |
| 
 | |
| Create a "New Notebook". In there, type:
 | |
| 
 | |
| ```
 | |
| %pyspark
 | |
| print sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
 | |
| ```
 | |
| 
 | |
| ## Result
 | |
| 
 | |
| You now have services and replication controllers for the Spark master, Spark
 | |
| workers and Spark driver.  You can take this example to the next step and start
 | |
| using the Apache Spark cluster you just created, see
 | |
| [Spark documentation](https://spark.apache.org/documentation.html) for more
 | |
| information.
 | |
| 
 | |
| ## tl;dr
 | |
| 
 | |
| ```console
 | |
| kubectl create -f examples/spark
 | |
| ```
 | |
| 
 | |
| After it's setup:
 | |
| 
 | |
| ```console
 | |
| kubectl get pods # Make sure everything is running
 | |
| kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark Master WebUI
 | |
| kubectl get pods -lcomponent=zeppelin # Get the driver pod to interact with.
 | |
| ```
 | |
| 
 | |
| At which point the Master UI will be available at
 | |
| [http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).
 | |
| 
 | |
| You can either interact with the Spark cluster the traditional `spark-shell` /
 | |
| `spark-subsubmit` / `pyspark` commands by using `kubectl exec` against the
 | |
| `zeppelin-controller` pod, or if you want to interact with Zeppelin:
 | |
| 
 | |
| ```console
 | |
| kubectl port-forward zeppelin-controller-abc123 8080:8080 &
 | |
| ```
 | |
| 
 | |
| Then visit [http://localhost:8080/](http://localhost:8080/).
 | |
| 
 | |
| ## Known Issues With Spark
 | |
| 
 | |
| * This provides a Spark configuration that is restricted to the cluster network,
 | |
|   meaning the Spark master is only available as a cluster service. If you need
 | |
|   to submit jobs using external client other than Zeppelin or `spark-submit` on
 | |
|   the `zeppelin` pod, you will need to provide a way for your clients to get to
 | |
|   the
 | |
|   [`examples/spark/spark-master-service.yaml`](spark-master-service.yaml). See
 | |
|   [Services](../../docs/user-guide/services.md) for more information.
 | |
| 
 | |
| ## Known Issues With Zeppelin
 | |
| 
 | |
| * The Zeppelin pod is large, so it may take a while to pull depending on your
 | |
|   network. The size of the Zeppelin pod is something we're working on, see issue #17231.
 | |
| 
 | |
| * Zeppelin may take some time (about a minute) on this pipeline the first time
 | |
|   you run it. It seems to take considerable time to load.
 | |
| 
 | |
| * On GKE, `kubectl port-forward` may not be stable over long periods of time. If
 | |
|   you see Zeppelin go into `Disconnected` state (there will be a red dot on the
 | |
|   top right as well), the `port-forward` probably failed and needs to be
 | |
|   restarted. See #12179.
 | |
| 
 | |
| <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | |
| []()
 | |
| <!-- END MUNGE: GENERATED_ANALYTICS -->
 |