mirror of
https://github.com/k3s-io/kubernetes.git
synced 2025-09-14 05:36:12 +00:00
Spark: Update to current example standards, add GCS connector
* Pod -> ReplicationController, which also forced me to hack around hostname issue on the master. (Spark master sees the incoming slave request to spark-master and assumes it's not meant for it, since it's name is spark-master-controller-abcdef.) * Remove service env dependencies (depend on DNS instead). * JSON -> YAML. * Add GCS connector. * Make example do something actually useful: A familiar example to anyone at Google, implement wordcount of all of Shakespeare's works. * Fix a minor service connection issue in the gluster example.
This commit is contained in:
@@ -57,60 +57,98 @@ instructions for your platform.
|
||||
|
||||
## Step One: Start your Master service
|
||||
|
||||
The Master [service](../../docs/user-guide/services.md) is the master (or head) service for a Spark
|
||||
cluster.
|
||||
The Master [service](../../docs/user-guide/services.md) is the master service
|
||||
for a Spark cluster.
|
||||
|
||||
Use the [`examples/spark/spark-master.json`](spark-master.json) file to create a [pod](../../docs/user-guide/pods.md) running
|
||||
the Master service.
|
||||
Use the
|
||||
[`examples/spark/spark-master-controller.yaml`](spark-master-controller.yaml)
|
||||
file to create a
|
||||
[replication controller](../../docs/user-guide/replication-controller.md)
|
||||
running the Spark Master service.
|
||||
|
||||
```sh
|
||||
$ kubectl create -f examples/spark/spark-master.json
|
||||
```console
|
||||
$ kubectl create -f examples/spark/spark-master-controller.yaml
|
||||
replicationcontrollers/spark-master-controller
|
||||
```
|
||||
|
||||
Then, use the [`examples/spark/spark-master-service.json`](spark-master-service.json) file to
|
||||
create a logical service endpoint that Spark workers can use to access
|
||||
the Master pod.
|
||||
Then, use the
|
||||
[`examples/spark/spark-master-service.yaml`](spark-master-service.yaml) file to
|
||||
create a logical service endpoint that Spark workers can use to access the
|
||||
Master pod.
|
||||
|
||||
```sh
|
||||
$ kubectl create -f examples/spark/spark-master-service.json
|
||||
```console
|
||||
$ kubectl create -f examples/spark/spark-master-service.yaml
|
||||
services/spark-master
|
||||
```
|
||||
|
||||
Optionally, you can create a service for the Spark Master WebUI at this point as
|
||||
well. If you are running on a cloud provider that supports it, this will create
|
||||
an external load balancer and open a firewall to the Spark Master WebUI on the
|
||||
cluster. **Note:** With the existing configuration, there is **ABSOLUTELY NO**
|
||||
authentication on this WebUI. With slightly more work, it would be
|
||||
straightforward to put an `nginx` proxy in front to password protect it.
|
||||
|
||||
```console
|
||||
$ kubectl create -f examples/spark/spark-webui.yaml
|
||||
services/spark-webui
|
||||
```
|
||||
|
||||
### Check to see if Master is running and accessible
|
||||
|
||||
```sh
|
||||
```console
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
[...]
|
||||
spark-master 1/1 Running 0 25s
|
||||
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
spark-master-controller-5u0q5 1/1 Running 0 8m
|
||||
```
|
||||
|
||||
Check logs to see the status of the master.
|
||||
Check logs to see the status of the master. (Use the pod retrieved from the previous output.)
|
||||
|
||||
```sh
|
||||
$ kubectl logs spark-master
|
||||
|
||||
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master.out
|
||||
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /opt/spark-1.4.0-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
|
||||
$ kubectl logs spark-master-controller-5u0q5
|
||||
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out
|
||||
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
|
||||
========================================
|
||||
15/06/26 14:01:49 INFO Master: Registered signal handlers for [TERM, HUP, INT]
|
||||
15/06/26 14:01:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
|
||||
15/06/26 14:01:51 INFO SecurityManager: Changing view acls to: root
|
||||
15/06/26 14:01:51 INFO SecurityManager: Changing modify acls to: root
|
||||
15/06/26 14:01:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
|
||||
15/06/26 14:01:51 INFO Slf4jLogger: Slf4jLogger started
|
||||
15/06/26 14:01:51 INFO Remoting: Starting remoting
|
||||
15/06/26 14:01:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
|
||||
15/06/26 14:01:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
|
||||
15/06/26 14:01:52 INFO Utils: Successfully started service on port 6066.
|
||||
15/06/26 14:01:52 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
|
||||
15/06/26 14:01:52 INFO Master: Starting Spark master at spark://spark-master:7077
|
||||
15/06/26 14:01:52 INFO Master: Running Spark version 1.4.0
|
||||
15/06/26 14:01:52 INFO Utils: Successfully started service 'MasterUI' on port 8080.
|
||||
15/06/26 14:01:52 INFO MasterWebUI: Started MasterWebUI at http://10.244.2.34:8080
|
||||
15/06/26 14:01:53 INFO Master: I have been elected leader! New state: ALIVE
|
||||
15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT]
|
||||
15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
|
||||
15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
|
||||
15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
|
||||
15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
|
||||
15/10/27 21:25:06 INFO Remoting: Starting remoting
|
||||
15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
|
||||
15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
|
||||
15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077
|
||||
15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
|
||||
15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080.
|
||||
15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080
|
||||
15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
|
||||
15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
|
||||
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
|
||||
```
|
||||
|
||||
If you created the Spark WebUI and waited sufficient time for the load balancer
|
||||
to be create, the `spark-webui` service should look something like this:
|
||||
|
||||
```console
|
||||
$ kubectl describe services/spark-webui
|
||||
Name: spark-webui
|
||||
Namespace: default
|
||||
Labels: <none>
|
||||
Selector: component=spark-master
|
||||
Type: LoadBalancer
|
||||
IP: 10.0.152.249
|
||||
LoadBalancer Ingress: 104.197.147.190
|
||||
Port: <unnamed> 8080/TCP
|
||||
NodePort: <unnamed> 31141/TCP
|
||||
Endpoints: 10.244.1.12:8080
|
||||
Session Affinity: None
|
||||
Events: [...]
|
||||
```
|
||||
|
||||
You should now be able to visit `http://104.197.147.190:8080` and see the Spark
|
||||
Master UI. *Note:* After workers connect, this UI has links to worker Web
|
||||
UIs. The worker UI links do not work (the links attempt to connect to cluster
|
||||
IPs).
|
||||
|
||||
## Step Two: Start your Spark workers
|
||||
|
||||
The Spark workers do the heavy lifting in a Spark cluster. They
|
||||
@@ -119,96 +157,104 @@ program.
|
||||
|
||||
The Spark workers need the Master service to be running.
|
||||
|
||||
Use the [`examples/spark/spark-worker-controller.json`](spark-worker-controller.json) file to create a
|
||||
Use the [`examples/spark/spark-worker-controller.yaml`](spark-worker-controller.yaml) file to create a
|
||||
[replication controller](../../docs/user-guide/replication-controller.md) that manages the worker pods.
|
||||
|
||||
```sh
|
||||
$ kubectl create -f examples/spark/spark-worker-controller.json
|
||||
```console
|
||||
$ kubectl create -f examples/spark/spark-worker-controller.yaml
|
||||
```
|
||||
|
||||
### Check to see if the workers are running
|
||||
|
||||
```sh
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
[...]
|
||||
spark-master 1/1 Running 0 14m
|
||||
spark-worker-controller-hifwi 1/1 Running 0 33s
|
||||
spark-worker-controller-u40r2 1/1 Running 0 33s
|
||||
spark-worker-controller-vpgyg 1/1 Running 0 33s
|
||||
If you launched the Spark WebUI, your workers should just appear in the UI when
|
||||
they're ready. (It may take a little bit to pull the images and launch the
|
||||
pods.) You can also interrogate the status in the following way:
|
||||
|
||||
$ kubectl logs spark-master
|
||||
```console
|
||||
$ kubectl get pods
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
spark-master-controller-5u0q5 1/1 Running 0 25m
|
||||
spark-worker-controller-e8otp 1/1 Running 0 6m
|
||||
spark-worker-controller-fiivl 1/1 Running 0 6m
|
||||
spark-worker-controller-ytc7o 1/1 Running 0 6m
|
||||
|
||||
$ kubectl logs spark-master-controller-5u0q5
|
||||
[...]
|
||||
15/06/26 14:15:43 INFO Master: Registering worker 10.244.2.35:46199 with 1 cores, 2.6 GB RAM
|
||||
15/06/26 14:15:55 INFO Master: Registering worker 10.244.1.15:44839 with 1 cores, 2.6 GB RAM
|
||||
15/06/26 14:15:55 INFO Master: Registering worker 10.244.0.19:60970 with 1 cores, 2.6 GB RAM
|
||||
15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM
|
||||
15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM
|
||||
15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM
|
||||
```
|
||||
|
||||
## Step Three: Start your Spark driver to launch jobs on your Spark cluster
|
||||
|
||||
The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
|
||||
[Spark architecture](http://spark.apache.org/docs/latest/cluster-overview.html).
|
||||
[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
|
||||
|
||||
```shell
|
||||
$ kubectl create -f examples/spark/spark-driver.json
|
||||
```console
|
||||
$ kubectl create -f examples/spark/spark-driver-controller.yaml
|
||||
replicationcontrollers/spark-driver-controller
|
||||
```
|
||||
|
||||
The Spark driver needs the Master service to be running.
|
||||
|
||||
### Check to see if the driver is running
|
||||
|
||||
```shell
|
||||
$ kubectl get pods
|
||||
NAME READY REASON RESTARTS AGE
|
||||
[...]
|
||||
spark-master 1/1 Running 0 14m
|
||||
spark-driver 1/1 Running 0 10m
|
||||
```console
|
||||
$ kubectl get pods -lcomponent=spark-driver
|
||||
NAME READY STATUS RESTARTS AGE
|
||||
spark-driver-controller-vwb9c 1/1 Running 0 1m
|
||||
```
|
||||
|
||||
## Step Four: Do something with the cluster
|
||||
|
||||
Use the kubectl exec to connect to Spark driver
|
||||
Use the kubectl exec to connect to Spark driver and run a pipeline.
|
||||
|
||||
```
|
||||
$ kubectl exec spark-driver -it bash
|
||||
root@spark-driver:/#
|
||||
root@spark-driver:/# pyspark
|
||||
```console
|
||||
$ kubectl exec spark-driver-controller-vwb9c -it pyspark
|
||||
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
|
||||
[GCC 4.9.2] on linux2
|
||||
Type "help", "copyright", "credits" or "license" for more information.
|
||||
15/06/26 14:25:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
|
||||
Welcome to
|
||||
____ __
|
||||
/ __/__ ___ _____/ /__
|
||||
_\ \/ _ \/ _ `/ __/ '_/
|
||||
/__ / .__/\_,_/_/ /_/\_\ version 1.4.0
|
||||
/__ / .__/\_,_/_/ /_/\_\ version 1.5.1
|
||||
/_/
|
||||
|
||||
Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
|
||||
SparkContext available as sc, HiveContext available as sqlContext.
|
||||
>>> import socket
|
||||
>>> sc.parallelize(range(1000)).map(lambda x:socket.gethostname()).distinct().collect()
|
||||
['spark-worker-controller-u40r2', 'spark-worker-controller-hifwi', 'spark-worker-controller-vpgyg']
|
||||
>>> sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
|
||||
939193
|
||||
```
|
||||
|
||||
Congratulations, you just counted all of the words in all of the plays of
|
||||
Shakespeare.
|
||||
|
||||
## Result
|
||||
|
||||
You now have services, replication controllers, and pods for the Spark master , Spark driver and Spark workers.
|
||||
You can take this example to the next step and start using the Apache Spark cluster
|
||||
you just created, see [Spark documentation](https://spark.apache.org/documentation.html)
|
||||
for more information.
|
||||
You now have services and replication controllers for the Spark master, Spark
|
||||
workers and Spark driver. You can take this example to the next step and start
|
||||
using the Apache Spark cluster you just created, see
|
||||
[Spark documentation](https://spark.apache.org/documentation.html) for more
|
||||
information.
|
||||
|
||||
## tl;dr
|
||||
|
||||
```kubectl create -f spark-master.json```
|
||||
```console
|
||||
kubectl create -f examples/spark/spark-master-controller.yaml
|
||||
kubectl create -f examples/spark/spark-master-service.yaml
|
||||
kubectl create -f examples/spark/spark-webui.yaml
|
||||
kubectl create -f examples/spark/spark-worker-controller.yaml
|
||||
kubectl create -f examples/spark/spark-driver-controller.yaml
|
||||
```
|
||||
|
||||
```kubectl create -f spark-master-service.json```
|
||||
|
||||
Make sure the Master Pod is running (use: ```kubectl get pods```).
|
||||
|
||||
```kubectl create -f spark-worker-controller.json```
|
||||
|
||||
```kubectl create -f spark-driver.json```
|
||||
After it's setup:
|
||||
|
||||
```console
|
||||
kubectl get pods # Make sure everything is running
|
||||
kubectl get services spark-webui # Get the IP of the Spark WebUI
|
||||
kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.
|
||||
```
|
||||
|
||||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||||
[]()
|
||||
|
Reference in New Issue
Block a user