Spark: Update to current example standards, add GCS connector

* Pod -> ReplicationController, which also forced me to hack around
hostname issue on the master. (Spark master sees the incoming slave
request to spark-master and assumes it's not meant for it, since it's
name is spark-master-controller-abcdef.)

* Remove service env dependencies (depend on DNS instead).

* JSON -> YAML.

* Add GCS connector.

* Make example do something actually useful: A familiar example to
anyone at Google, implement wordcount of all of Shakespeare's works.

* Fix a minor service connection issue in the gluster example.
This commit is contained in:
Zach Loafman
2015-10-26 13:42:02 -07:00
parent c0fba82da4
commit 777d3a6ec0
25 changed files with 325 additions and 231 deletions

View File

@@ -57,60 +57,98 @@ instructions for your platform.
## Step One: Start your Master service
The Master [service](../../docs/user-guide/services.md) is the master (or head) service for a Spark
cluster.
The Master [service](../../docs/user-guide/services.md) is the master service
for a Spark cluster.
Use the [`examples/spark/spark-master.json`](spark-master.json) file to create a [pod](../../docs/user-guide/pods.md) running
the Master service.
Use the
[`examples/spark/spark-master-controller.yaml`](spark-master-controller.yaml)
file to create a
[replication controller](../../docs/user-guide/replication-controller.md)
running the Spark Master service.
```sh
$ kubectl create -f examples/spark/spark-master.json
```console
$ kubectl create -f examples/spark/spark-master-controller.yaml
replicationcontrollers/spark-master-controller
```
Then, use the [`examples/spark/spark-master-service.json`](spark-master-service.json) file to
create a logical service endpoint that Spark workers can use to access
the Master pod.
Then, use the
[`examples/spark/spark-master-service.yaml`](spark-master-service.yaml) file to
create a logical service endpoint that Spark workers can use to access the
Master pod.
```sh
$ kubectl create -f examples/spark/spark-master-service.json
```console
$ kubectl create -f examples/spark/spark-master-service.yaml
services/spark-master
```
Optionally, you can create a service for the Spark Master WebUI at this point as
well. If you are running on a cloud provider that supports it, this will create
an external load balancer and open a firewall to the Spark Master WebUI on the
cluster. **Note:** With the existing configuration, there is **ABSOLUTELY NO**
authentication on this WebUI. With slightly more work, it would be
straightforward to put an `nginx` proxy in front to password protect it.
```console
$ kubectl create -f examples/spark/spark-webui.yaml
services/spark-webui
```
### Check to see if Master is running and accessible
```sh
```console
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
[...]
spark-master 1/1 Running 0 25s
NAME READY STATUS RESTARTS AGE
spark-master-controller-5u0q5 1/1 Running 0 8m
```
Check logs to see the status of the master.
Check logs to see the status of the master. (Use the pod retrieved from the previous output.)
```sh
$ kubectl logs spark-master
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.4.0-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /opt/spark-1.4.0-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.4.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms512m -Xmx512m -XX:MaxPermSize=128m org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
$ kubectl logs spark-master-controller-5u0q5
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
========================================
15/06/26 14:01:49 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/06/26 14:01:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/26 14:01:51 INFO SecurityManager: Changing view acls to: root
15/06/26 14:01:51 INFO SecurityManager: Changing modify acls to: root
15/06/26 14:01:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/06/26 14:01:51 INFO Slf4jLogger: Slf4jLogger started
15/06/26 14:01:51 INFO Remoting: Starting remoting
15/06/26 14:01:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
15/06/26 14:01:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
15/06/26 14:01:52 INFO Utils: Successfully started service on port 6066.
15/06/26 14:01:52 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
15/06/26 14:01:52 INFO Master: Starting Spark master at spark://spark-master:7077
15/06/26 14:01:52 INFO Master: Running Spark version 1.4.0
15/06/26 14:01:52 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/06/26 14:01:52 INFO MasterWebUI: Started MasterWebUI at http://10.244.2.34:8080
15/06/26 14:01:53 INFO Master: I have been elected leader! New state: ALIVE
15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
15/10/27 21:25:06 INFO Remoting: Starting remoting
15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077
15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080
15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
```
If you created the Spark WebUI and waited sufficient time for the load balancer
to be create, the `spark-webui` service should look something like this:
```console
$ kubectl describe services/spark-webui
Name: spark-webui
Namespace: default
Labels: <none>
Selector: component=spark-master
Type: LoadBalancer
IP: 10.0.152.249
LoadBalancer Ingress: 104.197.147.190
Port: <unnamed> 8080/TCP
NodePort: <unnamed> 31141/TCP
Endpoints: 10.244.1.12:8080
Session Affinity: None
Events: [...]
```
You should now be able to visit `http://104.197.147.190:8080` and see the Spark
Master UI. *Note:* After workers connect, this UI has links to worker Web
UIs. The worker UI links do not work (the links attempt to connect to cluster
IPs).
## Step Two: Start your Spark workers
The Spark workers do the heavy lifting in a Spark cluster. They
@@ -119,96 +157,104 @@ program.
The Spark workers need the Master service to be running.
Use the [`examples/spark/spark-worker-controller.json`](spark-worker-controller.json) file to create a
Use the [`examples/spark/spark-worker-controller.yaml`](spark-worker-controller.yaml) file to create a
[replication controller](../../docs/user-guide/replication-controller.md) that manages the worker pods.
```sh
$ kubectl create -f examples/spark/spark-worker-controller.json
```console
$ kubectl create -f examples/spark/spark-worker-controller.yaml
```
### Check to see if the workers are running
```sh
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
[...]
spark-master 1/1 Running 0 14m
spark-worker-controller-hifwi 1/1 Running 0 33s
spark-worker-controller-u40r2 1/1 Running 0 33s
spark-worker-controller-vpgyg 1/1 Running 0 33s
If you launched the Spark WebUI, your workers should just appear in the UI when
they're ready. (It may take a little bit to pull the images and launch the
pods.) You can also interrogate the status in the following way:
$ kubectl logs spark-master
```console
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-master-controller-5u0q5 1/1 Running 0 25m
spark-worker-controller-e8otp 1/1 Running 0 6m
spark-worker-controller-fiivl 1/1 Running 0 6m
spark-worker-controller-ytc7o 1/1 Running 0 6m
$ kubectl logs spark-master-controller-5u0q5
[...]
15/06/26 14:15:43 INFO Master: Registering worker 10.244.2.35:46199 with 1 cores, 2.6 GB RAM
15/06/26 14:15:55 INFO Master: Registering worker 10.244.1.15:44839 with 1 cores, 2.6 GB RAM
15/06/26 14:15:55 INFO Master: Registering worker 10.244.0.19:60970 with 1 cores, 2.6 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM
```
## Step Three: Start your Spark driver to launch jobs on your Spark cluster
The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
[Spark architecture](http://spark.apache.org/docs/latest/cluster-overview.html).
[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
```shell
$ kubectl create -f examples/spark/spark-driver.json
```console
$ kubectl create -f examples/spark/spark-driver-controller.yaml
replicationcontrollers/spark-driver-controller
```
The Spark driver needs the Master service to be running.
### Check to see if the driver is running
```shell
$ kubectl get pods
NAME READY REASON RESTARTS AGE
[...]
spark-master 1/1 Running 0 14m
spark-driver 1/1 Running 0 10m
```console
$ kubectl get pods -lcomponent=spark-driver
NAME READY STATUS RESTARTS AGE
spark-driver-controller-vwb9c 1/1 Running 0 1m
```
## Step Four: Do something with the cluster
Use the kubectl exec to connect to Spark driver
Use the kubectl exec to connect to Spark driver and run a pipeline.
```
$ kubectl exec spark-driver -it bash
root@spark-driver:/#
root@spark-driver:/# pyspark
```console
$ kubectl exec spark-driver-controller-vwb9c -it pyspark
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
15/06/26 14:25:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.4.0
/__ / .__/\_,_/_/ /_/\_\ version 1.5.1
/_/
Using Python version 2.7.9 (default, Mar 1 2015 12:57:24)
SparkContext available as sc, HiveContext available as sqlContext.
>>> import socket
>>> sc.parallelize(range(1000)).map(lambda x:socket.gethostname()).distinct().collect()
['spark-worker-controller-u40r2', 'spark-worker-controller-hifwi', 'spark-worker-controller-vpgyg']
>>> sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
939193
```
Congratulations, you just counted all of the words in all of the plays of
Shakespeare.
## Result
You now have services, replication controllers, and pods for the Spark master , Spark driver and Spark workers.
You can take this example to the next step and start using the Apache Spark cluster
you just created, see [Spark documentation](https://spark.apache.org/documentation.html)
for more information.
You now have services and replication controllers for the Spark master, Spark
workers and Spark driver. You can take this example to the next step and start
using the Apache Spark cluster you just created, see
[Spark documentation](https://spark.apache.org/documentation.html) for more
information.
## tl;dr
```kubectl create -f spark-master.json```
```console
kubectl create -f examples/spark/spark-master-controller.yaml
kubectl create -f examples/spark/spark-master-service.yaml
kubectl create -f examples/spark/spark-webui.yaml
kubectl create -f examples/spark/spark-worker-controller.yaml
kubectl create -f examples/spark/spark-driver-controller.yaml
```
```kubectl create -f spark-master-service.json```
Make sure the Master Pod is running (use: ```kubectl get pods```).
```kubectl create -f spark-worker-controller.json```
```kubectl create -f spark-driver.json```
After it's setup:
```console
kubectl get pods # Make sure everything is running
kubectl get services spark-webui # Get the IP of the Spark WebUI
kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.
```
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()