Zeppelin: Add Zeppelin image to Spark example

This adds a very basic Zeppelin image that works with the existing
Spark example. As can be seen from the documentation, it has a couple
of warts:

* It requires kubectl port-forward (which is unstable across long
periods of time, at least for me, on this app, bug incoming). See

* I needed to roll my own container (none of the existing containers
exactly matched needs, or even built anymore against modern Zeppelin
master, and the rest of the example is Spark 1.5).

The image itself is *huge*. One of the further refinements we need to
look at is how to possibly strip the Maven build for this container
down to just the interpreters we care about, because the deps here
are frankly ridiculous.

This might be a case where, if possible, we might want to open an
upstream request to build things dynamically, then use something like
probably the cut the image down considerably. (This might already be
possible, need to poke at whether you can late-bind interpreters
later.)
This commit is contained in:
Zach Loafman
2015-11-05 14:56:15 -08:00
parent 6a34b2cc04
commit 51817850ba
10 changed files with 239 additions and 29 deletions

View File

@@ -120,8 +120,8 @@ Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
```
After you know the master is running, you can use the (cluster
proxy)[../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy] to
After you know the master is running, you can use the [cluster
proxy](../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy) to
connect to the Spark WebUI:
```console
@@ -129,7 +129,7 @@ kubectl proxy --port=8001
```
At which point the UI will be available at
http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/
[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).
## Step Two: Start your Spark workers
@@ -172,32 +172,40 @@ you should now see the workers in the UI as well. *Note:* The UI will have links
to worker Web UIs. The worker UI links do not work (the links will attempt to
connect to cluster IPs, which Kubernetes won't proxy automatically).
## Step Three: Start your Spark driver to launch jobs on your Spark cluster
## Step Three: Start the Zeppelin UI to launch jobs on your Spark cluster
The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
a web notebook frontend or the traditional Spark command line. See
[Zeppelin](https://zeppelin.incubator.apache.org/) and
[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html)
for more details.
```console
$ kubectl create -f examples/spark/spark-driver-controller.yaml
replicationcontrollers/spark-driver-controller
$ kubectl create -f examples/spark/zeppelin-controller.yaml
replicationcontrollers/zeppelin-controller
```
The Spark driver needs the Master service to be running.
Zeppelin needs the Master service to be running.
### Check to see if the driver is running
### Check to see if Zeppelin is running
```console
$ kubectl get pods -lcomponent=spark-driver
NAME READY STATUS RESTARTS AGE
spark-driver-controller-vwb9c 1/1 Running 0 1m
$ kubectl get pods -lcomponent=zeppelin
NAME READY STATUS RESTARTS AGE
zeppelin-controller-ja09s 1/1 Running 0 53s
```
## Step Four: Do something with the cluster
Use the kubectl exec to connect to Spark driver and run a pipeline.
Now you have two choices, depending on your predilections. You can do something
graphical with the Spark cluster, or you can stay in the CLI.
### Do something fast with pyspark!
Use the kubectl exec to connect to the Zeppelin driver and run a pipeline.
```console
$ kubectl exec spark-driver-controller-vwb9c -it pyspark
$ kubectl exec zeppelin-controller-ja09s -it pyspark
Python 2.7.9 (default, Mar 1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
@@ -217,6 +225,24 @@ SparkContext available as sc, HiveContext available as sqlContext.
Congratulations, you just counted all of the words in all of the plays of
Shakespeare.
### Do something graphical and shiny!
Take the Zeppelin pod from above and port-forward the WebUI port:
```console
$ kubectl port-forward zeppelin-controller-ja09s 8080:8080
```
This forwards `localhost` 8080 to container port 8080. You can then find
Zeppelin at (https://localhost:8080/)[https://localhost:8080/].
Create a "New Notebook". In there, type:
```
%pyspark
print sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
```
## Result
You now have services and replication controllers for the Spark master, Spark
@@ -235,10 +261,46 @@ After it's setup:
```console
kubectl get pods # Make sure everything is running
kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark WebUI
kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.
kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark Master WebUI
kubectl get pods -lcomponent=zeppelin # Get the driver pod to interact with.
```
At which point the Master UI will be available at
[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).
You can either interact with the Spark cluster the traditional `spark-shell` /
`spark-subsubmit` / `pyspark` commands by using `kubectl exec` against the
`zeppelin-controller` pod, or if you want to interact with Zeppelin:
```console
kubectl port-forward zeppelin-controller-abc123 8080:8080 &
```
Then visit [http://localhost:8080/](http://localhost:8080/).
## Known Issues With Spark
* This provides a Spark configuration that is restricted to the cluster network,
meaning the Spark master is only available as a cluster service. If you need
to submit jobs using external client other than Zeppelin or `spark-submit` on
the `zeppelin` pod, you will need to provide a way for your clients to get to
the
[`examples/spark/spark-master-service.yaml`](spark-master-service.yaml). See
[Services](../../docs/user-guide/services.md) for more information.
## Known Issues With Zeppelin
* The Zeppelin pod is large, so it may take a while to pull depending on your
network. The size of the Zeppelin pod is something we're working on, see issue #17231.
* Zeppelin may take some time (about a minute) on this pipeline the first time
you run it. It seems to take considerable time to load.
* On GKE, `kubectl port-forward` may not be stable over long periods of time. If
you see Zeppelin go into `Disconnected` state (there will be a red dot on the
top right as well), the `port-forward` probably failed and needs to be
restarted. See #12179.
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->