Zeppelin: Add Zeppelin image to Spark example

This adds a very basic Zeppelin image that works with the existing Spark example. As can be seen from the documentation, it has a couple of warts: * It requires kubectl port-forward (which is unstable across long periods of time, at least for me, on this app, bug incoming). See * I needed to roll my own container (none of the existing containers exactly matched needs, or even built anymore against modern Zeppelin master, and the rest of the example is Spark 1.5). The image itself is *huge*. One of the further refinements we need to look at is how to possibly strip the Maven build for this container down to just the interpreters we care about, because the deps here are frankly ridiculous. This might be a case where, if possible, we might want to open an upstream request to build things dynamically, then use something like probably the cut the image down considerably. (This might already be possible, need to poke at whether you can late-bind interpreters later.)
2025-09-14 13:45:06 +00:00 · 2015-11-05 14:56:15 -08:00
parent 6a34b2cc04
commit 51817850ba
10 changed files with 239 additions and 29 deletions
--- a/examples/spark/README.md
+++ b/examples/spark/README.md
@@ -120,8 +120,8 @@ Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5
 15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE
 ```

-After you know the master is running, you can use the (cluster
-proxy)[../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy] to
+After you know the master is running, you can use the [cluster
+proxy](../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy) to
 connect to the Spark WebUI:

 ```console
@@ -129,7 +129,7 @@ kubectl proxy --port=8001
 ```

 At which point the UI will be available at
-http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/
+[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).

 ## Step Two: Start your Spark workers

@@ -172,32 +172,40 @@ you should now see the workers in the UI as well. *Note:* The UI will have links
 to worker Web UIs. The worker UI links do not work (the links will attempt to
 connect to cluster IPs, which Kubernetes won't proxy automatically).

-## Step Three: Start your Spark driver to launch jobs on your Spark cluster
+## Step Three: Start the Zeppelin UI to launch jobs on your Spark cluster

-The Spark driver is used to launch jobs into Spark cluster. You can read more about it in
-[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html).
+The Zeppelin UI pod can be used to launch jobs into the Spark cluster either via
+a web notebook frontend or the traditional Spark command line. See
+[Zeppelin](https://zeppelin.incubator.apache.org/) and
+[Spark architecture](https://spark.apache.org/docs/latest/cluster-overview.html)
+for more details.

 ```console
-$ kubectl create -f examples/spark/spark-driver-controller.yaml
-replicationcontrollers/spark-driver-controller
+$ kubectl create -f examples/spark/zeppelin-controller.yaml
+replicationcontrollers/zeppelin-controller
 ```

-The Spark driver needs the Master service to be running.
+Zeppelin needs the Master service to be running.

-### Check to see if the driver is running
+### Check to see if Zeppelin is running

 ```console
-$ kubectl get pods -lcomponent=spark-driver
-NAME                            READY     STATUS    RESTARTS   AGE
-spark-driver-controller-vwb9c   1/1       Running   0          1m
+$ kubectl get pods -lcomponent=zeppelin
+NAME                        READY     STATUS    RESTARTS   AGE
+zeppelin-controller-ja09s   1/1       Running   0          53s
 ```

 ## Step Four: Do something with the cluster

-Use the kubectl exec to connect to Spark driver and run a pipeline.
+Now you have two choices, depending on your predilections. You can do something
+graphical with the Spark cluster, or you can stay in the CLI.
+
+### Do something fast with pyspark!
+
+Use the kubectl exec to connect to the Zeppelin driver and run a pipeline.

 ```console
-$ kubectl exec spark-driver-controller-vwb9c -it pyspark
+$ kubectl exec zeppelin-controller-ja09s -it pyspark
 Python 2.7.9 (default, Mar  1 2015, 12:57:24)
 [GCC 4.9.2] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
@@ -217,6 +225,24 @@ SparkContext available as sc, HiveContext available as sqlContext.
 Congratulations, you just counted all of the words in all of the plays of
 Shakespeare.

+### Do something graphical and shiny!
+
+Take the Zeppelin pod from above and port-forward the WebUI port:
+
+```console
+$ kubectl port-forward zeppelin-controller-ja09s 8080:8080
+```
+
+This forwards `localhost` 8080 to container port 8080. You can then find
+Zeppelin at (https://localhost:8080/)[https://localhost:8080/].
+
+Create a "New Notebook". In there, type:
+
+```
+%pyspark
+print sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
+```
+
 ## Result

 You now have services and replication controllers for the Spark master, Spark
@@ -235,10 +261,46 @@ After it's setup:

 ```console
 kubectl get pods # Make sure everything is running
-kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark WebUI
-kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.
+kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark Master WebUI
+kubectl get pods -lcomponent=zeppelin # Get the driver pod to interact with.
 ```

+At which point the Master UI will be available at
+[http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/](http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/).
+
+You can either interact with the Spark cluster the traditional `spark-shell` /
+`spark-subsubmit` / `pyspark` commands by using `kubectl exec` against the
+`zeppelin-controller` pod, or if you want to interact with Zeppelin:
+
+```console
+kubectl port-forward zeppelin-controller-abc123 8080:8080 &
+```
+
+Then visit [http://localhost:8080/](http://localhost:8080/).
+
+## Known Issues With Spark
+
+* This provides a Spark configuration that is restricted to the cluster network,
+  meaning the Spark master is only available as a cluster service. If you need
+  to submit jobs using external client other than Zeppelin or `spark-submit` on
+  the `zeppelin` pod, you will need to provide a way for your clients to get to
+  the
+  [`examples/spark/spark-master-service.yaml`](spark-master-service.yaml). See
+  [Services](../../docs/user-guide/services.md) for more information.
+
+## Known Issues With Zeppelin
+
+* The Zeppelin pod is large, so it may take a while to pull depending on your
+  network. The size of the Zeppelin pod is something we're working on, see issue #17231.
+
+* Zeppelin may take some time (about a minute) on this pipeline the first time
+  you run it. It seems to take considerable time to load.
+
+* On GKE, `kubectl port-forward` may not be stable over long periods of time. If
+  you see Zeppelin go into `Disconnected` state (there will be a red dot on the
+  top right as well), the `port-forward` probably failed and needs to be
+  restarted. See #12179.
+
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/examples/spark/README.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->