Submit Spark job to Kubernetes
- 6 minutes read - 1162 wordsAbout this demo
This demo shows how to deploy a Spark job on minikube. minikube can be used to setup a local Kubernetes cluster.
A directory on the host is mounted into the Kubernetes pods; this directory is used
- to provide the jar file which contains the Spark job
- to store the output of the Spark job
Setup Docker
Please setup Docker before. minikube will detect it, and then use its Docker driver.
Setup minikube
Please follow the steps as described on the minekube pages.
Select or build a Spark distribution with Kubernetes support
Your Spark distribution has support for Kubernetes, when directory $SPARK_HOME/kubernetes/ exists. In this case you can skip this section.
Since I am using a Spark distribution which I have built, I had to add Kubernetes support to the build process.
As for building Spark with Kubernetes support, please have a look at my StackOverflow answer.
Create a Spark job with I/O
This demo will mount the host directory /tmp/data into the Kubernetes pod. We will store the jar file containing the Spark job there.
Therefore the Spark job is to write the output into a different directory: we use /tmp/data/output.
ds.write.format("json").mode("overwrite").save("/tmp/data/output")
Apart from this there is no difference to a “normal” Spark job.
Prepare Docker image
Generally speaking Kubernetes requires you to provide an image; in this case we have to provide a Spark image.
This demo is using Docker, so we have to provide a Spark Docker image. Spark provides a tool to build a Docker image. What is more, this tool - namely docker-image-tool.sh - is designed to work with minikube (see the -m argument):
$ docker-image-tool.sh -h
...
-m Use minikube's Docker daemon.
Using minikube when building images will do so directly into
minikube's Docker daemon.
There is no need to push the images into minikube in that case,
they will be automatically available when running applications
inside the minikube cluster.
Check the following documentation for more information on using the
minikube Docker daemon:
https://kubernetes.io/docs/getting-started-guides/minikube/#reusing-the-docker-daemon
Examples:
- Build image in minikube with tag "testing"
docker-image-tool.sh -m -t testing build
Since my hosts are behind a router which does not allow direct access to the Internet, I had to configure the proxy:
$ cd $SPARK_HOME
$ docker-image-tool.sh -m -t spark-k8s-2.4.5 \
-b http_proxy=$http_proxy -b https_proxy=$https_proxy build
When this completes successfully there should be three artifacts: these are Docker images for JVM-, Python-, and R-based Spark jobs:
[...]
Successfully tagged spark:spark-k8s-2.4.5
[...]
Successfully tagged spark-py:spark-k8s-2.4.5
[...]
Successfully tagged spark-r:spark-k8s-2.4.5
These artifacts do not turn up when you list the docker images on the host. Instead these images will turn up in minikube’s Docker. Please make sure that these are really there.
Mount host directory into the pod
Now let’s mount the host directory into the pod:
$ minikube mount /tmp/data:/tmp/data
* Mounting host path /tmp/data into VM as /tmp/data ...
- Mount type:
- User ID: docker
- Group ID: docker
- Version: 9p2000.L
- Message Size: 262144
- Permissions: 755 (-rwxr-xr-x)
- Options: map[]
- Bind Address: 192.168.49.1:36117
* Userspace file server: ufs starting
* Successfully mounted /tmp/data to /tmp/data
* NOTE: This process must stay alive for the mount to be accessible ...
You can test it like this: “Enter” minikube using ssh:
$ minikube ssh
Last login: Fri Apr 9 07:29:19 2021 from 192.168.49.1
docker@minikube:~$ mount | grep tmp/data
192.168.49.1 on /tmp/data type 9p (rw,relatime,sync,dirsync,dfltgid=999,dfltuid=1000,msize=262144,port=40207,trans=tcp,version=9p2000.L)
docker@minikube:~$
Create Kubernetes service account
The Spark job will be run on behalf of this service account:
kubectl create serviceaccount spark-sa
This account needs permissions:
kubectl create clusterrolebinding spark-rb \
--clusterrole=edit \
--serviceaccount=default:spark-sa \
--namespace=default
The cluster role edit is documented here.
edit means that this service account has now read/write access to most objects in the default namespace.
Without sufficient permissions, you might get error messages like this
Caused by: io.fabric8.kubernetes.client.KubernetesClientException:
Failure executing: GET at: https://kubernetes.default.svc/api/v1/namespaces/default/pods/spark-k8s-demo-1617878918322-driver.
Message: Forbidden! Configured service account doesn't have access.
Service account may have been revoked.
pods "spark-k8s-demo-1617878918322-driver" is forbidden:
User "system:serviceaccount:default:default" cannot get
resource "pods" in API group "" in the namespace "default".
Provide jar
Build the jar file containing the Spark job as usual, then copy the jar file into the shared directory:
$ cp .../spark-notebook_2.11-1.0-SNAPSHOT.jar /tmp/data/
Get master URL
Now inspect minikube to get the URL:
$ kubectl cluster-info
Kubernetes control plane is running at https://192.168.49.2:8443
KubeDNS is running at https://192.168.49.2:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
This means that https://192.168.49.2:8443 is to be used in the master URL.
Submit job
Finally we can submit the job.
You have to adjust the master URL (master), the name of the class (class), and the location of the jar file (last argument).
$ spark-submit \
--master k8s://https://192.168.49.2:8443 \
--deploy-mode cluster \
--name spark-k8s-demo \
--class com.rhaag.spark.notebook.shell.connector.localfs.JsonLocalFsDemo \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \
--conf spark.kubernetes.driver.volumes.hostPath.data.mount.path=/tmp/data \
--conf spark.kubernetes.driver.volumes.hostPath.data.options.path=/tmp/data \
--conf spark.kubernetes.driver.volumes.hostPath.data.mount.readOnly=false \
--conf spark.kubernetes.executor.volumes.hostPath.data.mount.path=/tmp/data \
--conf spark.kubernetes.executor.volumes.hostPath.data.options.path=/tmp/data \
--conf spark.kubernetes.executor.volumes.hostPath.data.mount.readOnly=false \
--conf spark.kubernetes.container.image=spark:spark-k8s-2.4.5 \
local:///tmp/data/spark-notebook_2.11-1.0-SNAPSHOT.jar
- The name spark-k8s-demo will turn up in the list provided by kubectl get pods (see below).
- The job is run on behalf of the Kubernetes service account spark-sa which has been created before.
- The container spark:spark-k8s-2.4.5 is the Docker image which has been built for JVM-based Spark jobs.
- The shared directory must be configured both for the driver and the executor. If it is only configured for the driver, there won’t be output files in the host.
- local:/// means the local filesystem in the pod (so it’s not the filesystem of the host).
While the job is running
… you can watch the status of the Kubernetes pods by using kubectl get pods
When the driver has been started:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-minikube-6ddfcc9757-vwkh8 1/1 Running 7 8d
spark-k8s-demo-1617950835735-driver 1/1 Running 0 6s
The NAME starts with the prefix which was defined by the name argument for spark-submit.
After the executors have been started:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-minikube-6ddfcc9757-vwkh8 1/1 Running 7 8d
spark-k8s-demo-1617950835735-driver 1/1 Running 0 10s
spark-k8s-demo-1617950835735-exec-1 1/1 Running 0 3s
spark-k8s-demo-1617950835735-exec-2 1/1 Running 0 3s
When the job has been completed:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-minikube-6ddfcc9757-vwkh8 1/1 Running 7 8d
spark-k8s-demo-1617950835735-driver 0/1 Completed 0 32s
Inspect log files
First get a listing of the pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
hello-minikube-6ddfcc9757-vwkh8 1/1 Running 6 7d20h
spark-k8s-demo-1617870359056-driver 0/1 Error 0 58m
spark-k8s-demo-1617870410297-driver 0/1 Error 0 57m
spark-k8s-demo-1617872396726-driver 0/1 Error 0 24m
spark-k8s-demo-1617872508802-driver 0/1 Error 0 22m
spark-k8s-demo-1617873367915-driver 0/1 Error 0 8m28s
spark-k8s-demo-1617873805665-driver 0/1 Completed 0 70s
Then using the name from that list you can inspect the log file:
$ kubectl logs spark-k8s-demo-1617873805665-driver
Output
When the job has succeeded, there should be output files in /tmp/data/output on the host:
$ LANGUAGE=en ls -l /tmp/data/output/
total 12
-rw-r--r-- 1 docker-admin docker-admin 46 Apr 8 11:28 part-00000-9e7f7fb5-5982-44cf-bbe7-3b7eaa9b2a9b-c000.json
-rw-r--r-- 1 docker-admin docker-admin 46 Apr 8 11:28 part-00001-9e7f7fb5-5982-44cf-bbe7-3b7eaa9b2a9b-c000.json
-rw-r--r-- 1 docker-admin docker-admin 46 Apr 8 11:28 part-00002-9e7f7fb5-5982-44cf-bbe7-3b7eaa9b2a9b-c000.json
-rw-r--r-- 1 docker-admin docker-admin 0 Apr 8 11:28 _SUCCESS
References
Tutorials
- https://blog.duyet.net/2020/05/spark-on-k8s.html
- https://bartek-blog.github.io/pyspak/python/spark/scala/minikube/2020/12/10/run-spark-in-minikube.html
- https://dzone.com/articles/quickstart-apache-spark-on-kubernetes
- https://jaceklaskowski.github.io/spark-kubernetes-book/demo/spark-and-local-filesystem-in-minikube/
Volume mounts
- https://spark.apache.org/docs/latest/running-on-kubernetes.html#volume-mounts
- https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes