Useful way for implementing CI/CD pipeline is to pack code as docker and run in k8s cluster. One very practical application for data analytics is notebook based tool apache zeppelin. Every business department requires own configuration for zeppelin. Hence, there is an idea to create docker containers for every department and run in k8s cluster.
Apache zeppelin uses spark as a computational engine for bigdata. So you need to submit spark jobs remotely to yarn cluster. Detailed description how todo this is here.
Unfortunately this approach cannot be applied for submitting spark jobs directly. Starting spark from docker in k8s from zeppelin and remotely submit to yarn has some issues to solve.
Let’s make an overview of physical deployment. Apache zeppelin is running in k8s inside of pod with virtual ip address (ip_k8s_pod_cni), which is assigned by container network interface. This pod is running on physical machine with own ip (ip_k8s_host_lan). Yarn is a part of cluster with own ip range and part of lan. Let’s assume that one node in cluster has ip_yarn_node_lan.
Assume that we start spark in yarn mode with deploy mode cluster. In this case spark driver is started directly in yarn cluster and has ip in yarn cluster (ip_yarn_node_lan). Problem with this setup is that spark driver should talk to apache zeppelin. In this case, callback connection between ip_k8s_pod_cni and ip_yarn_node_lan should be established, but there is no way to specify this in spark, since container network does not know about lan.
Let’s submit spark with yarn and deploy mode client. In this case spark driver is started directly in k8s pod. Spark driver should connect to spark workers, which are running in yarn cluster. It looks similar to previous case. We should establish connection between ip_k8s_pod_cni (ip of spark driver) and ip_yarn_node_lan (ip of spark worker). But here spark provides two additional parameters: spark.driver.host and spark.driver.bindAddress. Worker will connect to spark.driver.host here we shoul put ip_k8s_host_lan and then from here connection will be routed to ip_k8s_pod_cni, which we specify in spark.driver.bindAddress.
One open question is what about ports. We should expose ports for spark master in docker and they should be available in k8s, and spark should know about them. There is a parameter in spark spark.driver.port where we can put port number, for example 18080. If this port will be busy spark will check next one 18081. So we can expose in docker and k8s service 5 ports: 18080, 18081, 18082, 18083, 18084. In this case problem of ports for spark master will be solved.
There is one drawback if this approach, we use ip_k8s_pod_cni, which is changed every time, when pod is restarted. We investigate right now how to use instead of this ip, ip of corresponding k8s service.
One more approach to submit spark jobs remotely is to use apache livy. Apache livy runs inside of yarn cluster and provides REST to call submit spark. Drawback of livy is dependancy. All jars should be already available in yarn cluster.