{"id":735,"date":"2018-12-23T16:55:26","date_gmt":"2018-12-23T14:55:26","guid":{"rendered":"http:\/\/dekarlab.de\/wp\/?p=735"},"modified":"2020-05-23T15:33:28","modified_gmt":"2020-05-23T13:33:28","slug":"running-apache-zeppelin-in-k8s-cluster-and-integration-with-yarn-cluster","status":"publish","type":"post","link":"https:\/\/dekarlab.de\/wp\/?p=735","title":{"rendered":"Running Apache Zeppelin in K8s cluster and integration with YARN cluster"},"content":{"rendered":"\n<p>Useful way for implementing CI\/CD pipeline is to pack code as docker and run in K8s cluster. One very practical application for data analytics is notebook based tool Apache Zeppelin. Every business department requires own configuration for zeppelin. Hence, there is an idea to create docker containers for every department and run in k8s cluster.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>Apache Zeppelin uses Spark as a computational engine for big data. So you need to submit spark jobs remotely to yarn cluster. Detailed description how to do this is <a href=\"https:\/\/dekarlab.de\/wp\/?p=613\">here<\/a>.<\/p>\n\n\n\n<p>Unfortunately this approach cannot be applied for submitting spark jobs directly. Starting spark from docker in k8s from zeppelin and remotely submit to yarn has some issues to solve.<\/p>\n\n\n\n<p>Let&#8217;s make an overview of physical deployment. Apache Zeppelin is running in K8s inside of pod with virtual IP address (<strong>ip_k8s_pod_cni<\/strong>), which is assigned by container network interface. This pod is running on physical machine with own ip (<strong>ip_k8s_host_lan<\/strong>). YARN is a part of cluster with own IP range and part of LAN. Let&#8217;s assume that one node in cluster has <strong>ip_yarn_node_lan<\/strong>.<\/p>\n\n\n\n<p>Assume that we start Spark in <strong>YARN<\/strong> mode with deploy mode <strong>cluster<\/strong>. In this case spark driver is started directly in YARN cluster and has IP in YARN cluster (<strong>ip_yarn_node_lan<\/strong>). Problem with this setup is that Spark driver should talk to Apache Zeppelin. In this case, callback connection between <strong>ip_k8s_pod_cni<\/strong> and <strong>ip_yarn_node_lan<\/strong> should be established, but there is no way to specify this in spark, since container network does not know about LAN.<\/p>\n\n\n\n<p>Let&#8217;s submit spark with <strong>YARN<\/strong> and deploy mode <strong>client<\/strong>. In this case spark driver is started directly in K8s pod. Spark driver should connect to spark workers, which are running in YARN cluster. It looks similar to previous case. We should establish connection between <strong>ip_k8s_pod_cni<\/strong> (IP of Spark driver) and <strong>ip_yarn_node_lan<\/strong> (IP of Spark worker). But here Spark provides two additional parameters: <strong>spark.driver.host<\/strong> and <strong>spark.driver.bindAddress<\/strong>. Worker will connect to <strong>spark.driver.host <\/strong>here we should put <strong>ip_k8s_host_lan <\/strong>and then from here connection will be routed to <strong>ip_k8s_pod_cni<\/strong>, which we specify in <strong>spark.driver.bindAddress<\/strong>.<\/p>\n\n\n\n<p>One open question is what about ports. We should expose ports for Spark master in Docker and they should be available in K8s, and Spark should know about them. There is a parameter in Spark <strong>spark.driver.port <\/strong>where we can put port number, for example 18080. If this port will be busy Spark will check next one 1808<strong>1<\/strong>. So we can expose in Docker and K8s service 5 ports: 18080, 18081, 18082, 18083, 18084. In this case problem of ports for Spark master will be solved.<\/p>\n\n\n\n<p>There is one drawback if this approach, we use <strong>ip_k8s_pod_cni<\/strong>, which is changed every time, when pod is restarted. We investigate right now how to use instead of this IP, IP of corresponding K8s service.<\/p>\n\n\n\n<p>One more approach to submit Spark jobs remotely is to use Apache Livy. Apache Livy runs inside of YARN cluster and provides REST to call submit Spark. Drawback of Livy is dependency. All jars should be already available in YARN cluster.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Useful way for implementing CI\/CD pipeline is to pack code as docker and run in K8s cluster. One very practical application for data analytics is notebook based tool Apache Zeppelin. Every business department requires own configuration for zeppelin. Hence, there is an idea to create docker containers for every department and run in k8s cluster.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0},"categories":[25],"tags":[57,50,33],"_links":{"self":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/735"}],"collection":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=735"}],"version-history":[{"count":8,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/735\/revisions"}],"predecessor-version":[{"id":776,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/735\/revisions\/776"}],"wp:attachment":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=735"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=735"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=735"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}