{"id":823,"date":"2020-05-01T21:17:36","date_gmt":"2020-05-01T19:17:36","guid":{"rendered":"http:\/\/dekarlab.de\/wp\/?p=823"},"modified":"2020-05-23T15:32:32","modified_gmt":"2020-05-23T13:32:32","slug":"apache-zeppelin-behind-spark-interpreter","status":"publish","type":"post","link":"https:\/\/dekarlab.de\/wp\/?p=823","title":{"rendered":"Apache Zeppelin: behind spark interpreter"},"content":{"rendered":"\n<p>Here is an overview, what is hidden behind spark interpreter in Apache Zeppelin.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>Source code for Apache Zeppelin is available here: <a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/apache\/zeppelin\" target=\"_blank\">source code.<\/a><a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/apache\/zeppelin\" target=\"_blank\"><\/a><\/p>\n\n\n\n<p>First of all we should look inside of Interpreter launcher: <a href=\"https:\/\/github.com\/apache\/zeppelin\/tree\/master\/zeppelin-zengine\/src\/main\/java\/org\/apache\/zeppelin\/interpreter\/launcher\">source of launchers<\/a>. For spark there is a special launcher <a href=\"https:\/\/github.com\/apache\/zeppelin\/blob\/master\/zeppelin-zengine\/src\/main\/java\/org\/apache\/zeppelin\/interpreter\/launcher\/SparkInterpreterLauncher.java\">SparkInterpreterLauncher.java<\/a>, which extends standard launcher: <a href=\"https:\/\/github.com\/apache\/zeppelin\/blob\/master\/zeppelin-zengine\/src\/main\/java\/org\/apache\/zeppelin\/interpreter\/launcher\/StandardInterpreterLauncher.java\">StandardInterpreterLauncher.java<\/a>. Here we can see configuration for spark command line, like &#8211;conf &lt;key&gt;=&lt;value&gt; and usage of proxy users, plus, check of running mode, like yarn client or cluster.<\/p>\n\n\n\n<p>Depending on selected isolation options in interpreter, spark is started as RemoteInterpreterRunningProcess or RemoteInterpreterManagedProcess. This stays in launch function in StandardInterpreterLauncher. Important is that spark starts as a new process. For example, if you use spark option &#8211;proxy-user, then zeppelin will create separate spark processes per user. Communication between current process and interpreter process is done using RPC in case of RemoteInterpreterManagedProcess.<\/p>\n\n\n\n<p>Behind the scene <a href=\"https:\/\/github.com\/apache\/zeppelin\/blob\/master\/bin\/interpreter.sh\">interpreter.sh<\/a> is started by RemoteInterpreterManagedProcess. And inside of interpreter.sh script, spark-submit from active spark distribution, with <a href=\"https:\/\/github.com\/apache\/zeppelin\/blob\/master\/zeppelin-interpreter\/src\/main\/java\/org\/apache\/zeppelin\/interpreter\/remote\/RemoteInterpreterServer.java\">RemoteInterpreterServer<\/a> as a spark class is called. To communicate with RemoteInterpreterServer from zeppelin UI Apache Thrift RPC is used.<\/p>\n\n\n\n<p>This makes a trick, spark started per spark submit in separate JVM together with Thrift RPC server, to enable communication with spark process.<\/p>\n\n\n\n<p>Spark interpreter (spark-interpreter*.jar) is in class path. RemoteInterpreterServer can initialize then spark inside of its process.<\/p>\n\n\n\n<p>Then we need to look, what happened inside of spark interpreter, which is a classical spark application with SparkContext for spark1 and SparkSession for spark2.<\/p>\n\n\n\n<p>We start with <a href=\"https:\/\/github.com\/apache\/zeppelin\/blob\/master\/spark\/interpreter\/src\/main\/java\/org\/apache\/zeppelin\/spark\/SparkInterpreter.java\">SparkInterpreter<\/a>, which is driver program for spark. You can see that spark context or spark session is initialized using reflection. It looks like this is done to support both spark versions, and resolve needed version during run time, when it becomes clear, which version of spark to start. Zeppelin uses different scala classes depending on scala version:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>SparkScala210Interpreter.scala\nSparkScala211Interpreter.scala\nSparkScala212Interpreter.scala<\/code><\/pre>\n\n\n\n<p>These classes extends <a href=\"https:\/\/github.com\/apache\/zeppelin\/blob\/master\/spark\/spark-scala-parent\/src\/main\/scala\/org\/apache\/zeppelin\/spark\/BaseSparkScalaInterpreter.scala\">BaseSparkScalaInterpreter<\/a>, where you can see also, binding of variables and what are shown during %spark paragraph execution in zeppelin:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: java; title: ; notranslate\" title=\"\">\n  bind(&quot;sc&quot;, &quot;org.apache.spark.SparkContext&quot;, sc, List(&quot;&quot;&quot;@transient&quot;&quot;&quot;))\n  bind(&quot;sqlContext&quot;, sqlContext.getClass.getCanonicalName, sqlContext, List(&quot;&quot;&quot;@transient&quot;&quot;&quot;))\n\n  scalaInterpret(&quot;import org.apache.spark.SparkContext._&quot;)\n  scalaInterpret(&quot;import sqlContext.implicits._&quot;)\n  scalaInterpret(&quot;import sqlContext.sql&quot;)\n  scalaInterpret(&quot;import org.apache.spark.sql.functions._&quot;)\n<\/pre><\/div>\n\n\n<p>But what is bind() and scalaInterpret()?  Here we need to look into SparkILoop class, which is actually responsible for interactive mode of spark driver. In other case, it will be not possible to show results in zeppelin after spark finishes execution. SparkILoop is a part of spark distribution. <\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: java; title: ; notranslate\" title=\"\">\n...\nsparkILoop = new SparkILoop(None, replOut)\nsparkILoop.settings = settings\nsparkILoop.createInterpreter()\nval in0 = getDeclareField(sparkILoop, &quot;in0&quot;).asInstanceOf&#91;Option&#91;BufferedReader]]\nval reader = in0.fold(sparkILoop.chooseReader(settings))(r =&gt; SimpleReader(r, replOut, interactive = true))\n\nsparkILoop.in = reader\nsparkILoop.initializeSynchronous()\nsparkILoop.in.postInit()\n...\nsparkILoop.bind(name, tpe, value, modifier) \/\/bind(...) in code above\n... \nsparkILoop.interpret(code)\/\/scalaInterpret() in code above    \n...\nsparkILoop.closeInterpreter()\n<\/pre><\/div>\n\n\n<p>  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here is an overview, what is hidden behind spark interpreter in Apache Zeppelin.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0},"categories":[25],"tags":[57,28,33],"_links":{"self":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/823"}],"collection":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=823"}],"version-history":[{"count":10,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/823\/revisions"}],"predecessor-version":[{"id":835,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/823\/revisions\/835"}],"wp:attachment":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=823"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=823"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=823"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}