Remote submit is a powerful feature of Apache Spark. Why it is needed? For example, you can experiment with different versions of Spark, independent of what you have in the cluster. Or if you have no direct access to cluster you can start your spark jobs remotely.
How it works?
First, you need client configurations files from your cluster and copy them to remote machine:
- core-site.xml
- hadoop-env.sh
- hdfs-site.xml
- log4j.properties
- mapred-site.xml.
In Cloudera you can download them using these steps: Client Configuration Files.
Second, you should download Spark distribution from Apache or take it from your cluster and put in on remote machine.
Third, in spark-submit on remote machine you should specify location of configuration files:
export HADOOP_CONF_DIR=<folder with conf files>
Fourth, you should prepare jar with your application.
Hence, on remote machine you have
- client configuration files from cluster
- spark distribution
- jar with your application
After starting spark submit on remote machine you will see in logs, that spark pack all needed files on remote machine and upload them to cluster, and after that starts execution:
.... INFO Client: Uploading resource file:/spark-assembly.jar -> hdfs://spark-assembly.jar INFO Client: Uploading resource file:/app.jar -> hdfs://app.jar ...