During implementation of your components as a microservice, you can come to an idea to use Apache Spark for data retrieval. I will describe ideas how to do this in this post.
Classical way of implementation is to use Apache Livy and submit Spark Jobs using Livy REST API.
There are different limitations of this approach.
- Your spark version is not provided inside of your application, but provided by Livy server, this will increase impact during migration of different microservices with spark applications to new spark version.
- It can be also dependency conflict with libraries available on Livy server.
- You have less flexibility by patching spark code for your application.
Another way is to use SparkLauncher and approach, which is implemented in Apache Zeppelin.
With SparkLauncher you are able to submit spark in separate process. But here we have two problems. First one, is that spark is not submitted in interactive mode, so we have no possibility to request data by incoming API call to your microservice. Second one, is how to retrieve data from spark server process into client microservice process.
First problem can be solved if you use SparkILoop class from Spark distribution inside of your spark application. Second approach can be solved with Apache Thrift RPC. You can implement client/server for communication between microservice process and spark process.
As a result, spark started per spark submit in separate JVM together with Thrift RPC server, to enable communication with spark process from microservice.
As an additional effect, this approach will also allow to use microservice user as an proxy user to start spark process, which enables then authorization by means of Hadoop environment and you do not need to care about this in your microservice.