Hybrid cloud architecture for data lake applications

Big data technologies nowadays are very mature. Typically you use HDFS, or another distributed file systems, like S3, for storing data, Spark as a processor engine, and YARN as a resource manager. Next steps, wich you probably would like to achieve, are implement CI/CD (continuous integration and delivery) and move workload on demand in cloud.

Your first choice in this case is Docker. You can pack your spark jobs in docker container and use remote submit to YARN. If you would like to go next step further, you can run Spark on Kubernetes without using YARN. Advanced step forward is to orginize intelligent distribution of computational processes between cloud and on premise. Kubernetes can help you here. They have published recently Federation Concept. You can establish Kubernetes clusters in cloud and on premise and intelegently distribute your workload on demand between them.

Hybrid cloud architecture

Your computational processes will be implemented as microservices. They should be stateless, that allows to scale them up and down.

Moreover they should be loosely coupled, in this case you need an orchestration engine, like Kafka. Your components will inform Kafka per Pub/Sub about state changes. Other components can subscribe to these changes. State changes can be implemented in avro format, which supports schema evolution with backward/forward compatibility. By reconstructing chain of microservices connections you can use so-called correlation id for chain reconstruction, which you send between microservices in same chain.

Moreover you need also self healing of microservices, that means they should be restartable. Kubernetes can support automatic restart of microservices in case of errors.

By deplyoment you can use rolling deployment and restart. So your application will be not interrupted.

To save costs in cloud you can implement also infrastructure as a code approach. You will scale up and down your Kubernetes cluster on demand. You can use terraform script for starting servers and ansible scripts to monitor servers’ software state.

You can easily query the data in distributed file system using SQL engines, which executes SQL statements directly on files. Here Hive/Presto/Impala/Athena can help you.

Quick access to data is very expensive in nature. Sometimes it makes sense to separate hot data and cold data. Hot data, most often used (recent) data can be stored in memory in key/value stores like Hbase or Redis. Cold data (typically historical data) can be stored on most cheapest distributed file system, like HDFS or S3. Combination of the data can be done on microservices level, as it is recommended in Lambda architecture.