Category: DATA PROCESSING
We will discuss here solutions for data processing and distributed computing, like HDFS, spark, and kubernetes.
Two infrastructure layers for distributed systems
It looks like, that separation between two infrastructure layers is increasing.
Why contracts are important in data intensive applications with microservices
Main purpose of using microservices architecture is to increase velocity of development and reduce system complexity.
Book notes – Kubernetes: Up and Running: Dive into the Future of Infrastructure
Hybrid cloud architecture for data lake applications
Big data technologies nowadays are very mature. Typically you use HDFS, or another distributed file systems, like S3, for storing data, Spark as a processor engine, and YARN as a resource manager. Next steps, wich you probably would like to achieve, are implement CI/CD (continuous integration and delivery) and move workload on demand in cloud.
Remote submit of spark jobs
Remote submit is a powerful feature of Apache Spark. Why it is needed? For example, you can experiment with different versions of Spark, independent of what you have in the cluster. Or if you have no direct access to cluster you can start your spark jobs remotely.
Tuning spark parameters
Tuning spark parameters is not a trivial task. In this short post I will explain how to tune some of the important parameters.