DATA PROCESSING – Page 2

Big data technologies nowadays are very mature. Typically you use HDFS, or another distributed file systems, like S3, for storing data, Spark as a processor engine, and YARN as a resource manager. Next steps, wich you probably would like to achieve, are implement CI/CD (continuous integration and delivery) and move workload on demand in cloud.
Read more

Post Views: 741

Remote submit of spark jobs

1. May 2018 karden DATA PROCESSING

Remote submit is a powerful feature of Apache Spark. Why it is needed? For example, you can experiment with different versions of Spark, independent of what you have in the cluster. Or if you have no direct access to cluster you can start your spark jobs remotely.
Read more

Post Views: 1,403

Tuning spark parameters

27. December 2017 karden DATA PROCESSING

Tuning spark parameters is not a trivial task. In this short post I will explain how to tune some of the important parameters.
Read more

Post Views: 1,320

Book notes – Building Microservices: Designing Fine-Grained Systems

23. October 2017 karden BOOKS, DATA PROCESSING

You can buy this book from amazon.de.
Read more

Post Views: 722

Book notes – Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

23. October 2017 karden BOOKS, DATA PROCESSING

You can buy this book from amazon.de.
Read more

Post Views: 791

Why it is a bad idea to stream data back from HDFS into Kafka

2. October 2017 karden DATA PROCESSING

I think, an idea to stream data back from HDFS into some streaming component, like Kafka, is coming from concept of Enterprise Service Bus (ESB). But after some thoughts, I have come to conclusion, that this concept is not useful in Big Data world.
Read more

Post Views: 599