Why it is a bad idea to stream data back from HDFS into Kafka

I think, an idea to stream data back from HDFS into some streaming component, like Kafka, is coming from concept of Enterprise Service Bus (ESB). But after some thoughts, I have come to conclusion, that this concept is not useful in Big Data world.

It is clear, Kafka is of good usage, if you would like to implement online processing of data. That means you start processing of data at the same moment as data is created. You do not wait till data is landed into data lake. This is very helpful if you have use cases, where the speed of analysis is very important, for example by fraud detection during online payment, or monitoring of productive systems.

But after data is landed in data lake you do not need to stream this data again, arguing that this data will be available for other systems through publish/subscriber model and in a such way you try to decouple different components. Why this will not work? This is because we have big data. That means to read data from disk, put it in Kafka, distribute it to different components, then read this data again in your component, process this data in your component and save result will take too much time, than simply process data in batch in parallel fashion, using for example Spark, and save result back in HDFS.

But here there is an exception. It is a nice idea to publish derived data in Kafka. That means you have run a large analytical calculation in batch mode and would like that this calculation will be available for processing in other components, then in this case it makes sense to publish the calculation result in Kafka.

[Total: 1    Average: 4/5]