Raw, clean, and derived data in data lakes based on HDFS

You may think, that there is no need to structure data in HDFS. You can systemize it in the future. But I think this is a wrong way. We should always keep in mind: there is no free lunch. Therefore it is better to make desicions at the beginning.

In data lake it is important to differentiate between raw data and clean data. We save raw data as it is. But this data can not be used for decision making. It is needed to clean the data and mark it as “analysis ready”. Of cource for ad hoc analysis we can use raw data, but we should understand the risks. During the data cleaning we can save it in format, which is optimized for processing. For example, we an save data in avro binary format. This format has very useful properties, like compressing (less place on hard disks), splitable (we can process one file by many processes), supports format evolution (we can read old files with new code, for example if new column is added to a file).

Next step after cleaning is to derive something usefull from our data. This can be aggregates to improve the speed of retrieving of the data. For example, from transaction we can derive the turnover. Or we can apply optimization procedure and find optimal solution as a derived information. The deriving information form clean data need time. Surely it is possible to define “on the fly” processes. But reaction time for end users will be slow.

Hence, we can define the following base folders in our file system:

  • Raw Data
  • Clean Data
  • Derived Data