Follow the source: usage of CQRS pattern in data lake

There is a pattern in microservices architecture: Command and Query Responsibility Segregation (CQRS). This pattern helps to design multi-purpose data lake.

We can imagine data lake as data storage with analytical functionality. Different systems (sources) pushes data in data lake for further analysis. Approach, what was used before is a data warehouse with ETL functionality, where we first extract data from source system, then transform it, and after that push into dimensional model, which is typically uses star/snowflake schema in relational database.

Hadoop and Hadoop-like technologies have radically changed this. We do not need to define target model anymore, we can simply take it as given from source system, and, when it is needed, we can transform and prepare data for analytical purposes. Here ETL was changed to ELT. We can define main principle: follow the source and save all data one to one in data lake. Good practice here is to convert source data in Hadoop-compatible format, like, avro, which contains schema definition in each file.

Typically data is either streaming data or batch data. Streaming data can be landed first in Kafka and after that end of day offloaded in HDFS. Batch data can be stored directly in HDFS. Schema is dictated by source system.

You can see, that we described command part of the CQRS pattern.

What is about query part? Here we have freedom to experiment. Data is already in data lake. We can concentrate and optimize query data model or different query data models for analytical applications. If we have made an error in defining data model for querying data, we can easily correct it, by applying corrected transformation procedure to source (raw) data.

Approach about RAW/CLN/DRV areas, which was we described in previous posts follows exactly CQRS pattern.