ORM (object-relational mapping) analog for data in data lake

We start saving data in HDFS using avro format. In previous post we have discussed about forward and backward compatibility of avro schemas. How to use this concept?

We use scala and spark in development. Data is stored as avro format. There is a possibility to generate scala case classes using avrohugger or avro4s, which describe model of data. These classes can be saved in maven as jar with versioning. Hence, during development you can use this jar and classes to map data to object with as[] notation in spark. There is a similarity with ORM in traditional relational world.

If you follow rules for backward/forward compatibility for avro schemas, you can still use old version of case classes jar in case of schema changes. This will make maintenance of code easy task. You should not update your application after every schema changes.

Summarizing, to make development easy you can provide case classes for data stored in HDFS. It can be different versions of such models, but application can use each of them, if you follow rules for schema creation. Case classes is an analog of ORM in big data world.