Meta data service and schema registry in data lake

Maintaining data description is useful feature. There are some ideas, how to implement this.

Let’s have a practical case. All your data is saved in avro format. You use Apache Atlas as a meta data storage. You access data also using hive sql interface, where folders on file system are defined as hive external tables.

When you access data using hive SQL interface, you should provide hive description of data as avro schema for reading. Typically you store schema file in one folder in HDFS. There are some open questions with this approach. First, how to be sure if new schema is forward/backward compatible. Second, how to update meta data information, if schema is changed. Third, how to create new external tables in hive to be synchronized with files on file system. Fourth, how to provide applications possibility to update schemas and hive without additional dependency.

Solution for issues above is to create service with API. This service should have methods registerNewSchema(avroSchema, pathToDataHdfs, hiveTable) and updateSchema(avroSchema). In these methods you can check if schema is backward/forward compatible, update meta data in atlas, create hive external table, update schema file in HDFS.

If you have Kafka installed, confluent provides schema registry for avro formats. Here you have out of the box storage for schemas with API for hive, versioning, and checks.