Schema evolution and backward and forward compatibility for data in data lakes

We have discussed before the format for clean and derived data in data lakes. One of the popular formats for this goal is an avro format. We will talk here why it is needed and how to achieve backward and forward compatibility by designing avro schemas.

Avro format is a binary format. The nice feature of this format is that the data simulatunuousy splittable and archived. But we should think also about evolution in avro schema. Let’s start with simple example. In our data lake we have saved the information about customers in avro format like this:

CUSTOMER
  fact_date=201701
         customer_v1.0.0.avro //version 1.0.0 
  fact_date=201702
         customer_v1.0.1.avro //version 1.0.1 
  fact_date=201703
         customer_v1.0.2.avro //version 1.0.2 

We have changed the format every month. There are different applications, which use our data lake, in particular, the information about customers. The question is, which changes are allowed in avro definition, which do not broke the execution of our applications.
Let’s look in avro specification in section about schema resolution. There are two different schemas, what we have. Readers schema, it is what expect our application and writers schema, what we actually receive from data lake.

For example, we experiment independent with the following changes from version 1.0.0 to version 1.0.2:
* Change 1: Field “Phone” in 1.0.0 was defined as required, but in 1.0.2 we provide default value for it. (Change from required to optional)
* Change 2: Field “Phone” in 1.0.0 has a default value in 1.0.0, but in 1.0.2 we remove this value. (Change from optional to required)
* Change 3: We remove required field “Phone” in 1.0.2
* Change 4: We rename field “Phone” in 1.0.2 to “Phone2”
* Change 5: We add field “Phone3” in 1.0.2
* Change 6: We change type from int (1.0.0) to long (1.0.2)
* Chnage 7: We change type from long (1.0.0) to int (1.0.2)

Let’s start with backward compatibility.
Imagine, that we have written our application with version 1.0.2 schema (readers schema) and we try to read files, which are saved with version 1.0.0.
* Change 1: Here will be not a problem, since this filed will be always field with value.>
* Change 2: Here will be a problem with our application, since when we will try to read unset value. We have got an exception.
* Change 3: Here will be not a problem, since we do not expect this field “Phone” in our application
* Change 4: Here will be a problem, since “Phone2” is not available in initial dataset. We can solve this by providing default value to new Phone2 field.
* Change 5: This field will be not available in old file, so we can do this only by providing default value for this field. In other case we will got an exception.
* Change 6: int can be easily converted to long without lost of data
* Change 7: we will lose data by converting from long to int

Next is forward compatibility.
Imagine, that we have written our application with version 1.0.0 schema (readers schema) and we try to read new files, which are saved with version 1.0.2.
* Change 1: We can not do this, since value is required
* Change 2: Ok, we will use defualt value here
* Change 3: We expect this field in our application
* Change 4: Here will be a problem, since “Phone” is not available. This will be ok, if field Phone will be declared with default value in schema 1.0.0.
* Change 5: We do not need this field in old application, should be ok.
* Change 6: we will lose data by converting from long to int
* Change 7: int can be easily converted to long without lost of data

Based on these examples, we can define simple rules you should apply during designing avro schemas:

By designing schema from scratch:
# Always think when you specify the field as required (not providing default values). You cannot change this in the future. If you provide default value in the future, your old applications will not work with this file. You loose forward compatibility.
# Always provide meaningfull default values for optional fields. This improves also forward compatibility, if this value will be missing in new schema.

By changing existing schema:
# Do not delete required fields in new avro schema. You can do this for optional fields with default values. This will lead to problem by running old applications with new schema.
# Do not rename required fields in new avro schema. You can do this for optional fields with default values. This will lead to problem by running old applications with new schema.
# Do not change field types in new avro schema. You will lost backward or forward compability.

You can read other articles from BigData area:
Short note about HDFS or why you need distributed file system
Thoughts about schema-on-write and schema-on-read
Raw, clean, and derived data in data lakes based on HDFS
Improving performance by reading data with Hive for HDFS using subfolders (partitioning)
HBase is the next step in your BigData technology stack

Apache Kylin OLAP:
How to integrate Apache Kylin OLAP In Excel (pivot) [XMLA Connect and Mondrian]
How to implement Kylin Dialect for Mondrian
Authentication and Authorizaton for XMLA Connect and Mondrian