Why Contracts are Important in Data Intensive Applications with Microservices

Main purpose of using microservices architecture is to increase velocity of development and reduce system complexity.

Typicaly in none-data intensive applications this can be achieved by separating data storage for different microservices and allowing communication using rest API (we are talking here only about persistance, functionality should be defined as a stateless to allow scalability). In this case, developers can fix API for outside world and enjoy opportunity to be flexible inside of microservice. So they can experiment with different tools and freely change data model. In this case API is a contract between developers and outside world.

For data intensive applications, where large amount of data should be transferred between microservices, using rest API as a contract is not practical. Contract should be defined in other way. If you are moving from data warehouse to data lake, you know that data trip is started from staging area. Idea is to have the same in data lake, but not throw it away, but store forever and call it raw data. This will be first contract. Data is stored as it is received from outside world. Every data analytics application should use this data. To reduce retrieval complexity, data can be stored in format, which supports schema evolution, like avro format.

Analytical applications after complex calculations produce valuable information, which can be used further. But additionally a lot of by-product data is also produced. We should separate both types of data an fix the schema for first one as a contract for external applications. Developers should explicit describe and make a garantee of compatibility of the schema for this data. To separate this fixed schema from the rest it is reclmmended to put it in separate area in data lake, for example as derived data.