In our previous article we have discussed the root structure for HDFS. In this article we will discuss next level of the file structure, which will help to improve the speed of reading data.
In relational databases we have schemas, tables, and partitions. In HDFS we will reuse the same concept using subfolders. We will start with our root structure:
- RAW
- CLEAN
- DERIVED
Next, what we need we should define schema in distributed file system. For example, we acquire data from CRM and from ERP systems, then we will have the following file strucutre.
- RAW
- CRM <- schema 1
- ERP <- schema 2
- CLEAN
- CRM <- schema 1
- ERP <-schema 2
- DERIVED
We can extend it further with subfolders, which represent the entities, which equialent to tables in relational databases:
- RAW
- CRM <- schema 1
- CUSTOMER <- table 1
- PURCHASES <- table 2
- CRM <- schema 1
Next step is to divide each tables in partitions. This help us to process only needed data. In partitions we can put the files with data. For example:
- RAW
- CRM <- schema 1
- CUSTOMER <- table 1
- fact_date=20170601 <- partition 1
- *.csv <- data files, it can be also other format like avro
- fact_date=20170602 <- partiotion 2
- *.csv
- fact_date=20170603 <- partition 3
- *.csv
- fact_date=20170601 <- partition 1
- CUSTOMER <- table 1
- CRM <- schema 1
Why this structure is useful? if we need to see customers for 20170601 we do not need to scan other folders. We can go directly to folder with fact_date=20170601.
Moreover, we can easily create hive externals tables with partitions over the data sctructure, which we have described above.
CREATE EXTERNAL TABLE CUSTOMER (custId INT, name STRING) PARTITIONED BY(fact_date STRING) -- this is our fact_date folder ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' STORED AS TEXTFILE LOCATION 'RAW/CRM/CUSTOMER'
Important, it is needed to define partitions manually.
alter table CUSTOMER add partition (fact_date='20160601');