Improving performance by reading data with Hive for HDFS using subfolders (partitioning)

In our  previous article we have discussed the root structure for HDFS. In this article we will discuss next level of the file structure, which will help to improve the speed of reading data.

In relational databases we have schemas, tables, and partitions. In HDFS we will reuse the same concept using subfolders. We will start with our root structure:

  • RAW
  • CLEAN
  • DERIVED

Next, what we need we should define schema in distributed file system. For example, we acquire data from CRM and from ERP systems, then we will have the following file strucutre.

  • RAW
    • CRM   <- schema 1
    • ERP    <- schema 2
  • CLEAN
    • CRM   <- schema 1
    • ERP    <-schema 2
  • DERIVED

We can extend it further with subfolders, which represent the entities, which equialent to tables in relational databases:

  • RAW
    • CRM   <- schema 1
      • CUSTOMER <- table 1
      • PURCHASES <- table 2

Next step is to divide each tables in partitions. This help us to process only needed data. In partitions we can put the files with data. For example:

  • RAW
    • CRM   <- schema 1
      • CUSTOMER <- table 1
        • fact_date=20170601  <- partition 1
          • *.csv          <- data files, it can be also other format like avro
        • fact_date=20170602 <- partiotion 2
          • *.csv
        • fact_date=20170603 <- partition 3
          • *.csv

Why this structure is useful? if we need to see customers for 20170601 we do not need to scan other folders. We can go directly to folder with fact_date=20170601.

Moreover, we can easily create hive externals tables with partitions over the data sctructure, which we have described above.

CREATE EXTERNAL TABLE CUSTOMER (custId INT,
name STRING)
PARTITIONED BY(fact_date STRING) -- this is our fact_date folder
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
LOCATION 'RAW/CRM/CUSTOMER'

Important, it is needed to define partitions manually.

alter table CUSTOMER add partition (fact_date='20160601');