{"id":428,"date":"2017-06-06T20:26:46","date_gmt":"2017-06-06T18:26:46","guid":{"rendered":"http:\/\/dekarlab.de\/wp\/?p=428"},"modified":"2020-05-23T15:36:23","modified_gmt":"2020-05-23T13:36:23","slug":"improving-speed-by-reading-data-with-hive-for-hdfs-using-subfolders-partitioning","status":"publish","type":"post","link":"https:\/\/dekarlab.de\/wp\/?p=428","title":{"rendered":"Improving performance by reading data with Hive for HDFS using subfolders (partitioning)"},"content":{"rendered":"<p>In our\u00a0 <a href=\"https:\/\/dekarlab.de\/wp\/?p=386\">previous article<\/a> we have discussed the root structure for HDFS. In this article we will discuss next level of the file structure, which will help to improve the speed of reading data.<\/p>\n<p><!--more--><\/p>\n<p>In relational databases we have schemas, tables, and partitions. In HDFS we will reuse the same concept using subfolders. We will start with our root structure:<\/p>\n<ul>\n<li>RAW<\/li>\n<li>CLEAN<\/li>\n<li>DERIVED<\/li>\n<\/ul>\n<p>Next, what we need we should define schema in distributed file system. For example, we acquire data from CRM and from ERP systems, then we will have the following file strucutre.<\/p>\n<ul>\n<li>RAW\n<ul>\n<li>CRM\u00a0\u00a0 &lt;- schema 1<\/li>\n<li>ERP\u00a0\u00a0\u00a0 &lt;- schema 2<\/li>\n<\/ul>\n<\/li>\n<li>CLEAN\n<ul>\n<li>CRM\u00a0\u00a0 &lt;- schema 1<\/li>\n<li>ERP\u00a0\u00a0\u00a0 &lt;-schema 2<\/li>\n<\/ul>\n<\/li>\n<li>DERIVED<\/li>\n<\/ul>\n<p>We can extend it further with subfolders, which represent the entities, which equialent to tables in relational databases:<\/p>\n<ul>\n<li>RAW\n<ul>\n<li>CRM\u00a0\u00a0 &lt;- schema 1\n<ul>\n<li>CUSTOMER &lt;- table 1<\/li>\n<li>PURCHASES &lt;- table 2<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Next step is to divide each tables in partitions. This help us to process only needed data. In partitions we can put the files with data. For example:<\/p>\n<ul>\n<li>RAW\n<ul>\n<li>CRM\u00a0\u00a0 &lt;- schema 1\n<ul>\n<li>CUSTOMER &lt;- table 1\n<ul>\n<li>fact_date=20170601\u00a0 &lt;- partition 1\n<ul>\n<li>*.csv\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &lt;- data files, it can be also other format like avro<\/li>\n<\/ul>\n<\/li>\n<li>fact_date=20170602 &lt;- partiotion 2\n<ul>\n<li>*.csv<\/li>\n<\/ul>\n<\/li>\n<li>fact_date=20170603 &lt;- partition 3\n<ul>\n<li>*.csv<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>Why this structure is useful? if we need to see customers for 20170601 we do not need to scan other folders. We can go directly to folder with fact_date=20170601.<\/p>\n<p>Moreover, we can easily create hive externals tables with partitions over the data sctructure, which we have described above.<\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">CREATE EXTERNAL TABLE CUSTOMER (custId INT,\r\nname STRING)\r\nPARTITIONED BY(fact_date STRING) -- this is our fact_date folder\r\nROW FORMAT DELIMITED\r\nFIELDS TERMINATED BY '&#92;&#48;01'\r\nCOLLECTION ITEMS TERMINATED BY '&#92;&#48;02'\r\nMAP KEYS TERMINATED BY '&#92;&#48;03'\r\nSTORED AS TEXTFILE\r\nLOCATION 'RAW\/CRM\/CUSTOMER'<\/pre>\n<p><strong>Important, it is needed to define partitions manually.<\/strong><\/p>\n<pre class=\"brush: sql; title: ; notranslate\" title=\"\">alter table CUSTOMER add partition (fact_date='20160601');<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In our\u00a0 previous article we have discussed the root structure for HDFS. In this article we will discuss next level of the file structure, which will help to improve the speed of reading data.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0},"categories":[25],"tags":[57,37,53],"_links":{"self":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/428"}],"collection":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=428"}],"version-history":[{"count":6,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/428\/revisions"}],"predecessor-version":[{"id":434,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=\/wp\/v2\/posts\/428\/revisions\/434"}],"wp:attachment":[{"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=428"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=428"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dekarlab.de\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=428"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}