HBase is next step in your big data technology stack

Typically the basis for distributed computing is a distributed file system (DFS). This file system is spread over a large number of servers with one logical view. For end user the complexity and distributed nature is hidden by simple file explorer. There are different implementations of dfs, the most known are Hadoop DFS and S3 from Amazon.

You push the files in DFS and create so called data lake. On the next step these files should be processed. You can develop and run Map/Reduce jobs or Spark programs on these files to extract needed information and perform calculation. A huge advantage of this approach is the flexibility, you can do everything what you want with the data using flexible programming languages like java and scala. Drawback of this way is a time, you need to write programs for that. Sometimes it is enough to use some simple query languages, like SQL to do the job. This is also possible.

If you provide addtional struture to these files, that means store them in folders with some naming convention, you can define structure and use Hive to query the data. Hive will automatically transform your SQL Statements into Map/Reduce or Spark jobs, so you can save the time, which you will spend on development. But typically Hive queries are slowly, because of the nature of Hive: hive transforms SQL in Map/Reduce jobs, send these jobs to resource manager, like YARN, which starts their execution.

This problem will solve Hbase. HBase is key/value storage, which hosts data in DFS. Using HBase you can reduce latency time between user request for data and response. HBase is suitable for online analytical processing. But to use HBase as a basis for your data lake is not recommended.

In ideal use case you will have HDFS, which is a basic data sorage. You can query files in HDFS using flexible Spark or Map/Reduce, or less flexible Hive. For quering the data, which require immidiate response, what is the case for onlne analytical processing, you can load files on demand from HDFS into HBase. To automate this process you can use Apache Kylin OLAP engine.

You can read other articles from BigData area:
Short Note about HDFS or why you need distributed file system
Thoughts about schema-on-write and schema-on-read
Raw, clean, and derived data in data lakes based on HDFS
Improving performance by reading data with Hive for HDFS using subfolders (partitioning)

Apache Kylin OLAP:
How to integrate Apache Kylin OLAP In Excel (pivot) [XMLA Connect and Mondrian]
How to implement Kylin Dialect for Mondrian
Authentication and Authorizaton for XMLA Connect and Mondrian

Post Views: 628