What is Hadoop
HBase is the Hadoop database modeled after Google's Bigtable. It is an Apache Top Level Project. This means it is open source. It is however embraced and supported by IBM, etc. It is used by industry heavy hitters like Facebook, Twitter, etc to access BigData. It is written in Java, but there are other API to access it. It has the following characteristics:- Sparse - data is scattered
- Distributed - spread out over commodity hardware
- Persistent - data will be saved
- Multi-dimensional - may be multiple versions of data
- Sorted Map - need a key to access the data
NoSQL Technology
- HBase is a NoSQL datastore
- NoSQL stands for "Not only SQL"
- Not intended to replace a RDBMS
- Suited for specific business needs that require
- Massive scaling to terabytes and petabytes and larger
- Commodity Hardware used for scaling out solution
- Not knowing schema upfront
Why HBase
- HBase CAN replace costly implementations of RDBMS for BigData applications, but is not meant to replace RDBMS entirely because
- It doesn't support SQL
- Not for transactional processing
- Does not support table joins
- Horizontal scaling for very large data sets
- Ability to add commodity hardware without interruption of service
- Don't know data types in advance. This allows for a flexible schema.
- Need RANDOM read/write access to BigData. Reads and writes are very quick and efficient.
- Sharding - sharing the data between nodes
NOTE: Everything is stored as an array of bytes (except timestamp which is stored as a long integer).
HBase vs. RDBMS
Topic | HBase | RDBMS |
---|---|---|
Hardware architecture | Similar to Hadoop. Clustered commodity hardware. Very affordable. | Typically large scalable multi-processor systems. Very expensive. |
Typical Database Size | Terabytes to Petabytes - hundreds of millions to billions of rows | Gigabytes to Terabytes - hundreds of thousands to millions of rows. |
Data Layout | A sparse, distributed, persistent, multi-dimensional, sorted map. | Rows or column oriented |
Data Types | Bytes only | Rich data type support |
Transactions | ACID support on a single row only | Full ACID compliance across rows and tables |
Query Language | API primitive commands only, unless combined with Hive or other technologies. | SQL |
Indexes | Row-Key only unless combined with other technologies such as Hive or IBM's BigSQL | Yes. On one or more columns. |
Throughput | Millions of queries per second | Thousands of queries per second |
Fault Tolerance | Built into the architecture. Lots of nodes means each is relatively insignificant. No need to worry about individual nodes. | Requires configuration of the HW and the RDBMS with the appropriate high availability options. |
Data Representation Example (RDBMS vs HBase)
RDBMS might look something like this
ID (Primary Key) | LName | FName | Password | Timestamp |
---|---|---|---|---|
1234 | Smith | John | Hello, world! | 20130710 |
5678 | Doe | Jane | wysiwyg | 20120825 |
5678 | Doe | Jane | wisiwig | 20130916 |
Logical View in HBase
Row-Key | Value (Column-Family, Qualifier, Version) |
---|---|
1234 | info {'lName': 'Smith', 'fName': 'John' } pwd {'password': 'Hello, world!' } |
5678 | info {'lName': 'Doe', 'fName': 'Jane' } pwd {'password': 'wysiwyg'@ts 20130916, 'password': 'wisiwig'@ts 20120825 } |
HBase Physical (How it is stored on disk)
Logical View to Physical View
Let's assume you want to read Row4. You will need data from the both physical files. In the case of CF1, you will get two rows since there are two versions of the data.
HBase Components
Region
- This is where the rows of a table are stored
- Each region stores a single column family
- A table's data is automatically sharded across multiple regions when the data gets too large.
Region Server
- Contains one or more regions
- Hosts the tables, performs reads and writes, buffers, etc
- Client talks directly to the Region Server for their data.
Master
- Coordinating the Region Servers
- Detects status of load rebalancing of the Region Servers
- Assigns Regions to Region Servers
- Multiple Masters are allowed, but only one is the true master, and the others are only backups.
- Not part of the read/write path
- Highly available with ZooKeeper
ZooKeeper
- Critical component for HBase
- Ensures one Master is running
- Registers Region and Region server
- Integral part of the fault tolerance on HBase
HDFS
- The Hadoop file system is where the data (physical files) are kept
API
- The Java client API.
- You can also use SQL is you use Hive to access your data.
Here is how the components relate to each other.
HBase Shell introduction
Starting HBase InstanceHBASE_HOME/bin/start-hbase.sh
Stopping HBase Instance
HBASE_HOME/bin/stop-hbase.sh
Start HBase shell
HBASE_HOME/bin/hbase shell
Insert data into a table
Insert at rowA, column "cf:columnName" with a value of "val1"
put 'testTable', 'rowA', 'cf:columnName', 'val1'
Retrieve data from a table
Retrieve"rowA"from the table "testTable"
get 'testTable', 'rowA'
Delete data from a table
delete 'testTable', 'rowA', 'cf:columnName', ts1.
Delete a table:
disable 'testTable'
drop 'testTable'
Java - you can do the above crud operations and more using Java. It will be executed as a MapReduce job.
HBASE_HOME/bin/hbase shell
HBase Shell Commands
See a list of the tables
list
Create a table
create 'testTable', 'cf'
NOTE: testTable is the name of the table and cf is the name of the column family
Insert data into a table
Insert at rowA, column "cf:columnName" with a value of "val1"
put 'testTable', 'rowA', 'cf:columnName', 'val1'
Retrieve data from a table
Retrieve"rowA"from the table "testTable"
get 'testTable', 'rowA'
Delete data from a table
delete 'testTable', 'rowA', 'cf:columnName', ts1.
Delete a table:
disable 'testTable'
drop 'testTable'
HBase Clients
HBase Shell - you can do the above crud operations using the HBase Shell. However will be limiting for more complicated tasks.Java - you can do the above crud operations and more using Java. It will be executed as a MapReduce job.
NOTE: Some of this material was copied directly from the BigData University online class Using HBase for Real-time Access to your Big Data - Version 2.If you want hands on labs, more explanation, etc I suggest you check it out since all the information on this post comes from there.