Just geeks: January 2015

What is Hadoop

HBase is the Hadoop database modeled after Google's Bigtable. It is an Apache Top Level Project. This means it is open source. It is however embraced and supported by IBM, etc. It is used by industry heavy hitters like Facebook, Twitter, etc to access BigData. It is written in Java, but there are other API to access it. It has the following characteristics:

Sparse - data is scattered
Distributed - spread out over commodity hardware
Persistent - data will be saved
Multi-dimensional - may be multiple versions of data
Sorted Map - need a key to access the data

NoSQL Technology

HBase is a NoSQL datastore
NoSQL stands for "Not only SQL"
Not intended to replace a RDBMS
Suited for specific business needs that require

Massive scaling to terabytes and petabytes and larger
Commodity Hardware used for scaling out solution
Not knowing schema upfront

Why HBase

HBase CAN replace costly implementations of RDBMS for BigData applications, but is not meant to replace RDBMS entirely because

It doesn't support SQL
Not for transactional processing
Does not support table joins

Horizontal scaling for very large data sets
Ability to add commodity hardware without interruption of service
Don't know data types in advance. This allows for a flexible schema.
Need RANDOM read/write access to BigData. Reads and writes are very quick and efficient.
Sharding - sharing the data between nodes

NOTE: Everything is stored as an array of bytes (except timestamp which is stored as a long integer).

HBase vs. RDBMS

Topic	HBase	RDBMS
Hardware architecture	Similar to Hadoop. Clustered commodity hardware. Very affordable.	Typically large scalable multi-processor systems. Very expensive.
Typical Database Size	Terabytes to Petabytes - hundreds of millions to billions of rows	Gigabytes to Terabytes - hundreds of thousands to millions of rows.
Data Layout	A sparse, distributed, persistent, multi-dimensional, sorted map.	Rows or column oriented
Data Types	Bytes only	Rich data type support
Transactions	ACID support on a single row only	Full ACID compliance across rows and tables
Query Language	API primitive commands only, unless combined with Hive or other technologies.	SQL
Indexes	Row-Key only unless combined with other technologies such as Hive or IBM's BigSQL	Yes. On one or more columns.
Throughput	Millions of queries per second	Thousands of queries per second
Fault Tolerance	Built into the architecture. Lots of nodes means each is relatively insignificant. No need to worry about individual nodes.	Requires configuration of the HW and the RDBMS with the appropriate high availability options.

Data Representation Example (RDBMS vs HBase)

RDBMS might look something like this

ID (Primary Key)	LName	FName	Password	Timestamp
1234	Smith	John	Hello, world!	20130710
5678	Doe	Jane	wysiwyg	20120825
5678	Doe	Jane	wisiwig	20130916

Logical View in HBase

Row-Key	Value (Column-Family, Qualifier, Version)
1234	info {'lName': 'Smith', 'fName': 'John' } pwd {'password': 'Hello, world!' }
5678	info {'lName': 'Doe', 'fName': 'Jane' } pwd {'password': 'wysiwyg'@ts 20130916, 'password': 'wisiwig'@ts 20120825 }

HBase Physical (How it is stored on disk)

Logical View to Physical View

Let's assume you want to read Row4. You will need data from the both physical files. In the case of CF1, you will get two rows since there are two versions of the data.

HBase Components

Region

This is where the rows of a table are stored
Each region stores a single column family
A table's data is automatically sharded across multiple regions when the data gets too large.

Region Server

Contains one or more regions
Hosts the tables, performs reads and writes, buffers, etc
Client talks directly to the Region Server for their data.

Master

Coordinating the Region Servers
Detects status of load rebalancing of the Region Servers
Assigns Regions to Region Servers
Multiple Masters are allowed, but only one is the true master, and the others are only backups.
Not part of the read/write path
Highly available with ZooKeeper

ZooKeeper

Critical component for HBase
Ensures one Master is running
Registers Region and Region server
Integral part of the fault tolerance on HBase

HDFS

The Hadoop file system is where the data (physical files) are kept

API

The Java client API.
You can also use SQL is you use Hive to access your data.

Here is how the components relate to each other.

HBase Shell introduction

Starting HBase Instance
HBASE_HOME/bin/start-hbase.sh

Stopping HBase Instance
HBASE_HOME/bin/stop-hbase.sh

Start HBase shell
HBASE_HOME/bin/hbase shell

HBase Shell Commands

See a list of the tables

list

Create a table

create 'testTable', 'cf'

NOTE: testTable is the name of the table and cf is the name of the column family

Insert data into a table
Insert at rowA, column "cf:columnName" with a value of "val1"
put 'testTable', 'rowA', 'cf:columnName', 'val1'

Retrieve data from a table
Retrieve"rowA"from the table "testTable"
get 'testTable', 'rowA'

Delete data from a table
delete 'testTable', 'rowA', 'cf:columnName', ts1.

Delete a table:
disable 'testTable'
drop 'testTable'

HBase Clients

HBase Shell - you can do the above crud operations using the HBase Shell. However will be limiting for more complicated tasks.
Java - you can do the above crud operations and more using Java. It will be executed as a MapReduce job.

NOTE: Some of this material was copied directly from the BigData University online class Using HBase for Real-time Access to your Big Data - Version 2.If you want hands on labs, more explanation, etc I suggest you check it out since all the information on this post comes from there.

Just geeks

Friday, January 16, 2015

HBase Basics