Friday, January 16, 2015

HBase Basics

What is Hadoop

HBase is the Hadoop database modeled after Google's Bigtable. It is an Apache Top Level Project. This means it is open source. It is however embraced and supported by IBM, etc. It is used by industry heavy hitters like Facebook, Twitter, etc to access BigData. It is written in Java, but there are other API to access it. It has the following characteristics:

  • Sparse - data is scattered
  • Distributed - spread out over commodity hardware
  • Persistent - data will be saved
  • Multi-dimensional - may be multiple versions of data
  • Sorted Map - need a key to access the data

NoSQL Technology

  • HBase is a NoSQL datastore
  • NoSQL stands for "Not only SQL"
  • Not intended to replace a RDBMS
  • Suited for specific business needs that require
    • Massive scaling to terabytes and petabytes and larger
    • Commodity Hardware used for scaling out solution
    • Not knowing schema upfront

Why HBase

  • HBase CAN replace costly implementations of RDBMS for BigData applications, but is not meant to replace RDBMS entirely because 
    • It doesn't support SQL
    • Not for transactional processing
    • Does not support table joins
  • Horizontal scaling for very large data sets
  • Ability to add commodity hardware without interruption of service
  • Don't know data types in advance. This allows for a flexible schema.
  • Need RANDOM read/write access to BigData. Reads and writes are very quick and efficient.
  • Sharding - sharing the data between nodes
NOTE: Everything is stored as an array of bytes (except timestamp which is stored as a long integer).

HBase vs. RDBMS

Topic HBase RDBMS
Hardware architecture Similar to Hadoop. Clustered commodity hardware. Very affordable. Typically large scalable multi-processor systems. Very expensive.
Typical Database Size Terabytes to Petabytes - hundreds of millions to billions of rows Gigabytes to Terabytes - hundreds of thousands to millions of rows.
Data Layout A sparse, distributed, persistent, multi-dimensional, sorted map. Rows or column oriented
Data Types Bytes only Rich data type support
Transactions ACID support on a single row only Full ACID compliance across rows and tables
Query Language API primitive commands only, unless combined with Hive or other technologies. SQL
Indexes Row-Key only unless combined with other technologies such as Hive or IBM's BigSQL Yes. On one or more columns.
Throughput Millions of queries per second Thousands of queries per second
Fault Tolerance Built into the architecture. Lots of nodes means each is relatively insignificant. No need to worry about individual nodes. Requires configuration of the HW and the RDBMS with the appropriate high availability options. 


Data Representation Example (RDBMS vs HBase)

RDBMS might look something like this
ID (Primary Key) LName FName Password Timestamp
1234 Smith John Hello, world! 20130710
5678 Doe Jane wysiwyg 20120825
5678 Doe Jane wisiwig 20130916
Logical View in HBase
Row-Key Value (Column-Family, Qualifier, Version)
1234 info {'lName': 'Smith', 'fName': 'John' }
pwd {'password': 'Hello, world!' }
5678 info {'lName': 'Doe', 'fName': 'Jane' }
pwd {'password': 'wysiwyg'@ts 20130916,
'password': 'wisiwig'@ts 20120825 }


HBase Physical (How it is stored on disk)


Logical View to Physical View

Let's assume you want to read Row4. You will need data from the both physical files. In the case of CF1, you will get two rows since there are two versions of the data.

HBase Components

Region

  • This is where the rows of a table are stored
  • Each region stores a single column family
  • A table's data is automatically sharded across multiple regions when the data gets too large.

Region Server

  • Contains one or more regions
  • Hosts the tables, performs reads and writes, buffers, etc
  • Client talks directly to the Region Server for their data.

Master

  • Coordinating the Region Servers
  • Detects status of load rebalancing of the Region Servers
  • Assigns Regions to Region Servers
  • Multiple Masters are allowed, but only one is the true master, and the others are only backups.
  • Not part of the read/write path
  • Highly available with ZooKeeper

ZooKeeper

  • Critical component for HBase
  • Ensures one Master is running
  • Registers Region and Region server
  • Integral part of the fault tolerance on HBase

HDFS

  • The Hadoop file system is where the data (physical files) are kept

API

  • The Java client API.
  • You can also use SQL is you use Hive to access your data.
Here is how the components relate to each other.



HBase Shell introduction

Starting HBase Instance
HBASE_HOME/bin/start-hbase.sh

Stopping HBase Instance
HBASE_HOME/bin/stop-hbase.sh


Start HBase shell
HBASE_HOME/bin/hbase shell

HBase Shell Commands

See a list of the tables
list

Create a table
create 'testTable', 'cf'
NOTE: testTable is the name of the table and cf is the name of the column family

Insert data into a table
Insert at rowA, column "cf:columnName" with a value of "val1"
put 'testTable', 'rowA', 'cf:columnName', 'val1'

Retrieve data from a table
Retrieve"rowA"from the table "testTable"
get 'testTable', 'rowA'

Delete data from a table
delete 'testTable', 'rowA', 'cf:columnName', ts1.

Delete a table:
disable 'testTable'
drop 'testTable'

HBase Clients

HBase Shell - you can do the above crud operations using the HBase Shell. However will be limiting for more complicated tasks.
Java - you can do the above crud operations and more using Java. It will be executed as a MapReduce job.


NOTE: Some of this material was copied directly from the BigData University online class Using HBase for Real-time Access to your Big Data - Version 2.If you want hands on labs, more explanation, etc I suggest you check it out since all the information on this post comes from there.