Friday, January 16, 2015

HBase Basics

What is Hadoop

HBase is the Hadoop database modeled after Google's Bigtable. It is an Apache Top Level Project. This means it is open source. It is however embraced and supported by IBM, etc. It is used by industry heavy hitters like Facebook, Twitter, etc to access BigData. It is written in Java, but there are other API to access it. It has the following characteristics:

  • Sparse - data is scattered
  • Distributed - spread out over commodity hardware
  • Persistent - data will be saved
  • Multi-dimensional - may be multiple versions of data
  • Sorted Map - need a key to access the data

NoSQL Technology

  • HBase is a NoSQL datastore
  • NoSQL stands for "Not only SQL"
  • Not intended to replace a RDBMS
  • Suited for specific business needs that require
    • Massive scaling to terabytes and petabytes and larger
    • Commodity Hardware used for scaling out solution
    • Not knowing schema upfront

Why HBase

  • HBase CAN replace costly implementations of RDBMS for BigData applications, but is not meant to replace RDBMS entirely because 
    • It doesn't support SQL
    • Not for transactional processing
    • Does not support table joins
  • Horizontal scaling for very large data sets
  • Ability to add commodity hardware without interruption of service
  • Don't know data types in advance. This allows for a flexible schema.
  • Need RANDOM read/write access to BigData. Reads and writes are very quick and efficient.
  • Sharding - sharing the data between nodes
NOTE: Everything is stored as an array of bytes (except timestamp which is stored as a long integer).

HBase vs. RDBMS

Topic HBase RDBMS
Hardware architecture Similar to Hadoop. Clustered commodity hardware. Very affordable. Typically large scalable multi-processor systems. Very expensive.
Typical Database Size Terabytes to Petabytes - hundreds of millions to billions of rows Gigabytes to Terabytes - hundreds of thousands to millions of rows.
Data Layout A sparse, distributed, persistent, multi-dimensional, sorted map. Rows or column oriented
Data Types Bytes only Rich data type support
Transactions ACID support on a single row only Full ACID compliance across rows and tables
Query Language API primitive commands only, unless combined with Hive or other technologies. SQL
Indexes Row-Key only unless combined with other technologies such as Hive or IBM's BigSQL Yes. On one or more columns.
Throughput Millions of queries per second Thousands of queries per second
Fault Tolerance Built into the architecture. Lots of nodes means each is relatively insignificant. No need to worry about individual nodes. Requires configuration of the HW and the RDBMS with the appropriate high availability options. 


Data Representation Example (RDBMS vs HBase)

RDBMS might look something like this
ID (Primary Key) LName FName Password Timestamp
1234 Smith John Hello, world! 20130710
5678 Doe Jane wysiwyg 20120825
5678 Doe Jane wisiwig 20130916
Logical View in HBase
Row-Key Value (Column-Family, Qualifier, Version)
1234 info {'lName': 'Smith', 'fName': 'John' }
pwd {'password': 'Hello, world!' }
5678 info {'lName': 'Doe', 'fName': 'Jane' }
pwd {'password': 'wysiwyg'@ts 20130916,
'password': 'wisiwig'@ts 20120825 }


HBase Physical (How it is stored on disk)


Logical View to Physical View

Let's assume you want to read Row4. You will need data from the both physical files. In the case of CF1, you will get two rows since there are two versions of the data.

HBase Components

Region

  • This is where the rows of a table are stored
  • Each region stores a single column family
  • A table's data is automatically sharded across multiple regions when the data gets too large.

Region Server

  • Contains one or more regions
  • Hosts the tables, performs reads and writes, buffers, etc
  • Client talks directly to the Region Server for their data.

Master

  • Coordinating the Region Servers
  • Detects status of load rebalancing of the Region Servers
  • Assigns Regions to Region Servers
  • Multiple Masters are allowed, but only one is the true master, and the others are only backups.
  • Not part of the read/write path
  • Highly available with ZooKeeper

ZooKeeper

  • Critical component for HBase
  • Ensures one Master is running
  • Registers Region and Region server
  • Integral part of the fault tolerance on HBase

HDFS

  • The Hadoop file system is where the data (physical files) are kept

API

  • The Java client API.
  • You can also use SQL is you use Hive to access your data.
Here is how the components relate to each other.



HBase Shell introduction

Starting HBase Instance
HBASE_HOME/bin/start-hbase.sh

Stopping HBase Instance
HBASE_HOME/bin/stop-hbase.sh


Start HBase shell
HBASE_HOME/bin/hbase shell

HBase Shell Commands

See a list of the tables
list

Create a table
create 'testTable', 'cf'
NOTE: testTable is the name of the table and cf is the name of the column family

Insert data into a table
Insert at rowA, column "cf:columnName" with a value of "val1"
put 'testTable', 'rowA', 'cf:columnName', 'val1'

Retrieve data from a table
Retrieve"rowA"from the table "testTable"
get 'testTable', 'rowA'

Delete data from a table
delete 'testTable', 'rowA', 'cf:columnName', ts1.

Delete a table:
disable 'testTable'
drop 'testTable'

HBase Clients

HBase Shell - you can do the above crud operations using the HBase Shell. However will be limiting for more complicated tasks.
Java - you can do the above crud operations and more using Java. It will be executed as a MapReduce job.


NOTE: Some of this material was copied directly from the BigData University online class Using HBase for Real-time Access to your Big Data - Version 2.If you want hands on labs, more explanation, etc I suggest you check it out since all the information on this post comes from there.

Wednesday, December 17, 2014

Query SQL Server from Powershell using ADO.NET objects

If you want to execute a query against a SQL Server database using Powershell there are a couple of options to consider.

1. SQL Server Invoke-Sqlcmd cmdlet


You can always use the Invoke-Sqlcmd cmdlet, but you will need to have the SQL Server client tools installed, get some paths correct, and add the cmdlet so your script can access it.

2. Use ADO.NET objects.

Since ADO.NET has been part of .NET since the beginning and Powershell can access .NET objects, no additional objects are needed to execute a database query.

Here is a function for reuse.

function Invoke-SQL {
    param(
        [string] $dataSource = ".\SQLEXPRESS",
        [string] $username = $(throw "Please specify a username."),
        [string] $password = $(throw "Please specify a password."),
        [string] $database = "master",
        [string] $sqlCommand = $(throw "Please specify a query.")
      )


    $connectionString = "Data Source=$dataSource; " +
            "User Id=$username; Password=$password; " +
            "Initial Catalog=$database"

    $connection = new-object system.data.SqlClient.SQLConnection($connectionString)
    $command = new-object system.data.sqlclient.sqlcommand($sqlCommand,$connection)
    $connection.Open()

    $adapter = New-Object System.Data.sqlclient.sqlDataAdapter $command
    $dataset = New-Object System.Data.DataSet
    $adapter.Fill($dataSet) | Out-Null

    $connection.Close()
    $dataSet.Tables

}

Here is how you call it.
Invoke-SQL "serverHere" "userHere" "pwdHere" "dbHere" "SELECT GETDATE() AS TimeOfQuery;"  

NOTE: The basis of this script is from here. I did extend it a little bit to work with named users instead of using integrated security.

Friday, November 28, 2014

Pig Basics

What is Pig

Pig is a program that reads files written in PigLatin to generate a MapReduce program automatically. It is generally much easier to use Pig than writing your own MapReduce programs. Grunt is the interactive shell for Pig.

Ways to start Pig

Pig in interactive local mode

Runs in single virtual machine
All files are in the local file system
>pig -x local
grunt>


Pig in Interactive MapReduce mode

Runs in a Hadoop cluster
Is the default mode
>pig -x
grunt>

Pig executing Pig script file in local mode

Script is written in PigLatin
>pig -x local myscript.pig


Pig executing Pig script file in MapReduce mode

Script is written in PigLatin
>pig myscript.pig

Pig Data

Tuple - like a row in a file, or a row in a table, but not all don't have to have the same number of items. example: They can contain scalar types such as int, chararray, double, etc, or even bags. (Jeff, {apple, orange, pear}). Parenthesis are used to indicate the tuple datatype
Bag - bag of tuples. Curly braces are used to indicate the bag datatype.
Relation - an outer bag. Generally it is what you get back when you filter, group, sort, join, etc data. In terms of a database, it is kind of like a view or result set.

Data TypeDescriptionExample
intsigned 32-bit integer300
longsigned 64-bit integer300L or 300l
float32-bit floating point3.2F, 3.2f, 3.2e2f, 3.2E2F
double64-bit floating point3.2, 3.2e2, 3.2E2
chararraya stringabcde Basically a string
bytearraya blob
tupleordered set of fields (4, Brent, 388.25) Kind of like a row in a database
bagcollection of tuples {(4, Brent, 388.25) (20, Amanda,36.7)}
kind of like multiple rows in a database. Could also be thought of as an array, list, collection, etc
mapset of key value pairs


Pig Latin Basics

Terminate with a semi-colon
/*...*/ commend block
-- single line comment
Names of relations and fields are case sensitive
Function names are case sensitive
keywords such as LOAD, USING, AS, GROUP, BY, etc are NOT case sensitive

Loading Data

A = load '/datadir/datafile' using PigStorage('\t');
NOTE: tab is the default delimiter
NOTE: If the path to a file is a directory then all files in the directory will be loaded

The default is PigStorage, but there is also BinStorage, TextLoader, JsonLoader, and you can code your own loader as well.
You can also define your schema so you can refer to fields by name (f1,f2, f3, etc)
A = load '/datadir/datafile' using PigStorage('\t') as (f1:int, f2:chararray, f3:float);

If you don't specify the schema you need to use the position. For example, $0 is the first position.

OUTPUT

Opposite of load.
Formats: PigStorage(), BinStorage(), PigDump(), JsonStorage()
DUMP writes the results to the screen.


Operators

Arthmetic: +-/*%?
Boolean: and, or, not
Comparison: ==, !=, <, >, is null, is not null

Parameters

Parameters can be passed into a pig script via a parameter file or the command line.
Parameters are referenced using $

Relational Operators

FILTER

Selects tuples from a relation based on some criteria
b = filter data by pubyear == 2014

ORDER BY

Sorts a relation on one or more fields
b = order data by author ASC;

FOREACH

Projects fields into a new relation. Under the hood this just is a foreach loop that loops through each of the elements in the data. For example, if you want to only return a subset of the fields. A calculation can also be done here. For example, algebra between fields.

GROUP

Groups together tuples that have the same group key; the group key can be a single field or multiple fields (enclose multiple fields with parentheses).  The result of a Group is a relation that contains one tuple per group. The tuple has two fields (group and value (a bag with one tuple in it)).

Example:
data =
(1,2,3)
(4,5,6)
(7,8,9)
(4,3,2)
myGroup = group data by f1;

Result:
(1,{(1,2,3)})
(4,{(4,5,6),(4,3,2)})
(7,{(7,8,9)})

COGROUP

Same as GROUP operator, but by convention used when grouping multiple (up to 127) relations at the same time. Similar results to GROUP except resulting tuple has 1 group field and then one field for each relation we are cogrouping by. So if, we are cogrouping using two relationships then each resulting tuple would  be (group, value for relation1, value for relation2) where relation1 and relation2 would be bags of tuples just like with the GROUP operator.

Dereference

Allows us to reference a field in a tuple or bag that is outside the scope of the current operator. This can be used with the FOREACH operator.

DISTINCT

Removes duplicate tuples found in a relation

UNION

Merges the contents of two or relations.
NOTE: The relations do NOT have to have the same number of fields in them like you would in SQL.

SPLIT

Partitions a relation into two or more relations based on some condition

CROSS

Computes the cross product of two or more relations

JOIN / INNER

Performs a join (equijoin) on two or more relations using one or more common field values. Like a SQL join.

JOIN / OUTER (full, right, left)

Performs a join on two or more relations using one or more common fields. Works like you would expect if you are familiar with SQL outer joins.

 

Evaluation Functions 

Requires GROUP ALL or GROUP BY

  • COUNT - Counts the number of elements in a bag
  • COUNT_STAR - Computes the number of elements in a bag
  • MAX - Computes the maximum value in a single-column bag
  • MIN - Computes the minimum value in a single-column ba
  • SUM - Computes the sum of the numeric values in a single-column bag
  • AVG - Computes the average of the number values in a single-column bag

Do NOT require GROUP ALL or GROUP BY

  • CONCAT - Concatenates two columns
  • DIFF - Compares two fields in a tuple
  • IsEmpty - Checks if a bag or map is empty
  • SIZE - Computes the number of elements based on any Pig data type
  • TOKENIZE - splits a string and outputs a bag of words

Math Functions

Based on Java Math class
Samples: ABS, CEIL, etc

String Functions

Based on Java String class
Samples: STRSPLIT, SUBSTRING, REPLACE, REGEX_EXTRACT, REGEX_EXTRACT_ALL etc

Tuple, Bag, and Map Functions

  • TOTUPLE - converts one or more expressions to tuple
  • TOBAG - converts one or more expressions to type bag
  • TOMAP - converts pairs of expressions into a map

External Type Operators

  • MAPREDUCE - Executives native MapReduce jobs inside a Pig script
  • STREAM - Sends data to an external script or program
  • REGISTER - Registers a JAR file so that the UDFs in the file can be used.
  • fs - invokes any FSShell command from within script or the Grunt shell
  • grunt > exec myscript.pig
  • EXPLAIN - displays the execution plan. Used to review the logical, physical, and MapReduce execution plans

  References:

Big Latin Basics- great reference
Introduction to PIG at the Big Data University - nice training class for free. Nearly all the information above is from this class. In some cases copied.

Tuesday, September 30, 2014

Hadoop (HDFS) Command line basics

Once you get Hadoop installed you can open the a terminal (aka command line). There is a program called hadoop and we need to pass it different switches and arguments to make it do what we want. Most of the hadoop fs (file shell) commands behave like the corresponding UNIX commands. Below are some of the commands you may find useful.

As a general rule, all hadoop filesystem commands start with hadoop fs.

Referencing HDFS Paths
When accessing the HDFS filesystem you need to use the hostname and port associated with the name node. In the examples below, the host is bivm and the name node is running on port 9000. You can also just make a relative reference by leaving off the hdfs://bivm:9000/

For example, to copy a file from the local file system to the HDFS file system we could specify the full path with:

hadoop fs -put file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt

However, we can also do

hadoop fs -put file:///home/biadmin/test.txt test.txt

This convention applies to all hadoop fs commands.

help - To get help and see all the commands for hadoop fs
hadoop fs -help

help - Get help on a fs command
hadoop fs -help du

ls - Show the files in the current user's directory

hadoop fs -ls
or
hadoop fs -ls /user/biadmin(assuming the user name is biadmin)


ls - Show the files in the user directory

hadoop fs -ls /user

cp - Copy a file from local file system to HDFS
hadoop fs -cp file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt

put or copyFromLocal - Copy files from the local filesystem (the opposite of copyToLocal)
hadoop fs -put file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt
hadoop fs -copyFromLocal file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt


get or copyToLocal or get - Copies files from HDFS to the local filesystem (the opposite of copyFromLocal)
hadoop fs -copyToLocal hdfs://bivm:9000/user/biadmin/test.txt file:///home/biadmin/test.txt
hadoop fs -get hdfs://bivm:9000/user/biadmin/test.txt file:///home/biadmin/test.txt


tail - View the last few line of a file
hadoop fs -tail hdfs://bivm:9000/user/biadmin/test.txt

cat -View the entire contents of a file
hadoop fs -cat hdfs://bivm:9000/user/biadmin/test.txt

rm - remove a file
hadoop fs -rm hdfs://bivm:9000/user/biadmin/test.txt

du - find the size of a file
hadoop fs -du hdfs://bivm:9000/user/biadmin/test.txt

du - get the size of all files in a directory
 hadoop fs -du hdfs://bivm:9000/user/biadmin

du - get the total size of all files in a directory
hadoop fs -du -s hdfs://bivm:9000/user/biadmin

mkdir - make a new directory
hadoop fs -mkdir hdfs://bivm:9000/user/biadmin/test

Other Unix-like HDFS Commands

chgrp
chmod
chown
mkdir
mv
rm
stat
setRep - Sets the replication factor of a file or Sets the replication factor  of a entire tree
getMerge - Gets all files in the directories that match the source pattern and also merges and sorts them into only one file on the local filesystem

Piping
You can pipe the results of these commands to unix commands. For example, we can grep the result of the ls command.

hadoop fs -ls | grep test

The result would be something like:
-rw-r--r--   1 biadmin biadmin          5 2014-09-24 00:48 test.txt












Saturday, September 20, 2014

Big Data Concepts

Why Big Data

Big Data conceptually has been around every since we had data. The concept is how do store and process data on hardware that is smaller than the data itself. Big Data as the name implies is needed when dealing with very large amounts of data. Big is relative, but generally we are dealing with terabytes, petabytes, exabytes, etc. However, there is no real threshold for using Big Data. Consider that a person's DNA sequence is only about 800MB, but it contains 4 billion pieces of information and has lots of patterns in it. The problem is that processing is slow using conventional databases. Big data would still be a good candidate for this because of the complexity of the data and processing power needed to analyze it. It is great for unstructured data, but can be used with structured data as well.

In short the amount of data being generated is growing exponentially and most of that data is unstructured or semi-structured. To process that data we generally need more power and storage than a single database, server, etc can handle.

The 3 V's of Big Data


  • Velocity - how fast is data being produced
  • Volume - how much data is being produced
  • Variety - how different is the data

What is Hadoop

Apache, the creators of Hadoop say
"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures."

I would add some not generalizations:

  • You can kind of think of it as a software RAID in that it spreads data among different disks. Instead of hardware controller there is a dedicated name server that does the coordination between computers (instead of disks on one machine as with RAID). The big advantage of this is we are no longer bound to how much we can fit into one server and the processing power of one server and the IO of a hard disk because all requests are in parallel.
  • It is more than a place to store files, though one part of it is actually HDFS which the distributed file system. It includes an ever growing collection of tools to process the data.
  • It is a self healing technology such that if one computer or rack of computers goes down then it will detect this failure and use the other copies that are on other computers. Assuming there is available space available somewhere it will rebuild the data to reduce the risk if another server(s) goes down.
  • The great thing is that when we run out of space (and in big data you will by the nature of the domain) that we can add more computers to the configuration and tell Hadoop to re-balance and it will move data around to make use of the new space. 


The core of many Big Data systems


  • Open source project by The Apache Software Foundation
  • Written in Java
  • Great Performance
  • Reliability provided by replication of data between computers


Optimized to handle


  • Massive amounts of data through parallelism
  • A variety of data (unstructured, semi-structured, and structured)
  • Inexpensive commodity hardware

Projects Associated with Hadoop


  • Eclipse is a popular IDE donated by IBM to the open source community. 
  • Lucene is a text search engine library written in Java. 
  • Hbase is the Hadoop database. 
  • Hive provides data warehousing tools to extract, transform and load 
  • data, and then, query this data stored in Hadoop files. 
  • Pig is a high level language that generates MapReduce code to analyze 
  • large data sets. 
  • Jaql is a query language for JavaScript open notation. 
  • ZooKeeper is a centralized configuration service and naming registry for 
  • large distributed systems. 
  • Avro is a data serialization system. 
  • UIMA is the architecture for the development, discovery, composition 
  • and deployment for the analysis of unstructured data. 

What it is NOT good for


  • Not designed for OLTP, OLAP. It is not a replacement for RDBMS
  • Not designed for random access such is the case with RDBMS
  • Not good for processing lots of little files, but vendors are working to make this work better.
  • Not good for low latency data access
  • Not good for work that must be sequential or cannot be parallelized
  • Not good for complex calculations with little data.

Typical Sources for Big Data



  • RFID Readers
  • Shopping / Transactions
  • Mobile Devices
  • Internet users
  • Twitter
  • Sensor data


Friday, August 15, 2014

Free Convert for Excel files (.xls to .xlsx)

I was pleasantly surprised that Excel 2010 has a command line tool that will convert .xls to .xlsx files. Here is how you use it.

In this example we are converting a file called MyFile.xls in the temp directory to MyFle.xlsx in the same temp directory. I didn't have any luck getting this to work without using absolute path.

  1. Open a command prompt 
  2. Execute the following (after adjusting it to meet your specific environment:
    "C:\Program Files\Microsoft Office\Office14\excelcnv.exe" -oice "C:\temp\MyFile.xls" "C:\temp\MyFile.xlsx"


Since this is command line driven you can script many files easily.


@ECHO OFF
FOR %%X in (".\*.xls") DO IF NOT %%~xX == .xlsx echo Converting "%%~dpnxX"  & "C:\Program Files\Microsoft Office\Office14\excelcnv.exe"  -nme -oice "%%~dpnxX" "%%~dpnX.xlsx" 


WARNING: For some reason Excel may tell you that it didn't launch correctly and ask you if you want to run it in safemode. I have no clue as to why this is, but clicking No each time seems to work fine. It will do this for each file.

Friday, July 18, 2014

Remove alpha characters from string using SQL

If you have a string that has both numbers and alpha characters in it and want to remove all letters A-Z then this is a simple function that you can use on a column in SQL Server.


create function RemoveAlphas(@Text as nvarchar(255))
returns nvarchar(255)
as
BEGIN
Declare @Result as nvarchar(255)
Set @Result = @Text
Set @Result = Replace(@Result, 'A', '')
Set @Result = Replace(@Result, 'B', '')
Set @Result = Replace(@Result, 'C', '')
Set @Result = Replace(@Result, 'D', '')
Set @Result = Replace(@Result, 'E', '')
Set @Result = Replace(@Result, 'F', '')
Set @Result = Replace(@Result, 'G', '')
Set @Result = Replace(@Result, 'H', '')
Set @Result = Replace(@Result, 'I', '')
Set @Result = Replace(@Result, 'J', '')
Set @Result = Replace(@Result, 'K', '')
Set @Result = Replace(@Result, 'L', '')
Set @Result = Replace(@Result, 'M', '')
Set @Result = Replace(@Result, 'N', '')
Set @Result = Replace(@Result, 'O', '')
Set @Result = Replace(@Result, 'P', '')
Set @Result = Replace(@Result, 'Q', '')
Set @Result = Replace(@Result, 'R', '')
Set @Result = Replace(@Result, 'S', '')
Set @Result = Replace(@Result, 'T', '')
Set @Result = Replace(@Result, 'U', '')
Set @Result = Replace(@Result, 'V', '')
Set @Result = Replace(@Result, 'W', '')
Set @Result = Replace(@Result, 'X', '')
Set @Result = Replace(@Result, 'Y', '')
Set @Result = Replace(@Result, 'Z', '')
return @Result
END

Usage:

select dbo.RemoveAlphas('123abc456DEF')

Returns 123456