Tuesday, September 30, 2014

Hadoop (HDFS) Command line basics

Once you get Hadoop installed you can open the a terminal (aka command line). There is a program called hadoop and we need to pass it different switches and arguments to make it do what we want. Most of the hadoop fs (file shell) commands behave like the corresponding UNIX commands. Below are some of the commands you may find useful.

As a general rule, all hadoop filesystem commands start with hadoop fs.

Referencing HDFS Paths
When accessing the HDFS filesystem you need to use the hostname and port associated with the name node. In the examples below, the host is bivm and the name node is running on port 9000. You can also just make a relative reference by leaving off the hdfs://bivm:9000/

For example, to copy a file from the local file system to the HDFS file system we could specify the full path with:

hadoop fs -put file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt

However, we can also do

hadoop fs -put file:///home/biadmin/test.txt test.txt

This convention applies to all hadoop fs commands.

help - To get help and see all the commands for hadoop fs
hadoop fs -help

help - Get help on a fs command
hadoop fs -help du

ls - Show the files in the current user's directory

hadoop fs -ls
hadoop fs -ls /user/biadmin(assuming the user name is biadmin)

ls - Show the files in the user directory

hadoop fs -ls /user

cp - Copy a file from local file system to HDFS
hadoop fs -cp file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt

put or copyFromLocal - Copy files from the local filesystem (the opposite of copyToLocal)
hadoop fs -put file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt
hadoop fs -copyFromLocal file:///home/biadmin/test.txt hdfs://bivm:9000/user/biadmin/test.txt

get or copyToLocal or get - Copies files from HDFS to the local filesystem (the opposite of copyFromLocal)
hadoop fs -copyToLocal hdfs://bivm:9000/user/biadmin/test.txt file:///home/biadmin/test.txt
hadoop fs -get hdfs://bivm:9000/user/biadmin/test.txt file:///home/biadmin/test.txt

tail - View the last few line of a file
hadoop fs -tail hdfs://bivm:9000/user/biadmin/test.txt

cat -View the entire contents of a file
hadoop fs -cat hdfs://bivm:9000/user/biadmin/test.txt

rm - remove a file
hadoop fs -rm hdfs://bivm:9000/user/biadmin/test.txt

du - find the size of a file
hadoop fs -du hdfs://bivm:9000/user/biadmin/test.txt

du - get the size of all files in a directory
 hadoop fs -du hdfs://bivm:9000/user/biadmin

du - get the total size of all files in a directory
hadoop fs -du -s hdfs://bivm:9000/user/biadmin

mkdir - make a new directory
hadoop fs -mkdir hdfs://bivm:9000/user/biadmin/test

Other Unix-like HDFS Commands

setRep - Sets the replication factor of a file or Sets the replication factor  of a entire tree
getMerge - Gets all files in the directories that match the source pattern and also merges and sorts them into only one file on the local filesystem

You can pipe the results of these commands to unix commands. For example, we can grep the result of the ls command.

hadoop fs -ls | grep test

The result would be something like:
-rw-r--r--   1 biadmin biadmin          5 2014-09-24 00:48 test.txt

Saturday, September 20, 2014

Big Data Concepts

Why Big Data

Big Data conceptually has been around every since we had data. The concept is how do store and process data on hardware that is smaller than the data itself. Big Data as the name implies is needed when dealing with very large amounts of data. Big is relative, but generally we are dealing with terabytes, petabytes, exabytes, etc. However, there is no real threshold for using Big Data. Consider that a person's DNA sequence is only about 800MB, but it contains 4 billion pieces of information and has lots of patterns in it. The problem is that processing is slow using conventional databases. Big data would still be a good candidate for this because of the complexity of the data and processing power needed to analyze it. It is great for unstructured data, but can be used with structured data as well.

In short the amount of data being generated is growing exponentially and most of that data is unstructured or semi-structured. To process that data we generally need more power and storage than a single database, server, etc can handle.

The 3 V's of Big Data

  • Velocity - how fast is data being produced
  • Volume - how much data is being produced
  • Variety - how different is the data

What is Hadoop

Apache, the creators of Hadoop say
"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures."

I would add some not generalizations:

  • You can kind of think of it as a software RAID in that it spreads data among different disks. Instead of hardware controller there is a dedicated name server that does the coordination between computers (instead of disks on one machine as with RAID). The big advantage of this is we are no longer bound to how much we can fit into one server and the processing power of one server and the IO of a hard disk because all requests are in parallel.
  • It is more than a place to store files, though one part of it is actually HDFS which the distributed file system. It includes an ever growing collection of tools to process the data.
  • It is a self healing technology such that if one computer or rack of computers goes down then it will detect this failure and use the other copies that are on other computers. Assuming there is available space available somewhere it will rebuild the data to reduce the risk if another server(s) goes down.
  • The great thing is that when we run out of space (and in big data you will by the nature of the domain) that we can add more computers to the configuration and tell Hadoop to re-balance and it will move data around to make use of the new space. 

The core of many Big Data systems

  • Open source project by The Apache Software Foundation
  • Written in Java
  • Great Performance
  • Reliability provided by replication of data between computers

Optimized to handle

  • Massive amounts of data through parallelism
  • A variety of data (unstructured, semi-structured, and structured)
  • Inexpensive commodity hardware

Projects Associated with Hadoop

  • Eclipse is a popular IDE donated by IBM to the open source community. 
  • Lucene is a text search engine library written in Java. 
  • Hbase is the Hadoop database. 
  • Hive provides data warehousing tools to extract, transform and load 
  • data, and then, query this data stored in Hadoop files. 
  • Pig is a high level language that generates MapReduce code to analyze 
  • large data sets. 
  • Jaql is a query language for JavaScript open notation. 
  • ZooKeeper is a centralized configuration service and naming registry for 
  • large distributed systems. 
  • Avro is a data serialization system. 
  • UIMA is the architecture for the development, discovery, composition 
  • and deployment for the analysis of unstructured data. 

What it is NOT good for

  • Not designed for OLTP, OLAP. It is not a replacement for RDBMS
  • Not designed for random access such is the case with RDBMS
  • Not good for processing lots of little files, but vendors are working to make this work better.
  • Not good for low latency data access
  • Not good for work that must be sequential or cannot be parallelized
  • Not good for complex calculations with little data.

Typical Sources for Big Data

  • RFID Readers
  • Shopping / Transactions
  • Mobile Devices
  • Internet users
  • Twitter
  • Sensor data

Friday, August 15, 2014

Free Convert for Excel files (.xls to .xlsx)

I was pleasantly surprised that Excel 2010 has a command line tool that will convert .xls to .xlsx files. Here is how you use it.

In this example we are converting a file called MyFile.xls in the temp directory to MyFle.xlsx in the same temp directory. I didn't have any luck getting this to work without using absolute path.

  1. Open a command prompt 
  2. Execute the following (after adjusting it to meet your specific environment:
    "C:\Program Files\Microsoft Office\Office14\excelcnv.exe" -oice "C:\temp\MyFile.xls" "C:\temp\MyFile.xlsx"

Since this is command line driven you can script many files easily.

FOR %%X in (".\*.xls") DO IF NOT %%~xX == .xlsx echo Converting "%%~dpnxX"  & "C:\Program Files\Microsoft Office\Office14\excelcnv.exe"  -nme -oice "%%~dpnxX" "%%~dpnX.xlsx" 

WARNING: For some reason Excel may tell you that it didn't launch correctly and ask you if you want to run it in safemode. I have no clue as to why this is, but clicking No each time seems to work fine. It will do this for each file.

Friday, July 18, 2014

Remove alpha characters from string using SQL

If you have a string that has both numbers and alpha characters in it and want to remove all letters A-Z then this is a simple function that you can use on a column in SQL Server.

create function RemoveAlphas(@Text as nvarchar(255))
returns nvarchar(255)
Declare @Result as nvarchar(255)
Set @Result = @Text
Set @Result = Replace(@Result, 'A', '')
Set @Result = Replace(@Result, 'B', '')
Set @Result = Replace(@Result, 'C', '')
Set @Result = Replace(@Result, 'D', '')
Set @Result = Replace(@Result, 'E', '')
Set @Result = Replace(@Result, 'F', '')
Set @Result = Replace(@Result, 'G', '')
Set @Result = Replace(@Result, 'H', '')
Set @Result = Replace(@Result, 'I', '')
Set @Result = Replace(@Result, 'J', '')
Set @Result = Replace(@Result, 'K', '')
Set @Result = Replace(@Result, 'L', '')
Set @Result = Replace(@Result, 'M', '')
Set @Result = Replace(@Result, 'N', '')
Set @Result = Replace(@Result, 'O', '')
Set @Result = Replace(@Result, 'P', '')
Set @Result = Replace(@Result, 'Q', '')
Set @Result = Replace(@Result, 'R', '')
Set @Result = Replace(@Result, 'S', '')
Set @Result = Replace(@Result, 'T', '')
Set @Result = Replace(@Result, 'U', '')
Set @Result = Replace(@Result, 'V', '')
Set @Result = Replace(@Result, 'W', '')
Set @Result = Replace(@Result, 'X', '')
Set @Result = Replace(@Result, 'Y', '')
Set @Result = Replace(@Result, 'Z', '')
return @Result


select dbo.RemoveAlphas('123abc456DEF')

Returns 123456

Wednesday, July 9, 2014

Using T-SQL to format date as yyyy-mm-dd

Surprisingly, MS SQL Server doesn't provide custom date formatting. Instead you need to use one of their existing formats or use C# to implement it, but the later seems a bit overkill for our purposes here. The formats are defined here. Below are some ways to get a datetime or date column to print out in the yyyy-mm-dd format.

Method 1
This is simple and straight forward. Interestingly, I don't see it defined here, but it works.

Since it is not documented it is up to you if you want to use it or not. It has been around for many years, but it is unknown if it will be there in the future. Use this option at your own risk.

In this example, the length of 20 is used to show it doesn't matter, but any size could be used since it actually gives us the format we are looking for.


That will give you 2014-07-09.

NOTE: If you would like explore other undocumented formats, check this page out.

Method 2

A clever way is to use the 126 format which has hours, minutes, seconds, etc in it and just take the first 10 characters which is in the format yyyy-mm-dd. We could get a substring, but there really isn't a need since it will be implicitly truncated to 10 characters when we use char(10) or varchar(10) as our datatype we are converting to.

SELECT CONVERT(char(10), GetDate(),126)

That will give you 2014-07-09.

Method 2
If you have slightly different format requirement such as slashes or
select Replace(convert(nvarchar(10), GETDATE(), 102), '.', '-')

This works very simply because 102 is defined in the format yyyy.mm.dd and I am just replacing the periods with dashes. Pretty simple and effective.

Method 3
If you want it in the format yyyymmdd (no dashes) then you can use the 112 format as shown below.

select convert(nvarchar(10), GETDATE(), 112)

That will give you 20140709.

Wednesday, July 2, 2014

Get a list of tables in SQL Server that don't have primary keys defined

In SQL Server it is a best practice for all tables to have primary keys defined. A primary key is really a constraint. Below is a query to get a list of tables and the name of the primary key contstraint that is associated with that table. If the CONSTRAINT_NAME column is null then it doesn't have a primary key defined.

order by TABLE_NAME

You can add a where clause such as

where c.CONSTRAINT_TYPE is null 

to filter the results to just the tables that don't have a primary key.

Find blank rows in a SQL Server Database

After importing data into a SQL Server database there are sometimes blank rows that get created depending on what your data source looks like. Often when using Excel as a data source extra rows will be created with all blank values. Since the table doesn't by default have a primary key all columns can be null. Find what tables have blank rows and then deleting them can be time consuming. The script here will make this much easier.

I have used this script successfully on my databases, but please, please, please make a backup of your database BEFORE you execute the following since it can affect all your tables. I am of course not responsible for any data loss caused by this script. 

Executing the script below does NOT actually do the deletes. You will still need to copy and paste the generated SQL into SSMS and execute it. I highly suggest you read the generated SQL to make sure it is doing what you want it to before you do the final execution of the generated sql.

create table #BlankRowCounts(TableName nvarchar(255), NumBlankRows int)
Declare @SQL as nvarchar(MAX)
    @SQL = ISNULL(@SQL + ' union ' , '')
    + 'select ' +
'''' + TABLE_NAME + ''' as TableName, ' +
'COUNT(1) as NumBlankRows' +
' from ' +
'[' + TABLE_NAME + ']' +
' where ' +
dbo.GetColumnList(TABLE_NAME, 1, ' is null and ') + ' is null'

insert into #BlankRowCounts(TableName, NumBlankRows)
exec sp_executesql @SQL

select TableName, NumBlankRows,
'select ' +
'''' + TableName + ''' as TableName, ' +
dbo.GetColumnList(TableName, 1, ', ') + 
' from ' +
'[' + TableName + ']' +
' where ' +
dbo.GetColumnList(TableName, 1, ' is null and ') + ' is null' as SelectStmt,
'delete from ' +
'[' + TableName + ']' +
' where ' +
dbo.GetColumnList(TableName, 1, ' is null and ') + ' is null' as DeleteStmt
order by NumBlankRows desc
drop table #BlankRowCounts

You will also need to get the code for GetColumnList() function here.

The results of this SQL are simple. There are four columns:
  • TableName - The table for which the statements will affect
  • NumBlankRows - The number of rows in the table (see TableName) that have all blank columns
  • SelectStmt - The select statement you can copy and paste into SSMS to actually see for yourself that the columns are null. You don't have to execute these, but they are here to convince yourself that the data is blank.
  • DeleteStmt - The delete statement you can copy and paste into SSMS to actually do the deleting of the rows that have all the columns as null. I highly recommend reading this BEFORE you execute it. Also, consider backing up your data if the data is important to you.