Saturday, September 20, 2014

Big Data Concepts

Why Big Data

Big Data conceptually has been around every since we had data. The concept is how do store and process data on hardware that is smaller than the data itself. Big Data as the name implies is needed when dealing with very large amounts of data. Big is relative, but generally we are dealing with terabytes, petabytes, exabytes, etc. However, there is no real threshold for using Big Data. Consider that a person's DNA sequence is only about 800MB, but it contains 4 billion pieces of information and has lots of patterns in it. The problem is that processing is slow using conventional databases. Big data would still be a good candidate for this because of the complexity of the data and processing power needed to analyze it. It is great for unstructured data, but can be used with structured data as well.

In short the amount of data being generated is growing exponentially and most of that data is unstructured or semi-structured. To process that data we generally need more power and storage than a single database, server, etc can handle.

The 3 V's of Big Data

  • Velocity - how fast is data being produced
  • Volume - how much data is being produced
  • Variety - how different is the data

What is Hadoop

Apache, the creators of Hadoop say
"The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures."

I would add some not generalizations:

  • You can kind of think of it as a software RAID in that it spreads data among different disks. Instead of hardware controller there is a dedicated name server that does the coordination between computers (instead of disks on one machine as with RAID). The big advantage of this is we are no longer bound to how much we can fit into one server and the processing power of one server and the IO of a hard disk because all requests are in parallel.
  • It is more than a place to store files, though one part of it is actually HDFS which the distributed file system. It includes an ever growing collection of tools to process the data.
  • It is a self healing technology such that if one computer or rack of computers goes down then it will detect this failure and use the other copies that are on other computers. Assuming there is available space available somewhere it will rebuild the data to reduce the risk if another server(s) goes down.
  • The great thing is that when we run out of space (and in big data you will by the nature of the domain) that we can add more computers to the configuration and tell Hadoop to re-balance and it will move data around to make use of the new space. 

The core of many Big Data systems

  • Open source project by The Apache Software Foundation
  • Written in Java
  • Great Performance
  • Reliability provided by replication of data between computers

Optimized to handle

  • Massive amounts of data through parallelism
  • A variety of data (unstructured, semi-structured, and structured)
  • Inexpensive commodity hardware

Projects Associated with Hadoop

  • Eclipse is a popular IDE donated by IBM to the open source community. 
  • Lucene is a text search engine library written in Java. 
  • Hbase is the Hadoop database. 
  • Hive provides data warehousing tools to extract, transform and load 
  • data, and then, query this data stored in Hadoop files. 
  • Pig is a high level language that generates MapReduce code to analyze 
  • large data sets. 
  • Jaql is a query language for JavaScript open notation. 
  • ZooKeeper is a centralized configuration service and naming registry for 
  • large distributed systems. 
  • Avro is a data serialization system. 
  • UIMA is the architecture for the development, discovery, composition 
  • and deployment for the analysis of unstructured data. 

What it is NOT good for

  • Not designed for OLTP, OLAP. It is not a replacement for RDBMS
  • Not designed for random access such is the case with RDBMS
  • Not good for processing lots of little files, but vendors are working to make this work better.
  • Not good for low latency data access
  • Not good for work that must be sequential or cannot be parallelized
  • Not good for complex calculations with little data.

Typical Sources for Big Data

  • RFID Readers
  • Shopping / Transactions
  • Mobile Devices
  • Internet users
  • Twitter
  • Sensor data

No comments: