Introduction to Apache Hadoop and friends

Hadoop is a MapReduce framework that enables processing large datasets in parallel, on clusters of commodity hardware. This is cheaper, as it’s a open source solution that can run on commodity hardware while handling petabytes of data. It’s faster on massive data volumes as data processing is done in parallel.

A complete Hadoop MapReduce based solution may have following layers.

1.Hadoop core- HDFS (Hadoop Distributed File System)

This is the data storage system where data is splitted into large files ~64 or 128MB blocks and saved. This can scale to 1000s of nodes and is inspired by Google File System that resolved the problem of indexing the web. In order to cater for fault tolerance, by default, HDFS keep 3 replicas of same data that it stored. Hence there can be integrity violations on the data set at times.

2.MapReduce API

Allows job-based(batch) parallelizable processing, across data in HDFS. This API enables auto-parallelize for huge amounts of data while preserving fault tolerance and adding high availability. Also enables computing logics to come to data, rather than data travelling across the network to be operated on. Mapreduce algorithm expects hardware failures in the used commodity hardware, so have automatic retry is built-in.


As shown in the above picture, there 3 main phases in MapReduce. In the mapping phase, it splits the incoming files into blocks that can be saved in HDFS and output a key-value pair each of these data piece. This output is then consumed by the shuffling phase, which transforms data to reducers. While at shuffling phase, data can be subjected to optimization mechanisms such as sorting or pre-processing that will save time at reducers. Reducers derive the final desired output according to a defined reducing logic consuming the shuffled output.

3. Data access

access the data store in HDFS in optimized or enhanced ways for particular domains. Following are few such layers available for use.

● Pig – A scripting language that provides ‘Extract Transform Load’ library for HDFS
● Hive – Provides Hive Query Language which is similar to SQL for HDFS
● HBase – Inspired by Google BigTable for GFS, HBase provides a more abstract layer to leverages the distributed data storage of HDFS
● Mahout – Specific data access layer for scalable machine learning algorithms on HDFS

4. Tools and libraries

● HUE – Hadoop User Experience that provides a web interface for Hadoop platform.
● Sqoop – A tool to efficiently transfer bulk data between Hadoop and structured data stores. This is useful when enterprises using relational databases, encounter massive growths, which leads to massive amounts of data that can no longer be efficiently handled in relation databases.

5. Monitoring and alerting

On the cluster node performance and statistics related to current jobs in execution, mostly to be used by administration.

Though Hadoop is a framework with lot of power, it is not a solution for all the scalability and data volume problems in the enterprise. It should only be used when batch-processing and the involved latencies are fine with the application.

For example Hadoop may not be suitable for real time analysis where interactivity is needed. Facebook uses Hadoop for ‘Ad Targeting’ while Hadoop is also used for ‘point of sale transactional analysis (for relational systems)’, threat analysis, data sandboxing, recommendation engines, risk modelling and trade surveillance when huge volumes of data is involved.

While Hadoop is an open source framework, there are several vendors such as Cloudera, HortonWorks, MapR, AWS and Windows Azure who provides a premium layer on top of that under a cost, with value addition and support

Leave a Reply

Your email address will not be published. Required fields are marked *