Skip to main content

Hadoop The Definitive Guide [Book] - Study Notes

Chap-1- Meet Hadoop

  • Requirement and adoption in yahoo.
  • A framework that can scale to the web.
  • Map and Reduce acitivity and features like data locality.
  • Can be applied with a variety of algorithms
  • Huge data processing can beat good algorithms


Chap-2 - MapReduce

  • The Map Java class and Reducer Java class
  • The Job java class
  • Jobtracker and tasktracker
  • Hadoop reduces the input to input splits or just splits
  • Map tasks write the intermediate output to local disks, so that they can be discarded after use.
  • Outputs of Reduce tasks are stored in HDFS
  • Combiner function can be run on map output, and the combiner functions output forms the input to the reduce function
  • Hadoop streaming proivide hadoop apis in languages other than Java


Chap-3 - The Hadoop Distributed Filesystem

  • Fault tolerant solution. Same data written at multiple places.
  • Filesystems that manage the storage across a network of machines are called distributed filesystems.
  • Blocks - a block size is the minimum amount of data it can read and write (for hdfs its 64mb by default)
  • Namenodes and Datanodes - An HDFS cluster has a master-worker pattern: a namenode (master) and number of datanodes(workers). Master has all the meta data and datanode has all the blocks (but not persistent). Its reconstructed at start time.
  • HDFS federation
  • HDFS High-availablity
  • On large clusters the time it takes for a namenode to start from cold can be upto 30 mins
  • Fencing and failover - When one node fails an entity called 'failover controller' switch to the standby node. But first a ZooKeeper is used to ensure that only one namenode is active.
  • Graceful failover - triggered by adming
  • Ungraceful failover - in this case to make sure that the other node has completely stopped running, a mechanism called fencing is done. In worst case it does ' shoot the other node in the head' - force shutdown .
  • File Operations in HDFS
  • There are java endpoints to do all operations like create, delete, sync
  • Use Flume and Sqoop to move data
  • Copy parallel with distcp
  • Hadoop archives are compressed blocks that can be used as input to MapReduce


Chap - 4 I/O


  • Compression
  • Reading compressed data
  • Serialzation in natively implemented in Hadoop for better perfomance
  • Apache Avro is a project to do this in an improved way and support multiple languages, diff from Google Protocol Buffer and Thrift


Chapter - 5 - Developing a MapReduce Application

Setting up the Environment
- The Configuration API to read xml resource files etc
- Writing Unit Test with MRUnit
- Running locally on a small data
- Using Tool Interface write a Driver to run our MapReduce Job (Java file)
- Testing the driver
- Run in Cluster
- Package jar
- Launching a Job run the driver
- Debugging a Job
- Running multiple Job in particular flow

Chapter 6 - How MapReduce Works

Chapter 9 - Chapter 15

Setting Up Hadoop Cluster 
- Manually 
- Using a CDH distribution (See Appendix)

Hadoop Tools : 
  • Pig: Aimed to provide data structure and transformation more than just map and reduce can do
  • Hive: Made to run queries for people who were weak in Java but strong in SQL
  • Hbase: Distributed, column-oriented database built on top of HDFS. It is built to scale.
  • ZooKeeper: Is build to avoid partial failures of request transfers happening between nodes.
  • Squoop: To transfer data from external applicaitons , web api etc. This is focused on data movement.

Popular posts from this blog

ICFAI Sikkim Distance MBA Review From My Experience

After a long research I joined for the ICFAI distance MBA program in 2012. Now I've completed 2 semesters ( as of 2013 ). I wanted to write this review so that people who are looking for a good MBA program can get a hand-on review about the distance MBA offered by the ICFAI Sikkim. I've been through all the cycles of this program and this review might help you make the right choice about the program. This article presents my own ( and unbiased ) view of the program and is in no way associated with the course provider.


Is the MBA ICFAI Sikkim Approved By UGC?  As per the latest AICTE regulations, a distance education program must have the approval of a joint commission of  UGC- University Grant CommissionAICTE- All Indian Council for Technical EducationDEC- Distance Education CouncilICFAI Sikkim doesn't have this approval (don't get disappointed, it's not over yet). Only institutes and colleges affiliated to a University are required to take AICTE approval. So ICFAI be…

Is MacBook Air Good For Programming / Blogging ?

I'm a passionate java developer who just migrated from a Windows PC netbook ( Dell mini ) to a 13 inch MacBook Air. Before the netbook I owned a Dell inspirion 1501. I'm quite a bit of an avid blogger as well. I purchased Dell mini just as it was launched hoping that it's compact and mobile architecture would solve all of my need as a programmer and a writer. Unfortunately it turned out that it was a worthless device.The rest of the story goes...



Do Not Compare a Netbook With MacBook AirMacBook's astonishing features far exceeds anything that of a normal Netbook.Before
Buying a netbook for programming and blogging was one of the biggest blunders I ever made on choosing a machine. The screen was 11 inch and clumsy icons of the Windows were a disgrace all the time.The tightly arranged keys in the keyboard made typing a pain. It's slow Intel Atom Processor is too sluggish to run even VLC player.


After  The Mac's backlit spacious keyboard layout, 1440x900 resolution d…

Best Places To Eat at Trivandrum

Are you searching for the most amazing places to eat at Trivandrum? Well, this article is a collection of places from where I've carried the taste after eating. All of them are located in Trivandrum. There is no specific focus for a single kind or restaurant, I've written about the latest hot-spot cafes to ethnic and traditional places at the heart of the city and people for decades.
General Places Azad Hotel Azad Hotel is a hotel-chain all over India and they are one of the best hotels in Trivandrum too. They claim to have introduced the popular dish 'biriyani' in India. However, they provide a good ambience and tasty food. There are all kind of popular non-vegetarian dishes available here. They have a long tradition of serving quality food and that's what makes them the best. There are
Zam Zam This is the most sought after destination in city for the best cooked chicken dishes. The 'Shawai' is the all time best seller of Zam Zam. It's very rushy alwa…