Thursday, October 14, 2010
The platform
Hadoop commonly refers to the main component of the platform, the one from where the others offer high level services. This's the storage framework with the processing framework, formed by the Hadoop Distributed Filesystem library, the MapReduce library, and a core library, all working together. This represents the first project, that would lead the path for the others to work.
Those are: HBase (a columnar database), Hive (a data mining tool), Pig (scripting), Chuckwa (log analysis), they are all subjected to the availability of the platform. Then we have ZooKeeper (coordination service) independent of hadoop availability and used from HBase, and Avro (serialization/deserialization) designed to support the main service component requirements. We'll see them all after, in more detail.

I draw them in different layers, showing their functional dependencies and a little description for each component

(Click the picture to enlarge it)


Basic flow: how does it work ?
A Hadoop cluster compounds of any # of slave nodes, 1 master node managing them all, and 1 backup node assisting him.
The slave ones are data nodes which runs 2 processes: the DataNode process and the TaskTracker process. On the master node runs the NameNode process, and in other different node -whereas possible though not mandatory-, running the BackupNode process.

So, in practice the platform deploys in the cluster to 5 different processes, the NameNode, DataNode and BackupNode from the storage framework (HDFS), and the JobTracker and TaskTracker from the processing framework (MapReduce). In HDFS, The NameNode will coordinate almost all read/write and access operations between clients and the DataNodes from the cluster, the DataNodes will store, read and write the information, while the BackupNode is in charge of accelerating some heavy operations like boot up, ensuring failover data recovery, among others. In MapReduce, the JobTracker will coordinate all about deploying application tasks over the DataNodes, as well as summarizing their results, and the TaskTracker processes running on them will receive these tasks and execute them.

See in this diagram, how they're distributed within the flow.

(Click the picture to enlarge it)