Blog Archive
Powered by Blogger.
Thursday, October 14, 2010
The platform
Hadoop commonly refers to the main component of the platform, the one from where the others offer high level services. This's the storage framework with the processing framework, formed by the Hadoop Distributed Filesystem library, the MapReduce library, and a core library, all working together. This represents the first project, that would lead the path for the others to work.
Those are: HBase (a columnar database), Hive (a data mining tool), Pig (scripting), Chuckwa (log analysis), they are all subjected to the availability of the platform. Then we have ZooKeeper (coordination service) independent of hadoop availability and used from HBase, and Avro (serialization/deserialization) designed to support the main service component requirements. We'll see them all after, in more detail.
I draw them in different layers, showing their functional dependencies and a little description for each component
(Click the picture to enlarge it)
Basic flow: how does it work ?
A Hadoop cluster compounds of any # of slave nodes, 1 master node managing them all, and 1 backup node assisting him.
The slave ones are data nodes which runs 2 processes: the DataNode process and the TaskTracker process. On the master node runs the NameNode process, and in other different node -whereas possible though not mandatory-, running the BackupNode process.
So, in practice the platform deploys in the cluster to 5 different processes, the NameNode, DataNode and BackupNode from the storage framework (HDFS), and the JobTracker and TaskTracker from the processing framework (MapReduce). In HDFS, The NameNode will coordinate almost all read/write and access operations between clients and the DataNodes from the cluster, the DataNodes will store, read and write the information, while the BackupNode is in charge of accelerating some heavy operations like boot up, ensuring failover data recovery, among others. In MapReduce, the JobTracker will coordinate all about deploying application tasks over the DataNodes, as well as summarizing their results, and the TaskTracker processes running on them will receive these tasks and execute them.
See in this diagram, how they're distributed within the flow.
(Click the picture to enlarge it)
Hadoop commonly refers to the main component of the platform, the one from where the others offer high level services. This's the storage framework with the processing framework, formed by the Hadoop Distributed Filesystem library, the MapReduce library, and a core library, all working together. This represents the first project, that would lead the path for the others to work.
Those are: HBase (a columnar database), Hive (a data mining tool), Pig (scripting), Chuckwa (log analysis), they are all subjected to the availability of the platform. Then we have ZooKeeper (coordination service) independent of hadoop availability and used from HBase, and Avro (serialization/deserialization) designed to support the main service component requirements. We'll see them all after, in more detail.
I draw them in different layers, showing their functional dependencies and a little description for each component
(Click the picture to enlarge it)
Basic flow: how does it work ?
A Hadoop cluster compounds of any # of slave nodes, 1 master node managing them all, and 1 backup node assisting him.
The slave ones are data nodes which runs 2 processes: the DataNode process and the TaskTracker process. On the master node runs the NameNode process, and in other different node -whereas possible though not mandatory-, running the BackupNode process.
So, in practice the platform deploys in the cluster to 5 different processes, the NameNode, DataNode and BackupNode from the storage framework (HDFS), and the JobTracker and TaskTracker from the processing framework (MapReduce). In HDFS, The NameNode will coordinate almost all read/write and access operations between clients and the DataNodes from the cluster, the DataNodes will store, read and write the information, while the BackupNode is in charge of accelerating some heavy operations like boot up, ensuring failover data recovery, among others. In MapReduce, the JobTracker will coordinate all about deploying application tasks over the DataNodes, as well as summarizing their results, and the TaskTracker processes running on them will receive these tasks and execute them.
See in this diagram, how they're distributed within the flow.
(Click the picture to enlarge it)
Subscribe to:
Post Comments (Atom)
6 comentarios:
Hi, First of all thanks for the detailed diagrams. I am a newbie to hadoop I had a question wrt to the diagram above. Is the hadoop architecture such that the namenode and jobtracker never interact with each other?Job tracker only assisgns the task, but then how does it come to know where the chunk is present for a particular file, so as to allocate mappers appropriately? Thanks for the clarification.
-
Rivya
Nice Posting...
Red Hat Training in Chennai
Best Linux Training Center in Chennai
Best Linux Training Institute in Chennai
Rhce Training in Chennai
RHCSA Training in Chennai
Helpful Information....
Real Estate Companies
Real Estate Brokers in Chennai
Real Estate Agents in Chennai
Real Estate Advisory in India
Real Estate Private Equity in Chennai
Real Estate Research in Chennai
Real Estate Tax Advisor in Chennai
Legal advisor in Chennai
Thanks for sharing as it is an excellent post would love to read your future post -for more knowledge AWS / Amazon Web Services Training | AWS / Amazon Web Services Training in Bangalore
Great collection of blogs, keep posting please.
digital marketing agency hyderabad
Thank you for giving the great article. It delivered me to understand several things about this concept. Keep posting such surpassing articles so that I gain from your great post.
artificial intelligence internship | best final year projects for cse | internship certificate online | internship for mba finance students | internship meaning in tamil
Post a Comment