the Hadooper in me: Platform base: HDFS + MR

Wednesday, November 17, 2010

Platform base: HDFS + MR

1:57 AM | Publicado por Cristian |

| Edit Post

Let's explore the core of the platform
Here are some relevant points about the filesystem

Scales linearly over a set of low-budget comodity machines, doubled the amount of machines = reduced the processing time to half
Tolerates faults at different levels, from network, switches, disks, nodes, readdressing data traffic to other nodes, accomplishing a replica factor
Flexible scalability, for maintenance tasks, you only have to dis/connect computers into the rack
Lets you allocate any kind of files and formats, although it has better performance on files bigger than 128mb
The results of MapReduce jobs are stored in the filesystem
There's a unique namespace, with an automatic replication schema administered by the master, it's not possible to impact on a certain node in the cluster, whether to allocate files or execute jobs
The master itself balances the workload over execution plans and status reports from the slave nodes

The picture shows a more practical brief over HDFS responsibilities & execution flows
(Click to enlarge)

Has a mechanism of continuous replication per file, rack-aware, to extend data reliability and data availability warranties
Has an automatic file checksum with inmediate correction
Works in a master-slave schema where slaves share nothing between them, they only respond to master requests
The master coordinates all types of transactions, read/write, replication, restore, manages the tx log and the filesystem namespace
The slaves only take charge on low level operations over data, read, write, deletion, transport to-from client

Let's pass now to the processing model
Main characteristics of Map Reduce

It's a processing model about dividing and distributing information, in two chained phases: map first, then reduce
Both phases have as input and output, a key-value pair list,
The schema allows to define method parameters and own logic for both phases, as well as their own partitioning system and intermediate storage between phases.
The transactions are handled by a JobTracker daemon, that runs the initial data partitioning and the intermediate data combination, by posting tasks of type Map and type Reduce over the TaskTracker daemons (1-n x computer) of the nodes involved in the cluster, according the data being processed
the Reduce phase only starts when finished the Map, cause after the Map the resulting keys are combined, to distribute a sorted list of key-value pairs between the Reducers, that can be matched at the end of them.
The process is transactional, those map or reduce tasks not executed, (for data availability issues) will be reattempted a number of times, and then redistributed to other nodes.

The picture shows how these methods will interact in phases.

(Click to enlarge)

Detail:

- First the files are partitioned in parts that will be distributed to process across the cluster nodes
- Each part is parsed in pairs of Key(sorteable object) - Value(object), that will be the input parameters for the tasks implementing the Map function
- These user defined tasks (map), will read the value object, do something with it, and then build a new key-value list that will be stored by the framework, in intermediate files.
- Once all the map tasks are finished, it means that the whole data to process was completely read, and reordered into this mapreduce model of key-value paris.
- These intermediate key-value results are combined, resulting a new paris of key-value that will be the input for the next reduce tasks
- These user defined tasks (reduce), will read the value object, do something with it, and then produce the 3rd and last list of key-value pairs, that the framework will combine, and regroup into a final result.

Let's see a sample job with a reverted-index function, for analyzing the webcrawler's output files (just for instance)
(Click to enlarge)

MapReduce something, is about iterate a huge record collection, extract something good, mix and regroup intermediate results, that's all, it may look more complex than what it is.

244 comentarios:

«Oldest ‹Older 201 – 244 of 244 Newer› Newest»

Aptron Delhi said...: Android Institute in Delhi; November 23, 2020 at 7:58 AM
dataanalyticscourse said...: need to thanks for the information seeks such more blogs with complete knowledge.
data analytics course; December 3, 2020 at 6:35 AM
Unknown said...: Thanks for sharing the informative post.
Machine Learning training in Pallikranai Chennai
Data science training in Pallikaranai
Python Training in Pallikaranai chennai
Bigdata training in Pallikaranai chennai; December 23, 2020 at 11:24 PM
jegan said...: wonderful article contains lot of valuable information. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
This article resolved my all queries.good luck an best wishes to the team members.learn digital marketing use these following link
Digital Marketing Course in Chennai; December 30, 2020 at 4:13 AM
Mr Eric said...: Awesome article, it was exceptionally helpful! I simply began in this and I'm becoming more acquainted with it better! Cheers, keep doing awesome!
Web Design Gloucester
Web Design Cheltenham
Web Design Company Gloucester
Local SEO Agency Gloucester; February 25, 2021 at 5:06 AM
Buy Seo Service said...: Thanks for such a wonderful content. Our Motive is not just to create links but to get them indexed as will
Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain
High Quality Backlink Building Service
Boost DA upto 15+ at cheapest
Boost DA upto 25+ at cheapest . Very Helpful; March 20, 2021 at 8:33 AM
Anonymous said...: Very informative. Thanks for sharing.
Best Bike Taxi Service in Hyderabad
Best Software Service in Hyderabad; March 22, 2021 at 5:20 AM
rstrainings said...: I am really happy to say it’s an interesting post to read . I learn new information from your article , you are doing a great job . Keep it up.; April 27, 2021 at 11:22 AM
Ravi Varma said...: Thanks for sharing
+ COVID may stop you from coming out but not from growing up and moving ahead with your skills; October 6, 2021 at 3:33 AM
Landmark Group India said...: this is really amazing, this article has a very good information which is very useful. thanks for it. Visit us for looking lands in Hyderabad Open Plots Near Sadasivpet Telangana; March 9, 2022 at 5:34 AM
Divya said...: thank you for the blog.
Python Classes in Chennai
Python Classes Near Me
Best Python Training in Bangalore
Python Classes in Coimbatore; December 19, 2022 at 4:55 AM
priya said...: Digital marketing is creating brand awareness with the help of electronic media. It is a way of increasing digital awareness using electronic media such as phones, TV, and the internet.
Digital Marketing increases sales and builds customers' awareness of the product and services. Join the Digital Marketing Training in Chennai at FITA Academy to learn about digital marketing.Digital Marketing Training in Chennai
Digital Marketing Online Course
Digital Marketing Training in Bangalore; January 9, 2023 at 6:50 AM
praveen said...: Nice blog, Share more like this.

Software Testing Course In Chennai
Software Testing Online Course
Software Testing Course In Coimbatore; March 1, 2023 at 9:30 AM
rathna priya said...: Thanks for sharing the informative data. Keep sharing…

Swift Developer Course in Chennai
Learn Swift Online
Swift Training in Bangalore; April 7, 2023 at 2:48 AM
Digital Learning said...: thank you for sharing good information.In fuature i am expecting more information from your side.
AWS & DevopsTraining in Hyderabad; December 14, 2023 at 7:06 AM
vcube said...: Wonderful information, thanks a lot for sharing kind of information. Your website gives the best and the most interesting information. Thanks!!
React-Js Training in Hyderabad; December 22, 2023 at 2:59 AM
Anonymous said...: I really enjoyed reading this post! The way you presented the information is easy to understand and very helpful.

Nainital Jim Corbett Tour Package
Goa Tour Package from Delhi; January 30, 2025 at 7:14 AM
Anonymous said...: This comment has been removed by the author.; April 28, 2025 at 8:01 AM
Generativeaimasters said...: Great post! Your explanation of the platform base, HDFS, and MapReduce is detailed and very helpful for both beginners and advanced users in the Hadoop ecosystem. The clarity of your writing makes complex concepts easy to grasp. Thanks for sharing such an informative piece—it’s a valuable resource for anyone exploring big data technologies!

Generative AI Training In Hyderabad; June 10, 2025 at 1:32 PM
brollyai said...: Great overview! Your explanation of how HDFS stores large files across distributed clusters and how MapReduce processes that data efficiently really helped clarify the Hadoop architecture. A solid read for anyone getting started with big data engineering!
Generative Ai Training in Hyderabad; September 10, 2025 at 7:16 AM
Hey Temp said...: Hadoop’s HDFS and MapReduce platform are incredible for handling large-scale data efficiently and reliably, making analytics faster and more accurate. Similarly, in healthcare, managing patient data with precision is critical. At Vedanta Hospitals, a leading Multispeciality Hospital in Kadapa, we use advanced digital systems to securely store and analyze patient information, ensuring timely and effective treatments. Just like Hadoop optimizes data processing, organized hospital systems optimize patient care and outcomes.; November 9, 2025 at 3:20 AM
best gcp training institute in india said...: best gcp training institute in india; December 9, 2025 at 9:19 AM
weight gone said...: Hadooper is a great platform for learning and exploring big data concepts. It provides insights into Hadoop ecosystem tools and their practical applications. A useful resource for anyone interested in data processing and analytics.

Mounjaro (tirzepatide) is an advanced weight loss injection available in the UK, designed to support individuals struggling with obesity or excess weight. It works by targeting appetite regulation and improving blood sugar control, helping users achieve sustainable weight loss.

mounjaro weight loss injection uk; December 23, 2025 at 12:42 PM
ZetSIM said...: This explanation of the Hadoop platform and its core components like HDFS and MapReduce is very clear and easy to follow. The breakdown of how the system works helps make sense of big data processing concepts. Thank you for sharing such a helpful and informative overview
best esim for uk; January 19, 2026 at 6:22 AM
Looks Hair studio said...: Very nice post here and thanks for it .
Hair weaving in hyderabad; January 24, 2026 at 6:54 AM
unlimitmobile.com said...: The way you laid out HDFS and MapReduce fundamentals makes it easier to grasp how big data processing works. Very helpful for anyone getting started with Hadoop.
USA Canada eSIM plans; February 5, 2026 at 8:46 AM
unlimitmobile.com said...: I appreciate how you explained how data storage and processing fit together. This post gave me a solid foundational understanding of HDFS and MapReduce.
Cheapest eSIM Plan USA; February 5, 2026 at 8:46 AM
unlimitmobile.com said...: Your straightforward explanation makes complex concepts feel much more approachable. Really useful for learners exploring big data technologies.
Buy Canada eSIM Online; February 5, 2026 at 8:47 AM
Looks Hair studio said...: Thank you so much for sharing this worth able content with us.
Hair extension in hyderabad; February 8, 2026 at 9:24 AM
Natural Hair studio said...: Thank you for sharing such a nice blog..
Best Non-Surgical Hair Replacement in Hyderabad; February 14, 2026 at 8:01 AM
Natural Hair studio said...: It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.
Hair Extension in Hyderabad; February 22, 2026 at 7:11 AM
agapiclothing said...: A basic overview of the Hadoop platform, including HDFS and MapReduce components.indo western wear; February 24, 2026 at 6:42 AM
Looks Hair Studio Academy said...: Thanks for sharing such a good content with us. keep share these kind of content.i would like to read more.
Best Hair Course in Hyderabad; March 3, 2026 at 9:13 AM
Natural Hair studio said...: Thanks to sharing this information
Non Surgical Hair Replacement in Hyderabad; March 20, 2026 at 7:21 AM
antidothairstudio said...: Good Post! Thank you so much for sharing this pretty post, it was so good to read and useful to improve my knowledge as updated one, keep blogging
Non Surgical Hair Replacement; April 7, 2026 at 12:19 PM
antidothairstudio said...: Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
Hair Extensions in Hyderabad; April 8, 2026 at 9:24 AM
nexasai.in said...: such a wonderful article...very interesting to read ....thanks for sharing .
google ads in visakhapatnam; April 15, 2026 at 9:18 AM
nexasai.in said...: Nice post.Thanks for sharing this blog post.
email marketing in visakhapatnam; April 16, 2026 at 2:07 AM
nexasai.in said...: Your blog is very useful for me, Thanks for your sharing.
whatsapp marketing in visakhapatnam; April 16, 2026 at 4:27 AM
nexasai.in said...: I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.
whatsapp marketing in visakhapatnam; April 16, 2026 at 4:31 AM
nexasai.in said...: Wonderful Blog!!! Your post is very informative about the latest technology. Thank you for sharing the article with us.
SEO services in Visakhapatnam; April 20, 2026 at 2:41 AM
nexasai.in said...: THE INFORMATION YOU PROVIDED IS MUCH APPRECIATED
social media marketing services in Visakhapatnam; April 20, 2026 at 3:45 AM
drpraful said...: This is a great introduction to data analytics concepts.
The content is simple, clear, and easy to grasp. Helpful for anyone starting their data journey.
Shoulder Pain Treatment in Hyderabad; April 28, 2026 at 6:57 AM
antidothairstudio said...: I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
Hair Clipping in Hyderabad; June 6, 2026 at 1:11 PM

«Oldest ‹Older 201 – 244 of 244 Newer› Newest»

the Hadooper in me

Blog Archive

Platform base: HDFS + MR

244 comentarios:

Post a Comment