Wednesday, November 17, 2010
 
Let's explore the core of the platform
Here are some relevant points about the filesystem
  • Scales linearly over a set of low-budget comodity machines, doubled the amount of machines = reduced the processing time to half
  • Tolerates faults at different levels, from network, switches, disks, nodes, readdressing data traffic to other nodes, accomplishing a replica factor
  • Flexible scalability, for maintenance tasks, you only have to dis/connect computers into the rack
  • Lets you allocate any kind of files and formats, although it has better performance on files bigger than 128mb
  • The results of MapReduce jobs are stored in the filesystem
  • There's a unique namespace, with an automatic replication schema administered by the master, it's not possible to impact on a certain node in the cluster, whether to allocate files or execute jobs
  • The master itself balances the workload over execution plans and status reports from the slave nodes
The picture shows a more practical brief over HDFS responsibilities & execution flows
(Click to enlarge)

  • Has a mechanism of continuous replication per file, rack-aware, to extend data reliability and data availability warranties
  • Has an automatic file checksum with inmediate correction
  • Works in a master-slave schema where slaves share nothing between them, they only respond to master requests
  • The master coordinates all types of transactions, read/write, replication, restore, manages the tx log and the filesystem namespace
  • The slaves only take charge on low level operations over data, read, write, deletion, transport to-from client
Let's pass now to the processing model
Main characteristics of Map Reduce
  • It's a processing model about dividing and distributing information, in two chained phases: map first, then reduce
  • Both phases have as input and output, a key-value pair list, 
  • The schema allows to define method parameters and own logic for both phases, as well as their own partitioning system and intermediate storage between phases.
  • The transactions are handled by a JobTracker daemon, that runs the initial data partitioning and the intermediate data combination, by posting tasks of type Map and type Reduce over the TaskTracker daemons (1-n x computer) of the nodes involved in the cluster, according the data being processed
  • the Reduce phase only starts when finished the Map, cause after the Map the resulting keys are combined, to distribute a sorted list of key-value pairs between the Reducers, that can be matched at the end of them.
  • The process is transactional, those map or reduce tasks not executed, (for data availability issues) will be reattempted a number of times, and then redistributed to other nodes.
 The picture shows how these methods will interact in phases.
(Click to enlarge)
Detail:

- First the files are partitioned in parts that will be distributed to process across the cluster nodes
- Each part is parsed in pairs of Key(sorteable object) - Value(object), that will be the input parameters for the tasks implementing the Map function
- These user defined tasks (map), will read the value object, do something with it, and then build a new key-value list that will be stored by the framework, in intermediate files.
- Once all the map tasks are finished, it means that the whole data to process was completely read, and reordered into this mapreduce model of key-value paris.
- These intermediate key-value results are combined, resulting a new paris of key-value that will be the input for the next reduce tasks
- These user defined tasks (reduce), will read the value object, do something with it, and then produce the 3rd and last list of key-value pairs, that the framework will combine, and regroup into a final result.

Let's see a sample job with a reverted-index function, for analyzing the webcrawler's output files (just for instance)
(Click to enlarge)

MapReduce something, is about iterate a huge record collection, extract something good, mix and regroup intermediate results, that's all, it may look more complex than what it is.

 

217 comentarios:

«Oldest   ‹Older   201 – 217 of 217   Newer›   Newest»
delfen said...

Thanks first of all for the useful info.
the idea in this article is quite different and innovative please update more.
waiting for more updates and content.

Data Science Training in Chennai

Data Science Training in Velachery

Data Science Training in Tambaram

Data Science Training in Porur

Data Science Training in Omr
Data Science Training in Annanagar

Aptron Delhi said...

Android Institute in Delhi

dataanalyticscourse said...

need to thanks for the information seeks such more blogs with complete knowledge.
data analytics course

Unknown said...

Thanks for sharing the informative post.
Machine Learning training in Pallikranai Chennai
Data science training in Pallikaranai
Python Training in Pallikaranai chennai
Bigdata training in Pallikaranai chennai

jegan said...

wonderful article contains lot of valuable information. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
This article resolved my all queries.good luck an best wishes to the team members.learn digital marketing use these following link
Digital Marketing Course in Chennai

Mr Eric said...

Awesome article, it was exceptionally helpful! I simply began in this and I'm becoming more acquainted with it better! Cheers, keep doing awesome!
Web Design Gloucester
Web Design Cheltenham
Web Design Company Gloucester
Local SEO Agency Gloucester

Buy Seo Service said...

Thanks for such a wonderful content. Our Motive is not just to create links but to get them indexed as will
Increase Domain Authority (DA).We’re on a mission to increase DA PA of your domain
High Quality Backlink Building Service
Boost DA upto 15+ at cheapest
Boost DA upto 25+ at cheapest . Very Helpful

Anonymous said...

Very informative. Thanks for sharing.
Best Bike Taxi Service in Hyderabad
Best Software Service in Hyderabad

rstrainings said...

I am really happy to say it’s an interesting post to read . I learn new information from your article , you are doing a great job . Keep it up.






Ravi Varma said...

Thanks for sharing
+ COVID may stop you from coming out but not from growing up and moving ahead with your skills

Landmark Group India said...

this is really amazing, this article has a very good information which is very useful. thanks for it. Visit us for looking lands in Hyderabad Open Plots Near Sadasivpet Telangana

Divya said...

thank you for the blog.
Python Classes in Chennai
Python Classes Near Me
Best Python Training in Bangalore
Python Classes in Coimbatore

priya said...

Digital marketing is creating brand awareness with the help of electronic media. It is a way of increasing digital awareness using electronic media such as phones, TV, and the internet.
Digital Marketing increases sales and builds customers' awareness of the product and services. Join the Digital Marketing Training in Chennai at FITA Academy to learn about digital marketing.Digital Marketing Training in Chennai
Digital Marketing Online Course
Digital Marketing Training in Bangalore

praveen said...

Nice blog, Share more like this.

Software Testing Course In Chennai
Software Testing Online Course
Software Testing Course In Coimbatore

rathna priya said...


Thanks for sharing the informative data. Keep sharing…

Swift Developer Course in Chennai
Learn Swift Online
Swift Training in Bangalore

Digital Learning said...

thank you for sharing good information.In fuature i am expecting more information from your side.
AWS & DevopsTraining in Hyderabad

vcube said...

Wonderful information, thanks a lot for sharing kind of information. Your website gives the best and the most interesting information. Thanks!!
React-Js Training in Hyderabad

«Oldest ‹Older   201 – 217 of 217   Newer› Newest»