Blog Archive
Powered by Blogger.
Tuesday, November 2, 2010
For the newcomers, I'll provide here a simple approach to understand Hadoop's design and main goals.
Compare: one of the first mistaken attitudes !
A typical transaction system lives inside the range of some gigabytes data, in random access. It runs inside an application server, acceses and takes control over a relational database allocated in a different server, transporting data in and out of it.
Interacts with online clients, keeping a reduced size of data in transport, with a shared and limited bandwith, the operations are most of all continuous reading over small sets of data, combined with some maintenance & CRUD operations.
Bigger data processing is done in batch, but the architecture is the same.
But a different scenario ?
What happens if we need to process 1 petabyte/week ?, let's say a much lesser size, 1 terabyte/day, less, 100 gigabytes/day
Under the traditional schema, it would be like moving an elephant over a string thousands of times, the time-windows would not allow to work with realtime information, but always trying to keep up and losing the race.
Want problems ?
The network would saturate, batch processes would take days to finish information of hours, the harddisk latency access it will transform into an incremental overhead, having a painful impact on overall costs.
The traditional approach would mantain the architecture, but changing hardware, sophisticated requirements, each time more expensive, bigger, but over a limited growth architecture. How many times will you change your system's architecture to non-linear scalar solutions to keep up with the race of growing data, ?
The solution is in the focus !
Problem 1: Size of data
systems handling public data, can have huge processing flows with hundreds of terabytes, public websites have had increased their data up to petabytes !
Strategy 1: Distribute and paralellize the processing
If chewing 200 Tb of data takes 11 days in 1 computer, let divide the whole process into as many machines needed to finish the task in minutes or hours !
If we'll have lot of computers, so be they cheap and easily mantained !, into a simple add & replace node architecture.
Problem 2: Continuous failure and cluster growth
In a big cluster, lot of machines fail everyday, besides the cluster size cannot be fixed and sometimes cannot be planned, it should grow easily at the speed of data.
Strategy 2: High fault tolerance and high scalability
Any design must have an excelent fault-tolerance mechanism, a very flexible maintenance, and its service availability should keep up naturally to its cluster size.
So we need a distributed storage, atomic, transactional, with perfect linear scalability.
Problem 3: Exponential overhead in data transport
From data source to its processing location, bad harddisk latency and saturated networks will end up exploting
Strategy 3: Move application logic to data storage
It must allow to handle data and processes separately and in a transparent way, applications should take care of their own business application code, and the platform should automatically deploy it into the datastorage, execute and handle their results,
So...
One Solution: Hadoop Distributed Filesystem + Hadoop MapReduce
Anyway, Hadoop is not bullet-proof, actually, it's a very specific solution for some big-data scenarios, with needs of realtime processing, and technology choices for random access to data. (See HBase, Hive)
Hadoop is not designed to replace RDBMS, but instead, has been proven to handle -in a much performant way- huge amounts of data, whereas traditional enterprise database clusters, wouldn't work not even close at the same overall costs !
Compare: one of the first mistaken attitudes !
A typical transaction system lives inside the range of some gigabytes data, in random access. It runs inside an application server, acceses and takes control over a relational database allocated in a different server, transporting data in and out of it.
Interacts with online clients, keeping a reduced size of data in transport, with a shared and limited bandwith, the operations are most of all continuous reading over small sets of data, combined with some maintenance & CRUD operations.
Bigger data processing is done in batch, but the architecture is the same.
But a different scenario ?
What happens if we need to process 1 petabyte/week ?, let's say a much lesser size, 1 terabyte/day, less, 100 gigabytes/day
Under the traditional schema, it would be like moving an elephant over a string thousands of times, the time-windows would not allow to work with realtime information, but always trying to keep up and losing the race.
Want problems ?
The network would saturate, batch processes would take days to finish information of hours, the harddisk latency access it will transform into an incremental overhead, having a painful impact on overall costs.
The traditional approach would mantain the architecture, but changing hardware, sophisticated requirements, each time more expensive, bigger, but over a limited growth architecture. How many times will you change your system's architecture to non-linear scalar solutions to keep up with the race of growing data, ?
The solution is in the focus !
Problem 1: Size of data
systems handling public data, can have huge processing flows with hundreds of terabytes, public websites have had increased their data up to petabytes !
Strategy 1: Distribute and paralellize the processing
If chewing 200 Tb of data takes 11 days in 1 computer, let divide the whole process into as many machines needed to finish the task in minutes or hours !
If we'll have lot of computers, so be they cheap and easily mantained !, into a simple add & replace node architecture.
Problem 2: Continuous failure and cluster growth
In a big cluster, lot of machines fail everyday, besides the cluster size cannot be fixed and sometimes cannot be planned, it should grow easily at the speed of data.
Strategy 2: High fault tolerance and high scalability
Any design must have an excelent fault-tolerance mechanism, a very flexible maintenance, and its service availability should keep up naturally to its cluster size.
So we need a distributed storage, atomic, transactional, with perfect linear scalability.
Problem 3: Exponential overhead in data transport
From data source to its processing location, bad harddisk latency and saturated networks will end up exploting
Strategy 3: Move application logic to data storage
It must allow to handle data and processes separately and in a transparent way, applications should take care of their own business application code, and the platform should automatically deploy it into the datastorage, execute and handle their results,
So...
- Distribute and Paralellize processing
- High fault-tolerance mechanism, High scalability
- Take application processes to data storage
One Solution: Hadoop Distributed Filesystem + Hadoop MapReduce
Anyway, Hadoop is not bullet-proof, actually, it's a very specific solution for some big-data scenarios, with needs of realtime processing, and technology choices for random access to data. (See HBase, Hive)
Hadoop is not designed to replace RDBMS, but instead, has been proven to handle -in a much performant way- huge amounts of data, whereas traditional enterprise database clusters, wouldn't work not even close at the same overall costs !
Subscribe to:
Post Comments (Atom)
21 comentarios:
Really good piece of knowledge, I had come back to understand regarding your website from my friend Sumit, Hyderabad And it is very useful for who is looking for HADOOP.I browse and saw you website and I found it very interesting.Thank you for the good work, greetings.
Hadoop Training in hyderabad
your blog is very helpful for visitors and tourists thanks.
Database design Tanzania
Big Data and Data Science Course Material. Avail 15 Day Free Trial! Learn Flume, Sqoop, Pig, Hive, MapReduce, Yarn & More. Get Certified By Experts! Hadoop Training
It's Really A Great Post.
Best IT Training in Bangalore
Hi ,your post on hadoop was simple and as beginner i got an idea what hadoop would be meant in concepts Hadoop Training in Velachery | Hadoop Training .
Iot Training in Bangalore
Artificial Intelligence Training in Bangalore
Machine Learning Training in Bangalore
Blockchain Training bangalore
Data Science Training in Bangalore
Big Data and Hadoop Training in bangalore
Devops Training in Bangalore
I like your all blog so much. You share very useful information for Big Data and Hadoop. Thanks a lot...!
Big Data Training in Pune
Big Data Certification in Pune
Big Data Testing Classes
Hi admin, I have read your post. It was really awesome post. Keep it up... Customer Reconciliation | CFA Audit | CA Firms
Big data and Hadoop post was really very nice
best training institute for hadoop in Bangalore
best big data hadoop training in Bangalroe
hadoop training in bangalore
hadoop training institutes in bangalore
hadoop course in bangalore
An extremely pleasant guide. I will take after these tips. Much obliged to you for sharing such definite article..... top ca firms in chennai
CFA Audit
Stock Audit
More informative and impressive blog
best training institute for hadoop in Marathahalli
best big data hadoop training in Marathahalli
hadoop training in Marathahalli
hadoop training institutes in Marathahalli
hadoop course in Marathahalli
Really Nice post..
best training institute for hadoop in BTM
best big data hadoop training in BTM
hadoop training in btm
hadoop training institutes in btm
hadoop course in btm
Good content stuff. I got more knowledge from this blog.
Keep it up..................
Thanks for sharing............ Duplicate Payment
Continuous Monitoring
Duplicate Payment Audit
AP Vendor Helpdesk
It's Really A Great Post. Looking For Some More Stuff.
shriram break free
This is an amazing blog, thank you so much for sharing such valuable information with us.
Visit for best website design and SEO services at- Website Development Company in India
web company in delhi
web desiging company
web design & development banner
web design & development company
web design & development services
web design agency delhi
web design agency in delhi
web design and development services
web design companies in delhi
web design company delhi
web design company in delhi
web design company in gurgaon
web design company in noida
web design company list
web design company services
web design company website
web design delhi
web design development company
web design development services
web design in delhi
web design service
web design services company
web design services in delhi
web designer company
web designer delhi
web designer in delhi
web designers delhi
web designers in delhi
web designing & development
web designing advertisement
web designing and development
web designing and development company
web designing and development services
web designing companies in delhi
web designing company delhi
web designing company in delhi
web designing company in gurgaon
web designing company in new delhi
web designing company in noida
web designing company logo
Really appreciate this wonderful post that you have provided for us.Great site and a great topic as well I really get amazed to read this. It's really good.
I like viewing web sites which comprehend the price of delivering the excellent useful resource free of charge. I truly adored reading your posting. Thank you!.
mobile phone repair in Fredericksburg
iphone repair in Fredericksburg
cell phone repair in Fredericksburg
phone repair in Fredericksburg
tablet repair in Fredericksburg
mobile phone repair in Fredericksburg
mobile phone repair Fredericksburg
iphone repair Fredericksburg
cell phone repair Fredericksburg
phone repair Fredericksburg
python training in bangalore | python online training
aws training in bangalore | aws online training
artificial intelligence training in bangalore | artificial intelligence online training
machine learning training in bangalore | machine learning online training
data science training in bangalore | data science online training
What is Hadoop – Get to know about its definition & meaning, Hadoop architecture & its components, ... However, Apache Hadoop was the first one which reflected this wave of innovation.
Custom Cosmetic Boxes
https://k2incenseonlineheadshop.com/
info@k2incenseonlineheadshop.com
k2incenseonlineheadshop Buy liquid incense cheap Buy liquid incense cheap For Sale At The Best Incense Online Shop
Positive site, where did u come up with the information on this posting? I'm pleased I discovered it though, ill be checking back soon to find out what additional posts you include.
Peter Black
Peter Black
imran
abid
fahim
IndiGO Airlines Vietnam Office
Post a Comment