the Hadooper in me: August 2010

Tuesday, August 24, 2010

Introduction to Hadoop

10:44 PM | Publicado por Cristian |

Apache Hadoop is a family of projects focusing the big-data ambient concerns, with a paralell distributed computing strategy.
Tools from storing, processing, transformation, mining, analysis, searching, from big enterprise clusters up to university labs with some dozens of computers.

When it comes the time to evaluate this platform, it's unavoidable to compare it with other ones, and far from staring at the hurdles of research, training or implementation of any opensource-like software, we realize the day-to-day challenges and requirements this platform was born to handle, and the way that accomplishes its objectives. If you get involved in the matter, you'll find combination of challenges that no typical database tools were, already before, required to approach.

Why is it becoming widespreadly used, it's all about vision. The key of Hadoop main strategy is to masively scale only thru increasing amount of nodes, into a simple network architecture, generally composed of comodity hardware, common computers, no sophisticated datacenter hardware, diversity of comodity machines, so one important factor is to be prepared for malfunction, disconnection, of typical and common manufacture on all its peripherals, hardisks, switches, network boards, and so on.
In contrast to other enterprise database software products, requiring sophisticated hardware, maintenance, and most of all: high-budget contracts, for software and hardware, Hadoop takes advantage of the full potential of each node in the cluster, by distributing the data processing to all its nodes capacity, guaranteeing no exponential overhead, and so almost perfect linear scalability.

In order to match this strategy, other tactic is pursued: bring the processes to the data nodes, rather than transporting pieces of data went and back to the processing machines, this is a core concept of the platform design, and the key value of all its architecture.

The need of this vision, it's based on a data cluster growing masively non-stop every day. Beside public or statal organizations and companies, there're other new organizations allocating not only big amounts of data, but also at huge growing rates, these are websites, dealing with thousands or millions of users worldwide, and its proportional content generation, keeping the access and process results of this size of data in optimal performance. That's the context where the core design of Hadoop born, by the hands of an ex-Yahoo: Doug Cutting.
Inspired in Google's data infrastructure strategies, published a few years ago.
Yahoo got interested in Hadoop, then fully supported it. After some years, Hadoop took a greater and higher horizon, thru the open source community, making his author lead its evolution to advanced purposes than those inside Yahoo. (As stated publicly)

60 comentarios

Saturday, August 21, 2010

Research on Hadoop from now on !

3:42 AM | Publicado por Cristian |

| Edit Post

Hello people, I’m sharing recent graphical results over a Hadoop analysis I made this year.

I’ll be publishing everything related to this platform and other IT stuff.
I'm the author of all texts/pictures used in posts, feel free to use them, and leave a comment if you have any correction, etc.
I'll try to keep my posts non redundant to official websites, they wont attempt to be manual info, but instead only my perspective, there're are good books written by key comitters of Apache Hadoop.

Some specific information may turn obsolete from time to time, as Hadoop is still a relatively young project. I'll do my best to keep up.

So, stay tunned !

Soon all pictures in spanish !

Cristian Gonzalez

7 comentarios

the Hadooper in me

Blog Archive

Introduction to Hadoop

Research on Hadoop from now on !