Friday, 12 August 2016

Building Your Own Hadoop Cluster At Home: Part 1



Why Even Bother?
When I first became interested in learning Hadoop, I figured out pretty much immediately that I would need to cultivate a deep understanding of the underlying system and its installation and configuration, before I could successfully work up to actually using the system myself. From everything I digested in my early research, it seemed that everyone in the Hadoop community knew the ecosystem from the ground up: it felt imperative that I lay my own groundwork in order to truly understand Hadoop on all its levels. To that end, I decided I would build my own Hadoop cluster at home using cheap components scavenged from eBay. Sure, there are ways to explore Hadoop virtually, but I have always done better learning things in a more hands-on way. I also figured if I had a room full of computer equipment, I would be less likely to get frustrated and wander away.

Now that I’ve finally got my setup pieced together and working, let me give you fair warning: this stuff is hard. Depending on your background it may be a piece of cake, but for me it’s been an arduous adventure filled with nearly endless error messages, tons of googling, piles of reading, and frustration beyond belief, but also great victories. I’ve never been the kind of person to let my lack of knowledge or skills stop me from doing something: if that’s the kind of person you are, then this could be a very rewarding challenge for you. With Hadoop, you’ll quickly find that the distinction between sysadmin, developer, and analyst is a real gray area. Building your own cluster from scratch can certainly give you a head start in beginning to develop and manage those roles.

System Design
Here is the final configuration I ended up with. Each node is just a desktop system. I bought Dell Optiplex 745 SFF Core2Duo PCs on eBay to be my “servers”. They are plentiful, cheap, can handle 8GB of RAM, and have gigabit internet cards built in. I just plugged them all into my network switch, plugged that into my wifi router, and bing-bang-boom: a cluster is born. Please note, the bing-bang-boom took me months to get right but this set up works extremely well without a lot of user intervention. Prior to this set up I tried to run my slaves without internet connections (using local repositories) and let me strongly recommend avoiding that, it’s just a big huge nightmare horror show of a hassle.



What’s NextI’ll continue getting into the nitty gritty details of how to take this Frankenstein of a cluster and actually get Hadoop up and running. I’m going to try and avoid the vague Linux/Hadoop language I came across when trying to learn this myself. A lot of the install instructions, help files, forums, and resources available assume a lot of prior knowledge and can leave you stranded 10 steps into a process. But 3 months ago I knew nothing about Hadoop and yesterday I re-provisioned a cluster and set up a Hue server in less than half a day. So it can be done!  No doubt, you’ll need to learn a lot stuff along the way, there’s really no avoiding it, there is no magic one-click install, but hopefully my pain can make this a little bit easier for anyone looking to jump in head first.

Original Article Written By: Fred LaCrosse

0 comments:

Post a Comment