I recently looked into learning Apache Hadoop. Big data and cloud services are where the internet is headed and having not had experience with big data yet, I thought it’d be a good idea to at least get my feet wet. I found a great article on CTO Vision that walks you through getting hadoop installed and started. Addendum: If you’re having problems with their instructions, like I was, there’s a complete breakdown at pyfunc’s page.
Fortunately, the config is XML-based, so its easy enough to understand once you get the basic syntax down. Unfortunately, the config CTO Vision linked to for Pseudo Distributed mode turned out to be a dead link (404) because Apache removed the Hadoop r0.20.2 documentation from their site sometime in the first half of 2013.
Luckily, I’m resourceful and know of a site other than google that caches web sites, and they just so happened to have a copy of the document I needed.
So in the hopes that Google will pick up the text of the following two URLs; the following pages are dead:
And without any further ado, you can find an archived/cached copy of the page linked to by the two links above at:
That’ll help you complete the initial setup for a test install of hadoop in Pseudo Distributed mode. Beyond that, the rest of the r0.20.2 documentation is there also, but this seems to be a frequently searched for page on since it came up under Google’s Autofill shortly after I started typing the first few characters of the URL.