Using Hadoop

By March 20, 2016

Writing Hadoop MapReduce code in Java is the lowest level way to program against a Hadoop cluster. Hadoop’s libraries do not contain any abstractions, like Spark RDDs or a Hive or Pig-like higher level language. All code must implement the MapReduce paradigm.

Some example Java MapReduce code can be found in the ARC-TS Hadoop examples page. There is an example job – AverageNGramLength – to generate a yearly average of all words in Google’s
NGrams dataset that is in HDFS at `/var/ngrams`. This job can be built and launched from flux-login by running:

git clone https://bitbucket.org/umarcts/hadoop-examples.git
cd hadoop-examples
./gradlew jar
yarn jar build/libs/hadoop-examples-*-all.jar \
  com.alectenharmsel.examples.hadoop.AverageNGramLength \
  -Dmapreduce.job.queuename=<your_queue> \
  /var/ngrams ngrams-out

# List running jobs
yarn application -list

# Once it is done, view the output
hdfs dfs -cat ngrams-out/*

The result of the command will print all of the output to the terminal. This output can be redirected to a file and plotted with any plotting tool, such as R or MATLAB.

The last few lines of output should look like this:

hadoop output

To get those exact lines (if you don’t want to see all the output), the command is:

hdfs dfs -cat ngrams-out/* | tail -5

 

Please note that it can take around 1-5 minutes for everything to build and launch.

Afterwards, it’s smart to remove the directory your output is in, so that you can use the same name in future examples. To do this, run:

hdfs dfs -rm -r -skipTrash /user/<your_uniqname>/ngrams-out
Next Post