Writing Hadoop MapReduce code in Java is the lowest level way to program against a Hadoop cluster. Hadoop’s libraries do not contain any abstractions, like Spark RDDs or a Hive or Pig-like higher level language. All code must implement the MapReduce paradigm.
Some example Java MapReduce code can be found in the ARC-TS Hadoop examples page. There is an example job – AverageNGramLength – to generate a yearly average of all words in Google’s
NGrams dataset that is in HDFS at `/var/ngrams`. This job can be built and launched from flux-login by running:
git clone https://bitbucket.org/umarcts/hadoop-examples.git cd hadoop-examples ./gradlew jar yarn jar build/libs/hadoop-examples-*-all.jar \ com.alectenharmsel.examples.hadoop.AverageNGramLength \ -Dmapreduce.job.queuename=<your_queue> \ /var/ngrams ngrams-out # List running jobs yarn application -list # Once it is done, view the output hdfs dfs -cat ngrams-out/*
The result of the command will print all of the output to the terminal. This output can be redirected to a file and plotted with any plotting tool, such as R or MATLAB.
The last few lines of output should look like this:
To get those exact lines (if you don’t want to see all the output), the command is:
hdfs dfs -cat ngrams-out/* | tail -5
Please note that it can take around 1-5 minutes for everything to build and launch.
Afterwards, it’s smart to remove the directory your output is in, so that you can use the same name in future examples. To do this, run:
hdfs dfs -rm -r -skipTrash /user/<your_uniqname>/ngrams-out