Hive

By |

To demonstrate Hive, below is a short tutorial. The tutorial uses the Google NGrams dataset, which is available in HDFS in /var/ngrams.

# Open the interactive hive console
hive

# Create a table with the Google NGrams data in /var/ngrams
CREATE EXTERNAL TABLE ngrams_your-uniqname(ngram STRING, year INT, count BIGINT, volumes BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
STORED AS TEXTFILE
LOCATION ‘/var/ngrams’;

# Look at the schema of the table
DESCRIBE ngrams_your-uniqname;

# Count the total number of rows (should be 1430731493)
SELECT COUNT(*) FROM ngrams_your-uniqname;

# Select the number of words, by year, that have only appeared in a single volume
SELECT year, COUNT(ngram) FROM ngrams_your-uniqname WHERE
volumes = 1
GROUP BY year;

# Optional: delete your ngrams table
DROP table ngrams_your-uniqname;

# Exit the Hive console
QUIT;

Streaming (Other Programming Methods)

By |

It is also possible to write a job in any programming language, such as Python or C, that operates on tab-separated key-value pairs. The same example done above with Hive and Pig can also be written in Python and submitted as a Hadoop job using Hadoop Streaming. Submitting a job with Hadoop Streaming requires writing a mapper and a reducer. The mapper reads input line by line and generates key-value pairs for the reducer to “reduce” into some sort of sensible data. For our case, the mapper will read in lines and output the year as the key and a ‘1’ as the value if the ngram in the line it reads has only appeared in a single volume. The python code to do this is:

(Save this file as map.py)

#!/usr/bin/env python2.7
import fileinput
for line in fileinput.input():
 arr = line.split("\t")
 try:
    if int(arr[3]) == 1:
       print("\t".join([arr[1], '1']))
 except IndexError:
       pass
 except ValueError:
       pass

 

Now that the mapper has done this, the reduce merely needs to sum the values based on the key:

(Save this file as red.py)

#!/usr/bin/env python2.7

import fileinput

data = dict()

for line in fileinput.input():
  arr = line.split("\t")
  if arr[0] not in data.keys():
     data[arr[0]] = int(arr[1])
  else:
     data[arr[0]] = data[arr[0]] + int(arr[1])

for key in data:
 print("\t".join([key, str(data[key])]))

 

Submitting this streaming job can be done by running the below command:

yarn jar $HADOOP_STREAMING \
 -Dmapreduce.job.queuename=<your_queue> \
 -input /var/ngrams/data \
 -output ngrams-out \
 -mapper map.py \
 -reducer red.py \
 -file map.py \
 -file red.py \
 -numReduceTasks 10


hdfs dfs -cat ngrams-out/* | tail -5

streaming outputhdfs dfs -rm -r -skipTrash /user/<your_uniqname>/ngrams-out

Pig

By |

Pig is no longer available as part of our Hadoop software stack due to decisions of the upstream Hadoop project software maintainers.

mrjob

By |

Another way to run Hadoop jobs is through mrjob. Mrjob is useful for testing out smaller data on another system (such as your laptop), and later being able to run it on something larger, like a Hadoop cluster. To run an mrjob on your laptop, you can simply remove the “-r hadoop” from the command in the example we use here.

A classic example is a word count, taken from the official mrjob documentation here.

Save this file as mrjob_test.py.

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MRWordFreqCount.run()

Then, run the following command:

python mrjob_test.py -r hadoop /etc/motd

You should receive an output with the word count of the file /etc/motd. You can also try this with any other file you have that contains text.