Spark Submit

By | | No Comments

The following is a simple example of submitting a Spark job that uses an existing jar all users have access to. It estimates Pi, and the number at the end is the number of iterations it uses (more iterations = more accurate).

export SPARK_MAJOR_VERSION=2
cd /usr/hdp/current/spark2-client
spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master yarn \
   --queue <your_queue> \
examples/jars/spark-examples*.jar 10

Gradle is a popular build tool for Java and Scala. The following example is useful if you may be getting code from bitbucket, github, etc. This code can be downloaded and built by logging on to flux-hadoop-login and running:

git clone https://bitbucket.org/umarcts/spark-examples
cd spark-examples
./gradlew jar

The last command, “./gradlew jar”, will download all dependencies, compile the code, run tests, and package all of the code into a Java ARchive (JAR). This JAR is submitted to the cluster to run a job. For example, the AverageNGramLength job can be launched by running:

spark-submit \
   --class com.alectenharmsel.examples.spark.AverageNGramLength \
   --master yarn \
   --executor-memory 3g \
   --num-executors 35 \
 build/libs/spark-examples-*-all.jar /var/ngrams/data ngrams-out

The output will be located in your home directory in a directory called ‘ngrams-out’, and can be viewed by running:

hdfs dfs -cat ngrams-out/* | tail -5

The output should look like this:

spark output

Pig

By | | No Comments

Pig is similar to Hive and can do the same thing. The Pig code to do this is a little bit longer due to its design. However, writing long Pig code is generally easier that writing multiple SQL queries that that chain together, since Pig’s language, PigLatin, allows for variables and other high-level constructs.

# Open the interactive pig console
pig -Dtez.job.queuename=<your_queue>

# Load the data
ngrams = LOAD '/var/ngrams' USING PigStorage('\t') AS 
(ngram:chararray,
year:int, count:long, volumes:long);

# Look at the schema of the ngrams variable
describe ngrams;

# Count the total number of rows (should be 1430731493)
ngrp = GROUP ngrams ALL;
count = FOREACH ngrp GENERATE COUNT(ngrams);
DUMP count;

# Select the number of words, by year, that have only appeared in a single volume
one_volume = FILTER ngrams BY volumes == 1;
by_year = GROUP one_volume BY year;
yearly_count = FOREACH by_year GENERATE group, COUNT(one_volume);
DUMP yearly_count;

The last few lines of output should look like this:

More information on Pig can be found on the Apache website.

Using Hadoop and HDFS

By | | No Comments

Hadoop consists of two components; HDFS, a filesystem built for high read speeds, and YARN, a resource manager. HDFS is not a POSIX filesystem, so normal command line tools like “cp” and “mv” will not work. Most of the common tools have been reimplemented for HDFS and can be run using the “hdfs dfs” command. All data must be in HDFS for jobs to be able to read it.

Here are a few basic commands:

# List the contents of your HDFS home directory
hdfs dfs -ls

# Copy local file data.csv to your HDFS home directory
hdfs dfs -put data.csv data.csv

# Copy HDFS file data.csv back to your local home directory
hdfs dfs -get data.csv data2.csv

A complete reference of HDFS commands can be found on the Apache website.

Understanding MapReduce And Tez

By | | No Comments

Writing Hadoop MapReduce code in Java is the lowest level way to program against a Hadoop cluster. Hadoop’s libraries do not contain any abstractions, like Spark RDDs or a Hive or Pig-like higher level language. All code must implement the MapReduce paradigm.

This video provides a great introduction to MapReduce. This documentation provides a written explanation and an example.

However, some of our tools, mainly Hive and Pig, run on Tez rather than MapReduce. Some key points about the differences between Tez and Spark (with much credit to this post):

How Tez works (very similar to Spark):
Tez is a DAG (Directed acyclic graph) architecture.
1. Execute the plan but no need to read data from disk.
2. Once ready to do some calculations (similar to actions in spark), get the data from disk and perform all steps and produce output.
Notice the efficiency introduced by not going to disk multiple times. Intermediate results are stored in memory (not written to disks). On top of that there is vectorization (process batch of rows instead of one row at a time). All this adds to efficiencies in query time.
Only one read and one write.

==

How Mapreduce works:
1. Read data from file –>one disk access
2. Run mappers
3. Write map output –> second disk access
4. Run shuffle and sort –> read map output, third disk access
5. write shuffle and sort –> write sorted data for reducers –> fourth disk access
6. Run reducers which reads sorted data –> fifth disk output
7. Write reducers output –>sixth disk access

Hive

By | | No Comments

To demonstrate Hive, below is a short tutorial. The tutorial uses the Google NGrams dataset, which is available in HDFS in /var/ngrams.

# Open the interactive hive console
hive --hiveconf tez.queue.name=<your_queue>

# Create a table with the Google NGrams data in /var/ngrams
CREATE EXTERNAL TABLE ngrams_<your_uniqname>(ngram STRING, year INT, count BIGINT, volumes BIGINT)
     ROW FORMAT DELIMITED
     FIELDS TERMINATED BY '\t'
     STORED AS TEXTFILE
     LOCATION '/var/ngrams';

# Look at the schema of the table
DESCRIBE ngrams_<your_uniqname>;

# Count the total number of rows (should be 1430731493)
SELECT COUNT(*) FROM ngrams_<your_uniqname>;

# Select the number of words, by year, that have only appeared in a single volume
SELECT year, COUNT(ngram) FROM ngrams_<your_uniqname> WHERE 
volumes = 1
GROUP BY year;

# Optional: delete your ngrams table
DROP table ngrams_<your_uniqname>;

# Exit the Hive console
QUIT;

The last few lines of output should look something like this:

More information can be found on the Apache website.

PySpark

By | | No Comments

Spark comes with an interactive Python console, which can be opened this way:

# Load the pyspark console 
pyspark --master yarn --queue <your_queue>

This interactive console can be used for prototyping or debugging, or just running simple jobs.

The following example runs a simple line count on a text file, as well as counts the number of instances of the word “words” in that textfile. You can use any text file you have for this example:

>>> textFile = sc.textFile("test.txt")
>>> textFile.count()
>>> textFile.first()
>>> textFile.filter(lambda line: "words" in line).count()

 

You can also submit a job using PySpark without using the interactive console.

Save this file as job.py.

from pyspark import SparkConf, SparkContext
import sys

# This script takes two arguments, an input and output
if len(sys.argv) != 3:
  print('Usage: ' + sys.argv[0] + ' <in> <out>')
  sys.exit(1)

input = sys.argv[1]
output = sys.argv[2]

# Set up the configuration and job context
conf = SparkConf().setAppName('AnnualWordLength')
sc = SparkContext(conf=conf)


# Read in the dataset and immediately transform all the lines in arrays
data = sc.textFile(input).map(lambda line: line.split('\t'))

# Create the 'length' dataset as mentioned above. This is done using the next two variables, and the 'length' dataset ends up in 'yearlyLength'.
yearlyLengthAll = data.map(
    lambda arr: (int(arr[1]), float(len(arr[0])) * float(arr[2]))
)
yearlyLength = yearlyLengthAll.reduceByKey(lambda a, b: a + b)

# Create the 'words' dataset as mentioned above.
yearlyCount = data.map(
    lambda arr: (int(arr[1]), float(arr[2]))
).reduceByKey(
    lambda a, b: a + b
)

# Create the 'average_length' dataset as mentioned above.
yearlyAvg = yearlyLength.join(yearlyCount).map(
    lambda tup: (tup[0], tup[1][0] / tup[1][1])
)

# Save the results in the specified output directory.
yearlyAvg.saveAsTextFile(output)

# Finally, let Spark know that the job is done.
sc.stop()

This above script averages the lengths of words in the NGrams dataset by year. There are two main operations in the above code: ‘map’ and ‘reduceByKey’. ‘map’ applies a function to each RDD element and returns a new RDD containing the results. ‘reduceByKey’ applies a function to the group of values with the same key – for all keys – and returns an RDD with the result.

The job can be submitted by running:

spark-submit \
 --master yarn \
 --num-executors 35 \
 --executor-memory 5g \
 --executor-cores 4 \
 job.py /var/ngrams/data ngrams-out


hdfs dfs -cat ngrams-out/*

 

The only required argument from the above job submission command is ‘–master yarn-client’. The values passed to the other arguments may be modified in order to get better performance or conform to the limits of your queue.

*Note: Our default Python is Anaconda 2-5.0.1. If you would like to use Anaconda 3-5.0.1 for your PySpark job, run the following command:

export PYSPARK_PYTHON=/sw/dsi/centos7/x86-64/Anaconda3-5.0.1/bin/python

Creating an Account

By | | No Comments

Using Flux Hadoop requires a Flux user account but does not require an active Flux allocation. Information on getting a Flux account can be found on the Flux User Guide page. You only need to pay attention to steps 1, 2, and 4 in order to use Flux-Hadoop. Please note that in step 4 rather than logging into flux-login.arc-ts.umich.edu, you need to log into flux-hadoop-login.arc-ts.umich.eduThis video will help teach you some basic Linux navigation commands if needed.

When you create an account, you will automatically be added to our default queue. If you are using our Hadoop cluster for a class or another specific purpose, or you need a significant allocation, please open a ticket with us so you can be added to the appropriate queue. In many of the examples in this user guide, there is a “–queue <your_queue>” flag. Please fill in the name of your queue when running this examples, or simply “default” if you do not have one.

NEW Using mrjob

By |

NEW VERSION Another way to run Hadoop jobs is through mrjob. Mrjob is useful for testing out smaller data on another system (such as your laptop), and later being able to run it on something larger, like a Hadoop cluster. To run an mrjob on your laptop, you can simply remove the “-r hadoop” from the command in the example we use here.

A classic example is a word count, taken from the official mrjob documentation here.

Save this file as mrjob_test.py.

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MRWordFreqCount.run()

Then, run the following command:

python mrjob_test.py -r hadoop /etc/motd

You should receive an output with the word count of the file /etc/motd. You can also try this with any other file you have that contains text.

NEW Using Pig

By |

NEW VERSION Pig is similar to Hive and can do the same thing. The Pig code to do this is a little bit longer due to its design. However, writing long Pig code is generally easier that writing multiple SQL queries that that chain together, since Pig’s language, PigLatin, allows for variables and other high-level constructs.

# Open the interactive pig console
pig -Dmapreduce.job.queuename=<your_queue>

# Load the data
ngrams = LOAD '/var/ngrams' USING PigStorage('\t') AS 
(ngram:chararray,
year:int, count:long, volumes:long);

# Look at the schema of the ngrams variable
describe ngrams;

# Count the total number of rows (should be 1201784959)
ngrp = GROUP ngrams ALL;
count = FOREACH ngrp GENERATE COUNT(ngrams);
DUMP count;

# Select the number of words, by year, that have only appeared in a single volume
one_volume = FILTER ngrams BY volumes == 1;
by_year = GROUP one_volume BY year;
yearly_count = FOREACH by_year GENERATE group, COUNT(one_volume);
DUMP yearly_count;

The last few lines of output should look like this:

pig output

More information on Pig can be found on the Apache website.