Using Cavium

The Apache Hadoop software library on the Cavium ThunderX Cluster is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

To search this user guide, use the command + f keyboard shortcut.

Getting Started with Cavium

Overview

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across a cluster of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. (From hadoop.apache.org)

The software available is:

Back To Top

Creating an Account

Using the Cavium Hadoop cluster requires an ARC-TS user account (acquired by filling out this form) but does not require an active Flux allocation.

When you create an account, you will automatically be added to our default queue. If you are using our Hadoop cluster for a class or another specific purpose, or you need a significant allocation, please open a ticket with us so you can be added to the appropriate queue. In many of the examples in this user guide, there is a “–queue <your_queue>” flag. Please fill in the name of your queue when running this examples, or simply “default” if you do not have one.

Back To Top

Logging In

To log in to the Cavium Hadoop cluster, you need a terminal.  Currently the cluster is only accessible via the command line.

If you are trying to log in from off campus, or using the M-Guest wireless network, you have a couple of options:

    • Install VPN software on your computer
    • First ssh to login.itd.umich.edu, then ssh to cavium-thunderx.arc-ts.umich.edu from there.

Here’s what a login looks like using a terminal emulator:

Mac using terminal: Open terminal

Type: ssh -l uniqname cavium-thunderx.arc-ts.umich.edu [replacing your uniqname in the command]

Windows using PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/).

Launch Putty and enter cavium-thunderx.arc-ts.umich.edu as the host name then click open.

For both Mac and Windows:

At the “Enter a passcode or select one of the following options:” prompt, type the number of your preferred choice for Duo authentication.

Back To Top

Hadoop and HDFS Basics

Using Hadoop and HDFS

Hadoop consists of two components; HDFS, a filesystem built for high read speeds, and YARN, a resource manager. HDFS is not a POSIX filesystem, so normal command line tools like “cp” and “mv” will not work. Most of the common tools have been reimplemented for HDFS and can be run using the “hdfs dfs” command. All data must be in HDFS for jobs to be able to read it.

Here are a few basic commands:

# List the contents of your HDFS home directory
hdfs dfs -ls

# Copy local file data.csv to your HDFS home directory
hdfs dfs -put data.csv data.csv

# Copy HDFS file data.csv back to your local home directory
hdfs dfs -get data.csv data2.csv

A complete reference of HDFS commands can be found on the Apache website.

Back To Top

Fuse HDFS

Fuse HDFS allows you use standard posix system commands with HDFS. This may be useful, for example, if you have a program that needs to use data that is stored in HDFS. 

To use Fuse HDFS, change directories to /hadoop-fuse/user/<your_uniqname>

Once in this directory, you can use commands on your HDFS files just as you would on any other files. For example, the ls command will list the contents of your HDFS home directory.

You could also run a Python or R program that uses a file in HDFS.

You can save the below file and run it as you would regularly run a python program to access an example data file we have available to all users in HDFS.

#!/usr/bin/python
f = open("/hadoop-fuse/var/examples/romeojuliet.txt", "r")
data = f.read()
f.close()
d = {}
for word in data.split(' '):
        if word in d:
                d[word] += 1
        else:
                d[word] = 1
for word, count in d.items():
        print word + str(count)

Back To Top

Understanding MapReduce

Writing Hadoop MapReduce code in Java is the lowest level way to program against a Hadoop cluster. Hadoop’s libraries do not contain any abstractions, like Spark RDDs or a Hive or Pig-like higher level language. All code must implement the MapReduce paradigm.

This video provides a great introduction to MapReduce. This documentation provides a written explanation and an example.

Back To Top

Spark

Introduction to Spark

Spark and PySpark utilize a container called Resilient Distributed Dataset (RDD) for storing and operating on data. The most important characteristic of Spark’s RDD is that it is immutable — once created, the data it contains cannot be updated. New RDDs can be created by transforming the data in another RDD, which is how analysis is done with Spark.

Using Spark’s native language, Scala, requires more setup than using PySpark. Some example Scala jobs, including the same example job in the PySpark documentation, can be found on this website. That Spark code has some trivial set up required to run a Spark job, and all of the actual logic is in the ‘run’ function.

Back To Top

PySpark

Spark comes with an interactive Python console, which can be opened this way:

# Load the pyspark console 
pyspark --master yarn --queue <your_queue>

This interactive console can be used for prototyping or debugging, or just running simple jobs.

The following example runs a simple line count on a text file, as well as counts the number of instances of the word “words” in that textfile. You can use any text file you have for this example:

>>> textFile = sc.textFile("test.txt")
>>> textFile.count()
>>> textFile.first()
>>> textFile.filter(lambda line: "words" in line).count()

 

You can also submit a job using PySpark without using the interactive console.

Save this file as job.py.

from pyspark import SparkConf, SparkContext
import sys

# This script takes two arguments, an input and output
if len(sys.argv) != 3:
  print('Usage: ' + sys.argv[0] + ' <in> <out>')
  sys.exit(1)

input = sys.argv[1]
output = sys.argv[2]

# Set up the configuration and job context
conf = SparkConf().setAppName('AnnualWordLength')
sc = SparkContext(conf=conf)


# Read in the dataset and immediately transform all the lines in arrays
data = sc.textFile(input).map(lambda line: line.split('\t'))

# Create the 'length' dataset as mentioned above. This is done using the next two variables, and the 'length' dataset ends up in 'yearlyLength'.
yearlyLengthAll = data.map(
    lambda arr: (int(arr[1]), float(len(arr[0])) * float(arr[2]))
)
yearlyLength = yearlyLengthAll.reduceByKey(lambda a, b: a + b)

# Create the 'words' dataset as mentioned above.
yearlyCount = data.map(
    lambda arr: (int(arr[1]), float(arr[2]))
).reduceByKey(
    lambda a, b: a + b
)

# Create the 'average_length' dataset as mentioned above.
yearlyAvg = yearlyLength.join(yearlyCount).map(
    lambda tup: (tup[0], tup[1][0] / tup[1][1])
)

# Save the results in the specified output directory.
yearlyAvg.saveAsTextFile(output)

# Finally, let Spark know that the job is done.
sc.stop()

This above script averages the lengths of words in the NGrams dataset by year. There are two main operations in the above code: ‘map’ and ‘reduceByKey’. ‘map’ applies a function to each RDD element and returns a new RDD containing the results. ‘reduceByKey’ applies a function to the group of values with the same key – for all keys – and returns an RDD with the result.

The job can be submitted by running:

spark-submit \
 --master yarn \
 --num-executors 35 \
 --executor-memory 5g \
 --executor-cores 4 \
 job.py /var/ngrams/data ngrams-out


hdfs dfs -cat ngrams-out/*

 

The only required argument from the above job submission command is ‘–master yarn’. The values passed to the other arguments may be modified in order to get better performance or conform to the limits of your queue.

*Note: Our default Python is Anaconda 2-5.0.1. If you would like to use Anaconda 3-5.0.1 for your PySpark job, run the following command:

export PYSPARK_PYTHON=/sw/dsi/centos7/x86-64/Anaconda3-5.0.1/bin/python

Back To Top

Spark Shell

Spark has an easy-to-use interactive shell that can be used to learn API and also analyze data interactively. Below is a simple example written in Scala. You can use any text file that you have:

spark-shell --master yarn --queue <your_queue>
scala> val textFile = spark.read.textFile("test.txt")
scala> textFile.count()
scala> textFile.first()
//Count how many lines contain the word "words"
//You can replace "words" with any word you'd like
scala> textFile.filter(line => line.contains("words")).count()

Back To Top

Spark Submit

The following is a simple example of submitting a Spark job that uses an existing jar all users have access to. It estimates Pi, and the number at the end is the number of iterations it uses (more iterations = more accurate).

export SPARK_MAJOR_VERSION=2
cd /usr/hdp/current/spark2-client
spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master yarn \
   --queue <your_queue> \
examples/jars/spark-examples*.jar 10

Gradle is a popular build tool for Java and Scala. The following example is useful if you may be getting code from bitbucket, github, etc. This code can be downloaded and built by logging on to cavium-thunderx-arc-ts.umich.edu and running:

git clone https://bitbucket.org/umarcts/spark-examples
cd spark-examples
./gradlew jar

The last command, “./gradlew jar”, will download all dependencies, compile the code, run tests, and package all of the code into a Java ARchive (JAR). This JAR is submitted to the cluster to run a job. For example, the AverageNGramLength job can be launched by running:

spark-submit \
   --class com.alectenharmsel.examples.spark.AverageNGramLength \
   --master yarn \
   --executor-memory 3g \
   --num-executors 35 \
 build/libs/spark-examples-*-all.jar /var/ngrams/data ngrams-out

The output will be located in your home directory in a directory called ‘ngrams-out’, and can be viewed by running:

hdfs dfs -cat ngrams-out/* | tail -5

The output should look like this:

spark output

Back To Top

SparkR

SparkR allows users to utilize the ease of data analysis in R while using the speed and capacity of Spark on our Hadoop cluster. Those familiar with R should have no problem utilizing this feature. After opening the SparkR session, simply begin typing out your program in R.

Run this to open a SparkR session, run this:

sparkR --master yarn --queue <your_queue> --num-executors 4 --executor-memory 1g --executor-cores 4

 

The following is an example you can run to get a feel for how SparkR works. This example was taken from the official SparkR documentation, which can be found here, along with other examples.

families <- c("gaussian", "poisson")
train <- function(family) {
 model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family)
 summary(model)
}
# Return a list of model's summaries
model.summaries <- spark.lapply(families, train)

# Print the summary of each model
print(model.summaries)

Back To Top

Parquet Files

If you’re familiar with Spark, you know that a dataframe is essentially a data structure that contains “tabular” data in memory. That is, it consists of rows and columns of data that can, for example, store the results of an SQL-style query. Dataframes can be saved into HDFS as Parquet files. Parquet files not only preserve the schema information of the dataframe, but will also compress the data when it gets written into HDFS. This means that the saved file will take up less space in HDFS and it will load faster if you read the data again later. Therefore, it is a useful storage format for data you may want to analyze multiple times.

The Pyspark example below uses Reddit data which is available to all Cavium Hadoop users in HDFS ‘/var/reddit’. This data consists of information about all posts made on the popular website Reddit, including their score, subreddit, text body, author, all of which can make for interesting data analysis.

#First, launch the pyspark shell

pyspark --master yarn --queue <your_queue> --num-executors 35 --executor-cores 4 --executor-memory 5g

#Load the reddit data into a dataframe

>>> reddit = sqlContext.read.json("/var/reddit/RS_2016-0*")

#Set compression type to snappy

>>> sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

#Write data into a parquet file - this example puts it into your HDFS home directory as “reddit.parquet”

>>> reddit.write.parquet("reddit.parquet")

#Create a new dataframe from parquet file 

>>> parquetFile = sqlContext.read.parquet("reddit.parquet")

#Register dataframe as a SQL temporary table

>>> parquetFile.registerTempTable(“reddit_table")

#Query the table

#Can really be any query, but this query will find some of the more highly rated posts

>>> ask = sqlContext.sql(“SELECT title FROM reddit_table WHERE score > 1000 and subreddit = ‘AskReddit’”)

#Since we created the dataframe “ask” with the previous query, we can write it out to HDFS as a parquet file so it can be accessed again later

>>> ask.write.parquet(“ask.parquet”)

#Exit the pyspark console - you’ll view the contents of your parquet file after

>>> exit()

 

To view the contents of your Parquet file, use Parquet tools. Parquet tools is a command line tool that aids in the inspection of Parquet files, such as viewing its contents or its schema.

#view the output 

hadoop jar /sw/dsi/noarch/parquet-tools-1.7.0.jar cat \ 
ask.parquet

#view the schema; in this case, just the “title” of the askreddit thread 

hadoop jar /sw/dsi/noarch/parquet-tools-1.7.0.jar schema \ 
ask.parquet 

#to get a full list of all of the options when using Parquet tools  
hadoop jar /sw/dsi/noarch/parquet-tools-1.7.0.jar -h

Back To Top

Other Common Hadoop Tools

Hive

Hive is currently not available on the Cavium ThunderX Cluster. Check back soon for updates.

Back To Top

Streaming (Other Programming Methods)

It is also possible to write a job in any programming language, such as Python or C, that operates on tab-separated key-value pairs. The same example done above with Hive and Pig can also be written in Python and submitted as a Hadoop job using Hadoop Streaming. Submitting a job with Hadoop Streaming requires writing a mapper and a reducer. The mapper reads input line by line and generates key-value pairs for the reducer to “reduce” into some sort of sensible data. For our case, the mapper will read in lines and output the year as the key and a ‘1’ as the value if the ngram in the line it reads has only appeared in a single volume. The python code to do this is:

(Save this file as map.py)

#!/usr/bin/env python2.7
import fileinput
for line in fileinput.input():
 arr = line.split("\t")
 try:
    if int(arr[3]) == 1:
       print("\t".join([arr[1], '1']))
 except IndexError:
       pass
 except ValueError:
       pass

 

Now that the mapper has done this, the reduce merely needs to sum the values based on the key:

(Save this file as red.py)

#!/usr/bin/env python2.7

import fileinput

data = dict()

for line in fileinput.input():
  arr = line.split("\t")
  if arr[0] not in data.keys():
     data[arr[0]] = int(arr[1])
  else:
     data[arr[0]] = data[arr[0]] + int(arr[1])

for key in data:
 print("\t".join([key, str(data[key])]))

 

Submitting this streaming job can be done by running the below command:

yarn jar $HADOOP_STREAMING \
 -Dmapreduce.job.queuename=<your_queue> \
 -input /var/ngrams/data \
 -output ngrams-out \
 -mapper map.py \
 -reducer red.py \
 -file map.py \
 -file red.py \
 -numReduceTasks 10


hdfs dfs -cat ngrams-out/* | tail -5

streaming output
hdfs dfs -rm -r -skipTrash /user/<your_uniqname>/ngrams-out

Back To Top

Pig

Pig is similar to Hive and can do the same thing. The Pig code to do this is a little bit longer due to its design. However, writing long Pig code is generally easier that writing multiple SQL queries that that chain together, since Pig’s language, PigLatin, allows for variables and other high-level constructs.

# Open the interactive pig console
pig -Dtez.job.queuename=<your_queue>

# Load the data
ngrams = LOAD '/var/ngrams' USING PigStorage('\t') AS 
(ngram:chararray,
year:int, count:long, volumes:long);

# Look at the schema of the ngrams variable
describe ngrams;

# Count the total number of rows (should be 1430731493)
ngrp = GROUP ngrams ALL;
count = FOREACH ngrp GENERATE COUNT(ngrams);
DUMP count;

# Select the number of words, by year, that have only appeared in a single volume
one_volume = FILTER ngrams BY volumes == 1;
by_year = GROUP one_volume BY year;
yearly_count = FOREACH by_year GENERATE group, COUNT(one_volume);
DUMP yearly_count;

The last few lines of output should look like this:

More information on Pig can be found on the Apache website.

Back To Top

mrjob

Another way to run Hadoop jobs is through mrjob. Mrjob is useful for testing out smaller data on another system (such as your laptop), and later being able to run it on something larger, like a Hadoop cluster. To run an mrjob on your laptop, you can simply remove the “-r hadoop” from the command in the example we use here.

A classic example is a word count, taken from the official mrjob documentation here.

Save this file as mrjob_test.py.

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")


class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)

    def combiner(self, word, counts):
        yield (word, sum(counts))

    def reducer(self, word, counts):
        yield (word, sum(counts))


if __name__ == '__main__':
     MRWordFreqCount.run()

Then, run the following command:

python mrjob_test.py -r hadoop /etc/motd

You should receive an output with the word count of the file /etc/motd. You can also try this with any other file you have that contains text.

Back To Top

Policies

Order Service

To request an account, send an email to hpc-support@umich.edu. Users of the service must have a U-M uniqname and Duo security authentication to access this service.

When requesting an account, please supply your name, uniqname, and a short summary of what you’d like to do on the cluster.