Introduction to Spark

By | | No Comments

Spark and PySpark utilize a container called Resilient Distributed Dataset (RDD) for storing and operating on data. The most important characteristic of Spark’s RDD is that it is immutable — once created, the data it contains cannot be updated. New RDDs can be created by transforming the data in another RDD, which is how analysis is done with Spark.

Using Spark’s native language, Scala, requires more setup than using PySpark. Some example Scala jobs, including the same example job in the PySpark documentation, can be found on this website. That Spark code has some trivial set up required to run a Spark job, and all of the actual logic is in the ‘run’ function.

Parquet Files

By | | No Comments

If you’re familiar with Spark, you know that a dataframe is essentially a data structure that contains “tabular” data in memory. That is, it consists of rows and columns of data that can, for example, store the results of an SQL-style query. Dataframes can be saved into HDFS as Parquet files. Parquet files not only preserve the schema information of the dataframe, but will also compress the data when it gets written into HDFS. This means that the saved file will take up less space in HDFS and it will load faster if you read the data again later. Therefore, it is a useful storage format for data you may want to analyze multiple times.

The Pyspark example below uses Reddit data which is available to all Flux Hadoop users in HDFS ‘/var/reddit’. This data consists of information about all posts made on the popular website Reddit, including their score, subreddit, text body, author, all of which can make for interesting data analysis.

#First, launch the pyspark shell

pyspark --master yarn-client --queue <your_queue> --num-executors 35 --executor-cores 4 --executor-memory 5g

#Load the reddit data into a dataframe

>>> reddit = sqlContext.read.json("/var/reddit/RS_2016-0*")

#Set compression type to snappy

>>> sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")

#Write data into a parquet file - this example puts it into your HDFS home directory as “reddit.parquet”

>>> reddit.write.parquet("reddit.parquet")

#Create a new dataframe from parquet file 

>>> parquetFile = sqlContext.read.parquet("reddit.parquet")

#Register dataframe as a SQL temporary table

>>> parquetFile.registerTempTable(“reddit_table")

#Query the table

#Can really be any query, but this query will find some of the more highly rated posts

>>> ask = sqlContext.sql(“SELECT title FROM reddit_table WHERE score > 1000 and subreddit = ‘AskReddit’”)

#Since we created the dataframe “ask” with the previous query, we can write it out to HDFS as a parquet file so it can be accessed again later

>>> ask.write.parquet(“ask.parquet”)

#Exit the pyspark console - you’ll view the contents of your parquet file after

>>> exit()

 

To view the contents of your Parquet file, use Parquet tools. Parquet tools is a command line tool that aids in the inspection of Parquet files, such as viewing its contents or its schema.

#view the output

hadoop parquet.tools.Main cat ask.parquet

#view the schema; in this case, just the “title” of the askreddit thread

hadoop parquet.tools.Main schema ask.parquet

Spark Shell

By | | No Comments

Spark has an easy-to-use interactive shell that can be used to learn API and also analyze data interactively. Below is a simple example written in Scala:

spark-shell --master yarn-client
scala> val textFile = spark.read.textFile("test.txt")
scala> textFile.count()
scala> textFile.first()
scala> textFile.filter(line => line.contains("words")).count()

Spark Submit

By | | No Comments

Gradle is a popular build tool for Java and Scala. This code can be downloaded and built by logging on to flux-login and running:

git clone https://bitbucket.org/umarcts/spark-examples
cd spark-examples
./gradlew jar

The last command, “./gradlew jar”, will download all dependencies, compile the code, run tests, and package all of the code into a Java ARchive (JAR). This JAR is submitted to the cluster to run a job. For example, the AverageNGramLength job can be launched by running:

spark-submit \
   --class com.alectenharmsel.examples.spark.AverageNGramLength \
   --master yarn-client \
   --executor-memory 3g \
   --num-executors 35 \
   --queue <your_queue> \
 build/libs/spark-examples-*-all.jar /var/ngrams ngrams-out

The output will be located in your home directory in a directory called ‘ngrams-out’, and can be viewed by running:

hdfs dfs -cat ngrams-out/* | tail -5

The output should look like this:

spark output

PySpark

By | | No Comments

Spark comes with an interactive Python console, which can be opened this way:

# Load the pyspark console 
pyspark --master yarn-client --queue <your_queue>

This interactive console can be used for prototyping or debugging, or just running simple jobs.

The following example runs a simple line count on a text file, as well as counts the number of instances of the word “words” in that textfile:

>>> textFile = sc.textFile("test.txt")
>>> textFile.count()
>>> textFile.first()
>>> textFile.filter(lambda line: "words" in line).count()

 

Spark and PySpark utilize a container that their developers call a Resilient Distributed Dataset (RDD) for storing and operating on data. The most important characteristic of Spark’s RDD is that it is immutable – once created, the data it contains cannot be updated. New RDDs can be created by transforming the data in another RDD, which is how analysis is done with Spark.

Save this file as job.py.

from pyspark import SparkConf, SparkContext
import sys

# This script takes two arguments, an input and output
if len(sys.argv) != 3:
  print('Usage: ' + sys.argv[0] + ' <in> <out>')
  sys.exit(1)

input = sys.argv[1]
output = sys.argv[2]

# Set up the configuration and job context
conf = SparkConf().setAppName('AnnualWordLength')
sc = SparkContext(conf=conf)


# Read in the dataset and immediately transform all the lines in arrays
data = sc.textFile(input).map(lambda line: line.split('\t'))

# Create the 'length' dataset as mentioned above. This is done using the next two variables, and the 'length' dataset ends up in 'yearlyLength'.
yearlyLengthAll = data.map(
    lambda arr: (int(arr[1]), float(len(arr[0])) * float(arr[2]))
)
yearlyLength = yearlyLengthAll.reduceByKey(lambda a, b: a + b)

# Create the 'words' dataset as mentioned above.
yearlyCount = data.map(
    lambda arr: (int(arr[1]), float(arr[2]))
).reduceByKey(
    lambda a, b: a + b
)

# Create the 'average_length' dataset as mentioned above.
yearlyAvg = yearlyLength.join(yearlyCount).map(
    lambda tup: (tup[0], tup[1][0] / tup[1][1])
)

# Save the results in the specified output directory.
yearlyAvg.saveAsTextFile(output)

# Finally, let Spark know that the job is done.
sc.stop()

This above script averages the lengths of words in the NGrams dataset by year. There are two main operations in the above code: ‘map’ and ‘reduceByKey’. ‘map’ applies a function to each RDD element and returns a new RDD containing the results. ‘reduceByKey’ applies a function to the group of values with the same key – for all keys – and returns an RDD with the result.

The job can be submitted by running:

spark-submit \
 --master yarn-client \
 --queue <your_queue> \
 --num-executors 35 \
 --executor-memory 5g \
 --executor-cores 4 \
 job.py /var/ngrams ngrams-out


hdfs dfs -cat ngrams-out/*

 

The only required arguments from the above job submission command are ‘–master yarn-client’ and ‘–queue <your_queue>’. The values passed to the other arguments may be modified in order to get better performance or conform to the limits of your queue.

*Note: If you want to use Python 3.5 instead of our default 2.7 in your pyspark job, simply run the following commands, and submit your job normally using your Python 3.5 code:

export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

export PYSPARK_PYTHON=/sw/lsa/centos7/python-anaconda3/created-20170424/bin/python