Spark and PySpark utilize a container called Resilient Distributed Dataset (RDD) for storing and operating on data. The most important characteristic of Spark’s RDD is that it is immutable — once created, the data it contains cannot be updated. New RDDs can be created by transforming the data in another RDD, which is how analysis is done with Spark.
Using Spark’s native language, Scala, requires more setup than using PySpark. Some example Scala jobs, including the same example job in the PySpark documentation, can be found on this website. That Spark code has some trivial set up required to run a Spark job, and all of the actual logic is in the ‘run’ function.
In addition to the example code in the Spark-examples repository is a gradle build file. Gradle is a popular build tool for Java and Scala. This code can be downloaded and built by logging on to flux-login and running:
git clone https://bitbucket.org/umarcts/spark-examples cd spark-examples ./gradlew jar
The last command, “./gradlew jar”, will download all dependencies, compile the code, run tests, and package all of the code into a Java ARchive (JAR). This JAR is submitted to the cluster to run a job. For example, the AverageNGramLength job can be launched by running:
spark-submit \ --class com.alectenharmsel.examples.spark.AverageNGramLength \ --master yarn-client \ --executor-memory 3g \ --num-executors 5 \ --queue <your_queue> \ build/libs/spark-examples-*-all.jar /var/ngrams ngrams-out
The output will be located in your home directory in a directory called ‘ngrams-out’, and can be viewed by running:
hdfs dfs -cat ngrams-out/*
The last few lines of output should look like this:
You can also run this command to just get those lines:
hdfs dfs -cat ngrams-out/* | tail -5
Similar to PySpark’s interactive Python shell, Spark has an interactive Scala shell for prototyping and debugging. The Spark shell can be launched by running:
spark-shell --master yarn-client --queue <your_queue>