Overview

Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. (From hadoop.apache.org)

Flux Hadoop is a technology preview and available at no cost. It may have less technical support than the other Flux services. The Flux Hadoop cluster consists of 12 nodes offering 100TB of HDFS space. The software available is:

Using Flux Hadoop requires a Flux user account but does not require an active Flux allocation. Information on getting a Flux account can be found on the Flux User Guide page. You only need to pay attention to steps 1, 2, and 4 in order to use Flux-Hadoop. Please note that in step 4 rather than logging into flux-login.arc-ts.umich.edu, you need to log into flux-hadoop-login.arc-ts.umich.edu. When applying for a Flux account, mention that you would like to use Hadoop and what queue name you would like. To be placed on an already existing queue, the owner/PI on the queue needs to send an email to hpc-support@umich.eduThis video will help teach you some basic Linux navigation commands if needed.

Hadoop consists of two components; HDFS, a filesystem built for high read speeds, and YARN, a resource manager. HDFS is not a POSIX filesystem, so normal command line tools like “cp” and “mv” will not work. Most of the common tools have been reimplemented for HDFS and can be run using the “hdfs dfs” command. All data must be in HDFS for jobs to be able to read it.

Here are a few basic commands:

# List the contents of your HDFS home directory
hdfs dfs -ls

# Copy local file data.csv to your HDFS home directory
hdfs dfs -put data.csv data.csv

# Copy HDFS file data.csv back to your local home directory
hdfs dfs -get data.csv data2.csv

A complete reference of HDFS commands can be found on the Apache website.

Available software

There are multiple tools available (listed at the top) to perform analysis. Hive transparently transforms SQL queries to jobs that run on the Hadoop cluster, allowing researchers already familiar with SQL to continue using it as their data grows beyond the capacity of a traditional database such as MySQL or PostgreSQL. Pig is similar to Hive, except that its language, PigLatin, is slightly more complicated than SQL. However, PigLatin can be used to express complicated or long transformations and queries that are impossible or difficult to express with SQL.

Hadoop jobs may also be directly written in any programming language. Writing a map-reduce job directly gives more control than using Hive or Pig, but requires knowing or learning a programming language. Working on unstructured data generally requires writing a map-reduce job instead of using Hive or Pig, as those tools can’t work with unstructured data.

Spark and PySpark are completely separate from Hadoop, even though the jobs are run on the Hadoop cluster on files that are stored in HDFS. Spark exposes a single data structure, the Resilient Distributed Dataset (RDD), in the Java, Scala, and Python programming languages. RDDs support multiple operations (map, reduce, filter, etc.), and are held in-memory when possible to increase the speed of the analysis. Spark is a good choice over a native Hadoop job or a Streaming job, as Spark contains a number of convenient functions that Hadoop does not have.

About This User Guide

There are a few things that are important to note about this user guide before you begin:

  • Make sure that you always replace “<your_queue>” with whatever the name of your queue is.
  • Any backslashes (‘\’) in a command indicate that it is actually one long command, but it is broken up simply for the sake of readability.
  • These examples all use the Google Ngrams Dataset. You may want to familiarize yourself with that in order to better understand what kind of data you are running these jobs on. Essentially, its schema is:
    1. word – the word of interest
    2. year – the year of the data
    3. count – the number of times the word has appeared in books this year
    4. volumes – the number of volumes the word has appear in for this year