hadoopThe Data Science Platform is an upgraded Hadoop cluster currently available as a technology preview with no associated charges to U-M researchers. The ARC-TS Hadoop cluster is an on-campus resource that provides a different service level than most cloud-based Hadoop offerings, including:

  • high-bandwidth data transfer to and from other campus data storage locations with no data transfer costs
  • very high-speed inter-node connections using 40Gb/s Ethernet.

The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.6.0, and several additional data science tools.

Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:

  • Pig, a high-level language that enables substantial parallelization, allowing the analysis of very large data sets.
  • Hive, data warehouse software that facilitates querying and managing large datasets residing in distributed storage using a SQL-like language called HiveQL.
  • Sqoop, a tool for transferring data between SQL databases and the Hadoop Distributed File System.
  • Rmr, an extension of the R Statistical Language to support distributed processing of large datasets stored in the Hadoop Distributed File System.
  • Spark, a general processing engine compatible with Hadoop data
  • mrjob, allows MapReduce jobs in Python to run on Hadoop

The software versions are as follows:

Title Version
Hadoop 2.6.0
Hive 1.1.0
Sqoop 1.4.6
Pig 0.12.0
R/rhdfs/rmr 3.0.3
Spark 1.6.0
mrjob 0.4.3-dev, commit

226a741548cf125ecfb549b7c50d52cda932d045

 

Order Service

Using the Flux Hadoop environment requires a Flux user account (available at no cost), but currently does not require a Flux allocation.

To order:

Email hpc-support@umich.edu.

For more information: data-science-support@umich.edu.

Related Events

There are no upcoming events at this time.