The cluster provides 112TB of total usable disk space, 40GbE inter-node networking, Hadoop version 2.6.0, and several additional data science tools.
Aside from Hadoop and its Distributed File System, the ARC-TS data science service includes:
- Pig, a high-level language that enables substantial parallelization, allowing the analysis of very large data sets.
- Hive, data warehouse software that facilitates querying and managing large datasets residing in distributed storage using a SQL-like language called HiveQL.
- Sqoop, a tool for transferring data between SQL databases and the Hadoop Distributed File System.
- Rmr, an extension of the R Statistical Language to support distributed processing of large datasets stored in the Hadoop Distributed File System.
- Spark, a general processing engine compatible with Hadoop data
- mrjob, allows MapReduce jobs in Python to run on Hadoop
The software versions are as follows: