SparkSQL is a way for people to use SQL-like language to query their data with ease while taking advantage of the speed of Spark, a fast, general engine for data processing that runs over Hadoop. I wanted to test this out on a dataset I found from Walmart with their stores’ weekly sales numbers. I put the csv into our cluster’s HDFS (in /var/walmart) making it accessible to all Flux Hadoop users.
Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce an expanded data science computing platform, giving all U-M researchers new capabilities to host structured and unstructured databases, and to ingest, store, query and analyze large datasets.
The new platform features a flexible, robust and scalable database environment, and a set of data pipeline tools that can ingest and process large amounts of data from sensors, mobile devices and wearables, and other sources of streaming data. The platform leverages the advanced virtualization capabilities of ARC-TS’s Yottabyte Research Cloud (YBRC) infrastructure, and is supported by U-M’s Data Science Initiative launched in 2015. YBRC was created through a partnership between Yottabyte and ARC-TS announced last fall.
The following functionalities are immediately available:
- Structured databases: MySQL/MariaDB, and PostgreSQL.
- Unstructured databases: Cassandra, MongoDB, InfluxDB, Grafana, and ElasticSearch.
- Data ingestion: Redis, Kafka, RabbitMQ.
- Data processing: Apache Flink, Apache Storm, Node.js and Apache NiFi.
Other types of databases can be created upon request.
These tools are offered to all researchers at the University of Michigan free of charge, provided that certain usage restrictions are not exceeded. Large-scale users who outgrow the no-cost allotment may purchase additional YBRC resources. All interested parties should contact firstname.lastname@example.org.
At this time, the YBRC platform only accepts unrestricted data. The platform is expected to accommodate restricted data within the next few months.
ARC-TS also operates a separate data science computing cluster available for researchers using the latest Hadoop components. This cluster also will be expanded in the near future.
XSEDE Allocations award eligible users access to compute, visualization, and/or storage resources as well as extended support services.
XSEDE has various types of allocations from short term exploratory request to year long projects. In order to access to XSEDE resources you must have an allocation. Submit your allocation requests via the XSEDE Resource Allocation System (XRAS) in the XSEDE User Portal.
ARC-TS consultants can help researchers navigate the XSEDE resources and process. Contact them at email@example.com
Advanced Research Computing – Technology Services (ARC-TS) will be deploying a new research storage service aimed at serving faculty in the Big Data era. This service, called Locker, complements our existing Turbo general purpose storage service and planned archive service Data Den.
Locker is a large-file cost-optimized storage and is not good for general purpose / small file use.
Faculty can now buy in, at a one-time cost to bootstrap the service. Faculty interested in this option ahead of the general service will need to commit to 200TB un-replicated or 100TB replicated, of space or more, at a one-time cost of $175/TB un-replicated, $350/TB replicated, for 5 years. This would be a minimum purchase of $35,000, with no further costs for 5 years.
If you are interested, please contact ARC-TS by July 10, 2017, at firstname.lastname@example.org.
Q: When will Locker be ready if I contribute funds to its launch?
A: Locker aims to be on site and ready for data by Fall semester (2017).
Q: When will Locker be ready as a monthly service?
A: The current timeline aims for early November 2017.
Q: What if I need less than 100TB replicated or 200TB un-replicated?
A: After the early period smaller allocations will be available. Contact us at email@example.com to discuss your needs.
Q: Can I keep data beyond 5 years?
A: Yes. Options will exist beyond the first 5 years with new ongoing costs for support of the system.
Q: What is a large file for Locker?
Q: With what methods does one access Locker?
A: Locker will support NFS (v3/v4) and CIFS/SMB to workstations, servers, and clusters.
Q: Can I use Locker with Sensitive Data such as HIPAA/PHI?
A: Locker comes with encryption at rest and will eventually support HIPAA/PHI data and more. It will NOT support sensitive data during the early user period. Sensitive data clearance work will start once the system is in place, and should be ready 2-3 months later.
Q: Can I pay for Locker monthly rather than up front?
A: Locker will eventually be a monthly service similar to Turbo, but during the early period we are looking for faculty to commit to a minimum amount of storage at a one-time cost (hardware only) to bootstrap the service and keep future prices low.
Q: Can I add more capacity?
A: Yes, you can request more capacity at any time. Because of the design, larger requests will require a few weeks lead time. To keep costs low, Locker does not maintain significant extra capacity idle, but can grow at anytime to sizes in the 10s of PB.
Q: What optional features exist?
A: The features are:
- Optional Geographic Replication
- Optional Snapshots
Q: Can I use this for clinical care / enterprise use cases?
A: No. Locker has a 9:00 a.m. – 5:00 p.m. support window and is not architected for enterprise availability. We recommend using MiStorage for enterprise or comparable services in HITS for clinical care.
Q: Does Locker include backups?
A: Locker does not include backups. It does include optional geographic replication and snapshots, which provide some protection against user deletion and major disaster but do not protect against software or administrator error the same way backups do. For backups we recommend MiBackup.
Q: Who should use Locker?
A: Researchers whose datasets typical file size exceed 1 MB can use Locker to store their data more cost efficiently than other options at the University.
Q: Why should I contribute to the launch of Locker?
A: Locker aims to provide a cost effective solution for big data storage. To do this a minimum amount of space needs to be allocated. By contributing you secure the option of low cost storage for research going forward.
The University of Michigan is beginning the process of building our next generation HPC platform, “Big House.” Flux, the shared HPC cluster, has reached the end of its useful life. Flux has served us well for more than five years, but as we move forward with replacement, we want to make sure we’re meeting the needs of the research community.
ARC-TS will be holding a series of town halls to take input from faculty and researchers on the next HPC platform to be built by the University. These town halls are open to anyone and will be held at:
College of Engineering, Johnson Room, Tuesday, June 20th, 9:00a – 10:00a
NCRC Bldg 300, Room 376, Wednesday, June 21st, 11:00a – 12:00p
LSA #2001, Tuesday, June 27th, 10:00a – 11:00a
3114 Med Sci I, Wednesday, June 28th, 2:00p – 3:00p
Your input will help to ensure that U-M is on course for providing HPC, so we hope you will make time to attend one of these sessions. If you cannot attend, please email firstname.lastname@example.org with any input you want to share.
A series of training workshops in high performance computing will be held May 15, May 17 and May 24, 2017, presented by CSCAR in conjunction with Advanced Research Computing – Technology Services (ARC-TS). All sessions are held at East Hall, Room B254, 530 Church St.
Introduction to the Linux command Line
This course will familiarize the student with the basics of accessing and interacting with Linux computers using the GNU/Linux operating system’s Bash shell, also known as the “command line.”
• Monday, May 15, 9 a.m. – noon. (full description | registration)
Introduction to the Flux cluster and batch computing
This workshop will provide a brief overview of the components of the Flux cluster, including the resource manager and scheduler, and will offer students hands-on experience.
• Wednesday, May 17, 1 – 4:30 p.m. (full description | registration)
Advanced batch computing on the Flux cluster
This course will cover advanced areas of cluster computing on the Flux cluster, including common parallel programming models, dependent and array scheduling, and a brief introduction to scientific computing with Python, among other topics.
• Wednesday, May 24, 1 – 5 p.m. (full description | registration)
NOTE: Additional workshops may be scheduled if demand warrants. Please sign up for the waiting list if the workshops are full, and you will be given first priority for any additional sessions.
The Center for Human Growth and Development (CHGD) held a workshop on functional near-infrared spectroscopy (fNIRS), a form of neuroimaging, with a special focus on pediatric applications. The workshop was sponsored by units at U-M, as well as units from Eastern Michigan University and Gallaudet University. It was attended by 50 people from as far away as Texas, and included research talks, instructional sessions, and hands-on experience with fNIRS data processing. The workshop was the first of its kind at U-M.
CHGD, ARC-TS, and LSA IT staff collaborated to provide a remote neuroimaging computing environment, which included a graphical interface, access to Matlab, and a suite of fNIRS software, that was accessed from participants’s laptops during the workshop. The attendees rated the practice exercises done via the computing environment one of the most important components of the workshop.
ARC-TS was pleased to be able to contribute to training in computational tools needed for emerging methods in neuroimaging. For more information about the fNIRS workshop, please see http://chgd.umich.edu/facilities-resources/developmental-neuroscience-laboratories/fnirs/fnirs-workshop/
The Institute for Healthcare Policy and Innovation (IHPI) is partnering with Advanced Research Computing (ARC) to bring two commercial claims datasets to campus researchers.
The OptumInsight and Truven Marketscan datasets contain nearly complete insurance claims and other health data on tens of millions of people representing the US private insurance population. Within each dataset, records can be linked longitudinally for over 5 years.
To begin working with the data, researchers should submit a brief analysis plan for review by IHPI staff, who will create extracts or grant access to primary data as appropriate.
CSCAR consultants are available to provide guidance on computational and analytic methods for a variety of research aims, including use of Flux and other UM computing infrastructure for working with these large and complex repositories.
The data acquisition and availability was funded by IHPI and the U-M Data Science Initiative.