SparkSQL is a way for people to use SQL-like language to query their data with ease while taking advantage of the speed of Spark, a fast, general engine for data processing that runs over Hadoop. I wanted to test this out on a dataset I found from Walmart with their stores’ weekly sales numbers. I put the csv into our cluster’s HDFS (in /var/walmart) making it accessible to all Flux Hadoop users.
Researchers interested in using the Android platform for app development may consult with CSCAR about their work, free of charge.
CSCAR consultants with industry experience as Android developers can provide guidance on capabilities and limitations of Android apps, timelines for App implementation, 3D interaction, game engines, user interface design, and security.
Please contact firstname.lastname@example.org for more information.
Five research teams from the University of Michigan and Shanghai Jiao Tong University in China are sharing $1 million to study data science and its impact on air quality, galaxy clusters, lightweight metals, financial trading and renewable energy.
Since 2009, the two universities have collaborated on a number of research projects that address challenges and opportunities in energy, biomedicine, nanotechnology and data science.
In the latest round of annual grants, the winning projects focus on data science and how it can be applied to chemistry and physics of the universe, as well as finance and economics.
For more, read the University Record article.
For descriptions of the research projects, see the MIDAS/SJTU partnership page.
Please join us for the 2017 Michigan Institute for Data Science Symposium.
The keynote speaker will be Cathy O’Neil, mathematician and best-selling author of “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy.”
Other speakers include:
- Nadya Bliss, Director of the Global Security Initiative, Arizona State University
- Francesca Dominici, Co-Director of the Data Science Initiative and Professor of Biostatistics, Harvard T.H. Chan School of Public Health
- Daniela Whitten, Associate Professor of Statistics and Biostatistics, University of Washington
- James Pennebaker, Professor of Psychology, University of Texas
More details, including how to register, will be available soon.
Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce an expanded data science computing platform, giving all U-M researchers new capabilities to host structured and unstructured databases, and to ingest, store, query and analyze large datasets.
The new platform features a flexible, robust and scalable database environment, and a set of data pipeline tools that can ingest and process large amounts of data from sensors, mobile devices and wearables, and other sources of streaming data. The platform leverages the advanced virtualization capabilities of ARC-TS’s Yottabyte Research Cloud (YBRC) infrastructure, and is supported by U-M’s Data Science Initiative launched in 2015. YBRC was created through a partnership between Yottabyte and ARC-TS announced last fall.
The following functionalities are immediately available:
- Structured databases: MySQL/MariaDB, and PostgreSQL.
- Unstructured databases: Cassandra, MongoDB, InfluxDB, Grafana, and ElasticSearch.
- Data ingestion: Redis, Kafka, RabbitMQ.
- Data processing: Apache Flink, Apache Storm, Node.js and Apache NiFi.
Other types of databases can be created upon request.
These tools are offered to all researchers at the University of Michigan free of charge, provided that certain usage restrictions are not exceeded. Large-scale users who outgrow the no-cost allotment may purchase additional YBRC resources. All interested parties should contact email@example.com.
At this time, the YBRC platform only accepts unrestricted data. The platform is expected to accommodate restricted data within the next few months.
ARC-TS also operates a separate data science computing cluster available for researchers using the latest Hadoop components. This cluster also will be expanded in the near future.
XSEDE Allocations award eligible users access to compute, visualization, and/or storage resources as well as extended support services.
XSEDE has various types of allocations from short term exploratory request to year long projects. In order to access to XSEDE resources you must have an allocation. Submit your allocation requests via the XSEDE Resource Allocation System (XRAS) in the XSEDE User Portal.
ARC-TS consultants can help researchers navigate the XSEDE resources and process. Contact them at firstname.lastname@example.org
The Big Data in Transportation and Mobility symposium held June 22-23, 2017, in Ann Arbor, MI brought together more than 150 data science practitioners from academia, industry and government to explore emerging issues in this expanding field.
Sponsored by the NSF-supported Midwest Big Data Hub (MBDH) and the Michigan Institute for Data Science (MIDAS), the symposium featured lightning talks from transportation research programs around the Midwest; tutorials and breakout sessions on specific issues and methods; a poster session; and a keynote address from two representatives of the Smart Columbus project: Chris Stewart, Ohio State University Associate Professor of Computer Science and Engineering, and Shoreh Elhami, GIS Manager for the city of Columbus.
Speakers and attendees came from a number of organizations from across the midwest including the University of Michigan, University of Illinois, University of Nebraska, University of North Dakota, North Dakota State University, Ohio State University, Purdue University, Denso International America, Fiat Chrysler, Ford Motor Company, General Motors, IAV Automotive Engineering and Yottabyte.
“This was an extremely valuable opportunity to share information and ideas,” said Carol Flannagan, one of the organizers of the symposium and a researcher at MIDAS and the U-M Transportation Research Institute. “Cross-discipline and cross-institutional collaboration is crucial to the success of Big Data applications, and we took a significant step forward in that vein during this symposium.”
Topics addressed in talks, breakouts, and tutorials included:
- New Analytic Tools for Designing and Managing Transportation Systems
- New Mobility Options for Small and Mid-sized Cities in the Midwest
- Automated and Connected Vehicles
- Transforming Transportation Operations using High Performance Computing
- On-Demand Transit
- Using Big Data for Monitoring Bridges
At the closing session, participants outlined some areas that could be fruitful to focus on going forward, including increasing data-science literacy in the general public; diversity and workforce development in data science; public data-sharing platforms and partners; and privacy issues.
MICDE is pleased to announce the recipients of the 2017-2018 MICDE Fellowships for students enrolled in the PhD in Scientific Computing or the Graduate Certificate in Computational Discovery and Engineering. We had 91 applicants from 25 departments representing 6 schools and colleges. Due to the extraordinary number of high quality applications we increased the number of fellowships from 15 to 20 awards. See our Fellowship page for more information.
Diksha Dhawan, Chemistry
Negar Farzaneh, Computational Medicine & Bioinformatics
Kritika Iyer, Biomedical Engineering
Tibin John, Neuroscience
Bikash Kanungo, Mechanical Engineering
Yu-Han Kao, Epidemiology
Steven Kiyabu, Mechanical Engineering
Christiana Mavroyiakoumou, Mathematics
Ehsan Mirzakhalili, Mechanical Engineering
Colten Peterson, Climate and Space Sciences & Engineering
James Proctor, Chemical Engineering
Evan Rogers, Biomedical Engineering
Longxiu Tian, S. Ross School of Business
Jipu Wang, Nuclear Engineering and Radiological Sciences
Yanming Wang, Chemistry
Zhenlin Wang, Mechanical Engineering
Alicia Welden, Chemistry
Anna White, Industrial & Operations Engineering
Chia-Nan Yeh, Physics
Yiling Zhang, Industrial & Operations Engineering
Geunyeong Byeon, Industrial & Operations Engineering
Ayoub Gouasmi, Aerospace Engineering
Joseph Kleinhenz, Physics
Jia Li, Physics
Changjiang Liu, Biophysics
Vo Nguyen, Computational Medicine & Bioinformatics
Everardo Olide, Applied Physics
Qiyun Pan, Industrial & Operations Engineering
Pengchuan Wang, Civil & Environmental Engineering
Xinzhu Wei, Ecology & Evolutionary Biology
Advanced Research Computing – Technology Services (ARC-TS) will be deploying a new research storage service aimed at serving faculty in the Big Data era. This service, called Locker, complements our existing Turbo general purpose storage service and planned archive service Data Den.
Locker is a large-file cost-optimized storage and is not good for general purpose / small file use.
Faculty can now buy in, at a one-time cost to bootstrap the service. Faculty interested in this option ahead of the general service will need to commit to 200TB un-replicated or 100TB replicated, of space or more, at a one-time cost of $175/TB un-replicated, $350/TB replicated, for 5 years. This would be a minimum purchase of $35,000, with no further costs for 5 years.
If you are interested, please contact ARC-TS by July 10, 2017, at email@example.com.
Q: When will Locker be ready if I contribute funds to its launch?
A: Locker aims to be on site and ready for data by Fall semester (2017).
Q: When will Locker be ready as a monthly service?
A: The current timeline aims for early November 2017.
Q: What if I need less than 100TB replicated or 200TB un-replicated?
A: After the early period smaller allocations will be available. Contact us at firstname.lastname@example.org to discuss your needs.
Q: Can I keep data beyond 5 years?
A: Yes. Options will exist beyond the first 5 years with new ongoing costs for support of the system.
Q: What is a large file for Locker?
Q: With what methods does one access Locker?
A: Locker will support NFS (v3/v4) and CIFS/SMB to workstations, servers, and clusters.
Q: Can I use Locker with Sensitive Data such as HIPAA/PHI?
A: Locker comes with encryption at rest and will eventually support HIPAA/PHI data and more. It will NOT support sensitive data during the early user period. Sensitive data clearance work will start once the system is in place, and should be ready 2-3 months later.
Q: Can I pay for Locker monthly rather than up front?
A: Locker will eventually be a monthly service similar to Turbo, but during the early period we are looking for faculty to commit to a minimum amount of storage at a one-time cost (hardware only) to bootstrap the service and keep future prices low.
Q: Can I add more capacity?
A: Yes, you can request more capacity at any time. Because of the design, larger requests will require a few weeks lead time. To keep costs low, Locker does not maintain significant extra capacity idle, but can grow at anytime to sizes in the 10s of PB.
Q: What optional features exist?
A: The features are:
- Optional Geographic Replication
- Optional Snapshots
Q: Can I use this for clinical care / enterprise use cases?
A: No. Locker has a 9:00 a.m. – 5:00 p.m. support window and is not architected for enterprise availability. We recommend using MiStorage for enterprise or comparable services in HITS for clinical care.
Q: Does Locker include backups?
A: Locker does not include backups. It does include optional geographic replication and snapshots, which provide some protection against user deletion and major disaster but do not protect against software or administrator error the same way backups do. For backups we recommend MiBackup.
Q: Who should use Locker?
A: Researchers whose datasets typical file size exceed 1 MB can use Locker to store their data more cost efficiently than other options at the University.
Q: Why should I contribute to the launch of Locker?
A: Locker aims to provide a cost effective solution for big data storage. To do this a minimum amount of space needs to be allocated. By contributing you secure the option of low cost storage for research going forward.
The University of Michigan is beginning the process of building our next generation HPC platform, “Big House.” Flux, the shared HPC cluster, has reached the end of its useful life. Flux has served us well for more than five years, but as we move forward with replacement, we want to make sure we’re meeting the needs of the research community.
ARC-TS will be holding a series of town halls to take input from faculty and researchers on the next HPC platform to be built by the University. These town halls are open to anyone and will be held at:
College of Engineering, Johnson Room, Tuesday, June 20th, 9:00a – 10:00a
NCRC Bldg 300, Room 376, Wednesday, June 21st, 11:00a – 12:00p
LSA #2001, Tuesday, June 27th, 10:00a – 11:00a
3114 Med Sci I, Wednesday, June 28th, 2:00p – 3:00p
Your input will help to ensure that U-M is on course for providing HPC, so we hope you will make time to attend one of these sessions. If you cannot attend, please email email@example.com with any input you want to share.