Explore ARCExplore ARC

U-M selects Dell EMC, Mellanox and DDN to Supply New “Great Lakes” Computing Cluster

By | Flux, General Interest, Happenings, HPC, News

The University of Michigan has selected Dell EMC as lead vendor to supply its new $4.8 million Great Lakes computing cluster, which will serve researchers across campus. Mellanox Technologies will provide networking solutions, and DDN will supply storage hardware.

Great Lakes will be available to the campus community in the first half of 2019, and over time will replace the Flux supercomputer, which serves more than 2,500 active users at U-M for research ranging from aerospace engineering simulations and molecular dynamics modeling to genomics and cell biology to machine learning and artificial intelligence.

Great Lakes will be the first cluster in the world to use the Mellanox HDR 200 gigabit per second InfiniBand networking solution, enabling faster data transfer speeds and increased application performance.

“High-performance research computing is a critical component of the rich computing ecosystem that supports the university’s core mission,” said Ravi Pendse, U-M’s vice president for information technology and chief information officer. “With Great Lakes, researchers in emerging fields like machine learning and precision health will have access to a higher level of computational power. We’re thrilled to be working with Dell EMC, Mellanox, and DDN; the end result will be improved performance, flexibility, and reliability for U-M researchers.”

“Dell EMC is thrilled to collaborate with the University of Michigan and our technology partners to bring this innovative and powerful system to such a strong community of researchers,” said Thierry Pellegrino, vice president, Dell EMC High Performance Computing. “This Great Lakes cluster will offer an exceptional boost in performance, throughput and response to reduce the time needed for U-M researches to make the next big discovery in a range of disciplines from artificial intelligence to genomics and bioscience.”

The main components of the new cluster are:

  • Dell EMC PowerEdge C6420 compute nodes, PowerEdge R640 high memory nodes, and PowerEdge R740 GPU nodes
  • Mellanox HDR 200Gb/s InfiniBand ConnectX-6 adapters, Quantum switches and LinkX cables, and InfiniBand gateway platforms
  • DDN GRIDScaler® 14KX® and 100 TB of usable IME® (Infinite Memory Engine) memory

“HDR 200G InfiniBand provides the highest data speed and smart In-Network Computing acceleration engines, delivering HPC and AI applications with the best performance, scalability and efficiency,” said Gilad Shainer, vice president of marketing at Mellanox Technologies. “We are excited to collaborate with the University of Michigan, Dell EMC and DataDirect Networks, in building a leading HDR 200G InfiniBand-based supercomputer, serving the growing demands of U-M researchers.”

“DDN has a long history of working with Dell EMC and Mellanox to deliver optimized solutions for our customers. We are happy to be a part of the new Great Lakes cluster, supporting its mission of advanced research and computing. Partnering with forward-looking thought leaders as these is always enlightening and enriching,” said Dr. James Coomer, SVP Product Marketing and Benchmarks at DDN.

Great Lakes will provide significant improvement in computing performance over Flux. For example, each compute node will have more cores, higher maximum speed capabilities, and increased memory. The cluster will also have improved internet connectivity and file system performance, as well as NVIDIA Tensor GPU cores, which are very powerful for machine learning compared to prior generations of GPUs.

“Users of Great Lakes will have access to more cores, faster cores, faster memory, faster storage, and a more balanced network,” said Brock Palen, Director of Advanced Research Computing – Technology Services (ARC-TS).

The Flux cluster was created approximately 8 years ago, although many of the individual nodes have been added since then. Great Lakes represents an architectural overhaul that will result in better performance and efficiency. Based on extensive input from faculty and other stakeholders across campus, the new Great Lakes cluster will be designed to deliver similar services and capabilities as Flux, including the ability to accommodate faculty purchases of hardware, access to GPUs and large-memory nodes, and improved support for emerging uses such as machine learning and genomics.

ARC-TS will operate and maintain the cluster once it is built. Allocations of computing resources through ARC-TS include access to hundreds of software titles, as well as support and consulting from professional staff with decades of combined experience in research computing.

Updates on the progress of Great Lakes will be available at https://arc-ts.umich.edu/greatlakes/.

Cluster and storage maintenance set for Aug. 5-9

By | Flux, General Interest, Happenings, HPC, News

To accommodate updates to software, hardware, and operating systems, Flux, Armis, ConFlux, Flux Hadoop, and their storage systems (/home and /scratch) will be unavailable starting at 9 a.m. Sunday, August 5th and returning to service on Thursday, August 9th. These updates will improve the performance and stability of ARC-TS services.  We try to encapsulate the required changes into two maintenance periods per year and work to complete these tasks quickly, as we understand the impact of the maintenance on your research.

During this time, the following maintenance tasks are planned:

  • Operating system, compiler, and software updates (All clusters).
  • InfiniBand networking updates (firmware and software) (Flux/Armis/ConFlux)
  • Resource manager and job scheduling software updates (All clusters).
  • Lmod default software version changes (Flux/Armis/ConFlux)
  • Upgrade HPC systems to CUDA 9.X (Flux/Armis/ConFlux)
  • Update software of the Lustre file systems that provide /scratch (Flux)
  • Update Elastic Storage Server (ConFlux)
  • Enable 32-bit file IDs on home and software volumes (Flux/Armis)
  • Network switch maintenance (Turbo)

For Flux and Armis HPC jobs, you can use the command “maxwalltime” to discover the amount of time remaining until the beginning of the maintenance. Jobs requesting more walltime than remains before the maintenance will be queued and started after the maintenance is completed.

All Flux, Armis, ConFlux, and Flux Hadoop filesystems will be unavailable during the maintenance. We encourage you to copy any data that might be needed during that time from Flux prior to the start of the maintenance.

Turbo storage will be unavailable starting at 6 a.m Monday, August 6th and will return to service at 10 a.m.

We will post status updates on our Twitter feed ( https://twitter.com/arcts_um ) throughout the course of the maintenance and send an email to all HPC and Hadoop users when the maintenance has been completed.  Updates will also be compiled at http://arc-ts.umich.edu/summer-2018-maintenance/. Please contact hpc-support@umich.edu if you have any questions.

 

ARC-TS continues to expand Machine Learning and GPU capability

By | Flux, General Interest, Happenings, HPC, News

Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce the addition of 12 new NVIDIA TITANV Volta class GPUs to our Flux HPC computing cluster.

The new GPUs are spread across three nodes with four cards each. Each card has 12GB of memory, and over 5,100 CUDA cores. These cards bring the new NVIDIA “tensor core” to over 100 Teraflops, which will benefit certain types of machine learning jobs. The new cards will also provide the highest single and double precision performance of any GPU offered on Flux.

The new GPUs will augment our existing K40 and other GPUs, bringing the total GPU count on Flux and Armis to over 50 cards available to the U-M research community. Users of FluxG can access the new TITANV GPUs using the example on our our website or if you have any question, please contact us at hpc-support@umich.edu.

ARC-TS begins work on new “Great Lakes” cluster to replace Flux

By | Flux, Happenings, HPC, News

Advanced Research Computing – Technology Services (ARC-TS) is starting the process of creating a new, campus-wide computing cluster, “Great Lakes,” that will serve the broad needs of researchers across the University. Over time, Great Lakes will replace Flux, the shared research computing cluster that currently serves over 300 research projects and 2,500 active users.

“Researchers will see improved performance, flexibility and reliability associated with newly purchased hardware, as well as changes in policies that will result in greater efficiencies and ease of use,” said Brock Palen, director of ARC-TS.

The Great Lakes cluster will be available to all researchers on campus for simulation, modeling, machine learning, data science, genomics, and more. The platform will provide a balanced combination of computing power, I/O performance, storage capability, and accelerators.

ARC-TS is in the process of procuring the cluster. Only minimal interruption to ongoing research is expected. A “Beta” cluster will be available to help researchers learn the new system before Great Lakes is deployed in the first half of 2019.

The Flux cluster is approximately 8 years old, although many of the individual nodes are newer. One of the benefits of replacing the cluster is to create a more homogeneous platform.

Based on extensive input from faculty and other stakeholders across campus, the new Great Lakes cluster will be designed to deliver similar services and capabilities as Flux, including the ability to accommodate faculty purchases of hardware, access to GPUs and large-memory nodes, and improved support for emerging uses such as machine learning and genomics. The cluster will consist of approximately 20,000 cores.

For more information, contact hpc-support@umich.edu, and see arc-ts.umich.edu/systems-services/greatlakes, where updates to the project will be posted.

Patches being deployed for Meltdown and Spectre attacks

By | Flux, General Interest, Happenings, News

On January 3, two major vulnerabilities in computer chips made by Intel, AMD, ARM and others were made public. Collectively, the two issues are being referred to as Meltdown and Spectre, and they could allow low-privilege processes to access kernel memory that is allocated to other running programs.  Patches have been released at this time for Meltdown by almost every major operating system vendor, and we are in the process of deploying them on our major systems.  Deployment of these patches will result in varying performance impacts, depending on your workload. Based on the high profile nature of the Meltdown hardware vulnerability, along with the existence of examples of exploits in the wild, we have no choice but to deploy patches on all systems. Below we list our mitigation strategies for Meltdown for each system.  Existing patches also fix some Spectre exploits that are known, but there may be further patches as discovery continues.  

For further information regarding Meltdown and Spectre:

ARC-TS Systems:

Flux/Armis:  CentOS has released packages to mitigate against Meltdown.  These packages are being installed during the winter maintenance.

ConFlux: IBM’s PowerPC architecture is not known at this time to be impacted by Meltdown.  The impact of Spectre is still being evaluated.

Flux Hadoop: CentOS has released packages to mitigate against Meltdown.  These packages are being installed during the winter maintenance.

YBRC: We do not anticipate any outages and will use our standard procedure for upgrading.  ARC-TS is working closely with Yottabyte and the upstream sources to get a patch ready to mitigate Meltdown impacts for the platform. The only user-facing impact should be a degradation of storage performance and a brief suspension of networking when a VM migrate hosts (usually drops a single ping). The timeframe for applying the patch is still unknown at this time, but we intend to push forward with patching the hosts as soon as they become available. ARC-TS will send out follow-up notifications before starting the patch process. Applying patches to the various VMs/hosts/containers will require a restart of the affected machine after patches have been applied.  

Turbo: Meltdown impacts to Turbo are low, and we have not received any guidance on Dell/EMC/Isilon procedures at this time.

Potential service disruption for Value Storage maintenance — Dec. 2

By | Flux, General Interest, Happenings, HPC, News

The ITS Storage team will be applying an operating system patch on the MiStorage Silver environment, which provides home directories for both Flux and Flux Hadoop.  The ITS maintenance window will be from December 2nd 11:00pm to December 3rd 7:00am (8 hours total).  This update might be potentially disruptive to the stability of the nodes and jobs running on them.

The ITS status page for this incident is here:  http://status.its.umich.edu/report.php?id=141155

For Flux users: we have created a reservation on Flux so no jobs will be running or impacted.  We will remove the reservation after we receive the update from the ITS storage team of a successful update.

For Flux Hadoop users:  The scheduler and user logins will be deactivated when the outage starts, and any user currently logged into the cluster will be logged out for the duration of the outage.  We will reactivate access when we have received the all-clear from the ITS storage team of a successful update.

Status updates will be posted on the ARC-TS Twitter feed: https://twitter.com/arcts_um  and if you have any questions, please email us at hpc-support@umich.edu.

CSCAR provides walk-in support for new Flux users

By | Data, Educational, Flux, General Interest, HPC, News

CSCAR now provides walk-in support during business hours for students, faculty, and staff seeking assistance in getting started with the Flux computing environment.  CSCAR consultants can walk a researcher through the steps of applying for a Flux account, installing and configuring a terminal client, connecting to Flux, basic SSH and Unix command line, and obtaining or accessing allocations.  

In addition to walk-in support, CSCAR has several staff consultants with expertise in advanced and high performance computing who can work with clients on a variety of topics such as installing, optimizing, and profiling code.  

Support via email is also provided via hpc-support@umich.edu.  

CSCAR is located in room 3550 of the Rackham Building (915 E. Washington St.). Walk-in hours are from 9 a.m. – 5 p.m., Monday through Friday, except for noon – 1 p.m. on Tuesdays.

See the CSCAR web site (cscar.research.umich.edu) for more information.

ARC-TS seeks input on next generation HPC cluster

By | Events, Flux, General Interest, Happenings, HPC, News

The University of Michigan is beginning the process of building our next generation HPC platform, “Big House.”  Flux, the shared HPC cluster, has reached the end of its useful life. Flux has served us well for more than five years, but as we move forward with replacement, we want to make sure we’re meeting the needs of the research community.

ARC-TS will be holding a series of town halls to take input from faculty and researchers on the next HPC platform to be built by the University.  These town halls are open to anyone and will be held at:

  • College of Engineering, Johnson Room, Tuesday, June 20th, 9:00a – 10:00a
  • NCRC Bldg 300, Room 376, Wednesday, June 21st, 11:00a – 12:00p
  • LSA #2001, Tuesday, June 27th, 10:00a – 11:00a
  • 3114 Med Sci I, Wednesday, June 28th, 2:00p – 3:00p

Your input will help to ensure that U-M is on course for providing HPC, so we hope you will make time to attend one of these sessions. If you cannot attend, please email hpc-support@umich.edu with any input you want to share.

HPC maintenance scheduled for January 7 – 9

By | Flux, General Interest, News

To accommodate upgrades to software and operating systems, Flux, Armis, and their storage systems (/home and /scratch) will be unavailable starting at 9am Saturday, January 7th, returning to service on Monday, January 9th.  Additionally, external Turbo mounts will be unavailable 11pm Saturday, January 7th, until 7am Sunday, January 8th.

During this time, the following updates are planned:

  • Operating system and software updates (minor updates) on Flux and Armis.  This should not require any changes to user software or processes.
  • Resource manager and job scheduling software updates.
  • Operating system updates on Turbo.

For HPC jobs, you can use the command “maxwalltime” to discover the amount of time before the beginning of the maintenance. Jobs that cannot complete prior to the beginning of the maintenance will be able to start when the clusters are returned to service.

We will post status updates on our Twitter feed ( https://twitter.com/arcts_um ) and send an email to all HPC users when the outage has been completed.

U-M team uses Flux HPC cluster for pre-surgery simulations

By | Flux, General Interest, News

Last summer, Alberto Figueroa’s BME lab at the University of Michigan achieved an important “first” – using computer-generated blood flow simulations to plan a complex cardiovascular procedure.

“I believe this is the first time that virtual surgical planning was done for real and not as a retrospective theoretical exercise ,” says Figueroa.

Using a patient’s medical and imaging data, Figueroa was able to create a model of her unique vasculature and blood flow, then use it to guide U-M pediatric cardiologists Aimee Armstrong, Martin Bocks, and Adam Dorfman in placing a graft in her inferior vena cava to help alleviate complications from pulmonary arteriovenous malformations (PAVMs). The simulations were done using the Flux HPC cluster.

Read more…