Category

Flux

Modular Data Center Electrical Work

By | Flux, Systems and Services, Uncategorized

[Update 2019-05-17 ] The MDC electrical work was completed successfully and Flux has been returned to full production.

 

The Modular Data Center (MDC), which houses Flux, Flux Hadoop, and other HPC resources, has an electrical issue which requires us to bring the power usage below 50% for the some racks in order to resolve the problem.  In order to do this, we have put reservations on some of the nodes to reduce the power draw so the issue can be fixed by ITS Data Centers.  Once we hit the target power level and the issue is resolved, we will remove the reservations and return Flux and Flux Hadoop back into full production level.

Great Lakes Update: March 2019

By | Flux, General Interest, Great Lakes, Happenings, HPC, News

ARC-TS previously shared much of this information through the December 2018 ARC Newsletter and on the ARC-TS website. We have added some additional details surrounding the timeline for Great Lakes as well as for users who would like to participate in Early User testing.

What is Great Lakes?

The Great Lakes service is a next generation HPC platform for University of Michigan researchers, which will provide several performance advantages compared to Flux. Great Lakes is built around the latest Intel CPU architecture called Skylake and will have standard, large memory, visualization, and GPU-accelerated nodes.  For more information on the technical aspects of Great Lakes, please see the Great Lakes configuration page.

Key Features:

  • Approximately 13,000 Intel Skylake Gold processors providing AVX512 capability providing over 1.5 TFlop of performance per node
  • 2 PB scratch storage system providing approximately 80 GB/s performance (compared to 8 GB/s on Flux)
  • New InfiniBand network with improved architecture and 100 Gb/s to each node
  • Each compute node will have significantly faster I/O via SSD-accelerated storage
  • Large Memory Nodes with 1.5 TB memory per node
  • GPU Nodes with NVidia Volta V100 GPUs (2 GPUs per node)
  • Visualization Nodes with Tesla P40 GPUs

Great Lakes will be using Slurm as the resource manager and scheduler, which will replace Torque and Moab on Flux. This will be the most immediate difference between the two clusters and will require some work on your part to transition from Flux to Great Lakes.

Another significant change is that we are making Great Lakes easier to use through a simplified accounting structure.  Unlike Flux where you need an account for each resource, on Great Lakes you can use the same account and simply request the resources you need, from GPUs to large memory.

There will be two primary ways to get access to compute time: 1) the on-demand model, which adds up the account’s job charges (reserved resources multiplied by the time used) and is billed monthly, similar to Flux On-Demand and 2) node purchases.  In the node purchase model, you will own the hardware which will reside in Great Lakes through the life of the cluster. You will receive an equivalent credit which you can use anywhere on the cluster, including on GPU and large memory nodes. We believe this will be preferable to buying actual hardware in the FOE model, as your daily computational usage can increase and decrease as your research requires. Send us an email at arcts-support@umich.edu if you have any questions or are interested in purchasing hardware on Great Lakes.

When will Great Lakes be available?

The ARC-TS team will prepare the cluster in April 2019 for an Early User period beginning in May, which will continue for approximately 4 weeks to ensure sufficient time to address any issues. General availability of Great Lakes should occur in June 2019.  We have a timeline for the Great Lakes project which will have more detail.

How does this impact me? Why Great Lakes?

After being the primary HPC cluster for the University for 8 years, Flux will be retired in September 2019.  Once Great Lakes becomes available to the University community, we will provide a few months to transition from Flux to Great Lakes.  Flux will be retired after that period due to aging hardware as well as expiring service contracts and licenses. We highly recommend preparing to migrate as early as possible so your research will not be interrupted.  Later in this email, we have suggestions for what you can do to make this migration process as easy as possible.

When Great Lakes becomes generally available to the University community, we will no longer be accepting new Flux accounts or allocations.  All new work should be focused on Great Lakes.

You can see the HPC timeline, including Great Lakes, Beta and Flux, here.

What is the current status of Great Lakes?

Today, the Great Lakes HPC compute hardware and high-performance Storage System has been fully installed and configured. In parallel with this work, the ARC-TS and Unit Support team members have been readying the new service with new software, modules as well as developing training to support the transition onto Great Lakes. A key feature of the new Great Lakes service is the just released HDR InfiniBand from Mellanox. Today, the hardware is installed but the firmware is still in its final stages of testing with the supplier with a target delivery of of mid-April 2019. Given the delays, ARC-TS and the suppliers have discussed an adjusted plan that allows quicker access to the cluster while supporting the future update once the firmware becomes available.

We are working with ITS Finance to define rates for Great Lakes.  We will update the Great Lakes documentation when we have final rates and let everyone know in subsequent communications.

What should I do to transition to Great Lakes?

We hope the transition from Flux to Great Lakes will be relatively straightforward, but to minimize disruptions to your research, we recommend you do your testing early.  In October 2018, we announced availability of the HPC cluster Beta in order to help users with this migration. Primarily, it allows users to migrate their PBS/Torque job submission scripts to Slurm.  You can and should also see the new Modules environments, as they have changed from their current configuration on Flux. Beta is using the same generation of hardware as Flux, so your performance will be similar to that on Flux. You should continue to use Flux for your production work; Beta is only to help test your Slurm job scripts and not for any production work.

Every user on Flux has an account on Beta.  You can login into Beta at beta.arc-ts.umich.edu.  You will have a new home directory on Beta, so you will need to migrate any scripts and data files you need to test your workloads into this new directory.  Beta should not be used for any PHI, HIPAA, Export Controlled, or any sensitive data!  We highly recommend that you use this time to convert your Torque scripts to Slurm and test that everything works as you would expect it to.  

To learn how to use Slurm, we have provided documentation on our Beta website.  Additionally, ARC-TS and academic unit support teams will be offering training sessions around campus. We will have a schedule on the ARC-TS website as well as communicate new sessions through Twitter and email.

If you have compiled software for use on Flux, we highly recommend that you recompile on Great Lakes once it becomes available.  Great Lakes is using the latest CPUs from Intel and by recompiling, your code may get performance gains by taking advantage of new capabilities on the new CPUs.

Questions? Need Assistance?

Contact arcts-support@umich.edu.

Winter HPC maintenance completed

By | Beta, Flux, General Interest, Happenings, HPC, News

Flux, Beta, Armis, Cavium, and ConFlux, and their storage systems (/home and /scratch) are back online after three days of maintenance.  The updates that have been completed will improve the performance and stability of ARC-TS services. 

The following maintenance tasks were done:

  • Preventative maintenance at the Modular Data Center (MDC) which requires a full power outage
  • InfiniBand networking updates (firmware and software)
  • Ethernet networking updates (datacenter distribution layer switches)
  • Operating system and software updates
  • Migration of Turbo networking to new switches (affects /home and /sw)
  • Perform consistency checks on the Lustre file systems that provide /scratch
  • Update firmware and software of the GPFS file systems (ConFlux, starting 9 a.m., Monday, Jan. 7)
  • Perform consistency checks on the GPFS file systems that provide /gpfs (ConFlux, starting 9 a.m., Monday, Jan. 7) 

Please contact hpc-support@umich.edu if you have any questions.

Great Lakes Update: December 2018

By | Flux, General Interest, Great Lakes, Happenings, News

What is Great Lakes?

The Great Lakes service is a next generation HPC platform for University of Michigan researchers. Great Lakes will provide several performance advantages compared to Flux, primarily in the areas of storage and networking. Great Lakes is built around the latest Intel CPU architecture called Skylake and will have standard, large memory, visualization, and GPU-accelerated nodes. For more information on the technical aspects of Great Lakes, please see the Great Lakes configuration page.

Key Features:

  • Approximately 13,000 Intel Skylake Gold processors providing AVX512 capability providing over 1.5 TFlop of performance per node
  • 2 PB scratch storage system providing approximately 80 GB/s performance (compared to 8 GB/s on Flux)
  • New InfiniBand network with improved architecture and 100 Gb/s to each node
  • Each compute node will have significantly faster I/O via SSD-accelerated storage
  • Large Memory Nodes with 1.5 TB memory per node
  • GPU Nodes with NVidia Volta V100 GPUs (2 GPUs per node)
  • Visualization Nodes with Tesla P40 GPUs

Great Lakes will be using Slurm as the resource manager and scheduler, which will replace Torque and Moab on Flux. This will be the most immediate difference between the two clusters and will require some work on your part to transition from Flux to Great Lakes.

Another significant change is that we are making Great Lakes easier to use through a simplified accounting structure. Unlike Flux where you need an account for each resource, on Great Lakes you can use the same account and simply request the resources you need, from GPUs to large memory.

There will be two primary ways to get access to compute time: 1) the pay-as-you-go model similar to Flux On-Demand and 2) node purchases.  Node purchases will give you computational time commensurate to 4 years multiplied by the number of nodes you buy. We believe this will be preferable to buying actual hardware in the FOE model, as your daily computational usage can increase and decrease as your research requires. Additionally you will not be limited by hardware failures on your specific nodes, as your jobs can run anywhere on Great Lakes. Send us an email at arcts-support@umich.edu if you have any questions or are interested in purchasing hardware on Great Lakes.

When will Great Lakes be available?

The ARC-TS team will prepare the cluster in February/March 2019 for an Early User period which will continue for several weeks to ensure sufficient time to address any issues. General availability of Great Lakes should occur in April.

How does this impact me? Why Great Lakes?

After being the primary HPC cluster for the University for 8 years, Flux will be retired in September 2019.  Once Great Lakes becomes available to the University community, we will provide a few months to transition from Flux to Great Lakes. Flux will be retired after that period due to aging hardware as well as expiring service contracts and licenses. We highly recommend preparing to migrate as early as possible so your research will not be interrupted. Later in this email, we have suggestions for what you can do to make this migration process as easy as possible.

When Great Lakes becomes generally available to the University community, we will no longer be accepting new Flux accounts or allocations.  All new work should be focused on Great Lakes.

What is the current status of Great Lakes?

Today, the Great Lakes HPC compute hardware has been fully installed and the high-performance Storage System configuration is in progress. In parallel with this work, the ARC-TS and Unit Support team members have been readying the new service with new software, modules as well as developing training to support the transition onto Great Lakes. A key feature of the new Great Lakes service is the just released HDR InfiniBand from Mellanox. Today, the hardware is available but the firmware is still in its final stages of testing with the supplier with a target delivery date of March (2019). Given the delays, ARC-TS and the suppliers have discussed an adjusted plan that allows quicker access to the cluster while supporting the future update once the firmware becomes available.

What should I do to transition to Great Lakes?

We hope the transition from Flux to Great Lakes will be relatively straightforward, but to minimize disruptions to your research, we recommend you do your testing early.  In October, we announced availability of the HPC cluster Beta in order to help users with this migration. Primarily, it allows users to migrate their PBS/Torque job submission scripts to Slurm. You can also see the new Modules environments, as they have changed from their current configuration on Flux. Beta is using the same generation of hardware as Flux, so your performance will be similar to that on Flux. You should continue to use Flux for your production work; Beta is only to help test your Slurm job scripts and not for any production work.

Every user on Flux has an account on Beta.  You can login into Beta at beta.arc-ts.umich.edu. You will have a new home directory on Beta, so you will need to migrate any scripts and data files you need to test your workloads into this new directory. Beta should not be used for any PHI, HIPAA, Export Controlled, or any sensitive data! We highly recommend that you use this time to convert your Torque scripts to Slurm and test that everything works as you would expect it to.  

To learn how to use Slurm, we have provided documentation on our Beta website. Additionally, ARC-TS and academic unit support teams will be offering training sessions around campus. We’ll have a schedule on the ARC-TS website as well as communicate new sessions through Twitter and email.

If you have compiled software for use on Flux, we highly recommend that you recompile on Great Lakes once it becomes available. Great Lakes is using the latest CPUs from Intel and by recompiling, your code may get performance gains by taking advantage of new capabilities on the new CPUs.

Questions? Need Assistance?

Contact arcts-support@umich.edu.

Winter HPC maintenance scheduled for Jan. 6-9

By | Beta, Flux, General Interest, Happenings, HPC, News

To accommodate updates to software, hardware, and operating systems, Flux, Beta, Armis, Cavium, and ConFlux, and their storage systems (/home and /scratch) will be unavailable starting at 6 a.m. Sunday, January 6th and returning to service on Wednesday, January 9th.  These updates will improve the performance and stability of ARC-TS services. We try to encapsulate the required changes into two maintenance periods per year and work to complete these tasks quickly, as we understand the impact of the maintenance on your research.

During this time, the following maintenance tasks are planned:

  • Preventative maintenance at the Modular Data Center (MDC) which requires a full power outage
  • InfiniBand networking updates (firmware and software)
  • Ethernet networking updates (datacenter distribution layer switches)
  • Operating system and software updates
  • Potential updates to job scheduling software
  • Migration of Turbo networking to new switches (affects /home and /sw)
  • Perform consistency checks on the Lustre file systems that provide /scratch
  • Update firmware and software of the GPFS file systems (ConFlux, starting 9 a.m., Monday, Jan. 7)
  • Perform consistency checks on the GPFS file systems that provide /gpfs (ConFlux, starting 9 a.m., Monday, Jan. 7) 

You can use the command “maxwalltime” to discover the amount of time remaining until the beginning of the maintenance. Jobs requesting more walltime than remains before the maintenance will be queued and started after the maintenance is completed.

All filesystems will be unavailable during the maintenance. We encourage you to copy any data that might be needed during that time from Flux prior to the start of the maintenance.

We will post status updates on our Twitter feed ( https://twitter.com/arcts_um ) throughout the course of the maintenance and send an email to all users when the maintenance has been completed.  Please contact hpc-support@umich.edu if you have any questions.

Beta cluster available for learning Slurm; new scheduler to be part of upcoming cluster updates

By | Flux, General Interest, Happenings, HPC, News

New HPC resources to replace Flux and updates to Armis are coming.  They will run a new scheduling system (Slurm). You will need to learn the commands in this system and update your batch files to successfully run jobs. Read on to learn the details and how to get training and adapt your files.

In anticipation of these changes, ARC-TS has created the test cluster “Beta,” which will provide a testing environment for the transition to Slurm. Slurm will be used on Great Lakes; the Armis HIPAA-aligned cluster; and a new cluster called “Lighthouse” which will succeed the Flux Operating Environment in early 2019.

Currently, Flux and Armis use the Torque (PBS) resource manager and the Moab scheduling system; when completed, Great Lakes and Lighthouse will use the Slurm scheduler and resource manager, which will enhance the performance and reliability of the new resources. Armis will transition from Torque to Slurm in early 2019.

The Beta test cluster is available to all Flux users, who can login via ssh at ‘beta.arc-ts.umich.edu’. Beta has its own /home directory, so users will need to create or transfer any files they need, via scp/sftp or Globus.

Slurm commands will be needed to submit jobs. For a comparison of Slurm and Torque commands, see our Torque to Slurm migration page. For more information, see the Beta home page.

Support staff from ARC-TS and individual academic units will conduct several in-person and online training sessions to help users become familiar with Slurm. We have been testing Slurm for several months, and believe the performance gains, user communications, and increased reliability will significantly improve the efficiency and effectiveness of the HPC environment at U-M.

The tentative time frame for replacing or transitioning current ARC-TS resources is:

  • Flux to Great Lakes, first half of 2019
  • Armis from Torque to Slurm, January 2019
  • Flux Operating Environment to Lighthouse, first half of 2019
  • Open OnDemand on Beta, which replaces ARC Connect for web-based job submissions, Jupyter Notebooks, Matlab, and additional software packages, fall 2018

U-M selects Dell EMC, Mellanox and DDN to Supply New “Great Lakes” Computing Cluster

By | Flux, General Interest, Happenings, HPC, News

The University of Michigan has selected Dell EMC as lead vendor to supply its new $4.8 million Great Lakes computing cluster, which will serve researchers across campus. Mellanox Technologies will provide networking solutions, and DDN will supply storage hardware.

Great Lakes will be available to the campus community in the first half of 2019, and over time will replace the Flux supercomputer, which serves more than 2,500 active users at U-M for research ranging from aerospace engineering simulations and molecular dynamics modeling to genomics and cell biology to machine learning and artificial intelligence.

Great Lakes will be the first cluster in the world to use the Mellanox HDR 200 gigabit per second InfiniBand networking solution, enabling faster data transfer speeds and increased application performance.

“High-performance research computing is a critical component of the rich computing ecosystem that supports the university’s core mission,” said Ravi Pendse, U-M’s vice president for information technology and chief information officer. “With Great Lakes, researchers in emerging fields like machine learning and precision health will have access to a higher level of computational power. We’re thrilled to be working with Dell EMC, Mellanox, and DDN; the end result will be improved performance, flexibility, and reliability for U-M researchers.”

“Dell EMC is thrilled to collaborate with the University of Michigan and our technology partners to bring this innovative and powerful system to such a strong community of researchers,” said Thierry Pellegrino, vice president, Dell EMC High Performance Computing. “This Great Lakes cluster will offer an exceptional boost in performance, throughput and response to reduce the time needed for U-M researches to make the next big discovery in a range of disciplines from artificial intelligence to genomics and bioscience.”

The main components of the new cluster are:

  • Dell EMC PowerEdge C6420 compute nodes, PowerEdge R640 high memory nodes, and PowerEdge R740 GPU nodes
  • Mellanox HDR 200Gb/s InfiniBand ConnectX-6 adapters, Quantum switches and LinkX cables, and InfiniBand gateway platforms
  • DDN GRIDScaler® 14KX® and 100 TB of usable IME® (Infinite Memory Engine) memory

“HDR 200G InfiniBand provides the highest data speed and smart In-Network Computing acceleration engines, delivering HPC and AI applications with the best performance, scalability and efficiency,” said Gilad Shainer, vice president of marketing at Mellanox Technologies. “We are excited to collaborate with the University of Michigan, Dell EMC and DataDirect Networks, in building a leading HDR 200G InfiniBand-based supercomputer, serving the growing demands of U-M researchers.”

“DDN has a long history of working with Dell EMC and Mellanox to deliver optimized solutions for our customers. We are happy to be a part of the new Great Lakes cluster, supporting its mission of advanced research and computing. Partnering with forward-looking thought leaders as these is always enlightening and enriching,” said Dr. James Coomer, SVP Product Marketing and Benchmarks at DDN.

Great Lakes will provide significant improvement in computing performance over Flux. For example, each compute node will have more cores, higher maximum speed capabilities, and increased memory. The cluster will also have improved internet connectivity and file system performance, as well as NVIDIA Tensor GPU cores, which are very powerful for machine learning compared to prior generations of GPUs.

“Users of Great Lakes will have access to more cores, faster cores, faster memory, faster storage, and a more balanced network,” said Brock Palen, Director of Advanced Research Computing – Technology Services (ARC-TS).

The Flux cluster was created approximately 8 years ago, although many of the individual nodes have been added since then. Great Lakes represents an architectural overhaul that will result in better performance and efficiency. Based on extensive input from faculty and other stakeholders across campus, the new Great Lakes cluster will be designed to deliver similar services and capabilities as Flux, including the ability to accommodate faculty purchases of hardware, access to GPUs and large-memory nodes, and improved support for emerging uses such as machine learning and genomics.

ARC-TS will operate and maintain the cluster once it is built. Allocations of computing resources through ARC-TS include access to hundreds of software titles, as well as support and consulting from professional staff with decades of combined experience in research computing.

Updates on the progress of Great Lakes will be available at https://arc-ts.umich.edu/greatlakes/.

Cluster and storage maintenance set for Aug. 5-9

By | Flux, General Interest, Happenings, HPC, News

To accommodate updates to software, hardware, and operating systems, Flux, Armis, ConFlux, Flux Hadoop, and their storage systems (/home and /scratch) will be unavailable starting at 9 a.m. Sunday, August 5th and returning to service on Thursday, August 9th. These updates will improve the performance and stability of ARC-TS services.  We try to encapsulate the required changes into two maintenance periods per year and work to complete these tasks quickly, as we understand the impact of the maintenance on your research.

During this time, the following maintenance tasks are planned:

  • Operating system, compiler, and software updates (All clusters).
  • InfiniBand networking updates (firmware and software) (Flux/Armis/ConFlux)
  • Resource manager and job scheduling software updates (All clusters).
  • Lmod default software version changes (Flux/Armis/ConFlux)
  • Upgrade HPC systems to CUDA 9.X (Flux/Armis/ConFlux)
  • Update software of the Lustre file systems that provide /scratch (Flux)
  • Update Elastic Storage Server (ConFlux)
  • Enable 32-bit file IDs on home and software volumes (Flux/Armis)
  • Network switch maintenance (Turbo)

For Flux and Armis HPC jobs, you can use the command “maxwalltime” to discover the amount of time remaining until the beginning of the maintenance. Jobs requesting more walltime than remains before the maintenance will be queued and started after the maintenance is completed.

All Flux, Armis, ConFlux, and Flux Hadoop filesystems will be unavailable during the maintenance. We encourage you to copy any data that might be needed during that time from Flux prior to the start of the maintenance.

Turbo storage will be unavailable starting at 6 a.m Monday, August 6th and will return to service at 10 a.m.

We will post status updates on our Twitter feed ( https://twitter.com/arcts_um ) throughout the course of the maintenance and send an email to all HPC and Hadoop users when the maintenance has been completed.  Updates will also be compiled at http://arc-ts.umich.edu/summer-2018-maintenance/. Please contact hpc-support@umich.edu if you have any questions.

 

ARC-TS continues to expand Machine Learning and GPU capability

By | Flux, General Interest, Happenings, HPC, News

Advanced Research Computing – Technology Services (ARC-TS) is pleased to announce the addition of 12 new NVIDIA TITANV Volta class GPUs to our Flux HPC computing cluster.

The new GPUs are spread across three nodes with four cards each. Each card has 12GB of memory, and over 5,100 CUDA cores. These cards bring the new NVIDIA “tensor core” to over 100 Teraflops, which will benefit certain types of machine learning jobs. The new cards will also provide the highest single and double precision performance of any GPU offered on Flux.

The new GPUs will augment our existing K40 and other GPUs, bringing the total GPU count on Flux and Armis to over 50 cards available to the U-M research community. Users of FluxG can access the new TITANV GPUs using the example on our our website or if you have any question, please contact us at hpc-support@umich.edu.

ARC-TS begins work on new “Great Lakes” cluster to replace Flux

By | Flux, Happenings, HPC, News

Advanced Research Computing – Technology Services (ARC-TS) is starting the process of creating a new, campus-wide computing cluster, “Great Lakes,” that will serve the broad needs of researchers across the University. Over time, Great Lakes will replace Flux, the shared research computing cluster that currently serves over 300 research projects and 2,500 active users.

“Researchers will see improved performance, flexibility and reliability associated with newly purchased hardware, as well as changes in policies that will result in greater efficiencies and ease of use,” said Brock Palen, director of ARC-TS.

The Great Lakes cluster will be available to all researchers on campus for simulation, modeling, machine learning, data science, genomics, and more. The platform will provide a balanced combination of computing power, I/O performance, storage capability, and accelerators.

ARC-TS is in the process of procuring the cluster. Only minimal interruption to ongoing research is expected. A “Beta” cluster will be available to help researchers learn the new system before Great Lakes is deployed in the first half of 2019.

The Flux cluster is approximately 8 years old, although many of the individual nodes are newer. One of the benefits of replacing the cluster is to create a more homogeneous platform.

Based on extensive input from faculty and other stakeholders across campus, the new Great Lakes cluster will be designed to deliver similar services and capabilities as Flux, including the ability to accommodate faculty purchases of hardware, access to GPUs and large-memory nodes, and improved support for emerging uses such as machine learning and genomics. The cluster will consist of approximately 20,000 cores.

For more information, contact hpc-support@umich.edu, and see arc-ts.umich.edu/systems-services/greatlakes, where updates to the project will be posted.