Tag

maintenance

HPC Emergency 2023 Maintenance: September 15

By | Data, HPC, News, Research, Systems and Services

Due to a critical issue which requires an immediate update, we will be performing updates to Slurm and underlying libraries which allow parallel jobs to communicate. We will be updating the login nodes and the rest of the cluster on the fly and you should only experience minimal impact when interacting with the clusters. 

  • Jobs that are currently running will be allowed to finish. 
  • All new jobs will only be allowed to run on nodes which have been updated. 
  • The login and Open OnDemand nodes will also be updated, which will require a brief interruption in service.

Queued jobs and maintenance reminders

Jobs will remain queued, and will automatically begin after the maintenance is completed. Any parallel using MPI will fail; those jobs may need to be recompiled, as described below. Jobs not using MPI will not be affected by this update.

Jobs will be initially slow to start, as compute nodes are drained of running jobs so they can be updated. We apologize for this inconvenience, and want to assure you that we would not be performing this maintenance during a semester unless it was absolutely necessary.

Software updates

Only one version of OpenMPI (version 4.1.6) will be available; all other versions will be removed. Modules for the versions of OpenMPI that were removed will warn you that it is not available, as well as prompt you to load openmpi/4.1.6. 

When you use the following command, it will default to openmpi/4.1.6:
module load openmpi 

Any software packages you use (provided by ARC/LSA/COE/UMMS or yourself) will need to be updated to use openmpi/4.1.6. The software package updates will be completed by ARC. The code you compile yourself will need to be updated by you.

Note that at the moment openmpi/3.1.6 will be discontinued and warned to update your use to openmpi/4.1.6.

Status updates

 

System software changes

Great Lakes, Armis2 and Lighthouse

NEW version in BOLD OLD version
Slurm 23.02.5 compiles with:
  • PMIx
    • /opt/pmix/3.2.5
    • /opt/pmix/4.2.6
  • hwloc 2.2.0-3 (OS provided)
  • ucx-1.15.0-1.59056 (OFED provided)
  • slurm-libpmi
  • slurm-contribs
Slurm 23.02.3 compiles with:
  • PMIx
    • /opt/pmix/2.2.5
    • /opt/pmix/3.2.3
    • /opt/pmix/4.2.3
  • hwloc 2.2.0-3 (OS provided)
  • ucx-1.15.0-1.59056 (OFED provided)
  • slurm-libpmi
  • slurm-contribs
PMIx LD config /opt/pmix/3.2.5/lib PMIx LD config /opt/pmix/2.2.5/lib
PMIx versions available in /opt :
    • 3.2.5
    • 4.2.6
PMIx versions available in /opt :
  • 2.2.5
  • 3.2.3
  • 4.1.2
OpenMPI
  • 4.1.6
OpenMPI
  • 3.1.6
  • others

 

How can we help you?

For assistance or questions, contact ARC at arc-support@umich.edu.

Great Lakes, Lighthouse, Armis2 Summer Networking 2023 Maintenance Begins In

By |

Due to maintenance, the high-performance computing (HPC) clusters and their storage systems (/home and /scratch) will be unavailable:

  • Great Lakes, Armis2Lighthouse: Monday, August 21, 2023, 7am – Tuesday, August 22, 2023, 5pm
  • Turbo, Locker, Dataden: Monday, August 21, 2023, 7am – Tuesday, August 22, 2023, 5pm

ATTENTION

  • No jobs will run that cannot be completed by the beginning of maintenance. 
  • Make a copy of any files that you might need during maintenance that are in /home or /scratch, which will not be available during maintenance, to your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).
  • All running and queued jobs will be deleted at the start of maintenance.

More details can be found below:

ARC Summer 2023 Network Maintenance

Contact arc-support@umich.edu if you have any questions.

Summer 2023 Network Maintenance: HPC and storage unavailable August 21-22 

By | Data, HPC, News, Research, Systems and Services

During the 2023 summer maintenance, a significant networking software bug was discovered and ARC was unable to complete the ARC HPC and Storage network updates at the MACC Data Center.

ITS has been working with the vendor on a remediation, and it will be implemented on August 21-22.  This will require scheduled maintenance for the HPC clusters Great Lakes, Armis2, and Lighthouse, as well as the ARC storage systems Turbo, Locker, and Data Den. The date was selected to minimize any impact during the fall semester. 

Maintenance dates:

HPC clusters and storage systems (/home and /scratch) and ARC storage systems (Turbo, Locker, and Data Den) will be unavailable August 21 starting at 7:00am.  Expected completion date is August 22nd.

Queued jobs and maintenance reminders

Jobs will remain queued, and will automatically begin after the maintenance is completed. The command “maxwalltime” will show the amount of time remaining until maintenance begins for each cluster, so you can size your jobs appropriately. The countdown to maintenance will also appear on the ARC homepage

Status updates

How can we help you?

For assistance or questions, contact ARC at arc-support@umich.edu.

Great Lakes, Lighthouse, Armis2 Summer 2023 Maintenance

By |

Due to maintenance, the high-performance computing (HPC) clusters and their storage systems (/home and /scratch) will be unavailable:

  • Great Lakes, Armis2Lighthouse : Monday, June 5, 2023, 8am – Friday, June 9, 2023, 5pm

ATTENTION

  • No jobs will run that cannot be completed by the beginning of maintenance. 
  • Make a copy of any files that you might need during maintenance that are in /home or /scratch, which will not be available during maintenance, to your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).
  • All running and queued jobs will be deleted at the start of maintenance.

More details can be found below:

ARC Summer 2023 Maintenance

Contact arc-support@umich.edu if you have any questions.

ARC Summer 2023 Maintenance happening in June

By | HPC, News, Systems and Services

Summer maintenance will be happening earlier this year (June instead of August). Updates will be made to software, hardware, and operating systems to improve the performance and stability of services. ARC works to complete these tasks quickly to minimize the impact of the maintenance on research.

The dates listed below are the weeks the work will be occurring; the actual dates will be revised as planning continues.

HPC clusters and storage systems (/scratch) will be unavailable:

  • June 5-9: Great Lakes, Armis2, and Lighthouse

Storage systems will be unavailable:

  • June 6-7: Turbo, Locker, and Data Den

Queued jobs and maintenance reminders

Jobs will remain queued, and will automatically begin after the maintenance is completed. The command “maxwalltime” will show the amount of time remaining until maintenance begins for each cluster, so you can size your jobs appropriately. The countdown to maintenance will also appear on the ARC homepage

Status updates

How can we help you?

For assistance or questions, contact ARC at arc-support@umich.edu.

Lighthouse Winter 2023 Maintenance

By |

Due to maintenance, the Lighthouse high-performance computing (HPC) cluster and its storage systems (/home and /scratch) will be unavailable Monday, January 9, 2023, 8am – Wednesday, January 11, 2023, 5pm.

ATTENTION

  • No jobs will run that cannot be completed by the beginning of maintenance. 
  • Make a copy of any files that you might need during maintenance that are in /home or /scratch, which will not be available during maintenance, to your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).
  • All running and queued jobs will be deleted at the start of maintenance.

More details can be found below:

ARC Winter 2023 Maintenance

Contact arc-support@umich.edu if you have any questions.

Armis2 Winter 2023 Maintenance

By |

Due to maintenance, the Armis2 high-performance computing (HPC) cluster and its storage systems (/home and /scratch) will be unavailable Monday, January 9, 2023, 8am – Wednesday, January 11, 2023, 5pm.

ATTENTION

  • No jobs will run that cannot be completed by the beginning of maintenance. 
  • Make a copy of any files that you might need during maintenance that are in /home or /scratch, which will not be available during maintenance, to your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).
  • All running and queued jobs will be deleted at the start of maintenance.

More details can be found below:

ARC Winter 2023 Maintenance

Contact arc-support@umich.edu if you have any questions.

Great Lakes Winter 2023 Maintenance

By |

Due to maintenance, the following Great Lakes (HPC) nodes will be unavailable Wednesday, January 4, 2023, 8am – Thursday, January 5, 2023, 5pm, for the following services:

  • Single Precision GPU (spgpu) nodes (Jan 4)
  • On-campus login node (Jan 4)
  • Encore node (Jan 5)

ATTENTION

  • No jobs will run that cannot be completed by the beginning of maintenance. 
  • Make a copy of any files that you might need during maintenance that are in /home or /scratch, which will not be available during maintenance, to your local drive prior to maintenance using Globus File Transfer or other file transfer (see the User Guide for the cluster).
  • All running and queued jobs will be deleted at the start of maintenance.

More details can be found below:

ARC Winter 2023 Maintenance

Contact arc-support@umich.edu if you have any questions.

Summer 2019 Maintenance

By |

To accommodate updates to software, hardware, and operating systems, Flux, Armis, Lighthouse, ConFlux, Cavium HPC, and their storage systems (/home and /scratch) will be unavailable starting on Monday, August 12th at 6:00 AM and returning to service on Wednesday, August 14th at 3:00 PM. These updates will improve the performance and stability of ARC-TS services. We try to encapsulate the required changes into two maintenance periods per year and work to complete these tasks quickly, as we understand the impact of the maintenance on your research.

Planned infrastructure maintenance tasks:

  • Annual preventive maintenance and electrical work at Modular Data Center (Flux, Armis, Lighthouse)
  • InfiniBand networking updates (firmware and software) (Flux/Armis/Lighthouse/ConFlux)
  • Network switch upgrades to improve throughput between ARC-TS services (Turbo, Flux, Armis, Lighthouse, Great Lakes )

Flux, Armis, and Lighthouse maintenance tasks:

  • /scratch storage hardware (Flux only)
  • Hardware maintenance on Infiniband

ConFlux maintenance tasks:

  • OS Updates (CentOS 7.6)
  • Mellanox OFED
  • GPFS/ESS
  • CUDA updates (version 10.1)
  • Perhaps LSF 10.1

Cavium HPC updates:

  • OS updates
  • Slurm version upgraded to 18.08.7

For Flux, Lighthouse, and Armis HPC jobs, you can use the command “maxwalltime” to discover the amount of time remaining until the beginning of the maintenance. Jobs requesting more walltime than remains before the maintenance will be queued and started after the maintenance is completed.

All Flux, Armis, Lighthouse, ConFlux, and Cavium HPC file systems will be unavailable during the maintenance. We encourage you to copy any data that might be needed during that time prior to the start of the maintenance.

We will post status updates on our Twitter feed throughout the course of the maintenance and send an email to all HPC users when the maintenance has been completed. Updates will also be compiled at http://arc-ts.umich.edu/summer-2019-maintenance/. Please contact hpc-support@umich.edu if you have any questions.

Winter HPC maintenance completed

By | Beta, Flux, General Interest, Happenings, HPC, News

Flux, Beta, Armis, Cavium, and ConFlux, and their storage systems (/home and /scratch) are back online after three days of maintenance.  The updates that have been completed will improve the performance and stability of ARC-TS services. 

The following maintenance tasks were done:

  • Preventative maintenance at the Modular Data Center (MDC) which requires a full power outage
  • InfiniBand networking updates (firmware and software)
  • Ethernet networking updates (datacenter distribution layer switches)
  • Operating system and software updates
  • Migration of Turbo networking to new switches (affects /home and /sw)
  • Perform consistency checks on the Lustre file systems that provide /scratch
  • Update firmware and software of the GPFS file systems (ConFlux, starting 9 a.m., Monday, Jan. 7)
  • Perform consistency checks on the GPFS file systems that provide /gpfs (ConFlux, starting 9 a.m., Monday, Jan. 7) 

Please contact hpc-support@umich.edu if you have any questions.