Items on this page:
For the past several weeks we have been encountering problems with the Infiniband switches serving the HPCC central file systems. These issues have led to job losses and intermittent access to the Lustre storage system.
HPCC staff, working with the Infiniband vendor, have identified several issues contributing
to these problems. Updates have been applied that have stabilized the system until
further work can be carried out during the next planned maintenance shutdown.
Please contact firstname.lastname@example.org if you have any questions.
Planned shutdowns are usually reserved for the 2nd full week of the 2nd month of each calendar quarter to carry out system maintenance. These can sometimes be skipped if not needed, expanded if necessary, or reduced. In addition, there will sometimes be necessary special planned shutdowns to carry out upgrades or alterations to the clusters.
For Academic Year 2021-2022, the planned shutdowns will be on the following schedule:
1) November 8-12, 2021 *
2) February 7-11, 2022
3) May 9-13, 2022
4) August 8-12, 2022
* Work planned for the November 8-12, 2021 shutdown includes the following:
- Upgrade and security patches for the Slurm scheduler
- Update firmware on Infiniband switches, cards, and routers
- Upgrade metadata servers for Lustre storage system to NVMe drive array
- Improve resilience for the central NFS file service by moving MATLAB files to another server
- Update MATLAB version (to be confirmed with MathWorks)
- Installation of electrical redundancy for the coolant distribution unit for the liquid-cooled nodes
- Continuation of planned purge of older files on the scratch area
- Firmware updates on all cluster worker nodes
- Expansion of test nodes for transition of cluster to Rocky Linux, replacing CentOS 8
We appreciate your patience during this maintenance. Please feel free to contact us at email@example.com if you require assistance or have any further questions.
Maintenance downtime policy for HPCC systems:
- Special periods may be reserved for performing routine maintenance. Users will be notified as early as possible when we are planning on bring the systems down for any reason.
- Systems may also go down at anytime to fix security issues, but every effort will be made to give the earliest possible notice.
- Maintenance may also be required without prior notice in the event of system crashes or any unstable behavior.
Users should always keep extra copies of any files that are critical to their research on systems that are outside of the HPCC to guard against possible loss in the event of unforeseen catastrophic failure.
This policy is just good general practice and applies to all critical research files, regardless of where they are stored or whether or not they are located on HPCC resources.
If you have any questions, concerns, or problems the please contact us at firstname.lastname@example.org.