Items on this page:
- Unscheduled Emergency Maintenance
- Planned Maintenance Shutdowns
- New Scratch Storage
- Dealing With Updated HPCC Host Key Messages
- Maintenance downtime policy for HPCC systems
Tuesday, June 25, 2019: Please be aware that we have received notice of a chilled water issue on the TTU main campus which has necessitated shutdown of chilled water distribution systems campus-wide. We have been advised that there is a high likelihood that HVAC systems will be affected and building interior temperatures will increase. The chilled water supporting cooling for the HPCC clusters is also affected.
At this time we have ramped down power usage on the primary clusters in the Experimental Sciences Building. Our plan is to keep temperatures stable by holding job queue submissions and if necessary terminating user jobs, starting with the Hrothgar cluster and if necessary proceeding to the Quanah cluster.
Unfortunately we have no information regarding expected resolution schedules. We will update you with more information once we know more.
We appreciate your patience during this time. For more details, please refer to our web site: http://www.depts.ttu.edu/hpcc/operations/maintenance.php and as always, please feel free to contact us at email@example.com if you require assistance or have any further questions.
There are no planned maintenance shutdowns for the HPCC at this time.
During the most recent maintenance window, the HPCC provisioned a new Lustre storage system that replaced the former /home, /lustre/work, /lustre/scratch and will provide users with faster access to their data as well as increases to their home quota (from 150 GiB to 300 GiB). All data in /home and /lustre/work directories has been migrated to the new storage automatically with no further action required by you. However, data stored on the old /lustre/scratch has not been migrated and will require users to migrate any data they need on their own. The old /lustre/scratch storage can be reached using the path /lustre/old-scratch for a short time before we retire the system all together.
The HPCC has written a migration tool to aid users who wish to migrate some of their data from old scratch to the new scratch storage space. To use this tool please use the following commands:
- Create a new file named "scratch_migration.dat" in your home folder.
- Example: /home/errees/scratch_migration.dat
- List every file or folder you wish to copy in this file, with one file or directory
- Run the command: /lustre/work/examples/quanah/generateMigrationScript.sh
- This script will use the scratch_migration.dat file to generate an array job for the cluster. Once completed it will print out the qsub command you should run in order to begin the transfer of data.
Updated HPCC Host Key Messages After An Upgrade
Note: After returning to operation following a system upgrade, you may notice a message when you connect that states "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!". This occurs when the login nodes for the clusters go through a major upgrade that causes them to adopt new host keys. If you receive this message you will need to remove the old host key from your system using one of the following methods:
- On Linux or Mac systems this is done by editing your .ssh/known_hosts file to remove the old entry for the system you are trying to reach. This can be done using an editor or with a command like the following (replacing "quanah.hpcc.ttu.edu" with the remote system you are having trouble reaching due to its old key)::
ssh-keygen -f "~/.ssh/known_hosts" -R quanah.hpcc.ttu.edu
- On Windows system this done using the features of your ssh terminal program. A failed connection due to a changed host key is often followed by a prompt that allows you to change to the new host key. Select the option on the popup window that corresponds to changing keys.
Maintenance downtime policy for HPCC systems:
- Special periods may be reserved for performing routine maintenance. Users will be notified as early as possible when we are planning on bring the systems down for any reason.
- Systems may also go down at anytime to fix security issues, but every effort will be made to give the earliest possible notice.
- Maintenance may also be required without prior notice in the event of system crashes or any unstable behavior.
Users should always keep extra copies of any files that are critical to their research on systems that are outside of the HPCC to guard against possible loss in the event of unforeseen catastrophic failure.
This policy is just good general practice and applies to all critical research files, regardless of where they are stored or whether or not they are located on HPCC resources.
If you have any questions, concerns, or problems the please contact us at firstname.lastname@example.org.