Below are some common error messages you may see when SSH’ing into our system, or you may see in your job logs.

If You Get the Message, “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED …”, when you SSH into the Cluster

If you SSH into and receive a warning message like this:

Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
Please contact your system administrator.
Add correct host key in /Users/joeuser/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /Users/joeuser/.ssh/known_hosts:1
ECDSA host key for has changed and you have requested strict checking.
Host key verification failed.

This is OK and to be expected. When we upgraded our interactive nodes to RHEL 8.3, as part of the build process they generated new SSH keys that do not match the previous keys when they were RHEL 7.5. You can resolve this in one of the following ways (*on your local workstation or laptop):

  1. Just edit ~/.ssh/known_hosts and delete the offending line, which in the above example is line #1 (notice the line that ends with “known_hosts:1“)
  2. Use the command “ssh-keygen” to have it remove the offending key for you. For example: ssh-keygen -R “your server hostname or ip”
  3. Simply remove or rename the “known_hosts” file. In the above example, the file you would remove or rename is: /Users/joeuser/.ssh/known_hosts

If you get the Message, “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED …” in Your Job’s Output Log

It is possible to see a similar message within your job’s output log on occasion. This is caused by a compute node getting rebuilt, which (just like in the case of the interactive nodes) triggers the generation of new SSH keys for that host. You can resolve this in much the same way, but instead of editing the “known_hosts” file on your local workstation or laptop, you must edit the “known_hosts” file within your $HOME directory on the cluster interactive node. Simply choose one of the three ways listed above, and you should no longer see those messages in your job’s output log. However, the messages may return when we rebuild or upgrade any of the compute nodes on the cluster, which is inevitable, but now you know how to resolve the issue.

If you get the “Batch job submission failed” or “Unable to allocate resources” Error message

If you receive either of the above error messages, it means that you are trying to submit a batch job or use srun to a partition that you do not have permission to use. You can list the partitions that you have access to using the “sinfo” command; for example:

$ sinfo -s
Orion*       up 30-00:00:0        97/1/1/99 str-bm[1,5],str-c[1-36,49-69,128-167]
Andromeda    up 30-00:00:0         2/5/4/11 str-abm1,str-ac[1-10]
GPU          up 30-00:00:0         6/7/0/13 str-gpu[1-5,13-20]
Nebula       up 2-00:00:00         8/6/1/15 str-abm2,str-c[74-87]
DTN          up 30-00:00:0          0/1/0/1 dtn-s1

ERROR: Unable to locate a modulefile

We update the applications on the cluster from time-to-time. We try to keep at most 3 versions of any given application available for use, and eventually the oldest versions of the applications get retired when new versions get released and made available on the cluster. If you have a submit script that uses a version that has been retired, you will receive this error either at the command line or within your job log. For example, we retired ABAQUS 2020 quite some time ago, so if I try to load that environment module at the command line, I will get the following error:

$ module load abaqus/2020
ERROR: Unable to locate a modulefile for 'abaqus/2020'

# use the "module avail" command to see the available ABAQUS modules
$ module avail abaqus
-------------------------- /apps/usr/modules/apps ----------------------------------
abaqus/2021   abaqus/2022(default)