Texas Tech University

Job Submission Guide

Submitting jobs to HPCC resources is done using the qsub command. 

 

Table of Contents

1. Job Submission Script
a. Submission Script Layout
2. Quanah Job Submission Tutorial
3. Hrothgar Job Submission Tutorial
4. Monitoring Jobs
a. Monitoring a queued or running job - qstat
b. Monitoring a failed or completed job - qacct

 

Job Submission Script

Every job that runs requires a submission script that contains parameters for the job scheduler as well as the commands you wish to run on the cluster. The parameters you pass to the job scheduler will depend on the cluster you are submitting to.  Below you will find a generic submission script as an example to discuss what the various parameters do, followed by example submission scripts for our Quanah and Hrothgar clusters.

Submission Script Layout

#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N <job name>
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q <queue name>
#$ -pe <parallel environment>
#$ -P <cluster name>
hostname #Or any command(s) you wish to run on the cluster.

For the above script, the lines starting with #$ are options for the job scheduler (UGE). Each option is described below:

  • -V instructs the scheduler to use the current environment settings in the batch job.
  • -cwd instructs the scheduler to use the current directory where the job is submitted as the job's working directory.
  • -S /bin/bash instructs the scheduler to use the /bin/bash as the shell for the batch session.
  • -N <job name> sets the name for the job.  This can be referred to later in the script using the variable $JOB_NAME.
  • -o $JOB_NAME.o$JOB_ID indicates the name of the standard output file.  $JOB_ID is a unique number that distinguishes the job.
  • -e $JOB_NAME.e$JOB_ID indicates the name of the standard error file. $JOB_ID is a unique number that distinguishes the job.
  • -q <queue name> instructs the scheduler to use the queue defined by <queue name>.
  • -pe <parallel environment> instructs the scheduler to use the parallel environment defined by <parallel environment>. See our guide on "Parallel Environments (PE)" for more information about parallel environments.
  • -P <cluster name> instructs the scheduler to use the project/cluster defined by <cluster name>.

The last line, "hostname" is the actual command to run in your submission script.  You should modify this command to any commands that you need to run for your project.


 

Quanah Job Submission Tutorial

Submitting a job on Quanah can be done using the following steps:

Step 1. Log on to quanah.hpcc.ttu.edu using your eRaider account and password.

Step 2. Copy the tutorial job script folder for quanah to your $HOME area

#Copy the folder and all of its contents to your $HOME
cp -r /lustre/work/examples/quanah/mpi ~/mpi_tutorial

#Change the current directory to the mpi_tutorial we just copied.
cd ~/mpi_tutorial

Inside the ~/mpi_tutorial folder there should be a file names mpi.sh.  This is the submission script file for this particular tutorial.  To read this file we can use the command cat mpi.sh which will print out the contents of this script.


quanah:/mpi_tutorial$ cat mpi.sh
#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N MPI_Test_Job
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q omni
#$ -pe mpi 36
#$ -P quanah
module load intel impi
mpirun --machinefile machinefile.$JOB_ID -np $NSLOTS ./mpi_hello_world

This script sets the name of the project (-N) to "MPI_Test_Job", requests the job be run on the "omni" queue (-q), requests the job be assigned 36 cores using the "mpi" parallel environment (-pe), and requests that the jobs all be scheduled to the "quanah" cluster (-P).

Next the script runs the command "module load intel impi".  This will execute before the mpi job is run, allowing the correct compiler and mpi implementation to be loaded prior to executing the mpirun command.  For more information on how modules work, please review our "Software Environment Setup" guide.

Step 3. We will now submit a job to the Quanah cluster using the qsub command. The qsub command will request the job be scheduled and return the job ID for the job so you can monitor its progress.

qsub mpi.sh

Step 4. You can use the qstat command to view information about any jobs you have submitted that have not yet completed. For more information about this or other commands used to monitor your job, please see the Monitoring Jobs section below. Below is an image showing the output of running the commands listed in Step 3 and Step 4

quanah job submission demo

Step 5. Once execution has completed, you can read the results by viewing the output file for our mpi job.  You can expect the output file to be found in the ~/mpi_tutorial directory under the name MPI_Test_Job.o<jobID>.  In our example the file would be found at ~/mpi_tutorial/MPI_Test_Job.o51516.  While your results will likely vary, the output should look something like this:

#1 line referring to the UGE default spool for this process.
#36 lines (likely identical) stating the compute node that each process ran on.
Hello world from processor compute-20-4, rank 2 out of 36 processors
Hello world from processor compute-20-4, rank 4 out of 36 processors
Hello world from processor compute-20-4, rank 5 out of 36 processors
Hello world from processor compute-20-4, rank 11 out of 36 processors
Hello world from processor compute-20-4, rank 12 out of 36 processors
Hello world from processor compute-20-4, rank 17 out of 36 processors
Hello world from processor compute-20-4, rank 18 out of 36 processors
Hello world from processor compute-20-4, rank 19 out of 36 processors
Hello world from processor compute-20-4, rank 20 out of 36 processors
Hello world from processor compute-20-4, rank 25 out of 36 processors
Hello world from processor compute-20-4, rank 26 out of 36 processors
Hello world from processor compute-20-4, rank 28 out of 36 processors
Hello world from processor compute-20-4, rank 33 out of 36 processors
Hello world from processor compute-20-4, rank 34 out of 36 processors
Hello world from processor compute-20-4, rank 35 out of 36 processors
Hello world from processor compute-20-4, rank 0 out of 36 processors
Hello world from processor compute-20-4, rank 1 out of 36 processors
Hello world from processor compute-20-4, rank 3 out of 36 processors
Hello world from processor compute-20-4, rank 6 out of 36 processors
Hello world from processor compute-20-4, rank 7 out of 36 processors
Hello world from processor compute-20-4, rank 8 out of 36 processors
Hello world from processor compute-20-4, rank 9 out of 36 processors
Hello world from processor compute-20-4, rank 10 out of 36 processors
Hello world from processor compute-20-4, rank 13 out of 36 processors
Hello world from processor compute-20-4, rank 14 out of 36 processors
Hello world from processor compute-20-4, rank 15 out of 36 processors
Hello world from processor compute-20-4, rank 16 out of 36 processors
Hello world from processor compute-20-4, rank 21 out of 36 processors
Hello world from processor compute-20-4, rank 22 out of 36 processors
Hello world from processor compute-20-4, rank 23 out of 36 processors
Hello world from processor compute-20-4, rank 24 out of 36 processors
Hello world from processor compute-20-4, rank 27 out of 36 processors
Hello world from processor compute-20-4, rank 29 out of 36 processors
Hello world from processor compute-20-4, rank 30 out of 36 processors
Hello world from processor compute-20-4, rank 31 out of 36 processors
Hello world from processor compute-20-4, rank 32 out of 36 processors

Step 6. Edit the mpi.sh file and change the line "#$ -pe mpi 36" to instead say "#$ -pe mpi 72".  Now rerun steps 3, 4 and 5 and see what changed. Notice how now the job had to run on multiple compute nodes, causing you to now get messages back from multiple machines on the cluster.

Step 7. Congratulations, you have now successfully set up and run an MPI job on Quanah!


 

Hrothgar Job Submission Tutorial

Submitting a job on Hrothgar can be done using the following steps:

Step 1. Log on to hrothgar.hpcc.ttu.edu using your eRaider account and password.

Step 2. Copy the tutorial job script folder for quanah to your $HOME area

#Copy the folder and all of its contents to your $HOME
cp -r /lustre/work/examples/hrothgar/mpi ~/mpi_tutorial
#Change the current directory to the mpi_tutorial we just copied.
cd ~/mpi_tutorial

Inside the ~/mpi_tutorial folder there should be a file named mpi.sh. This is the submission script file for this particular tutorial. To read this file we can use the command cat mpi.sh which will print out the contents of this script.

quanah:/mpi_tutorial$ cat mpi.sh
#!/bin/sh
#$ -V
#$ -cwd
#$ -S /bin/bash
#$ -N MPI_Test_Job
#$ -o $JOB_NAME.o$JOB_ID
#$ -e $JOB_NAME.e$JOB_ID
#$ -q west
#$ -pe west 12
#$ -P hrothgar
mpirun --machinefile machinefile.$JOB_ID -np $NSLOTS ./mpi_hello_world

This script sets the name of the project (-N) to "MPI_Test_Job", requests the job be run on the "west" queue (-q), requests the job be assigned 12 cores using the "mpi" parallel environment (-pe), and requests that the jobs all be scheduled to the "hrothgar" cluster (-P).

Next the script runs the "mpirun" command which will actually execute the mpi_hello_world program.

Step 3. We will now submit a job to the Hrothgar cluster using the qsub command. The qsub command will request the job be scheduled and return the job ID for the job so you can monitor its progress.

qsub mpi.sh

Step 4. You can use the qstat command to view information about any jobs you have submitted that have not yet completed. For more information about this or other commands used to monitor your job, please see the Monitoring Jobs section below. Below is an image showing the output of running the commands listed in Step 3 and Step 4.

hrothgar job submission demo

Step 5. Once execution has completed, you can read the results by viewing the output file for our mpi job. You can expect the output file to be found in the ~/mpi_tutorial directory under the name MPI_Test_Job.o<jobID>. In our example the file would be found at ~/mpi_tutorial/MPI_Test_Job.o51516. While your results will likely vary, the output should look something like this:#1 line referring to the UGE default spool for this process.

#1 line referring to the UGE default spool for this process.
#12 lines (likely identical) stating the compute node that each process ran on.
Hello world from processor compute-5-25.local, rank 3 out of 12 processors
Hello world from processor compute-5-25.local, rank 8 out of 12 processors
Hello world from processor compute-5-25.local, rank 10 out of 12 processors
Hello world from processor compute-5-25.local, rank 11 out of 12 processors
Hello world from processor compute-5-25.local, rank 0 out of 12 processors
Hello world from processor compute-5-25.local, rank 1 out of 12 processors
Hello world from processor compute-5-25.local, rank 2 out of 12 processors
Hello world from processor compute-5-25.local, rank 4 out of 12 processors
Hello world from processor compute-5-25.local, rank 5 out of 12 processors
Hello world from processor compute-5-25.local, rank 6 out of 12 processors
Hello world from processor compute-5-25.local, rank 7 out of 12 processors
Hello world from processor compute-5-25.local, rank 9 out of 12 processors

Step 6. Edit the mpi.sh file and change the line "#$ -pe west 12" to instead say "#$ -pe west 24". Now rerun steps 3, 4 and 5 and see what changed. Notice how now the job had to run on multiple compute nodes, causing you to now get messages back from multiple machines on the cluster.

Step 7. Congratulations, you have now successfully set up and run an MPI job on Hrothgar!


 

 

Monitoring Jobs

Once you have submitted a job you may find that you wish to monitor how it is progressing.  The qstat command is used to view information about jobs that are currently queued up to run or that are currently running on a compute node.  The qacct command is used to view information about jobs that have completed execution either sucessfully or terminated due to some failure.

Monitoring a queued or running job - qstat

The qstat command is used to view information about any jobs you have submitted that have yet complete execution. This command can be run either with or without parameters, depending on if you wish to see detailed information about a particular job or wish to see a summary regarding all of your currently running jobs.

 

Viewing a summary of your currently running jobs

In order to view a summary of your currently running jobs you will need to run the qstat command without any parameters. If you had any running jobs then qstat will print out a table similar to the one in the picture below.

Do not run qstat within a watch command! 

While it may not seem like a resource intensive command to run, running the qstat command often leads to slower HPCC resources for all of our users.

quanah job submission demo

If you had no running jobs, then qstat will appear to do nothing.  Otherwise you should see a table that contains 10 columns, detailed below:

  1. job-ID
    1. The job ID for the job. This is the number you will need to know in order to get detailed information about that job.
  2. prior
    1. This is a decimal number stating that jobs current "priority" to the system.
  3. name
    1. The name of the job as set by the -N option when the job was started.
  4. user
    1. The username of the person who submitted the job.
  5. state
    1. The current state of the job. Usually denoted as one of the following:
      • qw - Queued and Waiting.  The job is currently pending.
      • r - Running. The job is currently executing.
      • t - Transferring. The job is currently in transfer (typically from qw to r).
      • h - On Hold. The job is currently on hold.
      • Eqw - Error.  The job is currently in an error state but can be fixed and restarted.
      • d - Deleted. The job is currently marked for deletion but has not yet deleted.
  6. submit/start at
    1. The submission or start time and date of the job.  All times local to Lubbock, TX.
  7. queue
    1. The queue in which the job is running (if it is running)
  8. jclass
    1. States the designated job class for this job. You can effectively ignore this column.
  9. slots
    1. The number of slots the job is requesting/using.
  10. ja-task-ID
    1. The job array task ID.  This will blank for all non-array jobs.

 

Viewing detailed information regarding a currently running job

Viewing detailed information regarding a running job can be done using either the HPCC provided "jobstat" program or the qstat command provided by the scheduler. To view a detailed summary of a running job, you can use jobstat as follows:

jobstat -j <job-ID>

In order to view a complete dump of all information regarding one of your currently running jobs you will need to run the qstat command as follows:

qstat -j <job-ID>

Because this command can spit out an incredible amount of data depending on your job and environment, we strongly suggest you pipe the output of this call into either more or less which should make it easier to view the data.  To do this you can use one of the following two commands:

qstat -j <job-ID> | more
qstat -j <job-ID> | less

Do not run qstat or jobstat within a watch command!

While it may not seem like a resource intensive command to run, running the qstat command often leads to slower HPCC resources for all of our users.

 

Monitoring a failed or completed job - qacct

Once a job has completed, you will no longer be able to view any information regarding the job using the qstat command.  If you wish to view a summary of information regarding a job that has failed or completed, then you will instead need to use the qacct command.  Using the command "qacct -j <job-ID>" you can see useful information regarding the job.

If you wish to view some sample qacct data then feel free to view the qacct information for the jobs we test ran above, or substitute in any of your previously completed job IDs.

Quanah example: qacct -j 51516

Hrothgar example:qacct -j 461607

When debugging a failed job or attempting to maximize the efficiency of your slot requests, you should pay particular attention to the following:

  1. qname
    1. Name of the queue the job was scheduled in.
  2. hostname
    1. Name of the (master) computer node the job executed on.
  3. taskid
    1. The task ID (for job arrays).
  4. cwd
    1. This is the "current working directory" that your script was running from.
  5. granted_pe
    1. The parallel environment that was used.  For more information on parallel environments see our guide "Parallel Environments (PE)".
  6. slots
    1. The number of slots that were requested.  The number of allowed slots depends on the parallel environment that was used. For more information on parallel environments see our guide "Parallel Environments (PE)".
  7. failed
    1. Did the job fail to complete?  (1 means the job did fail, 0 means it did not)
  8. exit_status
    1. The job script's exit status. (0 means the job completed without issue)
  9. ru_wallclock
    1. The wall clock time that elapsed (in seconds).
  10. ru_utime
    1. The consumed CPU time as reported by the OS (in seconds).
  11. ru_stime
    1. The consumed system time as reported by the OS (in seconds).  This is usually related to I/O or other system wait time.
  12. cpu
    1. The CPU time as reported by the scheduler (in seconds).  The ru_utime + ru_stime might not add up to the cpu time.
  13. mem
    1. The total memory x time used in GB seconds. This is the mean memory usage in mem/cpu in GB.
  14. io
    1. A measure of the I/O operations executed by the job.
  15. maxvmem
    1. The maximum amount of memory used by the job at any time during its execution. 

 

 

High Performance Computing Center