NCSA Home
Contact Us Intranet

PBS memory enforcement and monitoring on the SGI Altix

User Information Home
Data
Security
Allocations
Consulting
Training

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.illinois.edu
217-244-0710
help@ncsa.illinois.edu

  1. Introduction
  2. Checking memory use
  3. PBS job memory enforcement

1. Introduction

Starting Monday September 25, 2006, memory specification has been required and enforced for PBS jobs on the NCSA SGI Altix. Jobs are typically terminated if they exceed their requested memory. See the section PBS job memory enforcement for details.

2. Checking memory use

  • The cobalt cluster monitor web page: co-monitor.ncsa.uiuc.edu can show you job detail including memory use for running jobs. Select the Jobs link on the left, then click on your RUNNING job in the list and see the vmem field under the resources_used section of the page. Here's a sample web display for a job while it was running. Near the top of the job info display, note the resources_used :

    clumon job detail mem display

  • You can also check your memory use for jobs already run with the qhist command. This command will show a table of your jobs run in the last 2 days and their memory use:
    % qhist -g 2 -u $USER -f jobid,jobname,usedmem
    
    Scanning PBS raw accounting records: 09/11/2006 - 08/26/2007
    
    
      JobId  JobName        UsedMem
    -------------------------------
       5230  test             6.86M
       5231  test             9.58M
       5340  test            14.31M
       5401  malloctest       9.23G
    -------------------------------
    Total # jobs = 4
    Total # SUs  = 0.02
    
  • An alternate way to check processes currently running in batch is the qps command. In this example, the name of the application is malloc8g:
    % qps
      PID  PPID      COMMAND        HOST     RSS    SIZE  S CPU user time system tm
    _____ _____ ____________ ___________ _______ _______  _ ___ _________ _________
    13640 13587         tcsh   co-login1    3.4M   38.4M  S  27  00:00:00  00:00:00
    16361  8318         tcsh co-compute2    4.2M    7.7M  S 168  00:00:00  00:00:01
    16522 16361     [5548]*1 co-compute2    3.3M    6.2M  S 168  00:00:00  00:00:00
    16523 16522     malloc8g co-compute2    4.0G    4.0G  R 168  00:00:11  00:00:02
     5342  5340         sshd     co-viz8   10.0M   19.2M  S   0  00:00:00  00:00:00
     5343  5342         tcsh     co-viz8    4.0M    6.4M  S   0  00:00:00  00:00:00
     8804 13124        watch     co-viz8    2.2M    3.7M  S   1  00:00:00  00:00:00
     8807  5343          qps     co-viz8   30.3M   34.1M  S   0  00:00:02  00:00:00
     8917  8807       pminfo     co-viz8    1.7M    3.7M  S   4  00:00:00  00:00:00
    12715 12700         sshd     co-viz8   10.0M   19.2M  S   1  00:00:00  00:00:00
    13124 12715         tcsh     co-viz8    4.1M    6.4M  S   1  00:00:00  00:00:00
    
    

3. PBS job memory enforcement

Jobs are limited to the memory associated with the processors assigned to the job. You can expect your jobs to be terminated if they exceed their requested memory. Some jobs using small cpu counts and only a couple of gigabytes of memory, may occasionally "get away" with using more memory and run one time while getting killed the next time. If this happens with your job, request more memory as directed by the email you receive from the job kill daemon.

Here is an example of a job over-memory showing the memory requested, used, and the job standard error file and email sent to the user.
% qsub -lncpus=1,mem=2gb,walltime=00:15:00 -N malloctest myscript.pbs

% qhist 5401

Scanning PBS raw accounting records: 09/08/2005 - 08/26/2007

Compute Host:       co-compute2:ssinodes=1:ncpus=2:mem=12065792kb
JobId:              5401
JobName:            malloctest
User:               arnoldg
Project:            0x8e8dca1e0000025b
Queue:              standard

Job limits:
  wall clock:       00:15:00    
  Requested CPUs:   1        
  Available CPUs:   2        
  Requested Memory: 2097152kb 

Queued:             09/14/06 09:02
Started:            09/14/06 09:03
Ended:              09/14/06 09:07

Usage:
  wall clock:       00:03:20    
     cputime:       00:01:03    
         SUs:       0.02        
      memory:         2.53G     

[arnoldg@co-login1 ~/c]$ cat *.e5401
set_SCR: using existing PBS job directory /scratch/batch/5401
JOB_OVER_MEMORY

This email notification was sent to the user informing them about what happened to the job:

Date: Thu, 14 Sep 2006 09:05:23 -0500
To: arnoldg@ncsa.uiuc.edu
Subject: Job 5401.co-master Killed

      arnoldg :
      Your job	 5401.co-master
      on host  co-compute2
      was killed because it attempted to use more memory than exists
      within the requested fraction of the machine.  Please resubmit
      the job, requesting more resources. Modify your batch script 
      to request more memory.  For example, to request 10 Gbytes of
      memory for the job use the following line in your batch script:

	    #PBS -l mem=10gb

      to request 10 Gbytes of memory (total) for your batch job.

      If you used qg03 to submit this job (Gaussian), you can increase
      the memory limit by increasing the value of '%mem= ' keyword in
      your input file.

Note: Due to the way memory use is polled during execution of batch jobs, the memory usage reported by qhist may not be indicative of actual usage. A standard error message about memory overuse may also not be available. However, these are still potential ways besides the email notification that attempt to confirm memory overuse issues.