NCSA Home
Contact Us Intranet

Programming Environment on NCSA Intel 64 Tesla Cluster

User Information Home
Data
Security
Allocations
Consulting
Training

NCSA's Help Desk is available 24 hours a day, seven days a week, 365 days a year:
help.ncsa.illinois.edu
217-244-0710
help@ncsa.illinois.edu

 
  1. Traditional Compilers
  2. Tesla Compilers
    1. NVIDIA CUDA
    2. Portland Group Compiler
    3. Additional Information

1. Traditional Compilers

All compilers and libraries that are available on abe, are available for use on the lincoln nodes; please refer to the abe documentation for details. Note that availability on lincoln does not imply GPU capability; for GPU-aware compilers, see below.

2. Tesla Compilers

2.1 NVDIA CUDA

  • CUDA environment

    Add the appropriate softenv keys to your $HOME/.soft file, for example:

        +cuda-3.2
        +nvidia-sdk-cuda-3.2
    
    (For concreteness here, and below, we are using version 3.2; other versions can be listed with softenv.)

    The NVIDIA compiler is nvcc, in the path defined by the CUDA softenv key.

    Notes:

    • You can compile on the head nodes, but access to nodes with the Tesla devices is available only via PBS batch jobs.
    • There is no support for Fortran in the standard NVIDIA CUDA distribution, however, PGI and NVIDIA have collaborated on CUDA Fortran; cf. below.

  • CUDA SDK

    Example code, documentation and several utilities can be found under /usr/local/NVIDIA_GPU_Computing_SDK-3.2.

    When porting code to CUDA, the examples in the SDK can be quite useful, both as illustration and as templates for certain algorithms (marching cubes, Monte Carlo). In addition, there are further examples and tutorials available on the NVIDIA site at the link below.

    To use the examples, one should copy the NVIDIA_GPU_Computing_SDK-3.2 directory to your home directory, calling it, say, "nvidia_examples":

        cd $HOME
        cp -r /usr/local/NVIDIA_GPU_Computing_SDK-3.2 nvidia_examples
    

    Some examples will require libraries resident only on the compute nodes to compile; this can be done within an interactive batch job, obtained thusly:

       qsub -I -V -q lincoln -lwalltime=00:30:00,nodes=1:ppn=8
    

    One can then build the examples as follows:

        setenv CUDA_INSTALL_PATH /usr/local/cuda-3.2
        cd $HOME/nvidia_examples/C
        make
    
    Note: For building the simpleMPI example, also add the +openmpi-1.3.2-intel to use Open MPI.

    The source code for the examples is in $HOME/nvidia_examples/C/src; the built executables will be in $HOME/nvidia_examples/C/bin/linux/release.

    To run the examples, one can start an interactive batch job as mentioned above, or call them from a job script with the lincoln queue specified:

        #PBS -q lincoln
    

    The utility deviceQuery (applicable on the Tesla nodes only) can be used to examine the characteristics of the Tesla devices; sample output below:

    There are 2 devices supporting CUDA
    
    Device 0: "Tesla C1060"
      Major revision number:                         1
      Minor revision number:                         3
      Total amount of global memory:                 4294705152 bytes
      Number of multiprocessors:                     30 
      Number of cores:                               240        
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       16384 bytes
      Total number of registers available per block: 16384
      Warp size:                                     32 
      Maximum number of threads per block:           512 
      Maximum sizes of each dimension of a block:    512 x 512 x 64
      Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
      Maximum memory pitch:                          262144 bytes
      Texture alignment:                             256 bytes
      Clock rate:                                    1.44 GHz
      Concurrent copy and execution:                 Yes
    
    Device 1: "Tesla C1060"
      Major revision number:                         1
      Minor revision number:                         3
      Total amount of global memory:                 4294705152 bytes
      Number of multiprocessors:                     30
      Number of cores:                               240
      Total amount of constant memory:               65536 bytes
      Total amount of shared memory per block:       16384 bytes
      Total number of registers available per block: 16384
      Warp size:                                     32
      Maximum number of threads per block:           512 
      Maximum sizes of each dimension of a block:    512 x 512 x 64
      Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
      Maximum memory pitch:                          262144 bytes
      Texture alignment:                             256 bytes
      Clock rate:                                    1.44 GHz
      Concurrent copy and execution:                 Yes
    

  • CUDA Visual Profiler

    The CUDA Toolkit contains a visual profiler that can be invoked from within a batch session to view performance statistics; a list of counters can be found in the (rather terse) README.

    To invoke the profiler, one must enable X-forwarding by firstly logging into abe with "ssh -Y abe" (for trusted X-forwarding), and then launching a batch job with option "-X"; cf. the man pages for ssh, respectively, qsub.

    A CUDA softenv key, say,

        +cuda-3.2
    
    will add "cudaprof" to the search path; the executable to be profiled should be launched from the cudaprof window.

  • CUDA HOME

2.2 Portland Group Compiler

The Portland Group compilers currently support the NVIDIA Tesla in two ways, the first having an implicit programming model; the second, explicit. These are separate efforts, with differing objectives.

  • Accelerator model: Starting with version 9.0, the PGI compiler has introduced accelerator directives that may be added to existing code, without restructuring. This model and syntax are reminiscent of OpenMP.

    This model is currently available on x64 and NVIDIA GPUs supporting CUDA, and, as a general approach to acceleration, planned for ATI, Cell, and Larrabee.

  • CUDA Fortran: Beginning with version 9.0.4, there is a beta version of CUDA Fortran, analogous to NVIDIA's CUDA C; this is available solely for x64+NVIDIA GPUs supporting CUDA.

It is recommended that one add the softenv key for the explicit version of the compiler, given the pace of releases; the details below refer to version 10.9, but of course are applicable to other versions, mutatis mutandis.

To invoke the PGI compilers, add the softenv key, say,

    @pgi-10.9
to the top of your .soft file.

Examples of use may be found in

    /usr/local/pgi/linux86-64/10.9/etc/samples
As of version 9.0.3, a requested feature was added to the accelerator model: compiling with the flag
  -ta=nvidia,keepgpu
will output the gpu kernel code generated.

As the makefile currently compiles only those examples of the accelerator model, one should use the following to compile the CUDA Fortran example:

   pgfortran -o cufinfo cufinfo.cuf
   
   pgfortran -o sgemm sgemm.cuf
In all cases, the resulting executables should be run on a compute node of Lincoln, accessed through the batch system.

2.3 Additional Information