Overview
The PAPI (Performance Application Programming Interface) library
from the
Innovative Computing Laboratory
at the University of Tennessee-Knoxville is available on the
Dell NVIDIA cluster (Forge) and SGI Altix UV cluster (Ember) at NCSA.
PAPI is an effort
to establish a uniform, standard programming interface for accessing
hardware performance counters on modern microprocessors. The PAPI web
site is located at:
http://icl.cs.utk.edu/papi/
Hardware performance counters can be very useful for tuning the
performance of applications and for evaluating the effectiveness of
the compiler on your application. These counters allow you to
directly measure the actual usage of the hardware as your application runs
and may help you to diagnose bottlenecks in your application's
performance. By using PAPI, you gain the benefit of a cross-platform
interface to the counters, allowing you to maintain a common source
for a wide variety of architectures.
The page you are currently viewing is an overview intended
to provide information of interest and/or specific to users of
PAPI at NCSA. You can find a number of documents that cover PAPI
in more detail at the PAPI web site
including
The PAPI Reference, tailored
for the end-user or person new to PAPI. The repository also includes
a link to an in-depth tutorial by members of the PAPI development
team entitled
Performance Tuning Using Hardware Counter Data.
PAPI provides both a simple, high-level interface that may be suitable
for your needs and also a low-level interface that gives you much more
control over PAPI, including access to native hardware events that are
not part of the PAPI standard event definitions. Neither the low-level
interface nor accessing native
events through PAPI are covered here; please refer to the PAPI web site
and processor-specific documentation for details.
This page provides information on the following topics:
| System |
Kernel PMU Support |
PAPI version |
Directory |
| Forge |
perf_events |
4.4.0 |
/usr/apps/tools/papi/4.4.0-forge |
| Ember |
perf_events |
4.4.0 |
/usr/apps/tools/papi/4.4.0-login (for use on the login node)
/usr/apps/tools/papi/4.4.0-cmp (for use on the compute node)
|
The PAPI directory contains the compiled libraries, include files, UNIX
manual pages, and example programs from the PAPI distribution. You'll need
to ensure that this directory is named as part of the search path for both
include files as well as libraries during the compile and link process (see
below).
PAPI include files for Fortran
There are three different Fortran include
files that you can choose from when compiling your PAPI-enabled Fortran
program:
-
fpapi.h
-
This is an include file that requires C-style preprocessing. Several
compilers will treat a Fortran source code file with the suffix
"
.F"
(uppercase F) as a file that should be passed through the C preprocessor.
Consult the documentation for the compiler you are using for specifics.
-
f77papi.h
-
This is a Fortran 77-style include file.
This file requires no C preprocessing,
so you may find it more convenient to use.
-
f90papi.h
-
This is a Fortran 90-style include file.
Like
f77papi.h, this file requires no C preprocessing,
so you may find it more convenient to use.
PAPI libraries
If you link with the shared (.so) version of the library, you will have
to specify where the PAPI shared library can be found at runtime. For
example, on the Linux clusters you can:
If you link the static version of the PAPI library into your program,
your executable should run without having to modify the
LD_LIBRARY_PATH environment variable. You can cause the static
version to be used by specifying -static at link time,
or by including /usr/apps/tools/papi/<version>/lib/libpapi.a
on your link command.
For all compilers, specify
-I/usr/apps/tools/papi/<version>/include
at compile time, and specify
-L/usr/apps/tools/papi/<version>/lib -lpapi
at the link step.
Using PAPI_flops
Perhaps the easiest way to use the PAPI high-level functions (which
may be sufficient for many users) is to call the routine
PAPI_flops (or in Fortran,
PAPIF_flops).
This routine, which may be called multiple times from a single-threaded
program, is an easy way to measure wall-clock time, CPU time, the number
of floating point instructions executed, and the MFLOP rate.
Here's an example of using PAPI_flops from Fortran:
include 'f77papi.h'
real real_time, cpu_time, mflops
integer*8 fp_ins
integer ierr
C Call PAPIF_flops to get things started. This will initialize PAPI
C and start the counters running. Each of these calls return an
C error code in the 'ierr' parameter. See below for details on
C how to manage this.
call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr)
C Do some computation
call compute()
C Read the values in the counters and print them out. Any call to
C PAPIF_flops with fp_ins set to the value -1 will reinitialize
C all counters to zero. You might want to do this in order
C to individually time different portions of your application.
call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr)
write (*,100) real_time, cpu_time, fp_ins, mflops
100 format(' Real time (secs) :', f15.3,
+ /' CPU time (secs) :', f15.3,
+ /'Floating point instructions :', i15,
+ /' MFLOPS :', f15.3)
Using the general PAPI high-level interface
Here's an example in Fortran
of using the general high-level PAPI API, which allows
you to count any available PAPI events of your choice:
-
Include the proper PAPI constant definitions:
include 'f77papi.h'
-
Declare the events you want to count and other error-related
variables, for example:
integer events (2), numevents, ierr
character*(PAPI_MAX_STR_LEN) errorstring
-
Declare variables to hold the event counts:
integer*8 values (2)
-
Set each event to the desired type, listed in
f77papi.h
(or below):
numevents = 2
events(1) = PAPI_FP_INS
events(2) = PAPI_TOT_CYC
-
Start and clear the counters:
call PAPIF_start_counters(events, numevents, ierr)
-
Do some computation, then read and reset them but leave them running:
call PAPIF_read_counters(values, numevents, ierr)
A similar routine, PAPIF_accum_counters, accepts the same
arguments but adds the current values to the running totals already
contained in the values array.
-
Compute some more and then stop the counters and retrieve the values:
call PAPIF_stop_counters(values, numevents, ierr)
-
Each of those calls returns an error code that you can handle this way:
if ( ierr .ne. PAPI_OK ) then
call PAPIF_perror(ierr, errorstring, PAPI_MAX_STR_LEN)
print *, errorstring
endif
A similar C sequence is:
#include <papi.h>
#define NUMEVENTS 2
unsigned int events[NUMEVENTS] = {PAPI_FP_INS, PAPI_TOT_CYC};
int errorcode;
long long values[NUMEVENTS];
char errorstring[PAPI_MAX_STR_LEN+1];
errorcode = PAPI_start_counters(events, NUMEVENTS);
/* Compute... */
errorcode = PAPI_read_counters(values, NUMEVENTS);
/* Compute some more... */
errorcode = PAPI_stop_counters(values, NUMEVENTS);
if (errorcode != PAPI_OK) {
PAPI_perror(errorcode, errorstring, PAPI_MAX_STR_LEN);
fprintf(stderr, "PAPI error (%d): %s\n", errorcode, errorstring);
}
You can count five (Intel Core 2) or four (AMD Magny-Cours)
individual events, or you can alternatively "multiplex"
the available physical counters over a larger number of events.
Please refer to the PAPI web site for instructions on multiplexing.
Certain native hardware events are restricted to a subset
of the available counters. The details of this are beyond the scope
of this web page; refer to the Intel and AMD manuals for more information.
In general though, you don't have to concern yourself with this when
accessing counters through the PAPI software; the details are taken
care of for you.
Much more detailed information about the hardware performance counters
on Intel Core 2 and AMD Magny-Cours processors, including a
complete listing of all native events available on these processors,
can be found at the vendors' web sites:
Intel® 64 and IA-32 Architectures Software Developer Manuals
IntelĀ® 64 and IA-32 Architectures Software Developer's Manual
Combined Volumes 3A and 3B: System Programming Guide, Parts 1 and 2
(Intel IA-32 and x86-64)
AMD Developer Guides & Manuals
BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors
(AMD Magny-Cours processors)
-
How many hardware performance counters are there on Intel Core 2
and AMD Magny-Cours processors?
-
There are five counters on the Intel Core 2 processors and four
on the AMD Magny-Cours processors (Forge cluster).
-
Why is
PAPIF_flops returning bad numbers for times
and MFLOPS? I know they're not correct.
-
Make sure that you aren't passing in double-precision variables.
This might happen if you specify the
-r8 flag to the
Fortran compiler,
for example. PAPIF_flops expects a 32-bit floating point number
for the times and MFLOP arguments. Try declaring the variables
you pass to PAPIF_flops as real*4.
-
Are there any utilities that allow me to access the performance
counters without modifying or relinking my code?
-
Yes. Here is a synopsis of these utilities:
Recommended (for ease-of-use)
A command-line utility "psrun", is available on the
Forge and Ember clusters. psrun uses PAPI as the underlying
support for accessing the performance counters. psrun was developed
by the
PerfSuite project at NCSA. It supports the option
"-h" to access brief online help on usage.
For more detailed information about these tools and their use at NCSA:
You can also check the official PAPI FAQ
if we haven't answered your question here.