PAPI at NCSA
The PAPI (Performance Application Programming Interface) library
from the
Innovative Computing Laboratory
at the University of Tennessee-Knoxville is available on the Dell Intel 64
cluster (Abe), IBM IA-64
Linux cluster (Mercury), and SGI Altix (Cobalt) at NCSA. PAPI is an effort
to establish a uniform, standard programming interface for accessing
hardware performance counters on modern microprocessors. The PAPI web
site is located at:
http://icl.cs.utk.edu/papi/
Hardware performance counters can be very useful for tuning the
performance of applications and for evaluating the effectiveness of
the compiler on your application. These counters allow you to
directly measure the actual usage of the hardware as your application runs
and may help you to diagnose bottlenecks in your application's
performance. By using PAPI, you gain the benefit of a cross-platform
interface to the counters, allowing you to maintain a common source
for a wide variety of architectures.
The page you are currently viewing is an overview intended
to provide information of interest and/or specific to users of
PAPI at NCSA. You can find a number of documents that cover PAPI
in more detail at the PAPI web site
including
The PAPI User's Guide, tailored
for the end-user or person new to PAPI. The repository also includes
a link to an in-depth tutorial by members of the PAPI development
team entitled
Performance Tuning Using Hardware Counter Data.
PAPI provides both a simple, high-level interface that may be suitable
for your needs and also a low-level interface that gives you much more
control over PAPI, including access to native hardware events that are
not part of the PAPI standard event definitions. Neither the low-level
interface nor accessing native
events through PAPI are covered here; please refer to the PAPI web site
and processor-specific documentation for details.
Note: the PAPI low-level API and mechanism for accessing native events
have changed in PAPI 3. You will have to modify your source code if
you are using these features of PAPI and want to use PAPI 3. Detailed
instructions on converting applications and tools to the PAPI 3 API
are available at the main PAPI web site. If you are using the PAPI
high-level API only, your source code should require no changes to use
PAPI 3.
This page provides information on the following topics:
[an error occurred while processing this directive]
The PAPI directory contains the compiled libraries, include files, UNIX
manual pages, and example programs from the PAPI distribution. You'll need
to ensure that this directory is named as part of the search path for both
include files as well as libraries during the compile and link process (see
below). Add the directory /usr/apps/tools/papi/man to your
MANPATH environment variable if you want to have the "man"
command find the PAPI manual pages.
[an error occurred while processing this directive]
PAPI include files for Fortran
There are three different Fortran include
files that you can choose from when compiling your PAPI-enabled Fortran
program:
-
fpapi.h
-
This is an include file that requires C-style preprocessing. Several
compilers will treat a Fortran source code file with the suffix
"
.F"
(uppercase F) as a file that should be passed through the C preprocessor.
Consult the documentation for the compiler you are using for specifics.
-
f77papi.h
-
This is a Fortran 77-style include file.
This file requires no C preprocessing,
so you may find it more convenient to use.
-
f90papi.h
-
This is a Fortran 90-style include file.
Like
f77papi.h, this file requires no C preprocessing,
so you may find it more convenient to use.
PAPI libraries
If you link with the shared (.so) version of the library, you will have
to specify where the PAPI shared library can be found at runtime. For
example, on the Linux clusters you can:
If you link the static version of the PAPI library into your program,
your executable should run without having to modify the
LD_LIBRARY_PATH environment variable. You can cause the static
version to be used by specifying -static at link time,
or by including /usr/apps/tools/papi/lib/libpapi.a
on your link command.
For all compilers, specify
-I/usr/apps/tools/papi/include
at compile time, and specify
-L/usr/apps/tools/papi/lib -lpapi
at the link step.
If you are using PAPI on the POWER4/AIX system (Copper), you will
also want to append the PMAPI library (-lpmapi) at the
link step, as follows:
-L/usr/apps/tools/papi/lib -lpapi -lpmapi
[an error occurred while processing this directive]
Using PAPI_flops
Perhaps the easiest way to use the PAPI high-level functions (which
may be sufficient for many users) is to call the routine
PAPI_flops (or in Fortran,
PAPIF_flops).
This routine, which may be called multiple times from a single-threaded
program, is an easy way to measure wall-clock time, CPU time, the number
of floating point instructions executed, and the MFLOP rate.
Here's an example of using PAPI_flops from Fortran:
include 'f77papi.h'
real real_time, cpu_time, mflops
integer*8 fp_ins
integer ierr
C Call PAPIF_flops to get things started. This will initialize PAPI
C and start the counters running. Each of these calls return an
C error code in the 'ierr' parameter. See below for details on
C how to manage this.
call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr)
C Do some computation
call compute()
C Read the values in the counters and print them out. Any call to
C PAPIF_flops with fp_ins set to the value -1 will reinitialize
C all counters to zero. You might want to do this in order
C to individually time different portions of your application.
call PAPIF_flops(real_time, cpu_time, fp_ins, mflops, ierr)
write (*,100) real_time, cpu_time, fp_ins, mflops
100 format(' Real time (secs) :', f15.3,
+ /' CPU time (secs) :', f15.3,
+ /'Floating point instructions :', i15,
+ /' MFLOPS :', f15.3)
Using the general PAPI high-level interface
Here's an example in Fortran
of using the general high-level PAPI API, which allows
you to count any available PAPI events of your choice:
-
Include the proper PAPI constant definitions:
include 'f77papi.h'
-
Declare the events you want to count and other error-related
variables, for example:
integer events (2), numevents, ierr
character*(PAPI_MAX_STR_LEN) errorstring
-
Declare variables to hold the event counts:
integer*8 values (2)
-
Set each event to the desired type, listed in
f77papi.h
(or below):
numevents = 2
events(1) = PAPI_FP_INS
events(2) = PAPI_TOT_CYC
-
Start and clear the counters:
call PAPIF_start_counters(events, numevents, ierr)
-
Do some computation, then read and reset them but leave them running:
call PAPIF_read_counters(values, numevents, ierr)
A similar routine, PAPIF_accum_counters, accepts the same
arguments but adds the current values to the running totals already
contained in the values array.
-
Compute some more and then stop the counters and retrieve the values:
call PAPIF_stop_counters(values, numevents, ierr)
-
Each of those calls returns an error code that you can handle this way:
if ( ierr .ne. PAPI_OK ) then
call PAPIF_perror(ierr, errorstring, PAPI_MAX_STR_LEN)
print *, errorstring
endif
A similar C sequence is:
#include <papi.h>
#define NUMEVENTS 2
unsigned int events[NUMEVENTS] = {PAPI_FP_INS, PAPI_TOT_CYC};
int errorcode;
long long values[NUMEVENTS];
char errorstring[PAPI_MAX_STR_LEN+1];
errorcode = PAPI_start_counters(events, NUMEVENTS);
/* Compute... */
errorcode = PAPI_read_counters(values, NUMEVENTS);
/* Compute some more... */
errorcode = PAPI_stop_counters(values, NUMEVENTS);
if (errorcode != PAPI_OK) {
PAPI_perror(errorcode, errorstring, PAPI_MAX_STR_LEN);
fprintf(stderr, "PAPI error (%d): %s\n", errorcode, errorstring);
}
[an error occurred while processing this directive]
You can count two (Pentium III), four (Itanium and Itanium 2),
eighteen (Xeon),
or eight (POWER4) individual events, or you can alternatively "multiplex"
the available physical counters over a larger number of events.
Please refer to the PAPI web site for instructions on multiplexing.
Certain native hardware events are restricted to a subset
of the available counters. The details of this are beyond the scope
of this web page; refer to the Intel and IBM manuals for more information.
In general though, you don't have to concern yourself with this when
accessing counters through the PAPI software; the details are taken
care of for you.
Below is a table of available hardware performance counter events on
Pentium III, Xeon, Itanium, Itanium 2, and POWER4 that are reported by
PAPI (considered "standard"). This is a subset of all 104 standard
events that are defined by PAPI. Of these events, 45 are
supported on Pentium, 19 are supported on Xeon, 43 are supported on Itanium,
56 are supported on Itanium 2, and 22 are supported on POWER4.
They are listed here for convenience in determining what PAPI events you can
measure on the Intel-based Linux clusters and IBM p690 systems at
NCSA. You can find the full listing of PAPI standard events at the
PAPI web site or in the include file
papiStdEventDefs.h.
Note: not all of these events are available on all
platforms. The table indicates which events are available on
each processor, both in tabular form and by color-coding.
Additionally, these listings refer to PAPI 2 with the exception of
Tungsten (Pentium 4), which only supports PAPI 3 beta.
- Legend:
-
Red: available only on Pentium III
Turqouise: available only on Xeon
Blue: available only on Itanium
Plum: available only on Itanium 2
Yellow:available only on POWER4
Green: available on all processors
White: available on some (but not all) processors
"*": available and measured by a single native event
"D": available and is a derived event (calculated from multiple native
events)
|
Standard PAPI Events Available on NCSA Systems
|
| Name | Description | System |
Platinum (Pentium III) |
Tungsten (Xeon)
PAPI 3 only |
Titan (Itanium)
PAPI 2 only |
Mercury (Itanium 2) |
Copper (POWER4) |
| Conditional Branching |
| PAPI_BR_CN | Conditional branch instructions |
|
|
|
|
|
| PAPI_BR_INS | Branch instructions |
* |
* |
* |
* |
|
| PAPI_BR_MSP | Conditional branch instructions mispredicted |
* |
* |
D |
D |
|
| PAPI_BR_NTK | Conditional branch instructions not taken |
D |
* |
D |
|
|
| PAPI_BR_PRC | Conditional branch instructions correctly predicted |
D |
* |
* |
* |
|
| PAPI_BR_TKN | Conditional branch instructions taken |
* |
* |
D |
|
|
| PAPI_BTAC_M | Branch target address cache misses |
* |
|
|
|
|
| Cache Requests |
| PAPI_CA_CLN | Requests for exclusive access to clean cache line |
* |
|
|
|
|
| PAPI_CA_INV | Requests for cache line invalidation |
* |
|
|
D |
|
| PAPI_CA_ITV | Requests for cache line intervention |
* |
|
|
|
|
| PAPI_CA_SHR | Requests for exclusive access to shared cache line |
* |
|
|
|
|
| PAPI_CA_SNP | Requests for a snoop |
|
|
|
* |
|
| Conditional Store |
| (no events available) |
| Floating Point Operations |
| PAPI_FLOPS | Floating point instructions per second
(PAPI 2 only) |
D |
|
D |
D |
D |
| PAPI_FMA_INS | Floating point multiply-add
instructions completed |
|
|
|
|
* |
| PAPI_FML_INS | Floating point multiply instructions |
* |
|
|
|
|
| PAPI_FDV_INS | Floating point divide instructions |
* |
|
|
|
* |
| PAPI_FSQ_INS | Floating point square root
instructions |
|
|
|
|
* |
| PAPI_FP_INS | Floating point instructions |
* |
D |
D |
* |
* |
| PAPI_FP_OPS | Floating point operations (PAPI 3
only) |
* |
D |
|
* |
D |
| Instruction Counting |
| PAPI_FXU_IDL | Cycles integer units are idle |
|
|
|
|
* |
| PAPI_HW_INT | Hardware interrupts |
* |
|
|
|
* |
| PAPI_INT_INS | Integer instructions |
|
|
|
|
* |
| PAPI_IPS | Instructions per second (PAPI 2
only) |
D |
|
|
|
D |
| PAPI_TOT_CYC | Total cycles |
* |
* |
* |
* |
* |
| PAPI_TOT_IIS | Instructions issued |
* |
* |
|
* |
* |
| PAPI_TOT_INS | Instructions completed |
* |
* |
* |
D |
* |
| PAPI_VEC_INS | Vector/SIMD instructions |
* |
D |
|
|
|
| Cache Access |
| PAPI_L1_DCA | L1 data cache accesses |
* |
|
* |
* |
D |
| PAPI_L1_DCH | L1 data cache hits |
D |
|
D |
D |
|
| PAPI_L1_DCR | L1 data cache reads |
|
|
|
* |
* |
| PAPI_L1_DCM | L1 data cache misses |
* |
|
* |
* |
D |
| PAPI_L1_DCW | L1 data cache writes |
|
|
|
|
* |
| PAPI_L1_ICA | L1 instruction cache accesses |
* |
* |
|
D |
|
| PAPI_L1_ICH | L1 instruction cache hits |
D |
|
|
|
|
| PAPI_L1_ICM | L1 instruction cache misses |
* |
* |
* |
* |
|
| PAPI_L1_ICR | L1 instruction cache reads |
* |
|
D |
D |
|
| PAPI_L1_ICW | L1 instruction cache writes |
* |
|
|
|
|
| PAPI_L1_LDM | L1 load misses |
* |
|
D |
D |
* |
| PAPI_L1_STM | L1 store misses |
* |
|
|
|
* |
| PAPI_L1_TCA | L1 total cache accesses |
D |
|
|
|
|
| PAPI_L1_TCM | L1 total cache misses |
* |
  |
D |
D |
  |
| |
| PAPI_L2_DCA | L2 data cache accesses |
D |
|
* |
* |
|
| PAPI_L2_DCH | L2 data cache hits |
D |
|
|
D |
|
| PAPI_L2_DCM | L2 data cache misses |
|
|
D |
* |
|
| PAPI_L2_DCR | L2 data cache reads |
* |
|
* |
* |
|
| PAPI_L2_DCW | L2 data cache writes |
* |
|
* |
* |
|
| PAPI_L2_ICA | L2 instruction cache accesses |
* |
|
|
|
|
| PAPI_L2_ICM | L2 instruction cache misses |
|
|
* |
* |
|
| PAPI_L2_ICR | L2 instruction cache reads |
* |
|
D |
D |
|
| PAPI_L2_LDM | L2 load misses |
|
|
D |
* |
|
| PAPI_L2_STM | L2 store misses |
|
|
* |
* |
|
| PAPI_L2_TCA | L2 total cache accesses |
* |
* |
|
* |
|
| PAPI_L2_TCH | L2 total cache hits |
|
* |
|
* |
|
| PAPI_L2_TCM | L2 total cache misses |
* |
* |
* |
* |
|
| PAPI_L2_TCR | L2 total cache reads |
D |
|
|
D |
|
| PAPI_L2_TCW | L2 total cache writes |
* |
|
|
|
|
| |
| PAPI_L3_DCA | L3 data cache accesses |
|
|
D |
* |
|
| PAPI_L3_DCH | L3 data cache hits |
|
|
D |
D |
|
| PAPI_L3_DCM | L3 data cache misses |
|
|
D |
D |
|
| PAPI_L3_DCR | L3 data cache reads |
|
|
* |
* |
|
| PAPI_L3_DCW | L3 data cache writes |
|
|
* |
* |
|
| PAPI_L3_ICH | L3 instruction cache hits |
|
|
* |
* |
|
| PAPI_L3_ICM | L3 instruction cache misses |
|
|
* |
* |
|
| PAPI_L3_ICR | L3 instruction cache reads |
|
|
* |
* |
|
| PAPI_L3_LDM | L3 load misses |
|
|
D |
* |
|
| PAPI_L3_STM | L3 store misses |
|
|
* |
* |
|
| PAPI_L3_TCA | L3 total cache accesses |
|
* |
|
* |
|
| PAPI_L3_TCH | L3 total cache hits |
|
* |
|
D |
|
| PAPI_L3_TCM | L3 total cache misses |
|
* |
* |
* |
|
| PAPI_L3_TCR | L3 total cache reads |
|
|
|
* |
|
| PAPI_L3_TCW | L3 total cache writes |
|
|
|
* |
|
| Data Access |
| PAPI_LD_INS | Load instructions |
|
D |
* |
* |
|
| PAPI_LST_INS | Load/store instructions completed |
|
D |
D |
|
|
| PAPI_FP_STAL | Cycles the floating point units are
stalled |
|
|
|
* |
|
| PAPI_MEM_SCY | Cycles stalled waiting for memory access |
|
|
* |
|
|
| PAPI_RES_STL | Cycles stalled on any resource |
* |
* |
|
* |
|
| PAPI_SR_INS | Store instructions |
|
D |
* |
* |
|
| PAPI_STL_CCY | Cycles with no instructions completed |
|
|
|
* |
|
| PAPI_STL_ICY | Cycles with no instruction issue |
|
|
* |
* |
* |
| TLB Operations |
| PAPI_TLB_DM | Data translation lookaside buffer misses |
|
* |
* |
* |
* |
| PAPI_TLB_IM | Instruction translation lookaside buffer misses |
* |
* |
* |
* |
* |
| PAPI_TLB_TL | Total translation lookaside buffer misses |
|
* |
|
D |
D |
[an error occurred while processing this directive]
Much more detailed information about the hardware performance counters
on Pentium III, Pentium 4, Xeon, Itanium, and Itanium 2, including a
complete listing of all native events available on these processors,
can be found at Intel's web site:
Intel ® Architecture Optimization Reference Manual
(Pentium III)
IA-32 Intel ® Architecture Optimization Reference Manual
(see also
IA-32 Intel ® Architecture Software Developer's Manual, Volume
3: System Programming Guide)
(Pentium 4, Xeon, Pentium M)
Intel ® Itanium ® Processor Reference Manual for Software Development
Intel ® Itanium ® 2 Processor Reference Manual for Software
Development and Optimization
[an error occurred while processing this directive]
There are currently no reference manuals that list available native events
for the POWER4 architecture from IBM, but you can review the files
within the directory /usr/pmapi/lib to see what events
are available on these processors. In particular, the files:
POWER4.{evs,gps}
provide information about individual
performance events as well as "event group" information.
You may also find the IBM document
PowerPC Architecture Book
helpful.
[an error occurred while processing this directive]
-
How many hardware performance counters are there on Pentium III, Xeon,
Itanium, Itanium 2, and POWER4 processors?
-
There are two counters on Pentium III, eighteen on Xeon, four on Itanium,
four on Itanium 2, and eight on POWER4.
-
Why is
PAPIF_flops returning bad numbers for times
and MFLOPS? I know they're not correct.
-
Make sure that you aren't passing in double-precision variables.
This might happen if you specify the
-r8 flag to the
Fortran compiler,
for example. PAPIF_flops expects a 32-bit floating point number
for the times and MFLOP arguments. Try declaring the variables
you pass to PAPIF_flops as real*4.
-
Are the floating point operations reported by PAPI accurate?
-
On Pentium, PAPI bases its count of floating point operations
on the native event
FLOPS. On Itanium, PAPI
calculates the number of floating point operations using the following
formula:
FP_OPS_RETIRED_HI*4 + FP_OPS_RETIRED_LO
On Itanium 2, PAPI uses the native event FP_OPS_RETIRED to count
floating point operations.
These should give you an accurate count of total floating point operations
retired by your code on NCSA Linux clusters (in contrast, the MIPS R10000
counters on the SGI Origin, for example, count a fused multiply-add
instruction as a single floating-point operation).
Note: On Pentium 3, SSE/SSE2 vector operations are not included in the
floating point operation count. On Pentium 4, PAPI 3 includes
SSE2 instructions in the floating point operation count, but you may want
to adjust environment variables to count the exact type of vector
floating point operations of interest. Please refer to the PAPI
documentation for more information.
On POWER4, the native event PM_FPU_FIN is used. You should
be aware that this event alone does not accurately measure the floating
point operation count of the application and should use other tools such
as the HPM Toolkit or programming using PMAPI until this is resolved.
-
Where do I find Perfometer and Dynaprof for PAPI on NCSA systems?
-
Neither Perfometer nor Dynaprof are currently installed at NCSA. Please
contact NCSA support staff if you have a need for these tools.
-
Are there any utilities that allow me to access the performance
counters without modifying or relinking my code?
-
Yes. Here is a synopsis of these utilities:
Recommended (for ease-of-use)
A command-line utility "psrun", is available on the Platinum,
Titan, Tungsten, and Mercury clusters. psrun uses PAPI as the underlying
support for accessing the performance counters. psrun was developed
by the
PerfSuite project at NCSA. It supports the option
"-h" to access brief online help on usage.
On Copper, the utility "hpmcount", written by Luiz DeRose
of IBM ACTC, can be used to measure hardware performance counter
data with an unmodified application. hpmcount supports the option
"-h" to access brief online help on usage.
hpmcount uses the AIX PMAPI kernel interface to access the counters on
the p690, not PAPI. hpmcount is part of IBM's HPM Toolkit.
Also Available (but more complex)
There are two other utilities that require you to work with the performance
counter events native to each system (refer to the
Intel documents listed above for details).
On Platinum and Tungsten, a utility called "perfex"
allows you to measure native events for an arbitrary program from
the command line. perfex was written by Mikael Pettersson of
Uppsala University, Sweden (author of the IA-32 performance counter
driver). Although very flexible, perfex can be rather difficult
to use, so we recommend that you first try psrun on the IA-32
clusters.
On Titan, a similar utility called "pfmon" is available.
pfmon was written by Stephane Eranian of Hewlett-Packard (author of
the IA-64 performance counter driver).
Multiplexing support
Unlike SGI's "perfex", most of the above utilities do not
support multiplexing of the performance counters ("psrun" is
the exception).
For more detailed information about these tools and their use at NCSA:
You can also check the official PAPI FAQ
if we haven't answered your question here.