NCSA Home
Contact Us | Intranet | Search

pfmon: IA-64 Performance Monitoring

The following text is excerpted from the book ia-64 linux kernel, by David Mosberger and Stéphane Eranian (Prentice-Hall, 2002). Copyright © 2002 by Hewlett-Packard Company. Used by permission.


9.3.3 Using the perfmon interface: The pfmon example

In this section, we illustrate how perfmon can be used to collect performance information of unmodified binaries. We use pfmon for this purpose [14]. This is a versatile tool that can run on any Linux/ia64 system for which perfmon support has been enabled. It can monitor individual binaries with per-task sessions or the entire machine with a systemwide session. Both event counts and samples can be collected.

Because the PMU is CPU model specific, pfmon uses the modular architecture shown in Figure 9.28 (figure omitted). The tool itself is built on a library of utility routines called libpfm. This library is also modular, and its interface consists of a set of common and CPU-model-specific routines. For instance, Itanium-specific features such as address range checking are handled by the Itanium module. Every module also contains a table of events supported by that CPU model. On top of this library, pfmon builds its own set of modules specialized for a particular CPU. For instance, all the EAR and BTB support is in the Itanium-specific module. At both levels, each CPU-specific module is independent of the others. The library is able to autodetect the host CPU and activates the appropriate module. When the host CPU is unknown, pfmon can still be used and will default to generic support, which implements the minimal architected support for the two events "CPU cycles" and "number of instructions executed." In this section we show examples of pfmon running on Itanium.

The tool can be configured with command-line options. For instance, the following command counts the number of CPU cycles consumed and instructions executed (retired) by the date command, both at the user and kernel level:

$ pfmon -u -k -e cpu_cycles,ia64.inst.retired /usr/bin/date
293523 CPU_CYCLES
231009 IA64_INST_RETIRED

The -u and -k options activate monitoring at the user and kernel levels, respectively. The -e option specifies the list of events to be monitored. This example shows that no modifications are required on the date command. pfmon creates a perfmon context, then it forks a child task that inherits the context. After the fork() but before the execve(), the PMU is enabled, programmed, and protected to avoid random attempts to modify the context by the date command. Finally, the date command is launched. Next, pfmon waits for the command to terminate, at which point it receives the SIGCHILD signal. pfmon then uses PFM.READ.PMDS on the child task to extract the counter values before invoking wait4() to clean up the child task.

Cycle accounting with pfmon

In the introduction to this chapter we mentioned that cycle accounting is a useful tool to determine how the execution time is spent. Later, in Section 9.2.3 we presented what Itanium provides to support this technique. On this CPU, cycles can be classified into eight categories, and nine events are required to get a complete breakdown. Given that Itanium only has four counters and that we need to collect nine measurements at a minimum, we need three runs to collect the information. We can easily use pfmon to collect the counts by invoking it three times. As an example, we use a simple empty loop (nop) program with a core loop as follows:

loop:    [MIB]    nop.m 0x0
                  nop.i 0x0
                  br.cloop.dptk.few loop

Itanium can execute this loop in a single cycle. If we suppose that the program is called noploop and that the number of iterations through the loop is 109, then one of the runs is invoked as follows:

$ pfmon -u -e cpu_cycles,memory.cycle,data_access_cycle nooploop
1000423602 CPU_CYCLES
5154       MEMORY_CYCLE
5141       DATA_ACCESS_CYCLE

Once all three runs are completed it is possible to calculate the breakdown. Using only three runs to draw conclusions about a program is usually not recommended because there can be fluctuations between runs, but we use it to illustrate how pfmon could be used to construct the breakdown. For noploop, it might look as follows:

                                           cycles          % of cycles
----------------------------------------------------------------------
1. dependency cycles                       1544              0.00%
2. issue limit cycles                      4402              0.00%
3. data access cycles                      5141              0.00%
4. instruction access cycles               422267            0.04%
5. RSE memory cycles                       13                0.00%
6. inherent execution cycles               1000001803       99.93%
7. branch resteer cycles                   20195             0.00%
8. taken branch cycles                     6890              0.00%
----------------------------------------------------------------------
Total                                      1000462255       99.97%

This output confirms that, indeed, virtually all cycles were actually spent in the loop (inherent execution) and that no major stalls were incurred. Furthermore, we can also use it to verify that the loop was executed in one cycle by dividing the number of cycles by the number of iterations: 1000462255/109 ~ 1.


[14] Stéphane Eranian. The pfmon performance monitoring tool for Linux/ia64, June 2001.
ftp://ftp.hpl.hp.com/pub/linux-ia64/.