The following text is excerpted from the book
ia-64 linux
kernel, by David Mosberger and Stéphane Eranian
(Prentice-Hall, 2002). Copyright © 2002 by Hewlett-Packard
Company. Used by permission.
9.3.3 Using the perfmon interface: The pfmon example
In this section, we illustrate how perfmon can be used to collect
performance information of unmodified binaries. We use pfmon
for this purpose [14]. This is a versatile tool
that can run on any
Linux/ia64 system for which perfmon support has been enabled. It can
monitor individual binaries with per-task sessions or the entire
machine with a systemwide session. Both event counts and samples
can be collected.
Because the PMU is CPU model specific, pfmon uses the
modular architecture shown in Figure 9.28 (figure omitted).
The tool itself is built on a library of utility routines called
libpfm. This library is also modular, and its interface consists
of a set of common and CPU-model-specific routines. For instance,
Itanium-specific features such as address range checking are handled
by the Itanium module. Every module also contains a table of events
supported by that CPU model. On top of this library, pfmon
builds its own set of modules specialized for a particular CPU.
For instance, all the EAR and BTB support is in the Itanium-specific
module. At both levels, each CPU-specific module is independent
of the others. The library is able to autodetect the host CPU and
activates the appropriate module. When the host CPU is unknown,
pfmon can still be used and will default to generic
support, which implements the minimal architected support for the
two events "CPU cycles" and "number of instructions
executed." In this section we show examples of pfmon
running on Itanium.
The tool can be configured with command-line options. For instance,
the following command counts the number of CPU cycles consumed and
instructions executed (retired) by the date command,
both at the user and kernel level:
$ pfmon -u -k -e cpu_cycles,ia64.inst.retired /usr/bin/date
293523 CPU_CYCLES
231009 IA64_INST_RETIRED
The -u and -k options activate monitoring
at the user and kernel levels, respectively. The -e
option specifies the list of events to be monitored. This example
shows that no modifications are required on the date
command. pfmon creates a perfmon context, then it
forks a child task that inherits the context. After the fork()
but before the execve(), the PMU is enabled, programmed, and
protected to avoid random attempts to modify the context by
the date command. Finally, the date command
is launched. Next, pfmon waits for the command to
terminate, at which point it receives the SIGCHILD signal.
pfmon then uses PFM.READ.PMDS on the child task to
extract the counter values before invoking wait4() to clean
up the child task.
Cycle accounting with pfmon
In the introduction to this chapter we mentioned that cycle accounting
is a useful tool to determine how the execution time is spent. Later,
in Section 9.2.3 we presented what Itanium provides to support this
technique. On this CPU, cycles can be classified into eight
categories, and nine events are required to get a complete breakdown.
Given that Itanium only has four counters and that we need to collect
nine measurements at a minimum, we need three runs to collect the
information. We can easily use pfmon to collect the
counts by invoking it three times. As an example, we use a simple
empty loop (nop) program with a core loop as follows:
loop: [MIB] nop.m 0x0
nop.i 0x0
br.cloop.dptk.few loop
Itanium can execute this loop in a single cycle. If we suppose that
the program is called noploop and that the number of
iterations through the loop is 109, then one of the runs
is invoked as follows:
$ pfmon -u -e cpu_cycles,memory.cycle,data_access_cycle nooploop
1000423602 CPU_CYCLES
5154 MEMORY_CYCLE
5141 DATA_ACCESS_CYCLE
Once all three runs are completed it is possible to calculate the
breakdown. Using only three runs to draw conclusions about a program
is usually not recommended because there can be fluctuations between
runs, but we use it to illustrate how pfmon could be
used to construct the breakdown. For noploop, it might
look as follows:
cycles % of cycles
----------------------------------------------------------------------
1. dependency cycles 1544 0.00%
2. issue limit cycles 4402 0.00%
3. data access cycles 5141 0.00%
4. instruction access cycles 422267 0.04%
5. RSE memory cycles 13 0.00%
6. inherent execution cycles 1000001803 99.93%
7. branch resteer cycles 20195 0.00%
8. taken branch cycles 6890 0.00%
----------------------------------------------------------------------
Total 1000462255 99.97%
This output confirms that, indeed, virtually all cycles were
actually spent in the loop (inherent execution) and that no major
stalls were incurred. Furthermore, we can also use it to verify
that the loop was executed in one cycle by dividing the number of
cycles by the number of iterations: 1000462255/109 ~ 1.
[14] Stéphane Eranian. The pfmon performance monitoring tool for
Linux/ia64, June 2001.
ftp://ftp.hpl.hp.com/pub/linux-ia64/.