PROF(1)
NAME
- prof - analyze SpeedShop performance data
SYNOPSIS
- prof [options] [speedshop_data_file ...]
- prof executable-name [options] [pixie counts file]
DESCRIPTION
- prof analyzes one or more data files generated by the SpeedShop
performance tools and produces a listing. The second form analyzes data
files produced by directly running a pixiefied program.
The current implementation supports the following SpeedShop experiments:
usertime (callstack profiling) causes the program to be interrupted every
30 milliseconds, and to record the callstack at each interrupt. It can
show both inclusive and exclusive time. usertime data is statistical in
nature, and will show some variance from run to run.
[f]pcsamp[x] asks the kernel to look at the user PC every 10
milliseconds, and record a histogram of the value of the program counter
at each clock tick, using 16-bit bins, one for each PC value. It can
only show exclusive data, that is data about where the program counter
was, not the callstack to get there. The [f] prefix specifies 1
millisecond profiling, instead of 10 milliseconds. The [x] suffix
specifies 32-bit count bins, instead of 16-bit. pcsamp data is
statistical in nature, and will show some variance from run to run.
ideal causes the code to be instrumented to count the number of times
each basic block is executed. (A basic block is a region of the
program that can be entered only at the beginning and exited only at the
end.) The data recorded also contains a summary of function pointer
calls. From this data, a machine model is used to compute exclusive time
spent in each function, and, by propagating regular counts to callers of
a routine, inclusive time, as well.
Inclusive basic block counting calculates cycles just like regular basic
block counting, and then propagates it to all its callers
proportionately. The cycles for a procedure obtained using regular basic
block counting ("exclusive cycles"), are divided up amongst its callers
in proportion to the number of times they called this procedure. For
example, if sin(x) takes 1000 cyles, and its callers - procedures foo()
and bar() - call it 25 and 75 times respectively, 250 cycles are
attributed to foo() and 750 to bar(). By propagating cycles this way,
__start() ends up with all the cycles counted in the program. Note that
this assumption may not be reasonable, leading to misleading reports.
For example, if a matrix multiply routine is substituted for sin(x) in
the above example, and bar's calls are for 2X2 matrices, while foo's
calls are for 100X100 matrices, the attribution will still distribute 3/4
of the time to bar, whereas nearly all the time really should be
attributed to foo
Inclusive time is not normally generated by prof; specifying the -gprof
flag will cause prof to perform gprof-like inclusive cycle counting.
*_hwc asks the kernel to look at the user PC every time the hardware
performance counter specified by the experiment overflows, and record a
histogram of the value of the program counter at overflow. These
experiments can only be run on R10000 machines; other machines do not
have the hardware performance counters. There are a number of these
experiments defined; see speedshop(1). They can only show exclusive
data, that is data about where the program counter was, not the callstack
to get there. The particular counter used and its overflow value are
specified in the experiment. Some of the Hardware Performance counter
sampling experiments are statistical in nature, and will show some
variance from run to run; others are exact, provided that the program
executes the exact same sequence of instructions.
fpe asks the floating-point exception library to trace all floating-point
exceptions, with their callstacks. These experiments will show a listing
similar to those for usertime experiments, except the data reported is a
count of floating-point exceptions, rather than projected CPU time.
OPTIONS
- -calipers
- [n1,]n2
flag will cause prof to compute the data between caliper points n1
and n2, rather than for the entire experiment. If n1 >= n2, an
error is reported, otherwise if n1 is negative it is set to the
beginning of the experiment, and if n2 is greater than the maximum
recorded, it is set to the maximum. If n1 is omitted, zero is
assumed.
-gprof
- flag will cause prof to perform gprof-like inclusive cycle counting
for ideal experiments.
-dis[assemble]
- Disassemble and annotate the analyzed object code with cycle times
or number of PC samples.
-S(-source)
- Disassemble and annotate the analyzed object code with cycle times
(or PC samples) and source code.
-h[eavy]
- Reports the most heavily used lines in descending order of use.
-l[ines]
- Like -h[eavy], but group lines by procedure, with procedures sorted
in descending order of use. Within a procedure, lines are listed in
source order.
-q[uit]
- n
Truncates the -p[rocedures] , -h[eavy] , -l[ines] and -gprof
listings. Truncation occurs after the first n procedures or lines
have been listed.
-q[uit] n%
- Truncates the -p[rocedures] , -h[eavy] , -h[eavy] , -l[ines] and
-gprof listings. It only lists those procedures or lines up to the
one which takes more than n percent of the total.
-q[uit] ncum%
- Truncates the -p[rocedures] , -h[eavy] , -l[ines] and -gprof
listings. For -gprof , it behaves the same as -quit n%. For others,
it truncates the listing after the last procedure which brings the
cumulative total to n percent. For example, " -q[uit] 15 "
truncates each part of the listing after 15 lines of text, " -q[uit]
15% " truncates each part after the first line that represents less
than 15 percent of the whole, and " -q[uit] 15cum% " truncates each
part after the line that brought the cumulative percentage above 15
percent.
-den[sity]
- Prints a list of procedures with non-zero instruction cycles sorted
by the instruction density, which is the number of cycles-per-
instruction.
-nocounts
- flag will cause prof to analyze an executable or even a .o file,
using the pixie machine model, and assuming each instruction is
executed once. Please note that this analysis, in general, can not
match any possible real run of any executable which contains one or
more conditional branch instructions.
SELECT PROCEDURES/DSOS
-o[nly] procedure_name
- If you use one or more -o[nly] options, the profile listing includes
only the named procedures, rather than the entire program. If any
option uses an uppercase "O" for -O[nly], prof uses only the named
procedures, rather than the entire program, as the base upon which
it calculates percentages.
-e[xclude] procedure_name
- If you use one or more -e[xclude] options, the profiler omits the
specified procedure from the listing. If any option uses an
uppercase "E" for -E[xclude] , prof also omits that procedure from
the base upon which it calculates percentages.
-dsolist
- List all the DSO's in the program and their start and end text
addresses
-dso dso_name
- Apply the analysis only to the named DSO. Only the basename of the
DSO need be specified.
CPU
prof normally uses the scheduling model for the processor on which it is
being run to perform the analysis. The user can override the default
with any of the following options:
{-r10000 | -r8000 | -r5000 | -r4000 | -r3000}
CYCLE TIME
-clock megahertz
- Set the CPU clock speed to megahertz MHz. Alters the appropriate
parts of the listing to reflect the clock speed. The default value
is the clock speed of the machine on which the experiment was
performed.
-cycle nanosecond
- Set cycle time to nanosecond ns. This is the same as -clock
1000/nanosecond.
MISC
-z[ero]
- Prints a list of procedures that were never invoked.
-feedback
- Produces files with information that can be used to (a) arrange
procedures in the binary in an optimal ordering, and (b) tell the
compiler how to optimize compilation of the program next time.
Users can use cord(1) for procedure-ordering, and/or invoke "cc(1)
-fb <cfb-filename>" for compilation-optimization feedback; see their
respective man pages for further information.
Procedures are normally ordered by their measured invocation counts;
if -gprof is also specified, procedures are ordered using call graph
counts, rather than invocation counts. The cord feedback file is
named <a.out>.fb or <lib*so*>.fb.
This option also produces a file with information that the compiler
system can use to recompile, optimizing by using measured branch
frequencies, etc. The feedback file is named <a.out>.cfb or
<lib*so*>.cfb, and is a binary file. It may be dumped using the
fbdump(1) routine. This option can only be used with pixie (ideal-
time) data files.
-i[nvocations]
- Adds procedure invocation counts to the output report.
-nofilenames
- Removes a.out, DSO, and source file names from the listing; useful
for scripted analysis of prof output.
OBSOLESCENT OPTIONS
- Some additional options are supported:
-pixie
- Turn on pixie basic block counting mode. This option is not usually
needed, as prof can tell from the data file whether or not it is
reading a pixie .Counts file or a SpeedShop ideal-time experiment.
DIAGNOSTICS
- prof prints warnings and fatal errors on stderr. With inclusive cycle
counting, prof prints a list of functions at the end which are called but
not defined. Any functions starting with _rld listed here is normal
behavior. They appear because rld is not instrumented.
One way to sanity-check inclusive cycle counts is to look at the
percentage cycles for __start(). If the value is anything less than 98-
99%, the inclusive report is suspect. Look for other warnings which
indicate that prof didn't take into account certain procedures.
There are a number of known cases when prof fails to list cycles of a
procedure in the inclusive listing. The reasons can be one of the
following:
- init & fini sections, mips stubs are not part of any procedure.
- calls to procedures which don't use a jump and link are
not recognized as procedure calls.
- executions of global procedures with same name in different DSOs.
In this case, only one of them is listed.
All these exceptions are listed at the end of the -gprof listing, under a
separate section.
NOTE
- With both pc-sampling and basic block counting, prof may report times for
procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF
stands for Dummy Function, and indicates cycles spent in parts of text
which are not in any function, for example, init and fini sections, and
MIPS.stubs sections.
If any of the object files linked into the application have been stripped
of line-number information (with ld -x for example), prof will warn about
the affected procedures. The instruction counts for such procedures are
shown as a procedure total, not on a per-basic-block basis. Where a line
number would normally appear in a report on a function without line
numbers question marks appear instead
prof does not take into account interactions between basic blocks in
ideal time runs. prof computes the cycles for one execution of a basic
block, assuming all registers are free at entry to the block, and it
multiplies this count by the number of times that basic block is
executed. In real programs, a block may be entered with a result not yet
ready in a register, or with a function unit busy, so the cycle count
computed could be either higher or lower than the correct value.
Extending the computations to include inter-block state would be
prohibitively expensive.
When run on a program using shared libraries, prof will somtimes combine
the times of real but anonymous procedures in a shared library into the
preceding (according the library memory layout) externally visible
function. The times reported for your procedures are not affected by
this attribution-error, which is normally minor.
prof cannot be run on programs which have been stripped.
Compiler optimization level 3 will do procedure inlining. This can
result in extremely misleading profiles since the time spent in the
inlined procedure will show up in the profile as time spent in the
procedure into which it was inlined. It is generally better to use
compiler optimization level 2 or less when gathering an execution
profile.
Fortran alternate entry point times are attributed to the main
function/subroutine, since there is no general way for prof to separate
the times for the alternate entries.
SEE ALSO
- speedshop(1), ssrun(1), ssdump(1), pixie(1), fbdump(1)