PROF(1)

NAME

prof - analyze SpeedShop performance data

SYNOPSIS

prof [options] [speedshop_data_file ...]
prof executable-name [options] [pixie counts file]

DESCRIPTION

prof analyzes one or more data files generated by the SpeedShop performance tools and produces a listing. The second form analyzes data files produced by directly running a pixiefied program.

The current implementation supports the following SpeedShop experiments:

usertime (callstack profiling) causes the program to be interrupted every 30 milliseconds, and to record the callstack at each interrupt. It can show both inclusive and exclusive time. usertime data is statistical in nature, and will show some variance from run to run.

[f]pcsamp[x] asks the kernel to look at the user PC every 10 milliseconds, and record a histogram of the value of the program counter at each clock tick, using 16-bit bins, one for each PC value. It can only show exclusive data, that is data about where the program counter was, not the callstack to get there. The [f] prefix specifies 1 millisecond profiling, instead of 10 milliseconds. The [x] suffix specifies 32-bit count bins, instead of 16-bit. pcsamp data is statistical in nature, and will show some variance from run to run.

ideal causes the code to be instrumented to count the number of times each basic block is executed. (A basic block is a region of the program that can be entered only at the beginning and exited only at the end.) The data recorded also contains a summary of function pointer calls. From this data, a machine model is used to compute exclusive time spent in each function, and, by propagating regular counts to callers of a routine, inclusive time, as well.

Inclusive basic block counting calculates cycles just like regular basic block counting, and then propagates it to all its callers proportionately. The cycles for a procedure obtained using regular basic block counting ("exclusive cycles"), are divided up amongst its callers in proportion to the number of times they called this procedure. For example, if sin(x) takes 1000 cyles, and its callers - procedures foo() and bar() - call it 25 and 75 times respectively, 250 cycles are attributed to foo() and 750 to bar(). By propagating cycles this way, __start() ends up with all the cycles counted in the program. Note that this assumption may not be reasonable, leading to misleading reports. For example, if a matrix multiply routine is substituted for sin(x) in the above example, and bar's calls are for 2X2 matrices, while foo's calls are for 100X100 matrices, the attribution will still distribute 3/4 of the time to bar, whereas nearly all the time really should be attributed to foo

Inclusive time is not normally generated by prof; specifying the -gprof flag will cause prof to perform gprof-like inclusive cycle counting.

*_hwc asks the kernel to look at the user PC every time the hardware performance counter specified by the experiment overflows, and record a histogram of the value of the program counter at overflow. These experiments can only be run on R10000 machines; other machines do not have the hardware performance counters. There are a number of these experiments defined; see speedshop(1). They can only show exclusive data, that is data about where the program counter was, not the callstack to get there. The particular counter used and its overflow value are specified in the experiment. Some of the Hardware Performance counter sampling experiments are statistical in nature, and will show some variance from run to run; others are exact, provided that the program executes the exact same sequence of instructions.

fpe asks the floating-point exception library to trace all floating-point exceptions, with their callstacks. These experiments will show a listing similar to those for usertime experiments, except the data reported is a count of floating-point exceptions, rather than projected CPU time.

OPTIONS

-calipers
[n1,]n2 flag will cause prof to compute the data between caliper points n1 and n2, rather than for the entire experiment. If n1 >= n2, an error is reported, otherwise if n1 is negative it is set to the beginning of the experiment, and if n2 is greater than the maximum recorded, it is set to the maximum. If n1 is omitted, zero is assumed.

-gprof
flag will cause prof to perform gprof-like inclusive cycle counting for ideal experiments.

-dis[assemble]
Disassemble and annotate the analyzed object code with cycle times or number of PC samples.

-S(-source)
Disassemble and annotate the analyzed object code with cycle times (or PC samples) and source code.

-h[eavy]
Reports the most heavily used lines in descending order of use.

-l[ines]
Like -h[eavy], but group lines by procedure, with procedures sorted in descending order of use. Within a procedure, lines are listed in source order.

-q[uit]
n Truncates the -p[rocedures] , -h[eavy] , -l[ines] and -gprof listings. Truncation occurs after the first n procedures or lines have been listed.

-q[uit] n%
Truncates the -p[rocedures] , -h[eavy] , -h[eavy] , -l[ines] and -gprof listings. It only lists those procedures or lines up to the one which takes more than n percent of the total.

-q[uit] ncum%
Truncates the -p[rocedures] , -h[eavy] , -l[ines] and -gprof listings. For -gprof , it behaves the same as -quit n%. For others, it truncates the listing after the last procedure which brings the cumulative total to n percent. For example, " -q[uit] 15 " truncates each part of the listing after 15 lines of text, " -q[uit] 15% " truncates each part after the first line that represents less than 15 percent of the whole, and " -q[uit] 15cum% " truncates each part after the line that brought the cumulative percentage above 15 percent.

-den[sity]
Prints a list of procedures with non-zero instruction cycles sorted by the instruction density, which is the number of cycles-per- instruction.

-nocounts
flag will cause prof to analyze an executable or even a .o file, using the pixie machine model, and assuming each instruction is executed once. Please note that this analysis, in general, can not match any possible real run of any executable which contains one or more conditional branch instructions.

SELECT PROCEDURES/DSOS

-o[nly] procedure_name
If you use one or more -o[nly] options, the profile listing includes only the named procedures, rather than the entire program. If any option uses an uppercase "O" for -O[nly], prof uses only the named procedures, rather than the entire program, as the base upon which it calculates percentages.

-e[xclude] procedure_name
If you use one or more -e[xclude] options, the profiler omits the specified procedure from the listing. If any option uses an uppercase "E" for -E[xclude] , prof also omits that procedure from the base upon which it calculates percentages.

-dsolist
List all the DSO's in the program and their start and end text addresses

-dso dso_name
Apply the analysis only to the named DSO. Only the basename of the DSO need be specified.

CPU

prof normally uses the scheduling model for the processor on which it is being run to perform the analysis. The user can override the default with any of the following options:

{-r10000 | -r8000 | -r5000 | -r4000 | -r3000}

CYCLE TIME

-clock megahertz
Set the CPU clock speed to megahertz MHz. Alters the appropriate parts of the listing to reflect the clock speed. The default value is the clock speed of the machine on which the experiment was performed.

-cycle nanosecond
Set cycle time to nanosecond ns. This is the same as -clock 1000/nanosecond.

MISC

-z[ero]
Prints a list of procedures that were never invoked.

-feedback
Produces files with information that can be used to (a) arrange procedures in the binary in an optimal ordering, and (b) tell the compiler how to optimize compilation of the program next time. Users can use cord(1) for procedure-ordering, and/or invoke "cc(1) -fb <cfb-filename>" for compilation-optimization feedback; see their respective man pages for further information. Procedures are normally ordered by their measured invocation counts; if -gprof is also specified, procedures are ordered using call graph counts, rather than invocation counts. The cord feedback file is named <a.out>.fb or <lib*so*>.fb. This option also produces a file with information that the compiler system can use to recompile, optimizing by using measured branch frequencies, etc. The feedback file is named <a.out>.cfb or <lib*so*>.cfb, and is a binary file. It may be dumped using the fbdump(1) routine. This option can only be used with pixie (ideal- time) data files.

-i[nvocations]
Adds procedure invocation counts to the output report.

-nofilenames
Removes a.out, DSO, and source file names from the listing; useful for scripted analysis of prof output.

OBSOLESCENT OPTIONS

Some additional options are supported:

-pixie
Turn on pixie basic block counting mode. This option is not usually needed, as prof can tell from the data file whether or not it is reading a pixie .Counts file or a SpeedShop ideal-time experiment.

DIAGNOSTICS

prof prints warnings and fatal errors on stderr. With inclusive cycle counting, prof prints a list of functions at the end which are called but not defined. Any functions starting with _rld listed here is normal behavior. They appear because rld is not instrumented.

One way to sanity-check inclusive cycle counts is to look at the percentage cycles for __start(). If the value is anything less than 98- 99%, the inclusive report is suspect. Look for other warnings which indicate that prof didn't take into account certain procedures.

There are a number of known cases when prof fails to list cycles of a procedure in the inclusive listing. The reasons can be one of the following:

- init & fini sections, mips stubs are not part of any procedure.

- calls to procedures which don't use a jump and link are not recognized as procedure calls.

- executions of global procedures with same name in different DSOs. In this case, only one of them is listed.

All these exceptions are listed at the end of the -gprof listing, under a separate section.

NOTE

With both pc-sampling and basic block counting, prof may report times for procedures named with a prefix of *DF*, for example *DF*_hello.init_2. DF stands for Dummy Function, and indicates cycles spent in parts of text which are not in any function, for example, init and fini sections, and MIPS.stubs sections.

If any of the object files linked into the application have been stripped of line-number information (with ld -x for example), prof will warn about the affected procedures. The instruction counts for such procedures are shown as a procedure total, not on a per-basic-block basis. Where a line number would normally appear in a report on a function without line numbers question marks appear instead

prof does not take into account interactions between basic blocks in ideal time runs. prof computes the cycles for one execution of a basic block, assuming all registers are free at entry to the block, and it multiplies this count by the number of times that basic block is executed. In real programs, a block may be entered with a result not yet ready in a register, or with a function unit busy, so the cycle count computed could be either higher or lower than the correct value. Extending the computations to include inter-block state would be prohibitively expensive.

When run on a program using shared libraries, prof will somtimes combine the times of real but anonymous procedures in a shared library into the preceding (according the library memory layout) externally visible function. The times reported for your procedures are not affected by this attribution-error, which is normally minor.

prof cannot be run on programs which have been stripped.

Compiler optimization level 3 will do procedure inlining. This can result in extremely misleading profiles since the time spent in the inlined procedure will show up in the profile as time spent in the procedure into which it was inlined. It is generally better to use compiler optimization level 2 or less when gathering an execution profile.

Fortran alternate entry point times are attributed to the main function/subroutine, since there is no general way for prof to separate the times for the alternate entries.

SEE ALSO

speedshop(1), ssrun(1), ssdump(1), pixie(1), fbdump(1)