[ Previous | Next | Table of Contents | Index | Legal | ]

Performance Toolbox Version 2 and 3 Guide and Reference


Chapter 17. Response Time Measurement

This chapter provides information about the response time measurement facilities of the Performance Toolbox and the Performance Aide. Except where otherwise noted, the facilities described are available on all platforms supported by the Performance Aide. Monitoring of response times across the network can be done from workstations only.


Introduction

Response time measurement is especially important in a client/server environment and is ideally done on a transaction basis. The problem is that a transaction is an elusive concept. Between client and server, transactions may range from causing a single network transmission with no response to involving a large number of transmissions. In any one customer installation, one or a few typical transactions may be found, and the selected transactions can then be instrumented (possibly through the implementation of the Application Response Measurement API (ARM) described in Application Response Time Measurement (ARM). Using the response time measurement of a few transaction types representing a large percentage of the actual transactions performed, it is possible to get a feel for the responsiveness of all or most transaction types.

Transaction instrumentation is the most precise vehicle for response time measurement but for this concept to work, the installation must be willing to invest in the analysis of transaction patterns and instrumentation of transaction programs. The installation must also be able to modify the transaction programs, which is not always possible. For example, how does one instrument a standard SQL query program? Because it is expensive, somewhat complex, and often impossible to use the transaction instrumentation concept, other means must be used in an attempt to monitor system responsiveness.

Those other means involve the measurement of the atomic components that, together, add up to the response time of a given transaction. The following is a list of some major steps in a client/server application. Each is followed by some resources that are required by the task and, hence, will influence the response time component if they are scarce.

  1. Client application processes user input (CPU, disk)
  2. Client machine enqueues network request (CPU, adapter, network)
  3. Request is transferred over network (network capacity and speed)
  4. Server enqueues request (CPU, adapter)
  5. Server application processes request (CPU, disk, possibly access to other servers)
  6. Server machine enqueues response(CPU, adapter, network)
  7. Response is transferred over network (network capacity and speed)
  8. Client application processes response (CPU, disk, possibly access to other servers)
  9. Client application sends response to end-user (CPU, terminal network).

All of the resources in the previous list can be monitored by PTX when it comes to activity counts. Disks can be monitored for the percent of time they are busy, which give a good feel for their responsiveness, but networks can be monitored only for the activity counts. Furthermore, while activity counts for disks can be used to judge how close a disk is to being saturated, the activity counts for one machine's network adapter may have little or no connection to the actual load on the network. Maybe one machine is constantly accessing remote files while another seldom is. The low activity count on the second machine is no guarantee for a fast response when the machine does need remote access.

It only makes the situation worse that, in a typical client/server application, the largest response time component is usually the time is takes to get the request sent and the response back. That's why the IP response time measurement facility was added to PTX. IP response time measurement works by using the low level Internet Control Management Protocol (ICMP) to send responses to selected hosts and measuring the time it takes to get a response back. The ICMP protocol was chosen because it doesn't require an application to be running on the remote host, because the protocol is handled by the IP implementation itself.


IP Response Time Measurement

In PTX, IP response time measurement is implemented through a daemon and corresponding contexts in the Spmi data hierarchy. The Spmi will start the daemon as required and will dynamically add contexts for all the remote hosts, for which monitoring is started.

IP Response Time Daemon

Measuring of response times is done by a daemon called SpmiResp. If this daemon is not running when the Spmi receives a request for IP response time measurements, it is started by the Spmi library code. The daemon will continue to run until it has been the only user of the Spmi interface for 60 seconds or no data consumer has requested response time data for 300 seconds. When running, the SpmiResp daemon is controlled by an interval timer loop. The interval timer is, by default, set to interrupt the daemon every 10 seconds but the interval value can be changed from the daemon's configuration file /etc/perf/Resptime.cf.

Whenever the daemon is interrupted by the timer, it starts a new cycle, sending one ICMP packet to each host for which response time is being monitored and calculating the response time from the time it takes for the response to come back. The daemon will not attempt to send more frequently than specified by a variable maxrate (Configuring the SpmiResp Daemon section), which defaults to 10 packets per second but can be changed from the daemon's configuration file /etc/perf/Resptime.cf. When sending packets, the daemon will attempt to spread its activity evenly over the interval (Configuring the SpmiResp Daemon section) seconds a cycle lasts. If the number of monitored hosts is too large for all hosts to be contacted within interval seconds without exceeding maxrate, then interruptions by the interval timer are ignored until a full cycle has been completed. The reason for having the maxrate parameter is to prevent the measurement of network activity from being distorted by burst of ICMP packets from the response time monitor.

When a response is received to an ICMP package, the response time is calculated as a fixed point value in milliseconds. In addition. the weighted average response time is calculated as a floating point value using a variable weight ( Configuring the SpmiResp Daemon section), that defaults to 75%. The average response time is calculated as weight percent of the previous value of the average plus (100 - weight) percent of the latest response time observation. The value of weight can also be changed from the daemon's configuration file /etc/perf/Resptime.cf.

IP Response Time Metrics

The following metrics are maintained. Except where noted, all values are floating point values:

resptime The latest observed response time in milliseconds (fixed point value).
respavg The weighed average response time in milliseconds.
below10 The percentage of observations of response time that were less than 10 milliseconds.
below20 The percentage of observations of response time that were less than 20 milliseconds but greater than 10 milliseconds.
below100 The percentage of observations of response time that were less than 100 milliseconds but greater than 20 milliseconds.
above99 The percentage of observations of response time that were at 100 or more milliseconds.
requests A counter value giving the number of ICMP requests sent to the host (fixed point value).
responses A counter value giving the number of ICMP responses received from the host (fixed point value).

Configuring the SpmiResp Daemon

The SpmiResp daemon looks for a configuration file in /etc/perf/Resptime.cf. Three values can be specified in this file. A keyword identifies which value is being set. The keyword must appear in column one of a line and white space must separate the keyword and the value. The three values, as identified by the corresponding keywords are:

interval The interval in seconds between each loop of SpmiResp. Default is 10 second intervals.
maxrate The maximum rate SpmiResp will send ICMP packets with; packets per second. Default is 10 packets per second.
weight The weight a previous value has in finding the weighted average of the response time. Default is 75%.

If no configuration file is found, SpmiResp continues with default control values. The detailed meaning and use of the three values is described in IP Response Time Daemon.

The daemon will catch all major signals. All, with the exception of SIGHUP, will cause the daemon to shut gracefully down. SIGHUP will cause the daemon to reread its configuration file. Any value specified in the configuration file will replace whichever corresponding value is currently active.

IP Response Time Contexts

For applications to be able to monitor response time through the Spmi and, ultimately, the RSi interface, the data must be available in context and metric data structures in the Spmi shared memory area. However, it would consume many resources to create these data structures for all hosts in a large network, so the Spmi has been modified to handle response time contexts different from all other context types. What happens is that a context for a particular host is not created until some consumer of data refers to that context by its path name.

For a data consumer program to see the IP response time context for a host, either some other program must have created the context, or the program itself must attempt to get to the context through the context's path name. For example, to create (or access if somebody else created it) the context for the host farvel, the application could issue the following call:

SpmiPathGetCx("RTime/LAN/farvel", NULL);

The same effect is achieved by issuing the RSiPathGetCx subroutine call, which ultimately leads to an SpmiPathGetCx subroutine call on the agent host the RSi call is issued against.

This implementation leaves the Data Consumer applications with the responsibility of identifying hosts to monitor for IP response time. This is contrary to all other contexts in the Spmi, which can be instantiated simply by traversing their parent contexts. If an installation wants to make sure all IP response time contexts are created, the sample Data Consumer program in /usr/samples/perfagent/server/iphosts.c, which is also shipped as the executable iphosts, can be executed from the xmservd.res file whenever xmservd starts. This program will take a file with a list of hostnames as input and will issue the SpmiPathGetCx call for each host.

Because IP response time measurement uses the ICMP protocol, the hosts you want to monitor do not need to run the xmservd daemon. All that is required is that they can respond properly to ICMP echo requests. Because of this, response time to any node that talks ICMP - including dedicated routers and gateways - can be measured.

Two PTX applications use their knowledge of hosts in the network to present their users with lists of potential IP response time contexts. The two are xmperf and 3dmon. This is described in Monitoring IP Response Time from xmperf and Monitoring IP Response Time from 3dmon.

Monitoring IP Response Time from xmperf

The xmperf program presents its user with potential IP response time contexts in two situations:

  1. When the user displays the value selection window for the context RTime/LAN.
  2. When a user instantiates a skeleton console that refers to IP response time measurement.

In both cases, the list of potential IP response time contexts includes the hosts whose xmservd daemons responded to invitations from xmperf, as well as hosts for which contexts were added by other means such as by the iphosts program. If the host is known by two names (typically by the short hostname and the full host/domain name) the same host may appear twice in the list. The same may happen if the iphosts program identifies hosts by their IP addresses rather than their hostnames.

Host List in Value Selection Window

When the user wants to add a statistic to an instrument or exchange one statistic value with another, the value selection windows are used to select the new statistic. If the user selects RTime from the top value selection window and from the next window selects LAN, then the new value selection window displayed will show one context for each host that can be monitored. By selecting a host, you automatically create the context for measurement of response time for that host if the context doesn't exist. You then proceed to the value selection window for that context.

Instantiating an IP Response Time Skeleton Console

When a user instantiates an IP response time measurement skeleton console, the list of hosts to select from looks like any other host selection list, except that the list may contain lines that do not show an IP address. Such lines represent hosts for which an IP response time context exists but that have not responded to the xmperf invitation. If instantiation is tried again after the host list has been refreshed, more hosts may have IP addresses shown. This reflects how the host list is created. If the same hostname is available from both the list of hosts that responded to invitations and from existing IP response time contexts, then the entry from the list of responding hosts is used. Also, if a host is a little late responding to an invitation, it may first show up without its IP address. After the response to the invitation finally is received, it will show up with its IP address.

Remounting an IP Response Time Monitor

It is possible to create xmperf instruments that monitor response times for multiple hosts. This illustrates that even though it looks like the instrument receives data_feed packets from multiple hosts, in reality it receives packets from only one host. All values in an instrument must come from the same host and must be defined in the same statset. The Spmi on the measurement host receives ICMP responses (not data feeds) from the monitored hosts and has been instructed to supply the calculated response time data in statsets.

The looks of an instrument measurement IP response time for multiple hosts and of a console with multiple instruments, each monitoring IP response time for multiple hosts can be deceiving, though. You might be tempted to change the path of an instrument or the entire console, expecting to be able to select a new list of hosts to monitor. You will get a list of hosts from which you can select one host; that will then be the new monitoring (not monitored) host.

Monitoring IP Response Time from 3dmon

The 3dmon program takes a different approach to monitoring IP response time. It does so in a matrix that monitors the response time in both directions. For example, if you elect to monitor response times for three hosts, trist, ked, and nede, then the matrix would look like this:

                        NN
 
                KN              NK
 
        TN              KK              NT
 
nede            TK              KT              nede
 
        ked             TT              ked
 
                trist           trist

The right side of the matrix represents the monitoring hosts, and the left side represents the monitored hosts. Thus, the value represented in the cell marked NT is the response time for a response to an ICMP echo packet sent from host nede to host trist. The response time for a response to an ICMP echo packet sent in the other direction is shown in the cell marked TN. This measuring of response time in both directions allows for the detection of situations where two hosts use different routes to reach each other and it can also be used to pinpoint other anomalies in a network.

Because of the way 3dmon monitors IP response time, all of the hosts must be running the xmservd daemon. Therefore, when 3dmon shows a list of hosts to monitor IP response times for, only hosts that responded to invitations from 3dmon are included. The distributed 3dmon.cf sample configuration file includes a configuration set named lanresp for monitoring of IP network response time.


Application Response Time Measurement (ARM)

This section describes the Performance Aide and Performance Toolbox implementation of the Application Response Measurement API (ARM). To see where ARM would be useful, revisit the following list of common steps a client/server transaction goes through:

  1. Client application processes user input (CPU, disk)
  2. Client machine enqueues network request (CPU, adapter, network)
  3. Request is transferred over network (network capacity and speed)
  4. Server enqueues request (CPU, adapter)
  5. Server application processes request (CPU, disk, possibly access to other servers)
  6. Server machine enqueues response (CPU, adapter, network)
  7. Response is transferred over network (network capacity and speed)
  8. Client application processes response (CPU, disk, possibly access to other servers)
  9. Client application sends response to end-user (CPU, terminal network)

An application can be instrumented at many levels. For example, one set of measurements could cover the entire period from the beginning of step 1 to the end of step 9. Another, potentially simultaneous, set of measurements could cover the server side beginning with step 4 or 5 and ending with step 6. The ARM API, in its current, unfinished state, depends on the instrumentation to supply meaningful names to applications and transactions. Without such names, measurement would be impossible.

ARM Contexts in Spmi Data Space

As implemented in PTX, the ARM support is limited by the design of ARM as done by the original designers. The two-level naming structure of PTX, where every context and metric has a short name and a long descriptive name was not implemented in ARM. This puts a single requirement on how an application is instrumented. The first 31 bytes of the description of an application and a transaction must uniquely define that application and transaction within the application.

Applications are added to the Spmi data hierarchy as contexts. An application with the description "Checking Account Query" will be added as the context:

RTime/ARM/CheckingAccountQuery

A transaction of that application called "Last Check Query" would be added to the previous context as another context level and get a full path name as follows:

RTime/ARM/CheckingAccountQuery/LastCheckQuery

If the transaction was named "Check if any check of this account has bounced during the past 12 months", the full path name of the transaction context would be:

RTime/ARM/CheckingAccountQuery/Checkifanycheckofthisaccounthas

With the short name being truncated to 31 characters. The full name of the transaction would appear in the description of the transaction context, truncated to 63 characters if necessary.

ARM Transaction Metrics

For each transaction, the following metrics are maintained. All metrics, with the exception of respavg are fixed point values:

resptime The last measured response time for successful transaction in milliseconds.
respavg The weighted average response time for successful transactions in milliseconds.
count The number of successful transactions.
aborted The number of aborted transactions.
failed The number of failed transaction.
respmax The maximum response time for successful transactions in milliseconds.
respmin The minimum response time for successful transactions in milliseconds.

To reduce the memory usage, do not attempt to determine the median response time or the 80-percentile.

Implementation Restrictions

The ARM API is not implemented by Performance Aide on the SunOS operating system.

At this point, the arm_update subroutine is implemented as a null function. This is because the current monitors of PTX wouldn't be able to monitor transaction progress in a well-defined manner. This may change in a future version of PTX. Other implementation restrictions are listed under each API function.

For the SpmiArmd daemon to get write access to the Spmi common shared memory, the daemon must be started with root or system group authority. The safest way to make this happen is to make sure the xmservd daemon is always running. This can be accomplished by entering the flag -l0 (lowercase L followed by a zero) to the server's line in /etc/inetd.conf. It is also recommended that the following line is added to the appropriate xmservd.res file, which is used to start data suppliers:

supplier: /usr/bin/SpmiArmd

Because the PTX implementation uses shared memory, and because library code cannot feasibly catch all relevant signals, application instrumentors must make sure that an arm_end call is issued for each active application. If a program exits while applications it defined are still active, the shared memory area will not be released and the SpmiArmd daemon will assume that data is still being supplied and will not attempt to exit. This is not likely to be a major draw-back but if things get tricky, it may be necessary to stop the daemon (and all other programs using the Spmi) and clear shared memory manually as described in the PTX Guide and Reference manual.

Library Implementation

The ARM specifications prescribe that the ARM library is shipped as a shared library such as libarm.a or libarm.so. Replacing the installed library with another library with the same interfaces will redirect application subroutine calls to the library installed last. The implementation of ARM in PTX follows these specifications but also allows a customer installation to invoke both an existing ARM library and the PTX implementation of ARM. This is achieved by shipping the following two libraries:

/usr/lib/libarm.a A plain ARM implementation, which does not invoke any pre-existing ARM implementation.
/usr/lib/libarm2.a A replacement library for the plain library.

To use the replacement library, convert the preexisting ARM library by running the following command:

armtoleg /usr/lib/libarm.a /usr/lib/libarmrepl.a >
/dev/null

Then copy /usr/lib/libarm2.a over /usr/lib/libarm.a. The replacement library will invoke the ARM functions in /usr/lib/libarmrepl.a before invoking the PTX ARM implementation in the replacement library. This way, a customer installation can continue to use an earlier ARM instrumentation and at the same time take advantage of the PTX implementation of ARM. When the replacement library is used, the behavior of the PTX ARM implementation changes but remains compliant with the ARM specifications. Further information on these subroutines are in the AIX 5L Version 5.1 Technical Reference..

Both PTX libraries depend on the SpmiResp daemon to be started by the Spmi library code. This daemon must run with root authority in order to interface directly with the Spmi common shared memory area. The daemon is described in the SpmiArmd Daemon section.

The library code maintains state in a private shared memory area, which can be accessed with any authority, thus allowing applications to do so through the library calls. The library communicates with the Spmi through the SpmiArmd daemon, which issues standard Spmi subroutine calls. No direct communication takes place between the Spmi and the ARM library.

Run-time Control

The use of two environment variables allows an application to turn both levels of ARM instrumentation on and off. Because they are environment variables, they work for the shell from which the application is executed and have no effect on the execution from other shells. The two environment variables are as follows:


INVOKE_ARM Controls the PTX ARM instrumentation. If the environment variable is not defined or if it has any value other than False, PTX ARM instrumentation is active. If the replacement library is not used, the effect of setting this environment variable to False is that the PTX ARM library will function as a no-operation library, return zero on all calls.
INVOKE_ARMPREV Controls any ARM instrumentation that can be invoked from the PTX implementation through the replacement library. If the environment variable is not defined or if it has any value other than False, the preexisting ARM instrumentation will be invoked, regardless of whether the PTX ARM instrumentation is active. If the replacement library is not used, this environment variable has no effect.

If both environment variables are set to False, either PTX ARM library will function as a no-operation library, return zero on all calls.

SpmiArmd Daemon

Collection of application response times is done by a daemon called SpmiArmd. If this daemon is not running when the Spmi receives an SpmiGetCx , see the SpmiGetCx subroutine call referencing ARM contexts, it is started by the Spmi. The daemon will continue to run until it has been the only user of the Spmi interface for 15 minutes and no data has been received from an instrumented application for 15 minutes. The time to live can be changed from the daemon's configuration file /etc/perf/SpmiArmd.cf. When running, the SpmiArmd daemon is controlled by an interval timer loop. The interval timer is, by default, set to interrupt the daemon every second but the interval (see interval in theConfiguring the SpmiArmd Daemon section table) value can be changed from the daemon's configuration file /etc/perf/SpmiArmd.cf.

Whenever the daemon is interrupted by the timer, it empties the entire queue of post structures with the exception of the last entry (see ATake, in the SpmiArmd Daemon section table), using each to update the corresponding metrics. The metrics are updated as follows:

  1. If the post element indicates that the transaction instance completed successfully, response time is calculated. The response time is calculated as a fixed point value in milliseconds. In addition, the weighted average response time is calculated as a floating point value using a variable weight (Configuring the SpmiArmd Daemon section), that defaults to 75%. The average response time is calculated as weight percent of the previous value of the average plus (100 - weight) percent of the latest response time observation. The value of weight can be changed from the SpmiArmd daemon's configuration file /etc/perf/SpmiArmd.cf. In addition, the maximum and minimum response time for this transaction is updated, if required. Finally the counter of successful transaction executions is incremented.
  2. If the post element doesn't indicate a successful execution, either the aborted or failed counters are incremented. No other updates occur.

To eliminate the daemon from the need to lock data in the ARM library's private shared memory area, the following technique is used to control the linked list of post structures. The following three fields in the shared memory control area are used as anchors:

ATake Points to the first post element to be processed by the daemon. After initializing, this field is never updated by the library code. The daemon reads elements starting at ATake and processes them. It stops when the next element has a next pointer of NULL and then sets ATake to point at that element.
AGive Points to an uninitialized post element and is used by the library code to add new post elements. When the shared memory area is first initialized, both ATake and AGive point to an empty element. As new post elements are needed, the library code allocates them or takes them from the AFreePost list. The element pointed to by AGive is then updated. The last step in updating this element is the setting of its next pointer to the newly acquired element, which will have a NULL next pointer.
AFreePost Points to a linked list of post elements. The elements between this pointer and ATake are unused elements and will be reused by the library code. This field is never updated by the daemon. Whenever an element is taken off this list, the AFreePostanchor is updated to point at the next element. Initially, this anchor is set equal to AGive an ATake.

The daemon catches all viable signals and will exit for all but SIGHUP. When the daemon receives a SIGHUP signal, it rereads the configuration file and re-initializes its control variables to any new values specified in the configuration file.

Configuring the SpmiArmd Daemon

The SpmiArmd daemon looks for a configuration file in /etc/perf/SpmiArmd.cf. Three values can be specified in this file. A keyword identifies which value is being set. The keyword must appear in column one of a line and white space must separate the keyword and the value. The three values, as identified by the corresponding keywords are as follows:

interval The interval in seconds between each loop of SpmiArmd. Default is 1 second.
weight The weight a previous value has in finding the weighted average of the response time. Default is 75%.
timeout The number of seconds the daemon should live with no activity going on. The default is 900 seconds (15 minutes). A value of zero for this parameter will cause the daemon to live forever.

If no configuration file is found, SpmiArmd continues with default control values.

Monitoring ARM Metrics from xmperf

The xmperf program presents its user with ARM contexts in the following two situations:

  1. When the user displays the value selection window for the context RTime/ARM.
  2. When a user instantiates a skeleton console that refers to measurement of ARM metrics.

ARM Context List in Value Selection Window

When the user wants to add a statistic to an instrument or exchange one statistic value with another, the value selection windows are used to select the new statistic. If the user selects RTime from the top value selection window and from the next window selects ARM, then the new value selection window displayed will show one context for each application that can be monitored. By selecting an application you are taken to the next selection window, which will present a list of the transactions defined for the application. By selecting a transaction context you proceed to the value selection window for that application/transaction context.

Instantiating ARM Skeleton Console

ARM skeleton consoles must be defined for each application you want to monitor because xmperf doesn't support dual wildcards (as does 3dmon). When a user instantiates an ARM skeleton console, a list of transactions within the application is presented for the user to select from. Each line represent an application/transaction context. One or multiple lines can be selected.

Remounting an ARM Monitor

It is not possible to create xmperf instruments that monitor ARM data for multiple hosts. However, consoles can be constructed with instruments that each monitor a different host. ARM instruments can be remounted on different hosts as any other instrument.

Monitoring ARM Metrics from 3dmon

The 3dmon program permits wildcarding in multiple levels. This allows you to create configurations corresponding to those available for file systems. You can chose to be presented first with a list of hosts and then by a list of all application/transaction combinations defined for that host. You can also restrict the list to the application/transaction combinations for a single host.

The resulting 3dmon display will show the host/application/transaction name on the left side and the name of configured metrics on the right side. A configuration set named armresp for monitoring of application response time is available in the distributed 3dmon.cf configuration file.

Sample Applications

Source code for an instrumented application is not supplied with PTX but a modified version of the xmpeek program is shipped as /usr/samples/perfagent/server/armpeek. This program is an instrumented version of the ordinary xmpeek. It creates an application called armpeek and one transaction for each combination of command line flag and hostname. For example, the command armpeek -l myhost will create and measure a transaction called myhost-l; the command armpeek refers to the local host and would create a transaction called localhost; the command armpeek -a server would create the transaction server-a.


[ Previous | Next | Table of Contents | Index | Library Home | Legal | ]