Top problems with the TOP500

11.01.12 -

William Kramer
Deputy Project Director for Blue Waters
NCSA

The TOP500 list was introduced over 20 years ago to assess supercomputer performance at a time when a major technical challenge was solving dense linear algebra problems on high-end computers. Now, most consider the list a marketing tool or an easy, if simplistic, way to impress politicians, funding agencies, and the naïve rather than having much technical value.

The TOP500 list is based on the floating point computational performance assessed by a single benchmark, Linpack, which performs a dense matrix calculation. Unfortunately, the TOP500 does not provide comprehensive insight for the achievable sustained performance of real applications on any system—most compute-intense applications today stress many features of the computing system and data-intensive applications require an entirely different set of metrics. Yet, many sites feel compelled to submit results to the TOP500 list "because everyone has to."

It is time to ask, "Does everyone have to?" and more specifically, "Why does the HPC community let itself be led by a misleading metric?" Some computer centers, driven by the relentless pressure for a high list ranking, skew the configurations of the computers they acquire to maximize Linpack, to the detriment of the real work to be performed on the system.

The TOP500 list and its associated Linpack benchmark have multiple, serious problems. This column is too short to deal with all of them in detail. But a few of the issues and possible solutions are briefly listed below.

The TOP500 list disenfranchises many important application areas.

All science disciplines use multiple methods to pursue science goals. Linpack only deals with dense linear systems and gives no insight into how well a system works for most of the algorithmic methods (and hence applications) in use today. The lack of relevance to many current methods will get worse as we move from petascale to exascale computers since the limiting factor in performance in these systems will be bandwidth of memory and interconnects.

Possible improvement: Create a new, meaningful suite of benchmarks that are more capable of representing achievable application performance. Several Sample Estimation of Relative Performance Of Programs (SERPOP) metrics are in use today, such as the NERSC SSP test series, the DOD Technology Insertion benchmark series, and the NSF/Blue Waters Sustained Petascale Performance (SPP) test that use a composite of a diverse set of full applications to provide a more accurate estimate of sustained performance for general workloads. These composite measures indicate critical architectural balances a system must have.

There is no relationship between the TOP500 ranking and system usability.

In a number of cases, systems have been listed while being assembled at factories or long before they are ready for full service, leaving a gap of months between when a system is listed and when it is actually usable by scientists and engineers. This perturbs the list's claim of historical value and gives misleading reports.

Possible improvement: List only systems that are fully accepted and fully performing their mission in their final configurations.

The TOP500 encourages organizations to make poor choices.

There have been notable examples of systems being poorly configured in order to increase list ranking, leaving organizations with systems that are imbalanced and less efficient. Storage capacity, bandwidth, and memory capacity were sacrificed in order to increase the number of peak (and therefore Linpack) flops in a system, often limiting the types of applications that it can run and making systems harder for science teams to use. For example, for the same cost, Blue Waters could have been configured to have 3 to 4 times the peak petaflops by using all GPU nodes and having very little memory and extremely small storage. This would have made Blue Waters very hard to program for many science teams and severely limited what applications could use Blue Waters, but almost certainly would have guaranteed being at the top of the TOP500 list for quite a while.

Possible improvement: Require sites to fully specify their system capacities and feeds. For example, the amount and speed of memory and the amount and speeds of the I/O subsystems should be reported. This would allow observers to assess how well a system is balanced and would also document how different types of components influence the performance results.

The TOP500 measures the amount of funding for a system—it gives no indication of system value.

The dominant factor for list performance is how much funding a site received for a computer. Who spends the most on a system influences list position as much as (or more than) programming skill, system design, or Moore's Law. Without an expression of cost listed alongside the performance metric it is impossible to understand the relative value of the system, inhibiting meaningful comparisons of systems.

Possible improvement: Require all list submissions to provide a system cost. The cost estimate could be the actual cost paid, a cost estimated from pricing tables, or, at the worst, a component-wise estimate. The cost of a system contract is often publically announced by sites, or IDC (or others) can help calculate a typical "street" selling price for most systems. A cost estimate along with a ranking would provide much more insight and value. Remember, in the past every system listing in the NAS Parallel Benchmark reports was required to have a cost estimate. Adopting these and other improvements would be steps in the right direction if the list continues. However, it is time the community comes to agreement to entirely replace the TOP500 with new metrics, or multiple lists, that are much more realistically aligned with real application performance. Or the HPC community could just say "No more" and not participate in the list. Many government and industry sites already do this, we just never hear about them (which further limits the use of the list for historical information).

As our HPC community strives for more and more powerful systems, and as we cope with having to implement more exotic architectural features that will make future systems harder to use for sustained application performance, it is critical we have measures to guide us and inform our decision making rather than divert our focus and adversely influence our decisions.

Because of the issues discussed here, and with the National Science Foundation's blessing, Blue Waters will not submit to the TOP500 list this fall or any other time. NCSA will continue to pursue new ways to assess sustained performance for computing systems.