Tuning up applications | National Center for Supercomputing Applications at the University of Illinois

Tuning up applications

11.01.10 -

An Illinois team headed by computer science professor Laxmikant Kale is helping scientists tune their applications for Blue Waters, even before the hardware exists. He chatted with Access' Trish Barker about that process.

Q. You've been involved with the Blue Waters project since the beginning, stretching all the way back to the proposal process. Tell me about the aspects of the project you're working on.

A. There are three projects that I'm involved with, and they are all related. The first one has to do with deploying a programming model that the Parallel Programming Laboratory has developed over the years, which is embodied in Charm++ and Adaptive MPI. We are working to efficiently implement this model on Blue Waters.

The second project has to do with the fact that you have a huge machine and people need to tune their applications to that machine, but we don't have access to that machine now, obviously. Usually tuning can't begin until the machine is deployed, so in the early months of deployment applications are running at lower efficiency. To avoid that we are providing a performance-tuning environment based on the programming model we've developed. Using this environment, you can run a program as if it is running on the full machine, even though you may be using just one-tenth the number of processors. This infrastructure is called BigSim. It was developed with support from the National Science Foundation, and our object here is to deploy it for Blue Waters users.

BigSim is not so much a performance prediction system as it is a way to identify performance bottlenecks. We can give predictions of how quickly a simulation can be performed but it's really more useful to see what the potential issues are and how we can work around them.

The molecular dynamics application NAMD is already using BigSim in the process of tuning to the full-scale Blue Waters machine, and several other applications are exploring the use of it. NAMD is special because my group is a co-developer of NAMD. We've been working on it since the 1990s, and it was one of the first applications to use Charm++, and now it's also one of the earliest applications to use BigSim.

The third thing that I'm involved with is support for scaling NAMD to the full machine, especially for large molecular simulations involving tens of millions to hundreds of millions of atoms.

All of this is a good example of how the computer science department and NCSA can work together. My team is able to provide valuable tools and expertise, and Blue Waters gives us an ideal platform to further develop and demonstrate those tools.

Q. Let's talk about the first project, the programming model. What is a programming model?

A. A programming model defines how programmers should think about the machine. It defines an abstraction for the programmer to write to. It also specifies the way in which a computation is divided into its component entities and how they interact with each other. The distinguishing feature of Charm++ and Adaptive MPI is that the programmer does not have to worry about the number of processors, the programmer doesn't write to the processors. Instead they break the simulation into data and work units, and then the runtime system assigns those to the processors. This gives the system flexibility to deal with issues like dynamic load balancing and automatic fault tolerance, without the programmer having to do anything about it.

Charm++ provides a good division of work between the programmer and the system. The programmer should decide how to parallelize, but the system should decide who does what when. This is a contrast to MPI, in which a programmer decides what every processor is doing.

Q. What is the benefit of using this programming model?

A. Mainly it relieves the programmer of resource management issues. It can get much better performance from applications that have load-balance issues. The programmers also don't have to write check-point code to deal with fault tolerance, and much stronger methods of fault tolerance than simple check point/restart become possible automatically.

It also supports efficient interoperability. Basically, what you want is two modules to share the processors in such a way that idle time from one module can be used adaptively for productive computation in the other module. This is not possible if you are statically choreographing what the processor is doing, like you do in MPI. In CHARM++ and Adaptive MPI, the processor is scheduled according to availability of data. Thereby you can use the idle time for productive work by another module.

Because the objects are being moved around by the system, the runtime system can take advantage of its knowledge of the interconnection topology and keep communication on nearby objects, without programming intervention.

Q. Tell more about how you can use BigSim to tune applications to use a machine that doesn't even exist yet.

A. The first step is you have to be able to run your program at scale. If we have only 30,000 processors available, but we have a 300,000-processor job, we can emulate the correct scale on the smaller number of processors. This allows you to identify bugs in your program that appear only at scale.

Doing an accurate simulation while running the user's code would be prohibitively computationally expensive. Instead we do an emulation and record all the dependencies between computation blocks and messages—then you don't need the code or the simulation data anymore. Everything is captured in these traces between computation and messaging. This isn't feasible for every application, but it is feasible for most science and engineering applications.

In addition to NAMD, we're using BigSim for a turbulence code and for MILC.

Q. So instead of using a complete application, you use a skeleton emulation that traces the actions taking place in the real application. Communication A triggers computation Z, etc.

A. Yes. You run in an emulation mode and you get traces from that emulation mode and you run those traces through a simulator, which takes the machine characteristics and communication into account. How fast is the processor, how fast is the interconnect, what is the topology? You can bring all of that information into the simulation, but the application itself has been abstracted into these dependencies.

The simulation model provides a multi-resolution capability to analyze performance. Depending on how much detail you want you can get more accurate modeling of the network that shows contention or a simple model that just shows latency and bandwidth. Obviously the latency-only model is computationally cheaper to run, so if you aren't likely to suffer from contention, you can use the simpler model. We also have a multi-resolution approach to predicting the execution of sequential execution blocks, using simple scaling all the way to cycle-accurate simulators.

Q. What is contention?

A. Usually when you send a message from one node to another, the time it takes is determined by the number of bytes and the bandwidth of the network and a few simple things like that. But when everyone is sending messages to everyone else, the processors contend for the bandwidth. Applications with a large number of messages can suffer from this contention. So the network simulator has been very helpful for that purpose.

Q. What compute resources do you use for the performance simulations?

A. We are using the BluePrint system at NCSA [a cluster composed of 120 IBM POWER5+ nodes, each having 16 cores and 64 GB of memory] and TeraGrid machines at Oak Ridge and the Texas Advanced Computing Center.

Q. How many processors do you need to emulate a run on the full 300,000-core Blue Waters system?

A. We suggest using one-tenth the cores, but it's highly variable depending on how memory-intensive your application is. Because NAMD is not very memory-intensive, we can make do with a modest number of processors. We have tricks in the works for dealing with large memory applications.

Q. How much investment is required to prep applications to use BigSim?

A. For MPI applications they do have to convert to Adaptive MPI. So there is some effort involved on the application side. It depends on the application, but in most applications it's a weeklong effort. For MILC [the MIMD Lattice Computation code, used for studying subatomic particles] it was an afternoon.

Q. Do you expect this type of simulation to be used more widely in the future?

A. We hope that this will become more common and that BigSim will be used for other future machines.

Even when the Blue Waters system is up and running, optimizing an application for a whole machine will be costly and resource intensive. Instead, using BigSim and our techniques will allow better management of resources.

For more information: http://charm.cs.uiuc.edu/

Team members
Eric Bohm
Laxmikant Kale
Celso Mendes
Ryan Mokos
Gengbin Zheng