A better understanding

09.17.10 -

Building the world's fastest computer for open research also requires developing the applications that will allow researchers to do their science on the machine. Bill Gropp, the chief applications architect for Blue Waters, tells Access' Barbara Jewett what that work involves and how it feeds into the future of high-performance computing.

Let's start with Blue Waters; tell me about your role in that.

I have several roles in Blue Waters. I am one of the co-PIs so I'm involved in monitoring the project and participating in various ways in what we're doing and what we need to do, with particular focus on the software. I'm also supervising a couple of projects that are looking at pieces that will contribute to that. One project I'm leading is porting libraries that either some of the PRAC-approved applications use or, in a couple of cases, libraries that we expect applications to need. PRAC is the National Science Foundation's Petascale Computing Resource Allocation program.

How long does it take to develop a library?

Well, it depends on how much you need to do. If what you want to do is just port and see what trouble you have, that doesn't really take very long. If you really want to tune the library to make good use of the hardware, then that can take months.

We also are looking at I/O libraries. At large scale, approaches that work on a thousand nodes, such as having every processor write out its own file, is not a good way to use the I/O system when you have 10,000 nodes and hundreds of thousands of cores. It's also not a good way to do your analysis, with hundreds of thousands of files—some applications have each process write out one file per variable, and now we're talking about hundreds of billions of files; there are better ways. There's no real reason to do it that way other than in the past most I/O systems were not that capable of doing better, and if you only had a few hundred or a few thousand files you could get away with doing your I/O that way. There are I/O libraries that are up to the task of providing the application developers with workable alternatives. That's one component of what we try to do—help the applications developers.

The other project that we're looking at is support for programming models. We like to provide people with alternatives so they can make good use of the system. The most obvious is the hybrid programming model that mixes MPI with threads—either OpenMP or pthreads. And we'll also be providing good quality parallel languages that support what's called the partitioned global address space (PGAS) programming model. One of these is an extension to C called Unified Parallel C or UPC. So that you can define, for example, an array that is distributed across the parallel machine and easily access any part of that array. If this distributed data structure fits your application, it actually makes it easier to write different types of codes. Also, because it's a language, the compiler can perform optimizations that a library can't. IBM is doing this with their xlupc compiler and they already have some nice results. Of course, we don't expect users to rewrite their applications in a new language, so it will be possible to mix programming models in a single application. For example, you can write a module in UPC and link it with your MPI program. We'd like to see applications consider some performance-critical piece of code that would be suitable for UPC and try it out, linking with their MPI-based application.

Is that something new and novel? It sounds like something not commonly done.

It's not commonly done. UPC has been around for quite awhile, there are some users, there's a book on it, there are a number of groups that have built compilers and so forth, and it's been used on other supercomputers, but it hasn't really become mainstream. And I'm not sure it will become mainstream, but we want to make it easier and more feasible for groups to experiment with it and see if it will help solve some of their problems.

I wouldn't say that this is the first time support for UPC is available, but I would say this is the first time any supercomputing center has made it a goal that UPC would be available and interoperable with other parallel programming models, including MPI. We can use it to help better understand how we can move past the current MPI barrier in a way that meets the needs of applications.

I'd be the first one to say that it is relatively easy to define a new programming model that is exactly what you need for your application. The reason that MPI has been so successful is that the MPI programming model has been what almost everyone has needed to do their science. It is harder than people sometimes think to create a programming model that is applicable to everyone. By making it possible for people to experiment with other approaches without having to start over, we're probably going to give them programming alternatives that emphasize parallelism. We're going to give everyone, not just developers, a programming model better suited to where we go next.

As we go forward with exascale, the codes are going to have to change, aren't they?

Yes, an exascale system is going to have far more concurrency than the petascale systems. We're looking at 300,000 cores at least for a sustained petascale system. At exascale you're more likely to be looking at tens of millions of threads of concurrency, and you're going to have to manage those in some way. And you are reaching scalability problems. If you think about the fact that messages between different parts of the machine will take different amounts of time to arrive because nodes are not the same distance apart, you start ending up with variations in timing that at a thousand nodes are really irrelevant but at a million nodes can become very important. And in the typical MPI model, programs assume that they have control of all cores all the time, and that each phase of the computation and communication will take the same amount of time.

But with exascale you start having a more dynamic response to when things get scheduled, when things get done. You can do that in MPI but it is not as easy, and there is no reason to still have to do that programming. So we need to start looking for models that will help us forge that track without sacrificing some of the things that MPI has given us in the ability to in fact express—the algorithms, providing information on performance-critical issues like data locality and so forth. So that we can gain more experience on how we put these things together.

I think we will see more hybrid models of programming so that we don't use one model everywhere. Many programmers don't like the hybrid programming model because it is hard to use. But the reason that the hybrid model is hard to use is that the parts have not been designed to fit well together. So the real problem is not that it is a hybrid model, the problem is that it is not a well-designed hybrid model.

An exascale system is likely to be hybrid hardware, partly because an exascale system will have to be maybe two orders of magnitude more power efficient than Blue Waters—and Blue Waters is pretty power efficient. To do that, we are not going to be able afford general-purpose cores. We might be able to have lower-powered general-purpose cores, but then we have to have even greater concurrency because that's your tradeoff.

So what I think we will see is that the exascale system is likely to have heterogeneous hardware because it will be hardware that is specialized for say, control flow, hardware that is specialized for streams of data, hardware that is specialized for vector operations that are different than what you get with streams; there might be hardware that is specialized for minimizing data motion. And you'll have to program all these things. And all that sounds pretty frightening, but to do it having a uniform programming model that hides everything from the user won't work. So what will work, then, is a programming model that tries to minimize the programmer's pain and makes it as easy as possible for the programmer to work with the different hardware components. I think that is not as bad as some people might think. And we have some beginning experience with this with the work that is going on here and elsewhere on the use of GPUs. I don't expect an exascale system to have GPUs attached to nodes like the current systems, but I wouldn't be surprised if the features of GPUs don't become part of what is inside an exascale processor chip. It won't be an extra chip on the side, it will be within the processor chip. The software and the tools that we're starting to develop will help us understand what we need to know to use that part of an exascale system.

What are some other things that we're doing here at NCSA and the University of Illinois that go along with this?

There's a lot that is going on. There's fundamental work in computer architecture for the hardware that's going on between the computer science and electrical and computer engineering departments. There's work with programming models, there's work with tools to help you understand the performance that you're getting. One of the other problems to date has been that to a large extent a lot of the software work has been an art rather than engineering. In many applications, there is a trial and error approach to improving performance. So one of the things we are doing as part of the Blue Waters project and in these departments is trying to develop a better understanding of performance, better tools for modeling the performance of applications, better tools for applying the transformations needed to improve performance. With any aspect of engineering there is always an art to it, but it needs to be more systematic, more quantitative.

And so with the Blue Waters project we have several groups helping to develop analytic models of their applications' performance, which can help guide us and identify where the biggest gap is between the performance you should be getting and the performance you are getting. There are other efforts that are looking at tools that could be applied across the whole application so if, for example, you want to change a data structure, you can change it everywhere in your application. Such changes are something that computers are good at. And Wen-mei Hwu, another co-PI and GPU expert, is developing tools for using GPUs so you can use analytic models and understand how a code should perform. It also helps identify where there are greater bottlenecks and that affects things and how you may design the code. And their tools will become ever more important as the architecture becomes more complex and more specialized.

Some people say that by the time the Blue Waters operations end in 2016, we'll be ready for exascale, others that we'll be working with petascale for a very long time. Do you want to gaze into your crystal ball and say when we might be moving on?

When operation of Blue Waters reaches its predicted end, we won't be at exascale. Exascale is roughly 100 times faster than Blue Waters, and I just don't think we'll be there. Certainly not with the kind of architecture that we're looking at now.

I think that exascale is possible. There's a lot to do in terms of hardware research and software research. But I think we will go through another turn of an intermediate system, probably around 2015-2016, that might be a 100 petaflop. There are people who would like to see an exascale system by 2018. That's very aggressive. It is doable, but you'll have to sacrifice either cost or power.

Sometimes I think Blue Waters may be the last homogeneous general purpose processor system, because to get much past this without more power and without a lot more money you are going to have to give up something. You might give up the homogeneous parts and that would allow you to put a lot more computational power into the same footprint and the same power envelope, but it would be essentially like having a new edition, a new system. But that is going to require the right software tools. At exascale there is a risk of spending all of your energy moving data, not actually doing computations with it. And that will require building algorithms that are very limited in the data they would move, much more so than we do now.

How does IACAT play into some of the things we're doing here at Illinois?

The Institute for Advanced Computing Applications and Technologies provides a way to connect NCSA staff with the rest of campus, looking at some of these questions. We have a couple of projects that are looking in general at advanced computing—not all of them are looking at exascale—but three of them that are looking at the petascale to exascale issues. One of those is looking at use of computers in applications. So that brings the knowledge here, and the access and the expertise in the applications, to the work that's being done in GPU systems, so those tools can actually be applied to real applications in petascale situations. There's another project that is doing similar things, but within a different programming model, that provides a more dynamic approach to the use of several of these processors to address the problems of systems that evolve over time. And again, working with real applications as compared to working with benchmarking and testing codes. And there's a third project that is looking more at algorithms with a focus on multiscale problems. To get to exascale or to get to trans-petascale then they'll need to re-think the algorithms. We need different algorithms, not for the code problems, but for the problems we'll put on an exascale system. And those problems tend to have many components and parts.

We're talking about extreme-scale machines, but not everybody has an extreme-scale type of problem to solve. For those researchers who need computational power to solve their problems, but are not Blue Waters or extreme-scale users, what sort of computing resources will be available in the future for them?

Well, even though they won't be using exascale they'll benefit from solving the power problem for it. The processing cores are not going to get much faster, so the only way you make a processor faster is if you provide more parallelism on it. Your laptop, in a couple of years, might have 32, 64, or even 128 processor cores on it. Doubling will be the only way to get more computational power. And even now I think all laptops have at least two cores, so everybody has to deal with parallel processing. A lot of the techniques that we are developing make use of different levels of parallelism. And when you are looking at parallelism on Blue Waters, it's not equal threads; there is a hierarchy to the parallelism. So we have to understand how to make use of the eight cores on each chip and the 32 cores on each module. All that work will, in a few years time, be of good use in your laptop.

So we'll all benefit from Blue Waters.

We'll all benefit from a better understanding of how to make good use of specific levels of parallelism. Just the very top level of parallelism atop the whole thing is something that only people who have the most demanding problems will have to worry about. But the tools that we're developing will help the whole software stack. Sometimes when looking at the big picture it gives you a better way to understand how to solve individual pieces.