The best things constantly change

11.08.08 -

With the National Science Foundation's funding of a sustained-petascale computer system, called Blue Waters, the high-performance computing community embraces on new challenges. Access' Barbara Jewett discussed some of the hardware and software issues with the University of Illinois at Urbana-Champaign's Wen-mei Hwu, professor of electrical and computer engineering, and Marc Snir, director of the Illinois Informatics Institute and former head of the university's computer science department.

Q: A report, Dr. Snir, that you co-edited a few years ago on the future of supercomputing in the United States [Getting Up to Speed: The Future of Supercomputing, National Academies Press], indicated the country was falling behind in supercomputing. With the Blue Waters award, do you feel like we are now getting back to where we need to be?

SNIR: It certainly is an improvement. The part that is still weak is that there has been no significant investment in research on supercomputing technologies, and that is really the main thing we emphasized—you get continuous improvement in computer technology when you have continuous research.

Q: The committee that wrote that report felt that the federal agencies who rely on supercomputing should be responsible for accelerating the advances. Do you feel that will start happening?

SNIR: Supercomputing is important to science, and supercomputing is important to national security. The two go hand in hand, they're using the same machines, a lot of times the same software, oftentimes the same applications. Both science and national security are national priorities, and high-end supercomputers are funded by public funds. But I think we have yet to figure out exactly how to manage the supercomputing enterprise. The kind of problem we were asking ourselves when we wrote this report is how you should think of supercomputers. Should you think of them as unique systems such as a large telescope or a weapon system that has no commercial market and is developed to unique specifications? Or should the U.S. expect the companies to develop supercomputers, with government just buying whatever they are doing to win the marketplace? The answer is probably somewhere in the middle: It is important for supercomputing platforms to leverage the huge investments in commercial IT, but the government needs to fund unique technologies needed for high-end supercomputing. What became clear to us when working on the Blue Waters project is that there is a big investment in hardware—coming from DARPA's high productivity computing systems program. Investments in software companies, they are very few and very small. Probably there are two or three companies trying to develop software for high-performance computing. That's still a weakness.

Q: Let's talk about the software for a petascale machine. What is the biggest challenge?

SNIR: Scalability is first and foremost. You want to run and be able to leverage hundreds of thousands of processors. Wen-mei can explain it even better than I. Technology now is evolving in the direction where we can get an increasing number of cores on a chip, and therefore an increasing number of parallel processors in the machine. To be able to increase the performance over the coming years, the answer has to be that we will increase the level of parallelism one uses. And that really affects everything—applications that have to find algorithms that can use a higher level of parallelism, the run times, the operating systems, the services, the file systems. Everything has to run not on thousands, not on tens of thousands of processors, but on hundreds of thousands of cores. I expect to see millions before I retire. It's a problem.

Q: So, Dr. Hwu, how are we going to get these processors to run?

HWU: Each chip will have so many cores, and there are some varying states of technology demands. Such as, the compilers have to be able to manage all these cores on the chip, and the run-times need to coordinate all these chips, and there's a huge reliability challenge. Whenever you have so many parts in a machine, essentially some part of the machine will be failing all the time. So we need to have mechanisms that allow running the machine with some failed parts and being able to maintain the machine without taking the whole system down. Also, applications developers will need to be able to see what the performance is like when they proof the software and they need to see it to understand the behavior of that software on the system. We have people in the electrical and computer engineering department, in the computer science department, and at NCSA who have a lot of experience working on various parts of this problem, and we all came together and worked on it.

Q: What other expertise will your respective departments contribute to this project?

HWU: One of the aspects of this machine is that we are going to build this massive interconnect. Marc actually has a lot of experience building this kind of machine, although probably on a smaller scale, when he was working at IBM. And people that make up the electrical and computer engineering department (ECE) have a lot of experience building this kind of machine. Another aspect of this reliability facet that we talked about is that Ravi Iyer with ECE has more than 20 years' experience working with IBM measuring their mainframe failure rate and their component versus systems reliability. I personally focus much more on the microprocessors. I have worked with numerous companies on various microprocessors, and one of the things I specialize in is how do you actually build these microprocessors so that the compilers can use the parallel execution resources on that chip.

SNIR: We have a lot of experience at Illinois on developing parallel run-times, programming languages, and software for high performance. The computer science department has been involved in parallel applications, and large scale applications, assisting in developing the NAMD code just a few years ago. [Editor's note: NAMD is a molecular dynamics simulation code designed for high-performance simulation of large biomolecular systems (millions of atoms). It was developed through a collaboration of Illinois Theoretical and Computational Biophysics Group and Parallel Programming Laboratory.] We've done a lot of work on multicore systems. We certainly have a strong applications team on our campus whose efforts I think we can use, as well as all sorts of professors and graduate students as we are one of the few places that teach scientific computing and teach high-performance computing. So we have the breadth.

Q: We talked that in an ideal world the software and the applications would be quite far along before starting on the hardware. Does that bring another level of complication to the project?

SNIR: Yes. It's a process. You have to analyze your applications, you have to test them on smaller systems, you have to make sure they run flawlessly. There is a lot of work to be done. The academic departments and NCSA are already working together, and we'll work collaboratively with IBM, but there is going to be a lot of work in the broader community. As much as IBM has the wealth of technology, they cannot provide everything you need to run this system. As much as you want everything to run on this system, true, you want compatibility and portability to other hardware, especially the Track 2 platforms. [Track 2 machines are extremely powerful, but their sustained performance is under a petaflop. They are part of NSF's initiative to greatly increase the availability of computing resources to U.S. researchers.]

HWU: Four years is actually very short for building this machine. The reason it is short is in order to achieve the performance target, we need to use technology which will take a good deal of time to mature and get into production mode. With microprocessors today, it takes at least four years from design to completion. A benefit of working with the design team right from the beginning, we can make the design more suitable for the system. What this means is that we do need to work with other Track 2 systems in order to jumpstart some applications. All of these things will have to take place in the next couple of years.

SNIR: One thing you have to keep in mind is we were in discussions when we were working on the proposal, and we have worked with IBM on other projects. We are now clearly increasing the level of activity, but this is not a fundamentally new endeavor.

Q: Is there a particular application that will be focused on first?

SNIR: NSF or committees nominated by NSF will select the applications that are going to get significant [attention initially]. There are numerous areas that have leveraged supercomputing for years and which we know hunger for more supercomputer power.

Q: NCSA typically puts on the floor mature, or relatively mature, technology. People in your position tend to branch off in various directions that are based on research interests. How do we go about reconciling those two attitudes?

HWU: You're right that the academic departments tend to work more on research interests than a specific installation. There are, however, enough research challenges in this project since many technology components are still several years out, thus it is very different from the previous deployment projects that NCSA has been used to. And that is part of the reason why everybody involved in this project will need to come together and join forces and really combine our expertise to make this work.

SNIR: When you install this kind of machine, it is always one of a kind. You really need to practice balance: balance between using that which is prototype, that which is solid and practiced, and that which is a new real-world solution for a problem that you deal with because there is no such machine before. You will need in some cases to have new things, which in some cases may fail and in some cases may succeed. All of us have had the experience of working with industry. I and many others on the team have had experience working with DOE labs on their one-of-a-kind machines. You do your best, but in my opinion you do it very carefully by managing risks. And the way you do this is by carefully categorizing what must run in order to have a machine that is of any use, and that, to the greatest possible extent, is product quality technology developed by IBM.

NCSA is a leader in providing high-performance computing resources to researchers, and we have a good percentage of the high-performance computing community here at Illinois, so what you do can also change how business is done at the very high end. NCSA should take advantage of that. You have a lever to influence how high-performance computing is done. Buying the machine and getting it to run is really only the first step, because what is really needed is for NCSA, for us collectively, to really become leaders in driving the complete system of high-performance computing.

Q: A process like this changes the state of affairs for everybody. What are some of the likely candidates for those disruptive moments we are going to encounter?

SNIR: They are likely to be while working on new programs, new programming languages, and new programming models. The big impediment to these changes is: "Will my program run everywhere? Am I willing to invest in order to write my program in languages that will not be supported everywhere?" But if it is supported by several of the topmost machines in academia they will probably make that investment, so we'll need to work with the Track 2 teams.

Q: What are we far along on that will be a benefit to us as we move through the coming years?

HWU: From a hardware point of view, part of the reason we selected IBM as our partner is that IBM has a long history of engineering large-scale, reliable systems. So that's a very good start. IBM is also a huge technology company in the sense that it has access to things, and their experience with managing new technology and managing the more sophisticated technology to achieve some of the goals really will give us a leg up in this process. That's probably the best part in my mind as I look to the future as to what is going to be a big help.

SNIR: The great thing is that we are partnering with a company that has a lot of technology depth and has been building these kinds of machines for more than 10 years. Technology changes and never seems to stop, especially when it comes to the scale of these things, but the best things do constantly change.


National Science Foundation

Blue Waters is supported by the National Science Foundation through awards ACI-0725070 and ACI-1238993.