Blue Waters one year later: Delivering sustained petascale science

11.09.12 -

by Thom Dunning, Bill Kramer, and the Blue Waters team

One year ago, the National Science Foundation approved NCSA's request that the sustained petascale computing system being designed and deployed in the Blue Waters project be based on leading-edge technology from Cray, Inc. One year later, we want to bring you up to date on the project and summarize all that has occurred in the last 12 months—a remarkable year that has resulted in the deployment of a computer system that is both the largest ever fielded by Cray and an outstanding resource for the most challenging science and engineering research.

The most basic accomplishment is that manufacturing and installation were completed in approximately nine months. A small development system arrived in late November 2011 and was in use just one week later for a training class. By February 2012, NCSA and Cray had installed a 48-rack Early Science System with both XE (2-socket CPU-CPU) and XK (2-socket CPU-GPU) nodes. Although it was only one-sixth of the full Blue Waters computational system, the Early Science System was more powerful than any other U.S. academic computing system. Fifteen NSF-approved research teams used the Early Science System during its four months of operation, preparing their applications for use on the full Blue Waters system as well as producing new scientific discoveries.

By the beginning of summer, all 315 racks were installed at the University of Illinois' National Petascale Computing Facility—a LEED Gold facility that includes many energy efficient innovations—and full-scale testing began. By the end of September, all the final components were in place and operating. To reach this point, NCSA and Cray had to overcome not only all the usual challenges of an extremely large and complex computing system but also unanticipated challenges, such as the flooding in Thailand that created a worldwide shortage of disk drives.

With all of the equipment in place, everyone's focus turned to scaling the system hardware and software to unprecedented levels. The full Blue Waters system consists of 237 XE6 racks (45,504 AMD Interlagos chips), 32 XK7 racks (3,072 AMD Interlagos chips; 3,072 NVIDIA Kepler chips), and 7 I/O racks. Blue Waters also includes 1.5 PB high-speed memory—4 GB per core for the CPU nodes and 6 GB for the GPUs. This unique resource makes it easier to use the system for new challenges. All the memory can be referenced from any core in the system with advanced languages if this is required for a particular application.

Blue Waters has an interconnect network that is one-third larger than any other deployed Cray system. Blue Waters also has the largest, most robust online storage system in the open research world, with more than 25 petabytes of usable online storage. The sheer size of the online storage system is impressive, but the Cray Sonexion file system also provides sustained average, aggregate performance over 1.1 TB/s. This unprecedented level of performance is a substantial achievement. The near-line data storage system is also in place and provides more than 300 PB of RAIT-protected data on the floor. To achieve these results, a multi-organizational team of expert computer technologists made hundreds of improvements to existing technologies and developed new capabilities for storage and I/O. Connectivity to the national networks in Chicago began at 55 Gbits/s and will be scaled to hundreds of Gbits/s as needed by the research community using Blue Waters.

Throughout the challenge of fielding a system of this size and complexity we have been guided by our long-term interactions with the research teams. Consistent with Blue Waters' goal of sustained petascale performance on real science and engineering applications, we are validating the system by using it as a researcher would—with full, real world applications. We are judging performance based on the elapsed time required to perform real work rather than simplified benchmarks. Testing is not yet complete, but we would like to highlight some of the amazing accomplishments achieved so far.

Blue Waters has run all the original NSF application benchmarks. Three of these tests are full-scale petascale benchmarks—complete codes whose runtimes range from 16 hours to over 60 hours, on virtually the entire system—more than 25,000 nodes (or 400,000 floating point cores). These benchmarks solve real science problems, and we measure the entire time from the start to the completion of the problem, including all required I/O (both scientific results and defensive I/O) and the time required for checkpoints and restarts if failures occur! We are very pleased to report that Blue Waters is meeting or exceeding expectations for these stringent tests.

The Blue Waters team set an even higher challenge by creating the Sustained Petascale Performance test. The SPP test is a collection of 12 application benchmarks that are complete applications drawn from the research teams that will use the system! Again these are truly representative tests as they include all start-up processing, I/O, computation, and post-processing, just as these tasks would be performed for a real science run. We are timing not just parts of the programs or the computationally intensive kernels in order to judge performance—the SPP is a measure of the real sustained research potential of Blue Waters. The SPP codes run on one-fifth to one half of the total system, and several also have run at full scale. Four of the codes are used to show that the XK GPU nodes improve sustained performance (and therefore time to solution) and do so using 600 to 1,500 XK7 nodes. Another test—a "small" test by Blue Waters standards—is the largest Weather Research & Forecasting (WRF) simulation ever documented.

We are pleased to report that four of these codes already run above 1 PF of sustained performance. All 12 of the SPP tests are running at their largest scale, and three are taking good advantage of the XK7 GPU nodes. We expect to report additional achievements and improvements as the SPP benchmarking continues.

Starting in early November, the full Blue Waters system became available to the the National Science Foundation-approved science and engineering teams. These "friendly users" have access to the entire system during the availability testing period; their work on the system will help test and evaluate the system and will expedite the teams' ability to use Blue Waters productively as soon as it is in full production.

Blue Waters project staff have also produced over 50 invited and peer reviewed papers, almost 100 presentations, and have conducted many training and educational activities.

The most impressive achievement, however, is the work of the science and engineering teams with whom Blue Waters collaborates. NSF has now approved 33 science and engineering teams to use Blue Waters through the Petascale Computing Resource Allocations (PRAC) process. All of these teams have been invited to participate in our friendly user period on the full-scale Blue Waters.

As noted previously, 15 of these science and engineering teams used the Blue Waters Early Science System, with many of them making substantial science accomplishments. The teams used the ESS to explore how HIV infects cells, how stars explode, how the most basic constituents of matter behave, and how severe storms occur.

Blue Waters recently began its acceptance testing period to ensure it is ready for all of the challenging use cases of the NSF science and engineering community. We hope to end our deployment stage and enter our full service production phase early next year.

Many people and organizations—including our funders and stakeholders (the National Science Foundation, State of Illinois, and University of Illinois) and our suppliers and sub-awardees (in particular Cray, AMD, Xyratex, NVIDIA, the Great Lakes Consortium for Petascale Computation, INRIA, and the University of Tennessee)—have contributed tremendous efforts to help the Blue Waters project achieve the outstanding progress described above. We have also had enormous help from many of the science teams and look forward to a long and mutually beneficial partnership with them. Finally, we must make a special acknowledgement of the outstanding contributions of all the Blue Waters project staff at NCSA and the University of Illinois who are working 24 by 7 to create the unique Blue Waters system.

For more information, deeper discussions, and future updates about Blue Waters, stop by the NCSA booth at SC12 (#1030), visit http://www.ncsa.illinois.edu/enabling/bluewaters, or follow us on Twitter (@NCSAatIllinois). Also look forward to the Grand Opening Ceremony for Blue Waters on March 28, 2013.