Orchestrated move of scientific data minimizes user impact of Blue Waters upgrade

06.13.17 -

The National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign recently finished an upgrade to the parallel disk storage subsystem for the Blue Waters supercomputer—which stores data from ongoing science applications.

The Lustre open-source file system is responsible for writing the data into blocks on the disks, and its latest release changes how the data is physically written—requiring the entire 9.6 petabytes of data on the system to be reorganized.

Engineers at NCSA had four options in orchestrating this reorganization, and picked one that would take the the most effort, after consulting with users of the supercomputer, assembled in the Blue Waters Science and Engineering Advisory Committee.

"To be least impactful on the science teams, we took the hardest method for us, one where we had to do a lot of software development," said Bill Kramer, principal investigator for the Blue Waters project. "This software had to move the data multiple times. It had to make sure what we moved was correct and worked properly."

NCSA Senior Technical Program Manager Michelle Butler lead the effort, and said the easiest option would have been to do nothing, and let the system carry on without the most up-to-date Lustre fixes and features.

"We have a lot of disks (more than 17,000) in this one sub-system on Blue Waters, and sometimes they fail. Part of the upgrade we completed enables NCSA to rebuild the data and regain top performance of the disk environment faster when a disk fails," Butler said. She added that the 36 raw petabyte disk subsystem can hold up to 25 petabytes of usable scientific data at any one time, and hosts an 11 petabyte RAID-6 data protection scheme to eliminate data loss from single points of failure.

Another option the team did not pursue involved wiping all data on the file system and asking the scientists to recompute it—erasing what in some cases would be a year's worth of work. Option three—moving data to the separate "tape system," wiping the now empty disk system, then moving all the data back onto it—would have taken Blue Waters entirely offline for weeks.

Butler and her team spent almost a year developing and testing software that honed-in on upgrading the system in sections.

"While the system was up and available for users, so was ALL their data, making the scientists' time productive," Butler said.

The annual review panel convened by the main funding agency for Blue Waters, the National Science Foundation, celebrated the improvement in their 2017 site visit report.

They wrote: "The panel is impressed by the scope and amount of continuing overall hardware and software system improvements in the third period to 'keep the system up-to-date,' especially the challenging Lustre version update."

What the software did

The software that orchestrated the data transfer between the three individual units in the storage disk subsystem started by moving all 120 million files from the first unit, named "Home," into another partition in the second, named "Project." The hardware that normally held one unit now held two, actively storing new scientific results from the supercomputer to be accessed at anytime by researchers.

With Home's former space now cleared, it was upgraded, and then the partition living on borrowed space on Project was moved back onto its original hardware. All the data normally stored on Home also went to Project to make way for its upgrade, moving about 300 million files in the process.

A program that was written by Butler's team self-checked itself while data was moved: imprinting a calculated number based on the data (checksum) in the data file before moving it, and after. Those bytes where the checksum did not match were recopied, and an additional program then double-checked the results.

After the majority of data was copied, Blue Waters and all user activity was paused for maintenance to make sure the final amount of the data was copying was the most recent.

Once its original hardware was upgraded, the Project partition's data move was stalled to coincide with another move on the third unit, as both movements caused a brief amount of downtime for Blue Waters. Lining them up limited the total downtime experienced by scientists to a few tens of hours.

Transferring 360 million files off the third unit, "Scratch," was a bit more complicated than the process for the first two units. Data was compressed into one half of the unit, and storage hardware was taken away from the other half and set up as a separate and temporary unit. This temporary unit received all the crunched data, Scratch was upgraded, and then all data was expanded and reorganized onto it. Hardware was also reinstalled.

NCSA evaluated if a "rebalance" of the data for Scratch was required, because maybe half of the file system was more heavily loaded with data than with the newly upgraded disk environment, but after one month, the system had balanced itself correctly with data spread throughout the Scratch data subsystem evenly.

National Science Foundation

Blue Waters is supported by the National Science Foundation through awards ACI-0725070 and ACI-1238993.