Access in scientific computing

10.20.15 -

It's a delight to be invited to write this editorial, especially as NCSA Access magazine’s name is coincidentally appropriate to this topic. Much of the recent policy discussion around computing and scientific discovery has centered on access: access to methods, code, data, results, and even hardware. In this article I’ll argue that access is germane to the scientific research process, and the ongoing discussion of how we get there is crucial for all stakeholders in the supercomputing community to engage.

The use of computation in scientific research is of course widespread to the point of near-ubiquity, and the type and sophistication of both the tools and the problems they address are rapidly evolving. The advantages of computation, however, significantly increase the complexity of the research we carry out and are raising questions about the nature of computational discovery itself.

Computational science needs standards for transparency in dissemination. The scientific discovery process has traditionally been characterized by two methodological branches, the deductive and the empirical. Over the last two decades a third branch has been postulated—that of computational science. Similarly, a fourth branch for data-driven discovery is also regularly mentioned. The structured process of research and dissemination behind the first and second branches is a response to the ubiquity of error—that error can creep in anywhere in scientific research and our primary effort as scientists is to identify and root out error. Consequently, both the deductive and empirical branches have associated standards for the dissemination of results that permit transparency and verification. For example, mathematical results are published with the accompanying proofs, and empirical studies are conducted using the machinery of hypothesis testing and publication includes a structured methods section. The longstanding check is the independent verification and replication of the findings.

The credibility gap in computational science must be addressed with similar standards of transparency and replication. This means establishing standards of code and data release with published results, and including workflow information, software tests, and execution information necessary for replication.

In the summer of 2014, a workshop I co-organized in conjunction with the XSEDE14 conference attempted to address some of these issues in the high-performance computing context. A workshop report was issued that summarized the discussion, identifying two core needs: “(1) delivering and maintaining a robust cyberinfrastucture that enables reproducible research at scale; and (2) promoting a culture of reproducibility within the broad community of stakeholders.” You can view workshop information, including the final report, online. The final report builds on a previous workshop report, “Setting the Default to Reproducible: Reproducibility in Computational and Experimental Mathematics,” emerging from a 2012 Institute for Computational and Experimental Research in Mathematics (ICERM) workshop.

As the supercomputing community begins to address these issues, parallel discussions are occurring in Washington, D.C., jumpstarted by the 2013 Executive Memorandum and Executive Order issued by the White House mandating open access to publications and open access to data for federally funded research. In May of this year the National Science Foundation released a report entitled “Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science” aimed at addressing issues of reproducibility in research.

What can we do?

There are a number of current opportunities for input at the national policy level. The Computer Science and Telecommunications Board within the National Academies is conducting an ongoing project “Future Directions for NSF Advanced Computing Infrastructure to Support U.S. Science in 2017-2020,” with an interim report released in late 2014. On July 29, 2015, the White House released another Executive Order (“Creating a National Strategic Computing Initiative”) with a goal of creating a “coordinated Federal strategy in HPC research,” which includes the strategic objective of:

increasing the capacity and capability of an enduring national HPC ecosystem by employing a holistic approach that addresses relevant factors such as networking technology, workflow, downward scaling, foundational algorithms and software, accessibility, and workforce development.

Access to workflows and software, for example, is a vital part of producing reproducible research. We should consider how to further reproducibility of computational science within the outlines of the initiative. For example, we need to provide funding for:

  • the development of appropriate cyberinfrastructure environments that permit the capture of workflows and machine states during the research process, and to better understand what portions of data and codes to capture and share to enable reproducibility;
  • determining appropriate standards for software tests and documentation to be delivered with code associated with published results;
  • additional allocations on supercomputers specifically to provision for validation and verification of software and models, and uncertainty quantification of model inference and data;
  • coalescing on standards for the citation of data and codes when reused in subsequent research, and reward and recognition for such citations;
  • the ability to automatically check the reproducibility of computational results (does the code do what the author purports?);
  • understanding intellectual property rights for publicly supported research, including for code and data (see, for example, "The Legal Framework for Reproducible Scientific Research: Licensing and Copyright");
  • and providing support for the independent replication of key computational findings after the research has already been carried out.

These are a few ideas for a potential research program to facilitate reproducibility in computational science. I believe such a research program to enable reproducible and robust research is essential to the future of supercomputing.

Victoria Stodden
Associate Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign and NCSA faculty affiliate