Sky yields big data

06.30.10 -

by Barbara Jewett

It's not uncommon for computational scientists to generate several terabytes of data during the course of a project. But imagine gathering up to a petabyte of data daily. How will you process that much data? Where will you store it? How will you share the data with colleagues? Those are just some of the questions NCSA is assisting with answering as the center partners in large cosmology and astronomy projects.

"Data‐driven discovery requires an extensive cyberinfrastructure that supports data collection and transport to storage sites, followed by data cataloging, integration and analysis," says Thom Dunning, director of NCSA. "This often requires extensive computing resources as well as large data storage facilities."

LSST: Wide, fast, deep

The Large Synoptic Survey Telescope (LSST) will add a new capability in astronomy. The LSST is different from other ground-based telescopes in that it is a wide-field survey telescope and camera that can move quickly around the sky and image the entire sky every three days.

Using an 8.4-meter ground-based telescope, the LSST will, for the first time, produce a wide-field astronomical survey of our universe that tracks its changes over time. Its 3 gigapixel camera—the world's largest digital camera—will provide time-lapse digital imaging of faint astronomical objects. These maps can be used to better understand the nature of the mysterious dark energy that is driving the accelerating expansion of the universe. In addition, the LSST will also provide a comprehensive census of our solar system, including potentially hazardous near-Earth asteroids.

"The repetitive exposures of the sky will be combined to create the 'deepest' survey to date," says Ray Plante, the senior research scientist who's leading NCSA's effort. "In this case 'deep' means it is the most sensitive, able to capture the faint light of distant galaxies."

The LSST will be located on Cerro Pachón, a mountain in northern Chile. During the course of each week, it will survey deeply the entire visible sky twice a week and record it with the camera. The LSST is scheduled to see first light in 2014, to begin doing science in 2015, and be in full survey operations by 2016.

"NCSA is designing the large-scale computing, storage and networking infrastructure for the data management system," says Plante. "NCSA will also serve as the main repository, where the data will undergo complete processing and re-processing, and will feed the distribution of LSST data to the community."

The sheer amount of data represents some real challenges. The 15 terabytes of daily raw data translates into approximately 150 petabytes of raw data by program end. The degree of precision and automation that the data management system requires to ensure that every bit of information goes where it is supposed to go is an exciting challenge for the team. And with the Blue Waters petascale computer coming online in 2011, there exists even more potential for evaluation of LSST data. NCSA will also host the permanent data archive.

The software telescope

Testing gravity, studying cosmology and dark energy, searching for signals from other civilizations, studying magnetism in the universe and where it came from...these are just a few of the topics scientists hope to study with the Square Kilometer Array (SKA). SKA's primary objective, however, is to address fundamental questions about the origin and evolution of the universe. Scientists will be able to make a very large galaxy survey, including looking at how the distribution of galaxies evolve with cosmological time.

"The SKA project builds on collective advances in extreme-scale computing, astronomy, and advanced astronomical data management to enable leading-edge science in the next decade," says Athol Kemball, a professor in the University of Illinois' astronomy department and the Institute for Advanced Computing Applications and Technology, who also oversees NCSA's SKA activities.

NCSA is leading the calibration and processing team, which is addressing the computing challenges associated with deriving images from such a massive array. The signals from the several thousand antennas involved will need to be calibrated and the data properly processed. Doing so requires development of algorithms as well as providing feedback to antenna designers on the performance of the individual antennas.

The computational processing is so important to this project that some people call it a "software telescope." It's a telescope in which extreme-scale computational software will be central to its scientific success.

Understanding dark energy

The possibility that some cosmic 'dark energy' exists that contributes to accelerated expansion of the universe has been recognized for more than a decade. Scientists now estimate that dark energy, despite the fact that we don't know specifically what it is, makes up 70 percent of our universe. The Dark Energy Survey (DES) will make critical observations that can be used to assess theories about the nature of dark energy and cosmic acceleration.

DES science and technologies

The DES project will use four complementary techniques in two coupled surveys: a 5,000 square degree multiband, optical survey of the south galactic cap region, and a 40 square degree time domain search for supernovae. (Square degrees are used to measure parts of spheres.)

Fermilab is building an extremely red-sensitive 500 megapixel camera with a data acquisition system fast enough to take images in 17 seconds. The cage containing the system will be mounted at the prime focus of the Blanco 4-meter telescope at CTIO in Chile, and the instrument will become a general user instrument available to the astronomical community. The build phase for the camera is nearly complete, with the survey operations scheduled for 2011-2016.

The DES data management system (DESDM) will be used to process and archive approximately 200 terabytes (TB) of raw imaging data into a few petabytes (PB) of science-ready data products. Under the leadership of NCSA, the DES collaboration is developing—and will deploy and operate—the DESDM system.

NCSA's high-capacity storage infrastructure will house both the raw and the processed data, and the center is also working on critical middleware and an accessible online archive for the DES data.

As part of the development of the data processing pipeline, NCSA and the DESDM collaborators—Fermilab, and the Institut d'Astrophysique, Paris—conduct an annual DES Data Challenge. The challenge tests the prototype processing system using simulated data provided by Fermilab. With each successive challenge, the simulated data becomes more sophisticated and more closely resembles the data that will be gathered starting in 2011.

Just processing of a single season of data will require approximately 120 CPU-years and produce more than 100 terabytes of compressed data products, including catalogs of more than 12 billion objects.

"We tend to talk about networks being fast when we can watch a movie online, but one season of DES data is already as big as Netflix's entire catalog. Processing it once on a desktop computer would take an entire career. Building DESDM means rethinking everything—not only how to store data, but how to process and organize it in a way that enables the community of researchers to interactively explore and analyze it. And let's not forget, the telescopes coming online a few years after DES will produce orders of magnitude more data," says Jim Myers, head of NCSA's cyberenvironements and technologies directorate.