Doing with less | National Center for Supercomputing Applications at the University of Illinois

Doing with less

11.09.12 -

by Barbara Jewett

Researchers used NCSA's Forge to improve computational protein structure modeling.

Proteins, those biological workhorse molecules that catalyze reactions, hold cells together, move muscles, and duplicate DNA, are nothing without their shape. In fact, that three-dimensional shape, called the protein structure, is the a major determinant of a protein's function.

"That's why there's an important need to know what the native structures of proteins are so scientists can better understand how they work," explains Justin MacCallum, a junior fellow at the Laufer Center for Physical and Quantitative Biology at Stony Brook University. His team of researchers—including post-doc Alberto Perez and led by Laufer Center director Ken Dill—used NCSA's Forge and its GPU/CPU architecture in their quest to develop computational techniques for determining protein structure. Understanding the mechanisms of proteins helps advance fundamental biology knowledge and can also lead to practical applications, such as improved drug designs and designs of novel enzymes.

With the retirement of Forge, the work will transfer to Keeneland at the Georgia Institute of Technology. Additionally, the Laufer protein folding team received an award for two million hours on the new Titan GPU/CPU supercomputer at Oak Ridge National Laboratory. Both are XSEDE-allocated resources.

Proteins are made up of chains of amino acids held together by peptide bonds. The chains then fold, turning into their functional shape, or structure. While the chain sequences are fairly well understood, scientists still have much to learn about how they fold and their resultant structure.

"A lot of people put a tremendous amount of time and effort into exploring protein structures," MacCallum says. "The main technique they use is X-ray crystallography. But it has one really big limitation. And that is, it's very, very difficult to get many proteins to crystallize. So even though there are extremely interesting proteins that we would like to know more about, such as membrane proteins or proteins containing intrinsically disordered regions, it is often difficult to obtain crystals that can then be examined with X-ray crystallography."

MacCallum and his team hope to make it easier to determine structures in cases where crystals cannot be obtained. They are developing theoretical techniques that can be combined with experimental techniques other than crystallography to get protein structures. He uses a hybrid approach that combines detailed molecular dynamics simulations with distant restraints derived from bioinformatics, evolution, and experiments. The restraints serve to restrict the size of the conformational space to be searched and make the computation traceable on current computer hardware. They're hoping, he says, to find that sweet spot where researchers can use experiments to tell them something about the structure, and then use computational modeling to fill in the gaps.

Challenging folds

Currently, it is extremely difficult to get the structure of an unknown protein working only from its sequence. The problem, says MacCallum, is that even with GPUs, supercomputers still aren't fast enough, as most proteins take longer to fold than can be simulated. Computer simulations can be done in the hundreds of nanoseconds to microseconds timescale, but most proteins take milliseconds or longer to fold.

MacCallum's application runs about 100 times faster on GPUs than CPUs, so Forge "really lets us tackle problems we couldn't address before" he notes, adding that what would take years to run on a CPU cluster is reduced to just weeks on Forge.

"If we have a good amount of experimental data that's all consistent, we don't need to have much computer power to get a structure, and we don't need to have very accurate modeling tools to get reasonable structures. On the other extreme, if we have no experimental data, then even with the GPU computing power of Forge we don't have enough compute power to get a structure. But there's some sort of inbetween area where, with enough computing power, you can get by with a lot less data than you used to need in order to get a structure. And that's the area we're exploring. We're trying to see how far we can push, and how little data we can put into these calculations, and still get something reasonable out of it. And the faster the computers are, the less data we need," says MacCallum.

"The cost of the experimental work is much, much higher than the cost of the computer time," he continues. "So we can save on computer costs by making this calculation faster, but we can also save an awful lot on bench time if we can make this work. And that's what we're here for, to help the experimentalist."

By giving the computer even just a little bit of data about the protein, though, the tool MacCallum's team is developing can zero in on the structures that are the most likely outcomes. In some cases, says MacCallum, they've been able to get reasonable structures with surprisingly little data. For example, one protein with which they were working had some sparse data available from solid-state nuclear magnetic resonance (NMR) experiments. That protein folds on a millisecond timescale. By inputting the NMR data into the tool, the protein folded on a 50-nanosecond timescale.

The CASP challenge

Trying to get something from almost nothing is what makes the CASP (Critical Assessment of protein Structure Prediction) competition so interesting. CASP competitors are given the sequence of a protein and have 21 days to computationally determine a structure. The structures of the proteins have been identified through experimental means, but the structures have not yet been entered into the Protein Data Bank (PDB), so there's no way to see what the correct structure is until the competition ends. Thus the biannual CASP blind prediction competition serves as sort of a benchmarking exercise, says MacCallum, as well as providing information on the current state of the art of protein structure prediction.

For the current CASP10 competition, the sequences of structures recently solved by crystallography or NMR were made available to predictors from April through July 2012. Throughout the fall, independent assessors in each of the prediction categories are processing and evaluating the tens of thousands of models submitted by approximately 200 prediction groups worldwide, including MacCallum's group. The results will be made public and discussed at the CASP10 meeting to be held in December in Gaeta, Italy. About this same time, the structures of the 114 competition proteins will be entered into the PDB.

The PDB archive is the worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. It was established in 1971 at Brookhaven National Laboratory and originally contained seven structures. In 1998, the Research Collaboratory for Structural Bioinformatics at Rutgers became responsible for the management of the PDB, which now contains more than 77,000 structures.

One change to this year's CASP competition was especially useful to the Stony Brook group. For the first time they had some target sequences that came with data. Not real experimental data, says MacCallum, but information that one might be able to get from an experiment. CASP organizers wanted to see if extra information improved the outcomes. That was exciting for MacCallum's team because they've developed their methods around the idea of having extra data to form their predictions. MacCallum says that was a good way to test their tool and they were pleased with the one result they know to date. So they believe they are on the right track.

"After two years of development, we have the tools working pretty well right now," he says. "Obviously, there's still going to be tweaks and improvements to come. We've talked with some experimentalists and hope they'll collaborate with us so we can try our tools on real problems. We've taken a protein where the structure is known and we've proven that we can get the same answer with our technique. Now we want to start applying this to proteins where nobody knows the structure. We hope some really interesting biology and biochemistry will come out of that."

Project at a glance

Team members
Ken Dill
Justin MacCallum
Alberto Perez

Funding
National Institutes of Health
Laufer Center for Physical and Quantitative Biology

For more information
http://laufercenter.org/