No magic wands | National Center for Supercomputing Applications at the University of Illinois
No magic wands
08.27.12 - Permalink
Victor Jongeneel leads the bioinformatics efforts of NCSA and the University of Illinois. He recently spoke with Access' Barbara Jewett about what bioinformatics is—and is not.
Q. When we say bioinformatics, what exactly are we talking about?
A. We're talking about something's that not really a science. Bioinformatics is, in a very broad sense, the application of computer-based techniques to the handling and analysis of biological data. And this can go from the very mundane, like parsing data coming out of an instrument, to the very sophisticated like developing new methods and algorithms for analyzing multiple data sources and extracting various pieces of information from it. Bioinformatics has arisen as kind of a discipline over the last 20-25 years, driven, I would say, by two different imperatives. The first one is that biology used to be a science where very little data was generated; it was mostly observations. People doing experiments and recording—manually, most of the time, or in pictures under the microscope—what they observed for a particular organism and maybe some gels and ultracentrifugation experiments. Some techniques began to come online, particularly DNA sequencing, that produced data that you could no longer paste in a notebook and analyze by hand. So you needed some kind of set of automated procedures to use it.
And the second element is that contrary to physicists or even chemists, quantitative reasoning and computer science are normally not at all part of the curriculum of an average biologist. So biologists were faced with a problem in that they had data they had to do something with and, for the most part, they had no idea how to handle this. So a small band of people started appearing, some of them biologists with more of an interest in computers and quantitative analysis than some of their colleagues.
Q. This would include you?
A. This would include me. I took my first computer science course in graduate school. I had tried to take a computer science class when I was an undergrad, and was told in no uncertain terms that computer science was absolutely useless for biologists, and that those computer resources were too precious to be wasted! (laughs) Anyway, the confluence of these two things—a science that was beginning to generate data, though at the time not huge volumes, but still appreciable volumes, and the lack of training of most biologists—created a new discipline where there was a real need for people who knew how to use computers and knew enough about biology to figure out how computer skills could be applied to extracting information from biological data. That's really what bioinformatics is all about.
Nowadays, of course, it has become a huge thing because the instruments biologists use are getting more and more sophisticated, and of course everything is digital now, as is pretty much in all the sciences. So everybody needs at least moderate computer skills to be able to do something with data generated.
Also, there are some areas of biology, so-called high throughput biology, where the paradigm has shifted from doing experiments to answer a very specific question, which is the way biology has been done traditionally, to a modality where you take an interesting experimental system and you get lots and lots of data about that system and you hope you will be able to make sense of that data and explain some of the properties of the system. This is biology without a pre-existing experimental hypothesis. I would say that many in high-throughput biology have taken this too far. They just want to end up with lots of data and then hand it over to bioinformaticians who have magic wands and are going to make sense of the data. That is no longer bioinformatics and, unfortunately, that is not how it works.
Bioinformatics is data driven. You have large data sets and you try to make sense of it. First getting information, then knowledge, out of data. Data are by themselves not informative. The first transformation brings raw data to informative data, then the second step goes from informative data to knowledge, where you have a higher level of understanding. Once you understand more about the structure of the data you can better design your next experiment because you know what data to collect and how to collect it, so you can maximize the amount of information and knowledge that you get out of it. And that makes you a good scientist.
Q. When you joined NCSA and the University of Illinois two years ago, you saw a real need to bring data and computation together in the biomedical field. How are we progressing?
A. I think right now we are in kind of a two-tiered world. There are a couple dozen research centers around the world where there has been the development of very significant computational capability, both from the standpoint of having the right hardware and storage and so forth to having the right kind of competence in bringing together biologists and computational scientists and the technologies to produce the data to drive this forward. A typical example in this country would be the Broad Institute of MIT or a place like John Hopkins, or some of our sequencing facilities like Washington University, the joint genome institutes of the DOE, places like that. Another very well known example is the BGI in China, which is the largest genome sequencing facility in the world right now and they have an enormous computational capacity. Unfortunately, many other institutions have some data generating ability, but they do not have either the know-how or the facilities to really do large-scale biology and are struggling with how do they develop both sides of the equation. And I think many places, even those with decent computational facilities, have not been able to bring them to bear on the biology data deluge.
Here at the University of Illinois, I think we are somewhere in between. We have a very long-standing tradition in high-scale computing, but have not really been serving the biomedical community. On the other hand, there has also been a very significant effort on this campus to develop high throughput biology through the Biotechnology Center that makes available the core technologies for doing this kind of work and through the Institute for Genomic Biology (IGB), which brings the intellectual environment and expertise for doing it. And strangely enough, the high-performance computing and the high throughput biology components have never been working together! So one of the things I have been able to do is to create a new campus unit, which is integrated with the Biotechnology Center. It's called HPCBio, which stands for High-Performance Biological Computing. Basically this is a joint venture between the Biotech Center, IGB, and NCSA to create a unit that serves the needs of the campus in terms of expertise and user support, but also computational resources to help campus researchers doing high throughput biology. There's actually a pretty sizeable community that was underserved by the size of the structures that existed a couple years ago. So I think this has been a very positive development. It is now operational; we started about two months ago. The main problem we have now is we don't have enough staff, and it is very difficult to find people with the right combination of skills to work in that environment. So for the next few months our problem will be recruiting enough really high-quality people to make this group functional.
I'm hoping that there will be enough innovation at Illinois, and enough synergy between the biological/genomic component and the HPC component to capitalize on this discipline on campus and we get to be on par with the, quotation marks, "really big boys." What might help this is our partnership with the Mayo Clinic.
Q. Can you share about that collaboration?
A. The Mayo Clinic has a major problem. They are fully committed to integrating genomic information about their patients into the personalized care that they deliver to those patients. Mayo has been known for a very long time as a provider of personalized medicine. They have a model that is patient centric. Each patient is seen by team of physicians and that team puts together all the expertise that is needed to serve the patient's specific needs. This is very different from most medical centers where patients just get kicked around from one specialist to another and at best they share access to that patient's medical record but they very rarely get together to make holistic decisions.
At Mayo, they want to put the patient's genome in the center of this decision making process. To do that, they will have to sequence the genome of pretty much every patient coming into the clinic. Currently, Mayo is seeing about half a million patients a year. So even if the genomes are implemented gradually, they still will have a major informatics problem because with current technology, sequencing even only the exome, which is the part of the genome which is most likely to have deleterious mutations in it, takes days to analyze on even a relatively large computer system such as they have at Mayo.
Now even if you only want to sequence say a thousand patients a month, you have a problem. If you want to sequence 10,000 patients a month you have an even worse problem. If you want to sequence their whole genome instead exomes, currently it takes 10 days to analyze the data—for one patient. So we're working very actively with Mayo to develop not just techniques but maybe hardware solutions that will enable them to actually deliver on this promise. There's a lot of talk about the $1,000 genome, that you can produce all the data to analyze a human genome for $1,000 or less. It is not a reality today but we are very close. There are already companies that will sequence a human genome for $3,000. The technologies have moved extremely fast. There's technology on the horizon, but nobody knows yet if it will really work, that would dramatically reduce the cost of sequencing a genome not by allowing scientists to create more data, but by allowing them to create much less data but of better quality. So this would also be extremely helpful in resolving the informatics problem because if the data volume is smaller it is easier to analyze and the amount of computational power needed to solve the problem also diminishes.
Q. Isn't NCSA developing some of this?
A. The competition we're involved with is the The Archon Genomics X Prize presented by Express Scripts®. It is the only competition I'm aware of that challenges companies to sequence and interpret large numbers of human genomes in a limited amount of time. It starts in January 2013, but as we speak we're actually preparing the data that is needed to score the participants and the computer tools that will be needed to analyze the data. The software is being developed by EdgeBio in Maryland while NCSA is providing the computer infrastructure. And NCSA will also be tasked with scoring the results to see if any of the contestants can get the $10 million prize, since our computers can quickly analyze the data and determine the winner.