Illinois workshop charts path, and roadblocks, toward personalized medicine | National Center for Supercomputing Applications at the University of Illinois

Illinois workshop charts path, and roadblocks, toward personalized medicine

11.23.10 -

Leaders in human genomics, statistics, bioinformatics, and computational science came together in October to chart a map for the efficient use of high-performance computing in personalized medicine at a two-day workshop at the University of Illinois at Urbana-Champaign. The workshop was co-sponsored by Illinois' National Center for Supercomputing Applications (NCSA) and Institute for Genomic Biology and by Mayo Clinic.

Ten years after the completion of the human genome sequence, the promise of being able to tailor preventive and curative strategies to individual human beings is close to becoming a reality. Data are being collected at unprecedented scales on ever increasing cohorts of healthy and diseased individuals. Whole genome sequencing, full microbiome analysis, metabolic profiling, and real-time monitoring of functional status in large patient groups are realities. While there is a growing consensus that very large-scale computing will be required to effectively manage and analyze these vast data stores, efforts to employ the world's most powerful computers for personalized medicine are still in their infancy.

"This workshop was an excellent forum to discuss the progress we have made in the past 10 years and to assess the future of computational needs in this rapidly evolving field," said Mayo researcher Peter Li. Presentations and video from the workshop are available online at:

In an open discussion at the end of the workshop, the participants identified a number of bottlenecks to the effective implementation of personalized medicine:

  • the difficulty in aggregating newly acquired and publically available data in a single location
  • the integration of multiple data types into a common semantic framework
  • the need for analytical work to proceed at a pace comparable to the rate of data production
  • the lack of properly benchmarked and validated software
  • the effort required to adapt software to highly parallel environment, and
  • more generally the difficulty for hardware and software development to keep up with the very rapid changes in enabling technologies for data production.

The participants also agreed that there is only a limited number of problems pertinent to personalized medicine where high-end HPC capabilities are required, but that for those problems (e.g. high-throughput de novo genome assembly) these capabilities would be crucial.

The workshop was divided into four thematic sessions: Data-Driven Biology, Making Sense of Large-Scale Datasets, Clinical Informatics of the Future and Computational Challenges.

In the Data-Driven Biology session:

  • George Weinstock (Washington University Genome Center) said the center is conducting routine analysis of data whose volume doubles every few months, but keeping pace with this data comes at the cost of recurrent major investments in hardware and the center finds it challenging to perform large-scale analyses of, for example, microbial metagenomics data.
  • Andrey Tovchigrechko (J. Craig Venter Institute) presented a careful analysis of the applicability of high-performance computing (HPC) architectures to sequence analysis problems. He concluded that capacity systems (large commodity clusters) are best adapted to bioinformatics work but that with some clever programming HPC systems such as those offered by the National Science Foundation's TeraGrid could provide additional performance.
  • Mark Chance (Case Western) emphasized that network-level analysis of datasets combining multiple data types from large patient cohorts can provide new diagnostic insights and that exhaustive searches for combinatorial patterns in, for example, genetic or proteomic data will require massive computational resources.
  • Michael Barmada (University of Pittsburgh) further emphasized these points and underlined one of the major difficulties facing personalized medicine—the representation and integrative analysis of multiple heterogeneous datasets.

In the Making Sense of Large Datasets session:

  • Sasha Wait Zaranek (Harvard) made a strong plea for openness, from the sharing of genome data in the Personal Genomes Project to the sharing of computer resources through "Freegols," virtual machines providing Web services for diverse applications and data. Freegols could become a distributed resource capable of analyzing and comparing hundreds of thousands of human genomes.
  • Wolfgang Huber (EMBL Heidelberg) presented results from large-scale genetic screens using multiple technologies including lime-lapse microscopy. He argued that raw computational performance was never a bottleneck, but that moving terabytes of data was, and that therefore "bringing the computation to the data" was a top concern.
  • Pankaj Agarwal (GlaxoSmithKline) reported on the challenges of analyzing electronic health records and pharmacy benefit data, which are providing basic medical information about hundreds of millions of individuals. These are likely to grow both in the number of patients covered and in the richness of the data collected, and their analysis will require sophisticated data mining techniques running on powerful computers.
  • Keith Bisset (Virginia Tech) talked about modeling the behavior of populations based the networks of relationships between individuals. Such techniques are extremely useful in handling health emergencies but require supercomputers capable of performing quadrillions of calculations every second if applied to realistic populations (e.g. the 300 million individuals in the United States).

In the Clinical Informatics of the Future session:

  • Peter Li (Mayo Clinic) presented the four grand challenges of personalized medicine—individual variation, predictive knowledge, adaptive treatment and prevention, and integration of information—and explained how the representation of patient data at Mayo impacted them.
  • Jonathan Silverstein (U. of Chicago) outlined the enormous challenges, both organizational and informational, encountered in developing a computerized system to support health care delivery in the United States and argued that grid and cloud computing solutions would be required to enable such a system.
  • Dan Masys (Vanderbilt) showed how good decision-support systems for physicians, informed by rich data sources, including genomics, could drastically improve patient outcomes. He also advocated re-analyzing data from GWAS projects to uncover new associations of genetic polymorphisms with disease phenotypes that were not part of the original study design.
  • Barbara Beckerman (Oak Ridge National Laboratory) presented several data-driven clinical informatics projects where integrative analysis supported by large computer systems is generating new knowledge.

The final session was devoted to Computational Challenges:

  • Aimée Dudley (Institute for Systems Biology) showed how she has been able to model the relationship between genotype and phenotype in yeast mutants, and in particular changes in colony morphology linked to cell-cell signaling defects.
  • Alexandros Stamatakis (Heidelberg Institute for Theoretical Studies) undertook a detailed analysis of the computational requirements and bottlenecks of one class of algorithms, designed to reconstruct a phylogenetic tree using a maximum likelihood scoring function from a large number of aligned genomic sequences. He showed how memory size and access, representation of floating-point numbers, performance of the interconnect and basic machine architecture could affect both the size of the problems that can be solved and the performance of the algorithms.

Earlier this year Mayo Clinic and the University of Illinois at Urbana-Champaign launched a research alliance focused on individualized medicine, computational medicine, genomics, and point-of-care nanotechnologies.