Automating Phylogenomic Analysis of Protein Sequence Motifs
H. Rex Gaskins
College: Agricultural, Consumer and Environmental Sciences
Award year: 2003-2004
Recent progress in genomics, proteomics, and bioinformatics has enable the new field of phylogenomics to create unprecedented opportunities to understand the evolutionary history of extant organisms. The goal of this project is to use approaches provided by D2K [Data to Knowledge], developed by the NCSA Automated Learning Group, to automate the phylogenomic analysis of protein sequence motifs. We will focus on redox-sensitive motifs during (and after) the tenure of the fellowship, although the automated platforms to be built will support exploration of the evolutionary history of any of the multitude of annotated protein sequence motifs. Redox-sensitive motifs contribute to the most basic bioenergetic mechanisms of life and were thus likely key components of the primordial gene pool. We postulate that these ancient motifs were incorporated over time into regulatory genes, and that their functions were co-opted to support the emergency of multicellularity and the differentiation programs characteristic of eukaryotic cells. While this hypothesis is well-supported by candidate gene-based data from model organisms, which attest to the importance of redox-signaling in the control of eukaryotic gene expression, its validity can be determined most objectively and efficiently through phylogenomic analyses.
We have used manual processing to combine computational approaches for preliminary in silico analysis of redox-sensitive zinc-binding domains using the two online databases, InterPro (http://www2.ebi.ac.uk/interpro/) and EMBL Proteome (http://www.ebi.ac.uk/proteome/), and the theoretical proteomes of eight evolutionarily distant organisms from each of the three domains of life. This effort revealed the breadth of our computational problem and numerous bottlenecks associated with manual processing. Accordingly, we seek to streamline and automate the phylogenomic analysis of redox-sensitive protein motifs, and have devised the following specific objectives through initial consultation with NCSA Automated Learning Group Staff member, Michael Welge.
- To transfer our manually assembled redox-sensitive protein motif and archaeal, bacterial, and eukaryotic proteome databases to an integrated D2K environment to create module-based application templates and a standardized visual programming interface for rapid phylogenomic analysis.
- To explore other data mining, sequence alignment, and phylogenetic tree-building methods for their ease of integration and computational automation.
- To assemble a final automated, user-friendly platform for rapid phylogenomic analysis of protein motifs and structural domains.
- To validate the automated phylogenomics platform by examining the evolutionary history of the redox-sensitive cytochromes P450 (P450s) and glutathione transferases (GSTs). The P450 and GST gene families, which encode major detoxification enzymes across taxa, are remarkably large with distinct eukaryotic and prokaryotic sequences in each case.
By integrating genomic and proteomic databases and automating the phylogenomic process, the proposed work will create novel and efficient bioinformatic tools. Expedited discovery in redox signaling, an emerging area in developmental cell and cancer biology, will be an immediate and valuable outcome for the Applicant's program. More importantly, the project will broaden the scope of NCSA data mining applications, and thus provides an ideal platform for bridging greater ties between NCSA and biological scientists on Campus. These interactions will be critical as the University of Illinois positions itself to be a leading institution in the field of post-genomic biology.