J. William Bell
 A Flower's Family Tree 1 2 3 4
 A forest of 14 billion trees


Parsimony may imply efficiency, but, the efforts necessary to actually build reconstructions, clearly leave simplicity far behind.

Example of a phylogenetic tree for the
13 species studied in the Campanulaceae project. The numbers along the lines are an estimate of evolutionary distance, expressed as the expected number of evolutionary events.
click to enlarge.

Each phylogenetic reconstruction, called a tree, represents one possible history of the species. The Campanulaceae project begins with 12 modern species of bluebell and a single species of tobacco. Tobacco is used as an outgroup, a species that is clearly very distant from the others, and is used to identify the root of the tree. To predict the evolutionary history, almost 14 billion trees must be built and compared to one another. Bader, Moret and their colleague Tandy Warnow of the computer science department at the University of Texas at Austin go far beyond constructing the underlying tree and its eventual outcome, also calculating gene order for each predicted ancestor within the trees. That means a whopping 100 billion genomes must be reconstructed.

"We need to build as complete a picture of the ancestry as possible," says Moret. "Computing ancestral gene orders also enables us to put differing evolutionary models to the test."

For these computer scientists, the reconstructions amount to a massive optimization problem—a game of creating all the possible evolutionary scenarios that could have occurred and then narrowing down those billions of options to a single best solution. The process begins with the raw gene data taken from the chloroplast of each species of Campanulaceae. The bluebell chloroplasts' single chromosomes are each made up of 105 gene segments. The genes within the individual genomes are always the same. That is, all 13 genomes have identical genes and identical lengths, but in different species the genes appear along the chromosome in a different order or orientation.

The team's code generates trees one at a time. Once a tree is generated, its internal nodes—the intervening ancestors that come between the individual species and their final common ancestor—are labeled by the software. Labels, which consist of the gene order data for a given node, are derived through a complex optimization process based on the notion of breakpoints.

   Diagram of conversion from
breakpoint median to the travelling
salesperson problem.
click to enlarge.

A breakpoint occurs any time two genes are adjacent in one genome but are not adjacent in a genome to which the first is compared. An internal node's label is derived by finding the gene order that minimizes the number of breakpoints between a node and its three closest neighbors. "This is where the parsimony criterion comes in," say Moret. "We find a label that minimizes the amount of change at this place in the tree." A travelling salesperson problem solver—a common, if expensive, mathematical method of solving optimization problems—is used to find the median, calculating the hypothesized gene order data for each node.

In the initial labeling of the tree, nodes that represent known data are separated by many intervening nodes. As a result, a great deal of approximation is used in the early stages. Multiple passes at a tree refine the approximation. Once the initial labels are assigned, each node has closer neighbors that can be used to find the breakpoint median. The code recalculates the nodes' labels based on these new data, repeating this process again and again and recalculating any node that saw changes in one of its neighbors in the previous pass, until the tree stabilizes.

Labeling and refining the trees is by far the most challenging step. "Computing a single median is intractable in itself, and we solve these over and over. It's a very computationally intensive procedure," Moret says. With this step complete, each tree is scored, using inversions as the metric. These scores show researchers which trees are most parsimonious. Thus, the scores also show which evolutionary history best fits the model and give a plausible snapshot of each genome within that history.

"In completing so many trees, you see very good and very bad ones. Ones that match up very closely and ones that don't," says Bader. "What we look for is commonality. A consensus tree that we can dig in deep on and that will give us a foothold to larger phylogenies."

 

 1 2 3 4 up