Carrine E Blank

Department of Geosciences, University of Montana, Missoula, MT, 9812 USA

1. Introduction

At present, little is known about when major groups of microorganisms and their metabolisms appeared on the Earth. The fossil record provides some limited information, however this information can be contradictory or ambiguous. For example, lipid biomarkers suggest the presence of Cyanobacteria at ~2.7 Ga (Ga = billions of years ago; Brocks et al., 2003). In contrast, the first unambiguous cyanobacterial microfossils are not seen for another 700 million years (Hofmann, 1976; Tomitani et al., 2006), after the rise in atmospheric oxygen at ~2.32 Ga (Bekker et al., 2004). There are major changes in the mass-dependent and mass-independent fractionation of sulfur isotopes at ~2.4 Ga (Canfield and Raiswell, 1999; Farquhar et al., 2000). While this could been interpreted as the origin of widespread mesophilic sulfate reduction (Blank, 2004), it can also been interpreted as a continued presence of sulfate reducers but a change in sulfate concentrations and/or rates of sulfate reduction (Ohmoto et al., 1993; Habicht et al., 2002). The changes in the mass independent fractionation can be also interpreted as a change in atmospheric photochemistry (Farquhar et al., 2000). Lastly, the extensive record of carbon isotopic fractionation in organic carbon has long been interpreted as evidence of a diverse biosphere containing photosynthesis as far back as 3.5 Ga (Schidlowski, 1988). However, abiotic synthesis has recently been shown to produce carbon isotopic fractionations in the range of phototrophic bacteria, calling into question the biological interpretation of early Archean carbon isotopic signatures (McCollom and Seewald, 2006).

Although the past decade has seen major advances in our understanding of earliest record of life on Earth (Brasier et al., 2002), one of the biggest outstanding challenges in the field of Astrobiology is our lack of a deep comprehension of how microorganisms co-evolved with the Earth through time. This is due to a lack of detailed understanding of both the early rock record and how microorganisms evolved. This challenge will have to be overcome if we are to fully appreciate changing planetary habitability through time and if we are to design the most effective strategies for searching for life on other bodies in the solar system.

Phylogenomic dating (Blank, 2004, 2009a) is a method that has the potential to shed some light upon both the ancient rock record and microbial evolution.

It combines well-resolved phylogenies derived from whole genome sequence data, the inferred evolution of metabolic and physiological traits, and patterns seen in the rock record to identify new age constraints for the origin of major prokaryo-tic groups (i.e., clades). This method is also potentially useful in the constructing and testing of causal biological hypotheses of major transitions in the early geologic record. Such transitions include changes in the cyanobacterial fossil record, sulfur isotopic fractionation, carbon isotopic fractionation, and the sudden rise in atmospheric oxygen. This paper outlines the methodology behind phyloge-nomic dating. It also discusses the many of the challenges associated with its implementation and how these challenges might be addressed in the future.

2. Prokaryote Phylogenomics

The first requirement for phylogenomic dating is a well-resolved phylogenetic tree - this provides the backbone for all subsequent analyses and inferences. In the past, phylogenetic trees of prokaryotes have been largely constructed using the 16S (also called small subunit, or SSU) ribosomal RNA gene (Woese, 1987). SSU rDNA was instrumental in articulating the three-domain structure of the tree of life (Woese et al., 1990). Soon after, rDNA studies identified a large number of fundamental bacterial lineages, or divisions (Pace, 1997). At the same time, the resolution of the branching relationships between these divisions was poor, with the phylogeny often collapsing into a polytomy. Also, phylogenies of other genes (such as RNA polymerase) sometimes conflicted with the rDNA tree (e.g., Klenk et al., 1999).

There are multiple reasons why a single gene tree (like the 16S rDNA tree) may lack resolution or may conflict with other single gene trees (Eisen, 2000; Gribaldo and Philippe, 2002). Single genes typically lack sufficient characters needed to resolve distant relationships (Brown et al., 2001). Single genes may lack sufficient slowly evolving characters to resolve deep-branching relationships. Even worse, they may have too many fast-evolving characters that can exacerbate systematic error such as long branch attraction (LBA). Long branch attraction or lack of phylogenetic signal can lead to incongruent or unresolved topologies and are a significant concern in prokaryote phylogenetics, given the ancient divergence times for major lineages (Gribaldo and Philippe, 2002). Lateral gene transfer (LGT) is the horizontal movement of genetic material between distantly related lineages (in contrast to the "typical" vertical movement of genetic material between ancestors and descendants). Lateral gene transfer can introduce genes with different histories into the recipient genome, producing phylogenic trees that are incongruent with other genes in the genome. The frequency of LGT may also play a role in the overall topology of the bacterial tree. It has been proposed that organisms that frequently exchange DNA by LGT may cluster together in trees, leaving the basal lineages to be populated by organisms with lower rates of LGT

(Gogarten and Townsend, 2005). Fourthly, ancient gene duplications followed by loss can also produce phylogenetic incongruence in a process known as lineage sorting.

Resolving the phylogenetic history of prokaryotes is a significant, on-going challenge (e.g., Gupta and Griffiths, 2002). Fortunately, the ever-growing abundance of genome sequences is providing the data that will ultimately contribute toward a more comprehensive understanding of the evolutionary history of prokaryotic genes and genomes. Many approaches have been developed for constructing prokaryote phylogenies using genome data. One uses presence/absence data of orthologous protein families or protein domains (Snel et al., 1999; House and Fitz-Gibbon, 2002; Wang and Caetano-Anolles, 2006). Another uses rare genomic changes, such as the positioning of shared overlapping genes (Luo et al., 2006), genome rearrangements, and mobile genetic elements (Boore, 2006). Yet another uses the presence of shared insertions and deletions within proteins (Gupta, 1998). An increasingly common approach is to use concatenated sequences of highly conserved genes (e.g., Daubin et al., 2002; Ciccarelli et al., 2006; Barion et al., 2007). The most highly conserved genes in a genome are often involved in critical cellular functions such as the processing of information (RNA, DNA, and protein). Genes in a concatenated dataset can be analyzed individually using a "supertree" approach (Sanderson et al., 1998), or trees can be calculated with the entire concatenated dataset using a "supermatrix" approach (de Queiroz and Gatesy, 2006).

Early genomic phylogenies constructed with presence/absence of orthologous genes (Fitz-Gibbon and House, 1999; House and Fitz-Gibbon, 2002) generally agreed with those constructed using a supermatrix approach (Hansmann and Martin, 2000; Brown et al., 2001). However there were some differences, particularly within the Euryarchaoeta, a major clade of the archaeal domain of life. The presence/absence phylogenies placed the methanogens Methanococcus and Methanothermobacter sister to Archaeoglobus to the exclusion of the Pyrococci, while the concatenated analyses placed them with the Pyrococci to the exclusion of Archaeoglobus. A study using both gene content trees and concatenated ribos-omal proteins (Slesarev et al., 2002) found a similar pattern with the hyperther-mophilic methanogen Methanopyrus. Their gene-content tree placed Methanopyrus with Methanococcus-Methanothermobacter-Archaeoglobus while their concatenated analysis placed it with Methanococcus-Methanothermobacter-Pyrococci. Subsequently, a more exhaustive orthologue presence/absence analysis showed that methanogens and Archaeoglobus formed a large monophyletic group (House et al., 2003). The authors questioned previous phylogenies showing non-mono-phyly of methanogens, but acknowledged that clustering of this group in presence/absence trees could be an artifact of having 81 shared gene groups possibly correlated with a methanogenic lifestyle (64 having unknown function, 11 involved in methanogenesis; Fitz-Gibbon and House, 1999; House et al., 2003). This study also observed the anomalous Thermoplasmas branching before the basal Crenarchaeota and Euryarchaeota split - a placement they claim is likely caused by highly reduced genomes. Given these potential biases, the supermatrix approach may be more desirable than the presence/absence approach.

At the present time, supermatrix analyses employing large numbers of taxa and different sets of genes appear to be converging upon a "core" archaeal tree. One study (Brochier et al., 2005; Gribaldo and Brochier-Armanet, 2006) analyzed a set of concatenated transcriptional proteins (5,809 characters) and a set of ribos-omal proteins (2,213 characters). Both produced trees that were identical except for the placement of Methanopyrus. In the transcriptional set, Methanopyrus branched deep (agreeing with the 16S rRNA tree; Burggraf et al., 1991), however in the translational set Methanopyrus branched higher up with Methanococcus and Methanothermobacter. The authors proposed that the deep placement was due to LBA to long branching outgroups, and favored the higher placement of Methanopyrus. A larger supermatrix study (Blank, 2009a) used a different mix of operational genes (involved in metabolism and processes such as cell division and DNA repair) and information processing genes (15,220 characters). The trees constructed with this dataset agreed with the archaeal "core" tree. It was, however, noted that bootstrap support for the basal branching of Methanopyrus depended upon the sets of genes analyzed (e.g., metabolic vs. informational) and the tree reconstruction method used. Nevertheless, most topologies supported Methanopyrus branching deep with moderate to high support. Only some analyses placed Methanopyrus higher up, and these topologies tended to have weaker support than the basal branching position. Methanopyrus does have a number of physiological traits agree with a deep placement (hyperthermophily; lipids thought to be primitive; Hafenbradl et al., 1993). Regardless, the difficulty of accurately placing this lineage is due to the inherent problems associated with rooting and with long branches (Philippe and Laurent, 1998). Ultimately the issue will not be resolved unless another organism can be found that breaks up this long branch.

Although a "core" tree appears to be emerging for the archaeal domain, this is not necessarily the true tree or the strict organismal tree. Rather, it is an average tree that likely contains conflicting signal from lateral gene transfer, long branch attraction artifacts, lack of phylogenetic signal, and lineage sorting (Susko et al., 2006). The complexity hypothesis (Jain et al., 1999) proposes that the genes predominantly reflecting vertical evolution and the "core" are involved in information processing, while genes that experience more horizontal evolution are involved in metabolism. Indeed, upon the archaeal "core" appears to be superimposed a more complex history involving metabolic genes with vertical, horizontal, and convergent evolutionary components (Blank, 2009a).

Although initial studies conveyed hope for being able to identify a similar "core" for the bacterial domain (e.g., Brown et al., 2001), later studies soon reported unresolved phylogenies or well-supported conflicting phylogenies (Teichmann and Mitchison, 1999; Hansmann and Martin, 2000; Brochier et al., 2002; House and Fitz-Gibbon, 2002; Ciccarelli et al., 2006). The controversy over the shape of the bacterial tree extends not just to branching relationships between mesophilic divisions, but to whether the root lies in the hyperthermophilic bacteria

Aquifex and Thermotoga (Bocchetta et al., 2000; Brown et al., 2001; Di Giulio, 2003) or in the mesophilic Planctomycetes (Brochier and Philippe, 2002). The most recent analyses with large sets of taxa do re-affirm support for the thermophilic origin of the bacteria (Teeling et al., 2004; Barion et al., 2007), although there have long been questions about the deeply branching nature of thermophilic bacteria (Woese et al., 1991).

The reasons behind the inability to resolve the bacterial "core" are unclear. Some have hypothesized that high rates of LGT preclude resolution (Doolittle, 1999; Gogarten et al., 2002). Undoubtedly, LGT has played an important role in bacterial evolution, particularly in metabolic genes (Boucher et al., 2003; Bauer et al., 2004; Ma and Zeng, 2004; Mussmann et al., 2005; Pal et al., 2005). However, the extent to which lateral transfer has involved the highly conserved "core" genes is unclear and controversial (Daubin et al., 2003). Recent studies show that although LGT is difficult to distinguish from other phylogenetic causes of incongruence and therefore to enumerate, most bacterial genes appear to evolve vertically rather than horizontally (Lerat et al., 2003; Kunin et al., 2006; Susko et al., 2006). For example, an analysis of 205 gene families in the y-Proteobacteria (Lerat et al., 2003) identified only two cases of LGT. Using the same dataset, however, and a different approach of statistical tests to compare all possible topologies, about 10% of the genes (18/205) were identified as candidates for LGT (Susko et al., 2006).

Clearly, if there is a bacterial "core" tree it will be a challenge to find. It will require performing analyses on individual genes in the supermatrix to identify and remove taxa and/or characters that contribute to phylogenetic incongruence whether by LGT, LBA, lack of phylogenetic signal, or lineage sorting. Luckily, not all genes appear to experience LGT at the same rates. It is also likely that not all microorganisms experience LGT to the same extent. For example, lateral transfers appear to be most likely between organisms that share ecological niches (Zhaxybayeva et al., 2004). Large concatenated datasets are not immune to LBA (e.g., Sanchez-Baracaldo et al., 2005). Consequently, measures should be taken in both supermatrix analyses and individual gene analyses to test for LBA (such as "long branch extraction"; Siddall and Whiting, 1999; Pol and Siddall, 2001). Taxa that show LBA behavior should either be removed or additional taxa added to break up the long branch. For divergences that occurred billions of years ago, resolving deep branching relationships will require analyses with large numbers of slowly evolving characters/genes (Keeling et al., 2005). Also multiple phylogenetic methods should be used, including those that take into account rate heterogeneity among sites with a discrete gamma distribution parameter (Bergsten, 2005).

3. Compartmentalization

Compartmentalization was first proposed by Mishler (1994) as a way to improve resolution of deep-branching phylogenetic relationships. In this method, a global analysis is first performed on the data matrix to identify well-supported monophyletic groups (the compartments). Next, detailed phylogenetic analyses are performed within each compartment to obtain the best local/ingroup topologies. Two advantages of this approach is that it reduces LBA caused by the presence of outgroup taxa (Bergsten, 2005) and it maximizes the number of characters that can be used to construct the phylogeny (because taxa within a compartment share more characters/genes than taxa between compartments). Once the best topology is obtained within a compartment, the tree is rooted with a closely related outgroup and a constraint tree is generated. Poorly supported relationships are collapsed and the constraints from all compartments are coalesced into a single constraint tree. Finally, global analyses are performed while invoking the constraint statements.

Compartmentalization is uniquely suited for genomic data and deep phylo-genic questions. Different sets of genes can be chosen for global and local levels, to maximize the amount and type of characters needed to resolve the phyloge-netic problem at hand. The first large-scale implementation of Compartmentalization used a supermatrix of 53 taxa and 38 genes from bacterial genome sequences (Blank, 2002). The dataset included a mix of information processing genes (17), cellular processing genes (11), and metabolic genes (10) - in total 36 amino acid sequences, the small subunit rRNA gene, and the large subunit rRNA gene. Global analyses using all 17,750 characters were first performed using all taxa with maximum parsimony. (Analyses of mixed datasets are currently at a computational disadvantage. At present, maximum likelihood can only be used on pure amino acid or nucleic acid data. Bayesian methods can be used with partitioned mixed datasets, but are so computationally intensive they require significant pruning of taxa to the extent that they lose meaning. It is well established that phylogenetic resolution requires adequate representation of taxa to break up long branches. These factors effectively limit analyses of large, mixed datasets to, what could be considered less savory, parsimony and distance methods.) As expected, the initial global tree was poorly resolved, particularly at the deepest levels, with 6 tree islands and 13 steps to the next tree island (Fig. 1).

Next, branch lengths were measured. Branches with X > 0.100 ("long branches", where X = branch length divided by the total number of characters) included the Chlamydias, Spirochaetes, Mycoplasmas, and the s-Proteobacteria. Taxon excision experiments showed that random pairs of these taxa were always monophyletic, often branching in locations different from analyses with a single long-branching taxon - classic LBA behavior. Large concatenated datasets are clearly not free from systematic error introduced by mutational saturation, and can result in highly supported incorrect topologies, often regardless of the tree reconstruction method (Jeffroy et al., 2006).

There are many ways of compensating for potential systematic error in concatenated phylogenies. One is to remove taxa that exhibit mutational saturation and/or LBA artifacts, and another is to remove fast-evolving positions. In our initial study, we did both. Global trees calculated without the long branching taxa had

Geobacter

Magnetococcus

Rickettsia

N trosomonas

Borrelia

Burkhdderia

Ralstonia

Xyella

Pseudomonas Shewanella

Escherichia

- Thermatoga

- Deinococcus

- Thermus

- Chlorobium

- Prochlorococcus

- Nostoc

- Syniechocystis

- Chloroflexus

- Coryniebacerium

- Mycobacterium

- Streptomyces

- Thermofilum Campylobacter

- Helicobacter

- Mycoplasma

- Ureaplasma

- Borrelia Trepanema Chl trachomatis Chl pneumoniae Clostridium Desulftobacterium

B searothermophilus

Geobacter

Magnetococcus

Rickettsia

N trosomonas

Borrelia

Burkhdderia

Ralstonia

Xyella

Pseudomonas Shewanella

Escherichia

Figure 1. Global analysis using MP with 38 concatenated genes and 53 bacterial taxa. Left: cladog-ram showing relationship between clades and bootstrap support of nodes. Right: phylogram of one of the best trees showing branch lengths. Aquifex was the chosen outgroup based on analyses using archaeal taxa and concatenated information processing genes (not shown).

Figure 1. Global analysis using MP with 38 concatenated genes and 53 bacterial taxa. Left: cladog-ram showing relationship between clades and bootstrap support of nodes. Right: phylogram of one of the best trees showing branch lengths. Aquifex was the chosen outgroup based on analyses using archaeal taxa and concatenated information processing genes (not shown).

marginally better resolution (Fig. 2a), with 4 tree islands with 18 steps to the next best island (Fig. 2a). Trees with a more conserved dataset of 18 genes and 10,826 characters (identified by examining pairwise distances between taxa) had even better resolution, with 3 tree islands and 68 steps to the next island (Fig. 2b).

Next, local analyses were performed on the four monophyletic groups identified in the global analyses - the Cyanobacteria, High G + C Gram Positives (HGC, or Actinobacteria), Low G + C Gram Positives (LGC, or Firmicutes), and the Proteobacteria (for details see Blank, 2002). Analyses were performed using different subsets of genes, and with taxon excision experiments to identify potential LBA behavior. Next, the trees were rooted using neighboring outgroups and constraint trees constructed in MacClade (Maddison and Maddison, 2005; Fig. 3a).

A variant of Compartmentalization involves inference of ancestral sequences of compartment clades. To do this, the taxa in the compartment and the closest outgroup are selected in PAUP* (Swofford, 2001), the best ingroup topology obtained by Compartmentalization is enforced using a backbone constraint, and the ancestral sequence at the basal node is calculated using the Show Ancestral States command. The taxa in the compartment are then replaced by HTU sequences in the final global analyses. This approach may shorten overall branch

Aquifex

Theimatcga

Ceinocoacus

Thermus

ChiorcliexLs

Dehalococaoides

GorynebacferiuT

Mycobacterium

Figure 2. Global analyses using MP without long-branching taxa. Left (a) was performed with all 38 genes; right (b) was performed using the more conserved set of 18 genes.

Aquifex

Theimatcga

Ceinocoacus

Thermus

ChiorcliexLs

Dehalococaoides

GorynebacferiuT

Mycobacterium

Thermobifda

ChiorobiLT

ProchiorococcLB

Nostoc

Synechocystis

CioslridiLm

Desuiftobacieri^

B. steerothermophius

B. subtil s

Staphyiococcus

Enterococcus

LaatbcoccLB

Streptococcus

Desulfovibrio

Gfeobacfer

Magnetocbaaus

Figure 2. Global analyses using MP without long-branching taxa. Left (a) was performed with all 38 genes; right (b) was performed using the more conserved set of 18 genes.

Was this article helpful?

0 0

Post a comment