Stochastic error and the need for more data

Phylogenies based on a small number of characters (both morphological and molecular) are sensitive to stochastic (random or sampling) error. Consequently, the inferred trees are usually poorly resolved and often yield conflicting results, though differences are seldom statistically significant. By considering more characters and/or characters with more substitutions, the phylogenetic signal can be increased, since the stochastic error is due to the scarcity of substitutions that occurred along internal branches (the aforementioned synapomorphies). Starting from one or a few characters, as in the first classifications elaborated in the Middle Ages, to tens or a few hundred characters, as in most recent studies based on morphological characters, the rule 'the more characters the better' has always been applied. The advent of large-scale sequencing allowed a gain of about three orders of magnitude, resulting in an enormous improvement of the resolving power of phylogenetic inference. However, the switch from hundreds or a few thousand positions in single-gene phylogenies (e.g., rRNA tree (Woese, 1990)) to hundreds of thousand positions in phylogenomic studies based on complete genomes is quite recent. Phylogenies with a high statistical support for most nodes have been recently obtained for various groups, such as mammals (Madsen et al., 2001; Murphy et al., 2001), angiosperms (Qiu et al., 1999), and eukaryotes (Rodriguez-Ezpeleta et al., 2005).

8.4.2 Systematic error and the need for better reconstruction methods

While the use of many characters drastically reduces the stochastic error, it does not necessarily constitute a solution to the problem of tree reconstruction artefacts (Philippe et al., 2005a). Indeed, the addition of more data increases both the phylogenetic and non-phylogenetic signals. Therefore, in the presence of a systematic bias, the latter will eventually become predominant, especially when the phylogenetic signal is rather weak (Jeffroy et al., 2006). For example, a hyperthermophilic lifestyle gives rise to a systematically biased composition of all proteins (Kreil and Ouzounis, 2001), thus potentially leading to an artefactual albeit highly supported tree. Obviously, there is an urgent need to design reconstruction methods that are less sensitive to the systematic error induced by the use of large datasets. Hence, the research in this area is currently very active (Felsenstein, 2004). Although the previously presented MP method is intuitive, its improvement is difficult since it does not make any explicit assumptions about the underlying evolutionary process (see Steel and Penny (2000)). For example, the probability of a substitution event is implicitly assumed to be identical across all branches of the tree, whereas this assumption is clearly violated in cases where branch lengths are unequal, leading to the LBA artefact. In contrast, probabilistic methods such as maximum likelihood (ML) and Bayesian inference (BI) have been designed to take into account branch lengths and are therefore much less sensitive to the LBA artefact. More generally, in a probabilistic framework, the likelihood of a tree is computed using a model of sequence evolution able to handle numerous aspects of the underlying process of sequence evolution. This complex model enhances the extraction of the phylogenetic signal and greatly reduces the impact of non-phylogenetic signal because the probability that multiple substitutions have occurred is explicitly considered.

8.4.3 State of the art in evolutionary models

Currently, implemented models of sequence evolution have the following properties: (1) the various probabilities to substitute one character state by another are unequal (e.g., transitions are more frequent than transversions); (2) the stationary probabilities of the various character states can be unequal (e.g., A more frequent than T) and are generally estimated from observed frequencies; (3) evolutionary rates can be heterogeneous across sites (i.e., positions), this heterogeneity being usually modelled by a discrete gamma distribution (Yang, 1993). For nucleotides, the probabilities of the different substitutions are directly inferred from the data (GTR model (Lanave et al., 1984)), whereas for amino acids, values previously computed from large alignments of closely related sequences are preferred (e.g., WAG substitution matrix (Whelan and Goldman, 2001)). Taking into account these evolutionary heterogeneities efficiently improves the extraction of the phylogenetic signal, as exemplified by the fact that the introduction of the Gamma distribution is often associated with changes of the tree topology (Yang, 1996). Other evolutionary features that have been modelled include: heterogeneity of the G+C content through a non-stationary model (Foster, 2004; Galtier and Gouy, 1995; Yang and Roberts, 1995), heterotachy through a covarion model (Galtier, 2001; Huelsenbeck, 2002), and heterogeneous probabilities of substitution types across sites using mixture models (Lartillot and Philippe, 2004; Pagel and Meade, 2004). Although newer models are complex and probabilistic methods have been shown to be robust to model violations, reconstruction artefacts can still interfere with sequence-based phylogenomic analyses, as we will illustrate later in a case study of animal evolution.

8.4.4 Current limits of phylogenomic performance

Profiting from the synergistic effects of the massive increase of sequence data and the improvement of tree reconstruction methods, the resolution of the tree of life is rapidly progressing (Delsuc et al., 2005). However, a few important questions are expected to remain difficult to answer in the near future because the phylogenetic signal is either scarce or dominated by strong non-phylogenetic signals. Beside the afore-mentioned mutational saturation inherent in ancient events, the lack of phylogenetic signal can be due to the existence of short internal branches associated with numerous speciation events concentrated within a short time span (i.e., adaptive radiations). Furthermore, the resolution of ancient events is complicated by a dramatic reduction of the available data, because of the concomitant decrease in the number of orthologous genes (due to duplications and HGTs) and in the number of unambiguously aligned positions (due to considerable sequence divergence).

8.4.5 Corroboration from non-sequence-based phylogenomic methods

Formally defined as the inference of phylogenies from complete genomes, phylogenomics is not limited to primary sequence data. Instead, its principles can also be applied to virtually any heritable genomic feature such as gene content, gene order or intron positions (see Philippe et al. (2005a) for a review). Since the usual methods of tree reconstruction are used, strengths and limitations of phylogenies based on these other types of characters are very similar to those based on primary sequences. However, because these various characters are largely independent, they provide a major source of corroboration, which is of primary importance to validate historical studies (Miyamoto and Fitch, 1995). Indeed, the fact that phylogenies inferred from different character types converge to the same tree topology strongly suggests that the correct organismal tree has been reconstructed. Although such integrated approaches have rarely been applied so far, the first studies indicate a good congruence for Bacteria and Metazoa (for reviews see Delsuc et al. (2005), Philippe et al. (2005b)). Nevertheless, if the same systematic bias (e.g., rate acceleration) simultaneously affects all genomic features, the same reconstruction artefacts are likely to occur.

8.4.6 Case study: resolution of the metazoan evolution

Because of the general interest in animal evolution, we will present the resolution of this long-lasting problem as a case study to illustrate the theoretical concepts explained so far. Before 1997, metazoan taxonomy was essentially based on the presence or absence of true internal body cavities (coelom) (Adoutte et al., 2000), with arthropods and vertebrates grouped, among others, into Coelomata to the exclusion of nematodes (Pseudocoelomata). Suspecting that the early emergence of the generally fast-evolving nematodes was the result of an LBA artefact (Philippe et al., 1994), Alguinaldo et al. (1997) sequenced the SSU rRNA gene from dozens of nematodes, until they identified one slowly-evolving species, Trichinella. By using Trichinella they were able to overcome the LBA artefact, revolutionizing the picture of animal evolution by overruling the classical dichotomy between Coelomata and other animals. Instead, they found a new metazoan group named Ecdysozoa (Aguinaldo et al., 1997). Including, among others, arthropods, nematodes, tardigrades, and onychophorans, these animals are characterized by a moult induced by a class of hormones known as the ecdysteroids. Nevertheless, several phylogenomic studies reject the Ecdysozoa hypothesis and find a significant support for the classical Coelomata hypothesis (Blair et al., 2002; Dopazo et al., 2004; Philip et al., 2005; Wolf et al., 2004), suggesting that the monophyly of Ecdysozoa represents an rRNA-specific anomaly. These analyses use a large number of characters but only a very limited number of species, i.e., the few completely sequenced model organisms. Furthermore, only species distantly related to animals (fungi, plants, or apicomplexans) are used as outgroups, thus increasing the probability of an LBA artefact.


Other Fungi





0,1 Uj)T00



Fig. 8.4. Avoiding the LBA artefact through the use of a close outgroup. Phylogeny based on 146 genes inferred by the ML method. A. The long branch of the fast-evolving nematodes is attracted by the long branch of the yeast used as a distant outgroup. Because of this LBA artefact, the statistical support for the Coelomata (arthropods + chordates) hypothesis is maximal (100%). B. By breaking the long branch of the yeast, the addition of three basal animals as close outgroups allows the LBA artefact to be avoided and the true topology to be recovered. The statistical support for the Ecdysozoa (nematodes + arthropods) is now nearly maximal (96%). Redrawn from Delsuc et al. (2005).

To address this question, we assembled our own data set both species- and gene-rich (49 species, 146 proteins, and 35 371 amino acid positions) and our results strongly argue in favour of the group Ecdysozoa (Philippe et al., 2005b). In agreement with previous studies, when a poor species sampling is used, a strong support for Coelomata is recovered (Figure 8.4A). In contrast, adding two choanoflagellates and a cnidarian (Hydra) has a dramatic effect since Ecdysozoa are now highly supported (Figure 8.4B). This topological change is not surprising because the longest branch of the tree (leading to the distant outgroup) has been broken (Delsuc et al., 2005), which is known to reduce the impact of the artefact (Hendy and Penny, 1989). However, it is worth noting that this result is only achieved through the use of ML. Indeed, even in the presence of a close outgroup, the use of the LBA-sensitive MP still results in a similar support for Ecdysozoa and the artefactual Coelomata (data not shown).

Since reconstruction artefacts are primarily caused by multiple substitutions, eliminating the fastest-evolving characters should improve the quality of the phylogenetic inference, thereby reducing possible LBA artefacts. Actually, the removal of fast-evolving characters, performed using the SF method (Brinkmann and Philippe, 1999) produces exactly the same result as the addition of a close outgroup (Delsuc etal., 2005), i.e., atopological shift from Coelomata to Ecdysozoa (Figure 8.5). Moreover, a similar result was independently obtained with a larger number of genes and two different data removal approaches (Dopazo and Dopazo, 2005; Philippe et al., 2005b). Therefore, all these analyses demonstrate that

H Fungi

H Fungi

H Arthropods

-\ Nematodes


H Deuterostomes


Fig. 8.5. Avoiding the LBA artefact through the elimination of fast-evolving positions. Phylogeny based on 146 genes inferred by the ML method. A. When all positions are considered, the fast-evolving nematodes are artefactually attracted by the long branch of fungi. B. When only the slowest evolving positions are used, nematodes are correctly located as the sister group of arthropods. Redrawn from Delsuc et al. (2005).

Ecdysozoa is a natural clade and that an LBA artefact is responsible for the high statistical support for Coelomata in species-poor studies.

While this case study shows that sophisticated reconstruction methods may still produce erroneous trees, there are major improvements due to the largely increased resolving power of phylogenomics. This is reflected in the fact that almost all other nodes in the metazoan phylogeny are well supported (Delsuc et al., 2006; Philippe et al., 2005b). In addition, reconstruction errors are drastically reduced as soon as a large number of species is considered, which is likely to become the rule in the next few years. Furthermore, starting from a large phylogenomic data set allows highly saturated characters to be removed, which can be very useful in cases where a close outgroup is not available (Brinkmann et al., 2005; Burleigh and Mathews, 2004; Philippe et al., 2005b).

Was this article helpful?

0 0

Post a comment