Systematic biases and nonphylogenetic signals

Unfortunately, whenever many multiple substitutions accumulate in the data, the situation can be worse than a simple scenario based on the erosion of the phylogenetic signal. Indeed, in the presence of any kind of systematic bias, multiple substitutions will generate a non-phylogenetic signal potentially resulting in reconstruction artefacts, as we will illustrate in the case of the compositional bias of nucleotide sequences (Lockhart et al., 1992). Let us assume that the content of guanine and cytosine (G+C) in a group is equal to 50%, except in the branches leading to species 1 and 3, for which the G+C content is 70% (Figure 8.2A). Because these two species have more often acquired aGoraCat the same position by convergence, while the remaining species have conserved the ancestral A or T, a compositional signal supporting the grouping of species 1 and 3 will be present. Furthermore, if the alignment used is so highly saturated that the phylogenetic signal for grouping species 1 and 2 has been seriously eroded, species 1 and 3 will be erroneously but strongly grouped together because of the non-phylogenetic signal due to the heterogeneous G+C content (Figure 8.2B).

The variation of evolutionary rates across lineages is another important cause of non-phylogenetic signal due to convergent evolution. It results in the long branch

True Tree Inferred Tree

Compositional

Signal -►

Rate Signal

Fig. 8.2. Effects of non-phylogenetic signals on phylogenetic reconstruction. A. On the true tree, because of a bias of the substitutional process favouring the fixation of G and C versus A and T, the G+C content of lineages 1 and 3 independently increased from 50 to 70%. B. False tree resulting from the compositional signal. Lineages 1 and 3 are artefactually grouped because the convergence of their G+C content implies a high number of homoplastic positions, i.e., shared only by chance. C. On the true tree, after divergence from its common ancestor with lineage 1, lineage 2 was affected by an acceleration of its evolutionary rate. D. False tree inferred from the rate signal. Attracted by the long branch of the distant outgroup O, the long branch of lineage 2 artefactually emerges at the basis of the ingroup (species 1-4). This is the well-known LBA artefact.

attraction (LBA) artefact in which the two longest branches of a tree tend to be grouped even when not closely related (Felsenstein, 1978). A particular albeit frequent case may arise whenever a distant outgroup is included to root the tree, i.e., to allow the polarisation of the characters under study. Indeed, if a species of the studied group evolves significantly faster than the others (Figure 8.2C), it will artefactually emerge at the base of the group (Figure 8.2D), because its long branch is attracted by the long branch of the distant outgroup. This phenomenon wrongly suggests an early divergence of the fast species relative to the remaining species.

Early on, rate and compositional biases were identified as major sources of non-phylogenetic signals. An additional confounding bias that has drawn more attention is the heterotachous signal, which is due to a variation of the evolutionary rate of a given position throughout time (Kolaczkowski and Thornton, 2004; Lockhart et al., 1996; Philippe and Germot, 2000) and remains to be thoroughly explored.

8.3.3 Gene duplication and horizontal gene transfer

For the sake of simplicity, we have so far assumed that it is sufficient to analyse homologous positions to infer phylogeny. Strictly speaking, this is correct as long as we are interested in gene phylogenies, but as soon as we want to deduce the species phylogenies, we must ensure that the genes under study are orthologous (Fitch, 1970). By definition, orthologous genes originate in speciation events, whereas paralogous genes originate in gene duplication events (Figure 8.3A).

True Tree

Inferred Tree

Fig. 8.3. Orthology, paralogy, and xenology and their consequence on phylo-genetic inference. A. At some point in its history, a gene is duplicated and gives rise to two paralogous copies. The duplication event is indicated by a star. In the course of the subsequent speciation events, each copy evolves independently to generate a set of three orthologous genes. When a tree including both paralogues from each species (A, B, and C) is inferred, the true species phylogeny is recovered for each paralogue. B. In a tree inferred from different paralogues instead of orthologues, a wrong species phylogeny is recovered. The suboptimal gene sampling can be due to technical reasons (e.g., orthologous gene not yet sequenced) or to biological reasons (e.g., both copies have been differentially lost in the three lineages). C. True tree: during the evolution of lineages A, B, and C, a gene is horizontally transferred from lineage Ato lineage C. D. False tree: because of the close similarity between the xenologue in C and the orthologue in A, a wrong organismal phylogeny grouping species A and C is recovered. As for paralogy above, the orthologue in C may be lacking for technical or biological reasons (e.g., the acquired xenologue has replaced the orthologous gene).

Phylogenetic analyses of species must be imperatively based on orthologous sequences, because trees based on cryptic paralogues can be extremely misleading (Figure 8.3B). Another problem is horizontal gene transfer (HGT), where a gene is transferred from a donor species to a possibly unrelated receiver species (Figure 8.3C). This alien copy, often called a xenologue, causes similar problems to paralogues, i.e., unrelated species are incorrectly grouped in phylogenetic trees (Figure 8.3D).

Since gene (and even genome) duplications are frequent in eukaryotes, while HGTs are frequent in prokaryotes, the number of genes that are perfectly orthologous when considering all comparable extant organisms is probably close to zero. For instance, only 14 genes were found to exist in one and only one copy in ten completely sequenced eukaryotic genomes (Philip et al., 2005). Fortunately, if a given gene has undergone a single recent duplication event, the latter will be easy to detect and the gene will still be usable (Philippe et al., 2005b) because it does not interfere with the species phylogeny. Although the same reasoning should hold for HGTs, it appears that the impact of transfer events on the phylogenetic structure is much more destructive (Philippe and Douady, 2003), even when only a few HGTs have occurred. While some researchers have suggested that because of transfer events species phylogenies cannot exist (nor can the tree representation be used) (Doolittle, 1999), the current consensus is that a core of genes (a few hundred within Bacteria) have experienced so few HGTs that they can be used to infer species phylogenies (for reviews see Brown (2003), Ochman et al. (2005), Philippe and Douady (2003)).

Was this article helpful?

0 0

Post a comment