How to analyse multigene data

In addition to different models and methods for sequence analyses (see Chapter 8 for an overview), many methodologies exist for the analysis of multiple genes. One approach is to combine individual genes into a single dataset to extract phylogenetic information that might be distributed over many gene families; this so-called 'supermatrix approach' is often cited as a way to improve the resolution of the inferred phylogeny (Brown et al., 2001; Delsuc et al., 2005). In this approach, the individual gene alignments are concatenated into one superalignment, if the individual gene histories are found to have compatible (non-conflicting) phylogenetic histories. Then the superalignment is analysed as if it were one gene alignment, but with an advantage of containing a larger number of sites informative for phylogenetic analysis.

A problem with direct concatenation is the selection of data to include. This selection is complicated by the fact that the absence of evidence for transfer cannot be taken as evidence for the absence of transfer. If one applies a stringent measure for the detection of conflict, nearly all genes agree with each other within the limits of confidence. The amount of conflict detected depends on the chosen limits of confidence and on the extent of taxon sampling (Snel et al., 2002; Daubin et al., 2003; Mirkin et al., 2003; Ge et al., 2005; Kunin et al., 2005).

Testing the compatibility between different trees and the alignments from which these trees were derived using the Shimodaira-Hasegawa (Shimodaira and Hasegawa, 1999) or approximately unbiased (AU) test (Shimodaira, 2002) has become the preferred tool for assessing potential conflict between individual gene families (e.g., Lerat et al. (2003)). In these tests, the fit of alternative topologies to an alignment is evaluated and the trees under which the data have a significantly worse probability are rejected and considered as incompatible with the data. (The probability of observing the data given a model of evolution and a phylogenetic tree is also known as the likelihood of a phylogenetic tree.) However, the failure to reject a tree should not be mistaken for evidence for congruence with a tree (Bapteste et al., 2004). A gene might indeed have evolved with a different history, and this history might be different from the consensus phylogeny, but the individual gene family contains too little phylogenetically useful information to make the likelihood of the two phylogenies significantly different. This is analogous to the failure in detecting a significant correlation between fat intake and cancer (Prentice et al., 2006): it does not prove that the correlation does not exist; it only says that the correlation was not significant in the dataset, possibly because too small a sample was studied.

Another challenge to inferences based on concatenation may come from hidden, or unrecognized, paralogy in lineages that went through frequent gene duplication and aneupolyploidization (i.e., having multiple, albeit each incomplete, sets of chromosomes). Especially in animals with metameric organization (i.e., with the body divided into a number of similar segments), gene and genome duplications have long been postulated to have created the regulatory complexity necessary for the different bodyplans (e.g., Nam and Nei (2005)). Multiple whole genome duplications were inferred for the early evolution of plants (Cui et al., 2006) and vertebrates (Escriva et al., 2002; Meyer and Van de Peer, 2005). These gene duplications have led to gene families with often astounding diversity (e.g., Foth et al. (2006)).

However, it is not the complexity of the gene families in itself that generates problems in phylogenetic reconstruction; rather, the frequent loss of one or the other paralogue (Hughes and Friedman, 2004; Nam and Nei, 2005) can lead to the inclusion of unrecognized paralogues in the datasets, with the result that some of the events in the genes' histories reflect gene duplication and not speciation events. For example, analysing the homeobox gene superfamily in 11 genomes, Nam and Nei (2005) inferred 88 homeobox genes to have been present in the ancestor of bilateral animals. Thirty-forty of these were completely lost in one of the 11 species analysed, and many of the ones still represented underwent frequent gene duplications, especially in vertebrates, where more than 200 homeobox genes are found per haploid genome. In a study of four animal genomes, Hughes and Friedman (2004) observe massive (19-20% of gene families present in common ancestor) parallel loss in Caenorhabditis elegans and Drosophila melanogaster.

An alternative to the supermatrix approaches (Delsuc et al., 2005) is to analyse genes individually and to combine the resulting trees (or the bipartitions/embedded quartets constituting the trees) into a consensus signal (i.e., the phylogenetic signal supported by at least a plurality of genes) by using supertree methods (Beiko et al., 2005), bipartition plotting (Zhaxybayeva et al., 2004), or quartet decomposition (Zhaxybayeva et al., 2006). As an example, Figure 9.1A shows bipartition analysis of 678 gene families present in ten cyanobacterial genomes (see Zhaxybayeva et al. (2004) for more details). A bipartition plot shows all bipartitions significantly supported by at least one gene family and allows us to extract quickly the plurality signal as well as gene families conflicting with it. Notably, for cyanobacteria, only three compatible bipartitions are supported by the plurality of genes, resulting in a not very resolved plurality consensus (Figure 9.1B). And even those three bipartitions are conflicted strongly by 13 gene families (i.e., these genes support a conflicting partition with >99% bootstrap support; see Figure 9.1 legend for the list of gene families). An advantage of the bipartition plotting approach is that only the signal retained in the plurality of datasets is used to synthesize the consensus; the

Bipartitions

- Prochlorococcus sp. MIT9313

- marine Synechococcus WH8102

- Prochlorococcus sp. CCMP1375

- Prochlorococcus sp. MED4

- Nostoc punctiforme

-Anabaena sp. PCC7120

- Trichodesmium erythraeum

- Synechocystis sp. PCC6803

- Thermosynechococcus elongatus

- Gloeobacter violaceus

Bipartitions

Fig. 9.1. Bipartition plot analysis of 678 gene families in ten cyanobacterial genomes. An unrooted phylogenetic tree can be represented as a set of bipartitions (or splits). If a branch of a tree is removed, the tree 'splits' into two sets of leaves. A bipartition of an unrooted phylogenetic tree is defined as a division of the tree into two mutually exclusive sets of leaves. A. Plot of bipartitions with at least 70% bootstrap support. Each column represents the number of gene families that support (columns that are pointing upwards) or conflict (columns that are pointing downwards) a bipartition. The columns are sorted by number of gene families supporting each bipartition. The level of bootstrap support is coded by different shades of gray. For details of phylogenetic analyses see Zhaxybayeva et al. (2004). B. Plurality consensus reconstructed from the three most supported partitions. The genes that are in conflict with the consensus at >99% bootstrap support encode ribulose bisphosphate carboxylase large subunit, cell division protein FtsH, translation initiation factor IF-2, ferredoxin, geranylgeranyl hydrogenase, amidophosphoribosyltransferase, photosystem II protein D2, photosystem II CP43 protein, photosystem II CP47 protein, photosystem I core protein A2, photosystem I core protein A1, photosystem II manganese-stabilizing protein, and 5'-methylthioadenosine phosphorylase.

disadvantage is that individual gene analyses often suffer from a lack of resolution due to an insufficient number of phylogenetically informative positions.

Regardless of the method used to arrive at the consensus, the question still remains: what does the signal inferred from multigene data mean? Does it serve as a proxy for an organismal phylogeny? Does it sometimes reflect grouping by ecotypes? In the next section we explore these questions by considering an example of four marine cyanobacteria.

Was this article helpful?

0 0

Post a comment