## Info

location, that could be compared among individuals. Genetic variation at such nucleotide sites is characterized by the existence of DNA sequences that have different nucleotides (e.g. some individuals have an A and other individuals have a T at the 37th base pair from the start codon in the same gene) and is called nucleotide polymorphism. Nucleotide polymorphisms are sometimes referred to as single nucleotide polymorphisms or SNPs (pronounced "snips"). This section of the chapter covers commonly used measures of divergence and polymorphism estimated from DNA sequence data.

### DNA divergence between species

The most fundamental method to quantify molecular evolution is by comparing two DNA sequences. This comparison is a two-step procedure. First, the two DNA sequences must be aligned such that homologous nucleotide sites for each sequence are all lined up in the same columns. For example, if two coding genes were sequenced then one way to align them would be to match up the first three nucleotides that make up the start codon. (Methods of sequence alignment are beyond the scope of this text, but readers can refer to text such as Page & Holmes (1998)

for more details.) The second step is to determine the number of sites that have different nucleotides. The number of nucleotide sites that differ between two DNA sequences divided by the total number of nucleotide sites compared gives the proportion of nucleotide sites that differ, often called a p distance as a shorthand for proportion-distance. This is a basic measure of the evolutionary events that have occurred since two DNA sequences descended from a common ancestor, when they were each identical copies of the same sequence.

An example of divergence between a pair of sequences is shown in Fig. 8.4. Consider the two sequences at the far right in the present time after some divergence that has introduced substitutions. There are five nucleotide sites that have different nucleotides out of a total of 16 nucleotide sites. Therefore, the p p Distance The number of nucleotide sites that differ between two DNA sequences divided by the total number of nucleotide sites, a shorthand for proportion-distance. Sometimes symbolized as d for distance.

distance is 5/12 = 0.3125, or 31.25% of the nucleotide sites have diverged.

The p distance between two DNA sequences sampled from completely independent populations should increase over time as substitutions within each population replace the nucleotide that was originally shared at each site due to identity by descent. If the two DNA sequences represent two distinct species or completely isolated populations, then the p distance is a measure of divergence between the two species.

### DNA sequence divergence and saturation

Saturation is the phenomenon where DNA sequence divergence appears to slow and eventually reaches a plateau even as time since divergence continues to increase. Saturation in nucleotide changes over time is caused by substitution occurring multiple times at the same nucleotide site, a phenomenon called multiple hit substitution (see the related topic of multiple hit mutation in Chapter 5). Substitutions that occur repeatedly at the same site have the effect of covering up information about past substitutions, since only the most recent substitution can be observed and measured as divergence between two DNA sequences. Computing the p distance between two sequences leads to under-estimates of number of substitutions that have occurred and therefore an under-estimate of the degree of divergence. The top panel of Fig. 8.7 shows divergence that increases linearly with time since divergence (dashed line) and that exhibits saturation (solid line). Actual DNA sequence data routinely exhibit some degree of saturation, as shown in the bottom panel of Fig. 8.7 for the mitochondrial cytochrome c oxidase subunit II gene sequenced for several bovine species (ungulates including domestic cattle, bison, water buffalo, and yak) that diverged between 2 and 20 million years ago.

Saturation can be understood by imagining the process of assembling a DNA sequence at random and comparing it with another existing DNA sequence. Think of drawing individual nucleotides out of a bucket containing equal numbers of A, C, G, and T base

Figure 8.7 Substitutions that occur repeatedly at the same nucleotide site lead to saturation of nucleotide changes as time since divergence from a common ancestor increases. The rate of substitutions does not change and the total number of substitutions continues to increase over time, as shown by the dashed line in the top panel representing the true number of substitutions. In contrast, multiple substitutions at the same sites leads to a slowing and leveling off in the estimate of divergence (solid line, top panel). Therefore, the amount of divergence leads to the perception that the rate of divergence decreases over time. The bottom panel shows divergence and saturation at the mitochondrial cytochrome c oxidase subunit II gene among bovine species (ungulates including domestic cattle, bison, water buffalo, and yak) that diverged between 2 and 20 million years ago. In the top panel a = 1 x 10-6 (a is explained overleaf). The bottom panel data are from Janecek et al. (1996) and the line is a quadratic regression fit.

Figure 8.7 Substitutions that occur repeatedly at the same nucleotide site lead to saturation of nucleotide changes as time since divergence from a common ancestor increases. The rate of substitutions does not change and the total number of substitutions continues to increase over time, as shown by the dashed line in the top panel representing the true number of substitutions. In contrast, multiple substitutions at the same sites leads to a slowing and leveling off in the estimate of divergence (solid line, top panel). Therefore, the amount of divergence leads to the perception that the rate of divergence decreases over time. The bottom panel shows divergence and saturation at the mitochondrial cytochrome c oxidase subunit II gene among bovine species (ungulates including domestic cattle, bison, water buffalo, and yak) that diverged between 2 and 20 million years ago. In the top panel a = 1 x 10-6 (a is explained overleaf). The bottom panel data are from Janecek et al. (1996) and the line is a quadratic regression fit.

pairs. If a nucleotide site in the existing sequence is an A, there is a 25% chance that a randomly drawn nucleotide will be an A and the sites will match. On the other hand, there is a 75% chance it will not be an A (it will be a T, C, or G) and the two sites will be diverged. Thus, a DNA sequence assembled from random draws of nucleotides at equal frequency should be 75% divergent or 25% identical to another sequence on average. The consequence is that two sequences which originated from a common ancestor do not continue to get increasingly divergent over time. Eventually, the maximum divergence will plateau at 75% as continued mutation essentially randomizes the shared sites between the two sequences.

There are a wide variety methods to correct the perceived divergence between two DNA sequences to obtain a better estimate of the true divergence after accounting for multiple hits. These correction methods are called nucleotide substitution models and use parameters for DNA base frequencies and substitution rates to obtain a modified estimate of the divergence between two DNA sequences. The simplest of these is the Jukes and Cantor (1969) nucleotide-substitution model, named for its authors. Working through the derivation of the Jukes-Cantor model is worthwhile to gain some insight into how nucleotide-substitution models operate.

The Jukes-Cantor model starts out by assuming that any nucleotide in a DNA sequence is equally likely to be substituted with any of the other three nucleotides. For example, if a site currently has a C then substitution of an A, T, or G all have the same chance of occurring. Figure 8.8 shows three possible events for a nucleotide site. The site may (1) experience one and only one substitution, (2) not experience any substitutions over time, and (3) experience a substitution that changes the nucleotide at the site and then another independent substitution that restores the original nucleotide at the site. In the first situation perceived divergence and actual divergence are the same and no correction is required. The perceived divergence in the second and third cases is the same but very different events have occurred. In the third case, substitutions have occurred for some portion of nucleotide sites that appear to have no divergence. Nucleotide substitution models such as Jukes-Cantor serve to estimate the frequency of nucleotide sites that appear to have not diverged but actually have diverged.

In the Jukes-Cantor model, the probability of a nucleotide substitution is customarily represented by a (pronounced "alpha"). Since there are three nucleotides that can each be substituted for a

Case 1

Case 2

Case 3

Substitution

No substitution

Substitution not G

No substitution

No substitution

Substitution not G

not G

### Generation

Figure 8.8 The three types of event that a single nucleotide site may experience over two generations. A nucleotide site may initially have a G, for example. In case 1 a single substitution event in generation one changes the G to an A, C, or T nucleotide, the nucleotide also present at generation two. A p distance measure of divergence will accurately count the number of substitutions in case 1. In cases 2 and 3 the nucleotide site still retains the same nucleotide it had initially, giving the impression that there have been no substitutions. In case 2 this impression is accurate. However, in case 3 there have been two substitution events that are not accounted for in a simple p distance measure of divergence.

nucleotide currently at a site and all three are equally likely to be substituted, the probability of any substitution is 3 a. So if the nucleotide is initially a G at generation zero, the probability that it is also a G one generation later is

Since the chance of substitution is independent in each generation, the probability of no substitutions over two generations is

This gives the probability that a nucleotide does not change over two generations as shown in case 2 of Fig. 8.8.

We also need to determine the probability that a nucleotide changes twice, as shown in case 3 of Fig. 8.8. From generation zero to generation one, the probability of a substitution is 3 a. This probability can also be written as one minus the probability of no substitution, or 1 - PG(t-1). From generation one to generation two, there is only one base that can be substituted to make the site match its initial state. The chance that this occurs is the probability of a substitution or a. Bringing these two independent probabilities together gives the probability that a multiple hit nucleotide substitution occurs which restores the nucleotide initially present:

## Post a comment