## Problem box Computing Tajimas Dfrom DNA sequence data

To study the population history of Drosophila simulans, Baudry et al. (2006) sampled flies from multiple populations in Africa, Europe, and the Antilles. From these flies they sequenced four genes located on the X chromosome. Using part of their DNA sequence data, test the hypothesis that D. simulans meets the assumptions of the standard neutral coalescent model via Tajima's D.

DNA sequences from the runt locus for flies sampled in Europe and Mayotte (an overseas collectivity of France composed of several islands in the Indian Ocean, between northern Madagascar and northern Mozambique) exhibited the following patterns

Population n sequences Nucleotide sites S n Europe 15 556 17 0.012436

### Mayotte 15 538 34 0.013525

Use the number of segregating sites (S) to calculate the number of segregating sites per nucleotide site (pS) and then estimate 0S per site according to equation 8.29. Then compute Tajima's D according to equation 8.54.

What do you conclude about the history of these two D. simulans populations? Note that your estimates of Tajima's D will differ from those in Baudry et al. (2006) because they used only synonymous site polymorphisms whereas you have used polymorphisms at all sites. Why did Baudry et al. (2006) use only synonymous site polymorphisms? The DNA sequence data files are available on the text website.

where n is the number of sequences sampled and assuming that there is no recombination.

Although the variance of D is a complex expression, it can still be understood intuitively. The formula quantifies both sampling variance and evolutionary variance. Sampling variance comes from taking a sample of DNA sequences and using them to estimate n and S. As with any finite sample of data from a larger underlying population, repeating the sampling procedure would result in a slightly different estimate of the parameters of interest because the sample is not perfectly representative of the full population. Sampling variance decreases as sample sizes increase since estimates are based on a larger and larger proportion of the underlying population. In contrast, evolutionary variance is caused by the variable outcomes of the random evolutionary processes of genetic drift. Evolutionary variance can only be estimated by sampling multiple independent realizations of the same random process, for example taking samples of DNA sequences from multiple populations that independently experienced genetic drift after being isolated from the same ancestral population. The coalescence process itself has a great deal of evolutionary variance. For example, under the standard coalescent model there is a wide range of coalescence times for k lineages around an average with variance in coalescence time that is largest for two lineages (see equation 3.76). Both sources of variance are taken into account when determining the standard error of D.

The value of Tajima's D is influenced by changes over time in the size of populations, population structure, and the action of natural selection. Therefore, Tajima's D is not a simple test for the action of natural selection alone as sometimes assumed. The null model is based on a constant mutation rate through time (a molecular clock), the infinite sites model of mutation, the Wright-Fisher model with non-overlapping generations, and a panmictic population at driftmutation equilibrium (see Tajima 1996 on the first two points). Although a large value of D serves to reject the standard coalescent model for a given set of DNA polymorphism data, distinguishing among changes in effective size through time, population structure, and natural selection requires more than just an estimate of D at a single locus. DNA polymorphism in humans, for example, often shows negative values of Tajima's D. These results are considered by many to be caused by a low level of population structure as well as a history of very rapid population growth in the recent past that characterizes human populations rather than widespread balancing selection operating on human loci (Ptak & Przeworski 2002; Tishkoff & Verrelli 2003).

### Mismatch distributions

The previous section explored how a sample of DNA sequences from a population can be used to test neutral expectations by comparing estimates of 0 based on polymorphism measured with nucleotide diversity (n) and the number of segregating sites (S). Both n and S summarize patterns of sequence variation into a single number. Specifically, the nucleotide diversity n is really an average of the differences between all pairs of sequences in a sample. Instead of using an average to measure polymorphism, we can directly examine the distribution of all individual pairwise sequence comparisons. This is commonly called the mismatch distribution and it is the frequency distribution of the number of nucleotide sites that differ between all unique pairs of DNA sequences in a sample. The mismatch distribution is a tool that can be used to infer the history of the population that gave rise to a sample of DNA sequences. It can be used to infer past changes in the effective size of a population using selectively neutral DNA sequences. Alternatively, in populations that have maintained a constant size over time, these distributions can be used to identify the action of natural selection.

Mismatch distribution The frequency distribution of the number of nucleotide sites that differ between all unique pairs of DNA sequences in a sample from a single species. It is also known as the distribution of pairwise differences.

Haplotype frequency distribution The distribution of the frequency of each sequence haplotype in a population assuming that individuals are haploid or homozygous. It is also known as the site frequency spectrum.

Let's assume complete neutrality of mutations to focus on using mismatch distributions to develop expectations for patterns of DNA sequence differences in stable, growing, and shrinking populations. The properties of the mismatch distribution arise directly from expected patterns that characterize neutral genealogies. Chapter 3 shows that the last pair of lineages (k = 2) takes the longest average time to coalesce in standard neutral genealogies for populations with constant Ne. When there is mutation, the two oldest lineages in the population also differ by the largest number of mutations since the expected number of mutations is proportional to the length of time a lineage exists. In populations that maintain constant Ne, these oldest two lineages experience numerous mutations and therefore have a high degree of mismatch. This pattern of long lineages having multiple mutations can be seen in the genealogy in Fig. 8.21.

Working from the past to the present, the two oldest lineages in any genealogy give rise to additional lineages. The younger progeny lineages inherit all the mutations that have occurred on the progenitor lineages and may also experience additional new mutations. Since the lineages closer to the present tend to have shorter times to coalescence (the probability of coalescence increases with larger k), they also tend to accumulate fewer mutations. Looking at Fig. 8.21, the three lineages within group A would each inherit the four mutations that occurred on the internal branch that was their ancestor. Because lineages 1, 2, and 3 within group A share the mutations of their ancestral lineage, they also tend to have fewer nucleotide sites that mismatch. For example, lineages 1 and 2 differ by only the two mutations that occurred near the present (mutations at nucleotide sites 17 and 22). Lineages 4, 5, and 6 within group B also have low levels of mismatch by the same logic.

In contrast, the level of sequence mismatch is high when lineages are compared between groups A and B in Fig. 8.21. For example, lineages 1 and 4 differ by nine mutations. This high level of mismatch occurs because sequences from distantly related lineages are separated by much more time since they shared a common ancestral lineage, leading to many more mutational changes that independently altered each DNA sequence. Another way to think of the situation is that closely related lineages differ only by a few young mutations while distantly related lineages differ by more mutations, many of which are old and have been resident in the population for a long time.

The mismatch distribution has distinct patterns depending on the demographic history of the population (Slatkin & Hudson 1991; Rogers & Harpending 1992). Mismatch distributions from populations that have experienced a constant Ne over time tend to have two clusters of values in the mismatch distribution.

0 0