## Info

The variance estimator shown in equation A.2 is sometimes called the sampling variance and it is an unbiased estimator of the variance in a very large population (unbiased means that the expected value of the variance is equal to the true value of the variance). An alternative form of the variance exists where the sums of squares are divided by n rather than n - 1. Dividing by n estimates what is sometimes called the parametric variance, or a variance for a finite population of size n where all n individuals have been used to compute the variance. The parametric variance is employed in idealized situations where every individual in a population can be measured or sampled. In practice, this distinction makes little difference for large n.

If we are willing to assume that our average is drawn from a frequency distribution called the normal distribution (which Interact box A.1 suggests is not unreasonable), then we can use the variance in equation A2 as a measure of uncertainty in our estimate of the average allele frequency. The standard deviation is symbolized by o (a lower-case "sigma") or SD and is simply the square root of the variance (taking the square root returns the variance back to the original units of measurement). The SD measures the average deviation from the mean for a single observation. Figure A.3 illustrates the standard deviations that arise from the variances of two different frequency distributions. Normal distributions are very useful because the standard deviation corresponds to the probability that an observation is some distance from the average. For an ideal normal distribution, about 68% of the observations fall within plus or minus one SD of the mean, about 95% of the observations fall within two SDs of the mean, and about 99% of the observations fall within three SDs of the mean.

The standard deviation of a single observation can also be used to quantify uncertainty in the average of all observations. The standard error, or SE, for the sum of all observations (the denominator in equation A.1) is .^sample size multiplied by the SD. Since the SD measures the average spread of an observation from the mean, then the SE of the sum adds these individual deviations up. The square root of the sample size is the multiplier because the SE grows more slowly than the sample size itself as observations are added to the sum. The SE of the sum can be related to the average by taking the average of the SE of the sum, or cu f (V» )SD SD

(the yfn term in the numerator of the middle equation cancels because n = in the denominator). Notice that as the sample size increases, the SE of the mean decreases. Like the SD, the SE of the average defines a probability range around the mean due to chance events in the sample. The probability intervals defined by the SE serve to establish what are called confidence intervals. By convention, 95% confidence intervals are frequently used to quantify the chances that the parameter estimate (p) plus or minus 2 SEs covers the true parameter value in the population (p). Again, Fig. A.3 illustrates the standard errors of the means for two frequency distributions.

To return to our mouse population example diagrammed in Fig. A.2, p = 0.4976, the variance is 0.00074, and the standard deviation is 0.0272. The standard error of the sum is y/8

= 0.0769 and the standard error of the average is (( V8)(0.0272))/8 = 0.0096. Therefore, the 95% confidence interval for the mean is 0.4976 - (2 x 0.0096) to 0.4976 + (2 x 0.0096) or 0.4784 to 0.5168. Thus, we would expect 95 times out of 100 that the range of allele frequencies between 0.4784 and 0.5168 would include the actual allele frequency of the population, p. You will be reassured to note that the values in Fig. A.2 are random numbers generated by computer using a distribution with a mean of 0.50, equivalent to the true parameter value. The value of p is very close to this parameter value and the parameter value also falls inside the 95% confidence interval for p defined by plus or minus 2 SEs of the average.

### Problem box A.1 Estimating the variance

Refer to Fig. A.2 and replace the values in the figure for the allele frequency within each population with 0.6333, 0.5074, 0.4880, 0.3960, 0.5368, 0.3330, 0.5893, and 0.7029. What is the estimate of allele frequency in the entire population and what is the 95% confidence interval? How does the variance in this case compare with the variance for the original values given in Fig. A.2? How does the range of allele frequencies that include the true population parameter compare to the original values in Fig. A.2?

### Interact box A.I The central limit theorem

The central limit theorem predicts that the distribution of means will approach a normal distribution regardless of the shape of the distribution of the original data sampled to compute the mean. The central limit theorem is the justification for using the standard deviation as an estimate of the confidence in a mean such as the average allele frequency (see main text). The central limit theorem also demonstrates that certainty in parameter estimates is directly proportional to the size of samples used to make such estimates. Instead of accepting the central limit theorem as a given, let's use simulation to explore it as a prediction that can be tested.

Step 1 On the textbook web page under the heading Appendix, click on the link for Central limit theorem simulation.

Step 2 A web page titled Sampling Distributions will open. After a moment to load, a Begin button will appear at the top left of the window. Press the button and a new simulation window will open. Once the Java code for the window loads, you will see four sets of axes with some buttons and pop-out menus down the right side of the window. The top set of axes will contain a normal distribution with the mean, median and standard deviation ("sd") given in three colors to the left of the distribution and their positions also indicated in those same colors below the xaxis. Please read the Instructions text and refer to the simulation window so that you understand the controls. Step 3 This simulation samples individual data points from the distribution at the top. Pressing the Animated button on the right will show the sampling process with five data points at a time on the axes labeled Sample Data. Press the button and watch as five bars drop to indicate the five data points. Once the sample is complete a single data point, the mean of five individual data points, will appear on the axes labeled Distribution of Means. Press the Animated button a few times to get a sense of how the two middle windows display the data. Be sure you can distinguish between the sample size (the N = menu right of the axes) and the number of samples (tabulated as Reps = left of the axes). Step 4 Set N = 25 (the lower three graphs will clear) and press the Animated button five times (notice how Reps = increases by one each time). Now press the 5 button once. This is like sampling five more samples of 25 without the animation. Notice how reps is now 10 since the repetitions add up. Click on the Fit normal check box to compare the histogram in the distribution of means to an ideal normal distribution. Ideal normal distributions have skew and kurtosis (measures of asymmetry about the mean) values of zero, like the normal distribution at the top.

Step 5 How do the sample size and the number of samples influence the distribution of means? To find out, simulate a range of values of one variable while holding the other variable constant (try N = 2 and 20 with 20, 50, 100, and 1000 reps). Look at the histograms and write down the skew and kurtosis values for each combination of the sample size and number of samples. Do larger samples improve the approach to normality of the distribution of means?

Step 6 Is the distribution of means still a normal distribution even if the parent population (the top-most distribution) is not a normal distribution? Compare the skew and kurtosis values from samples taken from three different parent populations (changed with the pop-out menu to the right of the top-most axes) for a range of identical sample sizes and numbers of samples (try N = 5 with 20, 50, 100, and 1000 reps). Step 7 Feel free to explore the topic further by clicking on the Exercises link below the Begin button.

If at any point you would like to obtain a new simulation window with default starting values, just close the simulation window and then open a new one with the Begin button.

Note that in this example the allele frequencies for each population (pi) can take on any value between zero and one, and are therefore a continuous variable. The estimate of allele frequency within each individual population is based on counting up alleles that can take only one of two forms, a and A, called a binomial variable.

Covariance and correlation

• Quantifying the relationship between two variables.

• Introduction to covariance and correlation.

• The role of covariance in regression analysis.

In population genetics and in all of biology, there are many situations where our goal is understand the relationship between two variables. To extend the jelly bean example used above, imagine that each jelly bean was weighed and its length was also measured. Each individual jelly bean then has two values for the two variables that describe it. One question we might ask is whether the length and mass of jelly beans are related or whether jelly beans of any mass can exhibit any length. These questions about jelly bean mass and length can be answered by determining the covariance and correlation between two variables.

The spread or scatter of two variables, call them x and y, viewed simultaneously is their joint distribution. Figure A.4 illustrates three different joint distributions of x and y values. The degree to which the values of two variables tend to vary in the same direction is measured by the covariance. The covariance is

0 0