Info

Figure A.4 Examples of joint distributions between two variables, x and y. The covariance between x and y is zero in (a), positive in (b), and negative in (c). In all three panels the variance of x remains constant at 0.0912. Trends between the variables can be expressed as correlation coefficients (p) or as the slopes of the least-squares regression of y on x (b).

Figure A.4 Examples of joint distributions between two variables, x and y. The covariance between x and y is zero in (a), positive in (b), and negative in (c). In all three panels the variance of x remains constant at 0.0912. Trends between the variables can be expressed as correlation coefficients (p) or as the slopes of the least-squares regression of y on x (b).

where the deviations of each observation from the mean for each variable are multiplied, these products are summed over all observations, and then the sum is divided by the number of observations to achieve an average. Notice that the deviations from the mean are not squared as they are for the variance. This means that the covariance measures the direction of the deviations from the mean as well as their magnitude.

The joint distributions in Fig. A.4 illustrate a range of covariance values even though the variance of the x variable is constant. In Fig. A.4a, the x and y values have zero covariance and the two variables are therefore independent. When a covari-ance is zero then the scatter of each variable can be described by its variance alone without reference to the other variable. In both Figs A.4b and A.4c the values of x and y are not independent but rather tend to vary together. A positive covariance between x and y is shown in Fig. A.4b, telling us that higher values of one variable tend to be associated with higher values of the other variable. Figure A.4c shows negative covariance between x and y, telling us that higher values of one variable tend to be associated with lower values of the other variable and vice versa. Using the jelly bean analogy, a zero covariance says that length and weight are independent of each other, a positive covariance says that heavier jelly beans tend to be longer, and a negative covariance says that heavier jelly beans tend to be shorter.

The covariance forms the basis of the correlation coefficient, a summary measure of the strength and direction of the linear relationship between two variables. The Pearson product-moment correlation, symbolized by p (pronounced "roe"), is given by y = a + bx

where the ^Jvar(x) is really the same thing as the standard deviation of x. The correlation is a dimen-sionless quantity that takes on values between -1 and +1. A perfect positive linear relationship between x and y gives a correlation of+1, a perfect negative linear relationship between x and y gives a correlation of -1, and a correlation of zero indicates independence of x and y. The correlation coefficients are also given for the joint distributions in Fig. A.4. A number of other non-parametric correlation measures exist that are appropriate for data that are not normally distributed, such as Spearman's rankorder correlation.

It is important to recognize that the correlation coefficient measures only associations between variables but does not provide any information on how that association came into being. Unfortunately, correlations between variables are commonly misinterpreted as indicating a cause-and-effect relationship between variables. A hypothetical example might be a positive correlation between sales of lemonade and sales of baseballs. Consumption of lemonade does not cause people to also buy a baseball. Rather both events are tied to the weather in a cause-and-effect relationship; in warm weather people drink more lemonade and also play baseball. A classic demonstration of the weakness of the correlation as summary measure was made by Anscombe (1973), who concocted four data sets with identical means, standard deviations and correlation coefficients. These four data sets illustrate how non-normality, non-linearity, and outliers in data can produce high correlations but very poor summaries of the relationship between two variables.

The covariance also plays a fundamental role in regression analysis, which is used to estimate the resemblance between parents and offspring as described in Chapter 9. Assume that y is a response or dependent variable and x is an independent variable. The slope (a) and intercept (b) of a regression line for the variables x and y is represented by the equation

This equation can be rewritten using the average values of x and y y = a + bx

The difference between any individual observation y and the average of y is then y - y = a + bx - a - bx

Multiplying both sides of equation A.9 by (x - X) gives

Notice that the left-hand side looks a lot like the covariance in equation A.4 and the right-hand side looks a lot like the variance in equation A.2. If we sum the quantities on both sides of equation A.10 then divide the sums by 1/n to make them averages then A.10 becomes

Solving for the slope of the regression line gives b=

Therefore, we see that the slope of a regression line is the covariance between x and y divided by the variance in x.

The regression line slopes for the joint distributions in Fig. A.4 can be computed from the covariances of x and y along with the variance in x. In all three panels of Fig. A.4 var(x) = 0.0912. In the top panel, cov(x, y) = 0 so that the slope of the regression line is also zero. In the middle panel cov(x, y) = 0.0518 so that b = 0.0518/0.0912 = 0.568. In the bottom panel cov(x, y) =-0.0410 so that b = -0.0410/0.0912 =- 0.450.

Further reading

One approachable beginning statistics text is:

Freedman, D, Pisani R, and Purves R. 1997. Statistics, 3rd edn. Norton, New York.

Two classic biologically oriented statistics texts are:

Sokal RR and Rohlf FJ. 1995. Biometry: the Principles and Practice of Statistics in Biological Research, 3rd edn. W. H. Freeman & Company, New York. Zar JH. 1999. Biostatistical Analysis, 4th edn. Prentice Hall, Upper Saddle River, NJ.

For a humorous take on how basic graphs and statistics can lead to miscommunication and a book that will improve your own presentation of statistical information see:

Huff D. 1954. How to Lie with Statistics. W. W. Norton, New York.

Problem box answers

Problem box A.1 answer

Using the new values produces a mean of 0.5233, a variance of 0.0147, and a standard deviation of 0.1212. The SE of the sum is (V8)(0.1212) = 0.3428 and the standard error of the average is ((V8)(0.0272))/8 = 0.0429. Therefore, the 95% confidence interval for the mean is 0.5233 - (2 x 0.0429) to 0.5233 + (2 x 0.0429) or 0.4375 to 0.6091. Thus, we would expect 95 times out of 100 that the range of allele frequency between 0.4375 and 0.6091 would include the actual allele frequency of the population, p. The confidence interval about p is much larger in this case since the variance is larger: all caused by the observations being more spread out around the mean. In this case, the allele frequency parameter is estimated with more uncertainty since the underlying observations used for the estimate have a much greater range of values.

0 0

Post a comment