Figure 9.3 Tag SNP selection. For the Phase I and Phase II HapMap datasets, the number of tag SNPs required to capture genotyped common SNP diversity is shown for different population panels. r2 is the square of the correlation coefficient between two SNPs where r2 = 1 indicates complete correlation (Fig. 2.2). Reprinted by permission from Macmillan Publishers Ltd: Nature (Altshuler et al. 2005), copyright 2005; Nature (Frazer et al. 2007), copyright 2007.
recombination map (Fig. 9.4). For some genomic regions such as around centromeres, the absence of recombination meant haplotypes involving more than 100 SNPs were found across megabase regions of DNA. As anticipated, linkage disequilibrium was found to be higher close to centromeres and low near telomeres; it correlated with chromosome length but also varied depending on gene density and function. Genes involved in immunity, for example, were associated with regions of low linkage disequilibrium, while strong linkage disequilibrium was associated with genes involved in cell cycling and other fundamental cellular processes.
The HapMap dataset was also a rich source of information about the possible action of natural selection on genetic diversity with 'signatures of selection' recognized (Section 10.2). These included SNPs showing extreme variation in frequency between populations, with 926 such SNPs identified. There were 19 regions showing evidence of a 'selective sweep' with all diversity lost except for a particular allele that had risen to very high levels in the population such that the original mutated allele became 'fixed' and the only allele seen in the population, for example LCT encoding lactase which had been previously shown to display such effects (Section 10.4) (Enattah et al. 2002; Bersaglieri et al. 2004), and long haplotypes that were candidates for being subject to selection which might be balancing or not sufficient to lead to fixation (Altshuler et al. 2005).
In 2007, Phase II HapMap was published in which genotyping data were made available on an additional 2.1 million SNPs for 270 individuals in the four population panels (CEU, YRI, CHB, and JPT; see Box 9.1), taking the total number of genotypes available for each individual to more than 3.1 million (Frazer et al. 2007). The coverage of common variants was markedly increased. For example, on reanalysis of the resequenced ENCODE regions, the Phase II SNP set captured common variation to a very high degree (the square of the correlation coefficient, r2 = 0.90 in YRI, and 0.96 in CEU; in phase I r2 = 0.67 for YRI). The number of tag SNPs required increased by two-fold despite the number of genotyped SNPs increasing three-fold (see Fig. 9.3). This remarkable dataset of common SNP variation across the genome was estimated to have a SNP every 1.1 kb and to represent 25-35% of all common SNPs genome-wide based on the assembled genome sequence (Frazer et al. 2007).
The SNP set provided increased resolution, notably in the YRI population of African ancestry, with finer scale haplotypic structure resolved (Fig. 9.5). Representation of rarer variants was better than in Phase I, as was
Was this article helpful?