Box Phase I Hap Map populations

For phase I of the International HapMap Project, 270 individuals from four different populations were recruited. Two groups of individuals of Asian ancestry in geographic terms were genotyped: 45 unrelated Han Chinese from Beijing, China, denoted the 'CHB' population, and 45 Japanese from Tokyo, Japan ('JPT'). For the other two populations studied, of African and European ancestry, each comprised of 30 parent-offspring trios (a total of 90 individuals): one cohort was from Yoruba in Ibadan, Nigeria ('YRI') and the other from Utah, USA ('CEU'), the latter being part of the Centre d'Etude du Polymorphisme Humain (CEPH) collection of reference families established in 1984 and used for constructing genetic maps of the human genome (www.cephb.fr).

number that were polymorphic varied between populations, being highest in those of African ancestry (85% polymorphic in YRI) and lowest in those of Asian origin (75% in CHB/JPT); among those of European origin intermediate levels were found (79% in CEU). There were very few fixed differences, with alternate alleles seen only in particular population panels: 11 such differences were reported for example between CEU and YRI populations. The dataset was remarkably accurate (99.7%) and complete (99.3%), with for most genomic regions a common SNP (with a minor allele frequency of greater than 5%) genotyped every 5 kb. Analysis of parent-offspring trios showed that the statistical methods used for the reconstruction of haplotypes was very accurate and high quality, long range haplotypes were established.

The study complimented the genotyping dataset by publishing the results of resequencing ten regions of 500 kb in 48 individuals (16 YRI, 16 CEU, 8 CHB, 8 JPT) to try and capture all sequence diversity (Altshuler et al. 2005). These ten regions formed part of the genomic regions analysed in the ENCODE (ENCyclopedia Of DNA Elements) Project, which aimed to define the relationship between DNA sequence and functional regulation (www. genome.gov/10005107) (TEPC 2004). A total of 17 944 single nucleotide variants were identified over 5 Mb of sequence, equating to one every 279 bp: most were rare, with 45% having a minor allele frequency of less than 5%, and 9% of variants were found in only a single individual. However, the dataset confirmed that common SNPs (having a minor allele frequency above 5%) comprised 90% of heterozygous sites and that the strategy of analysing a limited set of informative common SNPs would capture the majority of diversity. More low frequency variants were seen in the individuals of African ancestry (YRI), consistent with earlier reports and the concept of population (genetic) bottlenecks in the non-African populations (Section 8.5.1). The dataset also allowed estimation of recombination rates and hotspots, with one hotspot identified on average per 57 kb and 80% of all recombination estimated to have occurred in 15% of the sequence.

The resequencing data from the ENCODE regions highlighted the haplotype 'block' structure of linkage disequilibrium (Fig. 9.2) with most sequence in blocks of four or more sequence variants. The average block encompassed many such variants (on average 30-70), with an average of 4.0 (Asian CHB + JPT) to 5.6 (African YRI) common haplotypes per block (Altshuler et al. 2005). These data validated the strategic approach and utility of the much sparser genome-wide dataset of common SNPs genotyped in Phase I HapMap. In terms of tag SNP selection, when working with the ENCODE set of variants, SNP density could be reduced by 75-90%. If common SNPs were selected progressively until the tag SNP set encompassed SNPs highly correlated with all common SNPs, it was possible to reduce the density of genotyping to one SNP every 2 kb (in YRI) or one every 5 kb for the other populations studied. For the full Phase I HapMap dataset of common SNPs, the number of tag SNPs required to capture common SNP diversity could be reduced by between one-third (Asian panel) and one-half (African population panel) (Fig. 9.3).

The extent of long range haplotypes varied, and the dataset allowed comparison of unique haplotypes to a

ENCODE region 2q37.1 ENCODE region 7q31.33

ENCODE region 2q37.1 ENCODE region 7q31.33

Hapmap EncodeHapmap Encode

Figure 9.2 Haplotype block structure revealed in the analysis of linkage disequilibrium and recombination for two ENCODE regions. D' plots are shown for two ENCODE regions according to geographic ancestry of individuals; D' provides a measure of linkage disequilibrium (Box 2.8) where white is D' <1 with a lod score of <2, with the darkest shading indicates D' = 1 with a lod score of >2. Recombination hotspots are indicated by inverted triangles in the lower portion of figure where estimated recombination rates are shown. Reprinted by permission from Macmillan Publishers Ltd: Nature (Altshuler et al. 2005), copyright 2005.

Figure 9.2 Haplotype block structure revealed in the analysis of linkage disequilibrium and recombination for two ENCODE regions. D' plots are shown for two ENCODE regions according to geographic ancestry of individuals; D' provides a measure of linkage disequilibrium (Box 2.8) where white is D' <1 with a lod score of <2, with the darkest shading indicates D' = 1 with a lod score of >2. Recombination hotspots are indicated by inverted triangles in the lower portion of figure where estimated recombination rates are shown. Reprinted by permission from Macmillan Publishers Ltd: Nature (Altshuler et al. 2005), copyright 2005.

Phase I HapMap (1 million SNPs)

r2 threshold

0 0

Post a comment