postulated mutation-driven convergence or conversion events between lineages. This low level of polymorphism extended into the region of exon 2 encoding the p sheet, found at the floor of the antigen binding groove. In contrast, the portion of exon 2 encoding the a helix, found at the ridge of the antigen binding groove, was remarkably polymorphic. Here the pattern was consistent with localized gene conversion at a microscopic level with polymorphism within allelic lineages and sharing of nucleotide substitutions common between lineages (Fig. 12.5). They concluded that recent shuffling of existing alleles by gene conversion generated this highly localized diversity. The low level of intronic diversity was also postulated as being consistent with the single origin hypothesis of human populations with an effective breeding population size of 10 000 over the past million years (estimated based on the very low diversity at non-HLA loci of 0.1%) (Li and Sadler 1991) rather than estimates of 100 000 based on the persistence of HLA allelic lineages over tens of millions of years (Takahata 1993).
More recent analysis based on more extensive sequencing at the DRB1 locus, including full length alleles together with comparisons between primate species of different lineages, suggest a recent origin for alleles (Bontrop 2006; von Salome et al. 2007). The DRB1*03 lineage is found in humans, chimpanzees, bonobos, gorillas, and orangutans, and an ancestral 'proto 03' lineage is postulated. If only the exon 2 sequences are used to construct the phylogenetic tree, all lineages predate the separation of humans and chimpanzees an estimated 5 million years ago. Many lineages are, however, unique to humans (such as *8, *11, *13, *14, *15, and *16) and analysis using the dataset excluding exon 2 is consistent with this. Some of the motifs subject to gene conversion shuffling within the antigen recognition sequences may be very ancient but it is believed that the alleles are being generated relatively rapidly with the average age of within lineage diversity estimated at less than 1 million years (von Salome et al. 2007).
The reference DNA sequence for the MHC, like the rest of the human genome, comprised a composite of many haplotypes with DNA fragments derived from different individuals. In 2004 the first results of the MHC Haplotype
Project (http://www.sanger.ac.uk/HGP/Chr6/ MHC/) were published (Stewart et al. 2004a), providing a resource unique to the MHC whereby a DNA sequence would be available comprising a single contiguous haplotype from one individual. This 'homozygous' DNA was available from consanguineous cell lines each of which comprised a single MHC haplotype. The MHC Haplotype Project aimed to sequence eight different MHC haplotypes using bacterial artificial chromosome (BAC) library clones derived from eight cell lines from the Tenth International Histocompatibility Workshop (Allcock et al. 2002). The cell lines were selected as representing common MHC haplotypes found in north European populations that were known to be associated with susceptibility to disease (Box 12.6). The hope was that this work would help identify the specific genetic variants within the extended haplotypes which were responsible for disease susceptibility, as well as broader questions about the nature and frequency of genetic diversity, the ancestral origins and relationships of haplotypes, and their structure.
What has the MHC Haplotype Project told us? We now have a contiguous sequence comprising a single haplotype to use as a reference sequence (the DNA sequence of the PGF cell line). The MHC haplotypes are the first long range single haplotypes sequenced in the human genome. Over 44 000 variants were identified by comparing the three haplotypes (Horton et al. 2008). Comparing PGF and QBL, for example, there were 17 695 sequence differences of which 15 345 were single nucleotide substitutions and the remainder insertion/
deletion events (indels) (Traherne et al. 2006b). The single nucleotide substitutions were most frequent at classical class I and II loci as expected, with the ratio of nonsynonymous to synonymous substitutions 3 : 1 at these regions (HLA-A, -B, -C, -DR, -DQ) compared to non-classical loci where the ratio was 1 : 1. This is consistent with positive selection acting on genes in the classical loci resulting in the higher frequency of polymorphism altering the amino acid structure of the encoded proteins.
The remaining genetic variants were almost all small (<96 bp) indels but 34 of these were large (96-5157 bp), including 15 Alu element insertions (Section 8.4). Most of these insertions were 'young', including five members of the Ya5 and Yb8 families which are found almost exclusively in humans rather than being common to the great apes or other species (Carroll et al. 2001). Ancient variants were also found, including repeats of the HERV (human endogenous retrovirus) sequences, LINEs (long interspersed nuclear elements), long tandem repeats, and MERs (mammalian interspersed repetitive elements).
The three haplotypes differ for example at the RCCX locus (Section 12.7), with PGF having two copies of the C4 gene (C4A and C4B), which were both of the 'long' type as they contained an HERVC4 insertion in intron 9, COX had only a single copy of C4, C4B of the short type, while QBL had a single C4A of the long type. Specific variants were also discovered at disease associated loci - for example at PSORSICI (a candidate locus for psoriasis, a skin disorder) in the QBL haplotype a deletion of 1 nt in the poly(C)
The first two cell lines to be sequenced, PGF and COX (Stewart et al. 2004a), bore two common MHC haplotypes associated with autoimmune disease susceptibility. HLA-A3-B7-Cw7-DR15, the 7.1 haplotype found in the PGF cell line, is present in 10% of north Europeans and has been associated with protection from type 1 diabetes (relative risk 0.05) and predisposition to multiple sclerosis and systemic lupus erythematosus (relative risks 2-4) (Barcellos et al. 2003; Larsen and Alper 2004), while the '8.1' haplotype HLA-A1-B8-Cw7-DR3 was found in the COX cell line. In 2006 the sequence for a third haplotype, found in the QBL cell line (HLA-A26-B18-Cw5-DR3-DQ2; the 18.2 ancestral haplotype), was published (Traherne et al. 2006b) associated with susceptibility to Graves' disease and type 1 diabetes (Johansson et al. 2003). The MHC Haplotype Project was completed by 2008 with the publication of sequence data and analysis of all eight selected haplotypes (Horton et al. 2008).
tract of exon 5 led to a frameshift in the spliced transcript, a premature stop codon, and shortening of the coding sequence by 266 amino acids.
The HLA-DRB locus is known to be extremely polymorphic and the highest diversity was observed between PGF and the COX/QBL haplotypes. COX and QBL both have the DR3 haplotype and the detailed analysis provided by the MHC Haplotype Project resolved a 158 kb region with almost no variation, dubbed a 'SNP desert' consistent with a relatively recent ancestry and allowing resolution of the original recombination event (Traherne et al. 2006).
In 2006 the MHC Sequencing Consortium published the most extensive map to date of the patterns of linkage disequilibrium across the MHC (de Bakker et al. 2006). The group performed extensive typing of HLA genes and of over 7500 polymorphisms (SNPs together with indels) within the 7.5 Mb region of the extended MHC. Four different ethnic groups were studied, African, European, Chinese, and Japanese, with a total of 361 individuals. This allowed the recombination rates and hotspots to be estimated, as well as the inferred haplotypes (Fig. 12.6). Overall the small scale haplotypic block structure in the MHC is comparable to elsewhere in the genome but the extent of linkage disequilibrium between blocks is higher, resulting in the observed extended or ancestral haplotypes that are found to be particularly common among individuals of north European origin. Consistent with studies elsewhere in the genome, the linkage disequilibrium was found to be lower in the African cohort with typically shorter haplotypes observed. The recombination rate overall was lower than for other genomic regions (0.44 cM/Mb versus genome-wide average of 1.2 cM/Mb).
Why is such a map important? For those interested in using genetic and HLA markers to resolve disease susceptibility, a dense map such as this provides the possibility of using specific 'tag SNPs' which potentially significantly reduce the number of markers required. As discussed in Section 9.2, this is because the tag SNPs are selected based on their ability to be informative about the underlying allelic architecture: a small number of SNPs may be sufficient to define the common haplotypes for a given gene or region of DNA. How well do they work? In the paper by de Bakker and colleagues, the selected tag SNPs showed high sensitivity and specificity as predictors for HLA alleles in two disease cohort studies (de Bakker et al. 2006). Knowledge of patterns of linkage disequilibrium and haplotypic structure also provides a route map for future studies of evolutionary dynamics and understanding the ancestral origin of MHC polymorphism as well as fine mapping disease associations.
Was this article helpful?