(2005) for mosquito Msx inclusion); the cross-hatched arrow is the EHGbox cluster (Pollard and Holland 2000), and the grey arrows are the different versions of the 'Extended Hox' subclass: Mox + Evx (Pollard and Holland 2000), Mox + Evx + Gbx + Mnx + En (Castro and Holland 2003, and Mox + Evx + Gbx + Mnx + En + Rough + Dlx (proposed here on the basis of established gene linkages rather than poorly resolved phylogenetic trees, as described in the text, which now becomes the 'Hox-linked' subclass). Lineage specific genes, such as Nanog (Booth and Holland 2004) and bicoid, are omitted for simplicity. The name MSXLX has been suggested for the CG15696 family based on weak phylogenetic support for a sister group relationship with the Msx family in a Bayesian tree (Ryan et al. 2006). This level of support is low, and so following the reasoning presented here the MSXLX name is at present not adopted for the CG15696 family. The subclasses of the ANTP-class that are identified here are thus (1) Hox + ParaHox, (2) Hox-linked, (3) NK cluster, including NK4/tin, NK3/bap, Tlx, Lbx, NK1/slou and Msx, with NK5/Hmx potentially joining following closer examination (Garcia-Fernandez 2005), and (4) NK-linked. There will be one or two families of the ANTP-class that fall outside these subclass definitions that will be clarified by analysis of further genomes. Such families would then also be categorised as subclasses.
ANTP-class that most confusion and discrepancy exists, and so I shall concentrate on this class and attempt to resolve at least some of the confusion. The classification of the groups of homeobox genes that I will use here broadly follows those of Wada et al. (2003), Edvardsen et al. (2005), Booth and Holland (2007) and Ryan et al. (2006), with a group of orthologous genes being a family, e.g. the Msx family, and the major distinctive groups of animal homeobox genes being classes, e.g. PRD class, and then the term 'subclass' being reserved for an intermediate grouping of several families within a class, e.g. the NKL subclass.
This classification of homeobox genes, which is based upon molecular phylogenies and sequence similarities, is very robust at the family level for the majority of gene families (e.g. Hox1/lab, Dlx or Evx), across the animal kingdom (Gauchat et al. 2000, Burglin 2005, Holland and Takahashi 2005). Orthologous genes can usually be recognised even between cnidarians and humans or flies, despite their lineages having diverged over 550 million years ago. There are of course exceptions due to lineage-specific mutation rate elevations, and lineage-specific duplications or losses, which I will come to later. On the whole, however, each homeobox family is united by high support values in phylogenetic trees (e.g. Kamm and Schierwater 2006, Ryan et al. 2006). The families of the ANTP-class and their members from humans and flies, with a few other selected representatives, are given in Table 10.1.
With this clarity of the family-level phylogeny, and the increasing availability of whole genome sequences from around the animal kingdom, our understanding of complete homeobox complements from several animals, and hence lineage-specific gene gains and losses, is improving. Importantly, the possibility of having every homeobox gene from a particular animal means that the uncertainty of whether a particular orthologue has been missed or not is largely eliminated, provided the genome sequencing has been done to sufficient coverage, the assembly is sound, and enough care has been taken with thoroughly searching for all homeobox genes so that problems with inadequate computational gene prediction algorithms are overcome. This last point is clearly illustrated by the different analyses of one of the most carefully sequenced and analysed genomes of all, that of humans. The initial predictions of numbers of homeobox genes were 160 and 267 (Venter et al. 2001, Lander et al. 2001, respectively). Both are wrong, and careful searching and analysis actually reveals 235
homeobox-containing genes in the euchromatin of the human genome, of which 100 are in the ANTP-class (Anne Booth, Peter Holland and Elspeth Bruford, personal communication).
problems, and some solutions
Armed with these large, comprehensive datasets we are now in a position to agree on the nomenclature of homeobox families on a sound, phylogenetically driven basis. This should overcome the inevitable problems generated by orthologous genes, or even the same gene, being given a multitude of different names because of different laboratory traditions or biases (Table 10.1). What cannot be overcome is the problem of different names caused by taxon-specific conventions, e.g. mutant-based names in Drosophila, three-letter and number conventions in nematodes, and acronyms in mammals. But these different taxon conventions can easily be accommodated once a gene from any particular animal is given just a single recognised name.
Ideally this nomenclature convention should be extended throughout a phylum. A convention that is rapidly gaining acceptance, at least in animals for which an alternative convention has not already been established, is naming genes with three letters that denote the species (the first capital letter being the first letter of the genus name, and the following two lower-case letters being the first two letters of the species name) followed by the gene name, which is deduced from phylogenetic analysis and sequence similarity (e.g. de Rosa et al. 1999). The Drosophila melanogaster labial gene thus becomes Dmelab whilst the Tribolium castaneum labial gene is Tcalab.
Difficulty arises when the gene phylogeny is not well resolved, as can occur when genes are isolated from a new taxon that is phylogeneti-cally divergent from previously sampled animals, or the newly sampled taxon is 'long-branch' with rapidly evolving sequences. This problem is prevalent in the nematode Caenorhabditis elegans, and is illustrated by the changing views on the affinity of the C. elegans Hox genes. The gene egl-5 was originally designated as the Posterior Hox gene of C. elegans, and hence orthologous to the AbdB gene of flies and the Hox9+ genes of chor-dates (Burglin 1994). The classification of this gene, based on phyloge-nies, was never particularly robust, but was accepted because egl-5 was the closest thing to a Posterior Hox gene known from nematodes at that time. Two further Hox genes were subsequently found during the whole genome sequencing of C. elegans, php-3 and nob-1, which had greater similarity to Posterior Hox genes (Van Auken et al. 2000). Now views are appearing that have usurped egl-5 from its designation as a Posterior Hox gene, and pushed it towards a Central Hox gene. However, such 'shoe-horning' of the gene is not done with great conviction, owing to the poorly resolved phylogenies of the genes (de Rosa et al. 1999, Bürglin 2005).
To avoid gene-naming difficulties such as two species with the same three-letter abbreviation, the homeobox community could adopt some form of hierarchical system, with precedence being given to the first species code used. Also, when phylogenies are poorly resolved, or interpreted in different ways by different authors, the community would need to adopt a set of 'rules' in which a particular level of phylo-genetic resolution (with agreed forms of phylogeny building) is taken as warranting classification as a family member unless further lines of evidence (see below) justify otherwise. Incidentally these regions of phylo-genetic ambiguity are often where some of the most interesting biology exists, which we can more easily focus on if clearer gene nomenclature reduces ambiguities and confusion.
A further gene classification problem can arise when there have been lineage-specific duplications, divergences and even gene losses. Further taxon sampling can often help to resolve such classification problems, effectively breaking the long branches in the molecular phyloge-nies of difficult-to-classify genes. This has proven to be the case for the zerknüllt (zen) and bicoid (bcd) genes of flies. These genes are in the Hox cluster of drosophilids between the group 2 Hox gene, proboscipedia, and the group 4 Hox gene, Deformed. On the basis of their genomic location they were thought to have probably been derived from a Hox3 gene. However, in phylogenetic trees the zen and bcd genes did not group robustly with the Hox3 genes available at the time. Only once other arthropod taxa were sampled, such as beetles and grasshoppers, and more basal lineages of flies than Drosophila, did stronger phylogenetic signals start to appear (Falciani et al. 1996, Stauber et al. 1999). The conclusion that insect zen and bcd genes are derived from Hox3, zen by sequence divergence and bcd via subsequent duplication and further divergence, is further supported when combined with the observations that the expression of the arthropod Hox3 genes can be seen to have evolved from a typical Hox-like expression (restricted along the anterior-posterior axis) into the derived role of the zen genes in extra-embryonic membrane and dorsal-ventral patterning, followed by the origin of bcd and its role as a maternal morphogen in fly anterior-posterior axis patterning (Damen and Tautz 1998, Telford and Thomas 1998, Stauber et al. 1999).
The zen/bcd example illustrates another important source of information that can be used to supplement phylogenetic information when attempting to classify homeobox genes, namely the genomic location. Regions of synteny can be analysed, and are particularly useful if the syntenic regions do not merely contain orthologous gene neighbours, but also conserve these genes in the same order along the chromosome. In the case of the zen/bcd example these genes are in the location of the Hox3 gene, in between the insect versions of Hox2 and Hox4. The resolution of homeobox evolutionary patterns can also be aided by synteny analysis of neighbouring genes other than homeobox genes as well. Examples of this have been the resolution of vertebrate ParaHox gene clusters evolving by whole cluster duplication, followed by some gene losses (Brooke et al. 1998, Coulier et al. 2000, Ferrier et al. 2005, Mulley et al. 2006), or the cryptic orthology of the TPRX1 and TPRX2P genes of humans to the Obox genes of mice (Booth and Holland 2007). So, although homeobox phylogenies are immensely valuable in understanding gene orthologies, they sometimes need to be supplemented. Other useful sources of information come from conservation of other domains outside of the homeodomain (Burglin 2005), or with further taxon sampling, information on genomic location and expression data.
There are one or two other families from the list in Table 10.1 that also suffer from some ambiguities at present, which are confounded by poorly resolved phylogenies and inconsistent or confusing nomenclature. The NK2 and NK4/tin genes are one important example, which can clearly be resolved once gene expression and genomic location are also taken into account. Unfortunately the names of the NK4/tin genes of humans are NKX2-3, NKX2-5 and NKX2-6, whilst the NK2 genes proper are NKX2-2, NKX2-4 and NKX2-8 (along with TITF1-2 which has bizarrely escaped the NKX2-X nomenclature theme). These NK4/tin and NK2 genes do indeed group together on phylogenies, suggesting that the two groups are sister groups, or a single multi-gene family if phylo-geny alone is used (e.g. Ryan et al. 2006). However, the genomic location of the NK4/tin genes in the NK cluster of flies and mosquitoes and their tight linkage with the NK3/bap orthologues in chordates as well (Luke et al. 2003) (including their conserved role in 'heart' development; Harvey 1996), indicates that the NK4/tin and NK2 groups are distinct families. Following the logic that the genomic location is of prime importance in the classification of these two particular groups of genes, the genomic position of a sea anemone (Nematostella vectensis) NK4/NK2-like gene next to an NK3/bap gene permitted its identification as the Nematostella NK4/tin orthologue (Chourrout et al. 2006).
hox-like and nk-like subclasses: are the names misleading?
Despite the lack of phylogenetic resolution between the families of the ANTP-class, the families are usually divided into the Hox-like subclass and NK-like (or NKL) subclass (Burglin 2005, Holland and Takahashi 2005). At the broad scale this can be useful, but there is a serious problem with regards to reconstructing the relationships between these different homeobox families. The support values at the nodes that might define the divergence patterns between different homeobox families are almost always very poorly resolved, and so will usually be collapsed if standard molecular phylogenetic 'rules' are applied (such as collapsing or ignoring Neighbour-Joining nodes with less than 70% bootstrap support). When drawing homeobox trees, however, most workers tend to be more liberal, and avoid collapsing poorly supported nodes. Whilst this may be said to be invalid on strict molecular phylogenetic grounds, there is nevertheless some justification for taking this liberal approach when genomic organisation of the genes is also considered. For example, Mox is a 'Hox-like' gene that also is relatively closely linked to the Hox cluster genes in chordate genomes (Minguillon and Garcia-Fernandez 2003).
Other 'Hox-like' genes have been designated by several authors as Evx, Mnx, Gbx and possibly En, and have been distinguished as the 'Extended Hox' genes by Pollard and Holland (2000) (who distinguish En, Gbx and Mnx as the EHGbox genes), and Castro and Holland (2003) (who dispense with the EHGbox classification) (Table 10.1), although Burglin (2005) instead classifies Gbx and Mnx as NKL genes. However, the phylogenetic support for such a distinction is weak, or even absent, and Evx genes sometimes relocate away from the Hox cluster genes into the so-called NKL genes in phylogenetic trees (Gauchat et al. 2000, Kamm and Schierwater 2006), and En remains so ambiguously placed in the phylogenetic trees that Burglin (2005) groups it with neither the 'Hox-like' genes nor the NKL genes. It is the genomic locations of these genes (Evx, Mox, En, Mnx and Gbx) that has warranted their classification as Hox-like or Extended-Hox genes. This logic, however, creates a problem when we consider the Mega-homeobox cluster hypothesis.
the ancestral mega-homeobox cluster: fact or fiction, and how will we know?
One of the most pervasive and appealing ideas with regards to animal homeobox evolution is the hypothesis that ancestrally the Hox-like and NKL genes were clustered in a Mega-homeobox cluster, which has subsequently been largely dispersed by inversions and translocations in extant lineages (Pollard and Holland 2000; reviewed in Garcia-Fernandez 2005, and summarised in Figure 10.1). This hypothesis arises from the important deduction that these homeobox genes predominantly originate by tandem duplications. Whilst non-tandem forms of gene duplication clearly are possible, they are much rarer than tandem duplications. Such evolution by tandem duplication has led to genes which group together in phylogenetic trees, such as the ANTP-class, tending to be genomically linked. This linkage does not have to be tight any more, owing to the prevalence of inversions that dissociate genes along a chromosome unless there is a functional reason for them to remain clustered, e.g. the Hox genes. Nevertheless, pooling linkage data from several taxa, and following the reasoning that a tight association of two homeobox genes is more likely to reflect an ancestral tight linkage rather than a chance coming together along a particular lineage, allowed the reconstruction of larger arrays of homeo-box genes and a hypothetical ancestral Mega-homeobox cluster (Pollard and Holland 2000).
The key gene in this hypothesis is Dlx. Dlx is classified as an NKL gene because of its location (with weak support) in phylogenetic trees, but it is genomically linked to the Hox cluster of chordates. However, as is the case for all of the 'NKL' genes, the divergence patterns for the families cannot be robustly resolved (Ryan et al. 2006, Kamm and Schier-water 2006). So can we really exclude the possibility that Dlx is not an NKL gene after all, but a Hox-like gene, whose sequence divergence is comparable to the Evx, Mnx or Gbx families, so that it cannot be placed reliably in molecular phylogenies?
The NKL gene Msx was originally thought to also provide evidence for Hox-NKL clustering, but this would have required breakage of the linkage after the genome duplications at the origin of the vertebrates (to accommodate two different break-points, between Dlx-Msx and
(neglecting Drosophil; and nematodes)
Chordata e.g. amphioxus
Chordata e.g. amphioxus
Was this article helpful?