Every sequence has its evolutionary history, and those sequences or sequence fragments, that have been successful in the earliest times of molecular evolution, are, perhaps, still around in hidden form or even unchanged since those times. The proteomic code described above is an example of such code of evolutionary record. The modern sequence modules are not the same as their ancestral prototypes, but a certain degree of resemblance to the ancestors is conserved allowing classification of present-day modules.
The earliest traced sequence elements go back to the very first codons, which are described as the triplets GGU, GCC, and their point mutational versions (Trifonov and Bettecken, 1997). More detailed reconstruction confirmed this conclusion (Trifonov, 2000b, 2004). According to the reconstruction of the earliest stages of molecular evolution, the very first "genes" had a duplex structure with complementary sequences (GGC)n and (GCC)n, encoding, Glyn and Alan, respectively. Thus, the mRNA consensus (GCU)n and the consensus (xxC)n of the mRNA binding sites in the ribosome are both fossils of the earliest mRNA sequences (Trifonov, 1987; Lagunez-Otero and Trifonov, 1992; Trifonov and Bettecken, 1997).
The size of the earliest minigenes, as it turns out, can be estimated by distance analysis of modern mRNA sequences (Trifonov et al., 2001). For this purpose the sequences were first rewritten in binary form, in an alphabet of two letters, G and A, for Gly series of amino acids and codons and Ala series (see above). Respective codons contain in their middle positions either purines (in G) or pyrimidines (in A). From the reconstructed chart of evolution of the codons (Trifonov, 2000b, 2004), it follows that all codons of G-series are descendants of the GGC codon, with purine in the middle, while codons of A-series originate from GCC codon, with pyrimidine in the middle. If the products of very first genes had the structures either Gn or An, of a certain size n, then after fusion of the minigenes the alternating patterns GAGA ... may have been formed. Later mutations could, of course, have n n n n J 77
completely destroyed this pattern, but they did not. Analysis of large ensembles of the mRNA sequences showed that the pattern did survive, though in rather hidden form (Berezovsky and Trifonov, 2001; Trifonov et al., 2001) so that the estimation of the very first gene size became possible, 6-7 codons encoding hexa- and hepta-peptides. This estimate is strongly supported by independent calculation of the sizes of the most ancient mRNA hairpins that arrived at the same minigene size (Gabdank et al., 2006; Trifonov et al., 2006b). Moreover, most conserved oligopeptide sequences, present in every prokaryotic proteome, also have the size of 6-9 amino acids (Sobolevsky and Trifonov, 2005, Sobolevsky et al., submitted).
The ancient conservation of the middle purines and pyrimidines in the codons during the evolution of the codon table, actually, has very much survived till now. This is confirmed by an analysis of amino acid substitutions in modern proteins (Trifonov, 2006; Gabdank et al., 2006). Every modern protein sequence, thus, can be written in the A and G alphabet. Such presentations of modern sequences in the binary code would suggest the most ancient version of the sequences.
The binary code, the mosaic of A- and G-minigenes, and the proteomic code describe various stages of protein evolution, from simple to more complex. Today one can also detect the next stage - combining the closed loop modules in the protein folds, domains.
First, the next level is seen already in protein sizes, which appear to be multiples of 120-150 amino acid units (Berman et al., 1994; Kolker et al., 2002). This size is a good match to the optimal DNA ring closure size, about 400 base pairs (Shore et al., 1981). This attractive numerology may well reflect original formation of modern genes and genomes by fusion of individual DNA circles (genome units) of this standard size (Trifonov, 1995, 2002). This would constitute the genome segmentation code. How this code is expressed in the sequence form is not yet specified, except for preferential appearance of methionines (former translation starts) at genome unit size distances (Kolker and Trifonov, 1995).
Was this article helpful?