In The Major Transitions in Evolution, John Maynard Smith and Eors Szathmary (1995) wrote: 'the origin of the code represents the most perplexing problem in evolutionary biology; the existing translational machinery is at the same time so complex, so universal, and so essential that it is hard to see how it could have come into existence, or how life could have existed without it.' This sentence has been cited many times because it is difficult to describe better the essential paradox about the translational apparatus. The standard genetic code is one of the most universal features of life. It is shared by almost every kind of organism, from bacteria to humans, with only a few minor exceptions. In an evolutionary framework, this universality implies that the standard code was used by a common ancestor of all forms of life, and thus the code assumed its actual form in a very short period of time at or very near the supposed beginning of life on Earth, but in any case, prior to the oldest discovered microfossils, i.e. bacterial and cyano-bacterial structures found in Archean apex cherts of the Warrawoona Group in Western Australia (Mojzsis et al., 1996; Schopf, 1993; Weberndorfer, 2002), dated to be at least 3,465 million years old.
Moreover, the standard genetic code seems to be at present an almost frozen version (Crick, 1968). This is because a minimal change, for example, in the amino acid meaning of a single codon will necessarily produce deleterious consequences for the organism involved. Even a single point mutation in a protein can cause severe diseases. This is the case, for example, for the point mutation in the synthesis of haemoglobin leading to the severe human disease microcytemy. An organism of medium complexity synthesizes at least several thousands of proteins (of the order of 50,000—60,000 in man following the latest results of human genome analysis); thus a change in the meaning of a single codon in the genetic code will necessarily produce changes in many different proteins: all those using this codon for the coding of a specific amino acid in the polypeptide chain, which are typically at least of the order of several hundreds in number. To ensure that not one of these modifications in many different key proteins should be fatal for the organism involved seems a very difficult if not an impossible task. Of course, if the original code were very much simpler than the present one, the task should be more plausible. Alternative explanations for different possible ways of facilitating code changes have been proposed; for example, the case in which a particular codon becomes totally absent on all proteins of an organism. If such an eventuality happens, it is possible to change the meaning of this particular codon without producing harmful consequences. This is a neutral mutation, meaning that it does not carry an associated biological advantage. This last fact, together with the low probability that a particular codon be completely exhausted in the entire protein library of an organism, makes it highly improbable that significant changes of the genetic code should be produced in this way. Regarding the case of primitive codes, due to the universality of the present one, any proposed simpler code must necessarily be a predecessor of the code of the common ancestor; thus the evolutionary period required to shape it is included in the short period allowed for the universal standard genetic code to acquire its present form.
Returning to the main characteristics of the code as a fundamental feature of the translational machinery of life, as well as universal, the genetic code is also complex and essential. It is essential because without the code the synthesis of proteins on the basis of genetically stored information is not possible. It is also complex because, despite its apparent simplicity (neglecting by the moment the associated highly complex synthesis machinery), it can lead to an astronomical combinatorial complexity; this complexity is analogous to the possible number of games that can be played on a 64-square chessboard, or, following the mythological origin of the game of chess, to the number of grains of wheat requested from the king as a reward by its inventor: starting with one grain on the first square, and duplicating the quantity for each succeeding chessboard square up to 64; curiously, this problem was first solved by Fibonacci using the fact that the sum of the first n terms of the series is equal to 2n + 1 - 1 (Geronimi, 2006) (this result is the basis of the univocal property of the power binary number representation system, as is shown below). The Fibonacci formula for n = 64 gives 18,446,744, 073,709,551,616, i.e. approximately 2 x 1019. This great number - to the dismay of the king, there was no sufficient wheat in the kingdom to provide the due reward for the ingenuity of the chess inventor - is in any case insignificant if compared with the number of possible genetic codes. Assuming as an independent statement that the 64 codons code all the 20 different amino acids plus the termination signal indicating the end of protein synthesis with at least one codon, that is, ignoring for the moment the biochemical level of complexity, some 1083 arbitrary codes can be generated; this is a purely combinatory problem (see, e.g. Chechetkin, 2003). Thus, the probability of existence of the actual code as a casual event produced in a single evolutionary step is of the order of 1/(1083) = (10-83), i.e. almost zero. As a comparative example it may be recalled that the total number of elemental particles in the universe is estimated at about 1080; about 1060 times the grains of wheat requested by the chess inventor. This number of possible codes refers only to static complexity, that is, does not involve the combinatorial complexity of the genetic information encoded by the genetic code. In fact, the degeneracy of the code allows the existence of different sequences of bases encoding the same protein. As an example, a not very large protein, consisting say of 200 amino acids, can be coded in approximately 3200, that is roughly 1095, different ways.
Considering these enormous combinatorial numbers it is licit to ask a very fundamental question: why is the code as it is? Why is a particular code preferred over 1083 other possibilities? Why is a particular codon sequence of 200 amino acids in length preferred over its 1095 alternatives? In order to attempt to answer these questions it is necessary to characterize in some quantitative way the complexity of the code. As said before, code complexity can be understood at least at two different levels of description: the first refers to the biochemical complexity of the code, i.e. the code is implemented by a complex synthesis machinery interpreting codons of exactly three non-overlapping mRNA bases as different amino acids and/or start and stop synthesis signals. Moreover, the bases and amino acids utilized in the code are a subset of those available in Nature exhibiting specific chemico-physical properties; and so on1 (Anderson et al., 2004; Mac Donaill, 2002). A second level of complexity refers, instead, to the internal structure of the code, that is, to the description of the degeneracy distribution: the number and the identity of codons that are assigned to every amino acid and to the initiation and termination signals (Jimenez-Montano, 1999). Due to the almost frozen character of the present standard code we cannot expect a high degree of internal ordering at the degeneracy level; for the reasons described above the evolutionary forces that have tailored the code have had very little evolutionary time to produce a strong internal organization. Furthermore, it has been demonstrated that the code is in many senses arbitrary, that is, a given tRNA can be modified to bind with different codons, thus changing the meaning of the code (Anderson et al., 2004); this is also shown by the existence of variants of the standard genetic code (Watanabe and Suzuki, 2001; Knight et al., 2001). If the code is to a great extent arbitrary at a chemical level, what can be the biological advantage of a strong internal structure not depending on biochemical constraints? Which can be the hidden evolutionary forces leading to such a tailoring of the code?
This article hopes to contribute to this understanding by showing a deep internal mathematical structure of the genetic code, which is based on the redundant representation of integer numbers by means of binary strings. This structure reveals the existence of symmetry properties of the genetic code whose uncovering may contribute to the understanding of the organization of the genetic information regarding the use of synonymous codons for the coding of specific amino acids in a protein chain. The applications of number theory in the modelling of complex
1 In fact, it has been demonstrated that amino acids and bases on one hand, and the number of bases used for amino acid coding on the other, can be arbitrary. For example, in Anderson et al., 2004, an unnatural amino acid is assigned to a four bases length codon (AGGA) without altering the synchronization of the frame reading and thus maintaining more or less invariant the rest of the composition of a given protein in Escherichia coli. Assignation of different amino acids to, for example, nonsense codons, allows for the incorporation of two unnatural amino acids, demonstrating that the code can be expanded arbitrarily in the number of amino acids used for protein synthesis (before this experiment restricted to assignment changes in nonsense codons), but also in the coding rules affecting, for example, the number of bases used to define a codon.
systems have seen a significant growth in the scientific literature of the last decades. This flourishing is largely associated with the role played by number theory in the mathematical description of many important phenomena in the theory of dynamical systems, and in several technological applications (Schroeder, 1986). In fact, the principal motivation for the present work has been the search for hidden deterministic mechanisms of error detection-correction within the genetic machinery. It is hypothesized that these mechanisms are directly related with the described mathematical ordering of the genetic code, which, in turn, is grounded in the dynamic properties of the translational apparatus. The results are very promising because, as is shown herein, strong analogies between the organization of the genetic code and of man-made codes for digital data transmission have been found. The present work, thus, aims to be a contribution towards the understanding of both the hidden rules of organization of degeneracy in the genetic information through a mathematical model of the code, on one hand, and of the associated functionality of this ordering, probably connected with a deterministic error detection/correction mechanism, on the other. This insight may contribute, in turn, to shedding light on the causes leading to the actual structure of the universal genetic code.
Was this article helpful?