All manifestations of life, from elementary biomolecular interactions to human behavior, are tightly associated with, if not in full command of, sequence-specific interactions. Nucleic acid or protein sequence patterns involved in the molecular or higher-level functions stand for the sequence codes of the functions. The genome that carries or encodes all these sequence patterns is, thus, a compact, intricately organized, informational depot. To single out all major sequence codes and trace them in action may be viewed as the major challenge of modern molecular biology, sequence biology.
The nucleotide sequences, thus, not only encode proteins, as an inexperienced reader would think. Various sequence instructions are read from the DNA, RNA, or protein molecule each in its own way, via one or another specific molecular interaction or a whole network of interactions. In the triplet code the reading device is the ribosome. In gene splicing the sequence signals are recognized by the spliceosome. There are also numerous relatively simple sequence-specific DNAprotein and RNA-protein interactions, where the respective sequences are read by a single protein.
After the triplet code was spectacularly cracked (Ochoa et al. 1963; Khorana et al., 1966; Nirenberg et al., 1966), the impact of this event was such that
Genome Diversity Center, Institute of Evolution, University of Haifa, Haifa 31905, Israel, e-mail: [email protected]
M. Barbieri (ed.), The Codes of Life: The Rules of Macroevolution. © Springer 2008
nobody could even think of other possible codes. The triplet code was even called "genetic code," in other words the only code, not leaving any room for doubts. All early history of bioinformatics revolved around this single code (Trifonov, 2000a). Yet, already in 1968, R. Holliday noted almost en passant that, perhaps, recombination signals in yeast might reside on the same sequence that encodes proteins. This remark not only introduced the notion of other possible codes, but also the overlapping of different codes on the same sequence. The existence of codes, other than the classical translation triplet code, is already suggested by degeneracy of the triplet code (Schaap, 1971). Freedom in the choice of codons allows significant changes in the nucleotide sequence without changing the encoded protein sequence. This makes it possible, in principle, to utilize the interchangeable bases of the mRNA sequence for some additional, different codes. In this case, the codes would coexist in interspersed form as mosaics of two or more "colors." It is known today that a more general and widespread case is when the codes literally overlap so that some letters in specific positions of a given sequence (nucleotides or amino acids) are simultaneously involved in two or more different codes (sequence patterns). Such is the case with the coexisting triplet code and chromatin code - sequence instructions for nucleosome positioning (Trifonov, 1980; Mengeritsky and Trifonov, 1983). This was the first demonstration of the actual existence (Trifonov, 1981) of the hypothetical overlapping codes. Sequences that do not encode proteins, despite their traditional classification as noncoding, carry some important messages (codes) as well. Especially striking are the cases of sequence conservation in the noncoding regions (Koop and Hood, 1994), suggesting that the so-called non-coding sequences are associated with some function.
Amongst known general sequence codes, other than the triplet code, are transcription signals (transcription code) in promoters such as TATAAA box in eukaryotes, and TATAAT and TTGACA boxes in bacteria coding for initiation of transcription. Another broadly known sequence code is the gene splicing code, the GT-AG rule (Breathnach and Chambon, 1981) and some sequence preferences around the intron-exon junctions. A complex set of sequence rules describes details of DNA shape important for DNA-protein interactions and DNA folding in the cell.
At the level of amino acid sequences, the most important is the protein folding code, which is not yet described as a sequence pattern. One can single out the modular component of the folding code - organization of the globular proteins as linear succession of the modules in the form of loops of 25-30 residues closed at the ends by interactions between hydrophobic residues (Berezovsky et al., 2000; Berezovsky and Trifonov, 2002). The 3D structure of proteins appears to be encoded largely by a binary code (Trifonov et al., 2001; Trifonov, 2006; Gabdank et al., 2006) that, essentially, reduces the 20-letter alphabet to only two letters, for nonpolar and polar residues (more accurately, residues encoded by codons with pyrimidine or purine in the middle). The binary code also suggests the ancestral form for any given sequence.
As the carriers of instructions, biological sequences may be considered a language. Indeed, according to an appealing definition of Russian philosopher V. Nalimov (1981), language is a communication tool to carry instructions to the operator at the receiving end. Such languages as computer programs (frequently called "codes" as well) and written (spoken) human languages convey instructions expressed in the form of one code, for one reading device that takes consecutively letter by letter, word by word, until the transmitted command is fully uttered. As mentioned above, a unique property of the biological sequences is the superposition of the codes they carry. That is, the same sequence is meant to be read by several reading devices, each geared to its own specific code. Many cases of such overlapping are known (Trifonov, 1981; Normark et al., 1983). The overlapping is possible due to degeneracy of the codes. There is, of course, an informational limit for such superposition, when the freedom of degeneracy becomes insufficient to accommodate additional messages without loss of quality of many or all other messages present.
Was this article helpful?