A few days ago Kea wrote post #66 in her long and fascinating series of M-theory posts. Seems like she’s giving away the text to a book here. She wrote:
I was quite intrigued when a mathematical biologist at a conference told me recently that no one really knew why DNA had four bases rather than two. Apparently it isn’t clear why self-replicating molecules fail to adopt a binary code in X and Y.
The context of the problem needs some explaining. DNA is a long chain molecule that is built from a series of nucleotides. The strange things is that exactly four nucleotides are used. This is strange because there are at least a dozen different nucleotides, why use just four?
[edit Feb 8, 2009]This question is discussed in the scientific literature. For example, I just now found the article: Why Are There Four Letters in the Genetic Alphabet?, which describes the question generally, but I think the reader will find the chemical and information content argument given here to be more precise.[/edit]
If it were a problem in maximizing the amount of information that can be packed into a DNA molecule (bit packing), then the more nucleotides that are used, the more information, and the higher efficiency. On the other hand, if it is a matter of minimizing the complexity of the information description, one might expect that a binary code would be used and DNA would need only two nucleotides. This is the “why 4?” question.
I know little about biology and don’t know any of the literature on this subject. However, I do consider myself the finest digital logic designer on the planet, when designing for efficiency, uh, not to toot my own horn or anything, as this is an objective that is seldom important in digital design. In fact, I tried to not tell my boss that I’ve designed something to be ultra efficient because it is a waste of engineering time. It made digital design into a fairly amusing intellectual game, and kept me happy, but it is not very useful in industry.
The DNA problem reminded me of analyses of the efficiency of certain digital coding schemes that have come up at work over the years. I suspect that these are also the explanation for why four nucleotides are used in DNA.
In an environment where the niches that can support life are rare and far between, and which last only brief periods, the successful competitor will be the one that can fill the niche as quickly as possible. To reproduce life, useful chemicals must be manufactured and DNA must be dupliated.
DNA is an information carrying substance. The information that a rung of DNA can carry depends on the number of different states that it can exist in. For the DNA that presently exists, that number is four. Information is measured in bits. Since DNA has four choices for a rung, this amounts to log_2( 4 ) = 2 bits. That is, 2 bits of information can choose between four states, 00, 01, 10, and 11. More generally, the nucleotides of DNA come in pairs, so if DNA has N pairs, then there are 2N states, and the amount of information encoded is log_2( 2N) per rung. This is the bit packing and it is a slowly increasing function of N.
If the number of nucleotides used had no effect on the speed at which a rung of DNA can be processed, then the highest information speed will be obtained by the organism that uses the largest number of nucleotides. This would argue for the use of as many nucleotides as possible. But this analysis does not take into account the price one must pay for using a large number of different nucleotides.
The present situation in how DNA is reproduced does not necessarily correspond to the situation back when the coding of DNA was frozen into place. So these comments may not have much to do with coding speed in the present environment. They come just from my experience designing digital systems.
Since DNA is reproduced chemically, one must keep chemical concentrations in the organism (or organelle or whatever) for each of the nucleotides required. In a binary system, one needs just two nucleotides, so these can have relative concentrations of 1/2. The rate at which the process can proceed is proportional to the concentration. And a binary system contains one bit of information per rung. So a binary system has a bit rate of 1-bit x 1/2 = 1/2 bit per unit time.
The existing DNA system requires 4 nucleotides, so their relative concentrations are 1/4 each. Each rung contains 2 bits of information. The bit rate is therefore 2-bits x 1/4 = 1/2 bit per unit time, just the same as the binary case. But 4 nucleotide DNA has twice as much information content per rung than 2 nucleotide DNA. So while 2 and 4 have equal bit rates, after taking into account concentration requirements, the 4 nucleotide is more efficient in bit packing.
For larger numbers of nucleotides, the bit rate decreases. For example, with 6 nucleotides the efficiency is log_2( 6 ) x 1/6 = 0.43, and the speed decreases further for larger numbers of nucleotides. A similar formula, log_2(N) / N arises in digital systems which pass N different tokens between each other, and which require “time” (or in the event that one is trying to maximize information flow per transistor, “a transistor count”) proportional to N.
So when comparing 4 nucleotide DNA with 2 nucleotide DNA, the bit rates are the same, but 4 nucleotides is twice as efficient at bit packing. The tie goes to the 4 nucleotide system. And 4 nucleotide DNA has a faster bit rate than any system that uses more nucleotides.
It’s been a long time since I looked at the codon system of DNA. I’ll take a quick look at the wikipedia article and see if I can come up with anything obvious on it.