# Why does DNA only use 4 nucleotides?

A few days ago Kea wrote post #66 in her long and fascinating series of M-theory posts. Seems like she’s giving away the text to a book here. She wrote:

I was quite intrigued when a mathematical biologist at a conference told me recently that no one really knew why DNA had four bases rather than two. Apparently it isn’t clear why self-replicating molecules fail to adopt a binary code in X and Y.

The context of the problem needs some explaining. DNA is a long chain molecule that is built from a series of nucleotides. The strange things is that exactly four nucleotides are used. This is strange because there are at least a dozen different nucleotides, why use just four?

[edit Feb 8, 2009]This question is discussed in the scientific literature. For example, I just now found the article: Why Are There Four Letters in the Genetic Alphabet?, which describes the question generally, but I think the reader will find the chemical and information content argument given here to be more precise.[/edit]

If it were a problem in maximizing the amount of information that can be packed into a DNA molecule (bit packing), then the more nucleotides that are used, the more information, and the higher efficiency. On the other hand, if it is a matter of minimizing the complexity of the information description, one might expect that a binary code would be used and DNA would need only two nucleotides. This is the “why 4?” question.

I know little about biology and don’t know any of the literature on this subject. However, I do consider myself the finest digital logic designer on the planet, when designing for efficiency, uh, not to toot my own horn or anything, as this is an objective that is seldom important in digital design.  In fact, I tried to not tell my boss that I’ve designed something to be ultra efficient because it is a waste of engineering time.  It made digital design into a fairly amusing intellectual game, and kept me happy, but it is not very useful in industry.

The DNA problem reminded me of analyses of the efficiency of certain digital coding schemes that have come up at work over the years.  I suspect that these are also the explanation for why four nucleotides are used in DNA.

In an environment where the niches that can support life are rare and far between, and which last only brief periods, the successful competitor will be the one that can fill the niche as quickly as possible. To reproduce life, useful chemicals must be manufactured and DNA must be dupliated.

DNA is an information carrying substance. The information that a rung of DNA can carry depends on the number of different states that it can exist in. For the DNA that presently exists, that number is four.  Information is measured in bits. Since DNA has four choices for a rung, this amounts to log_2( 4 ) = 2 bits. That is, 2 bits of information can choose between four states, 00, 01, 10, and 11.  More generally, the nucleotides of DNA come in pairs, so if DNA has N pairs, then there are 2N states, and the amount of information encoded is log_2( 2N) per rung.  This is the bit packing and it is a slowly increasing function of N.

If the number of nucleotides used had no effect on the speed at which a rung of DNA can be processed, then the highest information speed will be obtained by the organism that uses the largest number of nucleotides. This would argue for the use of as many nucleotides as possible. But this analysis does not take into account the price one must pay for using a large number of different nucleotides.

The present situation in how DNA is reproduced does not necessarily correspond to the situation back when the coding of DNA was frozen into place. So these comments may not have much to do with coding speed in the present environment.  They come just from my experience designing digital systems.

Since DNA is reproduced chemically, one must keep chemical concentrations in the organism (or organelle or whatever) for each of the nucleotides required. In a binary system, one needs just two nucleotides, so these can have relative concentrations of 1/2. The rate at which the process can proceed is proportional to the concentration. And a binary system contains one bit of information per rung. So a binary system has a bit rate of 1-bit x 1/2 = 1/2 bit per unit time.

The existing DNA system requires 4 nucleotides, so their relative concentrations are 1/4 each. Each rung contains 2 bits of information. The bit rate is therefore 2-bits x 1/4 = 1/2 bit per unit time, just the same as the binary case.  But 4 nucleotide DNA has twice as much information content per rung than 2 nucleotide DNA.  So while 2 and 4 have equal bit rates, after taking into account concentration requirements, the 4 nucleotide is more efficient in bit packing.

For larger numbers of nucleotides, the bit rate decreases. For example, with 6 nucleotides the efficiency is log_2( 6 ) x 1/6 = 0.43, and the speed decreases further for larger numbers of nucleotides.  A similar formula, log_2(N) / N arises in digital systems which pass N different tokens between each other, and which require “time” (or in the event that one is trying to maximize information flow per transistor, “a transistor count”) proportional to N.

So when comparing 4 nucleotide DNA with 2 nucleotide DNA, the bit rates are the same, but 4 nucleotides is twice as efficient at bit packing. The tie goes to the 4 nucleotide system. And 4 nucleotide DNA has a faster bit rate than any system that uses more nucleotides.

It’s been a long time since I looked at the codon system of DNA. I’ll take a quick look at the wikipedia article and see if I can come up with anything obvious on it.

Filed under DNA, engineering

### 15 responses to “Why does DNA only use 4 nucleotides?”

1. Good to see you post this. It sounds very plausible, although I can’t say I know much about the subject. Hopefully some DNA geeks will think about it.

2. Doug

Hi Carl,

Be catious. There are 4 nucleotides in DNA, but do not forget the 5th in RNA. There are also 2 precursor purines [or 3 with Inosine] and one precursor pyrimidine that I am aware of. Could you please refererence to account for the remainder of the possible dozen.

I like the U_Utah website: ‘Purine and Pyrimidine Metabolism’
http://library.med.utah.edu/NetBiochem/pupyr/pp.htm

Nature, Volume 447 Number 7146 pp753-881 [p 799]
has an article on the ENCODE Project for the human genome.
Editor’s Summary 14 June 2007
Decoding the blueprint [editor] and
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project [authors]
http://www.nature.com/nature/journal/v447/n7146/edsumm/e070614-01.html

This 108 page thesis, Alexander Gutfraind, ‘Error-Tolerant Coding and the Genetic Code’ is interesting.

This SCIAM FEATURE ARTICLE, June 2007 issue, BIOLOGY
Robert Shapiro, ‘A Simpler Origin for Life’, p 47-53
“The sudden appearance of a large self-copying molecule such as RNA was exceedingly improbable. Energy-driven networks of small molecules afford better odds as the initiators of life”; primarily discusses replicator-first or metabolism-first.
SA Brenner in a side-bar discusses how boron may have been necessary to stabilize ribose.
http://sciam.com/article.cfm?chanID=sa006&colID=1&articleID=779849FA-E7F2-99DF-3FF8ED5B4D8764FE

DNA replication appears to be a simple complimentary operation.
RNA transcription is also complimentary, but translation appears to use a noncommutative triple code.

3. carlbrannen

Doug, sorry your comment didn’t post immediately. Evidently it hit the moderation seive. Maybe too many useful links.

“Could you please refererence to account for the remainder of the possible dozen [nucleotides].”

Uh, I’m not a chemist. My source is wikipedia which lists 15 nucleotides and
15 deoxynucleotides.

“This 108 page thesis, Alexander Gutfraind, ‘Error-Tolerant Coding and the Genetic Code’ is interesting.”

Theses make great reading, I think. I looked at the title page, it’s actually 186 pages. I’m not going to read it because I’m not much interested in DNA or RNA, but I think that the error tolerant coding is about the codons, which presumably are an addition to DNA after the early evolution.

“There are 4 nucleotides in DNA, but do not forget the 5th in RNA.”
“The sudden appearance of a large self-copying molecule such as RNA was exceedingly improbable.”

This is why some people think that DNA self copying is older (and why I didn’t worry too much about the extra nucleotide in RNA). Heck, I’m not a biologist so I really don’t go to bed at night wondering about why DNA has 4 nucleotides. I think the problem isn’t very important, unless you wanted to redesign some very simple form of life. Why not just stick with the chemicals God used?

When digital designers sit around and talk about what they’re going to do when computers take over digital design, or what they’d study in college if they were 18 again, the subject of biological engineering at the DNA level comes up. But right now I’ve got other fish to fry.

Which reminds me. I still haven’t seen any new Koi fry. I suspect that the biggest fish is down there vacuuming them up because it hasn’t seemed hungry when I’ve fed the fish recently.

4. Doug

Carl,
1 – Thanks for the nucleotide [RE: nucleic acids] reference to Wiki.
Note that nucleotides have have 1-3 phosphate groups, each with high energy bonds. When phosphate groups are not present, the term nucleoside is used. A monophosphate nucleotide exchanges the phosphate group as energy when forming nucleic acids becoming a nucleoside.
Hence there are 15 nucleotides, 3 for each of 5 nucleosides [precursors omitted].
Adenine [A] can become Thymine [T] during DNA replication; or Uracil [U] during RNA transcription.
Thymine is found only in DNA.
Uracil is found only in RNA.
Guanine [G] and Cytosine [C] have a one to one relationship.
Thus technically the triplet genetic code is an RNA code, since U not T is used to manufacture proteins from amino acids.

2 – I made a typo using the QWERTY numbers/characters keys, exchanging 08 by mistake for 86.
My intent was to provide data which you could collaborate or contradict if you pursued the problem stated in your paragraph:
“The DNA problem reminded me of analyses of the efficiency of certain digital coding schemes that have come up at work over the years. I suspect that these are also the explanation for why four nucleotides are used in DNA.”
I do recommend that you at least scan the contents on page v-vii and the list of figures on pages viii-ix. [Especially section 2 Coding and Information Theory and ironically Appendix A The Eigen Model.]

3 – Thomas Cech and Sidney Altman shared the 1989 chemistry Nobel for essentially demonstrating that RNA was more versatile [self-catalysm and enzyme activity] than DNA and likely was present before DNA. Work is required to remove oxygen from ribose.

http://nobelprize.org/nobel_prizes/chemistry/laureates/1989/presentation-speech.html

Manfred Eigen [with Ronald Norrish and George Porter] was awarded the 1967 Chemistry Nobel for “for their studies of extremely fast chemical reactions, effected by disturbing the equlibrium by means of very short pulses of energy”.

http://nobelprize.org/nobel_prizes/chemistry/laureates/1967/press.html

4 – I was surprised by your invocation of ‘God’. I do not want to get into a religious debate. I am more comfortable with ‘Nature’s God’ from the US Declaration of Independence. The latter term is non-denominational and even ambiguous. Since the Greek root of physics means nature, this could be “physics’ god”.
Surely the same [or equal] “physics’ god” participated in the evolution of nucleic acids and before that the evolution of existence [HEP, QM and GR]. Both types of evolution are of great, perhaps equal, importance.

5 – “… biological engineering ..” is incomplete since the study of nucleic acids is at least bio_physical_chemical_engineering_with_the_ mathematics_of _energy_economics.

6 – I hope the Koi have been well fed.

5. Carl,

Your argument is nice. Since one considers either RNA or DNA so that you indeed have just 4 basis.

6. About “designing for efficiency”, it is worthwhile to remember that Nature is not allowed to start from scratch, but forced to rely in random mutations of the previous design.

7. carlbrannen

Matti, for some reason I think that RNA is a little different from DNA, and actually has 5 instead of 4 coding units. I forget, it should be in wikipedia. But DNA is the more basic of the coding schemes.

Alejandro, yes, my point is that there are many more nucleotides than the 4 used in DNA. Nature could naturally have begun with the more general coding method, but the count would have been trimmed to 4 by evolution.

You can work out the calculus easily enough. Let $a_n$ be the percentage of the codes that are of type n so that the sum of $a_n$ is unity. Compute the mixture that will minimize assembly time by using reaction rate equations. You will find that if one of the $a_m$ is less than all the others, the optimal broth will have a smaller percentage of m in it, but the calculated percentage will be larger than $a_m$.

In other words, the organism will have to maintain an abnormally high percentage of m to minimize its average time to reproduce.

That abnormally high percentage falls to zero as $a_m$ falls to zero, so the inefficiency will be reduced by further genetic mutations that reduce the number of codes.

On the other hand, if you look at the rate at which things can be reproduced, you will find that the question of whether or not the rate increases or decreases with adjustments in the $a_n$ depends on how many $a_n$ there are. I’m claiming that if n>4, then the system is unstable towards reduction in the number of n.

This claim is not much different from what I showed in the blog post. You know that the overall rate is higher for n=4 than for n=5. So that says that among the possible landscape of 4-element codes, there is a neighborhood of n=4 that is faster than any other n=5 solution.

8. 12 04 07

Hey CB:
Matti and I discussed this issue last year when I was trying to figure out a set theoretic way to categorize nucleotides using a p-adic model. I settled on 5-adicity because it includes ATGCU, which is all enclusive…

9. Allow me to type away on the A, T, G, C that occupy any three places on the DNA ladder rungs, as the rungs spiral there way along the helical coils.

Fortunately at the UC Berkeley Student Cooperative housing, at Cloyne Court just less than a 100 yards from the Electrical Engineering building, Northside of campus where a physics cousin of a roommate at that time in the 50-60’s invited Physist George Gamov to speak to some eight of us male students.

Gamov stated he calculated that four different types of DNA if occupying three positions on a rung would code for (4)(3)(2) amino acids, that gave 24 amino acids.

That comes close. Was he correct to draw that such a conclusion?

At any rate, since then he empowered me with the power of some mathematics in the quest for understanding physics.

Best, rmuldavin

10. Michael James Hanford

As a biochemical engineer…i can see both sides. Your mathematical analysis is interesting…it’s nice to see 4 pairs gives higher data packing.

However there are 2 issues that come to mind, despite the late hour:

1. Pairing of DNA bases has more to do with enhancing the stability of DNA in.vivo than information theory. single stranded (ss) DNA is less stable ; more amenable to nuclease (enzyme) degradation than double-stranded (ds) DNA pairs.

“More generally, the nucleotides of DNA come in pairs, so if DNA has N pairs, then there are 2N states, and the amount of information encoded is log_2( 2N) per rung. This is the bit packing and it is a slowly increasing function of N.”

2, I think the reason there are 4 instead of 2 has much more to do with codon/anticodon pairing in protein transcription. There are 20 amino acids; each must be encoded by a 3-letter ‘anticodon’. However, withthe “wobble” effect there is redundancy in teh code… the 3rd letter in the codon isn’t reliable. SO, tRNA anticodons CCC, CCU, CCG, CCA all encode for the same amino acid (AA): proline.

SO, the redundancy in the code requires enough permutations exist to cover all 20 AA plus the stop codon (to terminate translation)

thoughts?

11. jen

Perhaps I am overlooking something obvious here but it isn’t it true a nucleotide can only bind to one of the other three. A only binds to T and likewise G only binds to C. This implies the packing per rung is really log_2( 1 ) = 1 bit.

12. Carl Brannen

This is a good question. In DNA, the four choices amount to A, C, T, and G. The binding is used to duplicate DNA, not to carry information.

In an organism, a sequence down one strand of DNA might look like ACTATAACTGAGGAG The other strand has a related sequence by the pairing. One of these carries the information as “sense”, the other as “anti-sense”. Both halves of the molecule carry the same information, but in what it’s reversed. So the information content is the same as if there were only one strand. To learn more about this, see the wikipedia article on DNA which is more readable than a lot of wiki articles. Or see their article on sense: http://en.wikipedia.org/wiki/Sense_%28molecular_biology%29

13. I’ve been looking for someone like this for years! I am in the opposite camp as you: I know a whole lot about biology, but very little about computers. Yet, I’m realizing more and more a cell’s communication system can totally be treated like a chemical computer. I would love to do some joint project with you.