The ORION : January 2015

GENETIC CODE, MOLECULAR CLONING & APPLICATIONS

CIRCULAR BACTERIAL CHROMOSOME

A circular bacterial chromosome, showing DNA replication proceeding bidirectionally, with two replication forks generated at the "origin". Each half of the chromosome replicated by one replication fork is called a "replichore".

Circular bacterial chromosomes are the bacterial chromosomes contained in a circular DNA molecule. Unlike the linear DNA of vertebrates, typical bacterial chromosomes contain circular DNA.

Most bacterial chromosomes contain a circular DNA molecule - there are no free ends to the DNA. Free ends would otherwise create significant challenges to cells with respect to DNA replication and stability. Cells that do contain chromosomes with DNA ends, or telomeres (most eukaryotes), have acquired elaborate mechanisms to overcome these challenges. However, a circular chromosome can provide other challenges for cells. After replication, the two progeny circular chromosomes can sometimes remain interlinked or tangled, and they must be resolved so that each cell inherits one complete copy of the chromosome during cell division.

Replication of a circular bacterial chromosome

Bacterial chromosome replication is best understood in the well-studied bacteria Escherichia coli and Bacillus subtilis. Chromosome replication proceeds in three major stages: initiation, elongation and termination. The initiation stage starts with the ordered assembly of "initiator" proteins at the origin region of the chromosome, called oriC. These assembly stages are regulated to ensure that chromosome replication occurs only once in each cell cycle. During the elongation phase of replication, the enzymes that were assembled at oriC during initiation proceed along each arm ("replichore") of the chromosome, in opposite directions away from the oriC, replicating the DNA to create two identical copies. This process is known as bidirectional replication. The entire assembly of molecules involved in DNA replication on each arm is called a "replisome." At the forefront of the replisome is a DNA helicase that unwinds the two strands of DNA, creating a moving "replication fork". The two unwound single strands of DNA serve as templates for DNA polymerase, which moves with the helicase (together with other proteins) to synthesize a complementary copy of each strand. In this way, two identical copies of the original DNA are created. Eventually, the two replication forks moving around the circular chromosome meet in a specific zone of the chromosome, approximately opposite oriC, called the terminus region. The elongation enzymes then disassemble, and the two "daughter" chromosomes are resolved before cell division is completed.

Initiation

The E. coli bacterial replication origin, called oriC consists of DNA sequences that are recognised by the DnaA protein, which is highly conserved amongst different bacterial species. DnaA binding to the origin initiates the regulated recruitment of other enzymes and proteins that will eventually lead to the establishment of two complete replisomes for bidirectional replication.

DNA sequence elements within oriC that are important for its function include DnaA boxes, a 9-mer repeat with a highly conserved consensus sequence 5' - TTATCCACA - 3', that are recognized by the DnaA protein. DnaA protein plays a crucial role in the initiation of chromosomal DNA replication. Bound to ATP, and with the assistance of bacterial histone-like proteins [HU] DnaA then unwinds an AT-rich region near the left boundary of oriC, which carries three 13-mer motifs, and opens up the double-stranded DNA for entrance of other replication proteins.

This region also contains four “GATC” sequences that are recognized by DNA adenine methylase (Dam), an enzyme that modifies the adenine base when this sequence is unmethylated or hemimethylated. The methylation of adenines is important as it alters the conformation of DNA to promote strand separation, and it appears that this region of oriC has a natural tendency to unwind.

Elongation

When the replication fork moves around the circle, a structure shaped like the Greek letter theta Ө is formed. John Cairns demonstrated the theta structure of E. coli chromosomal replication in 1963, using an innovative method to visualize DNA replication. In his experiment, he radioactively labeled the chromosome by growing his cultures in a medium containing 3H-thymidine. The nucleoside base was incorporated uniformly into the bacterial chromosome. He then isolated the chromosomes by lysing the cells gently and placed them on an electron micrograph (EM) grid which he exposed to X-ray film for two months. This Experiment clearly demonstrates the theta replication model of circular bacterial chromosomes.

As described above, bacterial chromosomal replication occurs in a bidirectional manner. This was first demonstrated by specifically labelling replicating bacterial chromosomes with radioactive isotopes. The regions of DNA undergoing replication during the experiment were then visualized by using autoradiography and examining the developed film microscopically. This allowed the researchers to see where replication was taking place. The first conclusive observations of bidirectional replication were from studies of B. subtilis. Shortly after, the E. coli chromosome was also shown to replicate bidirectionally.

The E. coli DNA polymerase III holoenzyme is a 900 kD complex, possessing an essentially a dimeric structure. Each monomeric unit has a catalytic core, a dimerization subunit, and a processivity component. DNA Pol III uses one set of its core subunits to synthesize the leading strand continuously, while the other set of core subunits cycles from one Okazaki fragment to the next on the looped lagging strand. Leading strand synthesis begins with the synthesis of a short RNA primer at the replication origin by the enzyme Primase (DnaG protein).

Deoxynucleotides are then added to this primer by a single DNA polymerase III dimer, in an integrated complex with DnaB helicase. Leading strand synthesis then proceeds continuously, while the DNA is concurrently unwound at the replication fork. In contrast, lagging strand synthesis is accomplished in short Okazaki fragments. First, an RNA primer is synthesized by primase, and, like that in leading strand synthesis, DNA Pol III binds to the RNA primer and adds deoxyribonucleotides.

When the synthesis of an Okazaki fragment has been completed, replication halts and the core subunits of DNA Pol III dissociates from the β sliding clamp [B sliding clap is the processivity subunit of DNA Pol III]. The RNA primer is remove and replaced with DNA by DNA polymerase I [which also possesses proofreading exonuclease activity] and the remaining nick is sealed by DNA ligase, which then ligates these fragments to form the lagging strand.

Termination

Termination is the process of fusion of replication forks and disassembly of the resplisomes to yield two separate and complete DNA molecules. It occurs in the terminus region, approximately opposite oriC on the chromosome. The terminus region contains several DNA replication terminator sites, or "Ter" sites. A special "replicaiton terminator" protein must be bound at the Ter site for it to pause replication. Each Ter site has polarity of action, that is, it will arrest a replication fork approaching the Ter site from one direction, but will allow unimpeded fork movement through the Ter site from the other direction. The arrangement of the Ter sites forms two opposed groups that forces the two forks to meet each other within the region they span. This arrangement is called the "replication fork trap."

Replication of the DNA separating the opposing replication forks, leaves the completed chromosomes joined as ‘catenanes’ or topologically interlinked circles. The circles are not covalently linked, but cannot be separated because they are interwound and each is covalently closed. The catenated circles require the action of topoisomerases to separate the circles [decatanation]. In E.coli, DNA topoisomerase IV plays the major role in the separation of the catenated chromosomes, transiently breaking both DNA strands of one chromosome and allowing the other chromosome to pass through the break.

Genetic code

A series of codons in part of a messenger RNA (mRNA) molecule. Each codon consists of three nucleotides, usually representing a single amino acid. The nucleotides are abbreviated with the letters A, U, G and C. This is mRNA, which uses U (uracil). DNA uses T (thymine) instead. This mRNA molecule will instruct a ribosome to synthesize a protein according to this code.

The genetic code is the set of rules by which information encoded in genetic material (DNA or mRNA sequences) is translated into proteins (amino acid sequences) by living cells.

The code defines how sequences of three nucleotides, called codons, specify which amino acid will be added next during protein synthesis. With some exceptions, a three-nucleotide codon in a nucleic acid sequence specifies a single amino acid. Because the vast majority of genes are encoded with exactly the same code, this particular code is often referred to as the canonical or standard genetic code, or simply the genetic code, though in fact there are many variant codes. For example, protein synthesis in human mitochondria relies on a genetic code that differs from the standard genetic code.

Not all genetic information is stored using the genetic code. All organisms' DNA contains regulatory sequences, intergenic segments, chromosomal structural areas, and other non-coding DNA that can contribute greatly to phenotype. Those elements operate under sets of rules that are distinct from the codon-to-amino acid paradigm underlying the genetic code.

Discovery

The genetic code

After the structure of DNA was discovered by James Watson and Francis Crick, who used the experimental evidence of Maurice Wilkins and Rosalind Franklin (among others), serious efforts to understand the nature of the encoding of proteins began. George Gamow postulated that a three-letter code must be employed to encode the 20 standard amino acids used by living cells to encode proteins. With four different nucleotides, a code of 2 nucleotides could only code for a maximum of 4² or 16 amino acids. A code of 3 nucleotides could code for a maximum of 4³ or 64 amino acids.

The fact that codons consist of three DNA bases was first demonstrated in the Crick, Brenner et al. experiment. The first elucidation of a codon was done by Marshall Nirenberg and Heinrich J. Matthaei in 1961 at the National Institutes of Health. They used a cell-free system to translate a poly-uracil RNA sequence (i.e., UUUUU...) and discovered that the polypeptide that they had synthesized consisted of only the amino acid phenylalanine. They thereby deduced that the codon UUU specified the amino acid phenylalanine. This was followed by experiments in the laboratory of Severo Ochoa demonstrating that the poly-adenine RNA sequence (AAAAA...) coded for the polypeptide poly-lysine and that the poly-cytosine RNA sequence (CCCCC...) coded for the polypeptide poly-proline. Therefore the codon AAA specified the amino acid lysine, and the codon CCC specified the amino acid proline. Using different copolymers most of the remaining codons were then determined. Extending this work, Nirenberg and Philip Leder revealed the triplet nature of the genetic code and allowed the codons of the standard genetic code to be deciphered. In these experiments, various combinations of mRNA were passed through a filter that contained ribosomes, the components of cells that translate RNA into protein. Unique triplets promoted the binding of specific tRNAs to the ribosome. Leder and Nirenberg were able to determine the sequences of 54 out of 64 codons in their experiments.

Transfer of information via the genetic code

The genome of an organism is inscribed in DNA, or, in the case of some viruses, RNA. The portion of the genome that codes for a protein or an RNA is called a gene. Those genes that code for proteins are composed of tri-nucleotide units called codons, each coding for a single amino acid. Each nucleotide sub-unit consists of a phosphate, a deoxyribose sugar, and one of the four nitrogenous nucleobases. The purine bases adenine (A) and guanine (G) are larger and consist of two aromatic rings. The pyrimidine bases cytosine (C) and thymine (T) are smaller and consist of only one aromatic ring. In the double-helix configuration, two strands of DNA are joined to each other by hydrogen bonds in an arrangement known as base pairing. These bonds almost always form between an adenine base on one strand and a thymine base on the other strand, or between a cytosine base on one strand and a guanine base on the other. This means that the number of A and T bases will be the same in a given double helix, as will the number of G and C bases. In RNA, thymine (T) is replaced by uracil (U), and the deoxyribose is substituted by ribose.

Each protein-coding gene is transcribed into a molecule of the related polymer RNA. In prokaryotes, this RNA functions as messenger RNA or mRNA; in eukaryotes, the transcript needs to be processed to produce a mature mRNA. The mRNA is, in turn, translated on the ribosome into an amino acid chain or polypeptide. The process of translation requires transfer RNAs specific for individual amino acids with the amino acids covalently attached to them, guanosine triphosphate as an energy source, and a number of translation factors. tRNAs have anticodons complementary to the codons in mRNA and can be "charged" covalently with amino acids at their 3' terminal CCA ends. Individual tRNAs are charged with specific amino acids by enzymes known as aminoacyl tRNA synthetases, which have high specificity for both their cognate amino acids and tRNAs. The high specificity of these enzymes is a major reason why the fidelity of protein translation is maintained.

There are 4³ = 64 different codon combinations possible with a triplet codon of three nucleotides; all 64 codons are assigned for either amino acids or stop signals during translation. If, for example, an RNA sequence UUUAAACCC is considered and the reading frame starts with the first U (by convention, 5' to 3'), there are three codons, namely, UUU, AAA, and CCC, each of which specifies one amino acid. This RNA sequence will be translated into an amino acid sequence, three amino acids long. A given amino acid may be encoded by between one and six different codon sequences. A comparison may be made with computer science, where the codon is similar to a word, which is the standard "chunk" for handling data (like one amino acid of a protein), and a nucleotide is similar to a bit, in that it is the smallest unit.

The standard genetic code is shown in the following tables. Table 1 shows what amino acid each of the 64 codons specifies. Table 2 shows what codons specify each of the 20 standard amino acids involved in translation. These are called forward and reverse codon tables, respectively. For example, the codon AAU represents the amino acid asparagine, and UGU and UGC represent cysteine (standard three-letter designations, Asn and Cys, respectively).

RNA codon table

nonpolar	polar	basic	acidic	(stop codon)
nonpolar

		2nd base
		U		C		A		G
1st base	U	UUU	(Phe/F) Phenylalanine	UCU	(Ser/S) Serine	UAU	(Tyr/Y) Tyrosine	UGU	(Cys/C) Cysteine
		UUC	(Phe/F) Phenylalanine	UCC	(Ser/S) Serine	UAC	(Tyr/Y) Tyrosine	UGC	(Cys/C) Cysteine
		UUA	(Leu/L) Leucine	UCA	(Ser/S) Serine	UAA	Stop (Ochre)	UGA	Stop (Opal)
		UUG	(Leu/L) Leucine	UCG	(Ser/S) Serine	UAG	Stop (Amber)	UGG	(Trp/W) Tryptophan
	C	CUU	(Leu/L) Leucine	CCU	(Pro/P) Proline	CAU	(His/H) Histidine	CGU	(Arg/R) Arginine
		CUC	(Leu/L) Leucine	CCC	(Pro/P) Proline	CAC	(His/H) Histidine	CGC	(Arg/R) Arginine
		CUA	(Leu/L) Leucine	CCA	(Pro/P) Proline	CAA	(Gln/Q) Glutamine	CGA	(Arg/R) Arginine
		CUG	(Leu/L) Leucine	CCG	(Pro/P) Proline	CAG	(Gln/Q) Glutamine	CGG	(Arg/R) Arginine
	A	AUU	(Ile/I) Isoleucine	ACU	(Thr/T) Threonine	AAU	(Asn/N) Asparagine	AGU	(Ser/S) Serine
		AUC	(Ile/I) Isoleucine	ACC	(Thr/T) Threonine	AAC	(Asn/N) Asparagine	AGC	(Ser/S) Serine
		AUA	(Ile/I) Isoleucine	ACA	(Thr/T) Threonine	AAA	(Lys/K) Lysine	AGA	(Arg/R) Arginine
		AUG^[A]	(Met/M) Methionine	ACG	(Thr/T) Threonine	AAG	(Lys/K) Lysine	AGG	(Arg/R) Arginine
	G	GUU	(Val/V) Valine	GCU	(Ala/A) Alanine	GAU	(Asp/D) Aspartic acid	GGU	(Gly/G) Glycine
		GUC	(Val/V) Valine	GCC	(Ala/A) Alanine	GAC	(Asp/D) Aspartic acid	GGC	(Gly/G) Glycine
		GUA	(Val/V) Valine	GCA	(Ala/A) Alanine	GAA	(Glu/E) Glutamic acid	GGA	(Gly/G) Glycine
		GUG	(Val/V) Valine	GCG	(Ala/A) Alanine	GAG	(Glu/E) Glutamic acid	GGG	(Gly/G) Glycine

^A The codon AUG both codes for methionine and serves as an initiation site: the first AUG in an mRNA's coding region is where translation into protein begins.^[9]

Inverse table
Ala/A	GCU, GCC, GCA, GCG	Leu/L	UUA, UUG, CUU, CUC, CUA, CUG
Arg/R	CGU, CGC, CGA, CGG, AGA, AGG	Lys/K	AAA, AAG
Asn/N	AAU, AAC	Met/M	AUG
Asp/D	GAU, GAC	Phe/F	UUU, UUC
Cys/C	UGU, UGC	Pro/P	CCU, CCC, CCA, CCG
Gln/Q	CAA, CAG	Ser/S	UCU, UCC, UCA, UCG, AGU, AGC
Glu/E	GAA, GAG	Thr/T	ACU, ACC, ACA, ACG
Gly/G	GGU, GGC, GGA, GGG	Trp/W	UGG
His/H	CAU, CAC	Tyr/Y	UAU, UAC
Ile/I	AUU, AUC, AUA	Val/V	GUU, GUC, GUA, GUG
START	AUG	STOP	UAA, UGA, UAG

DNA codon table

The DNA codon table is essentially identical to that for RNA, but with U replaced by T.

Salient features

Sequence reading frame

A codon is defined by the initial nucleotide from which translation starts. For example, the string GGGAAACCC, if read from the first position, contains the codons GGG, AAA, and CCC; and, if read from the second position, it contains the codons GGA and AAC; if read starting from the third position, GAA and ACC. Every sequence can, thus, be read in three reading frames, each of which will produce a different amino acid sequence (in the given example, Gly-Lys-Pro, Gly-Asn, or Glu-Thr, respectively). With double-stranded DNA, there are six possible reading frames, three in the forward orientation on one strand and three reverse on the opposite strand. The actual frame in which a protein sequence is translated is defined by a start codon, usually the first AUG codon in the mRNA sequence.

Start/stop codons

Translation starts with a chain initiation codon (start codon). Unlike stop codons, the codon alone is not sufficient to begin the process. Nearby sequences (such as the Shine-Dalgarno sequence in E. coli) and initiation factors are also required to start translation. The most common start codon is AUG, which is read as methionine or, in bacteria, as formylmethionine. Alternative start codons (depending on the organism), include "GUG" or "UUG"; these codons normally represent valine and leucine, respectively, but, as a start codon, they are translated as methionine or formylmethionine.

The three stop codons have been given names: UAG is amber, UGA is opal (sometimes also called umber), and UAA is ochre. "Amber" was named by discoverers Richard Epstein and Charles Steinberg after their friend Harris Bernstein, whose last name means "amber" in German. The other two stop codons were named "ochre" and "opal" in order to keep the "color names" theme. Stop codons are also called "termination" or "nonsense" codons. They signal release of the nascent polypeptide from the ribosome because there is no cognate tRNA that has anticodons complementary to these stop signals, and so a release factor binds to the ribosome instead.

Effect of mutations

During the process of DNA replication, errors occasionally occur in the polymerization of the second strand. These errors, called mutations, can have an impact on the phenotype of an organism, especially if they occur within the protein coding sequence of a gene. Error rates are usually very low—1 error in every 10–100 million bases—due to the "proofreading" ability of DNA polymerases.

Missense mutations and nonsense mutations are examples of point mutations, which can cause genetic diseases such as sickle-cell disease and thalassemia respectively. Clinically important missense mutations generally change the properties of the coded amino acid residue between being basic, acidic polar or non-polar, whereas nonsense mutations result in a stop codon.

Mutations that disrupt the reading frame sequence by indels (insertions or deletions) of a non-multiple of 3 nucleotide bases are known as frameshift mutations. These mutations usually result in a completely different translation from the original, and are also very likely to cause a stop codon to be read, which truncates the creation of the protein. These mutations may impair the function of the resulting protein, and are thus rare in in vivo protein-coding sequences. One reason inheritance of frameshift mutations is rare is that, if the protein being translated is essential for growth under the selective pressures the organism faces, absence of a functional protein may cause death before the organism is viable. Frameshift mutations may result in severe genetic diseases such as Tay-Sachs disease.

Although most mutations that change protein sequences are harmful or neutral, some mutations have a positive effect on an organism. These mutations may enable the mutant organism to withstand particular environmental stresses better than wild-type organisms, or reproduce more quickly. In these cases a mutation will tend to become more common in a population through natural selection. Viruses that use RNA as their genetic material have rapid mutation rates, which can be an advantage, since these viruses will evolve constantly and rapidly, and thus evade the defensive responses of e.g. the human immune system. In large populations of asexually reproducing organisms, for example, E. coli, multiple beneficial mutations may co-occur. This phenomenon is called clonal interference and causes competition among the mutations.

Degeneracy

Degeneracy is the redundancy of the genetic code. The genetic code has redundancy but no ambiguity. For example, although codons GAA and GAG both specify glutamic acid (redundancy), neither of them specifies any other amino acid (no ambiguity). The codons encoding one amino acid may differ in any of their three positions. For example the amino acid glutamic acid is specified by GAA and GAG codons (difference in the third position), the amino acid leucine is specified by UUA, UUG, CUU, CUC, CUA, CUG codons (difference in the first or third position), while the amino acid serine is specified by UCA, UCG, UCC, UCU, AGU, AGC (difference in the first, second, or third position).

A position of a codon is said to be a fourfold degenerate site if any nucleotide at this position specifies the same amino acid. For example, the third position of the glycine codons (GGA, GGG, GGC, GGU) is a fourfold degenerate site, because all nucleotide substitutions at this site are synonymous; i.e., they do not change the amino acid. Only the third positions of some codons may be fourfold degenerate. A position of a codon is said to be a twofold degenerate site if only two of four possible nucleotides at this position specify the same amino acid. For example, the third position of the glutamic acid codons (GAA, GAG) is a twofold degenerate site. In twofold degenerate sites, the equivalent nucleotides are always either two purines (A/G) or two pyrimidines (C/U), so only transversional substitutions (purine to pyrimidine or pyrimidine to purine) in twofold degenerate sites are nonsynonymous. A position of a codon is said to be a non-degenerate site if any mutation at this position results in amino acid substitution. There is only one threefold degenerate site where changing to three of the four nucleotides may have no effect on the amino acid (depending on what it is changed to), while changing to the fourth possible nucleotide always results in an amino acid substitution. This is the third position of an isoleucine codon: AUU, AUC, or AUA all encode isoleucine, but AUG encodes methionine. In computation this position is often treated as a twofold degenerate site.

There are three amino acids encoded by six different codons: serine, leucine, and arginine. Only two amino acids are specified by a single codon. One of these is the amino-acid methionine, specified by the codon AUG, which also specifies the start of translation; the other is tryptophan, specified by the codon UGG. The degeneracy of the genetic code is what accounts for the existence of synonymous mutations.

Degeneracy results because there are more codons than encodable amino acids. For example, if there were two bases per codon, then only 16 amino acids could be coded for (4²=16). Because at least 21 codes are required (20 amino acids plus stop), and the next largest number of bases is three, then 4³ gives 64 possible codons, meaning that some degeneracy must exist.

These properties of the genetic code make it more fault-tolerant for point mutations. For example, in theory, fourfold degenerate codons can tolerate any point mutation at the third position, although codon usage bias restricts this in practice in many organisms; twofold degenerate codons can tolerate one out of the three possible point mutations at the third position. Since transition mutations (purine to purine or pyrimidine to pyrimidine mutations) are more likely than transversion (purine to pyrimidine or vice-versa) mutations, the equivalence of purines or that of pyrimidines at twofold degenerate sites adds a further fault-tolerance.

Despite the redundancy of the genetic code, single-point mutations can still cause dysfunctional proteins. For example, a mutated hemoglobin gene causes sickle-cell disease. In the mutant hemoglobin, a hydrophilic glutamate (Glu) is substituted by the hydrophobic valine (Val); that is, GAA or GAG becomes GUA or GUG. The substitution of glutamate by valine reduces the solubility of β-globin, which causes hemoglobin to form linear polymers linked by the hydrophobic interaction between the valine groups, causing sickle-cell deformation of erythrocytes. In gneral, sickle-cell disease is not caused by a de novo mutation. It is, rather, selected for in geographic regions where malaria is common (in a way similar to thalassemia), as heterozygous people have some resistance to the malarial Plasmodium parasite (heterozygote advantage).

These variable codes for amino acids are allowed because of modified bases in the first base of the anticodon of the tRNA, and the base-pair formed is called a wobble base pair. The modified bases include inosine and the Non-Watson-Crick U-G basepair.

Variations to the standard genetic code

While slight variations on the standard code had been predicted earlier, none were discovered until 1979, when researchers studying human mitochondrial genes discovered they used an alternative code. Many slight variants have been discovered since then, including various alternative mitochondrial codes, and small variants such as translation of the codon UGA as tryptophan in the species Mycoplasma and translation of CUG as a serine rather than a leucine in the genus Candida. In bacteria and archaea, GUG and UUG are common start codons, but in rare cases, certain proteins may use alternative start codons not normally used by that species.

In certain proteins, non-standard amino acids are substituted for standard stop codons, depending on associated signal sequences in the messenger RNA. For example, UGA can code for selenocysteine, and UAG can code for pyrrolysine. Selenocysteine is now viewed as the 21st amino acid, and pyrrolysine is viewed as the 22nd.

Despite these differences, all known naturally-occurring codes are very similar to each other, and the coding mechanism is the same for all organisms: three-base codons, tRNA, ribosomes, reading the code in the same direction and translating the code three letters at a time into sequences of amino acids.

Expanded genetic code

Since 2001, 40 non-natural amino acids have been added into protein by creating a unique codon (recoding) and a corresponding transfer-RNA:aminoacyl – tRNA-synthetase pair to encode it with diverse physicochemical and biological properties in order to be used as a tool to exploring protein structure and function or to create novel or enhanced proteins.

Origin

Despite the minor variations that exist, the genetic code used by all known forms of life is nearly universal. However, there is a huge number of possible genetic codes. If amino acids are randomly associated with triplet codons, there will be 1.5 x 10⁸⁴ possible genetic codes.

Phylogenetic analysis of transfer RNA suggests that tRNA molecules evolved before the present set of aminoacyl-tRNA synthetases.

In theory, the genetic code could be completely random (a "frozen accident"), completely non-random (optimal) or a combination of random and nonrandom. There are enough data to refute the first possibility. For a start, a quick view on the table of the genetic code shows a clustering of amino acid assignments. Furthermore, amino acids that share the same biosynthetic pathway tend to have the same first base in their codons, and amino acids with similar physical properties tend to have similar codons.

There are four themes running through the many theories about the evolution of the genetic code (and hence the origin of these patterns):

Chemical principles govern specific RNA interaction with amino acids. Experiments with aptamers showed that some amino acids have a selective chemical affinity for the base triplets that code for them. Recent experiments show that of the 8 amino acids tested, 6 show some RNA triplet-amino acid association. This has been called the stereochemical code. The stereochemical code could have created an ancient core of assignments. The current complex translation mechanism involving tRNA and associated enzymes may be a later development, and maybe protein sequences were directly templated on base sequences.
Biosynthetic expansion. The standard modern genetic code grew from a simpler earlier code through a process of "biosynthetic expansion". Here the idea is that primordial life "discovered" new amino acids (for example, as by-products of metabolism) and later incorporated some of these into the machinery of genetic coding. Although much circumstantial evidence has been found to suggest that fewer different amino acids were used in the past than today, precise and detailed hypotheses about which amino acids entered the code in what order have proved far more controversial.
Natural selection has led to codon assignments of the genetic code that minimize the effects of mutations. A recent hypothesis suggests that the triplet code was derived from codes that used longer than triplet codons (such as quadruplet codons). Longer than triplet decoding would have higher degree of codon redundancy and would be more error resistant than the triplet decoding. This feature could allow accurate decoding in the absence of highly complex translational machinery such as the ribosome and prior to the time when cells began making ribosomes.
Information channels: Information-theoretic approaches see the genetic code as an error-prone information channel. The inherent noise (that is, errors) in the channel poses the organism with a fundamental question: how to construct a genetic code that can withstand the impact of noise while accurately and efficiently translating information? These “rate-distortion” models suggest that the genetic code originated as a result of the interplay of the three conflicting evolutionary forces: the needs for diverse amino-acids, for error-tolerance and for minimal cost of resources. The code emerges at a coding transition when the mapping of codons to amino-acids becomes nonrandom. The emergence of the code is governed by the topology defined by the probable errors and is related to the map coloring problem.

Molecular cloning

Molecular cloning refers to a set of experimental methods in molecular biology that are used to assemble recombinant DNA molecules and to direct their replication within host organisms. The use of the word cloning refers to the fact that the method involves the replication of a single DNA molecule starting from a single living cell to generate a large population of cells containing identical DNA molecules. Molecular cloning generally uses DNA sequences from two different organisms: the species that is the source of the DNA to be cloned, and the species that will serve as the living host for replication of the recombinant DNA. Molecular cloning methods are central to many contemporary areas of modern biology and medicine.

In a conventional molecular cloning experiment, the DNA to be cloned is obtained from an organism of interest, then treated with enzymes in the test tube to generate smaller DNA fragments. Subsequently, these fragments are then combined with vector DNA to generate recombinant DNA molecules. The recombinant DNA is then introduced into a host organism. This will generate a population of organisms in which recombinant DNA molecules are replicated along with the host DNA. Because they contain foreign DNA fragments, these are transgenic or genetically-modified microorganisms (GMO). This process takes advantage of the fact that a single bacterial cell can be induced to take up and replicate a single recombinant DNA molecule. This single cell can then be expanded exponentially to generate a large amount of bacteria, each of which contain copies of the original recombinant molecule. Thus, both the resulting bacterial population, and the recombinant DNA molecule, are commonly referred to as "clones". Strictly speaking, recombinant DNA refers to DNA molecules, while molecular cloning refers to the experimental methods used to assemble them.

History of molecular cloning

Prior to the 1970s, our understanding of genetics and molecular biology was severely hampered by an inability to isolate and study individual genes from complex organisms. This changed dramatically with the advent of molecular cloning methods. Microbiologists, seeking to understand the molecular mechanisms through which bacteria restricted the growth of bacteriophage, isolated restriction endonucleases, enzymes that could cleave DNA molecules only when specific DNA sequences were encountered. They showed that restriction enzymes cleaved chromosome-length DNA molecules at specific locations, and that specific sections of the larger molecule could be purified by size fractionation. Using a second enzyme, DNA ligase, fragments generated by restriction enzymes could be joined in new combinations, termed recombinant DNA. By recombining DNA segments of interest with vector DNA, such as bacteriophage or plasmids, which naturally replicate inside bacteria, large quantities of purified recombinant DNA molecules could be produced in bacterial cultures. The first recombinant DNA molecules were generated and studied in 1972.

Molecular cloning takes advantage of the fact that the chemical structure of DNA is fundamentally the same in all living organisms. Therefore, if any segment of DNA from any organism is inserted into a DNA segment containing the molecular sequences required for DNA replication, and the resulting recombinant DNA is introduced into the organism from which the replication sequences were obtained, then the foreign DNA will be replicated along with the host cell's DNA in the transgenic organism.

Molecular cloning is similar to polymerase chain reaction (PCR) in that it permits the replication of a specific DNA sequence. The fundamental difference between the two methods is that molecular cloning involves replication of the DNA in a living microorganism, while PCR replicates DNA in an in vitro solution, free of living cells.

Steps in molecular cloning

In standard molecular cloning experiments, the cloning of any DNA fragment essentially involves seven steps: (1) Choice of host organism and cloning vector, (2) Preparation of vector DNA, (3) Preparation of DNA to be cloned, (4) Creation of recombinant DNA, (5) Introduction of recombinant DNA into host organism, (6) Selection of organisms containing recombinant DNA, (7) Screening for clones with desired DNA inserts and biological properties.

Choice of host organism and cloning vector

Although a very large number of host organisms and molecular cloning vectors are in use, the great majority of molecular cloning experiments begin with a laboratory strain of the bacterium E. coli and a plasmid cloning vector. E. coli and plasmid vectors are in common use because they are technically sophisticated, versatile, widely available, and offer rapid growth of recombinant organisms with minimal equipment. If the DNA to be cloned is exceptionally large (hundreds of thousands to millions of base pairs), then a bacterial artificial chromosome or yeast artificial chromosome vector is often chosen.

Specialized applications may call for specialized host-vector systems. For example, if the experimentalists wish to harvest a particular protein from the recombinant organism, then an expression vector is chosen that contains appropriate signals for transcription and translation in the desired host organism. Alternatively, if replication of the DNA in different species is desired (for example transfer of DNA from bacteria to plants), then a multiple host range vector (also termed shuttle vector) may be selected. In practice, however, specialized molecular cloning experiments usually begin with cloning into a bacterial plasmid, followed by sub-cloning into a specialized vector.

Whatever combination of host and vector are used, the vector almost always contains four DNA segments that are critically important to its function and experimental utility--(1) an origin of DNA replication is necessary for the vector (and recombinant sequences linked to it) to replicate inside the host organism, (2) one or more unique restriction endonuclease recognition sites that serves as sites where foreign DNA may be introduced, (3) a selectable genetic marker gene that can be used to enable the survival of cells that have taken up vector sequences, and (4) an additional gene that can be used for screening which cells contain foreign DNA.

Preparation of vector DNA

The cloning vector is treated with a restriction endonuclease to cleave the DNA at the site where foreign DNA will be inserted. The restriction enzyme is chosen to generate a configuration at the cleavage site that is compatible with that at the ends of the foreign DNA. Typically, this is done by cleaving the vector DNA and foreign DNA with the same restriction enzyme, for example EcoRI. Most modern vectors contain a variety of convenient cleavage sites that are unique within the vector molecule (so that the vector can only be cleaved at a single site) and is located within a gene (frequently beta-galactosidase) whose inactivation can be used to distinguish recombinant from non-recombinant organisms at a later step in the process. To improve the ratio of recombinant to non-recombinant organisms, the cleaved vector may be treated with an enzyme (alkaline phosphatase) that modifies the vector ends in such a way that it cannot replicate within cells unless it contains foreign DNA.

Preparation of DNA to be cloned

For cloning of genomic DNA, the DNA to be cloned is extracted from the organism of interest. Virtually any tissue source can be used (even tissues from extinct animals, as long as the DNA is not extensively degraded. The DNA is then purified using simple methods to remove contaminating proteins (extraction with phenol), RNA (ribonuclease) and smaller molecules (precipitation and/or chromatography). Polymerase chain reaction (PCR) methods are often used for amplification of specific DNA or RNA (RT-PCR) sequences prior to molecular cloning.

DNA for cloning experiments may also be obtained from RNA using reverse transcriptase (complementary DNA or cDNA cloning), or in the form of synthetic DNA (artificial gene synthesis). cDNA cloning is usually used to obtain clones representative of the mRNA population of the cells of interest, while synthetic DNA is used to obtain any precise sequence defined by the designer.

The purified DNA is then treated with a restriction enzyme to generate fragments with ends capable of being linked to those of the vector. If necessary, short double-stranded segments of DNA containing desired restriction sites may be added to create end structures that are compatible with the vector.

Creation of recombinant DNA with DNA ligase

The creation of recombinant DNA is in many ways the simplest step of the molecular cloning process. DNA prepared from the vector and foreign source are simply mixed together at appropriate concentrations and exposed to an enzyme (DNA ligase) that covalently links the ends together. This joining reaction is often termed ligation. The resulting DNA mixture containing randomly joined ends is then ready for introduction into the host organism.

DNA ligase only recognizes and acts on the ends of linear DNA molecules, usually resulting a complex mixture of DNA molecules with randomly joined ends. The desired products (vector DNA covalently linked to foreign DNA) will be present, but other sequences (e.g. foreign DNA linked to itself, vector DNA linked to itself and higher-order combinations of vector and foreign DNA) are also usually present. This complex mixture is sorted out in subsequent steps of the cloning process, after the DNA mixture is introduced into cells.

Introduction of recombinant DNA into host organism

The DNA mixture, previously manipulated in vitro, is moved back into a living cell, referred to as the host organism. The methods used to get DNA into cells are varied, and the name applied to this step in the molecular cloning process will often depend upon the experimental method that is chosen (e.g. transformation, transduction, transfection, electroporation).

When microorganisms are able to take up and replicate DNA from their local environment, the process is termed transformation, and cells that are in a physiological state such that they can take up DNA are said to be competent. In mammalian cell culture, the analogous process of introducing DNA into cells is commonly termed transfection. Both transformation and transfection usually require preparation of the cells through a special growth regime and chemical treatment process that will vary with the specific species and cell types that are used.

Electroporation uses high voltage electrical pulses to translocate DNA across the cell membrane (and cell wall, if present). In contrast, transduction involves the packaging of DNA into virus-derived particles, and using these virus-like particles to introduce the encapsulated DNA into the cell through a process resembling viral infection. Although electroporation and transduction are highly specialized methods, they may be the most efficient methods to move DNA into cells.

Selection of organisms containing vector sequences

Whichever method is used, the introduction of recombinant DNA into the chosen host organism is usually a low efficiency process; that is, only a small fraction of the cells will actually take up DNA. Experimental scientists deal with this issue through a step of artificial genetic selection, in which cells that have not taken up DNA are selectively killed, and only those cells that can actively replicate DNA containing the selectable marker gene encoded by the vector are able to survive.

When bacterial cells are used as host organisms, the selectable marker is usually a gene that confers resistance to an antibiotic that would otherwise kill the cells, typically ampicillin. Cells harboring the vector will survive when exposed to the antibiotic, while those that have failed to take up vector sequences will die. When mammalian cells (e.g. human or mouse cells) are used, a similar strategy is used, except that the marker gene confers resistance to the antibiotic Geneticin.

Screening for clones with desired DNA inserts and biological properties

Modern bacterial cloning vectors (e.g. pUC19 and later derivatives including the pGEM vectors) use the blue-white screening system to distinguish colonies (clones) of transgenic cells from those that contain the parental vector (i.e. vector DNA with no recombinant sequence inserted). In these vectors, foreign DNA is inserted into a sequence that encodes an essential part of beta-galactosidase, an enzyme whose activity results in formation of a blue-colored colony on the culture medium that is used for this work. Insertion of the foreign DNA into the beta-galactosidase coding sequence disables the function of the enzyme, so that colonies containing recombinant plasmids remain colorless (white). Therefore, experimentalists are easily able to identify and conduct further studies on transgenic bacterial clones, while ignoring those that do not contain recombinant DNA.

The total population of individual clones obtained in a molecular cloning experiment is often termed a DNA library. Libraries may be highly complex (as when cloning complete genomic DNA from an organism) or relatively simple (as when moving a previously-cloned DNA fragment into a different plasmid), but it is almost always necessary to examine a number of different clones to be sure that the desired DNA construct is obtained. This may be accomplished through a very wide range of experimental methods, including the use of nucleic acid hybridizations, antibody probes, polymerase chain reaction, restriction fragment analysis and/or DNA sequencing.

Applications of molecular cloning

Molecular cloning provides scientists with an essentially unlimited quantity of any individual DNA segments derived from any genome. This material can be used for a wide range of purposes, including those in both basic and applied biological science. A few of the more important applications are summarized here.

Genome organization and gene expression

Molecular cloning has led directly to the elucidation of the complete DNA sequence of the genomes of a very large number of species and to an exploration of genetic diversity within individual species, work that has been done mostly by determining the DNA sequence of large numbers of randomly cloned fragments of the genome, and assembling the overlapping sequences.

At the level of individual genes, molecular clones are used to generate probes that are used for examining how genes are expressed, and how that expression is related to other processes in biology, including the metabolic environment, extracellular signals, development, learning, senescence and cell death. Cloned genes can also provide tools to examine the biological function and importance of individual genes, by allowing investigators to inactivate the genes, or make more subtle mutations using regional mutagenesis or site-directed mutagenesis.

Production of recombinant proteins

Obtaining the molecular clone of a gene can lead to the development of organisms that produce the protein product of the cloned genes, termed a recombinant protein. In practice, it is frequently more difficult to develop an organism that produces an active form of the recombinant protein in desirable quantities than it is to clone the gene. This is because the molecular signals for gene expression are complex and variable, and because protein folding, stability and transport can be very challenging.

Many useful proteins are currently available as recombinant products. These include--(1) medically-useful proteins whose administration can correct a defective or poorly-expressed gene (e.g. recombinant factor VIII, a blood-clotting factor deficient in some forms of hemophilia,^[12] and recombinant insulin, used to treat some forms of diabetes), (2) proteins that can be administered to assist in a life threatening emergency (e.g. tissue plasminogen activator, used to treat strokes), and (3) recombinant subunit vaccines, in which a purified protein can be used to immunize patients against infectious diseases, without exposing them to the infectious agent itself (e.g. hepatitis B vaccine).

Transgenic organisms

Once characterized and manipulated to provide signals for appropriate expression, cloned genes may be inserted into organisms, generating transgenic organisms, also termed genetically-modified organisms (GMOs). Although most GMOs are generated for purposes of basic biological research (see for example, transgenic mouse), a number of GMOs have been developed for commercial use, ranging from animals and plants that produce pharmaceuticals or other compounds (pharming), herbicide-resistant crop plants, and fluorescent tropical fish (GloFish) for home entertainment.

Gene therapy

Gene therapy involves supplying a functional gene to cells lacking that function, with the aim of correcting a genetic disorder or acquired disease. Gene therapy can be broadly divided into two categories. The first is alteration of germ cells, that is, sperm or eggs, which results in a permanent genetic change for the whole organism and subsequent generations. This “germ line gene therapy” is considered by many to be unethical in human beings. The second type of gene therapy, “somatic cell gene therapy”, is analogous to an organ transplant. In this case, one or more specific tissues are targeted by direct treatment or by removal of the tissue, addition of the therapeutic gene or genes in the laboratory, and return of the treated cells to the patient. Clinical trials of somatic cell gene therapy began in the late 1990s, mostly for the treatment of cancers and blood, liver, and lung disorders.

Despite a great deal of publicity and promises, the history of human gene therapy has been characterized by relatively limited success. The effect of introducing a gene into cells often promotes only partial and/or transient relief from the symptoms of the disease being treated. Some gene therapy trial patients have suffered adverse consequences of the treatment itself, including deaths. In some cases, the adverse effects result from disruption of essential genes within the patient's genome by insertional inactivation. In others, viral vectors used for gene therapy have been contaminated with infectious virus. Nevertheless, gene therapy is still held to be a promising future area of medicine, and is an area where there is a significant level of research and development activity.

Monday, 19 January 2015