Nucleic Acids and Gene Expression Control

Structure of nucleic acids

Nucleic acids contain the information for determining the aa sequence and hence the structure and function of all the proteins of a cell. The DNA contains all the information required to build the cells. The information coded in the DNA is arranged in units called genes. In the process of transcription, the information stored in DNA is copied into ribonucleic acid (RNA) to be later translated to a protein.

DNA and RNA are very similar. Both are composed by monomers called nucleotides. A DNA molecule is 109 nucleotides long, while RNA ranges from a few bases up to many thousands. DNA and RNA consist each from four different nucleotides. All nucleotides consist of an organic base linked to a five-carbon sugar that has a phosphate group attached to carbon 5. In RNA the sugar is ribose, in DNA deoxyribose.

A nucleic acid strand has a end-to-end chemical orientation, the 5’ end has a phosphate group and the 3’ end has a hydroxyl group on the 3’ carbon of the sugar. Phosphodiester bonds link nucleic acids into long polynucleotides that can twist and fold into three dimensional conformations stabilized by noncovalent bonds. Phosphodiester bond is a link between the phosphate group and two other molecules over two ester bonds. CUAL ES EL ESTER??.

DNA structure

DNA consists of two associated polynucleotide strands that wind together to form a double helix. The two negatively charged phosphate bonds are on the outside of the double helix and the bases project into the interior, where they are linked by hydrogen bonds. The orientation of the strands are antiparallel. Watson-Crick base pairs (bp) are formed between A and T, and G and C. This complementarity is consequence of the size, shape and chemical composition of the bases, although other pairs are usually found. Thousands of these hydrogen bonds contribute to the stabilization of DNA molecules. The helix usually makes a complete turn every 3.6 nm, including 10.5 bp per turn. This is referred to as the B form of DNA.

NCBI->

Colors in chromosome represents G-staining (Giemsa staining, which stains phosphate groups from the chromosome). The more A-T bonds, the darker the stain.

The numbers on the left-most side refer to cytogenetics markers. The light-blue lines represent assembled contig sequences. Contig sequences are assemblies of several clones. These clones were direct sequenced. We can inspect the sequences of these clones.

The next gray histogram depicts the amount of ESTs aligned to a region

A UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene), together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location.

Genes Map includes known and putative genes placed as a result of alignments of mRNAs to the contigs, and gene predictions. All of them are identified, not as UniGene, which is made from ESTs (it’s possible to see diference clearly at 1/100th of chromosome). Colors (blue, brown, etc) represent the quality of the alignment between the gene entry and the genomic sequence.

Orientation shows the DNA strand on which the gene is located.

The "genome" of any given individual (except for identical twins) is unique; mapping "the human genome" involves sequencing multiple variations of each gene.

Despite all the popular press articles saying that the genome was "complete", as of 2003 it is still incomplete and it clearly won't be finished for many more years. First, it is important to realize that the central regions of each chromosome, known as centromeres, are highly repetitive DNA sequences that are difficult to sequence using current technology. The centromeres are millions (possibly tens of millions) of base pairs long, and for the most part these are entirely unsequenced. Second, the ends of the chromosomes, called telomeres, are also highly repetitive, and for most of the 46 chromosome ends these too are incomplete. we have completed about 92% of the genome

Great Britain, as well as numerous other groups from around the world broke the genome into larger pieces; approximately 150,000 base pairs in length. These pieces are called "bacterial artificial chromosomes", or BACs, because they can be inserted into bacteria where they are copied by the bacterial replication machinery. Each of these pieces was then sequenced separately as a small "shotgun" project and then assembled. The larger, 150,000 base pair chunks were then stitched together to create chromosomes. This is known as the "hierarchical shotgun" approach, because the genome is first broken into relatively large chunks, which are then mapped to chromosomes before being selected for sequencing.

The Evidence Viewer (ev) displays graphically the GenBank and RefSeq cDNAs that align to the genome in a particular region, along with a density plot for ESTs. The positions of any mismatches or insertions/deletions are marked, the multiple pairwise sequence alignments are provided, and computed translations are shown.

The Sequence Viewer (sv) is the Entrez graphical display option for any nucleotide sequence, focused on the gene indicated. By default, a 2-kb section of sequence is shown below the representation of the features, but that limit can be increased at the bottom of the page. It is also possible to zoom and navigate in the display.

Sequence Download (seq) provides the same function as the Download/View Sequence link provided at the top of the Maps page. The scope of the sequence passed to the tool corresponds to what is being viewed on the page. When connected to a gene feature, the scope corresponds to that gene. The tool allows the user to alter the sequence scope and to select a report format (e.g., FASTA, GenBank, ASN.1). For the human and mouse genomes, a link is also provided to the HumanMouse Homology Map (hm).

Model Maker (mm) displays the evidence for exons in a genomic region by diagramming the exons predicted from the alignment of cDNAs, from ab initio models (the default), and from alignment of ESTs (after an explicit selection). To facilitate construction of your own model transcript or transcripts, the splice junctions and the exons they connect are displayed, and the coding potential of any combination of exons can quickly be evaluated using ORFfinder. The sequence can also be edited, and the results can be saved or downloaded.

Molecular definition of a gene

A gene is defined as the entire nucleic acid sequence that is necessary for the synthesis of a functional gene product (RNA). Therefore a gene includes more than the coding region. A gene also includes all the DNA sequences required for the synthesis a RNA transcript. Besides the coding region, an eukaryotic gene consists of: enhancers (50 kb from coding region), sequences specifying 3’ cleavage and polyadenylation (poly A sites) and splicing sites.

A transcription unit includes the coding region – which extends from the cap site to the poly(A) site – and associated control regions. The primary transcript is composed from exons, which will be translated into proteins, and introns, which will not be translated into proteins and contain many regulatory sequences for controlling different processes.

Transcription of protein coding genes

Transcripts units are ‘copied’ from a DNA which serves as a ‘template’. This template determines the order in which ribonucleoside triphospates (rNTP) monomers are polymerized to form an RNA chain which will be complementary to the template. Complementary strands are in opposite directions, so if RNA is synthesized from 5’ to 3’, the DNA template will be copied from 3’ to 5’.

Bases in the template DNA strand base-pair with complementary incoming rNTPs, which then are joined in a polymerization reaction catalyzed by RNA polymerase II (for pre-mRNA). Polymerization involves formation of phosphodiester bonds between the 3’ oxygen in the growing RNA chain on the alpha-phosphate of the next nucleotide precursor. As a consequence, RNA molecules are always synthesized in the 5’->3’ direction.

(1) During transcription initiation, RNA polymerase recognizes and binds to a specific site, called promoter, in dsDNA. Polymerases require various protein factors, called general transcription factors, to help them locate promoters and initiate transcription. (2) After binding to a promoter, RNA polymerase melts the DNA strands in order to make the bases in the template strand available for base pairing with the bases of the rNTPs. Approximately 14 bp are melted around the transcription initiation site, which is located on the template strand within the promoter region. (3) Transcription initiation finishes when the first two ribonucleotides are linked by a phosphodiester bond.

(4) After several nucleotides have been polymerized RNA polymerase dissociates from the promoter DNA and general TFs. During the stage of elongation, RNA polymerase moves along the template DNA one base at a time, opening the dsDNA in front of its direction of movement and hybridizing the strands behind it. One nucleotide at a time is added to the 3’ end of the nascent RNA chain by the polymerase. The 14 bp melted region is called transcription bubble.

(5) During termination, the completed RNA molecule, or primary transcript is released from the RNA polymerase and the polymerase dissociates from the template DNA. Once released, an RNA polymerase is free to transcribe a new gene.

Regulation of transcription

Environmental changes induce changes in gene expression. Regulation of transcription initiation is the most common form of gene control in eukaryotes.

An extended chromosome is 2 meters long, and it has to be condensed into a cell nuclei, which size is ~10 um. This packing is very organized. Transcription-active areas of the chromosome are less condensed than inactive areas. The level of condensation determines whether a region can be accessed by transcription factors and polymerases.

The most abundant proteins associated with DNA are histones, a family of basic proteins present in all nuclei. Histones are rich in positively charged basic aa, which interacts with phosphate groups of DNA. Nucleosomes are composed of DNA wrapped around octameric histones and a H1 histone, and are the primary structural unit of chromatin. Nucleosomes are packed into a solenoid arrangement, with six nucleosomes per turn. Loops of solenoid chromatin are associated with a flexible chromosome scaffold protein, which in turn is compacted in a solenoid form.

Inactive genes are assembled into condensed chromatin, which inhibits the binding of RNA polymerases and general transcription factors required fro transcription initiation. Activator proteins bind to control elements near the transcription start site as well as kilobases away and promote chromatin decondensation and binding of RNA polymerase to the promoter. Repressor proteins bind to alternative control elements, causing condensation of chromatin.

A DNA sequence that specifies where RNA polymerase binds and initiates transcription of a gene is called a promoter. Transcription from a particular promoter is controlled by DNA-binding proteins, termed transcription factors. TFs regulating expression can bind at regulatory sites tens of thousands of base pairs upstream or downstream from the promoter, and they induce the positioning of RNA polymerase at the start site. As a result, transcription from a single promoter may be regulated by binding of multiple TFs to alternative control elements, permitting complex control of gene expression. There are about 2000 different TFs in the human organism.

In many transcribed genes, there is a conserved sequence TATA box, 25-35 bp upstream of the start site. A mutation in this sequence drastically reduces transcription of these genes. Mutating the sequence upstream of the start sites and checking the expression level of a gene may give us information about the sequences necessary for gene activation. These are found both proximal and distal to the initiation site.

Activators and repressors that bind to specific sites in DNA regulate expression by two general mechanisms. First, they act in concert with other proteins to modulate chromatin structure, thereby influencing the ability of general TFs to bind to promoters. DNA is wrapped around histones, forming nucleosomes. Acetylation (acetyl groups are acid) of histones influence the relative condensation of chromatin. Acetylation of lysine neutralizes the positive charge normally present, thus reducing affinity between histone and (negatively charged) DNA which renders DNA more accessible to transcription factors. Activator and repressors also interact with a large multiprotein complex called the mediator of transcription complex, or simple mediator. This complex binds to RNA polymerase and directly regulates assembly of the transcription preinitiation complex, which is formed by the polymerase and many other proteins, and is necessary for transcription initiation. The mediator also has acetylase activity.

A transcription factor is composed of separable functional domains, a DNA-binding domain, which binds to a specific DNA sequence, and an activation domain which interacts with other proteins to stimulate transcription from a nearby promoter. Many TFs form heterodimers, and interact with DNA in this form. The resulting combinatorial possibilities increase the number of potential DNA sequences that a family of TFs can bind. Three factor monomers could combine to form six dimeric factors. Some inhibitory factors bind to TFs, blocking their binding to DNA. As a result of this kind of combinations, the 2000 human TFs can differentially regulate the tens of thousands of human genes.