Video: Genome Annotation and Assembly

The entire genome of an organism cannot be sequenced in continuous sequences - even the newest generation of sequencing technologies produce fragmented data from thousands of short DNA fragments ranging from 50-1000bp in length.

These short DNA sequences - called reads - need to be assembled to reconstruct the complete sequence of a genome in a process called genome assembly.

There are four main steps in any next-generation genome assembly - raw data analysis, contig assembly, scaffolding, and finally, gap closing.

The first step is to analyze the raw data acquired for quality - and then eliminate any contamination, biased data, or poor quality reads with a large number of unknown nucleotides.

Next, the clean reads are trimmed to remove the adapter sequences from their ends. Any bases at the fragment ends that do not pass the quality threshold are also trimmed.

Then, a well-suited assembly tool is used to assemble the reads into contiguous sequences - called contigs - based on the overlapping DNA segments.

Comparative genome assembly can be used when a reference genome of a closely related organism is available to direct the reconstruction of the new genome. Here, the reads are aligned to the reference genome, and this provides a layout for further steps in the genome assembly.

Alternatively, de novo genome assembly needs to be performed in the absence of a reference genome. Here, the overlapping reads are used to orient the sequences into longer contigs.

In the next step, the paired short reads - which are overhanging reads at the end of the contigs - are used for scaffolding the genome.

The gaps between the adjacent contigs are filled with Ns incase of unknown sequences. However, if long reads that are more than 1kb in length are used to stitch contigs together, the gaps can be filled with actual sequences.

The result is an assembled genome that then needs to be annotated with the help of automated tools - a process called genome annotation.

The two main aims of genome annotation are gene structure and gene function prediction, commonly known as structural annotation and functional annotation, respectively.

While the structural annotation leads to the identification of the genomic elements such as coding regions, regulatory motifs,etc.; the functional annotation helps to correctly identify the biological function of these structural elements especially, protein-coding genes.

Genome annotation tools use available data, including known transcripts, protein or signal sequences, predicted genes from other sequenced genomes, or signatures of conserved domains, as the references for any new annotation.

Once the software aligns this available data to the draft genome, it needs to be filtered and polished either manually or using annotation tools to obtain a final set of gene annotations.

The genome refers to all of the genetic material in an organism. It can range from a few million base pairs in microbial cells to several billion base pairs in many eukaryotic organisms. Genome assembly refers to the process of taking the DNA sequencing data and putting it all back together in a correct order to create a close representation of the original genome. This is followed by the identification of functional elements on the newly assembled genome, a process called genome annotation.

Genome assembly is a complicated process. While human genomes in a population can have variable gene copy numbers and repeated sequences that add complexity to genome assembly, the physical location of the genes remains constant. In contrast, bacterial genes are not always in the same location, and multiple copies of the same gene may appear in different locations on the genome. This adds complexity to the assembly of the bacterial genomes. Therefore, a single genome assembly from an organism cannot represent all the diversity within the population of a species.

Furthermore, the possibility of technological or algorithmic errors adds further complexity to the process of genome assembly. As a result, many published genomes are continuously updated with the advancement in sequencing technologies as well as assembly and annotation tools. For example, while the first human genome assembly (build 37) was released in 2009, a new version (build 38) was made available in 2013.

Additionally, the evolution of genome annotation tools in the last few decades has increased its resolution. The genome annotation tools have come a long way from just annotating long protein-coding genes and regulatory elements on the genomes to the annotation of sole nucleotides within a population.

Both genome assembly and annotation are essential tools for genome analysis that lead to precise insights into the biology of species, populations, and individuals.

Tags

Genome Annotation Assembly Genetic Material Organism Base Pairs DNA Sequencing Data Functional Elements Gene Copy Numbers Repeated Sequences Physical Location Bacterial Genes Population Diversity Technological Errors Algorithmic Errors Sequencing Technologies Assembly Tools

From Chapter 15:

article

Now Playing

15.15 : Genome Annotation and Assembly

Studying DNA and RNA

18.6K Views

article

15.1 : Recombinant DNA

Studying DNA and RNA

16.4K Views

article

15.2 : DNA Isolation

Studying DNA and RNA

36.8K Views

article

15.3 : DNA Agarose Gel Electrophoresis

Studying DNA and RNA

91.5K Views

article

15.4 : Labeling DNA Probes

Studying DNA and RNA

8.0K Views

article

15.5 : Southern Blot

Studying DNA and RNA

17.5K Views

article

15.6 : DNA Microarrays

Studying DNA and RNA

16.9K Views

article

15.7 : Complementary DNA

Studying DNA and RNA

5.5K Views

article

15.8 : FISH - Fluorescent In-situ Hybridization

Studying DNA and RNA

18.9K Views

article

15.9 : PCR - Polymerase Chain Reaction

Studying DNA and RNA

80.4K Views

article

15.10 : Real Time RT-PCR

Studying DNA and RNA

56.0K Views

article

15.11 : RACE - Rapid Amplification of cDNA Ends

Studying DNA and RNA

6.2K Views

article

15.12 : Sanger Sequencing

Studying DNA and RNA

750.1K Views

article

15.13 : Next-generation Sequencing

Studying DNA and RNA

85.3K Views

article

15.14 : RNA-seq

Studying DNA and RNA

9.5K Views

See More

Copyright © 2025 MyJoVE Corporation. All rights reserved