Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2014 Nov;7(9):1026-42.
doi: 10.1111/eva.12178. Epub 2014 Jun 24.

A field guide to whole-genome sequencing, assembly and annotation

Affiliations
Review

A field guide to whole-genome sequencing, assembly and annotation

Robert Ekblom et al. Evol Appl. 2014 Nov.

Abstract

Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high-throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco-evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step-by-step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects.

Keywords: bioinformatics; conservation genomics; genome assembly; next generation sequencing; vertebrates; whole - genome sequencing..

PubMed Disclaimer

Figures

Figure 1
Figure 1
Workflow of a typical de novo whole-genome sequencing project. Black boxes with white text indicate genomic resources becoming available during the course of the project. From the top: wet-lab procedures, de novo assembly bioinformatic pipeline, postassembly analyses of additional population-wide sampling (population genomics), conservation genomic questions to address and analyses to perform (conservation genomic applications). Bullet points within the white star in the bottom part of the figures represent ultimate goals in conservation biology that can be addressed using genomic information combined with high-quality ecological data.
Figure 2
Figure 2
Simplified illustration of the assembly process and terminology. Shotgun sequencing: short fragments of DNA from the target organism are sequenced at random positions across the genome to a given depth of coverage. Fragments can consist of single reads (typically 50–1000 bp) or of paired-end reads of varying insert size (note that paired-end reads can even overlap). Mate-pair libraries span larger genomic regions (∼2–20 kb inserts) with reads generally facing outwards and can be complemented with fosmid-end libraries (∼40 kb inserts). Genome assembly: (A) short-read de novo assemblers extend the disperse sequence information from the reads into continuous stretches called contigs. Contigs usually reflect the consensus sequence and do not contain any polymorphisms. (B) Paired-end reads provide additional information on whether a read is supported for a given contig. (C) Some assemblers such as ALLPATHS-LG work with overlapping read pairs that are joined into a virtual longer read prior to the assembly. Read pairs from mate-pair or fosmid-end libraries can be used to order and orient contigs into scaffolds. Gap size between contigs is estimated from the expected length of mate-pairs and marked with ‘N's (indicated by hatched grey boxes). Long reads from single molecule sequencing provide an alternative. Annotation: gene models can be inferred in silico by prediction algorithms, by lifting over information from genomes of related organisms and by using transcriptome data (RNA-seq, expressed sequence tag) from the target organism itself. Spliced reads from RNA-seq data as indicated at the bottom of the figure provide valuable evidence for splice junctions and various isoforms of a gene.

References

    1. Allendorf FW, Hohenlohe PA. Luikart G. Genomics and the future of conservation genetics. Nature Reviews Genetics. 2010;11:697–709. - PubMed
    1. Allendorf FW, Luikart GH. Aitken SN. Conservation and the Genetics of Populations. 2nd edn. Chichester: Wiley-Blackwell; 2013.
    1. Amemiya CT, Alfoldi J, Lee AP, Fan S, Philippe H, MacCallum I, Braasch I, et al. The African coelacanth genome provides insights into tetrapod evolution. Nature. 2013;496:311–316. - PMC - PubMed
    1. Auerbach RK, Chen B. Butte AJ. Relating genes to function: identifying enriched transcription factors using the ENCODE ChIP-Seq significance tool. Bioinformatics. 2013;29:1922–1924. - PMC - PubMed
    1. Bao S, Jiang R, Kwan W, Wang B, Ma X. Song Y-Q. Evaluation of next-generation sequencing software in mapping and assembly. Journal of Human Genetics. 2011;56:406–414. - PubMed