Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;8(1):e53822.
doi: 10.1371/journal.pone.0053822. Epub 2013 Jan 18.

Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses

Affiliations

Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses

Helena Storvall et al. PLoS One. 2013.

Abstract

As next generation sequencing technologies are getting more efficient and less expensive, RNA-Seq is becoming a widely used technique for transcriptome studies. Computational analysis of RNA-Seq data often starts with the mapping of millions of short reads back to the genome or transcriptome, a process in which some reads are found to map equally well to multiple genomic locations (multimapping reads). We have developed the Minimum Unique Length Tool (MULTo), a framework for efficient and comprehensive representation of mappability information, through identification of the shortest possible length required for each genomic coordinate to become unique in the genome and transcriptome. Using the minimum unique length information, we have compared different uniqueness compensation approaches for transcript expression level quantification and demonstrate that the best compensation is achieved by discarding multimapping reads and correctly adjusting gene model lengths. We have also explored uniqueness within specific regions of the mouse genome and enhancer mapping experiments. Finally, by making MULTo available to the community we hope to facilitate the use of uniqueness compensation in RNA-Seq analysis and to eliminate the need to make additional mappability files.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Schematic illustration MULTo file generation.
(A) We defined the minimum unique length (MUL) of a genomic coordinate as the length of the shortest starting oligonucleotide at that coordinate that is needed to be unique. To find the MUL value, Fasta files with artificial “reads” of different lengths were iteratively created from whole chromosome fasta files and mapped to the genome using bowtie. When the minimum length needed for uniqueness was found, this value was stored in a binary file. In this example, position 3000091 was unique at 33 base pairs but not at 32, i.e. we have a MUL value of 33. (B) Exemplifying that MUL values can be retrieved from arbitrary regions in just a few lines of code.
Figure 2
Figure 2. Uniqueness in the transcriptome.
(A, B) We calculated the proportion of unique positions for each transcript, both for single reads and paired-end fragments (mean 500 nt), and then plotted how many transcripts have a certain proportion of unique positions. The y-axis represents the proportion of all transcripts that satisfies the given condition. (A) Gene-level uniqueness of all RefSeq transcripts. (B) Transcript-level uniqueness for all transcripts from multi-isoform genes. (C) Positional plot of the uniqueness proportion across all coding transcripts. We calculated the number of reads of a specific length that passes through each position, and determined what proportion of these were unique. Since transcripts differ in length, we binned positions together so that each region (upstream, downstream, coding sequence, 5′ and 3′UTR) had the same number of bins for each transcript. The x-axis represents coordinate bins across transcripts.
Figure 3
Figure 3. Effects of uniqueness normalization on expression level.
(A) Histogram showing how uniqueness compensation using MULTo affects the RPKM values at different read lengths. The x-axis show the difference in gene expression between uniqueness compensated and uncompensated expression levels. (B) RPKM values for FTH1 before and after uniqueness normalization. (C) Read coverage and uniqueness profile across FTH1 for 25 nt reads. Uniqueness density was calculated as the proportion unique reads aligning to each genomic coordinate.
Figure 4
Figure 4. Comparison of uniqueness compensation methods for RNA-Seq.
Scatter plots showing how gene expression values (RPKM) are affected by uniqueness compensation for transcripts as a function of increasing proportion unique positions. (A) Uniqueness compensation with MULTo corrected transcript lengths are close to optimal compensation line (y = 1/x) (B) ERANGE uniqueness compensation. (C) Cufflinks uniqueness compensation. (D,E) MA-plots between MUL and ERANGE uniqueness compensation (D) and between MUL and cufflinks uniqueness compensation (E) showing how gene expression differences correlate to the gene expression average. Short transcripts were colored red.
Figure 5
Figure 5. Uniqueness profiles within genomic regions.
The proportions of unique positions within different regions were calculated for read lengths in the range 20–255 nts. (A) Proportion unique positions in whole genome, within RefSeq genes, intergenic regions, known p300 binding sites, proximal promoters and CpG islands. (B) Proportion unique positions within different parts of genes; exons, introns and UTRs. (C,D) Difference in proportion unique positions between the regular and bisulfite converted genome. The y-axis in (C) and (D) represents the uniqueness proportion in bisulfite genome subtracted from that in the regular genome. The vertical dashed line marks 35 nucleotide reads.

Similar articles

Cited by

References

    1. Metzker ML (2010) Sequencing technologies - the next generation. Nature Reviews Genetics 11: 31–46 Available: http://www.ncbi.nlm.nih.gov/pubmed/19997069. - PubMed
    1. Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods Available: http://www.nature.com/nmeth/journal/vaop/ncurrent/abs/nmeth.1226.html. - PubMed
    1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470–476 Available: http://www.nature.com/nature/journal/v456/n7221/abs/nature07509.html. - PMC - PubMed
    1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics 40: 1413–1415 Available: http://www.ncbi.nlm.nih.gov/pubmed/18978789. - PubMed
    1. Lee S, Seo CH, Lim B, Yang JO, Oh J, et al. (2011) Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Research 39: e9 Available: http://www.ncbi.nlm.nih.gov/pubmed/21059678. - PMC - PubMed

Publication types

MeSH terms