. 2013;8(1):e53822.

doi: 10.1371/journal.pone.0053822. Epub 2013 Jan 18.

Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses

Helena Storvall¹, Daniel Ramsköld, Rickard Sandberg

Affiliations

PMID: 23349747
PMCID: PMC3548888
DOI: 10.1371/journal.pone.0053822

Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses

Helena Storvall et al. PLoS One. 2013.

. 2013;8(1):e53822.

doi: 10.1371/journal.pone.0053822. Epub 2013 Jan 18.

Authors

Helena Storvall¹, Daniel Ramsköld, Rickard Sandberg

Affiliation

¹ Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden.

PMID: 23349747
PMCID: PMC3548888
DOI: 10.1371/journal.pone.0053822

Abstract

As next generation sequencing technologies are getting more efficient and less expensive, RNA-Seq is becoming a widely used technique for transcriptome studies. Computational analysis of RNA-Seq data often starts with the mapping of millions of short reads back to the genome or transcriptome, a process in which some reads are found to map equally well to multiple genomic locations (multimapping reads). We have developed the Minimum Unique Length Tool (MULTo), a framework for efficient and comprehensive representation of mappability information, through identification of the shortest possible length required for each genomic coordinate to become unique in the genome and transcriptome. Using the minimum unique length information, we have compared different uniqueness compensation approaches for transcript expression level quantification and demonstrate that the best compensation is achieved by discarding multimapping reads and correctly adjusting gene model lengths. We have also explored uniqueness within specific regions of the mouse genome and enhancer mapping experiments. Finally, by making MULTo available to the community we hope to facilitate the use of uniqueness compensation in RNA-Seq analysis and to eliminate the need to make additional mappability files.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Schematic illustration MULTo file generation.**
(A) We defined the minimum unique length (MUL) of a genomic coordinate as the length of the shortest starting oligonucleotide at that coordinate that is needed to be unique. To find the MUL value, Fasta files with artificial “reads” of different lengths were iteratively created from whole chromosome fasta files and mapped to the genome using bowtie. When the minimum length needed for uniqueness was found, this value was stored in a binary file. In this example, position 3000091 was unique at 33 base pairs but not at 32, i.e. we have a MUL value of 33. (B) Exemplifying that MUL values can be retrieved from arbitrary regions in just a few lines of code.

**Figure 2. Uniqueness in the transcriptome.**
(**A, B**) We calculated the proportion of unique positions for each transcript, both for single reads and paired-end fragments (mean 500 nt), and then plotted how many transcripts have a certain proportion of unique positions. The y-axis represents the proportion of all transcripts that satisfies the given condition. (A) Gene-level uniqueness of all RefSeq transcripts. (B) Transcript-level uniqueness for all transcripts from multi-isoform genes. (C) Positional plot of the uniqueness proportion across all coding transcripts. We calculated the number of reads of a specific length that passes through each position, and determined what proportion of these were unique. Since transcripts differ in length, we binned positions together so that each region (upstream, downstream, coding sequence, 5′ and 3′UTR) had the same number of bins for each transcript. The x-axis represents coordinate bins across transcripts.

**Figure 3. Effects of uniqueness normalization on expression level.**
(A) Histogram showing how uniqueness compensation using MULTo affects the RPKM values at different read lengths. The x-axis show the difference in gene expression between uniqueness compensated and uncompensated expression levels. (B) RPKM values for FTH1 before and after uniqueness normalization. (C) Read coverage and uniqueness profile across FTH1 for 25 nt reads. Uniqueness density was calculated as the proportion unique reads aligning to each genomic coordinate.

**Figure 4. Comparison of uniqueness compensation methods for RNA-Seq.**
Scatter plots showing how gene expression values (RPKM) are affected by uniqueness compensation for transcripts as a function of increasing proportion unique positions. (A) Uniqueness compensation with MULTo corrected transcript lengths are close to optimal compensation line (y = 1/x) (B) ERANGE uniqueness compensation. (C) Cufflinks uniqueness compensation. (**D,E**) MA-plots between MUL and ERANGE uniqueness compensation (D) and between MUL and cufflinks uniqueness compensation (E) showing how gene expression differences correlate to the gene expression average. Short transcripts were colored red.

**Figure 5. Uniqueness profiles within genomic regions.**
The proportions of unique positions within different regions were calculated for read lengths in the range 20–255 nts. (A) Proportion unique positions in whole genome, within RefSeq genes, intergenic regions, known p300 binding sites, proximal promoters and CpG islands. (B) Proportion unique positions within different parts of genes; exons, introns and UTRs. (**C,D**) Difference in proportion unique positions between the regular and bisulfite converted genome. The y-axis in (C) and (D) represents the uniqueness proportion in bisulfite genome subtracted from that in the regular genome. The vertical dashed line marks 35 nucleotide reads.

See this image and copyright information in PMC

Cited by

m5C-Atlas: a comprehensive database for decoding and annotating the 5-methylcytosine (m5C) epitranscriptome.
Ma J, Song B, Wei Z, Huang D, Zhang Y, Su J, de Magalhães JP, Rigden DJ, Meng J, Chen K. Ma J, et al. Nucleic Acids Res. 2022 Jan 7;50(D1):D196-D203. doi: 10.1093/nar/gkab1075. Nucleic Acids Res. 2022. PMID: 34986603 Free PMC article.
Mammalian NET-seq analysis defines nascent RNA profiles and associated RNA processing genome-wide.
Nojima T, Gomes T, Carmo-Fonseca M, Proudfoot NJ. Nojima T, et al. Nat Protoc. 2016 Mar;11(3):413-28. doi: 10.1038/nprot.2016.012. Epub 2016 Feb 4. Nat Protoc. 2016. PMID: 26844429 Free PMC article.
The in vivo dynamics of antigenic variation in Trypanosoma brucei.
Mugnier MR, Cross GA, Papavasiliou FN. Mugnier MR, et al. Science. 2015 Mar 27;347(6229):1470-3. doi: 10.1126/science.aaa4502. Science. 2015. PMID: 25814582 Free PMC article.
Characterizing crosstalk in epigenetic signaling to understand disease physiology.
Lempiäinen JK, Garcia BA. Lempiäinen JK, et al. Biochem J. 2023 Jan 13;480(1):57-85. doi: 10.1042/BCJ20220550. Biochem J. 2023. PMID: 36630129 Free PMC article.
Molecular and functional heterogeneity of IL-10-producing CD4⁺ T cells.
Brockmann L, Soukou S, Steglich B, Czarnewski P, Zhao L, Wende S, Bedke T, Ergen C, Manthey C, Agalioti T, Geffken M, Seiz O, Parigi SM, Sorini C, Geginat J, Fujio K, Jacobs T, Roesch T, Izbicki JR, Lohse AW, Flavell RA, Krebs C, Gustafsson JA, Antonson P, Roncarolo MG, Villablanca EJ, Gagliani N, Huber S. Brockmann L, et al. Nat Commun. 2018 Dec 21;9(1):5457. doi: 10.1038/s41467-018-07581-4. Nat Commun. 2018. PMID: 30575716 Free PMC article.

See all "Cited by" articles

References

1. Metzker ML (2010) Sequencing technologies - the next generation. Nature Reviews Genetics 11: 31–46 Available: http://www.ncbi.nlm.nih.gov/pubmed/19997069. - PubMed
1. Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods Available: http://www.nature.com/nmeth/journal/vaop/ncurrent/abs/nmeth.1226.html. - PubMed
1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456: 470–476 Available: http://www.nature.com/nature/journal/v456/n7221/abs/nature07509.html. - PMC - PubMed
1. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics 40: 1413–1415 Available: http://www.ncbi.nlm.nih.gov/pubmed/18978789. - PubMed
1. Lee S, Seo CH, Lim B, Yang JO, Oh J, et al. (2011) Accurate quantification of transcriptome from RNA-Seq data by effective length normalization. Nucleic Acids Research 39: e9 Available: http://www.ncbi.nlm.nih.gov/pubmed/21059678. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses

Affiliation

Efficient and comprehensive representation of uniqueness for next-generation sequencing by minimum unique length analyses

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials