CRAC: an integrated approach to the analysis of RNA-seq reads

Nicolas Philippe, Mikaël Salson, Thérèse Commes, Eric Rivals

PMID: 23537109
PMCID: PMC4053775
DOI: 10.1186/gb-2013-14-3-r30

CRAC: an integrated approach to the analysis of RNA-seq reads

Nicolas Philippe et al. Genome Biol. 2013.

. 2013 Mar 28;14(3):R30.

doi: 10.1186/gb-2013-14-3-r30.

Authors

Nicolas Philippe, Mikaël Salson, Thérèse Commes, Eric Rivals

PMID: 23537109
PMCID: PMC4053775
DOI: 10.1186/gb-2013-14-3-r30

Abstract

A large number of RNA-sequencing studies set out to predict mutations, splice junctions or fusion RNAs. We propose a method, CRAC, that integrates genomic locations and local coverage to enable such predictions to be made directly from RNA-seq read analysis. A k-mer profiling approach detects candidate mutations, indels and splice or chimeric junctions in each single read. CRAC increases precision compared with existing tools, reaching 99:5% for splice junctions, without losing sensitivity. Importantly, CRAC predictions improve with read length. In cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted novel recurrent chimeric junctions. CRAC is available at http://crac.gforge.inria.fr.

PubMed Disclaimer

Figures

**Figure 1**
**The CRAC algorithm**. **(a)** Illustration of a break in the location profile. We consider each k-mer of the read and locate it exactly on the genome. In all figures, located k-mers are shown in blue, and unmapped k-mers in light orange. If the read differs from the genome by, for example an SNV or an error, then the k-mers containing this position are not located exactly on the genome. The interval of positions of unmapped k-mers is called a break. The end position of the break indicates the error or SNV position. **(b)** The support profile. The support value of a k-mer is the number of reads from the collection in which this k-mer appears at least once. The two plots show the support profile as a black curve on top of the location profile (in blue and orange). The support remains high (left plot) over the break if many reads covering this region are affected by a biological difference (for example, a mutation); it drops in the region of the break when the analyzed read is affected by a sequencing error; in this case, we say the support is dropping. **(c)** Rules for differentiating a substitution, a deletion, or an insertion depending on the break. Given the location profile, one can differentiate a substitution, a deletion, or an insertion by computing the difference between the gap in the genome and the gap in the read between k-mers starting before and after the break. **(d)** False locations and mirage breaks. When false locations occur inside or at the edges of a break they cause mirage breaks. False locations are represented in red. The break verification and break merging procedures correct for the effects of false locations to determine the correct break boundaries (and for example the correct splice junction boundaries) to avoid detecting a false chimera (Rule 2a) instead of a deletion. SNV: single nucleotide variant

**Figure 3**
**Sensitivity and precision of CRAC predictions by category for human simulated data**. **(A)** Absolute numbers of true and false positives reported by CRAC. These figures are the number of distinct events, say SNVs, reported by CRAC, not the number of reads containing the same SNV. False positives represent a small fraction of its output, thereby indicating a high level of precision. **(B)** and **(C)** For each category, the figure shows the proportion of events found by CRAC for the 75 nt and 200 nt datasets. The blue bars are the true positives, while the red bars on top are the false positives. The height of a blue bar gives CRAC's sensitivity, and the relative height of the red part of the bar gives the precision. For the two read lengths, for all categories the sensitivity increases with longer reads, while the precision in each category varies only a little. SNV: single nucleotide variant

**Figure 4**
**Splice junction detection using human real RNA-seq: comparison and agreement**. The figure shows the detection of splice junctions by CRAC, MapSplice, TopHat, and GSNAP for a human six-tissue RNA-seq library of 75M 100 nt reads (ERR030856). **(a)** Number and percentage of known, new, and other splice junctions detected by each tool with +/−3 nt tolerance for ERR030856. **(b)** Venn diagram showing the agreement among the tools on known RefSeq splice junctions (KJs). Additional file 4 has pending data for novel junctions (NJs) and RefSeq transcripts. **(c)** A read spanning four exons (2 to 5) and three splice junctions of the human TIMM50 gene displayed by the UCSC genome browser. The included exons, numbers 3 and 4, measure 32 and 22 nt, respectively. So exon 3 has exactly the k-mer size used in this experiment. KJ: known splice junction; SJ: splice junction

See this image and copyright information in PMC

References

1. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 2011;14:87–98. doi: 10.1038/nrg2934. - DOI - PMC - PubMed
1. Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010;14:685–696. doi: 10.1038/nrg2841. - DOI - PubMed
1. Trapnell C, Pachter L, Salzberg S. TopHat: discovering splice junctions with RNA-seq. Bioinformatics. 2009;14:1105–1111. doi: 10.1093/bioinformatics/btp120. - DOI - PMC - PubMed
1. Gingeras T. Implications of chimaeric non-co-linear transcripts. Nature. 2009;14:206–211. doi: 10.1038/nature08452. - DOI - PMC - PubMed
1. Mitelman F, Johansson B, Mertens F. Mitelman database of chromosome aberrations and gene fusions in cancer. 2013. http://cgap.nci.nih.gov/Chromosomes/Mitelman

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

CRAC: an integrated approach to the analysis of RNA-seq reads

CRAC: an integrated approach to the analysis of RNA-seq reads

Authors

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous