Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Jul 26;15(7):410.
doi: 10.1186/s13059-014-0410-6.

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

Corset: enabling differential gene expression analysis for de novo assembled transcriptomes

Nadia M Davidson et al. Genome Biol. .

Abstract

Next generation sequencing has made it possible to perform differential gene expression studies in non-model organisms. For these studies, the need for a reference genome is circumvented by performing de novo assembly on the RNA-seq data. However, transcriptome assembly produces a multitude of contigs, which must be clustered into genes prior to differential gene expression detection. Here we present Corset, a method that hierarchically clusters contigs using shared reads and expression, then summarizes read counts to clusters, ready for statistical testing. Using a range of metrics, we demonstrate that Corset out-performs alternative methods. Corset is available from https://code.google.com/p/corset-project/.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The pipeline for performing a count-based gene-level differential expression analysis on non-model organisms. Cleaned RNA-seq reads are first de novo assembled into contig sequences. Reads are mapped back to the transcriptome and the association between contigs and genes must be established (clustering of contigs). Then the abundance of each gene is estimated. Finally, statistical testing is performed on the count data to determine which genes are differentially expressed. Corset performs the clustering and counting (dashed box) in a single step.
Figure 2
Figure 2
Corset uses expression information to tease apart contigs from different genes. (A) Assembled contigs from a region of the human genome containing the two genes ATP5J and GABPA. Trinity assembles 8 contigs (bottom track), which are grouped into one cluster if the contig ratio test is not applied. Including this test allows corset to separate this region into four clusters (boxes). Notably, contigs 4 to 6 are false chimeras, caused by the overlapping UTRs of ATP5J and GABPA. These genes are differentially expressed, as shown by base-level coverage, averaged over replicates (top track). (B) When clustering, Corset checks for equal expression ratios between conditions when calculating distances between pairs of contigs: here we consider pairs contigs 2 and 3 (top) and contigs 3 and 4 (bottom). The ratio of the number of reads aligning to each contig is plotted for each sample (dots). It can be seen that contig 2 and contig 3 have the same expression ratio across groups and so are clustered together while contig 3 and contig 4 have different expression ratios between conditions and so are split. This feature helps Corset separate contigs that share sequence but are from different genes.
Figure 3
Figure 3
A comparison of the performance of different clustering approaches. For the assembler’s own clustering (Trinity or Oases), CD-HIT-EST and Corset we show the precision against the recall. The precision is the ratio of true positives over true positives plus false positives and the recall is the ratio of true positives over true positives plus false negatives. We show the results for six different assemblies: (A) chicken data assembled with Trinity; (B) chicken data assembled with Oases; (C) human data assembled with Trinity; (D) human data assembled with Oases; (E) yeast data assembled with Trinity; and (F) yeast data assembled with Oases. The X indicates perfect clustering.
Figure 4
Figure 4
The effect of clustering on differential gene expression rankings. The cumulative number of unique true positive differentially expressed clusters against the number of top ranked clusters in the de novo analysis is shown. A unique true positive refers to only counting the first instance of a gene that appears multiple times in the ranked list. Corset performed the same or better than CD-HIT-EST and the assembler’s own clustering, in all cases: (A) chicken data assembled with Trinity; (B) chicken data assembled with Oases; (C) human data assembled with Trinity; (D) human data assembled with Oases; (E) yeast data assembled with Trinity; and (F) yeast data assembled with Oases. For comparison, we also show the results of no clustering, where the analysis was performed at the level of contigs rather than clusters.
Figure 5
Figure 5
The effect of clustering on differential gene expression receiver operating characteristic (ROC) curves. The unique true positive differentially expressed clusters against unique false positive clusters in the de novo analysis is shown. A unique positive refers to only counting the first instance of a gene that appears multiple times in the ranked list. Corset performed similarly to or better than CD-HIT-EST and the assembler’s own clustering, in all cases: (A) chicken data assembled with Trinity; (B) chicken data assembled with Oases; (C) human data assembled with Trinity; (D) human data assembled with Oases; (E) yeast data assembled with Trinity; and (F) yeast data assembled with Oases. For comparison, we also show the results of no clustering, where the analysis was performed at the level of contigs rather than clusters.
Figure 6
Figure 6
Supplementing a de novo assembly with additional transcriptomes. Supplementing a de novo assembly with contigs from either (A) a partial annotation or (B) related species improves clustering recall of the de novo assembled contigs. We show the recall and precision, calculated for Trinity contigs. (A) We randomly sampled transcripts from the human annotation from Ensembl at 100%, 50%, 25%, 12.5% and 6% of all transcripts to emulate a partial annotation, mapped the human RNA-seq reads to each set and clustered the reads together with those mapped to the Trinity assembly using Corset. (B) We mapped human RNA-seq reads onto the Ensembl annotation for chimp, orangutan, macaque, marmoset and bushbaby, then clustered the reads together with those mapped to the Trinity assembly using Corset. 'None' in both plots indicates the Trinity assembly on its own.

References

    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 2011;12:87–98. doi: 10.1038/nrg2934. - DOI - PMC - PubMed
    1. Martin J, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011;12:671–682. doi: 10.1038/nrg3068. - DOI - PubMed
    1. Schulz MH, Zerbino DR, Vingron M, Birney E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics. 2012;28:1086–1092. doi: 10.1093/bioinformatics/bts094. - DOI - PMC - PubMed
    1. Robertson G, Schein J, Chiu R, Corbett R, Field M, Jackman S, Mungall K, Lee S, Okada H, Qian J, Griffith M, Raymond A, Thiessen N, Cezard T, Butterfield Y, Newsome R, Chan S, She R, Varhol R, Kamoh B, Prabhu A-L, Tam A, Zhao Y, Moore R, Hirst M, Marra M, Jones S, Hoodless P, Birol I. De novo assembly and analysis of RNA-seq data. Nat Methods. 2010;7:909–912. doi: 10.1038/nmeth.1517. - DOI - PubMed

Publication types

LinkOut - more resources