Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Multicenter Study
. 2018 Sep;36(8):746-757.
doi: 10.1038/nbt.4183. Epub 2018 Jul 16.

Comprehensive multi-center assessment of small RNA-seq methods for quantitative miRNA profiling

Affiliations
Multicenter Study

Comprehensive multi-center assessment of small RNA-seq methods for quantitative miRNA profiling

Maria D Giraldez et al. Nat Biotechnol. 2018 Sep.

Abstract

RNA-seq is increasingly used for quantitative profiling of small RNAs (for example, microRNAs, piRNAs and snoRNAs) in diverse sample types, including isolated cells, tissues and cell-free biofluids. The accuracy and reproducibility of the currently used small RNA-seq library preparation methods have not been systematically tested. Here we report results obtained by a consortium of nine labs that independently sequenced reference, 'ground truth' samples of synthetic small RNAs and human plasma-derived RNA. We assessed three commercially available library preparation methods that use adapters of defined sequence and six methods using adapters with degenerate bases. Both protocol- and sequence-specific biases were identified, including biases that reduced the ability of small RNA-seq to accurately measure adenosine-to-inosine editing in microRNAs. We found that these biases were mitigated by library preparation methods that incorporate adapters with degenerate bases. MicroRNA relative quantification between samples using small RNA-seq was accurate and reproducible across laboratories and methods.

PubMed Disclaimer

Conflict of interest statement

Competing financial interest

The spouse of Louise Laurent is an employee of Illumina, Inc., the manufacturer of the TruSeq Small RNA Library Preparation Kit. Dr. Laurent and her spouse’s equity interest in Illumina, Inc. represents <<1% of the company. The other authors declare no competing financial interests.

Figures

Figure 1
Figure 1. Overview of study design
(Top) The four primary RNA pools used as common reference samples in the study are shown. Equimolar and ratiometric pools were prepared using chemically-synthesized RNA oligonucleotides to establish “ground-truth” knowledge of absolute and relative abundances, respectively. The equimolar pool consisted of 1,152 synthetic RNAs (15-90 nt) mixed at equal concentration. Downstream analyses focused on the subset of 977 RNAs 16-25 nt in length and with 5’-phosphate modifications. The two ratiometric pools, A and B, consisted of 334 synthetic RNAs, in which subsets of RNAs were varied in relative abundance between the two pools. The RNAs were divided into 15 ratiometric subgroups. The subset of 290 RNAs, 16-25 nt in length, was used for downstream analyses, and the number of RNAs in each ratiometric subgroup is indicated. The plasma RNA pool comprised RNA from 11 healthy males that was centrally isolated and distributed to the participating labs. (Middle) Nine different library preparation protocols were tested. Three commercially available kits with “invariant” adapters and six 4N Random-End protocols were tested. (Bottom) Common reference RNA pools were distributed to 9 participating labs for sequencing in quadruplicate, using a standardized common protocol (TruSeq) and at least one additional method of their choice. The breakdown of the resulting libraries is depicted in the colored grid, with the Lab IDs indicated by columns, and the replicates and pools shown in rows. Solid grey blocks indicate libraries that were not attempted. Grey blocks with a diagonal red line indicate samples where library preparation and sequencing was attempted but was unsuccessful. Sequencing was performed by each lab independently.
Figure 2
Figure 2. Equimolar pool sequencing results across multiple labs and protocols
(a) The heatmap shows expression levels for each synthetic RNA sequence (rows) across all replicate equimolar pool libraries (columns). Expression levels represent log2-scaled counts-per-million (CPM) calculated for 977 equimolar pool sequences which are 16-25 nt in length and 5’-phosphorylated. Hierarchical clustering for rows and columns represents complete linkage clustering on Euclidean distances (the default setting for the R package, pheatmap, used for plotting). Columns are labeled at the bottom to identify replicate samples. Library Size indicates the sequencing depth for each library (log2-scaled). (b) Violin plots summarize the mean CPM observed for each of the 16-25 nt, 5’-phosphorylated equimolar pool sequences (y-axis; log10-scaled), as measured from equimolar pool libraries prepared by different institutions and using different library preparation protocols (x-axis). The width of the violins is proportional to the density of data points at each position. The horizontal lines within each violin represent the 25, 50 and 75th percentiles. The dashed horizontal line shows the expected CPM for each sequence in the equimolar pool (106/977 miRNAs = 1023.5 CPM). Each violin plot and corresponding quantile lines summarize mean CPM values for n=977 distinct equimolar pool sequences. The mean CPM values were calculated from n=4 technical replicate libraries for each lab/library preparation method shown, with the exception of 4N_NEXTflex. Lab7 (n=2). (c) The percentage of equimolar pool sequences sequenced at levels 10x higher (>10,235 CPM) or 10x lower (<102.35 CPM) than expected (y-axis) are plotted for each lab. The dots and whiskers indicate the median and range of values, respectively, measured across the technical replicates for each lab. N=4 technical replicate libraries per lab/library preparation method, with the exception of 4N_NEXTflex (n=2).
Figure 3
Figure 3. Small RNA-seq accuracy and cross-protocol concordance in measuring relative expression levels between samples
(a) Boxplots show the observed ratio (y-axis; log2-scale) vs. expected ratio (x-axis) for miRNAs present in each of the SynthA and SynthB synthetic RNA subpools. Observed ratios for each miRNA were calculated as mean CPM of SynthA/mean CPM of SynthB across technical replicates for each lab + library prep method. Boxes show the median + IQR; upper/lower whiskers indicate the smallest/largest observation less than or equal to 1st/3rd quartile -/+ 1.5 * IQR; outliers are calculated as being < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR. Mean CPM ratios were calculated from n=4 SynthA and n=4 SynthB technical replicate libraries for each lab and library preparation method shown, with the exception of TruSeq Lab8 SynthB (n=3). Those miRNAs with a mean CPM of 0 in SynthA or SynthB are not plotted. The numbers of miRNA not plotted are as follows: Truseq Labs 1,2,3,4,5,6,8: 1; Lab9: 0; CleanTag Lab5: 0; NEBNext Labs 1,3,5,9: 0; Lab4: 1; Lab2:3; 4N_NEXTflex Lab7: 1; other 4N: 0. The number of sequences represented in each boxplot is provided in Supplementary Table 10 (b) Heatmaps show the pairwise, squared Spearman rank correlation coefficients from sequencing the SynthA and SynthB pools. Pairwise correlation coefficients were calculated based on the mean CPM across technical replicates for SynthA samples (left), SynthB samples (middle) and the ratio of SynthA : SynthB (right). The mean CPM value for each ratiometric pool sequence was calculated from n=4 technical replicate libraries per lab, library preparation method and pool. Mean CPM values for n=290 ratiometric pool RNAs were used for calculating each pairwise correlation coefficient. Hierarchical clustering for rows and columns is the same for all heatmaps, and is based on the average pairwise Euclidean distances calculated from the SynthA CPM and SynthB CPM correlation matrices. Column labels indicate the lab ID and library prep method; row labels indicate only lab ID, but are presented in the same order (top to bottom) as columns (left to right).
Figure 4
Figure 4. Reproducibility of small RNA-seq within- and between-labs
(a) Violin plots summarize the technical reproducibility of quantification for all equimolar pool sequences, as calculated from each lab and library preparation method. Reproducibility measurements, percent coefficient of variation (%CV; top) and quartile coefficient of dispersion (QCD; bottom), were calculated from CPM values. Horizontal lines within each violin indicate the 25, 50 and 75th percentiles, calculated from the mean CPM values of n=977 equimolar pool RNA sequences. Mean CPM values were calculated from n=4 technical replicate libraries for each of the lab/library preparation methods shown, with the exception of TruSeq Lab1 (n=3). (b) Boxplots summarize the sequence-specific reproducibility of quantification measured in equimolar pool libraries generated by different labs using the same protocol. Percent CV (top) and QCD (bottom) values were calculated for each equimolar pool sequence across TruSeq (n=8 labs), NEBNext (n=4 labs) and 4N_B (n=4 labs) library preparation protocols. The mean CPM for each sequence across technical replicates (n=4 technical replicates per lab/library preparation method) was used to calculate the between-lab %CV and QCD plotted here. Boxplot statistics and outliers were calculated from %CV or QCD values for n=977 equimolar pool sequences. The overlaid boxes indicate the median and IQR. Whiskers represent the 1st/3rd quartile +/- 1.5 * IQR. Outliers are < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR.
Figure 5
Figure 5. Small RNA-seq of reference plasma RNA by multiple laboratories using multiple library preparation protocols
(a) The heatmap shows CPM (log2 scale) for each sequence (rows) across plasma pool libraries (columns). Only mature miRNAs with a high confidence of detection are shown, requiring a minimum of 100 CPM in 90 percent of samples from at least one protocol (TruSeq, CleanTag, NEBNext or 4N). Hierarchical clustering for rows and columns represents complete linkage clustering on Euclidean distances. “Library Size” indicates the sum of the mature miRNA-mapped read counts prior to filtering for the individual libraries (log2-scaled). (b) Violin plots summarize the technical reproducibility of quantification for miRNAs expressed in plasma pool libraries, as calculated from each lab and library preparation method. Reproducibility measurements, percent coefficient of variation (%CV; top) and quartile coefficient of dispersion (QCD; bottom), were calculated from CPM values. Horizontal lines within each violin indicate the25, 50 and 75th percentiles, calculated from the mean CPM values of n=977 equimolar pool RNA sequences. . For TruSeq Lab1 n=3 technical replicates; for all other lab/protocols n=4. (c) Boxplots summarize the between-lab reproducibility for miRNAs expressed in the plasma pool libraries using TruSeq (n=6 labs), NEBNext (n=4 labs) and 4N_B (n=4 labs) library preparation protocols. Each dot represents %CV or QCD calculated across labs for a single miRNA. The between-lab %CV and QCD were calculated using the mean CPMs for each sequence across technical replicates for each lab. The underlying boxes show the median and IQR. Whiskers represent the 1st/3rd quartile +/- 1.5 * IQR. Outliers are < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR. (d) Boxplots show the number of mature miRNAs detected by each protocol based on downsampling of datasets to the indicated sequencing depths (see Methods for more details). Each box summarizes number of miRNAs detected by each lab for the indicated protocol. The probability of each miRNA being detected was estimated for every sample randomly downsampled to 104, 104.5, 105, or 105.5 total read counts. A miRNA was only counted as detected if the probability of detection was at least 90%. Libraries with total counts less than the indicated sample size were excluded. Boxplots for 4N includes only in-house 4N protocols (4N_A, B, C and D). The number of libraries summarized by each boxplot is as follows: 105.5: TruSeq=19; CleanTag=4; NEBNext=12; 4N=28; 105, 104.5 and 104 : TruSeq=23; CleanTag=4; NEBNext=16; 4N=28. The underlying boxes show the median and IQR. Whiskers represent the 1st/3rd quartile +/- 1.5 * IQR. Outliers are < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR.
Figure 6
Figure 6. Library protocol performance in measuring miRNA A-to-I editing events
(a) A schematic depicting the experimental design for the miRNA A-to-I editing experiments. (left) 10 miRNAs were synthesized with either an Adenosine or Inosine at a single position previously shown to be edited in human cells. The position, relative to the 5’ end of the mature miRNA, is indicated to the right of the respective miRNA IDs, along with the identity of the nucleotide. 277 other unedited human miRNAs were added at a fixed concentration to increase the background complexity of the pools. (middle) Six different editing subpools were generated, using a constant amount of the background (Bk) pool and varying percentages of unedited (Un; adenosine) and edited (Ed; inosine) oligos in each pool. (right) The color-coded grid depicts the library design used in the A-to-I editing experiment. Specifically, the six editing pools were sequenced by three participating labs, using three different library preparation protocols, with each lab generating libraries in triplicate. (b) The observed percent editing (y-axis) is shown for each miRNA in the six A-to-I editing pools, as measured by each of the three labs, using TruSeq, NEBNext and 4N_B protocols. The expected editing percent in each pool is both indicated to the right of each plot group and indicated by the horizontal dotted line within each plot. The library prep method is indicated to the right of each plot. The dots and whiskers represent the median and range of percent editing for each miRNA (x-axis) as measured by the three labs. Individual miRNA % editing is shown for n=3 technical replicate libraries for each lab and library preparation method.

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–628. - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. - PMC - PubMed
    1. Levin JZ, et al. Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods. 2010;7:709–715. - PMC - PubMed
    1. Jayaprakash AD, Jabado O, Brown BD, Sachidanandam R. Identification and remediation of biases in the activity of RNA ligases in small-RNA deep sequencing. Nucleic Acids Res. 2011;39:e141. - PMC - PubMed
    1. Viollet S, Fuchs RT, Munafo DB, Zhuang F, Robb GB. T4 RNA ligase 2 truncated active site mutants: improved tools for RNA analysis. BMC Biotechnol. 2011;11:72. - PMC - PubMed

Publication types