Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 1;46(10):5125-5138.
doi: 10.1093/nar/gky325.

The fractured landscape of RNA-seq alignment: the default in our STARs

Affiliations

The fractured landscape of RNA-seq alignment: the default in our STARs

Sara Ballouz et al. Nucleic Acids Res. .

Abstract

Many tools are available for RNA-seq alignment and expression quantification, with comparative value being hard to establish. Benchmarking assessments often highlight methods' good performance, but are focused on either model data or fail to explain variation in performance. This leaves us to ask, what is the most meaningful way to assess different alignment choices? And importantly, where is there room for progress? In this work, we explore the answers to these two questions by performing an exhaustive assessment of the STAR aligner. We assess STAR's performance across a range of alignment parameters using common metrics, and then on biologically focused tasks. We find technical metrics such as fraction mapping or expression profile correlation to be uninformative, capturing properties unlikely to have any role in biological discovery. Surprisingly, we find that changes in alignment parameters within a wide range have little impact on both technical and biological performance. Yet, when performance finally does break, it happens in difficult regions, such as X-Y paralogs and MHC genes. We believe improved reporting by developers will help establish where results are likely to be robust or fragile, providing a better baseline to establish where methodological progress can still occur.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Summarizing RNA-seq alignment tools. (A) RNA-seq alignment: typical experiment + sources of error. (B) Transcriptomics asks three broad questions which include transcript detection, differential gene expression and gene co-expression. (C) Growth and performance of RNA-seq tools - cumulative number of tools shows steady growth across all tool types.
Figure 2.
Figure 2.
Summary of previous studies of RNA-seq alignment evaluations. (A) Fraction of reads aligned for aligners across different tools, datasets (real and simulated) and parameters, ordered by number of times assessed. (B) SDs of the fraction mapped comparing averaging by dataset or by method. (C) Correlations of the relative expression levels of the output of different quantifiers in a subset of 5 benchmarks. (D) Summary of databases and pipelines used in the global assessment. Overall, 57 experiments and 3405 samples were used in the assessment. (E) SD of fractions reads mapped per database and those across the three databases (per sample) (F) and correlations of the expression levels between the different databases (SDs per experiment).
Figure 3.
Figure 3.
Comparative performances across metrics. (A) Mapping rates for 462 samples from the GEUVADIS dataset varying a single parameter, minAS (–outFilterScoreMinOverLread). The default is 0.66. The most permissive parameter we tested was 0.55, and the most stringent at 0.99. (B) Spearman correlations between a sample and its default counts across the same parameter space. (C) Number of protein coding genes detected for each alignment parameter. (D) In parallel to (A), the distribution of fraction mapped metrics for 3405 samples in the Gemma, ARCHS4 and recount2 expression databases. (E) In parallel to (B), the distributions of the correlations between genes for each sample (pairwise across databases) and then (F) averaged per experiment. Recount2 has quite higher mapping rates, but the samples correlate less with the other two databases, as shown in (E) per sample and (F) averaged per experiment.
Figure 4.
Figure 4.
Differential expression by sex in the GEUVADIS dataset. (A) Overlap of all DE genes (FDR < 0. 05 and |log2 FC| > 2) for the default minAS = 0.66 and extreme minAS = 0.99 (B) and then across parameters with the default parameter. (C) ROCs for each set using average rank of the fold change and the adjusted P-value (see Materials and Methods). (D) AUROCs across the minimum alignment parameters for the two gene sets tested for both negative and positive controls. These are averaged across 10 runs of the analysis, sampling 200 males and 200 females from the totals.
Figure 5.
Figure 5.
Performance details for sex-specific differential expression. (A) Distribution of number of Y chromosome genes expressed in female (red) and male (blue) samples across all parameters. (B) Average expression of these genes, and a measure of the female-ness and male-ness of the samples (fraction of chromosome Y genes detected) at default minAS = 0.66. Samples colored by sex (females in red, males in blue). (C) Two sample experiments showing the change in expression and number of Y genes detected as minAS changes. Top panel is a female sample showing more genes are detected as expressed with higher expression levels as we decrease the minAS (rs = 0.98), but this relationship is inverted for a male sample (bottom panel, rs = −0.97). (D) Plot of the correlation between average expression and Y genes detected across the minAS for all samples showing the same trends as in (C). There are a few samples where these relationships are much weaker.
Figure 6.
Figure 6.
Alignment parameter impact on co-expression. (A) Distribution of co-expression scores for the different alignment parameters. (B) Differences in SDs when comparing across parameters or across samples, with more between sample variance than between parameter variance. (C) Scatterplot comparing the extreme parameter to default. Each point is the co-expression score of a sample. The grey is the average SD of the whole experiment and the identity line in black. (D) Scatterplots of each parameter against another, each point representing a sample, and each matrix square a comparison between parameter choices for the minimum alignment parameter.
Figure 7.
Figure 7.
Four score and ten alignment parameters. Scatterplots comparing the co-expression score to (A) the fraction mapped (rs = −0.12), (B) correlations (rs = 0.001) (C) and gene detection (rs = 0.13). Each color represents the alignment parameter tested. The gray dotted lines represent the quartiles of the scores and metrics. The co-expression scores do not correlate with the technical metrics, but are influenced by number of genes.
Figure 8.
Figure 8.
Defining the parameter landscape of STAR. (A) Distributions of scores for the extreme parameters tested on the GEUVADIS dataset. The purple distribution shows the scores per sample of the most permissive parameter and the grey distribution the most stringent. Downsampling reads has little effect on the co-expression scores. Filtering low expressing genes by counts has the greatest impact, consistent with the SEQC replicability experiments. Allowing for mismatches has little impact. The minimum alignment score parameters are also placed for comparison. (B) Interpolated co-expression scores showing alignment and post-alignment filter hotspots (light blue to white). Dark areas are least replicable. These results are averaged over all 462 samples for 1000 runs, and interpolated between the dashed lines. Contours define interpolated score boundaries.

References

    1. Henry V.J., Bandrowski A.E., Pepin A.-S., Gonzalez B.J., Desfeux A.. OMICtools: an informative directory for multi-omic data analysis. Database. 2014; 2014:bau069. - PMC - PubMed
    1. Patro R., Duggal G., Love M.I., Irizarry R.A., Kingsford C.. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods. 2017; 14:417–419. - PMC - PubMed
    1. Bray N.L., Pimentel H., Melsted P., Pachter L.. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016; 34:525–527. - PubMed
    1. Aniba M.R., Poch O., Thompson J.D.. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment. Nucleic Acids Res. 2010; 38:7353–7363. - PMC - PubMed
    1. Boulesteix A.-L., Wilson R., Hapfelmeier A.. Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med. Res. Methodol. 2017; 17:138. - PMC - PubMed

Publication types