Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Mar 29;20(1):250.
doi: 10.1186/s12864-019-5624-2.

ALFA: annotation landscape for aligned reads

Affiliations

ALFA: annotation landscape for aligned reads

Mathieu Bahin et al. BMC Genomics. .

Abstract

Background: The last 10 years have seen the rise of countless functional genomics studies based on Next-Generation Sequencing (NGS). In the vast majority of cases, whatever the species, whatever the experiment, the two first steps of data analysis consist of a quality control of the raw reads followed by a mapping of those reads to a reference genome/transcriptome. Subsequent steps then depend on the type of study that is being made. While some tools have been proposed for investigating data quality after the mapping step, there is no commonly adopted framework that would be easy to use and broadly applicable to any NGS data type.

Results: We present ALFA, a simple but universal tool that can be used after the mapping step on any kind of NGS experiment data for any organism with available genomic annotations. In a single command line, ALFA can compute and display distribution of reads by categories (exon, intron, UTR, etc.) and biotypes (protein coding, miRNA, etc.) for a given aligned dataset with nucleotide precision. We present applications of ALFA to Ribo-Seq and RNA-Seq on Homo sapiens, CLIP-Seq on Mus musculus, RNA-Seq on Saccharomyces cerevisiae, Bisulfite sequencing on Arabidopsis thaliana and ChIP-Seq on Caenorhabditis elegans.

Conclusions: We show that ALFA provides a powerful and broadly applicable approach for post mapping quality control and to produce a global overview using common or dedicated annotations. It is made available to the community as an easy to install command line tool and from the Galaxy Tool Shed.

Keywords: NGS; Post mapping; Quality control; Tool; Universal.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

N/A

Consent for publication

N/A

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
ALFA category plots (raw and normalized) for Cross-Linking and ImmunoPrecipitation Sequencing (CLIP-Seq) of eIF4A3 on Mus musculus in three technical replicates samples (unpublished data from HLH available on demand). Here, ALFA highlights that replicate Rep1 seems to be inconsistent with replicates Rep2 and Rep3 as CDS, 3′-UTR and intergenic categories seem to display different proportions.
Fig. 2
Fig. 2
ALFA category plots (raw and normalized) for Bisulfite Sequencing (BS-Seq) data on Arabidopsis thaliana samples (public data available on NCBI: SRA035939 and EBI: ERA051872). Datasets were gathered from two studies performed on the same model but in two different laboratories: Lab1-Rep1 (SRR342381) and Lab1-Rep2 (SRR342391) from [14] and Lab2-Rep1 (ERR046552) and Lab2-Rep2 (ERR046553) from [15]. ALFA highlights laboratory dependent differences between reads falling in CDS (t-test significant at a 5% level with a p-value of 4 × 10–2).
Fig. 3
Fig. 3
ALFA category plots (raw and normalized) for Ribosome Profiling (Ribo-Seq) data on Mus musculus samples (unpublished data from HLH available on demand). Unt-Rep1 and Unt-Rep2 are two untreated samples while HA-Rep1 and HA-Rep2 are samples treated with harringtonine. Harringtonine is a drug that inhibits the elongation phase of translation, after initiation. Here, ALFA shows that mRNAs are actively translated in the untreated samples (t-test significant at a 1% level with a p-value of 6 × 10–3) while an expected shift towards the translation start site (i.e. reads spanning the end of the 5’UTR (p-value = 7 × 10–5) and the start codons (p-value = 1 × 10–3) thanks to the depth argument set to 4) can be observed in the samples treated with harringtonine.
Fig. 4
Fig. 4
ALFA category plots (raw and normalized) for Ribosome Profiling (Ribo-Seq) data on Homo sapiens (unpublished data from AL available on demand) performed with two different procedures for footprinting: treatment with MNase as in [16] or treatment with RNase I as in [17]. As a preprocessing step, rRNA and mtRNA reads were computationally filtered out. The enrichment of signal in intergenic and 3′-UTR regions shows that treatment with MNase seems to produce a substantial increase of the non-protein coding reads compared to RNase I.
Fig. 5
Fig. 5
ALFA biotype plots (raw and normalized) for RNA-Seq data on Homo sapiens samples (public data available on NCBI: SRP058036). This dataset is part of a research work where ribosomal RNA depletion is compared between adult and fetal tissues [18]. This study reported that a large portion of transcripts with mitochondrial ribosomal origin was observed, in particular in colon, heart and kidney samples. For clarity, only lung (SRR2014234 and SRR2014235) and heart (SRR2014232 and SRR2014233) replicates from the study are reported here. ALFA enables, with a single command, to quickly confirm that mitochondrial rRNA contamination is more important in the heart samples than in the lung samples (t-test significant at a 1% level with a p-value of 3 × 10–3). Moreover, an intergenic contamination, not revealed in the original work, can also be noticed on Lung Adult-Rep2 with no additional work.
Fig. 6
Fig. 6
ALFA biotype plots (raw and normalized) for Chromatin Immuno-Precipitation sequencing (ChIP-Seq) of NPP-13 from Caenorhabditis elegans samples (public data available on NCBI: SRA062428). This dataset originates from a study [19] where snoRNA and tRNA genetic loci were found to be enriched in the IP (SRR628901) compared to the Inputs (SRR628899 and SRR628900). Here, ALFA can retrieve this result in a simple call as highlighted by this plot. Moreover, by providing a global overview without additional work, ALFA seems to denote enrichments in the IP for other biotypes such as miRNA, ncRNA genetic loci.
Fig. 7
Fig. 7
ALFA biotype plots (raw and normalized) for RNA sequencing (RNA-Seq) data from a Saccharomyces cerevisiae sample (public data available on NCBI: SRA030505 - SRR927165). In this example, a customized GTF annotation file was created to highlight the flexibility of ALFA. Dedicated biotypes characterizing various Saccharomyces cerevisiae stable transcripts (SUTs [24]) and unstable transcripts (CUTs [21], NUTs [22], XUTS [23]) were converted from a BED file.

References

    1. FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc.
    1. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. - DOI - PMC - PubMed
    1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129(4):823–837. doi: 10.1016/j.cell.2007.05.009. - DOI - PubMed
    1. Yeo GW, Coufal NG, Liang TY, Peng GE, Fu XD, Gage FH. An RNA code for the FOX2 splicing regulator revealed by mapping RNA-protein interactions in stem cells. Nat Struct Mol Biol. 2009;16(2):130–137. doi: 10.1038/nsmb.1545. - DOI - PMC - PubMed
    1. Hower V, Starfield R, Roberts A, Pachter L. Quantifying uniformity of mapped reads. Bioinformatics. 2012;28(20):2680–2682. doi: 10.1093/bioinformatics/bts451. - DOI - PMC - PubMed

LinkOut - more resources