Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Sep 1;41(9):btaf416.
doi: 10.1093/bioinformatics/btaf416.

Evaluation of sequencing reads at scale using rdeval

Affiliations

Evaluation of sequencing reads at scale using rdeval

Giulio Formenti et al. Bioinformatics. .

Abstract

Motivation: Large sequencing datasets are being produced and deposited into public archives at unprecedented rates. The availability of tools that can reliably and efficiently generate and store sequencing read summary statistics has become critical.

Results: As part of the effort by the Vertebrate Genomes Project (VGP) to generate high-quality reference genomes at scale, we sought to address the community's need for efficient sequence data evaluation by developing rdeval, a standalone tool to quickly compute and interactively display sequencing read metrics. Rdeval can either run on the fly or store key sequence data metrics in tiny read 'snapshot' files. Statistics can then be efficiently recalled from snapshots for additional processing. Rdeval can convert fa*[.gz] files to and from other popular formats including BAM and CRAM for better compression. Overall, while CRAM achieves the best compression, the gain compared to BAM is marginal, and BAM achieves the best compromise between data compression and access speed. Rdeval also generates a detailed visual report with multiple data analytics that can be exported in various formats. We showcase rdeval's functionalities using long-read data from different sequencing platforms and species, including human. For PacBio long-read sequencing, our analysis shows dramatic improvements in both read length and quality over time, as well as the benefit of increased coverage for genome assembly, though the magnitude varies by taxa.

Availability and implementation: Rdeval is implemented in C++ for data processing and in R for data visualization. Precompiled releases (Linux, MacOS, Windows) and commented source code for rdeval are available under MIT license at https://github.com/vgl-hub/rdeval. Documentation is available on ReadTheDocs (https://rdeval-documentation.readthedocs.io). Rdeval is also available in Bioconda and in Galaxy (https://usegalaxy.org). An automated test workflow ensures the consistency of software updates.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Representative plots from rdeval report using four HiFi sequencing runs from four species (D1: Vipera latastei, D2: Amazona ochrocephala, D3: Ascaphus truei, and D4: Gallus gallus). (a) Read length violin plots. D1 shows a tighter read length distribution, and D3 is skewed toward shorter read lengths. (b) Read length density plots. Similar to the violin plots, D3 has shorter reads on average, and D4 has the longest read lengths. (c) Read length inverse cumulative distributions. The distribution can be used to assess read coverage at certain read length cutoffs. For instance, in D3, we can observe that there is about 20 Gbp of coverage for reads 10 kb or longer (black arrow). (d) Read length versus Average read quality, plotted as a 2D contour plot that highlights data density for the four runs combined. The plot shows that read length correlates inversely with average read quality at varying magnitudes, with the highest density of reads in the 20–25 kb read length range and Q25–30 read quality range.
Figure 2.
Figure 2.
Sequencing read evaluation with rdeval. (a) Schematic of rdeval workflow. Inputs include genome assemblies in fasta, fa*[.gz], BAM, CRAM formats, and include/exclude lists as bed coordinate files for filtering. (b) Length distribution of reads from two representative human genomic datasets (CHM13 and HG002). (c) Quality distribution for the same data sets. Note that PacBio HiFi reads are at least Q20 and are capped at Q40. (d) Comparison of compression levels of 80 sequencing datasets (datasets under 1 Gbp in total read length were excluded) included in the VGP project across different file types (FASTA [.GZ], FASTQ [.GZ], BAM, and CRAM) and sequencing platforms (Illumina and PacBio; Table 2, available as supplementary data at Bioinformatics online). (e) Relationship between original file size and.rd file size in different data sets: the X-axis represents downsampling, while the Y-axis shows.rd file sizes in MB. The test has been performed on five different sequencing runs from human genomic data sets (Table 3, available as supplementary data at Bioinformatics online) generated by different sequencing platforms. All data sets were initially downsampled to 11 Gbp, corresponding to the downsampling fraction of 1. Each downsampled dataset was further downsampled to generate the remaining data points. The.rd size at the first downsampling level (0.05) has been used to estimate theoretical projections at subsequent steps. Projections are shown in dashed lines. (f) Runtime (minutes) versus data set size (Gbp) for 84 sequencing data sets from Illumina (n = 37) and PacBio (n = 47) platforms across various file formats.
Figure 3.
Figure 3.
(a) Average read length Nx plot using all long-read VGP datasets (Table 5, available as supplementary data at Bioinformatics online). Most reads were shorter than 100 kb on the Sequel Instrument. Read lengths consistently improved with Sequel II. HiFi consensus read lengths were longer on the Sequel II, and shorter with the Revio, likely as a consequence of shorter movie times. (b) Average read length versus average read quality across VGP datasets and instruments. Average read quality is calculated as the mean of the BAM QUAL field. Note that the rq tag in the BAM file represents the estimated “read quality” from Revio and Sequel II, and on Revio, rq may be higher than mean (QUAL) in the final output. Datasets are color-grouped by sequencing platform. Datasets generated by mixed sequencing instruments were excluded from the analysis. Q scores are not available for CLR reads downloaded as FASTQ from SRA. HiFi data show an inverse relationship between read length and quality, consistent with fewer number of passes available for consensus in longer reads. The difference observed between Sequel II and Revio data is only a representational change, due to capping of Q scores in Revio instruments. (c) Correlation between PacBio HiFi sequencing coverage and contig N50 in VGP genomes. Outliers are highlighted.

Update of

References

    1. Altmanová M, Rovatsos M, Kratochvíl L et al. Minute Y chromosomes and karyotype evolution in Madagascan iguanas (Squamata: Iguania: Opluridae). Biol J Linn Soc 2016;118:618–33.
    1. Andrews S. FastQC: A Quality Control Tool for High Throughput Sequence Data [Online], 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    1. Baid G, Cook DE, Shafin K et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat Biotechnol 2023;41:232–8. - PubMed
    1. Bankevich A, Bzikadze AV, Kolmogorov M et al. Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads. Nat Biotechnol 2022;40:1075–81. - PubMed
    1. Bonfield JK. CRAM 3.1: advances in the CRAM file format. Bioinformatics 2022;38:1497–503. - PMC - PubMed