Review

. 2023 Apr;24(4):235-250.

doi: 10.1038/s41576-022-00551-z. Epub 2022 Dec 7.

Navigating bottlenecks and trade-offs in genomic data analysis

Bonnie Berger^#^{1

2}, Yun William Yu^#^{3

4}

Affiliations

¹ Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA. bab@mit.edu.
² Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA. bab@mit.edu.
³ Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada.
⁴ Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.

^# Contributed equally.

PMID: 36476810
PMCID: PMC10204111
DOI: 10.1038/s41576-022-00551-z

Review

Navigating bottlenecks and trade-offs in genomic data analysis

Bonnie Berger et al. Nat Rev Genet. 2023 Apr.

. 2023 Apr;24(4):235-250.

doi: 10.1038/s41576-022-00551-z. Epub 2022 Dec 7.

Authors

Bonnie Berger^#^{1

2}, Yun William Yu^#^{3

4}

Affiliations

¹ Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA. bab@mit.edu.
² Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA. bab@mit.edu.
³ Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada.
⁴ Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.

^# Contributed equally.

PMID: 36476810
PMCID: PMC10204111
DOI: 10.1038/s41576-022-00551-z

Abstract

Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.

PubMed Disclaimer

Conflict of interest statement

Competing interests

The authors declare no competing interests.

Figures

**Fig. 1 |. Overview of genomic analysis pipelines.**
a, Biological data sources. Before any analysis takes place, raw data must be gathered. Traditionally, bulk sequencing mixes together DNA or RNA from many cells in a sample — sometimes from a single individual, sometimes from an entire microbiome. More recently, the emerging technology of single-cell RNA sequencing (scRNA-seq) allows for collecting and amplifying genomic samples from individual cells. b, Chemistry-driven primary data generation (sequencing). A sequencer is a machine that takes a physical sample and outputs digital information in the form of a set of limited-length nucleotide strings called ‘reads’ (for example, ACGT) with associated metadata. In the early days, Sanger sequencing produced reads of length ~800 bp, but it was the advent of ‘second-generation sequencing’ with reads of ~100–200 bp that have brought costs down sufficiently to produce analytical bottlenecks. Additionally, newer ‘third-generation’ technologies are still expensive, but can access lengths in the >1000 bp range. c, Algorithmic secondary processing. Our focus in this Review is on ‘secondary processing’, which is all of the computational methodology used to reconstruct genomic facts about the biological sample from the output of the sequencer — for example, to reconstruct the genome of an individual — but not to specifically resolve biological hypotheses. These secondary processing steps are where the majority of the current computational bottlenecks lie (see Box 2 for more details on the various elements and workflows). d, Statistical tertiary processing. ‘Tertiary processing’ uses the output of secondary processing to answer biological questions. For example, determining the genomic variants present in a sample would be secondary processing, but in tertiary processing a researcher may perform a genome-wide association study (GWAS) on those variants to find disease associations. Often, this type of analysis is largely statistical. Although there is substantial computation involved, especially with the rise of deep learning, the computational challenges here are fewer compared with secondary processing.

**Fig. 2 |. Genomic compression and sketching.**
There now exist many different processing techniques for reducing transmission requirements, but they involve different kinds of trade-offs, including reductions in transmission size, decompression requirements, compute time and accuracy of the downstream analysis. a, Lossless compression. Data are compressed, creating a smaller squashed version, but the data still need to be decompressed back to the original into order to run computations, which takes a lot of time, as represented by the multiple computers. b, Compressive genomics. Data are compressed so as to operate directly on the compressed representation, increasing speed (represented by the single computer for analysis) without loss of accuracy because the decompression step can be skipped. (Note that there is also a lossy version of compressive genomics, which improves the compression ratio at the cost of accuracy, but here we only illustrate lossless compressive genomics.) c, Lossy compression. This method allows for some error in the reconstruction to reduce the compressed file size (depicted in the green modification to the filing cabinet), which may alter the final results of analysis after decompression (hence the green ‘error’). d, Data sketching. The need for reconstruction can be avoided through a sketch (which completely transforms the data irreversibly, depicted visually as desaturation), but significantly reduces the file size and computation time, which may alter the analysis results (hence the red ‘error’). The central difference between sketching and compression is that because compressed data can be decompressed into the same format as the original, the same downstream analysis tools can be used; for sketching, the downstream tools must be modified to operate on sketched data, but sketching can often improve upon space savings over even lossy compression.

See this image and copyright information in PMC

References

1. Wetterstrand KA DNA sequencing costs: data. National Human Genome Research Institute; www.genome.gov/sequencingcostsdata (2022).
1. Preston J, VanZeeland A, & Peiffer DA Innovation at illumina: the road to the $600 human genome. Nature Portfolio https://www.nature.com/articles/d42473-021-00030-9 (2021).
1. Pennisi EA $100 genome? New DNA sequencers could be a ‘game changer’ for biology, medicine. Science 376, 1257–1258 (2022). - PubMed
1. Regalado A China’s BGI says it can sequence a genome for just $100. MIT Technology Review. https://www.technologyreview.com/2020/02/26/905658/china-bgi-100-dollarg... (2020).
1. Berger B, Daniels NM & Yu YW Computational biology in the 21st century: scaling with compressive algorithms. Commun. ACM 59, 72–80 (2016). - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

R35 GM141861/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Navigating bottlenecks and trade-offs in genomic data analysis

Affiliations

Navigating bottlenecks and trade-offs in genomic data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Related links

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources