Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Jul 4:8:1006.
doi: 10.12688/f1000research.19675.1. eCollection 2019.

Large-scale sequence comparisons with sourmash

Affiliations

Large-scale sequence comparisons with sourmash

N Tessa Pierce et al. F1000Res. .

Abstract

The sourmash software package uses MinHash-based sketching to create "signatures", compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.

Keywords: MinHash; bioinformatics; k-mer; sequence analysis; sourmash.

PubMed Disclaimer

Conflict of interest statement

No competing interests were disclosed.

Figures

Figure 1.
Figure 1.. The MDS plots produced from the reference-free sourmash compare similarity matrix and the transcript quantification analysis (salmon and edgeR) are similar.
Wild-type S. cerevisiae samples (ERR459011, ERR459102) are in yellow and mutant samples (ERR458584, ERR458829) in blue.
Figure 2.
Figure 2.. Heatmap and dendrogram generated using sourmash signatures built from scaffolds in the domesticated olive genome.
Two scaffolds are outliers when using tetranucleotide frequency to calculate similarity (highlighted in green on the dendrogram).
Figure 3.
Figure 3.. Heatmap and dendrogram generated using sourmash signatures built from 50 genomes that contained the word “ Escherichia coli”.
One signature is an outlier (highlighted in blue on the dendrogram).

References

    1. Sequence read archive overview.2018. Reference Source
    1. Broder AZ: On the resemblance and containment of documents. In Compression and complexity of sequences 1997. proceedings.IEEE.1997;21–29. Reference Source
    1. Ondov BD, Treangen TJ, Melsted P, et al. : Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132. 10.1186/s13059-016-0997-x - DOI - PMC - PubMed
    1. Bovee R, Greenfield N: Finch: a tool adding dynamic abundance filtering to genomic minhashing.2018;3(22):505 10.21105/joss.00505 - DOI
    1. Zhao XF: BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics. 2019;35(4):671–673. 10.1093/bioinformatics/bty651 - DOI - PubMed

Publication types