Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 3;40(6):btae332.
doi: 10.1093/bioinformatics/btae332.

SWGTS-a platform for stream-based host DNA depletion

Affiliations

SWGTS-a platform for stream-based host DNA depletion

Philipp Spohr et al. Bioinformatics. .

Abstract

Motivation: Microbial sequencing data from clinical samples is often contaminated with human sequences, which have to be removed prior to sharing. Existing methods for human read removal, however, are applicable only after the target dataset has been retrieved in its entirety, putting the recipient at least temporarily in control of a potentially identifiable genetic dataset with potential implications under regulatory frameworks such as the GDPR. In some instances, the ability to carry out stream-based host depletion as part of the data transfer process may be preferable.

Results: We present SWGTS, a client-server application for the transfer and stream-based host depletion of sequencing reads. SWGTS enforces a robust upper bound on the maximum amount of human genetic data from any one client held in memory at any point in time by storing all incoming sequencing data in a limited-size, client-specific intermediate processing buffer, and by throttling the rate of incoming data if it exceeds the speed of host depletion carried out on the SWGTS server in the background. SWGTS exposes a HTTP-REST interface, is implemented using docker-compose, Redis and traefik, and requires less than 8 Gb of RAM for deployment. We demonstrate high filtering accuracy of SWGTS; incoming data transfer rates of up to 1.65 megabases per second in a conservative configuration; and mitigation of re-identification risks by the ability to limit the number of SNPs present on a popular population-scale genotyping array covered by reads in the SWGTS buffer to a low user-defined number, such as 10 or 100.

Availability and implementation: SWGTS is available on GitHub: https://github.com/AlBi-HHU/swgts (https://doi.org/10.5281/zenodo.10891052). The repository also contains a jupyter notebook that can be used to reproduce all the benchmarks used in this article. All datasets used for benchmarking are publicly available.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
(A) Overview of SWGTS. SWGTS is a client–server application for the transfer and stream-based depletion of sequencing data from human reads. An upload to the SWGTS server is initiated by a context creation request; on the SWGTS server, each context corresponds to a unique buffer of unprocessed sequencing data of a defined maximum size (buffer size). After creation of the context, the client sequentially transmits equally-sized packets of sequencing data (chunk size) to the server, including the ID of the context just created; the server accepts chunks and adds the reads contained therein to the buffer corresponding to the specified context ID, for as long as doing so would not lead to the buffer exceeding its specified maximum size; otherwise, a “buffer full, retry in × seconds” notification is sent to the client. Single reads that are larger than the buffer are sent but immediately rejected and treated as filtered. A (multi-threaded) background process on the server continuously pulls reads from the buffer and carries out host depletion by aligning the received reads against a combined reference of the human genome and the target pathogen (in “Host-Competitive Pathogen Retention” mode). Reads that map to the pathogen are saved to disk; reads mapping to either the human reference or not at all are discarded. When a context is closed, the server sends a summary containing the IDs of the reads that were not discarded to the client. (B) Impact of the “buffer size” parameter on potential re-identifiability and performance. (Left panel) Relationship between buffer size and the number of Illumina Infinium Omni2.5-8 Kit Panel SNPs, which is pragmatically treated as a proxy for re-identifiability, covered by reads in the SWGTS buffer under worst-case assumptions (only human reads submitted); shown are empirical distributions based on real Illumina and ONT reads from three human samples (see Section 2; 10 replicates), as well as the expected value of the number of such SNPs under an approximate statistical model (“Binomial Model Expected Value”, see Supplementary Note S1). (Right panel) Relationship between buffer size and mean transfer rate for sequencing datasets between two systems, averaged over the four “contaminated” datasets representing SARS-CoV-2 and MRSA as well as Illumina and ONT (see Section 2) and in “host subtraction” mode.

References

    1. 1000 Genomes Project Consortium, Auton A, Brooks LD. et al.A global reference for human genetic variation. Nature 2015;526:68–74. - PMC - PubMed
    1. Bush SJ, Connor TR, Peto TEA. et al. Evaluation of methods for detecting human reads in microbial sequencing datasets. Microb Genom 2020;6:mgen000393. - PMC - PubMed
    1. Constantinides B, Hunt M, Crook DW. et al. Hostile: accurate decontamination of microbial host sequences. Bioinformatics 2023;39:btad728. - PMC - PubMed
    1. Hunt M, Swann J, Constantinides B. et al. ReadItAndKeep: rapid decontamination of SARS-CoV-2 sequencing reads. Bioinformatics 2022;38:3291–3. - PMC - PubMed
    1. Lin Z, Owen AB, Altman RB. et al. Genomic research and human subject privacy. Science 2004;305:183. - PubMed

Publication types