Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Apr 2;16(1):2.
doi: 10.1186/s13015-021-00181-w.

Fast lightweight accurate xenograft sorting

Affiliations

Fast lightweight accurate xenograft sorting

Jens Zentgraf et al. Algorithms Mol Biol. .

Abstract

Motivation: With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species' (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results.

Results: We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art sorting by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy. Several engineering steps (e.g., shortcuts for unsuccessful lookups, software prefetching) improve the performance even further.

Availability: Our software xengsort is available under the MIT license at http://gitlab.com/genomeinformatics/xengsort . It is written in numba-compiled Python and comes with sample Snakemake workflows for hash table construction and dataset processing.

Keywords: Alignment-free method; Cuckoo hashing; Xenograft sorting; k-mer.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Illustration of (3,4) Cuckoo hashing with 3 hash functions and buckets of size 4. Left: Each key-value pair can be stored at one of up to 12 locations in 3 buckets. For key x, the bucket given by f1(x) is full, so bucket f2(x) is attempted, where a free slot is found. Right: If all hb slots are full, key x is placed into one of these slots at random (blue), and the currently present key-value pair is evicted and re-inserted into an alternative slot
Fig. 2
Fig. 2
A k-mer is partitioned into its -prefix, a middle base and its -suffix. Efficient local re-sorting of k-mers according to common -prefix and -suffix yields groups of k-mers that differ only in their middle base
Fig. 3
Fig. 3
Decision rule tree for classifying a DNA fragment from k-mer statistics (h,h,g,g,b,x;n), meaning number of k-mers of type “host” (h), “weak host” (h), “graft” (g), “weak graft” (g), “both” (b), and number of k-mers not present in the key-value store (x), respectively; n is the total number of (valid) k-mers in the fragment. We also use weighted scores Shost:=h+h/2 and Sgraft:=g+g/2 and thresholds Thost:=n/4,Tgraft:=n/4 and Tboth:=n/4. A fragment is thus classified as “host”, “graft”, “both”, “neither”, or “ambiguous”. Category “ambiguous” is chosen if no other rule applies and no “else” rule is present in a node
Fig. 4
Fig. 4
Classification results of different tools (XenofilteR, xenome, xengsort, and partially xengsort with “quick” option) on several datasets: a GIAB human matepair dataset (XenofilteR did not run on this dataset); b Chicken genome; c Human lymphocytic leukemia RNA-seq data; d Patient-derived xenograft (PDX) RNA-seq data. e CPU times on the PDX RNA-seq dataset with different tools and different xengsort parameters (see text)
Fig. 5
Fig. 5
Effect of different prefetching levels on the running time of the adversarial chicken genome dataset: no prefetching (0, default), prefetching the second bucket (1), or prefetching both second and third bucket (2). Times are averages over 4 repeated runs
Fig. 6
Fig. 6
Effect of shortcut bits for unsuccessful k-mer searches on the running time of the adversarial chicken genome dataset: no extra bits (0, default), one extra bit (1) or two extra bits (2); see “Performance engineering” section. Times are averages over 6 repeated runs

References

    1. Jo SY, Kim E, Kim S. Impact of mouse contamination in genomic profiling of patient-derived models and best practice for robust analysis. Genome Biol. 2019;20(1):231. doi: 10.1186/s13059-019-1849-2. - DOI - PMC - PubMed
    1. Kluin RJC, Kemper K, Kuilman T, de Ruiter JR, Iyer V, Forment JV, Cornelissen-Steijger P, de Rink I, Ter Brugge P, Song JY, Klarenbeek S, McDermott U, Jonkers J, Velds A, Adams DJ, Peeper DS, Krijgsman O. XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft sequence data. BMC Bioinform. 2018;19(1):366. doi: 10.1186/s12859-018-2353-5. - DOI - PMC - PubMed
    1. Giner G. XenoSplit. Unpublished; 2019. source code available at https://github.com/goknurginer/XenoSplit.
    1. Khandelwal G, Girotti MR, Smowton C, Taylor S, Wirth C, Dynowski M, Frese KK, Brady G, Dive C, Marais R, Miller C. Next-generation sequencing analysis and algorithms for PDX and CDX models. Mol Cancer Res. 2017;15(8):1012–1016. doi: 10.1158/1541-7786.MCR-16-0431. - DOI - PubMed
    1. Ahdesmäki MJ, Gray SR, Johnson JH, Lai Z. Disambiguate: an open-source application for disambiguating two species in next generation sequencing data from grafted samples. F1000Res. 2016;5:2741. doi: 10.12688/f1000research.10082.1. - DOI - PMC - PubMed