Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 1;34(5):739-747.
doi: 10.1093/bioinformatics/btx655.

Bartender: a fast and accurate clustering algorithm to count barcode reads

Affiliations

Bartender: a fast and accurate clustering algorithm to count barcode reads

Lu Zhao et al. Bioinformatics. .

Abstract

Motivation: Barcode sequencing (bar-seq) is a high-throughput, and cost effective method to assay large numbers of cell lineages or genotypes in complex cell pools. Because of its advantages, applications for bar-seq are quickly growing-from using neutral random barcodes to study the evolution of microbes or cancer, to using pseudo-barcodes, such as shRNAs or sgRNAs to simultaneously screen large numbers of cell perturbations. However, the computational pipelines for bar-seq clustering are not well developed. Available methods often yield a high frequency of under-clustering artifacts that result in spurious barcodes, or over-clustering artifacts that group distinct barcodes together. Here, we developed Bartender, an accurate clustering algorithm to detect barcodes and their abundances from raw next-generation sequencing data.

Results: In contrast with existing methods that cluster based on sequence similarity alone, Bartender uses a modified two-sample proportion test that also considers cluster size. This modification results in higher accuracy and lower rates of under- and over-clustering artifacts. Additionally, Bartender includes unique molecular identifier handling and a 'multiple time point' mode that matches barcode clusters between different clustering runs for seamless handling of time course data. Bartender is a set of simple-to-use command line tools that can be performed on a laptop at comparable run times to existing methods.

Availability and implementation: Bartender is available at no charge for non-commercial use at https://github.com/LaoZZZZZ/bartender-1.1.

Contact: sasha.levy@stonybrook.edu or song.wu@stonybrook.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Bartender speed. (a) Running time for Bartender, Starcode, SEED and cd-hit clustering on simulated data of barcodes of different barcode lengths using four threads (t = 4). Bartender was performed with a seed length of 5 (l = 5). SEED clustering of the 26 base pair library took 122 min to complete. (b) Bartender performance of simulated 38 base pair barcodes using a variable number of threads (t) and seed lengths (l). Input/output (I/O) time is shown in dark orange and clustering time in shown in light orange. For t =4, a laptop equipped with 3.0 GHz Intel core i7 and 16 GB memory was used. For t = 12, a desktop equipped with 3.5 GHz 6-core intel xeon E5, 64 GB memory was used
Fig. 2.
Fig. 2.
Bartender accuracy on simulated data. The number of false negatives (a) (missing barcodes) and false positives (b) (spurious barcode clusters) for Bartender, Starcode, SEED and cd-hit clustering algorithms. Numbers above each bar are the barcode cluster counts. (c) Barcode clusters binned by counting error relative to the true counts in the simulated dataset. Perfect indicates no error in barcode counts. (d) A scatter plot of the estimated and true counts of each barcode cluster for Bartender (orange) and Starcode (blue). Low frequency barcodes have high errors with Starcode but not Bartender. Note log scales in all plots
Fig. 3.
Fig. 3.
The impact of sequencing depth on Bartender performance. A plot of the barcode count by the coefficient of variation (CV) for that count on simulated data with a 2% (a) and 0.33% (b) combined error rate of PCR and sequencing (sequencing error). The black lines are theoretical values, which follow the Binomial distribution. The blue lines are the CV of sampling (at the sequencer) alone, without sequencing errors or errors introduced by Bartender clustering. The red lines are the CV after running Bartender, and include sampling, sequencing and clustering errors. All lines are smoothed with window size 0.5
Fig. 4.
Fig. 4.
Bartender and Starcode performance on real data. (a) Venn diagram of the number of clusters identified by Bartender and Starcode. (b) Histogram of the number of counts for barcodes identified by Starcode but not Bartender (blue), and Bartender but not Starcode (orange). (c) Position weight matrices of the highlighted clusters from (d) and the count of each cluster. The first cluster is generated by Starcode, is variable at multiple nucleotide positions, matches the sequence of the first Bartender cluster (second black triangle), and incorporates two additional unique clusters that are distinguished by Bartender (red square and red star). All Bartender clusters display low variation at each nucleotide position. (d) Scatter plot of the counts of each barcode detected by either Bartender or Starcode. Highlighted points (black triangle, red star and square) are a representative example of Starcode over-clustering and are further explained in (c)
Fig. 5.
Fig. 5.
Bartender performance on time course data. A simulation was performed of 100 000 barcoded cells with different fitness coefficients that are evolved in competition for 112 generations (a) Lineage trajectories of 1000 randomly selected barcodes (colors). Solid lines are trajectories estimated by Bartender and dashed lines are the true trajectories. (b) The number of barcodes present in the pool at a count greater than 1 (grey) and the number detected following Bartender (orange) or simple (blue) merging

References

    1. Altschul S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
    1. Bao E. et al. (2011) SEED: efficient clustering of next-generation sequences. Bioinformatics, 27, 2502–2509. - PMC - PubMed
    1. Bassik M.C. et al. (2009) Rapid creation and quantitative monitoring of high coverage shRNA libraries. Nat. Methods, 6, 443–445. - PMC - PubMed
    1. Bhang H.E. et al. (2015) Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat. Medods, 21, 440–448. - PubMed
    1. Blundell J.R., Levy S.F. (2014) Beyond genome sequencing: lineage tracking with barcodes to study the dynamics of evolution, infection, and cancer. Genomics, 104, 417–430. - PubMed

Publication types