. 2018 Mar 1;34(5):739-747.

doi: 10.1093/bioinformatics/btx655.

Bartender: a fast and accurate clustering algorithm to count barcode reads

Lu Zhao¹, Zhimin Liu^{2

3}, Sasha F Levy^{2

3}, Song Wu¹

Affiliations

¹ Department of Applied Mathematics and Statistics.
² Laufer Center for Physical and Quantitative Biology.
³ Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY 11794, USA.

PMID: 29069318
PMCID: PMC6049041
DOI: 10.1093/bioinformatics/btx655

Bartender: a fast and accurate clustering algorithm to count barcode reads

Lu Zhao et al. Bioinformatics. 2018.

. 2018 Mar 1;34(5):739-747.

doi: 10.1093/bioinformatics/btx655.

Authors

Lu Zhao¹, Zhimin Liu^{2

3}, Sasha F Levy^{2

3}, Song Wu¹

Affiliations

¹ Department of Applied Mathematics and Statistics.
² Laufer Center for Physical and Quantitative Biology.
³ Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY 11794, USA.

PMID: 29069318
PMCID: PMC6049041
DOI: 10.1093/bioinformatics/btx655

Abstract

Motivation: Barcode sequencing (bar-seq) is a high-throughput, and cost effective method to assay large numbers of cell lineages or genotypes in complex cell pools. Because of its advantages, applications for bar-seq are quickly growing-from using neutral random barcodes to study the evolution of microbes or cancer, to using pseudo-barcodes, such as shRNAs or sgRNAs to simultaneously screen large numbers of cell perturbations. However, the computational pipelines for bar-seq clustering are not well developed. Available methods often yield a high frequency of under-clustering artifacts that result in spurious barcodes, or over-clustering artifacts that group distinct barcodes together. Here, we developed Bartender, an accurate clustering algorithm to detect barcodes and their abundances from raw next-generation sequencing data.

Results: In contrast with existing methods that cluster based on sequence similarity alone, Bartender uses a modified two-sample proportion test that also considers cluster size. This modification results in higher accuracy and lower rates of under- and over-clustering artifacts. Additionally, Bartender includes unique molecular identifier handling and a 'multiple time point' mode that matches barcode clusters between different clustering runs for seamless handling of time course data. Bartender is a set of simple-to-use command line tools that can be performed on a laptop at comparable run times to existing methods.

Availability and implementation: Bartender is available at no charge for non-commercial use at https://github.com/LaoZZZZZ/bartender-1.1.

Contact: sasha.levy@stonybrook.edu or song.wu@stonybrook.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

PubMed Disclaimer

Figures

**Fig. 1.**
Bartender speed. (a) Running time for Bartender, Starcode, SEED and cd-hit clustering on simulated data of barcodes of different barcode lengths using four threads (t = 4). Bartender was performed with a seed length of 5 (l = 5). SEED clustering of the 26 base pair library took 122 min to complete. (b) Bartender performance of simulated 38 base pair barcodes using a variable number of threads (t) and seed lengths (l). Input/output (I/O) time is shown in dark orange and clustering time in shown in light orange. For t =4, a laptop equipped with 3.0 GHz Intel core i7 and 16 GB memory was used. For t = 12, a desktop equipped with 3.5 GHz 6-core intel xeon E5, 64 GB memory was used

**Fig. 2.**
Bartender accuracy on simulated data. The number of false negatives (a) (missing barcodes) and false positives (b) (spurious barcode clusters) for Bartender, Starcode, SEED and cd-hit clustering algorithms. Numbers above each bar are the barcode cluster counts. (c) Barcode clusters binned by counting error relative to the true counts in the simulated dataset. Perfect indicates no error in barcode counts. (d) A scatter plot of the estimated and true counts of each barcode cluster for Bartender (orange) and Starcode (blue). Low frequency barcodes have high errors with Starcode but not Bartender. Note log scales in all plots

**Fig. 3.**
The impact of sequencing depth on Bartender performance. A plot of the barcode count by the coefficient of variation (CV) for that count on simulated data with a 2% (a) and 0.33% (b) combined error rate of PCR and sequencing (sequencing error). The black lines are theoretical values, which follow the Binomial distribution. The blue lines are the CV of sampling (at the sequencer) alone, without sequencing errors or errors introduced by Bartender clustering. The red lines are the CV after running Bartender, and include sampling, sequencing and clustering errors. All lines are smoothed with window size 0.5

**Fig. 4.**
Bartender and Starcode performance on real data. (a) Venn diagram of the number of clusters identified by Bartender and Starcode. (b) Histogram of the number of counts for barcodes identified by Starcode but not Bartender (blue), and Bartender but not Starcode (orange). (c) Position weight matrices of the highlighted clusters from (d) and the count of each cluster. The first cluster is generated by Starcode, is variable at multiple nucleotide positions, matches the sequence of the first Bartender cluster (second black triangle), and incorporates two additional unique clusters that are distinguished by Bartender (red square and red star). All Bartender clusters display low variation at each nucleotide position. (d) Scatter plot of the counts of each barcode detected by either Bartender or Starcode. Highlighted points (black triangle, red star and square) are a representative example of Starcode over-clustering and are further explained in (c)

**Fig. 5.**
Bartender performance on time course data. A simulation was performed of 100 000 barcoded cells with different fitness coefficients that are evolved in competition for 112 generations (a) Lineage trajectories of 1000 randomly selected barcodes (colors). Solid lines are trajectories estimated by Bartender and dashed lines are the true trajectories. (b) The number of barcodes present in the pool at a count greater than 1 (grey) and the number detected following Bartender (orange) or simple (blue) merging

See this image and copyright information in PMC

References

1. Altschul S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. - PubMed
1. Bao E. et al. (2011) SEED: efficient clustering of next-generation sequences. Bioinformatics, 27, 2502–2509. - PMC - PubMed
1. Bassik M.C. et al. (2009) Rapid creation and quantitative monitoring of high coverage shRNA libraries. Nat. Methods, 6, 443–445. - PMC - PubMed
1. Bhang H.E. et al. (2015) Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat. Medods, 21, 440–448. - PubMed
1. Blundell J.R., Levy S.F. (2014) Beyond genome sequencing: lineage tracking with barcodes to study the dynamics of evolution, infection, and cancer. Genomics, 104, 417–430. - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Bartender: a fast and accurate clustering algorithm to count barcode reads

Affiliations

Bartender: a fast and accurate clustering algorithm to count barcode reads

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources