Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence

Claire T Deakin¹, Jeffrey J Deakin¹, Samantha L Ginn¹, Paul Young², David Humphreys², Catherine M Suter³, Ian E Alexander⁴, Claus V Hallwirth¹

Affiliations

¹ Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia.
² Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia.
³ Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia Faculty of Medicine, University of New South Wales, Kensington, New South Wales 2052, Australia.
⁴ Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia Discipline of Paediatrics and Child Health, The Children's Hospital at Westmead Clinical School, The University of Sydney, Westmead, New South Wales 2145, Australia ian.alexander@health.nsw.gov.au.

PMID: 25013183
PMCID: PMC4176369
DOI: 10.1093/nar/gku607

Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence

Claire T Deakin et al. Nucleic Acids Res. 2014.

. 2014;42(16):e129.

doi: 10.1093/nar/gku607. Epub 2014 Jul 10.

Authors

Claire T Deakin¹, Jeffrey J Deakin¹, Samantha L Ginn¹, Paul Young², David Humphreys², Catherine M Suter³, Ian E Alexander⁴, Claus V Hallwirth¹

Affiliations

¹ Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia.
² Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia.
³ Molecular Genetics Division, Victor Chang Cardiac Research Institute, Sydney, Darlinghurst, New South Wales 2010, Australia Faculty of Medicine, University of New South Wales, Kensington, New South Wales 2052, Australia.
⁴ Gene Therapy Research Unit, Children's Medical Research Institute and The Children's Hospital at Westmead, Westmead, New South Wales 2145, Australia Discipline of Paediatrics and Child Health, The Children's Hospital at Westmead Clinical School, The University of Sydney, Westmead, New South Wales 2145, Australia ian.alexander@health.nsw.gov.au.

PMID: 25013183
PMCID: PMC4176369
DOI: 10.1093/nar/gku607

Abstract

Barcoded vectors are promising tools for investigating clonal diversity and dynamics in hematopoietic gene therapy. Analysis of clones marked with barcoded vectors requires accurate identification of potentially large numbers of individually rare barcodes, when the exact number, sequence identity and abundance are unknown. This is an inherently challenging application, and the feasibility of using contemporary next-generation sequencing technologies is unresolved. To explore this potential application empirically, without prior assumptions, we sequenced barcode libraries of known complexity. Libraries containing 1, 10 and 100 Sanger-sequenced barcodes were sequenced using an Illumina platform, with a 100-barcode library also sequenced using a SOLiD platform. Libraries containing 1 and 10 barcodes were distinguished from false barcodes generated by sequencing error by a several log-fold difference in abundance. In 100-barcode libraries, however, expected and false barcodes overlapped and could not be resolved by bioinformatic filtering and clustering strategies. In independent sequencing runs multiple false-positive barcodes appeared to be represented at higher abundance than known barcodes, despite their confirmed absence from the original library. Such errors, which potentially impact barcoding studies in an application-dependent manner, are consistent with the existence of both stochastic and systematic error, the mechanism of which is yet to be fully resolved.

PubMed Disclaimer

Figures

**Figure 1.**
Experimental design and analytical workflow for analysis of the Illumina-compatible barcode. (A) Structure and sequence of the Illumina-compatible barcode insert cloned into the NsiI site of the pEF1α.γc lentiviral construct. The insert contained a PstI site, 32 bp of the Illumina adaptor sequence, a 16-bp random sequence that functioned as the lentiviral barcode and an 18-bp known sequence. Numbers indicate the position of every fifth random nucleotide in the barcode. The SOLiD-compatible barcode followed a similar configuration, with the insert containing a PstI site, 23 bp of the P1-T adaptor, a 15-bp random sequence for the lentiviral barcode and the internal adaptor. For both barcode configurations, the barcode regions were amplified with 10 PCR cycles using primers that introduced the adaptor sequences required for the Illumina or SOLiD platforms. (B) Strategy for analyzing sequence data for the Illumina-compatible barcode. Raw sequence reads were filtered using the known sequence immediately following the barcode at positions 17–30 to eliminate indel errors. The lentiviral barcode was trimmed to positions 2–16 to avoid errors at position 1. The number of unique barcode sequences was counted with and without phred score filtering (Q30), and with and without allowing one mismatch. For the SOLiD-compatible barcode, raw sequence reads were filtered using 10 internal adaptor sequences and the number of unique barcode sequences were counted with and without allowing one mismatch.

**Figure 2.**
Distribution of the relative abundance of the 500 most abundant barcode sequences detected following analysis of the defined barcode libraries using different sequencing platforms. Libraries containing (A) 1, (B) 10 and (C) 100 defined Illumina-compatible barcode(s) sequenced using the first sequencing run. For the 100-barcode library, the first 89 most abundant barcodes matched expected sequences, and a point of inflection in the distribution of the relative frequencies of barcodes occurred at the 82nd-most abundant barcode. Six putatively false barcodes that were not in the 100-barcode library were detected within the top 100. (D) Library containing the same 100 defined Illumina-compatible barcodes sequenced using the second sequencing run after an independent amplification. Seven putatively false barcodes were detected within the top 100. A point of inflection occurred at the 79th-most abundant barcode and again, the first 89 most abundant barcodes matched expected sequences. (E) Library containing 100 defined SOLiD-compatible barcodes. The first 82 most abundant barcodes matched expected sequences; however, 13 putatively false barcodes were detected in the top 100. (F) Mean and range of relative abundances of expected and false barcodes, for each sample.

**Figure 3.**
Analysis of the relative abundance, GC content and likelihood of secondary structure formation for each of the 100 expected Illumina-compatible barcode sequences. (A) Relative abundance of the 100 expected barcode sequences, as detected during the first and second sequencing runs using the Illumina HiSeq 2000 (Pearson r(98) = 0.93, p<0.0001). (B) Distribution of the relative abundance of each barcode sequence as a function of the percentage GC content of that sequence. (C) Distribution of the relative abundance of each barcode sequence as a function of the MFE value calculated for that sequence. MFE values provide an estimate of the likelihood of secondary structure formation, with lower values associated with a higher likelihood.

**Figure 4.**
Analysis of the position and substitution-like type of error for all one-mismatch sequence errors for both Illumina HiSeq 2000 sequencing runs. One-mismatch errors were compared to the known barcode sequences from which they were derived. Errors from the first sequencing run represent the sum of one-mismatch errors after Q30 quality filtering for the one-barcode sample and 10- and 100-barcode libraries, although one-mismatch errors from the 100-barcode library comprise 95.3% of all errors. Errors from the second sequencing run represent one-mismatch errors after Q30 quality filtering for the 100-barcode library. (A) Distribution of one-mismatch errors across each position of the barcode (positions 2–16 of the sequence reads). This distribution differed significantly from an expected even distribution (χ² = 30 064, df = 14, p < 0.0001 for the first sequencing run; χ² = 90 717, df = 14, p < 0.0001 for the second sequencing run). (B) Distribution of each possible substitution-like error type. This distribution also differed significantly from an expected even distribution (χ² = 26 127, df = 11, p < 0.0001 for the first sequencing run; χ² = 82 229, df = 11, p < 0.0001 for the second sequencing run). df, degrees of freedom.

See this image and copyright information in PMC

References

1. Cavazzana-Calvo M., Hacein-Bey S., de Saint Basile G., Gross F., Yvon E., Nusbaum P., Selz F., Hue C., Certain S., Casanova J.L., et al. Gene therapy of human severe combined immunodeficiency (SCID)-X1 disease. Science. 2000;288:669–672. - PubMed
1. Hacein-Bey-Abina S., Hauer J., Lim A., Picard C., Wang G.P., Berry C.C., Martinache C., Rieux-Laucat F., Latour S., Belohradsky B.H., et al. Efficacy of gene therapy for X-linked severe combined immunodeficiency. N. Engl. J. Med. 2010;363:355–364. - PMC - PubMed
1. Gaspar H.B., Parsley K.L., Howe S., King D., Gilmour K.C., Sinclair J., Brouns G., Schmidt M., Von Kalle C., Barington T., et al. Gene therapy of X-linked severe combined immunodeficiency by use of a pseudotyped gammaretroviral vector. Lancet. 2004;364:2181–2187. - PubMed
1. Gaspar H.B., Cooray S., Gilmour K.C., Parsley K.L., Adams S., Howe S.J., Al Ghonaium A., Bayford J., Brown L., Davies E.G., et al. Long-term persistence of a polyclonal T cell repertoire after gene therapy for X-linked severe combined immunodeficiency. Sci. Transl. Med. 2011;3:97ra79. - PubMed
1. Aiuti A., Slavin S., Aker M., Ficara F., Deola S., Mortellaro A., Morecki S., Andolfi G., Tabucchi A., Carlucci F., et al. Correction of ADA-SCID by stem cell gene therapy combined with nonmyeloablative conditioning. Science. 2002;296:2410–2413. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence

Affiliations

Impact of next-generation sequencing error on analysis of barcoded plasmid libraries of known complexity and sequence

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources