Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jun 9;14(6):670.
doi: 10.3390/biology14060670.

Analysis of Software Read Cross-Contamination in DNBSEQ Data

Affiliations

Analysis of Software Read Cross-Contamination in DNBSEQ Data

Dmitry N Konanov et al. Biology (Basel). .

Abstract

DNA nanoball sequencing (DNBSEQ) is one of the most rapidly developing sequencing technologies and is widely applied in genomic and transcriptomic investigations. Recently, a new PE300 sequencing option primarily recommended for amplicon analysis was released for DNBSEQ-G99 and G400 devices. Given their unprecedentedly high data yield per flow cell, the new PE300 kits could be a great choice for various sequencing tasks, but we found that combining different types of DNA libraries in a single run could lead to undesired artifacts in the data. In this study, we investigate the occasional read cross-contamination that we first observed in our DNBSEQ PE300 run. The phenomenon, which we refer to as "software contamination", is not actual contamination but primarily manifests as improper forward/reverse read pairing, improper demultiplexing, or as "digital chimeric" reads. Although rare, these artifacts were found in all runs we have analyzed, including several MGI demo datasets (both PE100 and PE150). In this study, we demonstrate that these artifacts arise primarily from the incorrect resolution of sequencing signals produced by neighboring DNA nanoballs, leading to mixing out forward and reverse reads or improper demultiplexing. The artifacts occur most frequently with read pairs where the length of insert sequence is shorter than the read length. Based on a few external NA12878 human exome sequencing data, we conclude that the total improper pairing rate in DNBSEQ data is comparable to Illumina ones. Overall, the problem only affects the analysis results when simultaneously sequenced libraries have markedly different insert size distribution or flow cell loading. Additionally, we demonstrate here that raw DNBSEQ data might contain ~2% optical duplicates, resulting from the same effect of close neighboring of DNB-sites in the flow cell.

Keywords: DNBSEQ; data filtering; read duplicates; sequencing artifacts.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
An example of improper read pairing in barcode 104. (A) The assembly graph obtained on the reads demultiplexed as barcode 104. The target plasmid with included human papillomavirus has been assembled correctly, but there were a number of contaminating contigs from the objects sequenced under other barcodes. (B) An example of a read pair demultiplexed as barcode 104, where the forward read contains a non-target object and a non-target barcode 77 sequence. Here, the reverse read seems to be correctly demultiplexed. (C1) The counts of the forward reads in barcode 104 that contained exact barcode sequences different from the target. So, 279 forward reads demultiplexed as barcode 104 actually carried barcode 42. In total, 4.84% of the reads where the barcode sequence could be observed had incorrect barcodes. (C2) The same analysis was performed on the reverse reads. The percentage of improperly demultiplexed reverse reads was 1.25%.
Figure 2
Figure 2
(A1A4) Improper read demultiplexing in external demo datasets provided by MGI. Four demo read archives representing the whole-genome sequencing of the NA12878 reference sample (both PE150 and PE100) were downloaded from CNGB for this analysis. It should be noted here that only reads with very short inserts were considered so that they included the MGI technical sequence. (B) An example of improper pairing, identical to that observed in our PE300 run.
Figure 3
Figure 3
The distribution of estimated insert sizes in read pairs mapped on Densovirinae. (A) The insert size distribution in a target barcode 77 (virome). A visible decrease in the number of inserts shorter than 260 can be observed. (B) The insert size distribution in a non-target barcode 65 (whole-genome sequencing (WGS) metagenome). Most read pairs that mapped on contaminating Densovirinae had insert size lower than 260.
Figure 4
Figure 4
(A1,A2) An example of incorrect sequencing signal resolution. The neighboring reverse reads (A2) are duplicates, while the forward reads (A1) with the same IDs do not have similarity. Here, a red substring in the read headers is a field of view (FOV), and blue is a unique read identifier. (B1,B2) The distribution of the difference between IDs in duplicated reads. The main mode equals 1, so in most cases, duplicated reads have neighboring IDs. Four additional modes might represent farther neighborhoods of DNA nanoballs in the flow cell. (B1) has the original y-axis limits, while for (B2) the y-axis limits were set so that the four additional modes (indicated by red rectangle) were pronounced. (C1,C2) The same analysis of an MGI demo dataset (CNR0104869). There, the minor modes were the same but more intense. Four modes marked by red had a doubled difference value compared with those marked by green, probably representing an additional order of neighborhood.
Figure 5
Figure 5
Examples of chimeric and broken forward reads in the DNBSEQ-G400 PE300 run. (A) An example of a chimeric read that contains two different inserts. (B) An example of a read containing two different barcode sequences. (C) An example of a read with a broken technical sequence.

Similar articles

References

    1. Jeon S.A., Park J.L., Park S.-J., Kim J.H., Goh S.-H., Han J.-Y., Kim S.-Y. Comparison between mgi and illumina sequencing platforms for whole genome sequencing. Genes Genom. 2021;43:713–724. doi: 10.1007/s13258-021-01096-x. - DOI - PubMed
    1. Korostin D., Kulemin N., Naumov V., Belova V., Kwon D., Gorbachev A. Comparative analysis of novel mgiseq-2000 sequencing platform vs illumina hiseq 2500 for whole-genome sequencing. PLoS ONE. 2020;15:e0230301. doi: 10.1371/journal.pone.0230301. - DOI - PMC - PubMed
    1. Anslan S., Mikryukov V., Armolaitis K., Ankuda J., Lazdina D., Makovskis K., Vesterdal L., Schmidt I.K., Tedersoo L. Highly comparable metabarcoding results from mgi-tech and illumina sequencing platforms. PeerJ. 2021;9:e12254. doi: 10.7717/peerj.12254. - DOI - PMC - PubMed
    1. Póliska S., Fareh C., Lengyel A., Göczi L., Tőzsér J., Szatmari I. Comparative transcriptomic analysis of Illumina and MGI next-generation sequencing platforms using RUNX3-and ZBTB46-instructed embryonic stem cells. Front. Genet. 2024;14:1275383. doi: 10.3389/fgene.2023.1275383. - DOI - PMC - PubMed
    1. Jeon S.A., Park J.L., Kim J.H., Kim Y.S., Kim J.C., Kim S.-Y. Comparison of the MGISEQ-2000 and Illumina HiSeq 4000 sequencing platforms for RNA sequencing. Genom. Inform. 2019;17:e32. doi: 10.5808/GI.2019.17.3.e32. - DOI - PMC - PubMed

LinkOut - more resources