Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 May 29;20(5):e1012164.
doi: 10.1371/journal.pcbi.1012164. eCollection 2024 May.

Pairtools: From sequencing data to chromosome contacts

Affiliations

Pairtools: From sequencing data to chromosome contacts

Open2C et al. PLoS Comput Biol. .

Abstract

The field of 3D genome organization produces large amounts of sequencing data from Hi-C and a rapidly-expanding set of other chromosome conformation protocols (3C+). Massive and heterogeneous 3C+ data require high-performance and flexible processing of sequenced reads into contact pairs. To meet these challenges, we present pairtools-a flexible suite of tools for contact extraction from sequencing data. Pairtools provides modular command-line interface (CLI) tools that can be flexibly chained into data processing pipelines. The core operations provided by pairtools are parsing of.sam alignments into Hi-C pairs, sorting and removal of PCR duplicates. In addition, pairtools provides auxiliary tools for building feature-rich 3C+ pipelines, including contact pair manipulation, filtration, and quality control. Benchmarking pairtools against popular 3C+ data pipelines shows advantages of pairtools for high-performance and flexible 3C+ analysis. Finally, pairtools provides protocol-specific tools for restriction-based protocols, haplotype-resolved contacts, and single-cell Hi-C. The combination of CLI tools and tight integration with Python data analysis libraries makes pairtools a versatile foundation for a broad range of 3C+ pipelines.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Processing 3C+ data using pairtools.
a. Outline of 3C+ data processing leveraging pairtools. First, a sequenced DNA library is mapped to the reference genome with sequence alignment software, typically using bwa mem for local alignment. Next, pairtools extracts contacts from the alignments in.sam/.bam format. Pairtools outputs a tab-separated.pairs file that records each contact with additional information about alignments. A.pairs file can be saved as a binned contact matrix of counts with other software, such as cooler. The top row describes the steps of the procedure; the middle row describes the software and chain of files; the bottom row depicts an example of each file type. b. Three main steps of contact extraction by pairtools: parse, sort, and dedup. Parse takes alignments of reads as input and extracts the pairs of contacts. In the illustration, alignments are represented as triangles pointing in the direction of read mapping to the reference genome; each row is a pair extracted from one read. The color represents the genomic position of the alignment with the smallest coordinate, from the leftmost coordinate on the chromosome (orange) to the rightmost coordinate on the chromosome (violet). Sort orders mapped pairs by their position in the reference genome. Before sorting, pairs are ordered by the reads from which they were extracted. After sorting, pairs are ordered by chromosome and genomic coordinate. Dedup removes duplicates (pairs with the same or very close positions of mapping). The bracket represents two orange pairs with very close positions of mapping that are deduplicated by dedup.
Fig 2
Fig 2. Auxiliary tools for building feature-rich pipelines.
a. Header verifies and modifies the.pairs format. b-d. Flip, select, and sample are for pairs manipulation. e-f. Scaling and stats are used for quality control. For scaling, we report scalings for all pairs orientations (+-, -+, ++, —) as well as average trans contact frequency. Orientation convergence distance is calculated as the last rightmost genomic separation that does not have similar values for scalings at different orientations. g-h. Restrict and phase are protocol-specific tools that extend pairtools usage for multiple 3C+ variants.
Fig 3
Fig 3. Benchmark of different Hi-C mapping tools for one mln reads in 5 iterations (data from [64]).
a. Runtime per tool and number of cores. The labels at each bar of the time plot indicate the slowdown relative to Chromap [58] with the same number of cores. b. Maximum resident set size for each tool and number of cores. c. Runtime per tool and number of cores compared to the runtime of the corresponding mapper (gray shaded areas). Labels at the bars reflect the percentage of time used by the mapper versus the time used by the pair parsing tool. d. Maximum resident set size for each tool and number of cores compared with that of the corresponding mapper. To make the comparison possible, the analysis for each tool starts with.fastq files, and the time includes both read alignment and pairs parsing. For pairtools, we tested the performance with regular bwa mem [34] and bwa mem2 [35], which is ~2x faster but consumes more memory. Note that for HiC-Pro, we benchmark the original version and not the recently-rewritten nextflow [65] version that is part of nf-core [66]. FANC, in contrast to other modular 3C+ pairs processing tools, requires an additional step to sort.bam files before parsing pairs that we include in the benchmark. For Juicer, we use the “early” mode. Chromap is not included in this comparison because it is an integrated mapper [58].

Update of

References

    1. Akgol Oksuz B, Yang L, Abraham S, Venev SV, Krietenstein N, Parsi KM, et al.. Systematic evaluation of chromosome conformation capture assays. Nat Methods. 2021;18: 1046–1055. doi: 10.1038/s41592-021-01248-7 - DOI - PMC - PubMed
    1. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al.. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326: 289–293. doi: 10.1126/science.1181369 - DOI - PMC - PubMed
    1. Cohen NM, Olivares-Chauvet P, Lubling Y, Baran Y, Lifshitz A, Hoichman M, et al.. SHAMAN: bin-free randomization, normalization and screening of Hi-C matrices. bioRxiv. 2017. p. 187203. doi: 10.1101/187203 - DOI
    1. Spill YG, Castillo D, Vidal E, Marti-Renom MA. Binless normalization of Hi-C data provides significant interaction and difference detection independent of resolution. Nat Commun. 2019;10: 1938. doi: 10.1038/s41467-019-09907-2 - DOI - PMC - PubMed
    1. Abdennur N, Mirny LA. Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. 2020;36: 311–316. doi: 10.1093/bioinformatics/btz540 - DOI - PMC - PubMed