Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
[Preprint]. 2023 Feb 15:2023.02.13.528389.
doi: 10.1101/2023.02.13.528389.

Pairtools: from sequencing data to chromosome contacts

Affiliations

Pairtools: from sequencing data to chromosome contacts

Open2C et al. bioRxiv. .

Update in

  • Pairtools: From sequencing data to chromosome contacts.
    Open2C; Abdennur N, Fudenberg G, Flyamer IM, Galitsyna AA, Goloborodko A, Imakaev M, Venev SV. Open2C, et al. PLoS Comput Biol. 2024 May 29;20(5):e1012164. doi: 10.1371/journal.pcbi.1012164. eCollection 2024 May. PLoS Comput Biol. 2024. PMID: 38809952 Free PMC article.

Abstract

The field of 3D genome organization produces large amounts of sequencing data from Hi-C and a rapidly-expanding set of other chromosome conformation protocols (3C+). Massive and heterogeneous 3C+ data require high-performance and flexible processing of sequenced reads into contact pairs. To meet these challenges, we present pairtools - a flexible suite of tools for contact extraction from sequencing data. Pairtools provides modular command-line interface (CLI) tools that can be flexibly chained into data processing pipelines. Pairtools provides both crucial core tools as well as auxiliary tools for building feature-rich 3C+ pipelines, including contact pair manipulation, filtration, and quality control. Benchmarking pairtools against popular 3C+ data pipelines shows advantages of pairtools for high-performance and flexible 3C+ analysis. Finally, pairtools provides protocol-specific tools for multi-way contacts, haplotype-resolved contacts, and single-cell Hi-C. The combination of CLI tools and tight integration with Python data analysis libraries makes pairtools a versatile foundation for a broad range of 3C+ pipelines.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.. Processing 3C+ data using pairtools.
a. Outline of 3C+ data processing leveraging pairtools. First, a sequenced DNA library is mapped to the reference genome with sequence alignment software, typically using bwa mem for local alignment. Next, pairtools extracts contacts from the alignments in .sam/.bam format. Pairtools outputs a tab-separated .pairs file that records each contact with additional information about alignments. A .pairs file can be saved as a binned contact matrix with counts with other software, such as cooler. The top row describes the steps of the procedure; the middle row describes the software and chain of files; the bottom row depicts an example of each file type. b. Three main steps of contact extraction by pairtools: parse, sort, and dedup. Parse takes alignments of reads as input and extracts the pairs of contacts. In the illustration, alignments are represented as triangles pointing in the direction of read mapping to the reference genome; each row is a pair extracted from one read. The color represents the genomic position of the alignment with the smallest coordinate, from the leftmost coordinate on the chromosome (orange) to the rightmost coordinate on the chromosome (violet). Sort orders mapped pairs by their position in the reference genome. Before sorting, pairs are ordered by the reads from which they were extracted. After sorting, pairs are ordered by chromosome and genomic coordinate. Dedup removes duplicates (pairs with the same or very close positions of mapping). The bracket represents two orange pairs with very close positions of mapping that are deduplicated by dedup.
Figure 2.
Figure 2.. Parsing contact pairs and walks from alignments.
a. Parsing is the first step of contact extraction and can be done by either parse or parse2 in pairtools. The choice of the tool is directed by the read length and the abundance of multiple ligation events. For single ligation events, pairtools reports the type of pair (in the example here, both alignments are unique, UU, and other types listed in Supplementary Figure 1a–g). For multiple ligation events, pairtools distinguishes the ligation type of the pair (walk_pair_type: R1 and R2 - direct ligations observed on the first or the second side of the read; R1–2 unobserved ligation that can be potentially indirect), and reports the order of pair in a walk (walk_pair_index). For other examples, see Supplementary Figure 1d–g. b. Resolving multi-way contacts with 3C+ methods. In a 3C+ library, a multi-way contact is captured as a chimeric DNA molecule. Each end of DNA fragment can be ligated to its neighbor only once, i.e. hop to another DNA fragment. Contacts between consecutively ligated fragments of the chimera are called “direct” (i.e., directly ligated, 1-hops); those between non-adjacent fragments are “indirect” (2-, 3- and many-hops). In paired-end sequencing, a fraction of the molecule remains unsequenced and may contain DNA fragments (missing contacts). As a result, the contacts between the two fragments abutting the gap are called “unobserved”, and they may be either direct or indirect. c. Contacts recovery relative to default pairtools settings for different parsing modes of long walks in paired-end 3C+ data with reads ~150–250 bps (data from , , , ).
Figure 3.
Figure 3.. Auxiliary tools for building feature-rich pipelines.
a. Header verifies and modifies the .pairs format. b-d. Flip, select, and sample are for pairs manipulation. e-f. Scaling and stats are used for quality control. For scaling, we report scalings for all pairs orientations (+−,−+, ++, −−) as well as average trans contact frequency. Orientation convergence distance is calculated as the last rightmost genomic separation that does not have similar values for scalings at different orientations. g-h. Restrict and phase are protocol-specific tools that extend pairtools usage for multiple 3C+ variants.
Figure 4.
Figure 4.. Benchmark of different Hi-C mapping tools for one mln reads in 5 iterations (data from ).
a. Runtime per tool and number of cores. The labels at each bar of the time plot indicate the slowdown relative to Chromap with the same number of cores. b. Maximum resident set size for each tool and number of cores. c. Runtime per tool and number of cores compared to the runtime of the corresponding mapper (gray shaded areas). Labels at the bars reflect the percentage of time used by the mapper versus the time used by the pair parsing tool. d. Maximum resident set size for each tool and number of cores compared with that of the corresponding mapper. To make the comparison possible, the analysis for each tool starts with .fastq files, and the time includes both read alignment and pairs parsing. For pairtools, we tested the performance with regular bwa mem and bwa mem2 , which is ~2x faster but consumes more memory. Note that for HiC-Pro, we benchmark the original version and not the recently-rewritten nextflow version that is part of nf-core . FANC, in contrast to other modular 3C+ pairs processing tools, requires an additional step to sort .bam files before parsing pairs that we include in the benchmark. For Juicer, we use the “early” mode. Chromap is not included in this comparison because it is an integrated mapper .

References

    1. Akgol Oksuz B, Yang L, Abraham S, Venev SV, Krietenstein N, Parsi KM et al. Systematic evaluation of chromosome conformation capture assays. Nat Methods 2021; 18: 1046–1055. - PMC - PubMed
    1. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009; 326: 289–293. - PMC - PubMed
    1. Dekker J, Belmont AS, Guttman M, Leshyk VO, Lis JT, Lomvardas S et al. The 4D nucleome project. Nature 2017; 549: 219–226. - PMC - PubMed
    1. Luo Y, Hitz BC, Gabdank I, Hilton JA, Kagda MS, Lam B et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res 2020; 48: D882–D889. - PMC - PubMed
    1. Tan H, Onichtchouk D, Winata C. DANIO-CODE: Toward an Encyclopedia of DNA Elements in Zebrafish. Zebrafish 2016; 13: 54–60. - PMC - PubMed

Publication types