Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Nov 1;25(21):2865-71.
doi: 10.1093/bioinformatics/btp394. Epub 2009 Jun 26.

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads

Affiliations

Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads

Kai Ye et al. Bioinformatics. .

Abstract

Motivation: There is a strong demand in the genomic community to develop effective algorithms to reliably identify genomic variants. Indel detection using next-gen data is difficult and identification of long structural variations is extremely challenging.

Results: We present Pindel, a pattern growth approach, to detect breakpoints of large deletions and medium-sized insertions from paired-end short reads. We use both simulated reads and real data to demonstrate the efficiency of the computer program and accuracy of the results.

Availability: The binary code and a short user manual can be freely downloaded from http://www.ebi.ac.uk/ approximately kye/pindel/.

Contact: k.ye@lumc.nl; zn1@sanger.ac.uk.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Detecting deletion events. (a) When mapping paired-end reads to the reference genome, some reads may not be mapped even allowing a few mismatches because they are just across the break points of deletion events. If we can find a proper position to break the read into two fragments and map them separately, we will be able to compute the exact break points and the fragment deleted compared to the reference. If find more supporting evidences can be found, the possibility of the deletion event will be higher in the test sample. For simplicity, we only depicted one mapped read (green arrow); (b) The procedure to break the unmapped read into two parts at appropriate position and mapped them separately to the reference genome. The location and direction of the mapped read (green) define the local region to break the unmapped read into two fragments and map them separately. The 3′ end of the mapped read is defined as anchor point. Then pattern growth is used to search for minimum and maximum unique substrings from the 3′ end of unmapped reads within the range of two times of insert size starting from the anchor point. Using pattern growth again to search for minimum and maximum unique substrings from the 5′ of unmapped read within the range of read length+user defined maximum deletion size starting from the already mapped 3′ end of the unmapped read. The computed minimum and maximum substrings from both 3′ and 5′ are examined to see whether a complete unmapped read can be assembled. All possible solutions are stored in a database for sorting according to the break point coordinates. A deletion event is reported if at least two reads support it.
Fig. 2.
Fig. 2.
Detecting short insertion events. The procedure to split the unmapped read into three parts at appropriate position and mapped the terminal two separately to the reference genome. The location and direction of the mapped read (green) define the local region to split the unmapped read. The 3′ end of the mapped read is defined as anchor point. Then pattern growth is used to search for minimum and maximum unique substrings from the 3′ end of unmapped reads within the range of two times of insert size starting from the anchor point. Using pattern growth again to search for minimum and maximum unique substrings from the 5′ of unmapped read within the range of read length – 1, starting from the already mapped 3′ end of the unmapped read. The computed minimum and maximum substrings from both 3′ and 5′ are examined to see whether they are adjacent to each other. The middle fragment is the inserted fragment. All possible solutions are stored in a database for sorting according to the break point coordinates. An insertion event is reported if at least two reads support it.
Fig. 3.
Fig. 3.
An example output of Pindel. The type and size of deletion are specified first (D 321). Then the chromosome ID, coordinates of the break points and the number of reads supporting every event are given. The mapping directions of the mapped reads and their 3′ coordinates on the reference are reported for each supporting read.
Fig. 4.
Fig. 4.
Simulation of paired-end reads from human chromosome X. (a–f) True positive rates per each deletion size are displayed in the presence of sequencing errors and SNPs when Max_D_Size is increased from 10 to 1 000 000 bp. The impact of sequencing errors and/or SNPs on the overall true positive rates (g) and false discovery rates (h) when Max_D_Size is set to different values from 10 to 1 000 000. (i) The impact of sequencing errors and/or SNPs on the true positive rates for detecting medium sized insertions from size 1 to –20 bp.
Fig. 5.
Fig. 5.
Plots of deletion size distribution for NA18507 from 1 to 10 000 bp. (a) The frequency per each deletion size from 1 to 10 000 bp. Adjacent dots are connected. There is a peak around 300 bp, which may contain hundreds of putative SINEs. (b) Sum of frequencies for each 20 bases is plotted. The peaks for putative SINEs and LINEs are visible.
Fig. 6.
Fig. 6.
Runtime and memory consumption for Pindel applied to the NA18507 data on a single CPU for mining indels with different Max_D_Size (10 bp, 100 bp, 1 kb and 10 kb). (a) The user runtime for Pindel is divided into three categories: (i) loading the reference genome and the reads into memory. (ii) Break unmapped reads and map them separately using pattern growth. (iii) Sort break points according to coordinates and write the results on hard disk. (b) Maximum memory consumption for Pindel to process the NA18507 data with different Max_D_Size (10 bp, 100 bp, 1 kb and 10 kb).

References

    1. Bennett EA, et al. Natural genetic variation caused by transposable elements in humans. Genetics. 2004;168:933–951. - PMC - PubMed
    1. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. - PMC - PubMed
    1. Chaisson MJ, Pevzner PA. Short read fragment assembly of bacterial genomes. Genome Res. 2008;18:324–330. - PMC - PubMed
    1. Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. - PubMed
    1. Kidd JM, et al. Mapping and sequencing of structural variation from eight human genomes. Nature. 2008;453:56–64. - PMC - PubMed

Publication types