Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jun 5;14(1):3243.
doi: 10.1038/s41467-023-38870-2.

INSurVeyor: improving insertion calling from short read sequencing data

Affiliations

INSurVeyor: improving insertion calling from short read sequencing data

Ramesh Rajaby et al. Nat Commun. .

Abstract

Insertions are one of the major types of structural variations and are defined as the addition of 50 nucleotides or more into a DNA sequence. Several methods exist to detect insertions from next-generation sequencing short read data, but they generally have low sensitivity. Our contribution is two-fold. First, we introduce INSurVeyor, a fast, sensitive and precise method that detects insertions from next-generation sequencing paired-end data. Using publicly available benchmark datasets (both human and non-human), we show that INSurVeyor is not only more sensitive than any individual caller we tested, but also more sensitive than all of them combined. Furthermore, for most types of insertions, INSurVeyor is almost as sensitive as long reads callers. Second, we provide state-of-the-art catalogues of insertions for 1047 Arabidopsis Thaliana genomes from the 1001 Genomes Project and 3202 human genomes from the 1000 Genomes Project, both generated with INSurVeyor. We show that they are more complete and precise than existing resources, and important insertions are missed by existing methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Overview of the INSurVeyor method.
The method can be essentially divided into three blocks: (a) INSurVeyor extracts discordant pairs and clipped reads as possible evidence of insertions; (bd) the evidence extracted by (a) is used to generate the alternative allele sequence, which consists of the predicted inserted sequence along with the two flanking regions shared with the reference genome. This is achieved by three separate modules: the remapping module (b) aims at predicting transpositions; the local assembly module (c) aims at predicting novel insertions, while the consensus overlap module (d) predicts small insertions. e This sequence is then remapped to the reference genome to identify the precise boundaries of the predicted insertion, which is finally passed through a series of filters (f) that aim at reducing the number of false positive calls.
Fig. 2
Fig. 2. Performance of the tested tools on the HG002 benchmark.
a Sensitivity, precision and F1-score of individual callers. INSurVeyor has a much higher sensitivity (0.72) than the other tools, and extremely high precision (0.98). This predictably results in the highest F1-score (0.83). b The number of predicted TPs and the running time in minutes for INSurVeyor and for different combinations of existing tools (sorted by number of TPs, top 20 showed). INSurVeyor alone predicts more true positives than all the other tools combined, while using a fraction of the running time. c The number of calls that are uniquely contributed by each caller. Notably, INSurVeyor contributes more than 700 true positives that are missed by all of the other tools. No other tested caller performs similarly. d Sensitivity when using the strict criterion. INSurVeyor is still more sensitive than other tools.
Fig. 3
Fig. 3. Performance of the tested tools on different types of inserted sequences, and comparison with Sniffles2.
The benchmark insertions are partitioned into three types depending whether the inserted sequence is (a) a mobile element (SINE, LINE or SVA), b low complexity or (c) other, i.e., none of the previous categories. The sensitivity of different tools was assessed for each type. INSurVeyor performs better than other short reads-based methods in every single class. Furthermore, with the exception of low complexity sequences, INSurVeyor predicts most of the insertions detected by methods that use long reads datasets.
Fig. 4
Fig. 4. Performance of the tested tools on the HGSVC2 benchmark.
a Sensitivity, precision and b F1-score of Manta, MELT, the union of the two and INSurVeyor (10 samples randomly picked are displayed here, a summary for the 34 samples is presented in Supplementary Fig. 8). Results are consistent with what was observed on HG002. c Venn diagram of the true positive calls per sample that are called by different combinations of the tools, averaged over the 34 genomes.
Fig. 5
Fig. 5. Performance of IndelEnsembler and INSurVeyor on different plant samples.
a Performance of IndelEnsembler and INSurVeyor on predicting insertions on seven Arabidopsis Thaliana genomes. INSurVeyor is more sensitive. Furthermore, it is more precise in all samples except one. b We compared the sensitivity of IndelEnsembler to INSurVeyor on two more species of plants, B. Napus and Soybean, for different sequencing depths. In both species, INSurVeyor shows major improvements.
Fig. 6
Fig. 6. Properties of the catalogue of insertions called in 1047 samples from the 1001 Genomes Project.
a Size distribution of the insertions discovered in 1047 samples of Arabidopsis Thaliana. The most prominent peaks are caused by insertions of transposable elements. b The counts of different TE insertions in genic regions. Genic regions include 1.5 kb upstream of the gene body. c, d A significant loci for flowering time of Spain 2008 (c) and summer 2008 (d). Left, Manhattan plots of insertions genome-wide association studies for flowering time of Spain 2008 (c) and plant summer 2008 (d). Blue and red horizontal lines indicate the significance thresholds of GWAS (5.25 ⋅ 10−5 and 2.63 ⋅ 10−6, respectively). The vertical line represents the candidate gene AT3G27570 on chromosome 3. Center, QQ plots for flowering time of Spain 2008 (c) and plant summer 2008 (d). Right, Violin plots showing the flowering time of accessions with different AT3G27570 alleles for Spain 2008 (c) and plant summer 2008 (d) (P-values were determined using two-tailed Student’s t-tests). Flowering time is significantly delayed in samples with the alternative allele compared to those with the reference allele. Boxplots in (c) and (d) show median (inner line) and inner quartiles (box). Whiskers extend to the highest and lowest values no greater than 1.5 times the inner quartile range.
Fig. 7
Fig. 7. Properties of the catalogue of insertions called in 3202 samples from the 1000 Genomes Project.
a Number of insertions called per superpopulation. Africans consistently have a higher number of insertions than other superpopulations when compared to hg38. The boxes contain values from the lower to the upper quartile, the line within the box is the median and the whiskers extend by 1.5 times the interquartile range. Circles represent data points outside of the whiskers. b Length distribution of the inserted sequences. The ALU, SVA and LINE peaks are all clearly present. c Principal component analysis (PCA) of the distribution of the insertions in the population clearly separates the superpopulations. d Number of private and shared calls between the 1000g-SV and the INSurVeyor callsets. Between parentheses, the validation rates of calls in samples with long reads. Note that we match insertions as long as they are within 500 bp from each other, therefore a single insertion from 1000g-SV can match multiple insertions from INSurVeyor, and vice versa. For this reason, the number of 1000g-SV insertions with a match in INSurVeyor (45,340) is not the same as the number of INSurVeyor insertions with a match in 1000g-SV (53,946). Not only INSurVeyor has a large number of private events (94,988 compared to 4353 private to 1000g-SV), but also a much higher validation rate. e When evaluated sample by sample using HGSVC2, INSurVeyor is consistently more sensitive and precise (10 randomly picked samples shown here, a summary for the 34 samples is shown in Supplementary Fig. 9).
Fig. 8
Fig. 8. Analysis of enriched regions and insertion types in the INSurVeyor dataset.
a Compared to the 1000g-SV dataset, INSurVeyor shows the most enrichment in SINE and low complexity regions (as annotated by RepeatMasker). b We classify insertions in SINE regions by repeat content of the inserted sequence. Most insertions into reference SINEs are by other SINE sequences. The second most frequent category is the insertion of low-complexity sequences, and they are mostly specific to INSurVeyor. c We identify 747 HGSVC2 calls as insertions of a low complexity sequence into a SINE. Our catalogue contains 73% of them, while only 20% are present in the 1000g-SV dataset. Only 3 are uniquely present in 1000g-SV and missed by INSurVeyor. d We observed that most (potentially all) insertions of low complexity sequences into SINE regions are due to STR expansions. One notable example is the expansion of the 3' tail of an AluSx1 element in an intron of the ATXN10 gene. Very large expansions (≥800 copies) of the ATTCT motif result in SCA10. e 34% of the SINE STR expansions are in intronic regions, and most of them are not reported by 1000g-SV nor ExpansionHunter Denovo, a specialised tool. The validation rate when compared to HGSVC2 is 92%, which suggest most detected expansions are true positives. Some intronic ALU STR expansions are known to cause neurodegenerative diseases. f Most expansions (74%) happen in the 3' tail of an ALU element. We consider the 3'-most 30 bp of an ALU to be its 3' tail.

References

    1. Weischenfeldt J, Symmons O, Spitz F, Korbel JO. Phenotypic impact of genomic structural variation: insights from and for human disease. Nat. Rev. Genet. 2013;14:125–38. doi: 10.1038/nrg3373. - DOI - PubMed
    1. Reilly MT, Faulkner GJ, Dubnau J, Ponomarev I, Gage FH. The role of transposable elements in health and diseases of the central nervous system. J. Neurosci. 2013;33:17577–17586. doi: 10.1523/JNEUROSCI.3369-13.2013. - DOI - PMC - PubMed
    1. Kazazian HH, et al. Haemophilia a resulting from de novo insertion of l1 sequences represents a novel mechanism for mutation in man. Nature. 1988;332:164–166. doi: 10.1038/332164a0. - DOI - PubMed
    1. Miki Y, et al. Disruption of the apc gene by a retrotransposal insertion of l1 sequence in a colon cancer. Cancer Res. 1992;52:643–645. - PubMed
    1. Solyom S, et al. Extensive somatic l1 retrotransposition in colorectal tumors. Genome Res. 2012;22:2328–38. doi: 10.1101/gr.145235.112. - DOI - PMC - PubMed

Publication types

MeSH terms