. 2022 Apr 12;13(1):1948.

doi: 10.1038/s41467-022-29518-8.

Population-scale long-read sequencing uncovers transposable elements associated with gene expression variation and adaptive signatures in Drosophila

Affiliations

¹ Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 08003, Barcelona, Spain.
² Université Paris-Saclay, INRAE, URGI, 78026, Versailles, France.
³ Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 08003, Barcelona, Spain. josefa.gonzalez@ibe.upf-csic.es.

PMID: 35413957
PMCID: PMC9005704
DOI: 10.1038/s41467-022-29518-8

Population-scale long-read sequencing uncovers transposable elements associated with gene expression variation and adaptive signatures in Drosophila

Gabriel E Rech et al. Nat Commun. 2022.

. 2022 Apr 12;13(1):1948.

doi: 10.1038/s41467-022-29518-8.

Affiliations

¹ Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 08003, Barcelona, Spain.
² Université Paris-Saclay, INRAE, URGI, 78026, Versailles, France.
³ Institute of Evolutionary Biology (CSIC-Universitat Pompeu Fabra), 08003, Barcelona, Spain. josefa.gonzalez@ibe.upf-csic.es.

PMID: 35413957
PMCID: PMC9005704
DOI: 10.1038/s41467-022-29518-8

Abstract

High quality reference genomes are crucial to understanding genome function, structure and evolution. The availability of reference genomes has allowed us to start inferring the role of genetic variation in biology, disease, and biodiversity conservation. However, analyses across organisms demonstrate that a single reference genome is not enough to capture the global genetic diversity present in populations. In this work, we generate 32 high-quality reference genomes for the well-known model species D. melanogaster and focus on the identification and analysis of transposable element variation as they are the most common type of structural variant. We show that integrating the genetic variation across natural populations from five climatic regions increases the number of detected insertions by 58%. Moreover, 26% to 57% of the insertions identified using long-reads were missed by short-reads methods. We also identify hundreds of transposable elements associated with gene expression variation and new TE variants likely to contribute to adaptive evolution in this species. Our results highlight the importance of incorporating the genetic variation present in natural populations to genomic studies, which is essential if we are to understand how genomes function and evolve.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Geographical location of the 12 *D. melanogaster* natural populations analyzed in this work.**
The 32 sequenced and assembled genomes correspond to strains obtained from: Tenerife, Spain: TEN (1), Munich, Germany: MUN (6), Gimenells, Spain: GIM (2), Raleigh, USA: RAL (8), Cortes de Baza, Spain: COR (4), Tomelloso, Spain: TOM (2), Jutland, Denmark: JUT (2), Stockholm, Sweden: STO (1), Lund, Sweden: LUN (2), Slankamen, Serbia: SLA (1), Kiev, Ukraine: KIE (1) and Akka, Finland: AKA (2). In brackets, the number of genomes sequenced from each location. Map colors represent different climatic regions according to the Köppen climate classification (Supplementary Data 1).

**Fig. 2. Comparison between the TE annotation in FlyBase and the TE annotation performed using *REPET* with the MCTE library.**
In blue copies annotated with REPET and in red copies annotated by FlyBase in the reference genome. a Number of TE copies per family. b Overlapping of TE annotations considering that the copies were from the same family and that they were overlapping at least 95% of their lengths (breadth of coverage). TEs shorter than 100 bp, belonging to the INE-1 family and nested TEs were excluded from the analysis. c Distribution of number of TE copies by length in 500 bp bin sizes.

**Fig. 3. Three new TE families in *D. melanogaster*.**
a Schematic representation of the structural features detected by *PASTEC* in the consensus sequences of the three new families identified in this study. b Length ratio (size as proportion of the consensus) distribution for TE copies annotated in the 32 genomes with each of the three new consensus sequences.

**Fig. 4. TE annotations at the superfamily level.**
a Principal component analysis based on TE insertions polymorphisms grouped by continent (colors) and climatic zoned (shapes). b The proportion of TE copies annotated for each superfamily. c Per genome pairwise comparisons in the proportion of copies annotated at the superfamily level. The colors of the matrix squares represent adjusted (FDR) p-values of the two-sided Chi Square test. Only one significant result was observed (adjusted p-value = 0.03) between ISO1 and MUN-009. d Representation of the Pearson residuals (r) for each cell (pair Superfamily-genome). Cells with the highest residuals contribute the most to the total Chi Square score. Positive values in cells (red) represent more copies than the expected, while negative residuals (blue) represent fewer copies than the expected (does not imply statistical significance). e Distribution of TE insertion identity values classified by superfamily and considering all genomes together. The boxplot shows median (the horizontal line in the box), 1st and 3rd quartiles (lower and upper bounds of box, respectively), minimum and maximum (lower and upper whiskers, respectively). Number of copies analyzed per superfamily are given in Supplementary Data 9c.

**Fig. 5. TE classification according to three frequency classes: rare (present in <10 of the strains), common (present in ≥10 and ≤95% of the strains) and fixed (present in >95% of the strains).**
a Number of TEs and their classification according to their frequency in the population using from 5 to 47 strains. The standard deviation was calculated by taking 30 random samples of strains for each case. Data are presented as median values ±standard deviation. b Intersection of the different sets of common TEs identified taking into account 10, 20, 30, 40 and 47 strains at random. c Venn diagrams depicting the intersection of orthologous TEs defined by geographic origin. The ALL diagram represents all TEs regardless their frequency class, while the rare, common and fixed diagrams are defined by the TEs of each of the classes in each set.

**Fig. 6. Gene expression levels in strains with and without TE insertions.**
Gene expression levels in strains without (gray) and with (red) the 13 TE insertions with the most significant association according to our eQTL analysis, and for the *3L_14050243_14050245_pogo* insertion with evidence of selection (last plot). The name of the TE insertions and the genomic location regarding the associated gene is provided. In total, the expression levels of 20 strains are plotted. The boxplot shows median (the horizontal line in the box), 1st and 3rd quartiles (lower and upper bounds of box, respectively), minimum and maximum (lower and upper whiskers, respectively).

**Fig. 7. Significantly enriched terms for genes nearby 107 TEs showing evidence of selection.**
Each panel shows significant enriched terms using different approaches. a DAVID GO Biological Process: Horizontal axis represents DAVID enrichment score. Only significant (score > 1.3) and non-redundant clusters are shown. FlyEnrichr results when using different libraries: b Anatomy GeneRIF Predicted, c Allele LoF Phenotypes from FlyBase, d Putative Regulatory miRNAs from DroID and e Transcription Factors from DroID. Only statistically significant terms are shown (Fisher test (two sided) adjusted p-value <0.05). Horizontal axis represents the *Enrichr* Combined Score. For Regulatory miRNAs and Transcription Factors, putative biological functions or phenotypes associated were assigned based on FlyBase gene summaries. Bar colors indicate similar biological functions as specified at the bottom of the figure.

See this image and copyright information in PMC

References

1. De Coster W, Van Broeckhoven C. Newest methods for detecting structural variations. Trends Biotechnol. 2019;37:973–982. - PubMed
1. Huddleston J, Eichler EE. An incomplete understanding of human genetic variation. Genetics. 2016;202:1251–1254. - PMC - PubMed
1. Audano PA, et al. Characterizing the major structural variant alleles of the human genome. Cell. 2019;176:663–675.e619. - PMC - PubMed
1. Chaisson MJP, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 2019;10:1784. - PMC - PubMed
1. Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. Structural variant calling: The long and the short of it. Genome Biol. 2019;20:246. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- FlyBase

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Population-scale long-read sequencing uncovers transposable elements associated with gene expression variation and adaptive signatures in Drosophila

Affiliations

Population-scale long-read sequencing uncovers transposable elements associated with gene expression variation and adaptive signatures in Drosophila

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Molecular Biology Databases