Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 20;13(6):e0231922.
doi: 10.1128/mbio.02319-22. Epub 2022 Oct 20.

Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families

Affiliations

Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families

João Luís Reis-Cunha et al. mBio. .

Abstract

Repetitive elements cause assembly fragmentation in complex eukaryotic genomes, limiting the study of their variability. The genome of Trypanosoma cruzi, the parasite that causes Chagas disease, has a high repetitive content, including multigene families. Although many T. cruzi multigene families encode surface proteins that play pivotal roles in host-parasite interactions, their variability is currently underestimated, as their high repetitive content results in collapsed gene variants. To estimate sequence variability and copy number variation of multigene families, we developed a read-based approach that is independent of gene-specific read mapping and de novo assembly. This methodology was used to estimate the copy number and variability of MASP, TcMUC, and Trans-Sialidase (TS), the three largest T. cruzi multigene families, in 36 strains, including members of all six parasite discrete typing units (DTUs). We found that these three families present a specific pattern of variability and copy number among the distinct parasite DTUs. Inter-DTU hybrid strains presented a higher variability of these families, suggesting that maintaining a larger content of their members could be advantageous. In addition, in a chronic murine model and chronic Chagasic human patients, the immune response was focused on TS antigens, suggesting that targeting TS conserved sequences could be a potential avenue to improve diagnosis and vaccine design against Chagas disease. Finally, the proposed approach can be applied to study multicopy genes in any organism, opening new avenues to access sequence variability in complex genomes. IMPORTANCE Sequences that have several copies in a genome, such as multicopy-gene families, mobile elements, and microsatellites, are among the most challenging genomic segments to study. They are frequently underestimated in genome assemblies, hampering the correct assessment of these important players in genome evolution and adaptation. Here, we developed a new methodology to estimate variability and copy numbers of repetitive genomic regions and employed it to characterize the T. cruzi multigene families MASP, TcMUC, and transsialidase (TS), which are important virulence factors in this parasite. We showed that multigene families vary in sequence and content among the parasite's lineages, whereas hybrid strains have a higher sequence variability that could be advantageous to the parasite's survivability. By identifying conserved sequences within multigene families, we showed that the mammalian host immune response toward these multigene families is usually focused on the TS multigene family. These TS conserved and immunogenic peptides can be explored in future works as diagnostic targets or vaccine candidates for Chagas disease. Finally, this methodology can be easily applied to any organism of interest, which will aid in our understanding of complex genomic regions.

Keywords: MASP; T. cruzi; antigenicity; complex genomes; copy number variation; mucins; multicopy genes; transsialidases; variability.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

FIG 1
FIG 1
Phylogeny and whole-genome variation comparison of T. cruzi strains and isolates. (A) Unrooted maximum likelihood phylogenetic tree of the 36 T. cruzi strains based on 1,563 single copy genes (51), with 1000 bootstrap replicates. (B) Zoomed in on the TcI branch to ease visualization. (C) PCA based on SNPs of the 36 T. cruzi strains. In this image, the x-axis and y-axis represent 45.28% and 15.96% of the variability observed in the evaluated isolates, respectively. In both images, the T. cruzi DTUs TcI, TcII, TcIII, TcIV, TcV, and TcVI are represented, respectively, by the colors blue, red, pink, orange, purple, and green. The number 6277 corresponds to the sample SRR3676277.
FIG 2
FIG 2
K-mers and clusters variability and copy number within each T. cruzi strain. Each dot corresponds to a T. cruzi isolate. (A) “K-mer.Count,” (B) “Cluster. Count” and (C) “K-mer.Cluster.ratio” correspond, respectively, to the total number of different k-mers, clusters, and mean a number of k-mer in each cluster for each T. cruzi strain. These counts were only based on presence/absence without accounting for the copy number of each k-mer and cluster. (D) “Sum.cluster.Copies” corresponds to the sum of coverage of each cluster for a given strain, which was proportional to the multigene family copy number in the genome. Strain-specific values can be seen in Table S2.
FIG 3
FIG 3
Heatmap of the cluster variability and copy number among T. cruzi isolates. Cluster variability was estimated by the Jaccard Coefficient (JC) based on the presence/absence of clusters for each multigene family: (A) TcMUC, (B) MASP, and (C) TS. JC values are represented on a scale from green (low) to white (medium) to red (high) similarity. Cluster copy number variability was estimated by Manhattan distance for each multigene family. (D) TcMUC, (E) MASP, and (F) TS. Manhattan distance values are represented in a scale from green (high), white (medium) to red (low) distances. In this image, each line and column correspond to a T. cruzi isolate. The DTU of each isolate was represented by colored lateral strips, where blue, red, pink, orange, purple and green correspond to, respectively, TcI, TcII, TcIII, TcIV, TcV, and TcVI. Lateral dendrograms were generated by UPGMA clustering. A larger version of each image with the names of each isolate is available in (Fig. S2).
FIG 4
FIG 4
Correlation between cluster copy number, variability, and genome size in hybrid and nonhybrid DTUs. (A) Correlation between genome size and cluster copy number in the 36 T. cruzi strains. In this image, each dot corresponds to a T. cruzi strain, the y-axis corresponds to the sum of the copy number of all clusters in each strain and the x-axis corresponds to the genome size. The correlation between these two axes was estimated using Spearman’s rank order: MASP (rho = 0.393, P = 0.0183); TcMUC (rho = 0.469, P = 0.0042); TS (rho = 0.627, P = 6.14 × 10−5) (B) Boxplot of the cluster copy number in hybrid (Hyb) and nonhybrid (NH) DTUs. The statistical significance between the groups was estimated using the Mann-Whitney test: MASP (P = 3.19 × 10−3); TcMUC (P = 1.29 × 10−3); TS (P = 5.26 × 10−3). (C) Correlation between genome size and cluster variability in the 36 T. cruzi strains. In this image, each dot corresponds to a T. cruzi strain, the y-axis corresponds to the number of different clusters in each strain and the x-axis corresponds to the genome size. The correlation between these two axes was estimated using Spearman’s rank order: MASP (rho = 0.749, P = 7.562 × 10−7); TcMUC (rho = 0.778, P = 2.236 × 10−8); TS (rho = 0.752, P = 6,941 × 10−7). (D) Boxplot of the cluster variability in hybrid (Hyb) and nonhybrid (NHyb) DTUs. The statistical significance between the groups was estimated using the Mann-Whitney test: MASP (P = 3.39 × 10−5); TcMUC (P = 1.39 × 10−3); TS (P = 2.39 × 10−5).
FIG 5
FIG 5
Antigenicity of peptides derived from the multigene families using sera of mice infected with different T. cruzi DTUs. (A) Each dot corresponds to a peptide, and the white boxes in each panel separate the peptides from the MASP (left), TcMUC (middle), and TS (right) multigene families. The reactivity of each peptide is represented on a scale from black (low reactivity), orange (median reactivity) to white (high reactivity). The panels representing the reactivity of the sera from mice in the acute phase were circumvented horizontally by a pink box, while the ones representing the sera from mice in the chronic phase are by a cyan box. The plots vertically circumvented by white, blue, salmon, and green boxes represent, respectively, the reactivity from the peptides to the sera of noninfected mice (NC), or mice infected with TcI, TcII, or TcVI strains. Venn diagrams representing the number of peptides with above cutoff reactivity for the pool of sera collected during the acute (B), chronic (C), or both acute and chronic (D) phases of infection. Percentage values correspond to the fraction of the reactive peptides that were observed in each quadrant.
FIG 6
FIG 6
Antigenicity of peptides derived from the multigene families with sera of Chagasic human patients infected with TcII strains. (A) Each dot corresponds to a different peptide, where the white boxes in each panel separate the peptides from the MASP (left), TcMUC (middle), and TS (right). The reactivity of each peptide is represented on a scale from black (low reactivity), orange (median reactivity) to white (high reactivity). (B) Percentage of the peptides from each multigene family that presented reactivity above the cutoff. (C) Top 10 peptides with the highest human sera reactivity.

Similar articles

Cited by

References

    1. Metzker ML. 2010. Sequencing technologies - the next generation. Nat Rev Genet 11:31–46. doi:10.1038/nrg2626. - DOI - PubMed
    1. McCombie WR, McPherson JD, Mardis ER. 2019. Next-generation sequencing technologies. Cold Spring Harb Perspect Med 9:a036798. doi:10.1101/cshperspect.a036798. - DOI - PMC - PubMed
    1. Baptista RP, Kissinger JC. 2019. Is reliance on an inaccurate genome sequence sabotaging your experiments? PLoS Pathog 15:e1007901. doi:10.1371/journal.ppat.1007901. - DOI - PMC - PubMed
    1. El-Sayed NM, Myler PJ, Blandin G, Berriman M, Crabtree J, Aggarwal G, Caler E, Renauld H, Worthey EA, Hertz-Fowler C, Ghedin E, Peacock C, Bartholomeu DC, Haas BJ, Tran A-N, Wortman JR, Alsmark UCM, Angiuoli S, Anupama A, Badger J, Bringaud F, Cadag E, Carlton JM, Cerqueira GC, Creasy T, Delcher AL, Djikeng A, Embley TM, Hauser C, Ivens AC, Kummerfeld SK, Pereira-Leal JB, Nilsson D, Peterson J, Salzberg SL, Shallom J, Silva JC, Sundaram J, Westenberger S, White O, Melville SE, Donelson JE, Andersson B, Stuart KD, Hall N. 2005. Comparative genomics of trypanosomatid parasitic protozoa. Science 309:404–409. doi:10.1126/science.1112181. - DOI - PubMed
    1. Reis-Cunha JL, Valdivia HO, Bartholomeu DC. 2018. Gene and chromosomal copy number variations as an adaptive mechanism towards a parasitic lifestyle in trypanosomatids. Curr Genomics 19:87–97. doi:10.2174/1389202918666170911161311. - DOI - PMC - PubMed

Publication types

Substances