. 2022 Dec 20;13(6):e0231922.

doi: 10.1128/mbio.02319-22. Epub 2022 Oct 20.

Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families

João Luís Reis-Cunha^{1

2}, Anderson Coqueiro-Dos-Santos¹, Samuel Alexandre Pimenta-Carvalho¹, Larissa Pinheiro Marques¹, Gabriela F Rodrigues-Luiz³, Rodrigo P Baptista^{4

5}, Laila Viana de Almeida¹, Nathan Ravi Medeiros Honorato¹, Francisco Pereira Lobo⁶, Vanessa Gomes Fraga¹, Lucia Maria da Cunha Galvão^{1

7}, Lilian Lacerda Bueno¹, Ricardo Toshio Fujiwara¹, Mariana Santos Cardoso¹, Gustavo Coutinho Cerqueira⁸, Daniella C Bartholomeu¹

Affiliations

¹ Departamento de Parasitologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Geraisgrid.8430.f, Belo Horizonte, Minas Gerais, Brazil.
² Department of Biology, University of York, York, Yorkshire, United Kingdom.
³ Experimental Medicine Research Cluster (EMRC), University of Campinas (UNICAMP), Campinas, São Paulo, Brazil.
⁴ Center for Tropical and Emerging Global Diseases and Institute of Bioinformatics, The University of Georgia, Athens, Georgia, USA.
⁵ Houston Methodist Research Institute, Houston, Texas, USA.
⁶ Departamento de Genética e Evolução, Instituto de Ciências Biológicas, Universidade Federal de Minas Geraisgrid.8430.f, Belo Horizonte, Minas Gerais, Brazil.
⁷ Universidade Federal do Rio Grande do Norte, Centro de Ciências da Saúde, Programa de Pós-Graduação em Ciências Farmacêuticas, Natal, RN, Brasil.
⁸ Personal Genome Diagnostics, Baltimore, Maryland, USA.

PMID: 36264102
PMCID: PMC9765020
DOI: 10.1128/mbio.02319-22

Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families

João Luís Reis-Cunha et al. mBio. 2022.

. 2022 Dec 20;13(6):e0231922.

doi: 10.1128/mbio.02319-22. Epub 2022 Oct 20.

Authors

Affiliations

¹ Departamento de Parasitologia, Instituto de Ciências Biológicas, Universidade Federal de Minas Geraisgrid.8430.f, Belo Horizonte, Minas Gerais, Brazil.
² Department of Biology, University of York, York, Yorkshire, United Kingdom.
³ Experimental Medicine Research Cluster (EMRC), University of Campinas (UNICAMP), Campinas, São Paulo, Brazil.
⁴ Center for Tropical and Emerging Global Diseases and Institute of Bioinformatics, The University of Georgia, Athens, Georgia, USA.
⁵ Houston Methodist Research Institute, Houston, Texas, USA.
⁶ Departamento de Genética e Evolução, Instituto de Ciências Biológicas, Universidade Federal de Minas Geraisgrid.8430.f, Belo Horizonte, Minas Gerais, Brazil.
⁷ Universidade Federal do Rio Grande do Norte, Centro de Ciências da Saúde, Programa de Pós-Graduação em Ciências Farmacêuticas, Natal, RN, Brasil.
⁸ Personal Genome Diagnostics, Baltimore, Maryland, USA.

PMID: 36264102
PMCID: PMC9765020
DOI: 10.1128/mbio.02319-22

Abstract

Repetitive elements cause assembly fragmentation in complex eukaryotic genomes, limiting the study of their variability. The genome of Trypanosoma cruzi, the parasite that causes Chagas disease, has a high repetitive content, including multigene families. Although many T. cruzi multigene families encode surface proteins that play pivotal roles in host-parasite interactions, their variability is currently underestimated, as their high repetitive content results in collapsed gene variants. To estimate sequence variability and copy number variation of multigene families, we developed a read-based approach that is independent of gene-specific read mapping and de novo assembly. This methodology was used to estimate the copy number and variability of MASP, TcMUC, and Trans-Sialidase (TS), the three largest T. cruzi multigene families, in 36 strains, including members of all six parasite discrete typing units (DTUs). We found that these three families present a specific pattern of variability and copy number among the distinct parasite DTUs. Inter-DTU hybrid strains presented a higher variability of these families, suggesting that maintaining a larger content of their members could be advantageous. In addition, in a chronic murine model and chronic Chagasic human patients, the immune response was focused on TS antigens, suggesting that targeting TS conserved sequences could be a potential avenue to improve diagnosis and vaccine design against Chagas disease. Finally, the proposed approach can be applied to study multicopy genes in any organism, opening new avenues to access sequence variability in complex genomes. IMPORTANCE Sequences that have several copies in a genome, such as multicopy-gene families, mobile elements, and microsatellites, are among the most challenging genomic segments to study. They are frequently underestimated in genome assemblies, hampering the correct assessment of these important players in genome evolution and adaptation. Here, we developed a new methodology to estimate variability and copy numbers of repetitive genomic regions and employed it to characterize the T. cruzi multigene families MASP, TcMUC, and transsialidase (TS), which are important virulence factors in this parasite. We showed that multigene families vary in sequence and content among the parasite's lineages, whereas hybrid strains have a higher sequence variability that could be advantageous to the parasite's survivability. By identifying conserved sequences within multigene families, we showed that the mammalian host immune response toward these multigene families is usually focused on the TS multigene family. These TS conserved and immunogenic peptides can be explored in future works as diagnostic targets or vaccine candidates for Chagas disease. Finally, this methodology can be easily applied to any organism of interest, which will aid in our understanding of complex genomic regions.

Keywords: MASP; T. cruzi; antigenicity; complex genomes; copy number variation; mucins; multicopy genes; transsialidases; variability.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**FIG 1**
Phylogeny and whole-genome variation comparison of T. cruzi strains and isolates. (A) Unrooted maximum likelihood phylogenetic tree of the 36 T. cruzi strains based on 1,563 single copy genes (51), with 1000 bootstrap replicates. (B) Zoomed in on the TcI branch to ease visualization. (C) PCA based on SNPs of the 36 T. cruzi strains. In this image, the x-axis and y-axis represent 45.28% and 15.96% of the variability observed in the evaluated isolates, respectively. In both images, the T. cruzi DTUs TcI, TcII, TcIII, TcIV, TcV, and TcVI are represented, respectively, by the colors blue, red, pink, orange, purple, and green. The number 6277 corresponds to the sample SRR3676277.

**FIG 2**
K-mers and clusters variability and copy number within each T. cruzi strain. Each dot corresponds to a T. cruzi isolate. (A) “K-mer.Count,” (B) “Cluster. Count” and (C) “K-mer.Cluster.ratio” correspond, respectively, to the total number of different k-mers, clusters, and mean a number of k-mer in each cluster for each T. cruzi strain. These counts were only based on presence/absence without accounting for the copy number of each k-mer and cluster. (D) “Sum.cluster.Copies” corresponds to the sum of coverage of each cluster for a given strain, which was proportional to the multigene family copy number in the genome. Strain-specific values can be seen in Table S2.

**FIG 3**
Heatmap of the cluster variability and copy number among T. cruzi isolates. Cluster variability was estimated by the Jaccard Coefficient (JC) based on the presence/absence of clusters for each multigene family: (A) TcMUC, (B) MASP, and (C) TS. JC values are represented on a scale from green (low) to white (medium) to red (high) similarity. Cluster copy number variability was estimated by Manhattan distance for each multigene family. (D) TcMUC, (E) MASP, and (F) TS. Manhattan distance values are represented in a scale from green (high), white (medium) to red (low) distances. In this image, each line and column correspond to a T. cruzi isolate. The DTU of each isolate was represented by colored lateral strips, where blue, red, pink, orange, purple and green correspond to, respectively, TcI, TcII, TcIII, TcIV, TcV, and TcVI. Lateral dendrograms were generated by UPGMA clustering. A larger version of each image with the names of each isolate is available in (Fig. S2).

**FIG 4**
Correlation between cluster copy number, variability, and genome size in hybrid and nonhybrid DTUs. (A) Correlation between genome size and cluster copy number in the 36 T. cruzi strains. In this image, each dot corresponds to a T. cruzi strain, the y-axis corresponds to the sum of the copy number of all clusters in each strain and the x-axis corresponds to the genome size. The correlation between these two axes was estimated using Spearman’s rank order: MASP (rho = 0.393, P = 0.0183); TcMUC (rho = 0.469, P = 0.0042); TS (rho = 0.627, P = 6.14 × 10⁻⁵) (B) Boxplot of the cluster copy number in hybrid (Hyb) and nonhybrid (NH) DTUs. The statistical significance between the groups was estimated using the Mann-Whitney test: MASP (P = 3.19 × 10⁻³); TcMUC (P = 1.29 × 10⁻³); TS (P = 5.26 × 10⁻³). (C) Correlation between genome size and cluster variability in the 36 T. cruzi strains. In this image, each dot corresponds to a T. cruzi strain, the y-axis corresponds to the number of different clusters in each strain and the x-axis corresponds to the genome size. The correlation between these two axes was estimated using Spearman’s rank order: MASP (rho = 0.749, P = 7.562 × 10⁻⁷); TcMUC (rho = 0.778, P = 2.236 × 10⁻⁸); TS (rho = 0.752, P = 6,941 × 10⁻⁷). (D) Boxplot of the cluster variability in hybrid (Hyb) and nonhybrid (NHyb) DTUs. The statistical significance between the groups was estimated using the Mann-Whitney test: MASP (P = 3.39 × 10⁻⁵); TcMUC (P = 1.39 × 10⁻³); TS (P = 2.39 × 10⁻⁵).

**FIG 5**
Antigenicity of peptides derived from the multigene families using sera of mice infected with different T. cruzi DTUs. (A) Each dot corresponds to a peptide, and the white boxes in each panel separate the peptides from the MASP (left), TcMUC (middle), and TS (right) multigene families. The reactivity of each peptide is represented on a scale from black (low reactivity), orange (median reactivity) to white (high reactivity). The panels representing the reactivity of the sera from mice in the acute phase were circumvented horizontally by a pink box, while the ones representing the sera from mice in the chronic phase are by a cyan box. The plots vertically circumvented by white, blue, salmon, and green boxes represent, respectively, the reactivity from the peptides to the sera of noninfected mice (NC), or mice infected with TcI, TcII, or TcVI strains. Venn diagrams representing the number of peptides with above cutoff reactivity for the pool of sera collected during the acute (B), chronic (C), or both acute and chronic (D) phases of infection. Percentage values correspond to the fraction of the reactive peptides that were observed in each quadrant.

**FIG 6**
Antigenicity of peptides derived from the multigene families with sera of Chagasic human patients infected with TcII strains. (A) Each dot corresponds to a different peptide, where the white boxes in each panel separate the peptides from the MASP (left), TcMUC (middle), and TS (right). The reactivity of each peptide is represented on a scale from black (low reactivity), orange (median reactivity) to white (high reactivity). (B) Percentage of the peptides from each multigene family that presented reactivity above the cutoff. (C) Top 10 peptides with the highest human sera reactivity.

See this image and copyright information in PMC

Cited by

Clinical Trypanosoma cruzi isolates share a common antigen repertoire that is absent from culture adapted strains.
Hakim JMC, Guiterrez SAG, Duran A, Malaga-Machaca E, Duque C, Singer L, Colanzi R, Sherbuk JE, Bern C, Gilman RH, Messenger LA, Mugnier MR; Working Group on Chagas Disease in Bolivia and Peru. Hakim JMC, et al. bioRxiv [Preprint]. 2025 Jun 4:2025.06.04.657671. doi: 10.1101/2025.06.04.657671. bioRxiv. 2025. PMID: 40501824 Free PMC article. Preprint.
The time has come for a vaccine against Chagas disease.
Teixeira SM, Burle-Caldas GA, Castro JT, Gazzinelli RT. Teixeira SM, et al. Lancet Reg Health Am. 2025 Mar 21;45:101059. doi: 10.1016/j.lana.2025.101059. eCollection 2025 May. Lancet Reg Health Am. 2025. PMID: 40206818 Free PMC article. Review.
An algorithm for annotation and classification of T. cruzi MASP sequences: towards a better understanding of the parasite genetic variability.
Dean AAC, Berná L, Robello C, Buscaglia CA, Balouz V. Dean AAC, et al. BMC Genomics. 2025 Feb 24;26(1):194. doi: 10.1186/s12864-025-11384-5. BMC Genomics. 2025. PMID: 39994548 Free PMC article.
Gut membrane proteins as candidate antigens for immunization of mice against the tick Amblyomma sculptum.
Costa GCA, Ribeiro ICT, Giunchetti RC, Gontijo NF, Sant'Anna MRV, Pereira MH, Pessoa GCD, Koerich LB, Oliveira F, Valenzuela JG, Fujiwara RT, Bartholomeu DC, Araujo RN. Costa GCA, et al. Vaccine. 2024 Aug 30;42(21):126141. doi: 10.1016/j.vaccine.2024.07.042. Epub 2024 Jul 20. Vaccine. 2024. PMID: 39033080
Validation of the NAT Chagas IVD Kit for the Detection and Quantification of Trypanosoma cruzi in Blood Samples of Patients with Chagas Disease.
Moreira OC, Fernandes AG, Gomes NLDS, Dos Santos CM, Jacomasso T, Costa ADT, Nascimento LOR, Hasslocher-Moreno AM, do Brasil PEAA, Morello LG, Marchini FK, Krieger MA, Britto C. Moreira OC, et al. Life (Basel). 2023 May 24;13(6):1236. doi: 10.3390/life13061236. Life (Basel). 2023. PMID: 37374019 Free PMC article.

See all "Cited by" articles

References

1. Metzker ML. 2010. Sequencing technologies - the next generation. Nat Rev Genet 11:31–46. doi:10.1038/nrg2626. - DOI - PubMed
1. McCombie WR, McPherson JD, Mardis ER. 2019. Next-generation sequencing technologies. Cold Spring Harb Perspect Med 9:a036798. doi:10.1101/cshperspect.a036798. - DOI - PMC - PubMed
1. Baptista RP, Kissinger JC. 2019. Is reliance on an inaccurate genome sequence sabotaging your experiments? PLoS Pathog 15:e1007901. doi:10.1371/journal.ppat.1007901. - DOI - PMC - PubMed
1. El-Sayed NM, Myler PJ, Blandin G, Berriman M, Crabtree J, Aggarwal G, Caler E, Renauld H, Worthey EA, Hertz-Fowler C, Ghedin E, Peacock C, Bartholomeu DC, Haas BJ, Tran A-N, Wortman JR, Alsmark UCM, Angiuoli S, Anupama A, Badger J, Bringaud F, Cadag E, Carlton JM, Cerqueira GC, Creasy T, Delcher AL, Djikeng A, Embley TM, Hauser C, Ivens AC, Kummerfeld SK, Pereira-Leal JB, Nilsson D, Peterson J, Salzberg SL, Shallom J, Silva JC, Sundaram J, Westenberger S, White O, Melville SE, Donelson JE, Andersson B, Stuart KD, Hall N. 2005. Comparative genomics of trypanosomatid parasitic protozoa. Science 309:404–409. doi:10.1126/science.1112181. - DOI - PubMed
1. Reis-Cunha JL, Valdivia HO, Bartholomeu DC. 2018. Gene and chromosomal copy number variations as an adaptive mechanism towards a parasitic lifestyle in trypanosomatids. Curr Genomics 19:87–97. doi:10.2174/1389202918666170911161311. - DOI - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families

Affiliations

Accessing the Variability of Multicopy Genes in Complex Genomes using Unassembled Next-Generation Sequencing Reads: The Case of Trypanosoma cruzi Multigene Families

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

LinkOut - more resources

Full Text Sources

Medical