Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 19;12(1):21920.
doi: 10.1038/s41598-022-26236-5.

In silico identification of multiple conserved motifs within the control region of Culicidae mitogenomes

Affiliations

In silico identification of multiple conserved motifs within the control region of Culicidae mitogenomes

Thomas M R Harrison et al. Sci Rep. .

Abstract

Mosquitoes are important vectors for human and animal diseases. Genetic markers, like the mitochondrial COI gene, can facilitate the taxonomic classification of disease vectors, vector-borne disease surveillance, and prevention. Within the control region (CR) of the mitochondrial genome, there exists a highly variable and poorly studied non-coding AT-rich area that contains the origin of replication. Although the CR hypervariable region has been used for species differentiation of some animals, few studies have investigated the mosquito CR. In this study, we analyze the mosquito mitogenome CR sequences from 125 species and 17 genera. We discovered four conserved motifs located 80 to 230 bp upstream of the 12S rRNA gene. Two of these motifs were found within all 392 Anopheles (An.) CR sequences while the other two motifs were identified in all 37 Culex (Cx.) CR sequences. However, only 3 of the 304 non-Culicidae Dipteran mitogenome CR sequences contained these motifs. Interestingly, the short motif found in all 37 Culex sequences had poly-A and poly-T stretch of similar length that is predicted to form a stable hairpin. We show that supervised learning using the frequency chaos game representation of the CR can be used to differentiate mosquito genera from their dipteran relatives.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Overview flowchart showing the data processing. Anopheles, Culex, and Aedes sequences were treated separately, but used the same procedure.
Figure 2
Figure 2
(a) Mitochondrial genome of Anopheles gambiae L20934. Yellow regions are the coding sequences of protein coding genes. Red regions are rRNA genes. Blue regions are tRNA genes. Grey region is the control region. Organization of genes is the same in all genera except Sabethes, where the tRNACys and tRNATyr genes are found in the control region, rather than between the ND2 and COX1 CDS. Gene organization for Sabethes belisarioi MF957171 tRNA bordering the control region is shown in an inset. (b) Close up of Anopheles gambiae L20934 control region with motif regions labelled. Generated with Geneious R11.1 (https://www.geneious.com).
Figure 3
Figure 3
Comparison of 27 Culex Short motif and mirrored A and T stretches. Note mirrored mutations in 26. NC_037819 (G in position 4 and C in position 31) and 27. NC_028616 (T in position 9 and A in position 26). Generated with Geneious R11.1 (https://www.geneious.com).
Figure 4
Figure 4
Overview of the deep neural network used to classify mosquito and non-mosquito sequences. (A) A simple overview of how information from normalized FCGRs passes through each branch of the network. Each branch begins with FCGRs being split into patches. The information from each patch then passes through attention layers and a small fully connected feed-forward network. The layers predicting target information (real/synthetic, genera) are the last layers of the network. During training, the loss between predictions and actual targets is minimized by gradually adjusting the weights and biases of each layer. When predicting unknown labels, the layer which classifies each FCGR as either real or synthetic is discarded and only the taxonomic classification layer is used. Colors have been added only to aid in visualization. (B) The meta-classifier creates random training sets using the training data. The weights of each model are initialized randomly. This helps train a diverse set of models which can be used to classify unseen data.
Figure 5
Figure 5
Predicted secondary structure of NC_014574 (Culex quinquefasciatus) Culex Short motif and surrounding bases at 25 °C. Coloured scale represents the probability of bases being in the state represented. Bases are coloured according to the likelihood they are in their shown state. Blue/purple bases are unlikely, cyan/green are somewhat likely, and yellow/red are the most likely. Predictions made with ViennaRNA v 2.4.14 using DNA stacking energies.
Figure 6
Figure 6
Results of the semi-supervised learning investigation using a deep learning model. The model was trained using data collected up to 2019. (A) The organizational structure of the chaos game. An example of the composition of the NAAA super-pixel can be found in the top left corner. (B) Saliency map for each FCGR. Highlighted are 4 × 4 super-pixels of k-mer frequencies corresponding to the different regions of the FCGR presented in (A). For example, the patch found in the top left corner represents a collection of k-mers ending in AAAA. High saliency regions are warmer and are used by the model to differentiate between sequences. (C) This table displays the results of the fivefold stratified cross-validation experiment. Predictions which were correctly made are found where both the column and row labels are identical. False negative predictions for each genus are found along the rows (eg: five Anopheles sequences were predicted to be Non-Culicidae dipterans) while false positives are found along the columns.
Figure 7
Figure 7
Precision-Recall curve quantifying the performance of the semi-supervised classifier when trained on the entire dataset. Precision-Recall curves illustrate the ability of our deep learning model to balance the identification of true positive sequences while minimizing the number of false identifications (precision) and false negatives (recall).

References

    1. Harbach RE, Besansky NJ. Mosquitoes. Curr. Biol. 2014;24(1):R14–R15. doi: 10.1016/j.cub.2013.09.047. - DOI - PubMed
    1. World Malaria Report 2019. Available from: https://www.who.int/publications-detail-redirect/9789241565721 (2020).
    1. Ruzzante L, Reijnders MJMF, Waterhouse RM. Of genes and genomes: mosquito evolution and diversity. Trends Parasitol. 2019;35(1):32–51. doi: 10.1016/j.pt.2018.10.003. - DOI - PubMed
    1. Lourens GB, Ferrell DK. Lymphatic filariasis. Nurs. Clin. N. Am. 2019;54(2):181–192. doi: 10.1016/j.cnur.2019.02.007. - DOI - PubMed
    1. Musso D, Gubler DJ. Zika virus. Clin. Microbiol. Rev. 2016;29(3):487–524. doi: 10.1128/CMR.00072-15. - DOI - PMC - PubMed

Publication types