Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 20;12(1):10349.
doi: 10.1038/s41598-022-13790-1.

Compilation of parasitic immunogenic proteins from 30 years of published research using machine learning and natural language processing

Affiliations

Compilation of parasitic immunogenic proteins from 30 years of published research using machine learning and natural language processing

Stephen J Goodswen et al. Sci Rep. .

Abstract

The World Health Organisation reported in 2020 that six of the top 10 sources of death in low-income countries are parasites. Parasites are microorganisms in a relationship with a larger organism, the host. They acquire all benefits at the host's expense. A disease develops if the parasitic infection disrupts normal functioning of the host. This disruption can range from mild to severe, including death. Humans and livestock continue to be challenged by established and emerging infectious disease threats. Vaccination is the most efficient tool for preventing current and future threats. Immunogenic proteins sourced from the disease-causing parasite are worthwhile vaccine components (subunits) due to reliable safety and manufacturing capacity. Publications with 'subunit vaccine' in their title have accumulated to thousands over the last three decades. However, there are possibly thousands more reporting immunogenicity results without mentioning 'subunit' and/or 'vaccine'. The exact number is unclear given the non-standardised keywords in publications. The study aim is to identify parasite proteins that induce a protective response in an animal model as reported in the scientific literature within the last 30 years using machine learning and natural language processing. Source code to fulfil this aim and the vaccine candidate list obtained is made available.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
A schematic of the pipeline processes that takes abstracts as input and provides vaccine candidates as output. PubMed is a database maintained by the National Center for Biotechnology Information (NCBI) and contains over 30 million abstracts on life sciences and biomedical topics. The advanced search query was parasite, vaccine, vaccinated, OR vaccination in Title or Abstract text AND publication year greater or equal to 1991 and less than 2022. Keywords for the rule-based abstract classification were related to protective immunity, animal models, parasite species, and parasitic diseases. Note keywords were searched and counted in both title and abstract. The term ‘abstract of interest’ refers to abstracts that potentially contain a protein name of a vaccine candidate. Database searching involves checking for a match of an extracted protein name in an in-house protein and gene database compiled from The Universal Protein Resource (UniProt) and NCBI. Training data consisted of abstracts converted to a vectorised format (i.e., a numerical representation) using the text vectorization technique, Bag of Words (BoW). NLP is an acronym for natural language processing. Named entity recognition (NER) is a sub-task of NLP and was used to classify named entities in abstracts into a pre-defined category of protein name. CD-HIT (cluster database at high identity with tolerance) was used to cluster 3731 sequences associated with 403 unique protein names into 1099 clusters, in which each member had a sequence similarity identity greater than 90%. A representative sequence is the longest sequence in a cluster. Exposed candidates are proteins naturally exposed to the immune system, whereas non-exposed are normally located in the pathogen’s interior.
Figure 2
Figure 2
A word cloud showing the 50 most frequent words in the positives training data applied in the classification of abstracts using machine learning. Note that stop words e.g., “a”, “the”, “is”, “are” etc. were removed and a standard Porter Stemming algorithm applied to detect and combine similar words e.g., words such as responses and response or significant and significantly are combined (the most frequent of the variants is chosen to represent them). TagCrowd (https://tagcrowd.com/) was used to generate the word cloud.
Figure 3
Figure 3
A bar chart showing frequency of disease words in classified abstracts over three decades from 1991 to 2021. The classified abstracts are ‘title+abstract’ text output from the machine learning abstract classification stage of the current study i.e., given an initial input of 332,627 ‘title+abstract’ texts downloaded from PubMed, 64,986 had a classification probability greater than or equal to 50% and were deemed ‘abstracts of interest’ (e.g.; an abstract that potentially contains a protein name of a vaccine candidate). Each word or a series of words associated with a parasitic disease were counted in the abstracts of interest e.g., the word ‘malaria’ appears 2162 times and ‘toxocariasis’ 13 times in the abstracts. The bar chart shows that each decade has a greater disease frequency than the decade before; and the frequency has more than doubled in the last 10 years (except for schistosomiasis and cysticercosis). Note that for brevity, counts of words related to the same or similar diseases were combined e.g., the diseases Chagas disease, American trypanosomiasis, African trypanosomiasis, and sleeping sickness are all caused by trypanosomes. The word counts associated with these diseases were combined and presented under trypanosomiasis.
Figure 4
Figure 4
A bar chart showing frequency of ‘animal model’ words in classified abstracts over three decades from 1991 to 2021. The classified abstracts are ‘title+abstract’ text output from the machine learning abstract classification stage of the current study i.e., given an initial input of 332,627 ‘title+abstract’ texts downloaded from PubMed, 64,986 had a classification probability greater than or equal to 50% and were deemed ‘abstracts of interest’ (e.g.; an abstract that potentially contains a protein name of a vaccine candidate). Each word or a series of words describing an animal were counted in the abstracts of interest e.g., the word ‘mice’ appears 30,749 times and ‘goats’ 545 times in the abstracts. Note that the automated approach does not distinguish whether the animal words relate to a model for candidate verification or reference to another context such as an animal host. The bar chart shows that each decade has a greater frequency for each ‘animal model’ word than the decade before. The rate of increase in frequency has doubled in the last 10 years for the following (listed in descending rates): pigs, chickens, cattle, birds, goats, dogs, and sheep. Conversely, the rate of increase has slowed for the following (listed in ascending rates): primates, rats, rabbits, mice, and guinea pigs. Note that for brevity, counts of words related to the same or similar animal model were combined e.g., the ‘cattle’ animal model comprises word counts for cow, cows, calf, calves, and cattle.
Figure 5
Figure 5
A word cloud showing the 15 most reported protein names per organism in the last 30 years of published research for four important parasite species. The size of the name is proportional to the number of publications reporting the protein. These protein names were ‘automatically’ extracted by the current study’s computational pipeline, which is designed to identify, from publication abstracts, parasite proteins that induce a protective response in an animal model. The presented names are from the top four species based on the total number of proteins identified: (A) Plasmodium Falciparum, (B) Toxoplasma gondii, (C) Babesia bovis, and (D) Schistosoma japonicum. Wordclouds.com (https://classic.wordclouds.com/) was used to generate the word cloud.
Figure 6
Figure 6
A column graph depicting the number of predicted characteristics in candidate proteins per phylum per genus (A); and a bar graph showing the number of publications associated with the candidates (B). Protein characteristics were predicted from 1099 representative sequences related to protein names extracted from 332,627 PubMed ‘title+abstract’ texts using the presented study’s pipeline. The 1099 proteins are considered here as potential vaccine candidates. Characteristics predicted are accessibility to the immune system by Vacceed, transmembrane (TM) domains by TMHMM, the presence of a signal peptide (SP) by signalP, and glycosylphosphatidylinositol (GPI) anchors by PredGPI. As an example of how to interpret the graphs, there are 1099 candidates of which 320 are proteins from the genus Plasmodium (a member of the Apicomplexa phylum)—257 of the 320 proteins are predicted to be naturally accessible to the immune system, 173 have at least one TM, 204 have SPs, and 76 GPI-anchors. The 320 candidates appear collectively in 4055 publications.

Similar articles

Cited by

References

    1. Frank SA. Models of parasite virulence. Q. Rev. Biol. 1996;71:37–78. doi: 10.1086/419267. - DOI - PubMed
    1. Prenter J, MacNeil C, Dick JTA, Dunn AM. Roles of parasites in animal invasions. Trends Ecol. Evol. 2004;19:385–390. doi: 10.1016/j.tree.2004.05.002. - DOI - PubMed
    1. Price PW. Evolutionary biology of parasites. Monogr. Popul. Biol. 1980;15:1–237. - PubMed
    1. Poulin R, Morand S. The diversity of parasites. Q. Rev. Biol. 2000;75:277–293. doi: 10.1086/393500. - DOI - PubMed
    1. May RM. Parasites, people and policy: Infectious diseases and the Millennium Development Goals. Trends Ecol. Evol. 2007;22:497–503. doi: 10.1016/j.tree.2007.08.009. - DOI - PubMed

Publication types