Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul;291(2027):20240423.
doi: 10.1098/rspb.2024.0423. Epub 2024 Jul 31.

The changing landscape of text mining: a review of approaches for ecology and evolution

Affiliations
Review

The changing landscape of text mining: a review of approaches for ecology and evolution

Maxwell J Farrell et al. Proc Biol Sci. 2024 Jul.

Abstract

In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.

Keywords: Information Extraction; Natural Language Processing; database construction; deep learning; large language models; literature synthesis.

PubMed Disclaimer

Conflict of interest statement

We declare we have no competing interests.

Figures

Illustration of common text processing tasks, including sentence segmentation, stop words removal, tokenization, POS-tagging, lemmatization / stemming, dependency parsing, Named Entity Recognition, and coreference resolution.
Figure 1.
Illustration of common text processing tasks, including sentence segmentation, stop words removal, tokenization, POS tagging, lemmatization/stemming, dependency parsing, named entity recognition and coreference resolution. POS tagging involves marking up tokens with a set of descriptive POS tags, e.g. determiner (DT), proper noun (NNP), adjective (JJ), etc. Dependency parsing creates a tree-like representation of the grammatical relationships between words in a sentence. Note that the order and inclusion of individual steps in a pipeline will depend on the task. Example text derived from Herrera et al. [7].
Example steps for three text mining paradigms: 1. Frequency-based (bag-of-words) (frequency-based) approach, 2. Traditional NLP pipeline, and 3. Deep learning based language models.
Figure 2.
Example steps for three text mining paradigms: (a) frequency-based (bag-of-words) (frequency-based) approach, (b) traditional NLP pipeline, and (c) deep learning-based language models. Dashed arrows indicate possible interactions between each of the paradigms (e.g. text pre-processed using a classical NLP pipeline could be analysed using bag-of-words approaches, or fed into a deep learning-based document classifier). (d) Some examples of outcomes, including quantifying document similarity, topic modelling and training models for document classification, named entity recognition and relation extraction. Note that for steps in the traditional NLP section, there are often additional task-specific external data sources (e.g. word lists, dictionaries, labelled training data) which are not depicted here.

References

    1. Farrell MJ, Brierley L, Willoughby A, Yates A, Mideo N. 2022. Past and future uses of text mining in ecology and evolution. Proc. R. Soc. B 289 , 20212721. (10.1098/rspb.2021.2721) - DOI - PMC - PubMed
    1. Brandies PA, Hogg CJ. 2021. Ten simple rules for getting started with command-line bioinformatics. PLoS Comput. Biol. 17 , e1008645. (10.1371/journal.pcbi.1008645) - DOI - PMC - PubMed
    1. Bird S, Klein E, Loper E. 2009. Natural Language Processing with python: analyzing text with the natural language toolkit. Sebastopol, CA: O’Reilly Media, Inc.
    1. Honnibal M, Montani I. 2017. spaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing
    1. Nunez‐Mir GC, Iannone BV, Pijanowski BC, Kong N, Fei S. 2016. Automated content analysis: addressing the big literature challenge in ecology and evolution. Methods Ecol. Evol. 7 , 1262–1272. (10.1111/2041-210X.12602) - DOI

LinkOut - more resources