Domain adaptation in small-scale and heterogeneous biological datasets

Seyedmehdi Orouji¹, Martin C Liu^{2

3}, Tal Korem^{3

4

5}, Megan A K Peters^{1

5

6}

Affiliations

¹ Department of Cognitive Sciences, University of California Irvine, Irvine, CA, USA.
² Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
³ Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA.
⁴ Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA.
⁵ CIFAR Azrieli Global Scholars Program, CIFAR, Toronto, Canada.
⁶ CIFAR Fellow, Program in Brain, Mind, & Consciousness, CIFAR, Toronto, Canada.

PMID: 39705361
PMCID: PMC11661433
DOI: 10.1126/sciadv.adp6040

Review

Domain adaptation in small-scale and heterogeneous biological datasets

Seyedmehdi Orouji et al. Sci Adv. 2024.

. 2024 Dec 20;10(51):eadp6040.

doi: 10.1126/sciadv.adp6040. Epub 2024 Dec 20.

Authors

Seyedmehdi Orouji¹, Martin C Liu^{2

3}, Tal Korem^{3

4

5}, Megan A K Peters^{1

5

6}

Affiliations

¹ Department of Cognitive Sciences, University of California Irvine, Irvine, CA, USA.
² Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
³ Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA.
⁴ Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA.
⁵ CIFAR Azrieli Global Scholars Program, CIFAR, Toronto, Canada.
⁶ CIFAR Fellow, Program in Brain, Mind, & Consciousness, CIFAR, Toronto, Canada.

PMID: 39705361
PMCID: PMC11661433
DOI: 10.1126/sciadv.adp6040

Abstract

Machine-learning models are key to modern biology, yet models trained on one dataset are often not generalizable to other datasets from different cohorts or laboratories due to both technical and biological differences. Domain adaptation, a type of transfer learning, alleviates this problem by aligning different datasets so that models can be applied across them. However, most state-of-the-art domain adaptation methods were designed for large-scale data such as images, whereas biological datasets are smaller and have more features, and these are also complex and heterogeneous. This Review discusses domain adaptation methods in the context of such biological data to inform biologists and guide future domain adaptation research. We describe the benefits and challenges of domain adaptation in biological research and critically explore some of its objectives, strengths, and weaknesses. We argue for the incorporation of domain adaptation techniques to the computational biologist's toolkit, with further development of customized approaches.

PubMed Disclaimer

Figures

**Fig. 1.. Diagrammatic overview of the machine learning pipeline and modifications needed to engage in transfer learning or domain adaptation (DA).**
(A) In traditional machine learning, each domain has its own model, trained on domain-specific features. This means that the model can make predictions about data from that domain, but transferring the model to apply it to other domains is typically difficult or even impossible (indicated by red Xs). (B) In transfer learning or DA, data from one or more source domains are aligned (denoted by dashed outlines) with those in the target domain to find common feature spaces with similar statistical distributions such that a single model can be trained on aggregate source domain data and evaluated on target domain. This process can produce generalizable knowledge that is not domain specific. Of note, in some cases, target data will only be used after the model has been trained and not in the alignment stage (*152*).

**Fig. 2.. A cartoon representation of source and target domains before and after alignment.**
In this cartoon, features vary in their values along two dimensions, and each domain’s features take on a different mean and covariance. Unless the domains are aligned, these differences could both obscure other meaningful variation in the data that are shared across domains and prevent models trained on one domain from generalizing to another.

See this image and copyright information in PMC

References

1. Ross L. N., Bassett D. S., Causation in neuroscience: Keeping mechanism meaningful. Nat. Rev. Neurosci. 25, 81–90 (2024). - PubMed
1. DeGrave A. J., Janizek J., Lee S.-I., AI for radiographic COVID-19 detection selects shortcuts over signal. Nat. Mach. Intell. 3, 610–619 (2021).
1. Li X., Gu Y., Dvornek N., Staib L. H., Ventola P., Duncan J. S., Multi-site fMRI analysis using privacy-preserving federated learning and domain adaptation: ABIDE results. Med. Image Anal. 65, 101765 (2020). - PMC - PubMed
1. M. Zizienová, New OSF metadata to support data sharing policy compliance. (2023).
1. Musen M. A., Bean C. A., Cheung K.-H., Dumontier M., Durante K. A., Gevaert O., Gonzalez-Beltran A., Khatri P., Kleinstein S. H., O’Connor M. J., Pouliot Y., Rocca-Serra P., Sansone S.-A., Wiser J. A., CEDAR team, The center for expanded data annotation and retrieval. J. Am. Med. Inform. Assoc. 22, 1148–1152 (2015). - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Domain adaptation in small-scale and heterogeneous biological datasets

Affiliations

Domain adaptation in small-scale and heterogeneous biological datasets

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources