Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 May 15;18(1):87.
doi: 10.1186/s12920-025-02154-z.

Dynamic clustering of genomics cohorts beyond race, ethnicity-and ancestry

Affiliations

Dynamic clustering of genomics cohorts beyond race, ethnicity-and ancestry

Hussein Mohsen et al. BMC Med Genomics. .

Abstract

Background: Recent decades have witnessed a steady decrease in the use of race categories in genomic studies. While studies that still include race categories vary in goal and type, these categories already build on a history during which racial color lines have been enforced and adjusted in the service of social and political systems of power and disenfranchisement. For early modern classification systems, data collection was also considerably arbitrary and limited. Fixed, discrete classifications have limited the study of human genomic variation and disrupted widely spread genetic and phenotypic continuums across geographic scales. Relatedly, the use of broad and predefined classification schemes-e.g. continent-based-across traits can risk missing important trait-specific genomic signals.

Methods: To address these issues, we introduce a dynamic approach to clustering human genomics cohorts based on genomic variation in trait-specific loci and without using a set of predefined categories. We tested the approach on whole-exome sequencing datasets in ten cancer types and partitioned them based on germline variants in cancer-relevant genes that could confer cancer type-specific disease predisposition.

Results: Results demonstrate clustering patterns that transcend discrete continent-based categories across cancer types. Functional analysis based on cancer type-specific clusterings also captures the fundamental biological processes underlying cancer, differentiates between dynamic clusters on a functional level, and identifies novel potential drivers overlooked by a predefined continent-based clustering.

Conclusions: Through a trait-based lens, the dynamic clustering approach reveals genomic patterns that transcend predefined classification categories. We propose that coupled with diverse data collection, new clustering approaches have the potential to draw a more complete portrait of genomic variation and to address, in parallel, technical and social aspects of its study.

Keywords: Ancestry; Cancer genomics; Classification; Ethnicity; Genetic variation; Race.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Multidimensional scaling (MDS) plots of dynamically generated clusters for ten TCGA cancer cohorts. Cancer type-specific dynamic clusters transcend predefined continent-based categories. Dynamic cluster numbers (i.e. C1, C2, … C8) correspond to disjoint sample subsets within each cancer cohort. resulting in single dots each representing a subcluster (e.g. LUSC-HFI in Fig. 2b and KIRC-HFI in Supplementary Fig. 1).
Fig. 2
Fig. 2
MDS plots of dynamically generated clusters based on high-functional-impact (HFI) germline variant subsets. a PAAD results demonstrate a higher number of clusters in HFI-based results compared to ones based on all nonsynonymous variants in the COSMIC-based setting in Fig. 1h. b LUSC results demonstrate compact clusters with a high number of samples demonstrating similar or identical variation patterns in HFI subsets
Fig. 3
Fig. 3
Algorithmic and human-aided identification of dynamic clusters. K-means results with predefined-k = 4 fails to identify COSMIC-based clusters in (a) BRCA, (b) COAD, (c) OV, and (d) PAAD among other cancer types. Dynamic-k results also demonstrate the failure of (e) K-means, (f) HClust, and (g) DBSCAN to identify COSMIC-based clusters in LIHC, highlighting the need for (h) human-aid in cluster identification
Fig. 4
Fig. 4
Known and potential driver genes identified based on dynamic clustering. a Dynamic cluster-based genes overlooked by the continent-based scheme. Each of the listed genes was identified in at least one COSMIC- or HFI-based dynamic cluster and none of the clusters based on predefined continent-based categories. b Dynamic cluster-based drivers associate widely with known cancer pathways. Genes overlooked by the continent-based scheme drive a subset of these associations in one (blue border) or both (light green) settings centered on the COSMIC- and HFI-based variant sets
Fig. 5
Fig. 5
Dynamic clusters across cancer types highlight clinical and functional associations. a Dynamic cluster 1 (C1) based on the COSMIC subset in LUAD (LUAD-COSMIC) shows statistically significant lower age cancer onset than that of the second cluster (C2; padj < 0.05). b C1 in LIHC-COSMIC includes samples with higher tumor grade compared to other clusters combined (padj < 0.05). Clusters C5 and C8 include no Grade 4 samples. c C1 shows more advanced tumor stage samples in LIHC-HFI compared to C2 (padj < 0.05). d Genes significantly expressed (FDR < 0.05) in opposite directions among clusters of BRCA-COSMIC, BRCA-HFI and LIHC-COSMIC highlight potential functional roles associated with different clusters. e Gene programs with known association to cancer are collectively expressed in different magnitudes and directions between dynamic clusters across cancer types and settings
Fig. 6
Fig. 6
Genes expressed in single clusters within each cancer type and setting highlight related biological and clinical associations beyond cancer (e.g. asthma and lupus in LUSC-COSMIC-C1 and neural development in LIHC-HFI-C1). Resulting associations share a subset of their underlying genes (i.e. edges), and a number of biological processes recurrently emerges across cancer types and settings (i.e. “Across Cancer Types,” top-right)

Similar articles

Cited by

References

    1. Roberts DE. Fatal invention: How Science, Politics, and Big Business Re-create Race in the Twenty-first Century. New York: New Press; 2011.
    1. Jablonski NG. Skin color and race. Am J Phys Anthropol. 2021;175(2):437–47. - PMC - PubMed
    1. Marks J. Long shadow of Linnaeus’s human taxonomy. Nature. 2007;447(7140):28–28. - PubMed
    1. Anemone RL. Race and Human Diversity: A Biocultural Approach. Oxford and New York: Routledge; 2019.
    1. Sax, B., When Adam and Eve Were Monkeys: Anthropomorphism, zoomorphism, and other ways of looking at animals, in The Routledge companion to animal-human history, H. Kean and P. Howell, Editors. 2018, Routledge/Taylor & Francis Group,: London ; New York.

LinkOut - more resources