Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 20;8(1):536.
doi: 10.1038/s41746-025-01942-2.

A novel sequence-based transformer model architecture for integrating multi-omics data in preterm birth risk prediction

Affiliations

A novel sequence-based transformer model architecture for integrating multi-omics data in preterm birth risk prediction

Si Zhou et al. NPJ Digit Med. .

Abstract

Preterm birth (PTB) significantly contributes to maternal and perinatal mortality and lifelong morbidity. While large language models (LLM) offer considerable potential for disease risk prediction and early detection, their application to PTB prediction using multi-omics data remains limited. We developed a novel transformer-based architecture for integrating cell free (cfDNA) and cfRNA sequencing data for PTB risk prediction. In the test set, the cfDNA LLM model achieved an AUC of 0.822, and the cfRNA LLM model achieved 0.851. Integrating cfDNA and cfRNA data within the transformer-based framework outperformed both, reaching an AUC of 0.890, a significant improvement over single-modality models. Additionally, we explored cfRNA and cfDNA integration using RNA editing and achieved an AUC of 0.82. This underscores the potential of multi-omics data fusion, with transformer-based architectures providing a powerful framework for disease risk assessment, and demonstrates the potential of AI-driven multi-omics for broader applications in precision obstetrics and biomedicine.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Samples used for cfRNA and cfDNA sequencing in two cohorts.
Fig. 2
Fig. 2. Architecture of the multi-modal transformer model for PTB prediction.
Raw cfDNA sequencing reads were processed through a series of steps, including quality control, mapping, and variant calling. The model processes cfDNA variant calls, converting them into binary vectors within specific genomic windows. These are then transformed into pseudo-nucleotide sequences and subsequently segmented into 150-base pair fragments. Concurrently, cfRNA data were processed to generate normalized gene expression matrices, with expression levels log-transformed and scaled. Normalized gene expression levels are utilized to create pseudo-sequences; this is achieved by proportionally repeating gene-specific tokens based on their expression abundance, thereby encoding both the gene’s identity and its quantity. A critical component of the model is its modality-agnostic integration: both cfDNA and cfRNA are converted into a unified token stream using an identical vocabulary. This deliberate approach avoids modality-specific embeddings, allowing the self-attention mechanism to freely learn cross-modal dependencies. The pre-trained GeneLLM transformer then processes this unified input, employing self-attention and feed-forward networks to generate contextualized embeddings. These representations undergo further refinement by a dedicated Disease Tuning Module before being fed into a final linear layer, which, with a sigmoid activation function, outputs the probability of PTB.
Fig. 3
Fig. 3. The performance of transformer-based LLMs.
Area under the receiver operating characteristic curves (AUC) for cfDNA (A), cfRNA (B), and combined cfDNA and cfRNA datasets (C) in the LG and FJ cohorts. The lower panels display the probability distributions generated by model, comparing each sample with its respective control. Each point on the ROC curves represents a single sample.
Fig. 4
Fig. 4. Analysis of cfRNA and cfDNA in the prediction of PTB.
A The relative abundance of different RNA types in 672 samples. Data are shown as mean ± SD. B Representative coverages for ACTB in 3 PTB samples and 3 TB samples. C Intron to exon ratio in 672 samples. D ROC curves and corresponding 95% CIs were used to quantify the performance of the classifiers of cfRNA E ROC curves and corresponding 95% CIs were used to quantify the performance of the classifiers of cfDNA. F Schematic diagram displays mutations detected in cfRNA and cfDNA. G Proportion of predicted RNA editing events that appear in the cfRNA-specific mutations. H In the set of mutations detected in both cfRNA and cfDNA, the proportion of mutations predicted to be DNA mutation. I Box plots show significantly more predicted RNA editing sites that are cfRNA-specific. J ROC curves and corresponding 95% CIs were used to quantify the performance of the classifiers of RNA editing. Asterisks indicate statistically significant differences. Triple asterisks indicate a significance of P < 0.001.
Fig. 5
Fig. 5. Analysis of cfRNA in the pathophysiology of PTB.
A GO enrichment (biological process) of significantly upregulated genes in PTB. B Box plots show significantly different levels of high-sensitivity C-reactive protein (hs-CRP) between PTB and TB in the Longgang cohort. C Chi-squared test shows increased white blood cell (WBC) count was significantly associated with PTB in the Fujian cohort. D Box plots show significantly different gene expression levels (TNFRSF10B, ICAM1, has-miR-17-5p, KMT2E-AS1 and TP73-AS1) between PTB and TB. E Network of base-pairing interaction between mRNA (ICAM1, TNFRSF10B), miRNA (hsa-miR-17-5p), lncRNA (KMT2E-AS1, TP73-AS1). Asterisks indicate statistically significant differences. Double asterisks indicate a significance of P < 0.01, Triple asterisks indicate a significance of P < 0.001.

Similar articles

References

    1. Dudley, D. J. & Ennen, C. S. The vexing problem of preterm birth prevention. Jama330, 323–325 (2023). - PubMed
    1. Tsamantioti, E., Sandström, A., Lindblad Wollmann, C., Snowden, J. M. & Razaz, N. Association of severe maternal morbidity with subsequent birth. Jama333, 133–142 (2025). - PMC - PubMed
    1. Inder, T. E., Volpe, J. J. & Anderson, P. J. Defining the neurologic consequences of preterm birth. N. Engl. J. Med.389, 441–453 (2023). - PubMed
    1. Hoffman, M. K. Prediction and prevention of spontaneous preterm birth: ACOG practice bulletin, number 234. Obstet. Gynecol.138, 945–946 (2021). - PMC - PubMed
    1. Kindschuh, W. F. et al. Preterm birth is associated with xenobiotics and predicted by the vaginal metabolome. Nat. Microbiol.8, 246–259 (2023). - PMC - PubMed

LinkOut - more resources