Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Nov;22(11):625-639.
doi: 10.1038/s41568-022-00502-0. Epub 2022 Sep 5.

Big data in basic and translational cancer research

Affiliations
Review

Big data in basic and translational cancer research

Peng Jiang et al. Nat Rev Cancer. 2022 Nov.

Abstract

Historically, the primary focus of cancer research has been molecular and clinical studies of a few essential pathways and genes. Recent years have seen the rapid accumulation of large-scale cancer omics data catalysed by breakthroughs in high-throughput technologies. This fast data growth has given rise to an evolving concept of 'big data' in cancer, whose analysis demands large computational resources and can potentially bring novel insights into essential questions. Indeed, the combination of big data, bioinformatics and artificial intelligence has led to notable advances in our basic understanding of cancer biology and to translational advancements. Further advances will require a concerted effort among data scientists, clinicians, biologists and policymakers. Here, we review the current state of the art and future challenges for harnessing big data to advance cancer research and treatment.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Considerations for using big data in translational applications and basic research.
Clinical decisions, basic research and the development of new therapies should consider two orthogonal dimensions when leveraging big-data resources; integrating data across many data modalities and integrating data from different cohorts, which may include the transfer of knowledge from pre-existing datasets.
Fig. 2
Fig. 2. Prospective clinical studies guided by omics data to use off-label drugs.
Recent umbrella clinical trials have focused on multi-omics profiling of the tumours of enrolled patients by generating and analysing genome-wide data — including data from DNA sequencing, gene expression profiling, and copy number profiling — to prioritize treatments. After multi-omics profiling, a multidisciplinary molecular tumour board led by clinicians selects the best therapies on the basis of the current known relationships between drugs, genes and tumour vulnerabilities. For each therapy, the relevant altered vulnerabilities could include direct drug targets, genes in the same pathway, indirect drug targets upregulated or downregulated by drug treatment, or other genes interacting with the drug targets through physical or genetic interactions. This process then results in patients being treated with off-label targeted therapies. The end points for evaluating clinical efficacy include the ratio of the progression-free survival (PFS) associated with omics data-guided therapies (PFS2) and the PFS associated with previous therapy (PFS1), or differences in survival between patients treated with omics data-guided therapies and patients treated with therapies guided by physician’s choice alone.
Fig. 3
Fig. 3. Data-driven artificial intelligence to support cancer diagnosis.
a | A common artificial intelligence (AI) framework in cancer detection uses a convolutional neural network (CNN) to detect the presence of cancer cells from a diagnostic image. CNNs use convolution (weighted sum of a region patch) and pooling (summarize values in a region to one value) to encode image regions into low-dimensional numerical vectors that can be analysed by machine learning models. The CNN architecture is typically pretrained with ImageNet data, which is much larger than any cancer biology imaging dataset. To increase the reliability of the AI framework, the input data can be augmented through rotation or blurring of tissue images to increase data size. The data are separated into non-overlapping training, tuning and test sets to train the AI model, tune hyperparameters and estimate the prediction accuracy on new inputs, respectively. False-positive predictions are typically essential data points for retraining the AI model. b | An example of the application of AI in informing clinical decisions, as per the US Food and Drug Administration-approved AI test Paige Prostate. From one needle biopsy sample, the pathologist can decide whether cancer cells are present. If the results are negative (‘no cancer’) or if the physician cannot make a firm diagnosis (‘defer’), the Paige Prostrate AI can analyse the image and prompt the pathologist with regard to potential cancer locations if any are detected. The alternative procedure involves evaluating multiple biopsy samples and performing immunohistochemistry tests on prostate cancer markers, independently from the AI test.
Fig. 4
Fig. 4. Design of new kinase inhibitors using a generative artificial intelligence model.
The variational autoencoder, trained with the structures of many compounds, can encode a molecular structure into a latent space of numerical vectors and decode this latent space back into the compound structure. For each target, such as the receptor tyrosine kinase DDR1, the variational autoencoder can create embeddings of compound categories, such as existing kinase inhibitors, patented compounds and non-kinase inhibitors. Sampling the latent space for compounds that are similar to existing on-target inhibitors and not patented compounds or non-kinase inhibitors can generate new candidate kinase inhibitors for downstream experimental validation. Adapted from ref., Springer Nature Limited.

References

    1. Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. doi: 10.1016/j.cell.2011.02.013. - DOI - PubMed
    1. Weinstein JN, et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013;45:1113–110. doi: 10.1038/ng.2764. - DOI - PMC - PubMed
    1. Edgar R, Domrachev M, Lash AE. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002;30:207–210. doi: 10.1093/nar/30.1.207. - DOI - PMC - PubMed
    1. Deng J, et al. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conf. Computer Vis. Pattern Recognit. 2009 doi: 10.1109/cvprw.2009.5206848. - DOI
    1. Stuart T, Satija R. Integrative single-cell analysis. Nat. Rev. Genet. 2019;20:257–272. doi: 10.1038/s41576-019-0093-7. - DOI - PubMed

Publication types