Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Nov-Dec;16(6):759-67.
doi: 10.1197/jamia.M2780. Epub 2009 Aug 28.

Large datasets in biomedicine: a discussion of salient analytic issues

Affiliations

Large datasets in biomedicine: a discussion of salient analytic issues

Anshu Sinha et al. J Am Med Inform Assoc. 2009 Nov-Dec.

Abstract

Advances in high-throughput and mass-storage technologies have led to an information explosion in both biology and medicine, presenting novel challenges for analysis and modeling. With regards to multivariate analysis techniques such as clustering, classification, and regression, large datasets present unique and often misunderstood challenges. The authors' goal is to provide a discussion of the salient problems encountered in the analysis of large datasets as they relate to modeling and inference to inform a principled and generalizable analysis and highlight the interdisciplinary nature of these challenges. The authors present a detailed study of germane issues including high dimensionality, multiple testing, scientific significance, dependence, information measurement, and information management with a focus on appropriate methodologies available to address these concerns. A firm understanding of the challenges and statistical technology involved ultimately contributes to better science. The authors further suggest that the community consider facilitating discussion through interdisciplinary panels, invited papers and curriculum enhancement to establish guidelines for analysis and reporting.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Cumulative distribution functions (CDFs) for several important informatics resources. CDFs, showing growth rates of: the Entrez Protein database, compiled from a variety of sources including SwissProt, PIR, PRF and PDB; the Protein Data Bank (PDB), a repository for 3-D structural data of proteins and nucleic acids; PubMed, a service from the U.S. National Library of Medicine that provides citations to biomedical literature; and the Columbia University Medical Center (CUMC) clinical database, measured in number of rows.

References

    1. Kettenring JR. A Perspective on Cluster Analysis Stat Anal Data Min 2008;1:52-53.
    1. Gilks WR. A rapid two-stage modeling technique for exploring large datasets Appl Stat 1986;352:183-194.
    1. Dempster AP. A high dimensional two sample significance test Ann Math Stat 1958;294:995-1,010.
    1. Heithoff KS, Lohr KN. Effectiveness and Outcomes in Health Care Proceedings of the Invitational Conference by the Institute of Med, Division of Health Care Sciences. Washington, DC, United States: National Academies Press; 1990. - PubMed
    1. Kettenring JR. Massive datasets Reflections on a Workshop. Telcordia Technologies, Inc; 2001.

Publication types

MeSH terms