Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 May 7;9(5):1377-1383.
doi: 10.1534/g3.119.400018.

BGData - A Suite of R Packages for Genomic Analysis with Big Data

Affiliations

BGData - A Suite of R Packages for Genomic Analysis with Big Data

Alexander Grueneberg et al. G3 (Bethesda). .

Abstract

We created a suite of packages to enable analysis of extremely large genomic data sets (potentially millions of individuals and millions of molecular markers) within the R environment. The package offers: a matrix-like interface for .bed files (PLINK's binary format for genotype data), a novel class of linked arrays that allows linking data stored in multiple files to form a single array accessible from the R computing environment, methods for parallel computing capabilities that can carry out computations on very large data sets without loading the entire data into memory and a basic set of methods for statistical genetic analyses. The package is accessible through CRAN and GitHub. In this note, we describe the classes and methods implemented in each of the packages that make the suite and illustrate the use of the packages using data from the UK Biobank.

Keywords: big data; biobank; distributed computing; genetic analyses; parallel computing.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Manhattan plot obtained by regressing sex-age adjusted height on variants using data from the training set (n = 222,648, unrelated White British).
Figure 2
Figure 2
Correlation (+/− SE) between sex-adjusted height and predicted height in the testing set, by the number of SNPs used.

References

    1. Adler, D., C. Gläser, O. Nenadic, J. Oehlschlägel, and W. Zucchini, 2018 ff: Memory-Efficient Storage of Large Data on Disk and Fast Access Functions https://CRAN.R-project.org/package=ff.
    1. Atwell S., Huang Y. S., Vilhjálmsson B. J., Willems G., Horton M., et al. , 2010. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631. 10.1038/nature08800 - DOI - PMC - PubMed
    1. Broman K. W., Wu H., Sen S., Churchill G. A., 2003. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19: 889–890. 10.1093/bioinformatics/btg112 - DOI - PubMed
    1. Chang C. C., Chow C. C., Tellier L. C., Vattikuti S., Purcell S. M., et al. , 2015. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4: 7 10.1186/s13742-015-0047-8 - DOI - PMC - PubMed
    1. Kane M. J., Emerson J., Weston S., 2013. Scalable Strategies for Computing with Massive Data. J. Stat. Softw. 55: 1–19. 10.18637/jss.v055.i14 - DOI

Publication types

LinkOut - more resources