Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Apr 15;41(8):1482-1497.
doi: 10.1002/sim.9300. Epub 2022 Jan 6.

Optimal sampling for design-based estimators of regression models

Affiliations

Optimal sampling for design-based estimators of regression models

Tong Chen et al. Stat Med. .

Abstract

Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analyses by design-based estimators. Generalized raking is an efficient class of design-based estimators, and they improve on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analyses by the two classes of design-based estimators can be very different. The optimal design for analysis via the IPW estimator is optimal for IPW estimation and typically gives near-optimal efficiency for generalized raking estimation, though we show there is potential improvement in some settings.

Keywords: Neyman allocation; generalized raking; influence function; model-assisted sampling; optimal design; residual; two-phase sampling.

PubMed Disclaimer

References

    1. Huang J, Duan R, Hubbard RA, et al. PIE: A prior knowledge guided integrated likelihood estimation method for bias reduction in association studies using electronic health records data. J Am Med Inf Assoc. 2017; 25(3): 345–352. - PMC - PubMed
    1. Shepherd BE, Shaw PA. Errors in multiple variables in human immunodeficiency virus (HIV) cohort and electronic health record data: statistical challenges and opportunities. Stat Commun Infect Dis. 2020; 12(s1): 20190015. - PMC - PubMed
    1. Chatterjee N, Chen YH, Maas P, Carroll RJ. Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J Am Stat Assoc. 2016; 111(513): 107–117. - PMC - PubMed
    1. Yadav P, Steinbach M, Kumar V, Simon G. Mining electronic health records (EHRs): A survey. ACM Comput Surv. 2018; 50(6).
    1. Shepherd BE, Han K, Chen T, et al. Analysis of error-prone electronic health records with multi-wave validation sampling: Association of maternal weight gain during pregnancy with childhood outcomes. ArXiv. 2021; arXiv:2109.14001.

Publication types

LinkOut - more resources