Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome

Gustavo Amorim¹, Ran Tao^{1

2}, Sarah Lotspeich¹, Pamela A Shaw³, Thomas Lumley⁴, Bryan E Shepherd¹

Affiliations

¹ Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA.
² Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA.
³ Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, PA, USA.
⁴ Department of Statistics, University of Auckland, Auckland, New Zealand.

PMID: 34975235
PMCID: PMC8715909
DOI: 10.1111/rssa.12689

Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome

Gustavo Amorim et al. J R Stat Soc Ser A Stat Soc. 2021 Oct.

. 2021 Oct;184(4):1368-1389.

doi: 10.1111/rssa.12689. Epub 2021 Apr 15.

Authors

Gustavo Amorim¹, Ran Tao^{1

2}, Sarah Lotspeich¹, Pamela A Shaw³, Thomas Lumley⁴, Bryan E Shepherd¹

Affiliations

¹ Department of Biostatistics, Vanderbilt University Medical Center, Nashvile, TN, USA.
² Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA.
³ Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, PA, USA.
⁴ Department of Statistics, University of Auckland, Auckland, New Zealand.

PMID: 34975235
PMCID: PMC8715909
DOI: 10.1111/rssa.12689

Abstract

Measurement errors are present in many data collection procedures and can harm analyses by biasing estimates. To correct for measurement error, researchers often validate a subsample of records and then incorporate the information learned from this validation sample into estimation. In practice, the validation sample is often selected using simple random sampling (SRS). However, SRS leads to inefficient estimates because it ignores information on the error-prone variables, which can be highly correlated to the unknown truth. Applying and extending ideas from the two-phase sampling literature, we propose optimal and nearly-optimal designs for selecting the validation sample in the classical measurement-error framework. We target designs to improve the efficiency of model-based and design-based estimators, and show how the resulting designs compare to each other. Our results suggest that sampling schemes that extract more information from the error-prone data are substantially more efficient than SRS, for both design- and model-based estimators. The optimal procedure, however, depends on the analysis method, and can differ substantially. This is supported by theory and simulations. We illustrate the various designs using data from an HIV cohort study.

Keywords: Design-based estimator; Linear Regression; Measurement error; Model-based estimator; Two-phase design.

PubMed Disclaimer

Figures

**FIGURE 1**
Empirical variance (×10³) for IPW estimator for different values of β₁ and different number of strata.

**FIGURE 2**
Empirical variance (×10³) for IPW estimator for 3 strata, for β = (1, .5, 1) and different strata boundaries. We considered symmetrical strata, with cut-off points at the qth and (1 − q)th percentiles.

**FIGURE 3**
Overlap between the true and error-prone influence functions. Grey dots represent observations that were classified into the correct strata, with respect to the unknown true IF, by the error-prone IF.

**FIGURE 4**
Empirical variances for MI, IPW and raking estimators, for all 4 settings. IPW, raking and MI were applied to data collected via the IPW optimal design discussed in Section 4 and are denoted by IPW-IPW, raking-IPW and MI-IPW, respectively. IPW-SRS, raking-SRS and MI-SRS denote IPW, raking an MI applied to data obtained via simple random sampling (SRS), respectively. MI-SFS corresponds to MI applied to data obtained from the model-based SFS design discussed in Section 3.

See this image and copyright information in PMC

References

1. Berglund L, Garmo H, Lindbäck J and Zethelius B (2007) Correction for regression dilution bias using replicates from subjects with extreme first measurements. Statistics in Medicine, 26, 2246–2257. - PubMed
1. Bickel PJ, Klaassen CA, Ritov Y and Wellner JA (1993) Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press.
1. Blattman C, Jamison J, Koroknay-Palicz T, Rodrigues K and Sheridan M (2016) Measuring the measurement error: A method to qualitatively validate survey data. Journal of Development Economics, 120, 99–112.
1. Bound J, Brown C and Mathiowetz N (2001) Measurement error in survey data. In Handbook of Econometrics, vol. 5, 3705–3843. Elsevier.
1. Breslow NE, Lumley T, Ballantyne CM, Chambless LE and Kulich M (2009a) Improved Horvitz-Thompson estimation of model parameters from two-phase stratified samples: Applications in epidemiology. Statistics in Biosciences, 1, 32–49. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome

Affiliations

Two-Phase Sampling Designs for Data Validation in Settings with Covariate Measurement Error and Continuous Outcome

Authors

Affiliations

Abstract

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources