. 2021 Sep;15(3):1556-1581.

doi: 10.1214/21-aoas1453. Epub 2021 Sep 23.

ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS

Brady T West¹, Roderick J Little², Rebecca R Andridge³, Philip S Boonstra², Erin B Ware¹, Anita Pandit², Fernanda Alvarado-Leiton⁴

Affiliations

¹ Survey Research Center, Institute for Social Research, University of Michigan.
² Department of Biostatistics, School of Public Health, University of Michigan.
³ Division of Biostatistics, College of Public Health, Ohio State University.
⁴ Michigan Program in Survey and Data Science, Institute for Social Research, University of Michigan.

PMID: 35237377
PMCID: PMC8887878
DOI: 10.1214/21-aoas1453

ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS

Brady T West et al. Ann Appl Stat. 2021 Sep.

. 2021 Sep;15(3):1556-1581.

doi: 10.1214/21-aoas1453. Epub 2021 Sep 23.

Authors

Brady T West¹, Roderick J Little², Rebecca R Andridge³, Philip S Boonstra², Erin B Ware¹, Anita Pandit², Fernanda Alvarado-Leiton⁴

Affiliations

¹ Survey Research Center, Institute for Social Research, University of Michigan.
² Department of Biostatistics, School of Public Health, University of Michigan.
³ Division of Biostatistics, College of Public Health, Ohio State University.
⁴ Michigan Program in Survey and Data Science, Institute for Social Research, University of Michigan.

PMID: 35237377
PMCID: PMC8887878
DOI: 10.1214/21-aoas1453

Abstract

Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.

Keywords: Linear regression; National Survey of Family Growth; nonprobability samples; polygenic scores; probit regression; selection bias.

PubMed Disclaimer

Figures

**Fig. 1.**
*Scatter plots presenting associations between MUBNS and the true differences in coefficients between selected and nonselected units for the Z*₁ *coefficient. Results are median MUBNS values across* 1000 *simulated datasets for each of the* 1944 *combinations of data generation model and selection mechanism; panels are separated by the level of dependence on Y in the selection model* (OR_Y; *rows*) *and the correlation between Y and A*, *given Z*₁ *and Z*₂ (*columns*). *The dotted black line represents the Y* = *X relationship*.

**Fig. 2.**
Side-by-side box plots presenting distributions of the Spearman correlations between MUBNS and the true difference in the coefficients between selected and nonselected units. We estimate each correlation from 1000 *replicate populations for each combination of data generation model and selection model*. OR_A = *odds ratio for A in the selection model*; OR_Y = *odds ratio for Y in the selection model*.

**Fig. 3.**
*Side-by-side box plots presenting distributions of the empirical coverage rates for the alternative intervals. We estimate each coverage rate by computing the interval for each coefficient from* 1000 *replicate populations for each combination of data generation model and selection model*. OR_A = *odds ratio for A in the selection model*; OR_Y = *odds ratio for Y in the selection model. The horizontal black line represents* 0.95 *coverage, for reference*.

**Fig. 4.**
Side-by-side box plots presenting distributions of the empirical median widths for the alternative intervals across the different scenarios. We obtain the median width by computing the interval for each coefficient from 1000 *replicate populations for each combination of data generation model and selection model*. OR_A = *odds ratio for A in the selection model*; OR_Y = *odds ratio for Y in the selection model*.

See this image and copyright information in PMC

Cited by

Risk of Traumatic Intracranial Hemorrhage After Stroke: A Nationwide Population-Based Cohort Study in Taiwan.
Fang YT, Liao SF, Chen PL, Yeh TS, Chen CI, Piravej K, Wu CC, Chiu WT, Lam C. Fang YT, et al. J Am Heart Assoc. 2024 Oct;13(19):e035725. doi: 10.1161/JAHA.124.035725. Epub 2024 Sep 18. J Am Heart Assoc. 2024. PMID: 39291491 Free PMC article.
Analyzing Potential Non-Ignorable Selection Bias in an Off-Wave Mail Survey Implemented in a Long-Standing Panel Study.
Schroeder HM, West BT. Schroeder HM, et al. J Surv Stat Methodol. 2024 Oct 23;13(1):100-127. doi: 10.1093/jssam/smae039. eCollection 2025 Feb. J Surv Stat Methodol. 2024. PMID: 39877150
The Role of Weighting Adjustment for Attrition in Longitudinal Trajectory Modeling: A Simulation Study.
West BT, Si Y, Hu Y, McCabe SE, Veliz P. West BT, et al. Commun Stat Simul Comput. 2025;54(3):866-888. doi: 10.1080/03610918.2024.2362923. Epub 2024 Jun 7. Commun Stat Simul Comput. 2025. PMID: 40270979
Evaluating Pre-election Polling Estimates Using a New Measure of Non-ignorable Selection Bias.
West BT, Andridge RR. West BT, et al. Public Opin Q. 2023 Jun 8;87(Suppl 1):575-601. doi: 10.1093/poq/nfad018. eCollection 2023. Public Opin Q. 2023. PMID: 37705923 Free PMC article.

References

1. Andridge RR and Little RJ (2011). Proxy pattern-mixture analysis for survey nonresponse. J. Off. Stat 27 153–180.
1. Andridge RR and Little RJ (2020). Proxy pattern-mixture analysis for a binary variable subject to nonresponse. J. Off. Stat 36 703–728.
1. Andridge RR, West BT, Little RJA, Boonstra PS and Alvarado-Leiton F (2019). Indices of non-ignorable selection bias for proportions estimated from non-probability samples. J. R. Stat. Soc. Ser. C. Appl. Stat 68 1465–1483. MR4022822 10.1111/rssc.12371 - DOI - PMC - PubMed
1. Baker R, Brick JM, Bates NA, Battaglia M, Couper MP, Dever JA and Tourangeau R (2013). Summary report of the AAPOR task force on nonprobability sampling. J. Sur. Stat. Methodol 1 90–143.
1. Belsky DW and Israel S (2014). Integrating genetics and social science: Genetic risk scores. Biodemogr. Soc. Biol 60 137–155. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS

Affiliations

ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous