Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep;15(3):1556-1581.
doi: 10.1214/21-aoas1453. Epub 2021 Sep 23.

ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS

Affiliations

ASSESSING SELECTION BIAS IN REGRESSION COEFFICIENTS ESTIMATED FROM NONPROBABILITY SAMPLES WITH APPLICATIONS TO GENETICS AND DEMOGRAPHIC SURVEYS

Brady T West et al. Ann Appl Stat. 2021 Sep.

Abstract

Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.

Keywords: Linear regression; National Survey of Family Growth; nonprobability samples; polygenic scores; probit regression; selection bias.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Scatter plots presenting associations between MUBNS and the true differences in coefficients between selected and nonselected units for the Z1 coefficient. Results are median MUBNS values across 1000 simulated datasets for each of the 1944 combinations of data generation model and selection mechanism; panels are separated by the level of dependence on Y in the selection model (ORY; rows) and the correlation between Y and A, given Z1 and Z2 (columns). The dotted black line represents the Y = X relationship.
Fig. 2.
Fig. 2.
Side-by-side box plots presenting distributions of the Spearman correlations between MUBNS and the true difference in the coefficients between selected and nonselected units. We estimate each correlation from 1000 replicate populations for each combination of data generation model and selection model. ORA = odds ratio for A in the selection model; ORY = odds ratio for Y in the selection model.
Fig. 3.
Fig. 3.
Side-by-side box plots presenting distributions of the empirical coverage rates for the alternative intervals. We estimate each coverage rate by computing the interval for each coefficient from 1000 replicate populations for each combination of data generation model and selection model. ORA = odds ratio for A in the selection model; ORY = odds ratio for Y in the selection model. The horizontal black line represents 0.95 coverage, for reference.
Fig. 4.
Fig. 4.
Side-by-side box plots presenting distributions of the empirical median widths for the alternative intervals across the different scenarios. We obtain the median width by computing the interval for each coefficient from 1000 replicate populations for each combination of data generation model and selection model. ORA = odds ratio for A in the selection model; ORY = odds ratio for Y in the selection model.

Similar articles

Cited by

References

    1. Andridge RR and Little RJ (2011). Proxy pattern-mixture analysis for survey nonresponse. J. Off. Stat 27 153–180.
    1. Andridge RR and Little RJ (2020). Proxy pattern-mixture analysis for a binary variable subject to nonresponse. J. Off. Stat 36 703–728.
    1. Andridge RR, West BT, Little RJA, Boonstra PS and Alvarado-Leiton F (2019). Indices of non-ignorable selection bias for proportions estimated from non-probability samples. J. R. Stat. Soc. Ser. C. Appl. Stat 68 1465–1483. MR4022822 10.1111/rssc.12371 - DOI - PMC - PubMed
    1. Baker R, Brick JM, Bates NA, Battaglia M, Couper MP, Dever JA and Tourangeau R (2013). Summary report of the AAPOR task force on nonprobability sampling. J. Sur. Stat. Methodol 1 90–143.
    1. Belsky DW and Israel S (2014). Integrating genetics and social science: Genetic risk scores. Biodemogr. Soc. Biol 60 137–155. - PMC - PubMed

LinkOut - more resources