Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Jun;65(2):554-63.
doi: 10.1111/j.1541-0420.2008.01116.x.

Presence-only data and the em algorithm

Affiliations

Presence-only data and the em algorithm

Gill Ward et al. Biometrics. 2009 Jun.

Abstract

In ecological modeling of the habitat of a species, it can be prohibitively expensive to determine species absence. Presence-only data consist of a sample of locations with observed presences and a separate group of locations sampled from the full landscape, with unknown presences. We propose an expectation-maximization algorithm to estimate the underlying presence-absence logistic model for presence-only data. This algorithm can be used with any off-the-shelf logistic model. For models with stepwise fitting procedures, such as boosted trees, the fitting process can be accelerated by interleaving expectation steps within the procedure. Preliminary analyses based on sampling from presence-absence records of fish in New Zealand rivers illustrate that this new procedure can reduce both deviance and the shrinkage of marginal effect estimates that occur in the naive model often used in practice. Finally, it is shown that the population prevalence of a species is only identifiable when there is some unrealistic constraint on the structure of the logistic model. In practice, it is strongly recommended that an estimate of population prevalence be provided.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The EM algorithm for presence-only data.
Figure 2
Figure 2
A simple example where species presence depends only on elevation. A logistic regression model estimated using y (a) and a naive logistic regression model estimated using z (b) are illustrated by the lines. The vertical location of data points (crosses and circles) indicates the value of the outcome as used in the fitting procedure.
Figure 3
Figure 3
Iterations of the EM algorithm using a logistic regression model as applied to the simple elevation example. The vertical location of data points (crosses and circles) indicate the value of the outcome as used in the fitting procedure and lines indicate models estimated from these data.
Figure 4
Figure 4
The case–control adjusted ηnaive from the naive logistic regression is an increasing but nonlinear function of the linear predictor from the true model of interest, η, and the population prevalence π. Thick lines indicate typical values of η for each π.
Figure 5
Figure 5
Parameter estimates for the naive logistic regression model are biased toward the origin, with increased bias for larger π. These estimates are from 100 simulations of presence-only data, generated from the model η(x) = α + x1β1 + x2β2, where β1 = 1 and β2 = −2. The x are independent and identically distributed standard normals and np = 300 and nu = 1000. Note that the variance of the EM estimates increase with π.
Figure 6
Figure 6
The marginal effect of each variable on η for the EM and naive models and for the full model based on the true presences (mean and ±1 pointwise standard errors). The boxplot in (a) indicates the distribution of the summer temperature across all locations in the sample; 15% of these locations had a dam downstream.
Figure 7
Figure 7
The EM model has a lower validation set deviance (a) and less shrinkage in η (b) than the naive model, when predicting the presence of the Longfin Eel. The average validation set deviance is calculated from the likelihood of the true presences (mean and ±1 pointwise standard errors). The predicted η in (b) are the best fitting EM and naive models, versus the best fitting model based on the true presences.
Figure 8
Figure 8
A sensitivity analysis for π indicates that the minimum validation set deviance for the EM model is smallest for π ≈ 0.6 (a). The effect on η of average summer temperature has a consistent shape across all π, but the estimated effect magnitude increases with π (b). The validation set deviance is calculated using the presence-only likelihood, so is not comparable with Figure 7.
Figure 9
Figure 9
The likelihood surface, and a 95% confidence interval (dashed line), for π and the intercept α at the true β for a simulation of 200 observed presences and 1000 background data. The generative model is η(x) = α + βx, where α = −1.0, β = 2, π = 0.34 and the x are independent and identically distributed standard normals.

References

    1. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:1–38.
    1. Dudík M, Schapire R, Phillips S. Correcting sample selection bias in maximum entropy density estimation. In: Weiss Y, Schölkopf B, Platt J, editors. Advances in Neural Information Processing Systems. Vol. 18. Cambridge, MA: MIT Press; 2006. pp. 323–330.
    1. Elith J, Graham CH, Anderson RP, Dudík M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa, Overton JM, Peterson AT, Phillips SJ, Richardson K, Scachetti-Pereira R, Schapire RE, Soberón J, Williams S, Wisz MS, Zimmermann NE. Novel methods improve prediction of species’ distributions from occurrence data. Ecography. 2006;29:129–151.
    1. Engler R, Guisan A, Rechsteiner L. An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data. Journal of Applied Ecology. 2004;41:263–274.
    1. Ferrier S, Drielsma M, Manion G, Watson G. Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. II. Community-level modelling. Biodiversity and Conservation. 2002;11:2309–2338.