Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jun 19;115(25):E5716-E5725.
doi: 10.1073/pnas.1719367115. Epub 2018 Jun 5.

Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning

Affiliations

Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning

Mohammad Sadegh Norouzzadeh et al. Proc Natl Acad Sci U S A. .

Abstract

Having accurate, detailed, and up-to-date information about the location and behavior of animals in the wild would improve our ability to study and conserve ecosystems. We investigate the ability to automatically, accurately, and inexpensively collect such data, which could help catalyze the transformation of many fields of ecology, wildlife biology, zoology, conservation biology, and animal behavior into "big data" sciences. Motion-sensor "camera traps" enable collecting wildlife pictures inexpensively, unobtrusively, and frequently. However, extracting information from these pictures remains an expensive, time-consuming, manual task. We demonstrate that such information can be automatically extracted by deep learning, a cutting-edge type of artificial intelligence. We train deep convolutional neural networks to identify, count, and describe the behaviors of 48 species in the 3.2 million-image Snapshot Serengeti dataset. Our deep neural networks automatically identify animals with >93.8% accuracy, and we expect that number to improve rapidly in years to come. More importantly, if our system classifies only images it is confident about, our system can automate animal identification for 99.3% of the data while still performing at the same 96.6% accuracy as that of crowdsourced teams of human volunteers, saving >8.4 y (i.e., >17,000 h at 40 h/wk) of human labeling effort on this 3.2 million-image dataset. Those efficiency gains highlight the importance of using deep neural networks to automate data extraction from camera-trap images, reducing a roadblock for this widely used technology. Our results suggest that deep learning could enable the inexpensive, unobtrusive, high-volume, and even real-time collection of a wealth of information about vast numbers of animals in the wild.

Keywords: artificial intelligence; camera-trap images; deep learning; deep neural networks; wildlife ecology.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Deep neural networks (DNNs) can successfully identify, count, and describe animals in camera-trap images. Above the image: The ground-truth, human-provided answer (top line) and the prediction (second line) by a DNN we trained (ResNet-152). The three plots below the image, from left to right, show the neural network’s prediction for the species, number, and behavior of the animals in the image. The horizontal color bars indicate how confident the neural network is about its predictions. All similar images in this work are from the SS dataset (1).
Fig. 2.
Fig. 2.
Various factors make identifying animals in the wild hard even for humans (trained volunteers achieve 96.6% accuracy vs. experts).
Fig. 3.
Fig. 3.
DNNs have several layers of abstraction that tend to gradually convert raw data into more abstract concepts. For example, raw pixels at the input layer are first processed to detect edges (first hidden layer), then corners and textures (second hidden layer), then object parts (third hidden layer), and so on if there are more layers, until a final prediction is made by the output layer. Note that which types of features are learned at each layer are not human-specified, but emerge automatically as the network learns how to solve a given task.
Fig. 4.
Fig. 4.
While we train models on individual images, we only have labels for entire capture events (a set of images taken one after the other within approximately 1 second, e.g., A, B, and C), which we apply to all images in the event. When some images in an event have an animal (e.g., A) and others are empty (B and C in this example), the empty images are labeled with the animal type, which introduces some noise in the training-set labels and thus makes training harder.
Fig. 5.
Fig. 5.
(Upper) Top-1 and top-5 accuracy of different models on the task of identifying the species of animal present in the image. Although the accuracy of all of the models are similar, the ensemble of models is the best with 94.9% top-1 and 99.1% top-5 accuracy. (Lower) Top-1 accuracy and the percentage of predictions within ±1 bin for counting animals in the images. Again, the ensemble of models is the best with 63.1% top-1 and 84.7% of the prediction within ±1 bin.
Fig. 6.
Fig. 6.
Shown are nine images the ResNet-152 model labeled correctly. Above each image is a combination of expert-provided labels (for the species type and counts) and volunteer-provided labels (for additional attributes), as well as the model’s prediction for that image. Below each image are the top guesses of the model for different tasks, with the width of the color bars indicating the model’s output for each of the guesses, which can be interpreted as its confidence in that guess.
Fig. 7.
Fig. 7.
(A–I) Shown are nine images the ResNet-152 model labeled incorrectly. Above each image are a combination of expert-provided labels (for the species type and counts) and volunteer-provided labels (for additional attributes), as well as the model’s prediction for that image. Below each image are the top guesses of the model for different tasks, with the width of the color bars indicating the model’s output for each of the guesses, which can be interpreted as its confidence in that guess. One can see why the images are difficult to get right. G and I contain examples of the noise caused by assigning the label for the capture event to all images in the event. A, B, D, and H show how animals being too far from the camera makes classification difficult.

References

    1. Swanson A, et al. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Sci Data. 2015;2:150026. - PMC - PubMed
    1. Harris G, Thompson R, Childs JL, Sanderson JG. Automatic storage and analysis of camera trap data. Bull Ecol Soc Am. 2010;91:352–360.
    1. O’Connell AF, Nichols JD, Karanth KU. Camera Traps in Animal Ecology: Methods and Analyses. Springer; Tokyo: 2010.
    1. Silveira L, Jacomo AT, Diniz-Filho JAF. Camera trap, line transect census and track surveys: A comparative evaluation. Biol Conserv. 2003;114:351–355.
    1. Bowkett AE, Rovero F, Marshall AR. The use of camera-trap data to model habitat use by antelope species in the Udzungwa mountain forests, Tanzania. Afr J Ecol. 2008;46:479–487.

Publication types