Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 May 4;17(5):e0267759.
doi: 10.1371/journal.pone.0267759. eCollection 2022.

Deep learning with self-supervision and uncertainty regularization to count fish in underwater images

Affiliations

Deep learning with self-supervision and uncertainty regularization to count fish in underwater images

Penny Tarling et al. PLoS One. .

Abstract

Effective conservation actions require effective population monitoring. However, accurately counting animals in the wild to inform conservation decision-making is difficult. Monitoring populations through image sampling has made data collection cheaper, wide-reaching and less intrusive but created a need to process and analyse this data efficiently. Counting animals from such data is challenging, particularly when densely packed in noisy images. Attempting this manually is slow and expensive, while traditional computer vision methods are limited in their generalisability. Deep learning is the state-of-the-art method for many computer vision tasks, but it has yet to be properly explored to count animals. To this end, we employ deep learning, with a density-based regression approach, to count fish in low-resolution sonar images. We introduce a large dataset of sonar videos, deployed to record wild Lebranche mullet schools (Mugil liza), with a subset of 500 labelled images. We utilise abundant unlabelled data in a self-supervised task to improve the supervised counting task. For the first time in this context, by introducing uncertainty quantification, we improve model training and provide an accompanying measure of prediction uncertainty for more informed biological decision-making. Finally, we demonstrate the generalisability of our proposed counting framework through testing it on a recent benchmark dataset of high-resolution annotated underwater images from varying habitats (DeepFish). From experiments on both contrasting datasets, we demonstrate our network outperforms the few other deep learning models implemented for solving this task. By providing an open-source framework along with training data, our study puts forth an efficient deep learning template for crowd counting aquatic animals thereby contributing effective methods to assess natural populations from the ever-increasing visual data.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. In situ sampling of Lebranche mullet in turbid waters using a sonar imaging system.
(a) Schematics of the image production by the sonar camera. The Adaptive Resolution Imaging Sonar (ARIS) uses 128 beams to project a wedge-shaped volume of acoustic energy and convert their returning echoes into a digital image that gives an overhead view of the object, here exemplified by a cylinder (reprinted from the ARIS Explorer User Guide Manual 015045_RevB under a CC BY license, with permission from ©SoundMetrics Coorp, original copyright 2014; this modified image is similar, but not identical, to the original image, and is therefore for illustrative purposes). (b) In situ sonar sampling during the dolphin-fisher foraging interactions. The traditional cooperative foraging between wild dolphins and artisanal net-casting fishers targeting mullets, in the murky waters of the estuarine canal in Laguna, southern Brazil, seen from land and from a drone. Fishers wait in line at the edge of the canal for the dolphins’ foraging cues (top image: a sudden dive near the coast) which fishers interpret as the moment and place to cast their nets, presumably on top of passing mullet schools. The sonar camera (blue triangle) was deployed to record passing mullet schools at the spatial scale relevant for the interacting dolphins and fishers (6–20m). (c) Lebranche mullets (Mugil liza). A still image from a real-time underwater sonar video depicting the overhead perspective view of a passing mullet school in front of the line of fishers; a typical mullet caught by the fishers is shown (average body length = 42.9 cm ± 7.00 SD, n = 771 fish measured at the beach). (Photos by M. Cantor, A.M.S. Machado, D.R. Farine; reproduced with permission).
Fig 2
Fig 2. Image pre-processing for assessing mullet abundance from sonar images.
(a) Raw frame depicting dolphins and a large mullet school. (b) Contrast enhancement and background removal. (c) Manual labelling of a sample: the large bounding box marks where the raw image was cropped so all input samples represent a consistent size of geographical area and at a consistent distance from the sonar camera. The smaller bounding boxes mark where noise (here, a dolphin) is present. Each point annotation marks the location of an individual mullet. (d-f) Examples of variation in the sonar images in our dataset, to which the density-based deep learning model needs to be adaptable to. (d) Frame with high mullet abundance: large number of fish, swimming compactly; (e) low abundance: small number of fish, sparsely distributed; (f) noise: 3 dolphins and a fishing net (note the overhead perspective of a rounded casting net).
Fig 3
Fig 3. Distribution of labelled dataset by number of fish.
Number of fish plotted in log scale: the subset of data is skewed towards samples with low numbers of fish. This imbalanced distribution is even more exaggerated in the complete dataset, a common theme of data collected in the wild.
Fig 4
Fig 4. Pipeline of our final network.
The multi-task network is trained end-to-end to simultaneously regress labelled images to corresponding density maps and rank the unlabelled images in order of fish abundance. The backbone of each branch is a ResNet-50 [63] followed by a 1 × 1 convolutional layer with 2 output filters. A non-learnable Global Average Pooling layer is added to each branch of the Siamese network so the resulting scaler count of first image in the pair (I, I′) can be subtracted from the second image. All parameters are shared (represented by the orange dashed arrows) thus incorporating the self-supervised task adds no parameters to the base model. The inclusion of an additional channel in our output tensor to estimate noise variance only adds a further ∼2x parameters in the head, equivalent to 0.01% of the total number. K is the batch size, where a batch contains K images from the labelled subset of data, and K pairs of images from the larger unlabelled pool of data. H and W are the height and width of some input 3-channel RGB image, whereas H′ and W′ are the height and width of the output tensors from the backbone and heads.
Fig 5
Fig 5
Performance of the deep learning models for counting fish in sonar images: (a) Error analysis for sample subgroups—categorised by number of fish or noise present. The MAE for a sample has been divided by the average actual count within each subgroup so results are somewhat normalised and can be compared between subgroups (Eq 11). The percentage of samples that fall within each group are: c < 25: 34%, 25 ≤ c < 50: 10%, 50 ≤ c < 150: 14%, c ≥ 150: 9%, Noise: 34%. The reason c < 50 (our first class for balance regularization) is altogether lower than 75% is because many of these samples have been put into the “noise” subgroup for this analysis. (b) The relationship between predicted noise variance and absolute error score, for models with AU-reg (iii, viii, ix). (c-f) Four sample images with corresponding ground truth and predicted density map. Predicted density maps from our best performing model, MT + AU-reg (viii). The density maps can be interpreted as a typical heat map where areas of red indicate dense regions of mullet.
Fig 6
Fig 6. Three sample images from the “counting” subset of the DeepFish dataset with point level annotation.
All data were collected with HD resolution digital cameras in 20 different marine habitats in tropical Australia. Mean of 1.2 fish/image, ranging from 0–18 individuals. (a) Low algal bed, count: 3, classification: “fish” (b) Reef trench, count: 2, classification: “fish” (c) Upper mangrove, count: 0, classification: “no fish”. The images were obtained from the open-source dataset DeepFish [42], licensed under a Creative Commons Attribution 4.0 International License.

References

    1. Cardinale BJ, Duffy JE, Gonzalez A, Hooper DU, Perrings C, Venail P, et al.. Biodiversity loss and its impact on humanity. Nature. 2012;486(7401):59–67. doi: 10.1038/nature11148 - DOI - PubMed
    1. Jones JP, Asner GP, Butchart SH, Karanth KU. The ‘why’,‘what’and ‘how’of monitoring for conservation. Key topics in conservation biology. 2013;2:327–343. doi: 10.1002/9781118520178.ch18 - DOI
    1. Worm B, Hilborn R, Baum JK, Branch TA, Collie JS, Costello C, et al.. Rebuilding global fisheries. science. 2009;325(5940):578–585. doi: 10.1126/science.1173146 - DOI - PubMed
    1. Pauly D, Zeller D. Catch reconstructions reveal that global marine fisheries catches are higher than reported and declining. Nature communications. 2016;7(1):1–9. doi: 10.1038/ncomms10244 - DOI - PMC - PubMed
    1. Hilborn R, Amoroso RO, Anderson CM, Baum JK, Branch TA, Costello C, et al.. Effective fisheries management instrumental in improving fish stock status. Proceedings of the National Academy of Sciences. 2020;117(4):2218–2224. doi: 10.1073/pnas.1909726116 - DOI - PMC - PubMed

Publication types