Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Aug 3;16(8):1222.
doi: 10.3390/s16081222.

DeepFruits: A Fruit Detection System Using Deep Neural Networks

Affiliations

DeepFruits: A Fruit Detection System Using Deep Neural Networks

Inkyu Sa et al. Sensors (Basel). .

Abstract

This paper presents a novel approach to fruit detection using deep convolutional neural networks. The aim is to build an accurate, fast and reliable fruit detection system, which is a vital element of an autonomous agricultural robotic platform; it is a key element for fruit yield estimation and automated harvesting. Recent work in deep neural networks has led to the development of a state-of-the-art object detector termed Faster Region-based CNN (Faster R-CNN). We adapt this model, through transfer learning, for the task of fruit detection using imagery obtained from two modalities: colour (RGB) and Near-Infrared (NIR). Early and late fusion methods are explored for combining the multi-modal (RGB and NIR) information. This leads to a novel multi-modal Faster R-CNN model, which achieves state-of-the-art results compared to prior work with the F1 score, which takes into account both precision and recall performances improving from 0 . 807 to 0 . 838 for the detection of sweet pepper. In addition to improved accuracy, this approach is also much quicker to deploy for new fruits, as it requires bounding box annotation rather than pixel-level annotation (annotating bounding boxes is approximately an order of magnitude quicker to perform). The model is retrained to perform the detection of seven fruits, with the entire process taking four hours to annotate and train the new model per fruit.

Keywords: agricultural robotics; deep convolutional neural network; harvesting robots; horticulture; multi-modal; rapid training; real-time performance; visual fruit detection.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Example images of the detection for two fruits. (a) and (b) show a colour (RGB) and a Near-Infrared (NIR) image of sweet pepper detection denoted as red bounding boxes respectively. (c) and (d) are the detection of rock melon.
Figure 2
Figure 2
Pixel-wise (a) and bounding box (b) image annotation.
Figure 3
Figure 3
Illustration of test time the Faster Region-based Convolutional Neural Network (R-CNN). There are 13 convolutional and 2 fully-connected (Fc6 and Fc7) and one softmax classifier layers. N denotes the number of proposals and is set as 300. O1:N is the output that contains N bounding boxes and their scores. Non-Maximum Suppression (NMS) with a threshold of 0.3 removes duplicate predictions. BK is a bounding box of the K-th detection that is a 4 × 1 vector containing the coordinates of top-left and bottom right points. xK is a scalar representing an object being detected.
Figure 4
Figure 4
(a) The 3 × 3 (pixels) Conv164 filters of the RGB network from VGG, (b) The input data and (c) One of the feature activations from the conv5 layer. The cyan boxes in (b) are manually labelled in the data input layer to highlight the corresponding fruits of the feature map.
Figure 5
Figure 5
t-SNE feature visualisation of 3 classes. The 4k dimensions of features are extracted from the Fc7 layer and visualised in 2D. For the visualisation, 86 images are randomly selected from the dataset and processed for the network shown in Figure 3.
Figure 6
Figure 6
A diagram of the early and late fusion networks. (a) The early fusion that concatenates a 1-channel NIR image with a 3-channel RGB image; (b) The late fusion that stacks outputs, O1:2NRGB+NIR, from two Faster R-CNN networks. O1:NRGB and O1:NNIR represent the output containing N=300 bounding boxes and their scores from the RGB and NIR networks, respectively. K is the number of objects being detected. Note that the Faster R-CNNs of the early fusion are identical to that of Figure 3.
Figure 7
Figure 7
(a,b) The hand-labelled ground truth using an RGB image and an NIR image respectively; (c) A merged ground truth bounding box. The cyan box displays a bounding box that is correctly annotated using the RGB image, but missed in the NIR image, due to the poor visibility of a fruit.
Figure 8
Figure 8
Precision-recall curves of four networks. The marks indicate the point where precision and recall are identical, and F1 scores are computed at these points.
Figure 9
Figure 9
Precision-recall curves of the CRF baseline, early and late fusion networks. All make use of RGB and NIR images as inputs. Due to the performance issue of CRF, we calculate the F1 score slightly offset from the equilibrium point.
Figure 10
Figure 10
Performance evaluation of the region proposals of four different networks.
Figure 11
Figure 11
(a) Precision-recall curves with the varying number of training images as denoted by different colours. The marks indicate the points where precision and recall are identical; (b) The F1 scores versus the number of images being used for fine-turning.
Figure 12
Figure 12
Instances of detection performance using the same camera setup as the training dataset and the same location. Above each detection is the classification confidence output from the DCNN. (a,b) The outputs from the RGB and NIR networks, respectively. It can be seen that there are noticeable FN (miss) in the NIR image, and colour and surface reflections play important roles in detection for this example.
Figure 13
Figure 13
Instances of detection performance using a different camera setup (Kinect 2) and a different location. Above each detection is the classification confidence output from the DCNN. (a,b) Without/with a Sun screen shed, respectively. Despite the brightness being obviously different in the two scenes, the proposed algorithm impressively generalises well to this dataset.
Figure 14
Figure 14
Quantitative performance evaluation for different fruits. The marks indicate the point where F1 scores are computed.
Figure 15
Figure 15
Four instances of sweet pepper detection. (a) and (b) are obtained from a farm site using a JAI camera, and (c) and (d) are collected using a Kinect 2 camera at a different farm site. Above each detection is the classification confidence output from the DCNN.
Figure 16
Figure 16
Four instances of rock melon detection. (a) and (b) are obtained from a farm site using a JAI camera, and (c) and (d) are from Google Images. Above each detection is the classification confidence output from the DCNN.
Figure 17
Figure 17
Eight instances of red (a,eh) and green (bd) apples detection (different varieties). Images are obtained from Google Images. Above each detection is the classification confidence output from the DCNN.
Figure 18
Figure 18
Eight instances (ag), and (h) of avocado detection (varying levels of ripeness). Images are obtained from Google Images.
Figure 19
Figure 19
Eight instances (ag), and (h) of mango detection (varying levels of ripeness). Images are obtained from Google Images.
Figure 20
Figure 20
Eight instances (ag), and (h) of orange detection (varying levels of ripeness). Images are obtained from Google Images.
Figure 21
Figure 21
Eight instances (ag), and (h) of strawberry detection (varying levels of ripeness). Images are obtained from Google Images.
Figure 22
Figure 22
Detection result when two fruits are present in the scene. Two images are manually cropped, stitched and then fed to the RGB network.

References

    1. ABARE . Australian Vegetable Growing Farms: An Economic Survey, 2013–14 and 2014–15. Australian Bureau of Agricultural and Resource Economics (ABARE); Canberra, Australia: 2015. Research report.
    1. Kondo N., Monta M., Noguchi N. Agricultural Robots: Mechanisms and Practice. Trans Pacific Press; Balwyn North Victoria, Australia: 2011.
    1. Bac C.W., van Henten E.J., Hemming J., Edan Y. Harvesting Robots for High-Value Crops: State-of-the-Art Review and Challenges Ahead. J. Field Robot. 2014;31:888–911. doi: 10.1002/rob.21525. - DOI
    1. McCool C., Sa I., Dayoub F., Lehnert C., Perez T., Upcroft B. Visual Detection of Occluded Crop: For automated harvesting; Proceedings of the International Conference on Robotics and Automation; Stockholm, Sweden. 16–21 May 2016.
    1. Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015;115:211–252. doi: 10.1007/s11263-015-0816-y. - DOI