Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Dec 2;40(11):115001.
doi: 10.1088/1361-6579/ab525c.

Cardio-respiratory signal extraction from video camera data for continuous non-contact vital sign monitoring using deep learning

Affiliations

Cardio-respiratory signal extraction from video camera data for continuous non-contact vital sign monitoring using deep learning

Sitthichok Chaichulee et al. Physiol Meas. .

Abstract

Non-contact vital sign monitoring enables the estimation of vital signs, such as heart rate, respiratory rate and oxygen saturation (SpO2), by measuring subtle color changes on the skin surface using a video camera. For patients in a hospital ward, the main challenges in the development of continuous and robust non-contact monitoring techniques are the identification of time periods and the segmentation of skin regions of interest (ROIs) from which vital signs can be estimated. We propose a deep learning framework to tackle these challenges.

Approach: This paper presents two convolutional neural network (CNN) models. The first network was designed for detecting the presence of a patient and segmenting the patient's skin area. The second network combined the output from the first network with optical flow for identifying time periods of clinical intervention so that these periods can be excluded from the estimation of vital signs. Both networks were trained using video recordings from a clinical study involving 15 pre-term infants conducted in the high dependency area of the neonatal intensive care unit (NICU) of the John Radcliffe Hospital in Oxford, UK.

Main results: Our proposed methods achieved an accuracy of 98.8% for patient detection, a mean intersection-over-union (IOU) score of 88.6% for skin segmentation and an accuracy of 94.5% for clinical intervention detection using two-fold cross validation. Our deep learning models produced accurate results and were robust to different skin tones, changes in light conditions, pose variations and different clinical interventions by medical staff and family visitors.

Significance: Our approach allows cardio-respiratory signals to be continuously derived from the patient's skin during which the patient is present and no clinical intervention is undertaken.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The proposed framework consists of two deep learning networks: the patient detection and skin segmentation network; and the intervention detection network. These networks operate in sequence to identify appropriate time periods and ROIs from which vital signs can be estimated.
Figure 2.
Figure 2.
Equipment set-up for video recording: (a) camera, recording workstation and incubator; and (b) sample video frame.
Figure 3.
Figure 3.
The proposed patient detection and skin segmentation network has two output streams. The patient detection stream performs global average pooling over feature maps to predict the presence of the patient in the scene. The skin segmentation stream performs hierarchical upsampling of feature maps across the shared core network to produce a skin label. The network was designed to evaluate the skin segmentation stream only if the infant was present in the scene.
Figure 4.
Figure 4.
Flowchart of semi-automatic skin annotation. Each annotator was asked to label skin areas in the first image of each session. The label was then propagated to the next frame using GMMs. The annotator can interact with seeds (green and red circles corresponding to skin or non-skin areas respectively) to modify the skin label for the new image frame.
Figure 5.
Figure 5.
Lighting augmentation was applied to generate additional training images with different lighting conditions. The histogram of the average lighting components of all training images was divided into four uniform intervals. The mean of each interval was computed (marked with a red asterisk). Three additional images were generated by scaling the lighting component of the original image to the mean of the interval 2, 3, and 4 respectively.
Figure 6.
Figure 6.
The proposed intervention detection network operates on a 5 s time window. The network consists of two input streams. The first input stream (context stream) processes a stack of skin confidence maps, produced by the patient detection and skin segmentation network. The second input stream (optical flow stream) handles a stack of dense optical flow. The outputs from both input streams are then combined to predict the occurrence of a clinical intervention in a given time window.
Figure 7.
Figure 7.
The processing of the input to the optical flow input stream. For each time window of 5 s, six video frames were taken, one image per second. A total of five optical flow vectors were computed from each pair of consecutive video frames. The horizontal and vertical components of each optical flow were then stacked together.
Figure 8.
Figure 8.
Example images for skin segmentation results.
Figure 9.
Figure 9.
Extraction of PPGi and respiratory signals from segmented skin area. (a) Video frames with segmented skin area provided by our proposed framework. (b) Timeline of patient activities over a 60 min segment for a typical recording session, manually annotated over a minute-by-minute basis. (c) Timeline of predicted time periods for infant absence and clinical intervention provided by the proposed algorithms. (d) 60 min time series of the PPGi signal extracted from the mean pixel intensity of the entire segmented skin region in the green channel. (e) 60 min time series of the respiratory signal extracted from the area of the entire segmented skin region. (f) Comparison of non-contact PPGi, contact ECG, and contact PPG signals for the area highlighted in (d). Each signal contains 78 peaks corresponding to a heart rate of 156 beats min−1. (g) Comparison of non-contact respiratory and contact impedance pneumographic (IP) signals for the area highlighted in (e). Each signal contains 35 peaks corresponding to a respiratory rate of 70 beats min−1.
Figure 10.
Figure 10.
Comparisons of non-contact and contact signals from different subjects. (a) Signals extracted from a mixed-race subject. Each cardiac signal contains 27 peaks corresponding to a heart rate of 162 beats min−1. Each respiratory signal contains 13 peaks corresponding to a respiratory rate of 78 breath min−1. (b) Signals extracted from a subject with dark skin. Each cardiac signal contains 24 peaks corresponding to a heart rate of 144 beats min−1. Each respiratory signal contains nine peaks corresponding to a respiratory rate of 54 breath min−1.

Similar articles

Cited by

References

    1. Aarts L A M, Jeanne V, Cleary J P, Lieber C, Nelson J S, Bambang Oetomo S, Verkruysse W. Non-contact heart rate monitoring utilizing camera photoplethysmography in the neonatal intensive care unit. Early Hum. Dev. 2013;89:943–8. doi: 10.1016/j.earlhumdev.2013.09.016. - DOI - PubMed
    1. Bianco S, Schettini R. Two new von Kries based chromatic adaptation transforms found by numerical optimization. Color Res. Appl. 2010;35:184–92. doi: 10.1002/col.20573. - DOI
    1. Bishop C M. Pattern Recognition and Machine Learning. 6th edn. New York: Springer; 2006.
    1. Breiman L. Random Forests. Mach. Learn. 2001;45:5–32. doi: 10.1023/A:1010933404324. - DOI
    1. Brox T, Bruhn A, Papenberg N, Weickert J. High accuracy optical flow estimation based on a theory for warping. Proc. European Conf. on Computer Vision; 2004. pp. pp 25–36. - DOI

Publication types