A machine learning-based approach for constructing remote photoplethysmogram signals from video cameras … – Nature.com

In this section, the methodology used in this study is presented, from the data processing techniques to the models used to construct the rPPG. A general visualization of the pipeline is presented in Fig.1.

From data processing to comparison of the reference photoplethysmogram (PPG) with the remote photoplethysmogram (rPPG) constructed by the model. CV cross-validation, RGB red, green, and blue channels, ML machine learning. Colors: the green signal refers to the rPPG reconstructed by the model, and the black signal refers to the fingertip PPG.

For this study, three public datasets were utilized:

LGI-PPGI: This dataset is published under the CC-BY-4.0 license. The study was supported by the German Federal Ministry of Education and Research (BMBF) under the grant agreement VIVID 01S15024 and by CanControls GmbH Aachen21. The LGI-PPGI dataset is a collection of videos featuring six participants, the sex of five is male and one is female. The participants were recorded while performing four activities: Rest, Talk, Gym (exercise on a bicycle ergometer), and Rotation (rotation of the head of the subject at different speeds). The videos were captured using a Logitech HD C270 webcam with a frame rate of 25 fps, and cPPG signals were collected using a CMS50E PPG device at a sampling rate of 60 Hz. The videos were shot in varying lighting conditions, with talking scenes recorded outdoors and other activities taking place indoors.

PURE: Access to this dataset is granted upon request. It received support from the Ilmenau University of Technology, the Federal State of Thuringia, and the European Social Fund (OP 2007-2013) under grant agreement N501/2009 for the project SERROGA (project number 2011FGR0107)26. The PURE dataset contains videos of 10 participants, of which eight have the sex male and two female, engaged in various activities classified as Steady, Talk, Slow Translation (average speed is 7% of the face height per second), Fast Translation (average speed is 14% of the face height per second), Small Rotation (average head angle of 20), and Medium Rotation (average head angle of 35). The videos were captured using a 640480 pixel eco274CVGE camera by SVS-Vistek GmbH, with a 30 fps frame rate and a 4.8 mm lens. The cPPG signals were collected using a CMS50E PPG device at a sampling rate of 60 Hz. The videos were shot in natural daylight, with the camera positioned at an average distance of 1.1 m from the participants faces.

MR-NIRP indoor: This dataset is openly accessible without any restrictions. It received funding under the NIH grant 5R01DK113269-0227. The MR-NIRP indoor video dataset is comprised of videos of eight participants, including six participants with sex male and two female, with different skin tones: 1 Asian, 4 Indian, and 3 Caucasian. The participants were recorded while performing Still and Motion activities, with talking and head movements being part of the latter. The videos were captured using a FLIR Blackfly BFLY-U3-23S6C-C camera with a resolution of 640640 and a frame rate of 30 fps. The cPPG signals were collected using a CMS 50D+ finger pulse oximeter at a sampling rate of 60 Hz.

Each dataset includes video recordings of participants engaged in various activities, alongside a reference cPPG signal recorded using a pulse oximeter. Table1 provides detailed characteristics of each dataset.

The datasets used in our research are not only publicly available but are also extensively utilized within the scientific community for various secondary analyses. All datasets received the requisite ethical approvals and informed consents, in accordance with the regulations of their respective academic institutions. This compliance facilitated the publication of the data in academic papers and its availability online. The responsibility for managing ethical compliance was handled by the original data providers. They ensured that these datasets were made available under terms that permit their use and redistribution with appropriate acknowledgment.

Given the extensive use of these datasets across multiple studies, additional IRB approval for secondary analyses of de-identified and publicly accessible data is typically not required. This practice aligns with the policies at ETH Zurich, which do not mandate further IRB approval for the use of publicly available, anonymized data.

A comprehensive description of each dataset, including its source, funding agency, and licensing terms, has been provided in the manuscript. This ensures full transparency and adherence to both ethical and legal standards.

Several steps were necessary to extract the rPPG signal from a single video. First, the regions of interest (RoI) were extracted from the face. We extracted information from the forehead and cheeks using the pyVHR framework28, which includes the software MediaPipe for the extraction of RoI from a human face29. The RoI extracted from every individual were composed of a total of 30 landmarks. Each landmark is a specific region of the face, represented by a number that indicates the location of that region. The landmarks 107, 66, 69, 109, 10, 338, 299, 296, 336, and 9 were extracted from the forehead, the landmarks 118, 119, 100, 126, 209, 49, 129, 203, 205, and 50 were extracted from the left cheek, and the landmarks 347, 348, 329, 355, 429, 279, 358, 423, 425, and 280 were extracted from the right cheek. Every landmark was composed of 3030 pixels, and the average across the red, green, and blue (RGB) channels was computed for every landmark. The numbers of the landmarks of each area represent approximately evenly spaced regions of that area.

After all the landmarks were extracted, the RGB signals of each landmark were used as input for the algorithms CHROM, LGI-PPGI, POS, and ICA. These algorithms were chosen because of their effectiveness in separating the color information related to blood flow from the color information not related to blood flow, as well as their ability to extract PPG signals from facial videos. CHROM separates the color information by projecting it onto a set of basis vectors, while LGI-PPGI uses local gradient information to extract PPG signals. POS uses a multi-channel blind source separation algorithm to extract signals from different sources, and ICA separates the PPG signals from the other sources of variation in the video. These methods were chosen based on their performance in previous studies and their ability to extract high-quality PPG signals from facial videos20,23.

For the data processing, the signals used as rPPG are the outputs of the algorithms ICA, CHROM, LGI, and POS, and the cPPG signals were resampled to the same fps as the rPPG. First, the filters detrend and bandpass were applied to both the rPPG and cPPG signals. Bandpass is a sixth-order Butterworth with a cutoff frequency of 0.654 Hz. The chosen frequency range was intended to filter out noise in both low and high frequencies. Next, the rPPG signals were filtered by removing low variance signals and were segmented into non-overlapping windows of 10 seconds, followed by minmax normalization. We applied histogram equalization to the obtained spatiotemporal maps, showing a general improvement in the performance of the methods.

Spectral analysis was performed on both the rPPG and cPPG signals by applying Welchs method to each window of the constructed rPPG and cPPG signals. The highest peak in the frequency domain was selected as the estimated HR, with alternative methods such as autocorrelation also tested. However, these methods showed minimal absolute differences in beats-per-minute absolute difference (HR). Welchs method was deemed useful as it allowed for heart rate evaluation in the frequency domain and demonstrated the predictive capability of each channels rPPG signal.

The model was trained using data sourced from the PURE dataset. The input data contains information from 10 participants. Each participant was captured across 6 distinct videos, engaging in activities categorized as Steady, Talk, Slow Translation, Fast Translation, Small Rotation, and Medium Rotation. This accounts for a total of 60 videos, with an approximate average duration of 1min. Each video was transformed to RGB signals. Then, every RGB set of signals representing a video was subdivided into 10-s fragments, with each fragment serving as a unit for training data. The dataset used to train the model contains a total of 339 such samples.

Because the duration of each video is 10 seconds and the frame rate is 30, each sample is represented by three RGB signals composed of 300 time-steps. The RGB signals, serving as training inputs, underwent a transformation process resulting in the derivation of four distinct signals through the application of the POS, CHROM, LGI, and ICA methods. Consequently, each 10-s segment yielded four transformed signals, which were intended for subsequent utilization as input for the model. Before being fed to the model, data preprocessing was applied to the signals. Then, a 5-fold cross-validation (CV) procedure was conducted. During this procedure, the dataset was partitioned into five subsets, with a distribution ratio of 80% for training data and 20% for testing data within each fold.

The models architecture was composed of four blocks of LSTM and dropout, followed by a dense layer. The model architecture is shown in Fig.2. To reduce the number of features of the model in each layer, the number of cells in each block decreases from 90 to 1. The learning rate scheduler implemented was ReduceLROnPlateau and the optimizer was Adam30. Finally, the metrics root mean squared error (RMSE) and Pearson correlation coefficient (r) were set as loss function.

The model architecture generates a remote photoplethysmogram (rPPG) signal from three regions of interest: the forehead (R1), left cheek (R2), and right cheek (R3). The average value from each region is calculated, and these averages are then combined to produce the overall rPPG signal. The model is composed of four blocks of LSTM and dropout, followed by a dense layer. The methods ICA, LGI, CHROM, and POS were used as input to the model. rPPG remote photoplethysmogram, RGB red, green, and blue channels, LSTM long short-term memory.

To evaluate the signals, we applied four criteria: Dynamic Time Warping (DTW), Pearsons r correlation coefficient, RMSE, and HR. We computed each criterion for every window in each video. We then took the average of the values of all the windows to obtain the final results. This helped us to analyze the results of every model from different points of view.

DTW31 is a useful algorithm for measuring the similarity between two time series, especially when they have varying speeds and lengths. The use of DTW is also relevant for this case because the rPPG and its ground truth may not be aligned sometimes, so metrics that rely on matching timestamps are less appropriate. The metric was implemented using the Python package DTAIDistance32.

The equation below shows how the r coefficient calculates the strength of the relationship between rPPG and cPPG.

$$r=frac{mathop{sum }nolimits_{i = 1}^{N}({x}_{i}-hat{x})({y}_{i}-hat{y})}{sqrt{mathop{sum }nolimits_{i = 1}^{N}{({x}_{i}-hat{x})}^{2}}sqrt{mathop{sum }nolimits_{i = 1}^{N}{({y}_{i}-hat{y})}^{2}}}$$

(1)

In this equation, xi and yi are the values of the rPPG and PPG signals at lag i, respectively. (hat{x}) and (hat{y}) are their mean values. N is the number of values in the discrete signals.

The equation below shows how RMSE calculates the prediction error, which is the difference between the ground truth values and the extracted rPPG signals.

$${{{{{{mathrm{RMSE}}}}}}},=sqrt{frac{mathop{sum }nolimits_{i = 1}^{N}{left({x}_{i}-{y}_{i}right)}^{2}}{N}}$$

(2)

In this equation, N is the number of values and xi, yi are the values of the rPPG and contact PPG signals at lag i, respectively.

HR was estimated using Welchs method, which computes the power spectral density of a signal and finds the highest peak in the frequency domain. The peak was searched within a range of 39240 beats-per-minute (BPM), which is the expected range of human BPMs. HR is obtained as the absolute difference between the HR estimated from rPPG and the HR estimated from cPPG.

To evaluate the models performance, we applied non-parametric statistical tests, which have fewer assumptions about the data distribution than parametric ones. Some comparisons involved small sample sizes, such as those with a limited number of subjects.

The Friedman Test33 is appropriate for this study because it evaluates the means of three or more groups. Every group is represented by a model. If the p-value is significant, the means of the groups are not equal. The Nemenyi Test34 was used to calculate the difference in the average ranking values and then to compare the difference with a critical distance (CD). The general procedure is to apply the Friedmann test to each group and if the p-value is significant, the means of the groups are not equal. In that case, the Nemenyi test is performed to compare the methods pairwise. The Nemenyi test helps to identify which methods are similar or different in terms of their average ranks. The Bonferroni correction was applied for multiple-comparison correction.

Read the original:
A machine learning-based approach for constructing remote photoplethysmogram signals from video cameras ... - Nature.com

Related Posts

Comments are closed.