Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND APPARATUS FOR FACE ANTI-SPOOFING VIA AUXILIARY SUPERVISION
Document Type and Number:
WIPO Patent Application WO/2019/152983
Kind Code:
A2
Abstract:
A unified convolutional neural network (CNN) - recurrent neural network (RNN) architecture is provided, which may be trained to distinguish between a presentation attack (PA) and a live face, for a given sequence of image frames (e.g., video). The CNN portion of the network may produce an estimated depth map and an estimated feature map for each image frame in the sequence, while the RNN portion of the network may produce an estimated remote Photoplethysmography (rPPG) signal. A non-rigid registration layer may be coupled between the CNN and the RNN and may frontalize the feature maps using 3D face shape data derived from the sequence of image frames.

Inventors:
LIU XIAOMING (US)
LIU YAOJIE (US)
JOURABLOO AMIN (US)
Application Number:
PCT/US2019/016638
Publication Date:
August 08, 2019
Filing Date:
February 05, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV MICHIGAN STATE (US)
International Classes:
G06T7/20
Attorney, Agent or Firm:
MILHOLLIN, Andrew C. (US)
Download PDF:
Claims:
CLA1MS

1. A system comprising:

a computer processor;

a memory device coupled to the computer processor, the memory device storing computer-readable instructions which, when executed by the computer processor, cause the computer processor to:

receive a plurality of image frames;

with a convolutional neural network (CNN) implemented by the computer processor, produce estimated depth maps for the plurality of image frames;

with a recurrent neural network (RNN) implemented by the computer processor, produce an estimated remote photoplethysmography (rPPG) signal for the plurality of image frames;

determine a classification score for the plurality of image frames based on the estimated rPPG signal and a selected estimated depth map of the estimated depth maps; and

classify the plurality of image frames as corresponding to a presentation attack based on the classification score.

2. The system of claim 1, wherein the computer-readable instructions, when executed by the computer processor, further cause the computer processor to:

with the CNN, produce estimated feature maps for the plurality of image frames.

3. The system of claim 2, wherein the computer-readable instructions, when executed by the computer processor, further cause the computer processor to:

with a three-dimensional (3D) face alignment engine implemented by the computer processor, producing estimated 3D face shapes based on the plurality of image frames.

4. The system of claim 3, wherein the computer-readable instructions, when executed by the computer processor, further cause the computer processor to: with a non-rigid registration layer implemented by the computer processor, receive the estimated feature maps;

with the non-rigid registration layer, receive the estimated depth maps;

with the non-rigid registration layer, receive the estimated 3D face shapes; and with the non-rigid registration layer, produce frontalized feature maps based on the estimated 3D face shapes, the estimated depth maps, and the estimated feature maps.

5. The system of claim 4, wherein the estimated rPPG signal is produced based on the frontalized feature maps.

6. The system of claim 5, wherein the CNN comprises:

a plurality of groups of convolutional layers, wherein a first group of convolutional layers of the plurality of groups of convolutional layers receives the plurality of image frames;

a plurality of pooling layers, wherein each pooling layer of the plurality of pooling layers is connected to at least one group of convolutional layers of the plurality of groups of convolutional layers;

a concatenation layer connected to outputs of the plurality of pooling layers;

a first branch of convolutional layers connected to an output of the concatenation layer, wherein the first branch of convolutional layers outputs the estimated feature maps; and

a second branch of convolutional layers connected to the output of the concatenation layer, wherein the second branch of convolutional layers outputs the estimated depth maps.

7. The system of claim 5, wherein the RNN comprises:

a long short-term memory (LSTM) network connected to the non-rigid registration layer;

a fully-connected (FC) layer connected to an output of the LSTM network; and a fast Fourier transform (FFT) layer connected to an output of the FC layer, wherein the FFT layer outputs the estimated rPPG signal.

8. A method comprising:

with a processor, receiving a plurality of image frames;

with a convolutional neural network (CNN) implemented by the processor, producing a sequence of estimated depth maps for the plurality of image frames;

with a recurrent neural network (RNN) implemented by the processor, producing an estimated rPPG signal for the plurality of image frames;

with the processor, determining a classification score for the plurality of image frames based on the estimated rPPG signal and a selected estimated depth map of the sequence of estimated depth maps; and

with the processor, classifying the plurality of image frames as corresponding to a presentation attack based on the classification score.

9. The method of claim 8, further comprising:

with the CNN, producing a sequence of estimated feature maps for the plurality of image frames.

10. The method of claim 9, further comprising:

with a three-dimensional (3D) face alignment engine implemented by the processor, producing a sequence of estimated 3D face shapes based on the plurality of image frames.

11. The method of claim 10, further comprising:

with a non-rigid registration layer implemented by the processor, producing a sequence of frontalized feature maps based on the sequence of estimated 3D face shapes, the sequence of estimated depth maps, and the sequence of estimated feature maps.

12. The method of claim 11, wherein producing a frontalized feature map of the sequence of frontalized feature maps comprises: with the non-rigid registration layer, applying a threshold to an estimated depth map of the sequence of estimated depth maps to produce a binary depth map, such that first pixels of the estimated depth map having first pixel values that exceed or are equal to the threshold are assigned first pixel values of 1 in the binary depth map, and such that second pixels of the estimated depth map having second pixel values that are less than the threshold are assigned second pixel values of 0 in the binary depth map;

with the non-rigid registration layer, calculating masked activation values by determining an inner product of the binary depth map and an estimated feature map of the sequence of estimated feature maps; and

with the non-rigid registration layer, producing the frontalized feature map by frontalizing the masked activation values based on the sequence of estimated 3D face shapes.

13. The method of claim 12, wherein producing the estimated rPPG signal comprises: with the RNN, producing the estimated rPPG signal based on the sequence of frontalized feature maps.

14. The method of claim 13, wherein the producing the sequence of estimated depth maps and producing the sequence of estimated feature maps comprises:

with a first group of convolutional layers, receiving the plurality of image frames and producing a first output;

with a first pooling layer, receiving the first output from the first group of convolutional layers and producing a second output;

with a second group of convolutional layers, receiving the second output from the first pooling layer and producing a third output;

with a second pooling layer, receiving the third output from the second group of convolutional layers and producing a fourth output;

with a third group of convolutional layers, receiving the fourth output from the second pooling layer and producing a fifth output; with a third pooling layer, receiving the fifth output from the third group of convolutional layers and producing a sixth output;

with a concatenation layer, receiving the second output from the first pooling layer, the fourth output from the second pooling layer, and the sixth output from the third pooling layer and producing a seventh output;

with a first branch of convolutional layers, receiving the seventh output of the concatenation layer and producing the sequence of estimated feature maps; and

with a second branch of convolutional layers, receiving the seventh output of the concatenation layer and producing the sequence of estimated depth maps.

15. The method of claim 13, wherein producing the estimated rPPG signal comprises: with a long short-term memory (LSTM) network, receiving the sequence of frontalized feature maps from the non-rigid registration layer and producing a first output; with a fully-connected (FC) layer, receiving the first output from the LSTM network and producing a second output; and

with a fast Fourier transform (FFT) layer, receiving the second output from the FC layer and producing the estimated rPPG signal.

16. A system comprising:

a computer processor configured to implement a neural network architecture, the neural network architecture comprising:

a convolutional neural network (CNN) configured to receive a sequence of image frames and produce a sequence of estimated depth maps; and

a recurrent neural network (RNN) configured to produce an estimated remote photoplethysmography (rPPG) signal for the sequence of image frames.

17. The system of claim 16, wherein the CNN comprises:

a plurality of groups of convolutional layers, wherein a first group of convolutional layers of the plurality of groups of convolutional layers is configured to receive a plurality of image frames; a plurality of pooling layers, wherein each pooling layer of the plurality of pooling layers is connected to at least one group of convolutional layers of the plurality of groups of convolutional layers;

a concatenation layer connected to outputs of the plurality of pooling layers;

a first branch of convolutional layers connected to an output of the concatenation layer, wherein the first branch of convolutional layers outputs a sequence of estimated feature maps; and

a second branch of convolutional layers connected to the output of the concatenation layer, wherein the second branch of convolutional layers outputs the sequence of estimated depth maps.

18. The system of claim 17, wherein the neural network architecture further comprises: a non-rigid registration layer configured to receive the sequence of estimated depth maps and the sequence of estimated feature maps from the CNN and to produce a sequence of frontalized feature maps.

19. The system of claim 18, wherein the RNN comprises: a long short-term memory (LSTM) network configured to receive the sequence of frontalized feature maps from the non-rigid registration layer;

a fully-connected (FC) layer configured to receive an output from the LSTM network; and

a fast Fourier transform (FFT) layer configured to receive an output from the FC layer, wherein the FFT layer outputs the estimated rPPG signal.

20. The system of claim 18, further comprising: a three-dimensional (3D) face alignment engine configured to produce estimated 3D face shapes based on the sequence of images, wherein the non-rigid registration layer is configured to receive the estimated 3D face shapes from the 3D face alignment engine, and wherein the non-rigid registration layer is configured to produce the sequence of frontalized feature maps further based on the estimated 3D face shapes.

Description:
SYSTEM AND APPARATUS FOR FACE ANTI-SPOOFING VIA AUXILIARY SUPERVISION

CROSS REFERENCE TO RELATED APPLICATIONS

[1] This application claims priority to U.S. Provisional Application No.

62/626,486 filed February 5, 2018, which is incorporated by reference in its entirety for all purposes.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

[2] This invention was made with government support under 2017- 17020200004, awarded by the lntelligence Advanced Research Projects Activity (IARPA). The government has certain rights in the invention.

BACKGROUND

[3] Biometric systems are widely used with applications such as phone unlocking, access control, and transportation security. The face, as a unique and easily accessible part of human beings, is a popular biometric modality. While face recognition systems serve as a verification portal of personal devices, attackers may present face spoofs (i.e., presentation attacks (PA)) to these systems in attempts to be authenticated as the genuine user. Face recognition PAs may include printing an image of the face on paper (print attack), replaying a video of the face on a digital device (replay attack), or wearing a mask corresponding to the face (mask attack). To counteract such PA methods, face anti spoofing methods have been developed to detect a PA prior to a face image being recognized for verification. These anti-spoofing methods are vital to ensuring that face recognition systems are robust to PAs and are secure. However, many traditional face anti spoofing methods that tend to address the face anti-spoofing problem as a binary classification problem may have poor generalization across various poses, illuminations, and expressions. As a result, such traditional face anti-spoofing methods may fail to distinguish spoof versus live faces under varying conditions. Additionally, traditional binary classification may only generate a binary decision (spoof vs. real) without explanation or rationale for the decision (e.g., without identifying the specific identified spoof characteristics or patterns that lead to the classification of a series of captured face images as a PA).

[4] ln light of the above, there remains a need for improved systems and methods for face anti-spoofing that may be used in conjunction with face recognition and verification systems.

SUMMARY

[5] The present disclosure generally relates to face anti-spoofing technology. More specifically, the present disclosure is directed to systems and methods that provide a solution for distinguishing between live faces and presentation attacks (PA) ln one embodiment, these systems and methods may utilize a deep learning approach to classify a sequence of captured images as corresponding to a live face or to a PA based on both spatial and temporal auxiliary information. A neural network architecture including a convolutional neural network (CNN), a non-rigid registration layer, and a recurrent neural network (RNN) may receive a sequence of image frames of a face. The CNN may produce a sequence of estimated depth maps and a sequence of estimated feature maps based on the sequence of image frames. The non-rigid registration layer may produce a sequence of frontalized feature maps based on the estimated feature maps and 3D face shape data corresponding to the sequence of image frames and produced by a 3D face alignment engine. The RNN may output an estimated remote photoplethysmography (rPPG) signal based on the sequence of frontalized feature maps received from the non-rigid registration layer. A computer processor may implement the neural network architecture, may determine a classification score for the sequence of image frames based on the estimated rPPG signal and a selected estimated depth map of the sequence of estimated depth maps, and may classify the sequence of image frames as corresponding to a presentation attack based on the classification score. BRIEF DESCRIPTION OF THE DRAWINGS

[6] The present invention will hereafter be described with reference to the accompanying drawings, wherein like reference numerals denote like elements. The patent or application file contains at least one drawing executed in color.

[7] F1G. 1 is an illustrative block diagram showing a face recognition system that may implement face anti-spoofing, in accordance with aspects of the present disclosure.

[8] F1G. 2 is an illustrative comparison of using binary supervision vs. auxiliary supervision as a model for classifying a sequence of images as a live face or a presentation attack (PA), in accordance with aspects of the present disclosure.

[9] F1G. 3 is an illustrative block diagram showing a high-level representation of a deep neural network architecture that may be used to classify a sequence of images as corresponding to a live face or to a PA, in accordance with aspects of the present disclosure.

[10] F1G. 4A is an illustrative block diagram showing a lower-level representation of a deep neural network architecture that may be trained and used to classify a sequence of images as corresponding to a live face or to a PA, in accordance with the present disclosure.

[11] F1G. 4B is an illustrative block diagram showing inputs to and operations performed by a non-rigid registration layer in the deep neural network architecture of F1G. 4A, in accordance with the present disclosure.

[12] F1G. 5 is an illustrative process flow chart providing a process by which a sequence of image frames is received and processed by a CNN-RNN neural network architecture to determine whether the sequence of image frames corresponds to a PA or a live face and, optionally, to adjust the CNN-RNN neural network during training, in accordance with the present disclosure.

[13] F1G. 6 is an illustrative process flow providing a process by which an image frame may be processed by a CNN in a CNN-RNN neural network architecture to produce an estimated feature map and an estimated depth map for the image frame, in accordance with aspects of the present disclosure.

[14] F1G. 7 is an illustrative process flow chart providing a process by which a sequence of frontalized feature maps may be processed by a RNN in a CNN-RNN neural network architecture to produce an estimated remote photoplethysmography (rPPG) signal, in accordance with aspects of the present disclosure.

DETAILED DESCRIPTION

[15] The present disclosure relates to systems and methods for face anti-spoofing via the classification of a sequence of images as corresponding to a live face or to a presentation attack (PA).

[16] F1G. 1 shows a block diagram of an example system 100 (e.g., a face recognition system). As an illustrative, non-limiting example, system 100 could be a mobile electronic device (e.g., a mobile phone device, a tablet device, or any other portable electronic device) or may be implemented by a desktop or laptop computer.

[17] System 100 includes a processor 102 (e.g., a computer processor), a memory 104, a communications interface 106, an optional battery and power management module 108, and an image sensor 110. Some or all of these components may communicate over a bus 114. Although bus 114 is illustrated here as a single bus, it may instead be implemented as one or more busses, bridges, or other communication paths used to interconnect components of system 100. Memory 104 may be a non-transitory computer readable storage medium (e.g., read-only memory (ROM), random access memory (RAM), flash memory, etc.). Optional battery and power management module 108 may provide power to components of system 100 in a controlled manner, and may be optionally omitted in embodiments where system 100 is connected to a desktop or laptop computer (e.g., as a peripheral device) lmage sensor 110 may include an optical image sensor, such as an RGB image sensor, capable of capturing images and video.

[18] Processor 102 may execute instructions stored in memory 104 to process and classify a sequence of images (e.g., image frames) captured by image sensor 110 as corresponding either to a live face or to a PA. lf the sequence of images is classified as corresponding to a live face, processor 102 may subsequently perform facial recognition and verification on the sequence of images. Otherwise, if the sequence of images is classified as a PA, this facial recognition and verification may not be performed and, optionally, an alert may be generated by processor 102 indicating that a PA attempt has been detected. Additionally, processor 102 may execute instructions for implementing a 3D face alignment engine (e.g., 3D face alignment engine 403, F1G. 4E>), which receives image frames and produces 3D face shapes corresponding to faces in those image frames.

[19] ln some embodiments, image sensor 110 may be external to system 100 and may provide data (e.g., images) to system 100 via a connection to communications interface 106. The processing and classification performed by processor 102 may, for example, be performed using a combined convolutional neural network - recurrent neural network (CNN-RNN), as is described in greater detail below in connection with F1GS. 3-6.

[20] While conventional systems for face anti-spoofing to protect against PAs presently exist, these systems tend to regard the face anti-spoofing problem as a binary (real vs. spoof/PA) classification problem. There are two main issues in developing deep anti-spoofing models with only binary supervision. First, there are different levels of image degradation, namely spoof patterns, comparing an image of a spoof face to an image of a live face. These patterns include skin detail loss, color distortion, moire patterning, shape deformation, and spoof artifacts (e.g., reflection). A conventional convolutional neural network (CNN) that uses softmax loss might discover a variety of cues that can separate the two classes (live vs. spoof/PA), such as screen bezel, but may not identify the actual spoof patterns. Thus, when those discovered cues are not present during testing, these binary models would fail to distinguish a PA versus a live face, indicating poor generalization of the model. Second, during testing, models learned with binary supervision only generate a binary decision without explanation or rationale for the decision.

[21] To address the aforementioned issues, a deep learning model may be used that uses supervision from both spatial and temporal auxiliary information, rather than binary supervision, for the purpose of robustly detecting face PA from, for example, a face video. This auxiliary information may be acquired based on domain knowledge about the key differences between live and spoof faces. These differences include two perspectives: spatial and temporal. From the spatial perspective, real faces have face-like depth (e.g., the nose is closer to the camera than the cheek in frontal-view faces), while faces in print or replay attacks have flat or planar depth maps (e.g., all pixels on the image of a paper are perceived by a camera as having the same depth). Hence, depth data may be utilized as auxiliary information to supervise both live faces and spoof faces (e.g., used in a PA). From the temporal perspective, it has been shown that normal remote photoplethysmography (rPPG) signals (i.e., heart pulse signals) are detectable from live face videos, but not in spoof face videos (replay attack). Therefore rPPG signals may be provided as auxiliary information for supervision, which may guide the neural network that implements the deep learning model to learn to distinguish between real and spoof video (e.g., sequences of image frames).

[22] The system 100 may be used in a variety of applications in which facial recognition is used in order to recognize and alert users to presentation attacks and/or other forms of face spoofing. Such applications may include, but are not limited to, security systems, video doorbells, and cell phone (e.g., smart phone) camera applications (e.g., video conferencing applications, facial-recognition-based phone unlocking applications).

[23] F1G. 2 shows a comparison of the results of using two different deep learning models (e.g., which may each be implemented using neural networks) for perform face anti-spoofing. For each model, a captured color image of a live face 206 and a captured color image of a PA 208 (e.g., corresponding to the presentation of a photograph, rather than a live face) are separately assessed. As shown, a conventional binary supervision model 202 provides a binary 0 or 1 to indicate whether a given input image or sequence of images corresponds to a live face or to a PA. The output of binary supervision model 202 is simplistic in that it consists only of a binary 0 or 1. When considering this binary output, it may be difficult to draw conclusions regarding the basis upon which binary supervision model 202 has classified live face 206 as a "0" and PA 208 as a "1".

[24] ln contrast, an auxiliary supervision model 204 generates a depth map and an rPPG signal for an input sequence of images, which are subsequently used as a basis for determining whether the sequence of images corresponds to a PA or a live face. By respectively comparing the depth map and rPPG signal of live face 206 to the depth map and rPPG signal of PA 208, the basis upon which auxiliary supervision model 204 classifies a live face versus a PA can be understood (e.g., because the depth map of PA 208 is substantially flat compared to the depth map of live face 206, and because the rPPG signal corresponding to PA 208 is substantially flat and inactive compared to the rPPG signal corresponding to live face 206. ln this way, the classification basis of the auxiliary supervision model 204 is readily explainable compared to binary supervision model 202.

[25] As indicated above, neural networks (i.e., artificial neural networks) may be used to implement face anti-spoofing models, such as auxiliary supervision model 204. A neural network is a computational model that is used to approximate functions that are generally unknown. A neural network generally includes multiple layers, with each layer including multiple nodes. Each node of a given layer is connected to at least one other node of another layer, and each connection may be assigned a weight. Additionally, a function (commonly referred to as an activation function) may be applied to the input of a given node to produce an output. Activation functions may, for example, be linear, non-linear, continuously differentiable, monotonic, or any other applicable type of function.

[26] A convolutional neural network (CNN) (sometimes referred to herein in the context of a CNN processing module that implements a CNN model) is a type of deep, feed forward artificial neural network that typically includes an input layer, an output layer, and multiple hidden layers. These hidden layers typically include convolutional layers, pooling layers, fully connected layers, and normalization layers. A convolutional layers applies a convolution operation as an activation function to its inputs. A pooling layer, for each neuron cluster of multiple neuron clusters of a preceding layer, combines outputs of that neuron cluster into the output of a single neuron of the pooling layer. This combination may, for example, use the maximum value of a cluster of neurons as the output of the corresponding single neuron of the pooling layer (referred to as max pooling), or may alternatively use the average of the values of a cluster of neurons as the output (referred to as average pooling). A fully connected layer provides a 1-to-l connection between each neuron of a preceding layer and each neuron of the fully connected layer, and may apply an activation function at these nodes if desired. The connections between each layer of a CNN may be weighted. For example, the outputs of a given layer of a CNN may be multiplied by predetermined weight values. These weight values may be adjusted during training of the CNN.

[27] A recurrent neural network (RNN) (sometimes referred to herein in the context of a RNN processing module that implements a RNN model) is a type of artificial neural network in which connections between units form a directed cycle. A long short term memory (LSTM) network is a type of RNN that includes an LSTM unit as a building block. A LSTM unit may include four neurons: a memory cell, an input gate, an output gate, and a forget gate. The input gate controls the extent to which a new value flows into the memory cell, the forget gate controls the extent to which a value remains in the memory cell, and the output gate controls the extent to which the value in the memory cell is used to compute the output activation (e.g., activation function) of the LSTM unit. The inputs and outputs of each neuron within a LSTM unit may be assigned weights, which are adjusted during training of the LSTM. LSTM networks may be beneficial due to their ability to handle long-term dependencies because information input to an LSTM network is (at least partially) retained.

[28] Turning now to F1G. 3, a block diagram illustrating a unified CNN-RNN network architecture 300 is shown, which may be trained to perform classification of input sequences of image frames as corresponding to either a live face or to a PA. As shown, a sequence of N image frames 302 may be input to a CNN 304, with image frame 302-1 representing the first image frame in the sequence and image frame 302-N representing the last image frame in the sequence. As shown, the image frames 302 may show a color or grayscale image of a face that is rotated at various different angles across the different image frames. For example, image frame 302-1 shows a face that is oriented substantially toward the center of the field of view of the image capture device (i.e., facing the viewer), while image frame 302-(N-l) shows a face that is angled roughly 45 degrees to the left of the center of the field of view of the image capture device (i.e., the person in the image frame has turned their head slightly), and image frame 302-N shows a face that is angled roughly 90 degrees to the left of the center of the field of view of the image capture device. CNN 304 may produce N estimated depth maps 308 and N estimated frontalized feature maps 306, which are then provided to a RNN 310. Each of estimated depth maps 308 is a representation of the 3D shape of the face in a respective corresponding one of image frames 302. Each of estimated frontalized feature maps 306 is a representation showing the locations of activations of CNN 304 on the normalized frontal 3D face of one input image frame 302. The estimated depth maps 308 may include color or grayscale heat maps, the color of which varies proportionally with estimated depth of the faces shown in the corresponding image frames 302. The estimated frontalized feature maps may include color or grayscale heat maps, in which the faces of the image frames 302 are rotated to be forward facing (i.e., substantially toward the viewer) and in which areas of the image frame 302 that correspond to activations of the CNN 304 are distinguishable by magnitude based on color. These estimated frontalized feature maps 306 are "frontalized" by a non-rigid registration layer coupled between CNN 304 and RNN 310, as will be explained in more detail in connection with F1G. 4B, below. RNN 310 receives all N estimated depth maps 308 and estimated frontalized feature mas 306 and produces a remote photoplethysmography (rPPG) signal 312. A final classification score is then determined based on rPPG signal 312 and on the last depth map 308-N. The final classification score may, for example, be determined according to the following equation:

[29] where f represents the rPPG signal vector, D represents the estimated depth map vector, and l is a constant weight for combining the two responses of the network. Depending on whether the final classification score exceeds a predetermined score threshold value, CNN-RNN network architecture 300 produces an output 314 indicating either that the sequence of images 302 corresponds to a live face or to a spoof/PA.

[30] F1G. 4A shows a detailed block diagram of a CNN-RNN network architecture 400 (e.g., which may correspond to CNN-RNN network architecture 300 of F1G. 3). As shown, CNN-RNN network architecture 400 includes a CNN 406 (sometimes referred to herein as a CNN processing module 406), a non-rigid registration layer 438, and a RNN 408 (sometimes referred to herein as a RNN processing module 408). CNN 406 receives a sequence of image frames 402 that include color or grayscale images that may include one or more faces, and produces an estimated feature map 434 and an estimated depth map 432, which are provided to non-rigid registration layer 438. Non-rigid registration layer 438 then provides a sequence of estimated frontalized feature maps to RNN 408. RNN 408 then produces an estimated rPPG signal 446 based on the sequence of estimated frontalized feature maps.

[31] CNN 406 may be a fully convolutional network (FCN). CNN 406 includes a first convolutional layer 410, which may include 64 filters, followed by multiple convolutional blocks 412, 414, and 416, a concatenation layer 426, and branches 428 and 430. Each of convolutional blocks 412, 414, and 416 may include three convolutional layers 418, 420, and 422, having 128, 196, and 128 filters, respectively, followed by a pooling layer 424. The outputs of the pooling layers 424-1 and 424-2 of convolutional blocks 412 and 414 are respectively provided to both the next convolutional block in the sequence (414 and 416, respectively) and to concatenation layer 426 via bypass connections ln some embodiments, each of the convolutional layers shown may represent three separate layers, including a convolutional layer followed by an exponential linear layer and a batch normalization layer. As an example, the size of all convolutional filters applied by the convolutional layers in CNN 406 may be 3x3 and the stride of these convolutional filter may be 1. Before concatenating the responses of the network at different layers, we use a resizing layer may be used to resize the responses, such that each response is the same size. Pooling layer 424-3 of convolutional block 416 provides an output only to concatenation layer 426. ln some embodiments, each of pooling layers 424-1, 424-2, and 424-3 may be max pooling layers and may have a filter size of 3x3 and a stride of 2. The bypass connections between the outputs of pooling layers 424-1 and 424-2 may allow the network to utilize extracted features from layers with different depths and help the network to converge faster by avoiding the gradient vanishing/exploding problem.. The concatenation layer 426 may concatenate the outputs of each of pooling layers 424 to provide an output to branches 428 and 430. Branch 428 may include three convolutional layers and may be trained to estimate a depth map 432 for a given input image frame I e ® >256x256 0 f the sequence of image frames 402. Branch 430 may include three convolutional layers and may be trained to estimate a feature map 434 for the given input image frame. Branch 428includes three convolutional layers which include 128, 64 and 1 filters, respectively. Similarly, branch 430 includes three convolutional layer which include 128, 3, and 1 filters, respectively lt should be noted that the number of filters used in the convolutional layers of CNN 406 may be optimized for a given application. During training of CNN 406, estimated depth map 432 may be fed into a depth map loss function 436, which is minimized by the training process.

[32] The depth map 432 and feature map 434 are then fed out of CNN 406 and into non-rigid registration layer 438, along with a sequence of 3D face shapes 404, which may be generated by applying dense face alignment (DeFA) methods to the sequence of image frames 402. Each 3D face shape in the sequence of 3D face shapes 404 corresponds to the shape of a respective face shown in a corresponding image frame of image frames 402 from which that 3D face shape is derived. A detailed description of these DeFA methods is provided below in connection with methods for training CNN-RNN network architecture 400.

[33] A detailed block diagram of non-rigid registration layer 438 is shown in F1G. 4B. Non-rigid registration layer 438 takes as inputs a face shape 404, S from the sequence of 3D face shapes 404 (e.g., estimated 3D face shapes), a mean face shape 441, So, generated from the sequence of 3D face shapes 404, an estimated depth map 432, D e R 32 32 , output by CNN 406, and an estimated feature map 434, T e R 32 32 , output by CNN 406. The sequence of 3D face shapes 404 are generated by processing the sequence of image frames 402 with a 3D face alignment engine 403 (e.g., implemented by executing instructions on processor 102, F1G. 1). The estimated feature map 434 may include a color or grayscale heat map in which locations of a corresponding one of the image frames 402 that correspond to activations of the CNN 406 vary in color with varying activation magnitude. The estimated depth map 432 may include a color or grayscale heat map in which locations of a corresponding one of the image frames 402 vary in color with varying depth.

[34] Within non-rigid registration layer 438, a threshold is first applied to the estimated depth map 432, D to generate a binary depth map 435, V e R 32 32 , which may be expressed as:

V = D > threshold. [35] Any pixel in estimated depth map 432, D that exceeds or is equal to (e.g., that has a magnitude or pixel value that exceeds or is equal to) the threshold is expressed as a 1 in binary depth map 435, V, and any pixel that is less than the threshold is expressed as a 0. Next, the inner product of the binary depth map 435, V and the feature map 434, T is calculated to determine masked activation values 437, U:

U = T Q V

[36] ln this way, if the depth value for a given pixel in the feature map is less than the threshold, that pixel is considered invisible in masked activation values 437, U. The masked activation values 437 may include a color or grayscale heat map in which the non zero activations of the CNN 406 shown in the estimated feature map 434 that overlap with the "l"s of the binary depth map 435 are shown, while any other non-zero activations of the CNN 406 are set to zero. This masking may effectively remove noisy data occurring from CNN activations that are outside of the region of the face in the image frame 402. Finally, masked activation values 437, U are frontalized using the estimated 3D face shape 404, S:

[37] where m e E^is the pre-defined list of K indexes of the face area in the mean frontal face shape 441, So and my is the corresponding index for the pixel at location i,j. m is utilized to project the masked activation values 437, U to mean frontal face shape 441, So to create an estimated frontalized feature map 439, F. The frontalized feature map 439 may include a heat map that is a frontalized version of the heat map corresponding to the masked activation values 437, frontalized according to the mean frontal face shape 441, So. ln other words, m is a mapping between the mean frontal face shape 441, So and the estimated 3D shape 404, S of each input frame by utilizing vertex lDs of these two 3D shapes. Frontalized feature map 439, F is then output to RNN 408. [38] By using non-rigid registration layer 438 to frontalize the feature maps in this way, RNN 408 may compare different feature maps in the sequence without concern for facial pose or expression. Additionally, non-rigid registration layer 438 helps to remove background area in the feature maps. Hence, the background area would not participate in RNN learning, despite the background area already being utilized in the training of CNN 406.

[39] Returning to F1G. 4A, RNN 408 includes a long short-term memory (LSTM) network 440, a fully-connected (FC) layer 442, and a fast Fourier transform (FFT) layer 444. LSTM network 440 receives a sequence of estimated frontalized feature maps from non-rigid registration layer 438, which are then provided to FC layer 442. The output of FC layer 442 is provided to FFT layer 444, which applies a fast Fourier transform activation function in order to produce an estimated rPPG signal 446 in the frequency domain lt should be noted that CNN 406 processes each individual image frame of image frames 402, whereas RNN 408 processes a sequence of inputs. Thus, after generating the feature maps 434 and depth maps 432 for a predefined number of Nf image frames, a sequence of Nf outputs is generated concatenating the feature maps 434 and depth maps 432 generated by CNN 406, which is subsequently provided to RNN 408. During training of RNN 408, estimated rPPG signal 446 may be fed into a rPPG loss function 448, which is minimized by the training process.

[40] During training of the CNN-RNN network architecture 400, the sequence of image frames 402 may be a sequence of known image frames for which depth maps and rPPG signals have already been estimated.

[41] To estimate the "ground truth" depth map for a given 2D face image for use in training and verification, dense face alignment (DeFA) methods may be used to estimate the 3D shape of the face. A frontal dense 3D face shape S e R 3x<3 , with Q vertices, is represented as a linear combination of identity bases and expression bases

[42] where a id E E 199 and a exp e E 29 are the identity and expression parameters, a = [<¾, a exp ] are the shape parameters, and So is the mean 3D face shape. By applying the estimated pose parameters P = (s, R, t ), where R E E 3x3 is a rotation matrix, t e l 3 is a 3D translation, and s is a scale, the 3D face shape S may be aligned to the 2D face image as:

S = SRS F + t.

[43] This method for determining the 3D face shape S for a given 2D face image may also be used to generate 3D face shapes 404, described above. Considering the challenge of estimating the absolute depth from a 2D face image, the z-axis values of the 3D vertices may be normalized within [0,1], with the vertex closest to the camera (e.g., the nose) having a depth of one, and the vertex furthest away from the camera having a depth of zero. A Z-buffer algorithm may then be applied to project the normalized z-axis values of the 3D vertices to a 2D plane, which results in an estimated "ground truth" 2D depth map D e E 32x32 for the given face image.

[44] ln order to estimate the "ground truth" rPPG signal for a given sequence of image frames, DeFA is applied to each image frame and dense 3D face shapes corresponding to the face shown in each image frame are estimated across the image frames. The estimated 3D face shape may be utilized for tracking a region on the face lt should be noted that the sequence of image frames used to obtain the "ground truth" rPPG for a given subject (e.g., face) may not include variations in pose, illumination, and expression (P1E). lt is assumed that the same subject under different P1E conditions have the same ground truth rPPG signal, since the heart beat is similar for videos of the same subject that are captured within a short span of time. This consistent supervision may allow the CNN 406 and RNN 408 to be robust to P1E changes.

[45] For a selected region of the face shown in the sequence of image frames, two orthogonal chrominance signals are computed: f = 3 r f - 2 g f ,

y f = 1.5 r f + g f - 1.5 b f ,

[46] the bandpass filtered versions of the r, g, b channels of the image frames with skin-tone normalization. The ratio of the standard deviation of these chrominance signals may be used for computing blood flow signals, and is defined as, defined as: g =

a(Xy)

[47] The blood flow signal p is calculated as:

[48] A Fourier transform is applied to p to obtain the "ground truth" rPPG signal f e R 50 , which provides a magnitude for each frequency. The "ground truth" rPPG signal f extracted from the constrained sequence of image frames (i.e., without P1E variation) is used to supervise the rPPG loss function for all videos of the same subject.

[49] When training the CNN-RNN network 400, CNN 406 is supervised using the estimated "ground truth" depth map D in conjunction with depth map loss function 436 as follows:

0 D = arg

®

[50] where 0 D represents the CNN parameters and ND is the total number of training images. The 0 D contains all of trainable parameters in the CNN. For training the CNN, a batch size of 10 may be used with Adam optimizer. The CNN D (l ; 0 D ) shows the estimated depth map for the input image I έ . The “arg min” returns the optimized

®D

parameters 0 D after minimizing the loss function.

[51] RNN 408 is supervised using the estimated "ground truth" rPPG signal f in conjunction with rPPG signal loss function 448 as follows:

Q b = arg

®

[52] where 0 R is represents the RNN parameters, F e R 32x32 is the estimated frontalized feature map 434, N s the total number of image frames in a sequence of image frames, and N s is the total number of sequences ln the RNN loss function, the

RNN R represents the estimated rPPG signal utilizing information from Nf

frames.

[53] CNN-RNN network 400 may be trained end-to-end. The desired training data for CNN 406 may be from diverse subjects so as to make the training procedure more stable and increase the generalizability of the trained model. The training data for RNN 408 may include long sequences of image frames to leverage temporal information across frames. To this end, a two-stream training strategy may be used. The first stream for training CNN 406 may include, as inputs, image frames I and ground truth depth maps D.

The second stream for training RNN 408 may include face sequences ground truth

depth maps {D yj ^, estimated 3D shapes {Sy}^, and corresponding ground truth rPPG signals f. During training, these two streams may be alternated between in order to converge to a model that minimizes both depth map loss and rPPG signal loss. While the first stream only updates the weights of CNN 406, it should be noted that backpropagation from the second stream updates the weights of both CNN 406 and RNN 408 in an end-to- end manner.

[54] The quality of the database of image frame sequences (e.g., videos) used to train CNN-RNN network 400 may impact the accuracy and generalizability of the network. First, it may be important for the database to include live subjects of diverse races in order to make the trained CNN-RNN network more generalizable across different races. Second, the videos in the database may be captured with high-quality cameras having different variations in P1E, which may make the trained CNN-RNN network more robust to PAs utilizing high quality images and/or images with varied P1E conditions. The database may include video of print, replay, and mask attacks to help to ensure that the trained CNN-RNN network is resilient against different types of PAs.

[55] F1G. 5 shows the process flow of a method 500 for processing a sequence of image frames with a CNN-RNN network (e.g., CNN-RNN network 400, F1G. 4A) in order to determine whether the sequence of image frames corresponds to a live face or to a PA. Method 500 also includes optional steps for reconfiguring the CNN portion and/or the RNN portion of the CNN-RNN network during training of the CNN-RNN network.

[56] At step 502, a CNN may receive an image frame of a sequence of image frames and may output an estimated feature map and an estimated depth map corresponding to the received image.

[57] At step 504, a non-rigid registration layer, may receive the estimated feature map from the CNN, the estimated depth map, and an estimated 3D shape, producing a frontalized feature map, and storing the frontalized feature map as part of a sequence of frontalized feature maps

[58] At step 506, the number of stored frontalized feature maps is compared to a predetermined threshold value in order to determine whether additional image frames should be processed, or whether an adequate number of frontalized feature maps have been obtained for rPPG signal estimation lf the number of stored frontalized feature maps is not equal to the predetermined threshold value, method 500 returns to step 502 and another image frame is processed by the CNN and non-rigid registration layer. Alternatively, if the number of stored frontalized feature maps is not equal to the predetermined threshold value, method 500 proceeds to step 508.

[59] At step 508, a RNN receives the sequence of frontalized feature maps from the non-rigid registration layer and generates an estimated rPPG signal corresponding to the originally input sequence of image frames. [60] At step 510, a calculation is performed (e.g., by a computer processor, such as processor 102 of F1G. 1) to determine a classification score for the input sequence of image frames based on the estimated depth map and the estimated rPPG signal.

[61] At step 512, it is determined (e.g., by the processor 102) whether the sequence of image frames corresponds to a presentation attack or to a live face. This determination may be made by comparing the classification score to a predetermined score threshold value lf the classification score exceeds the predetermined score threshold value, the input sequence of image frames is classified as corresponding to a live face. Otherwise, the input sequence of image frames is classified as corresponding to a PA. lf the image frames are classified as a PA, an alert may be issued (e.g., by the processor 102), indicating that a PA attempt has been detected. Such an alert may be displayed on a screen of the device on which the CNN-RNN network is implemented (e.g., system 100 of F1G. 1), or may be transmitted to an external device or computer system depending on the application.

[62] At step 514, if the CNN-RNN network is being trained, the estimated depth map corresponding to the last image frame of the sequence of image frames being processed by the CNN-RNN network may be compared to a ground-truth depth map corresponding to the last image frame. The difference between the estimated depth map and the ground-truth depth map is then used as a basis for adjusting the weights (and optionally other parameters) of the CNN.

[63] At step 516, if the CNN-RNN network is being trained, the estimated rPPG signal may be compared to a ground-truth rPPG signal corresponding to the processed sequence of image frames. The difference between the estimated rPPG signal and the ground-truth rPPG signal is then used as a basis for adjusting the weights (and optionally other parameters) of the RNN.

[64] F1G. 6 shows the process flow of steps that may be performed when executing step 502 of method 500 of F1G. 5 to produce a depth map and a feature map for an input image frame.

[65] At step 602, a first convolutional layer may receive an image frame of a sequence of image frames and may produce a first output. [66] At step 604, a first convolutional block may receive the first output from the first convolutional layer. A first pooling layer in the first convolutional block may produce a second output that is provided to a second convolutional block and to a concatenation layer.

[67] At step 606, a second convolutional block may receive the second output from the first convolutional block. A second pooling layer in the second convolutional block may produce a third output that is provided to a third convolutional block and to the concatenation layer.

[68] At step 608, a third convolutional block may receive the third output from the second convolutional block. A third pooling layer in the third convolutional block may produce a fourth output that is provided to a fourth convolutional block and to the concatenation layer.

[69] At step 610, the concatenation layer receives and concatenates the second, third, and fourth outputs of the first, second, and third convolutional blocks to produce a concatenated output.

[70] At step 612, a first branch of convolutional layers receives the concatenated output and produces an estimated depth map corresponding to the image frame.

[71] At step 614, a second branch of convolutional layers receives the concatenated output and produces an estimated feature map corresponding to the image frame.

[72] At step 616, the CNN provides the estimated depth map and estimated feature map to the non-rigid registration layer.

[73] F1G. 7 shows the process flow of steps that may be performed when executing step 508 of method 500 of F1G. 5 to produce an estimated rPPG signal.

[74] At step 702, a long short-term memory (LSTM) network may receive the sequence of frontalized feature maps from the non-rigid registration layer and may produce a first output.

[75] At step 704, a fully connected (FC) layer may receive the first output from the LSTM network, and may produce a second output. [76] At step 706, a fast Fourier transform (FFT) layer may receive the second output from the FC layer, and may perform a fast Fourier transform on the second output to produce the estimated rPPG signal.

[77] The embodiments described herein have generally been presented in the context of distinguishing between the presentation attack method of face spoofing and a live face lt should be understood that this is meant to be illustrative and not limiting, and that the described systems and methods may also be applied in a variety of other ways. Two illustrative examples are now provided, one for distinguishing between live faces and photographs for an autonomous driving system, and one for the detection of software- based face spoofing attempts.

[78] For the first example, a face anti-spoofing system for distinguishing between live faces and pictures of faces in video data captured by autonomous driving system may be implemented using, for example, the system of F1G. 1 and the system architecture of F1GS. 3-4B, which may perform the methods of F1GS. 5-7. Generally, autonomous driving systems may include image (e.g., optical video) capturing devices that, collectively, scan an area in the vicinity of the vehicle being autonomously driven, generating video data. This video data may be processed by a computer processor of the autonomous driving system so that pedestrians in the scan area may be identified as quickly as possible so that appropriate action (e.g., slowing, swerving, or stopping the vehicle) may be taken. However, there may be instances in which a photograph of a person enters the vicinity of the autonomously driven vehicle. For example, signs and posters that include images of people or container trucks that include images of people on the shipping containers they're transporting could be mistaken for live pedestrians by conventional autonomous driving systems ln this first example, an autonomous driving system, when a potential identification of a person in the vicinity of the vehicle has been made, may verify the potential identification by determining whether the face of the person corresponds to a live face or a photograph or other image of a face (e.g., using the system architecture of F1GS. 3- 4B via the methods of F1GS. 5-7).

[79] For the second example, a face anti-spoofing system for detecting software- based face spoofing attempts. For example, software-based face spoofing attempts may generally take the form of modified video data that includes one or more human faces that have been added to the original video data or that have been otherwise modified by a computer. Artificial lntelligence (Al) processing techniques presently exist that are capable of changing (i.e., spoofing) faces of persons shown in videos, sometimes maliciously. For example, a face in a video may be replaced with a face of another person or may be altered to give the impression of speech so that corresponding voice data can be overlaid with the video. The face anti-spoofing system for detecting software-based face spoofing attempts may, for example, be implemented using the system 100 of F1G. 1 and the network architecture 300 shown in F1G. 3, and may be trained to process frames of video data and identify faces within the video data that are "fake" (referring here to a software-generated or -modified face) and faces within the video data that are "real" (referring here to an unmodified face that was present in the originally captured video data) ln this way, video data containing one or more faces that have been spoofed using one of these Al processing techniques may be identified.

[80] The present invention has been described in terms of one or more preferred embodiments, and it should be appreciated that many equivalents, alternatives, variations, and modifications, aside from those expressly stated, are possible and within the scope of the invention.