Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND SYSTEM FOR HEAD POSE ESTIMATION
Document Type and Number:
WIPO Patent Application WO/2019/145411
Kind Code:
A1
Abstract:
The invention relates to a method for head pose estimation using a monocular camera (2). In order to provide means for reliable and robust real-time head pose estimation, the invention provides that the method comprises : - providing an initial image frame (ln) recorded by the camera (2) showing a head (10); - providing a plurality of facial features (F) with defined positions on the head (10); and - performing at least one pose updating loop with the following steps: - identifying and selecting of a plurality of salient points (S) of the head (10) having initial 2D coordinates (pi) in the initial image frame (ln) within a region of interest (30); - determining initial 3D coordinates (pi) for the selected salient points (S) using a geometric head model (20) of the head (10), corresponding to an initial head pose; - providing an updated image frame (ln+1) recorded by the camera (2) showing the head (10); - identifying within the updated image frame (ln+1) at least some previously selected salient points (S) having updated 2D coordinates (qi); - determining a prediction head pose corresponding to the updated 2D coordinates (qi), - attempting to identify facial features (F) in the updated image frame (ln+1) and to determine a correction head pose corresponding to facial 2D coordinates of the identified facial features (F), - if a correction head pose has been determined, using the correction head pose to correct the prediction head pose in order to determine the initial head pose for the next pose updating loop, and otherwise, using the prediction head pose as the initial head pose for the next pose updating loop; and - using the updated image frame (ln+1) as the initial image frame (ln) for the next pose updating loop.

Inventors:
MIRBACH BRUNO (DE)
GARCIA BECERRO FREDERIC (LU)
DIAZ BARROS JILLIAM MARIA (DE)
STRICKER DIDIER (DE)
Application Number:
PCT/EP2019/051736
Publication Date:
August 01, 2019
Filing Date:
January 24, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
IEE SA (LU)
TECHNISCHE UNIV KAISERSLAUTERN (DE)
International Classes:
G06K9/00; G06K9/46
Domestic Patent References:
WO2015192369A12015-12-23
Foreign References:
US20160210500A12016-07-21
US9317785B12016-04-19
US20090202114A12009-08-13
Other References:
JUN-SU JANG ET AL: "Robust 3D head tracking by online feature registration", 8TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION, FG '08, 17-19 SEPT. 2008, IEEE, PISCATAWAY, NJ, USA, 17 September 2008 (2008-09-17) - 19 September 2008 (2008-09-19), pages 1 - 6, XP031448343, ISBN: 978-1-4244-2153-4
CHONG EUNJI ET AL: "Visual 3D tracking of child-adult social interactions", 2017 JOINT IEEE INTERNATIONAL CONFERENCE ON DEVELOPMENT AND LEARNING AND EPIGENETIC ROBOTICS (ICDL-EPIROB), IEEE, 18 September 2017 (2017-09-18), pages 399 - 406, XP033342989, DOI: 10.1109/DEVLRN.2017.8329835
WANG YU ET AL: "Head Pose Estimation Based on Head Tracking and the Kalman Filter", PHYSICS PROCEDIA, vol. 22, 2011, pages 420 - 427, XP028354990, ISSN: 1875-3892, [retrieved on 20111227], DOI: 10.1016/J.PHPRO.2011.11.066
HUANG K S ET AL: "Robust real-time detection, tracking, and pose estimation of faces in video streams", PATTERN RECOGNITION, 2004. ICPR 2004. PROCEEDINGS OF THE 17TH INTERNAT IONAL CONFERENCE ON CAMBRIDGE, UK AUG. 23-26, 2004, PISCATAWAY, NJ, USA,IEEE, LOS ALAMITOS, CA, USA, vol. 3, 23 August 2004 (2004-08-23), pages 965 - 968, XP010724821, ISBN: 978-0-7695-2128-2, DOI: 10.1109/ICPR.2004.1334689
V. KAZEMI; J. SULLIVAN: "International Conference on Computer Vision and Pattern Recognition (CVPR", 2014, IEEE, article "One millisecond face alignment with an ensemble of regression trees", pages: 1867 - 1874
J.Y.BOUGET: "Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm", vol. 1, 2001, INTEL CORPORATION, pages: 1 - 9
Attorney, Agent or Firm:
BEISSEL, Jean et al. (LU)
Download PDF:
Claims:
Claims

1. A method for head pose estimation using a monocular camera (2), the method comprising:

- providing an initial image frame (ln) recorded by the camera (2) showing a head (10);

- providing a plurality of facial features (F) with defined positions on the head (10); and

- performing at least one pose updating loop with the following steps:

- identifying and selecting of a plurality of salient points (S) of the head (10) having initial 2D coordinates (p,) in the initial image frame (ln) within a region of interest (30);

- determining initial 3D coordinates (P,) for the selected salient points (S) using a geometric head model (20) of the head (10), corresponding to an initial head pose;

- providing an updated image frame (ln+i) recorded by the camera (2) showing the head (10);

- identifying within the updated image frame (ln+i) at least some previously selected salient points (S) having updated 2D coordinates

(qi);

- determining a prediction head pose corresponding to the updated 2D coordinates (q,),

- attempting to identify facial features (F) in the updated image frame (ln+i) and to determine a correction head pose corresponding to facial 2D coordinates of the identified facial features (F),

- if a correction head pose has been determined, using the correction head pose to correct the prediction head pose in order to determine the initial head pose for the next pose updating loop, and otherwise, using the prediction head pose as the initial head pose for the next pose updating loop; and

- using the updated image frame (ln+i) as the initial image frame (ln) for the next pose updating loop.

2. The method of claim 1 , wherein at least one of the prediction head pose and the correction head pose is determined using a perspective-n-point method.

3. The method of claim 1 or 2, wherein before the at least one pose updating loop, the positions of the facial features (F) on the head (10) are set to be standardized positions.

4. The method of any of the preceding claims, wherein before performing the at least one pose updating loop, facial features (F) are identified in the initial image frame (ln) and the initial head pose is determined corresponding to facial 2D coordinates of the identified facial features (F).

5. The method of any of the preceding claims, wherein the head model (20) is an ellipsoidal head model or a cylindrical head model.

6. The method of any of the preceding claims, wherein the prediction head pose and the correction head pose are combined using a Kalman filter to determine the initial head pose for the next pose updating loop.

7. The method of any of the preceding claims, wherein a covariance is assigned to each of the initial head pose, the prediction head pose and the correction head pose.

8. The method of claim 7, wherein the prediction head pose and the correction head pose are combined depending on their respective covariance to determine the initial head pose for the next pose updating loop.

9. The method of claim 7 or 8, wherein the covariance of the prediction head pose is set to be greater than the covariance of the initial head pose

10. The method of any of claims 7 to 9, wherein the covariance of the initial head pose for the next pose updating loop is set to be less or equal to both the covariance of the prediction head pose and of the correction head pose.

11. The method of any of the preceding claims, wherein previously selected salient points (S) are identified using optical flow.

12. The method of any of the preceding claims, wherein the initial 3D coordinates (Pi) are determined by projecting initial 2D coordinates (p,) from an image plane (2.1 ) of the camera (2) onto a visible head surface (22).

13. The method of any of the preceding claims, wherein the region of interest (30) is defined by projecting the visible head surface (22) onto the image plane (2.1 ).

14. A system (1 ) for head pose estimation, comprising a monocular camera (2) and a processing device (3), which is configured to

- receive an initial image frame (ln) recorded by the camera (2) showing a head (10);

- provide a plurality of facial features (F) with defined positions on the head

(10); and

- perform at least one pose updating loop with the following steps:

- identifying and selecting of a plurality of salient points (S) of the head (10) having initial 2D coordinates (p,) in the initial image frame (ln) within a region of interest (30);

- determining initial 3D coordinates (P,) for the selected salient points (S) using a geometric head model (20) of the head (10), corresponding to an initial head pose;

- providing an updated image frame (ln+i) recorded by the camera (2) showing the head (10);

- identifying within the updated image frame (ln+i) at least some

previously selected salient points (S) having updated 2D coordinates

(qi);

- determining a prediction head pose corresponding to the updated 2D coordinates (q,),

- attempting to identify facial features (F) in the updated image frame

(ln+i) and to determine a correction head pose corresponding to facial 2D coordinates of the identified facial features (F),

- if a correction head pose has been determined, using the correction head pose to correct the prediction head pose in order to determine the initial head pose for the next pose updating loop, and otherwise, using the prediction head pose as the initial head pose for the next pose updating loop; and - using the updated image frame (ln+i) as the initial image frame (ln) for the next pose updating loop.

15. The system of claim 14, wherein the system (1 ) is adapted to perform the method of any of claims 2 to 13.

Description:
Method and system for head pose estimation

Technical field

[0001] The present invention relates method and a system for head pose estimation.

Background of the Invention

[0002] Head pose estimation (HPE) is required for different kinds of applications. Apart from determining the head pose itself, HPE is often necessary for face recognition, detection of facial expression, gaze or the like. Many of these applications are safety-relevant, e.g. if the head pose of a driver is detected in order to determine whether he is tired or distracted. However, detecting and monitoring the pose of a human head based on camera images is a challenging task. This applies especially if a monocular camera system is used. In general, the head pose can be characterised by 6 degrees of freedom (DOF), namely 3 for translation and 3 for rotation. For most applications, these 6 DOF need to be determined or estimated in real-time. Some of the problems encountered with head pose estimation are that the human head is geometrically rather complex, individual heads differ significantly (in size, proportions, colour etc.) and the illumination may significant influence on the appearance of the head.

[0003] In general, HPE approaches intended for monocular camera systems are based on geometric head models and the tracking of feature points on the head model in the image. Feature points may be facial landmarks (e.g. eyes, nose or mouth) or arbitrary points on the person's face. Thus, these approaches rely either on a precise detection of facial landmarks or on a frame-to-frame face detection. The main drawback of these methods is that they may fail at large rotation angles of the head when facial landmarks become occluded to the camera. Methods based on tracking arbitrary features on the face surface may cope with larger rotations, but tracking of these features alone may lead to a drift of the estimated head pose over time due to the accumulation of tracking errors and tracking may be lost in case of a occlusion of the head, e.g. by a hand. In addition, the face detection at large rotation angles is also less reliable than in a frontal view. Although there have been several approaches to address these drawbacks, the fundamental problem remains unsolved so far, namely that a frame-to-frame detection of the face or facial landmarks is required.

Object of the invention

[0004] It is an object of the present invention to provide means for reliable and robust real-time head pose estimation. The object is achieved by a method according to claim 1 and a system according to claim 14.

General Description of the Invention

[0005] The present invention provides a method for head pose estimation using a monocular camera. In the context of this invention, "estimating" the head pose and "determining" the head pose are used synonymously. It is understood that whenever a head pose is determined based on images alone, there is some room for inaccuracy, making this an estimation of the head pose. The method uses a monocular camera, which means that only images from a single viewpoint are available at a time. However, it is conceivable that the monocular camera itself changes its position and/or orientation while the method is performed. "Head" in this context mostly refers to a human head, although it is conceivable to apply the method to HPE of an animal head.

[0006] In a first step, a 2D initial image frame recorded by the camera is provided, which initial image frame shows a head. It is understood that the image frame is normally provided as a sequence of (digital) data representing pixels. The initial image frame may normally be referred to as a 2D initial image frame, i.e. it contains no depth information. The initial image frame represents everything in the field of view of the camera, and a part of the initial image frame is an image of a head. Normally, the initial image frame should show the entire head, although the inventive method may also work if e.g. the person is so close to the camera that only a part of the head (e.g. 80%) is visible. In general, the initial image frame may be monochrome or multicolour.

[0007] Also, a plurality of facial features with defined positions on the head is provided. One might also say that the facial features are "defined" instead of being "provided". The facial features, which may also be referred to as facial landmarks, are disposed in the facial region of the head. They are normally selected so that they are unaffected by or robust to facial expressions, blinking, and non-rigid motions from the face. They should be clearly distinguishable from their surroundings, e.g. due to a clear contrast in colour or brightness. Such facial features may include the corners of the eyes and the points around the nose. The facial features have defined positions on the head, i.e. their 3D coordinates with respect to the head are known. It should be noted that there may be some discrepancy between the defined 3D positions and the actual 3D positions, e.g. the position of an eye corner as defined could differ to some extent from its actual position.

[0008] After the initial image frame has been provided, at least one pose updating loop is performed. However, it should be noted that the pose updating loop does not have to be performed immediately afterwards. For example, if the camera is recording a series of image frames e.g. at 50 frames per second or 100 frames per second, the pose updating loop does not have to be performed for the image frame that follows the initial image frame. Rather it is possible that several frames or even several tens of frames have passed since the initial image frame. Normally, the method is used to monitor a changing head pose over a certain period of time. Thus, it is preferred that a plurality of consecutive pose updating loops are performed. Normally, each pose updating loop comprises the following steps, which do not necessarily have to be performed in the order they are mentioned. However, it would be conceivable to omit some steps in one or the other pose updating loop.

[0009] In one step, a plurality of salient points of the head having 2D coordinates in the initial image frame within a region of interest are identified and selected. Salient points (or salient features) are points that are in some way clearly distinguishable from their surroundings, mostly due to a clear contrast in colour or brightness. Mostly they are part of a textured region. Examples for salient points are corners of an eye or a mouth, features of an ear, birthmarks, piercings or the like. In order to detect these salient points, algorithms known in the art may be employed, e.g. Harris Corner detection, SIFT, SURF or FAST. A plurality of such salient points is identified and selected. This includes the possibility that some salient points are identified but not selected (i.e. discarded), for example because they are considered to be less suitable for the following steps of the method. Identifying and selecting the salient points is preferably performed independently of the provided facial features. In other words, it is not taken into account which facial features have been provided when the salient points are identified and selected. One could also say that facial features on the one hand and salient points on the other hand are preferably treated separately by the inventive method. This does not exclude the possibility, though that a salient point may ("by coincidence") correspond to a facial feature. E.g. a corner of an eye could be a facial feature and a salient point at the same time. The region of interest is that part of the initial image frame that is considered to show the head or at least part of the head. In other words, identification and selection of salient points is restricted to this region of interest. The time interval between recording the initial image frame and selecting the plurality of salient points can be short or long. However, for real-time applications, it is mostly desirable that the time interval is short, e.g. less than 10 ms. In general, identification of the salient points is not restricted to the person's face. For instance when the head is rotated, the region of interest comprises, at least in one loop, a non-facial region of the head. In that case, at least in one loop, at least one selected salient point is in a non-facial region of the head. Such a salient point may be e.g. a feature of an ear, an ear ring or the like. Not being restricted to detecting facial features is a great advantage of the inventive method which makes frame-to-frame detection of the face unnecessary.

[0010] After the salient points have been selected, corresponding initial 3D coordinates are determined using a geometric head model of the head, corresponding to an initial head pose. In other words, starting from the initial 2D coordinates (in the initial image frame) of the salient points, initial 3D coordinates in the 3D space (or in the "real world") are determined (or estimated). Of course, without additional information, the initial 3D coordinates would be ambiguous. In order to resolve this ambiguity, a geometric head model is used which defines the size and shape of the head (normally in a simplified way) and an initial head pose is assumed, which defines 6 DOF of the head, i.e. its position and orientation. It is understood that the (initial) head pose has to be predetermined in some way. There are different ways to do this and the invention is not limited to a specific way. While it is conceivable to approximately determine the position of the head e.g. by assuming an average size and relating this to the size of the initial image, it is rather difficult to estimate the orientation. One possibility is to consider the facial features. Using e.g. a perspective-n-point method (which will also be referred to below), the initial head pose that relates these (3D) facial features with corresponding (2D) facial features detected in the image can be estimated. This initialization requires the detection of a sufficient number of 2D facial features in the image, which is usually possible. If not, a person may be asked to face the camera more or less directly when the initial image frame is recorded so that in the facial features are visible. As this step is completed, the salient points are associated with initial 3D coordinates which are located on the head as represented by the (usually simplified) geometric head model.

[0011] In another step, an updated image frame recorded by the camera showing the head is provided. This updated image frame - which normally can be referred to as a 2D updated image frame - has been recorded after the initial image frame, but as mentioned above, it does not have to be the following frame. In contrast to methods known in the art, the inventive method works satisfyingly even if several image frames have passed from the initial image frame to the updated image frame. This of course implies the possibility that the updated image frame differs considerably from the initial image frame and that the pose of the head may have changed significantly.

[0012] After the updated image frame has been provided, at least some previously selected salient points having updated 2D coordinates are identified within the updated image frame. This may also be described as a tracking of the salient points from one frame to another, e.g. the next frame. This may be performed before or after the initial 3D coordinates are determined or at the same time, i.e. in parallel. Normally, since the head pose has changed between the initial image frame and the updated image frame, the updated 2D coordinates differ from the initial 2D coordinates. Also, it is possible that some of the previously selected salient points are not visible in the updated image frame, usually because the person has turned his head so that some salient points are no longer facing the camera or because some salient points are occluded by an object between the camera and the head. However, if enough salient points have been selected before, a sufficient number should still be visible. These salient points are identified along with their updated 2D coordinates. Although this is preferred, it is not necessary to perform a frame-by-frame tracking of the head or the salient points.

[0013] Once the salient points have been identified and the updated 2D coordinates are known, a prediction head pose corresponding to the updated 2D coordinates is determined. The prediction head pose corresponds to updated 3D coordinates of the salient points, which may be determined (i.e. calculated) explicitly. However, assuming that the positions of the salient points with respect to the head model do not change, the updated 3D coordinates are implicitly given by the updated head pose and do not have to be determined explicitly. There are different possibilities how the prediction head pose (and, optionally, the updated 3D coordinates) can be determined based on the updated 2D coordinates, e.g. by a perspective-n-point method as will be explained further below. This step is based on the assumption that the positions of the salient points with respect to the head model do not change (significantly) from the initial image frame to the updated image frame. Therefore, with the updated 2D coordinates of the salient points known, the "updated" head pose, which herein is referred to as the prediction head pose, can be deduced. Of course, determination of the prediction head pose is facilitated by a large number of salient points with updated 2D coordinates. As long as a sufficient number of previously selected salient points can be identified in the updated image frame, the prediction head pose can be determined.

[0014] On the one hand, the above described tracking of salient points from one frame to another frame provides a high degree of flexibility, because even if the head assumes an "extreme" head pose that differs significantly from a frontal head pose, a sufficient number of salient points can normally be identified. The same applies to the situation where a part of the head, especially a part of face, is occluded by some object. In other words, even if the face of the person is not visible or not fully visible, flexible selection of salient features allows for reliable and robust determination of the current head pose. However, since this tracking of salient points is basically a recursive method, where determination of the prediction head pose is based on the previously determined initial head pose, it is susceptible to a certain degree of drift, resulting from errors adding up from one pose updating loop to another. Therefore, the inventive method provides the possibility of correcting the prediction head pose by the steps described below.

[0015] In another step, it is attempted to identify facial features (i.e. at least some of the facial features) in the updated image frame and to determine a correction head pose corresponding to facial 2D coordinates of the identified facial features. "Attempt to identify" of course means that facial features are looked for or searched for in the updated image frame and are either identified successfully or not (unsuccessful attempt). If a facial feature is identified, it can be characterised by facial 2D coordinates within the updated image frame. Ideally, all facial features are identified, but it is in general sufficient to identify some of the facial features. The positions (i.e. the 3D coordinates) of the facial features with respect to the head have been defined before the (first) pose updating loop, wherefore the correction head pose can be determined based on the corresponding facial 2D coordinates in the respective image frame. Detection of the facial features can e.g. be based on the approach proposed in V. Kazemi and J. Sullivan "One millisecond face alignment with an ensemble of regression trees. In International Conference on Computer Vision and Pattern Recognition (CVPR), pages 1867-1874. IEEE, 2014. This method uses an ensemble of regression trees to align the facial features, from a sparse subset of intensity values indexed to an initial estimate of the shape. While the predefined positions of the facial features with respect to the head by definition do not change, their positions with respect to the camera of course depend on the correction head pose (and vice versa). Thus, the correction head pose corresponds to facial 3D coordinates of the facial features (with respect to the camera or some other, stationary reference frame) which may be determined explicitly or not. The correction head pose can be determined based on the facial 2D coordinates by the same method as the prediction head pose is determined based on the updated 2D coordinates. However, a different method could be employed.

[0016] If a sufficient number of facial features with their corresponding facial 2D coordinates is identified, the position and orientation of the face can be determined, and thus the pose of the head. The difference between identifying salient points and identifying facial features is that the number and position of the salient points on the head is not limited or predefined as such, whereas the facial features and their relative positions on the head are predefined before the start of the (first) pose updating loop. Also, while determination of the prediction head pose is based recursively on the initial head pose, the determination of the correction head pose is based on information only from the updated image frame and uses no information from the previous image frame. Therefore, the correction head pose cannot be biased by drift or similar effects. However, it should be noted that in some cases, no facial features or an insufficient number of facial features may be identified in the updated image frame, which can make determination of the correction head pose impossible.

[0017] It should be noted that the steps pertaining to the facial features do not have to be performed after the steps pertaining to the salient points, but could also be performed before or simultaneously. This is due to the fact that the facial features and the salient points are treated in a completely independent way, although coincidentally, a facial feature and a salient point could be identical.

[0018] If a correction head pose has been determined successfully, the correction head pose is used to correct the prediction head pose in order to determine the initial head pose for the next pose updating loop. There are various possibilities how the prediction head pose can be corrected, but this normally implies that the initial head pose for the next pose updating loop is based partially on the prediction head pose and the correction head pose, and normally not exclusively on only one of these. The terms "prediction" and "correction" do not imply that the correction head pose is always more accurate than the prediction head pose. E.g. in a situation where only a few facial features have been identified, the correction head pose may possibly be determined with a considerable inaccuracy or error. As will be explained later, the method of correcting the prediction head pose may take such inaccuracy into account. However, the great advantage of the correction head pose is that any error or uncertainty occurring from the previous image frame has no impact for the following image frame. While the prediction head pose is always determined in relation to the initial head pose, wherefore the former cannot (statistically) be more precise than the latter, the correction head pose is uninfluenced by any errors or inaccuracies of the initial head pose.

[0019] If no correction head pose has been determined, the prediction head pose is used as the initial head pose for the next pose updating loop. As explained above, failure to determine the correction head pose may be due to (self-)occlusions or extreme head poses. Normally, the correction head pose can be determined again after a limited number of image frames. However, while this is not possible, a (new) initial head pose for the next updating loop is still provided by simply using the prediction head pose. It should be noted that, in particular if only one or a few image frames have passed since the last correction, the prediction head pose is normally fairly accurate. As mentioned above, the prediction head pose can in some situations even be more accurate than the correction head pose.

[0020] If more than one pose updating loop is performed, the updated image frame is used as the initial image frame for the next loop. In case of a plurality of pose updating loops, some of these loops could be "modified" in that certain steps are omitted. For example, the steps pertaining to the correction head pose could be performed only in every second (third, fourth etc.) pose updating loop.

[0021] The inventive method provides a robust and reliable way of head pose estimation. Its particular strength is that it combines the advantages of two individual methods. On the one hand, it benefits from the robustness of detecting salient points, which is nearly always possible, regardless of the head pose and (limited) occlusions. On the other hand, it benefits from the absolute (i.e. not relative to a previous head pose) determination of the head pose based on facial features, which is unaffected by drift or similar effects. Another great advantage is that the prediction head pose and the correction head pose can be determined independently of each other, i.e. the corresponding calculations can be performed in an arbitrary sequence, e.g. in parallel. This greatly helps to provide a fast, real- time head pose estimation.

[0022] Preferably, at least one of the prediction head pose and the correction head pose is determined using a perspective-n-point method. In general, perspective-n-point is the problem of estimating the pose of a calibrated camera given a set of n 3D points in the world and their corresponding 2D projections in the image. However, this is equivalent to the pose of the head being unknown with respect to the camera, when n salient points (or n facial features) of the head with known positions (i.e. 3D coordinates) with respect to the head are given. Of course, this approach is based on the assumption that the positions of the salient points or the facial features, respectively, with respect to the geometric head model do not change significantly. Although the head with its salient points is not completely rigid and the relative positions of the salient points may change to some extent (e.g. due to changes in facial expression), it is generally still possible to solve the perspective-n-point problem, while changes in the relative positions can lead to some discrepancies which can be minimised to determine the most probable head pose. Such problems can be reduced with respect to the facial features by providing facial features in regions that are normally unaffected by talking, facial expressions or the like. Examples for these include the eye corners and features of the nose, e.g. nostrils. In general, the perspective-n-point problem can be formulated as minimising the error between the projection of the 3D coordinates onto the image plane and the corresponding 2D coordinates:

wherein p(R ) denotes the perspective projection operator, where it: M 3 ® M 2 , and i is the index of the i-th salient point. Referring to the salient points, P, are the initial 3D coordinates according to the initial image frame, which are transformed to the updated 3D coordinates by a rotation R and a translation t, while g, are the updated 2D coordinates. Referring to facial features, P, would be the 3D coordinates e.g. in a predefined head mesh before the first pose updating loop during initialisation, which are transformed to the (current) facial 3D coordinates by a rotation R and a translation t, while p, are the current facial 2D coordinates. Either way, finding R and t which minimise the error defines the prediction head pose or the correction head pose, respectively. The big advantage of employing a perspective-n-point method in order to determine the prediction head pose and/or the correction head pose is that this method works even if larger changes occur between the initial image frame and the updated image frame.

[0023] The head model normally represents a simplified geometric shape. This may be e.g. a plane head model (PHM). According to one embodiment, the head model is a cylindrical head model (CHM). In other words, the shape of the head is approximated as a cylinder. According to another embodiment, the head model is an ellipsoidal head model (EHM), i.e. the shape of the head is approximated as an ellipsoid. While these models are simple and allows for easy identification of the visible portions of the surface, they are still sufficiently good approximations to yield reliable results. However, other more accurate models may be used to advantage, too.

[0024] While it is conceivable to individually define the facial features for each head, one may also refer to a standardised set of facial features having relative positions that are similar for each individual head and differ mostly by a scaling factor that depends on the dimensions of the head. While such an approach could potentially lead to increased inaccuracies (e.g. if a face has highly irregular proportions), it is usually sufficiently accurate and above all helps to keep the initialisation of the method simple and time saving. According to such an approach, before the at least one pose updating loop, the positions of the facial features on the head are set to be standardised positions. These standardised positions can be based on averaging position data of a multitude of actual faces. The standardised positions represent a three-dimensional mesh that could optionally be scaled, i.e. enlarged or reduced, according to the dimensions of the head model. However, since the dimensions of the actual head do not differ too much (at least for an adult), such scaling is often unnecessary to achieve satisfactory results. The dimensions of the head model may also be chosen according to a standardised head, which preferably corresponds at least approximately to the standardised positions of the facial features. For instance, the ratio of the distance between the facial features representing the outer corners of the eyes and the radius of the CHM (or the small radius of the EHM) should be within a certain range. Also, the dimensions of the head model should be chosen so that the facial features are close to the surface of the head model.

[0025] As mentioned above, the initial head pose has to be determined before the first pose updating loop. This can be achieved in different ways. According to a preferred embodiment, before performing the at least one pose updating loop, facial features are identified in the initial image frame and the initial head pose is determined corresponding to facial 2D coordinates of the identified facial features. In other words, the initial head pose for the first pose updating loop is determined in the same way as the correction head pose. After the initial head pose has been determined, the head model can be aligned according to this head pose. [0026] Correction of the prediction head pose may be performed in different ways. According to a preferred embodiment, the prediction head pose and the correction head pose are combined using a Kalman filter to determine the initial head pose for the next pose updating loop. In other words, the prediction head pose is associated with the predicted state according to the Kalman filter, whereas the correction head pose is associated with the measurement. If the measurement at the time k is denoted by z k and the predicted state is denoted by x k , the state estimate x k is given by:

x k = x k + K k z k - H x k ), Eq. (2) where K k denotes the Kalman gain and H is normally a matrix and relates the current state to the measurement. The state estimate x k corresponds to the initial head pose for the next pose updating loop. Within this concept, the prediction head pose is treated as the prediction, whereas the correction head pose is treated as the measurement. In general, the measurement z k can be described by the following equation:

z k = H x k + v k , Eq. (3) where v k denotes the measurement noise in the observation. Preferably, the covariance of the measurement noise is updated in pose updating loop according to the expected accuracy of the correction head pose using facial features. In addition, the covariance of the measurement noise may also be updated in every pose updating loop, e.g. according to the number of identified salient features or a reprojection error.

[0027] In particular, but not exclusively, if the above described Kalman approach is used, it is preferred that a covariance is assigned to each of the initial head pose, the prediction head pose and the correction head pose. The respective covariance is a measure of the expected error or the uncertainty of the respective head pose. In other words, the greater the reliability of the respective head pose is, the smaller the respective covariance should be. In general, the covariance is a matrix, which, in this context, can be assumed to be diagonal (although this assumption may be an approximation that is to some extent inaccurate). When relations like "greater" and "smaller" with respect to a covariance are referred to, this may refer to respective elements in the covariance matrix or, more generally, to the norm of the covariance matrix. It is also preferred that each of the covariances is adapted or updated in every pose updating loop. In particular, the covariance matrix of the correction head pose is preferably updated in every frame according to the expected accuracy of the correction head pose. The covariance of the prediction head pose is e.g. updated in every frame dependent on the initial head pose or a computed measure quantifying the accuracy of the prediction head pose.

[0028] Preferably, the prediction head pose and the correction head pose are combined depending on their respective covariance to determine the initial head pose for the next pose updating loop. In particular, when the two head poses are combined, the head pose having the larger covariance can be assigned a smaller weight than the other head pose, thus having less influence on the initial head pose for the next loop. This embodiment may in particular be combined with the application of a Kalman filter. For instance, the Kalman gain K k can depend on the ratio of the covariance of the prediction head pose to the covariance of the correction head pose.

[0029] As mentioned above, the covariances are normally adapted in each pose updating loop. According to one embodiment, the covariance of the prediction head pose is set to be greater than the covariance of the initial head pose. This reflects the fact that the prediction head pose is determined based on the initial head pose, wherefore the expected error of the prediction head pose has to be greater.

[0030] According to another embodiment, the covariance of the initial head pose for the next pose updating loop is set to be less or equal to both the covariance of the prediction head pose and of the correction head pose. This reflects the fact that since the prediction head pose and the correction head pose are determined independently of each other, the expected error of their combination (namely the initial head pose for the next pose updating loop) does not increase with respect to either of the two.

[0031] Beside this, there more possibilities to update the respective covariances. For instance, the covariance of the prediction head pose may be updated according to the number of identified salient features or a reprojection error. Also, the covariance of the correction head pose may be adapted according to the number of detected facial features (where less detected facial features lead to a greater covariance) or according to the correction head pose as such (since a head pose that differs much from a frontal view can normally only be detected with a greater error).

[0032] There are different options how to identify previously selected salient points. The general problem may be regarded as tracking the salient points from the initial image frame to the updated image frame. There are several approaches to such an optical tracking problem. According to one preferred embodiment, previously selected salient points are identified using optical flow. This may be performed, for example, using the Kanade-Lucas-Tomasi (KLT) feature tracker as disclosed in J.Y.Bouget, “Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm”, Intel Corporation, 2001 , vol. 1 , No. 2, pp. 1-9.

[0033] Preferably, the initial 3D coordinates are determined by projecting initial 2D coordinates from an image plane of the camera onto a visible head surface. The image plane of the camera may correspond to the position of a CCD element or the like. This may be regarded as the physical location of the image frames. Given the optical characteristics of the camera, it is possible to project or "ray trace" any point on the image plane to its origin, if the surface of the corresponding object is known. In this case, a visible head surface is provided and the initial 3D coordinates correspond to the intersection of a back-traced ray with this visible head surface. The visible head surface represents those parts of the head that are considered to be visible. It is understood that depending on the head model used, the actually visible surface of the (real) head may differ more or less.

[0034] According to a preferred embodiment, the visible head surface is determined by determining the intersection of a boundary plane with a model head surface. The model head surface is a surface of the used geometric head model. In the case of a CHM, the model head surface is a cylindrical surface, in case of an EHM the model head surface is ellipsoidal. The boundary plane is used to separate the part of the model head surface that is considered to be invisible (or occluded) from the part that is considered to be visible. The accuracy of the thus determined visible head surface partially depends on the head model, but for a CHM as well as for an EHM, the result is adequate if the location and orientation of the boundary plane are determined appropriately.

[0035] Preferably, the boundary plane is parallel to an X-axis of the camera and a center axis of the head model. Herein, the X-axis is a horizontal axis perpendicular to the optical axis. In the corresponding coordinate system, the Z-axis corresponds to the optical axis and the Y-axis to the vertical axis. Of course, the respective axes are horizontal/vertical within the reference frame of the camera, and not necessarily with respect to the direction of gravity. The center axis of a cylindrical head model runs through the centers of each base of the cylinder. In other words, it is the symmetry axis of the cylinder. The centre axis of an ellipsoidal head model coincides with the large semi-axis of the ellipsoid. One can also say that the normal vector of the boundary plane results from the cross-product of the X-axis and the center axis. The intersection of this boundary plane and the (cylindrical, ellipsoidal or other) model head surface defines the (three-dimensional) edges of the visible head surface.

[0036] According to one embodiment, the region of interest is defined by projecting the visible head surface onto the image plane. The intersection of the boundary plane and the model head surface defines the (three-dimensional) edges of the visible head surface. Projecting these edges onto the image plane of the camera yields the corresponding 2D coordinates in the image. These correspond to the (current or updated) region of interest. As mentioned above, e.g. when the head is rotated, the region of interest comprises, at least in one loop, a non-facial region of the head. In that case, at least in one loop, the visible head surface comprises a non-facial head surface.

[0037] According to a preferred embodiment, the salient points are selected based on an associated weight. One possibility is that this weight depends on the distance to a border of the region of interest. This is based on the assumption that salient points which are close to the border of the region of interest may possibly not belong to the actual head or may be more likely to become occluded even if the head pose changes only slightly. For example, one such salient point could belong to person's ear and thus be visible when the person is facing the camera, but become occluded even if the person turns his head only slightly. Therefore, if enough salient points are detected further away from the border of the region of interest, salient points closer to the border could be discarded.

[0038] Also, the perspective-n-point method may be performed based on the weight of the salient points. For example, if the result of the perspective-n-point method is inconclusive, those salient points which had been detected closer to the border of the region of interest could be neglected completely or any inconsistencies in the determination of the updated 3D coordinates associated with these salient points could be tolerated. In other words, when determining the updated head pose, the salient points further away from the border are treated as more reliable and with greater weight. This approach can also be referred to as "distance transform".

[0039] Another possibility (which may be combined with the abovementioned option) is that the weight depends on the number of image frames (or the number of pose updating loops, respectively) in which a certain salient point has been detected. This may also be referred to as a "temporal weight". According to this approach, those salient points that have been tracked over several image frames are treated as more reliable. By doing so, new salient features that might appear due to a (self)-occlusion, e.g., passing the hand in front of the face, can be easily discarded. It is understood that this weight may be used for selecting the salient points as well as for performing the perspective-n-point method.

[0040] If several consecutive pose updating loops are performed, the initially specified region of interest is normally not suitable any more after some time. This would lead to difficulties when updating the salient points because detection would occur in a region of the image frame that does not correspond well with the position of the head. It is therefore preferred that in each pose updating loop, the region of interest is updated. Normally, updating the region of interest is performed after updating the head pose.

[0041] The invention also provides a system for head pose estimation, comprising a monocular camera and a processing device, which is configured to: receive an initial image frame recorded by the camera showing a head;

provide a plurality of facial features with defined positions on the head; and perform at least one pose updating loop with the following steps: - identifying and selecting of a plurality of salient points of the head having initial 2D coordinates in the initial image frame within a region of interest;

- determining initial 3D coordinates for the selected salient points using a geometric head model of the head, corresponding to an initial head pose;

- providing an updated image frame recorded by the camera showing the head;

- identifying within the updated image frame at least some previously selected salient points having updated 2D coordinates;

- determining a prediction head pose corresponding to the updated 2D coordinates,

- attempting to identify facial features in the updated image frame and to determine a correction head pose corresponding to facial 2D coordinates of the identified facial features,

- if a correction head pose has been determined, using the correction head pose to correct the prediction head pose in order to determine the initial head pose for the next pose updating loop, and otherwise, using the prediction head pose as the initial head pose for the next pose updating loop; and

- using the updated image frame as the initial image frame for the next pose updating loop.

[0042] The processing device can be connected to the camera with a wired or wireless connection in order to receive image frames from the camera and, optionally, to transmit commands to the camera. It is understood that normally at least some functions of the processing device are software-implemented.

[0043] Other terms and functions performed by the processing device have been described above with respect to the corresponding method and therefore will not be explained again.

[0044] Preferred embodiments of the inventive system correspond to those of the inventive method. In other words, the system, or normally, the processing device of the system, is preferably adapted to perform the preferred embodiments of the inventive method. Brief Description of the Drawings

[0045] Further details and advantages of the present invention will be apparent from the following detailed description of not limiting embodiments with reference to the attached drawing, wherein:

Fig .1 is a schematic representation of an inventive system and a head;

Fig. 2 is a flowchart illustrating an embodiment of the inventive method;

Fig. 3 illustrates two image frames used in the method of fig. 2;

Fig. 4 illustrates a projection onto an image plane of a camera of the system of Fig. 1 ;

Fig. 5 shows the head with a plurality of facial features; and

Fig. 6 shows the head with a plurality of salient points.

Description of Preferred Embodiments

[0046] Fig. 1 schematically shows a system 1 for head pose estimation according to the invention and a head 10 of a person. The system 1 comprises a monocular camera 2 which may be characterized by a vertical Y-axis, a horizontal Z-axis, which corresponds to the optical axis, and an X-axis which is perpendicular to the drawing plane of fig. 1. The camera 2 is connected (by wire or wirelessly) to a processing device 3, which may receive 2D image frames lo, l n , l n+i recorded by the camera 2. The camera 2 is directed towards the head 10. The system 1 is configured to perform a method for head pose estimation, which will now be explained with reference to figs. 2 to 6.

[0047] Fig. 2 is a flowchart illustrating one embodiment of the inventive method. After the start, an initial image frame lo is recorded by the camera 2 as shown in fig. 4. The "physical location" of any image frame corresponds to an image plane 2.1 of the camera 2. The initial image frame lo is provided to the processing device 3.

[0048] Also, plurality of facial features F is defined, which are robust to facial expressions, blinking and other non-rigid motions from the face. These includes the corners of the eyes and the points around in the nose (see fig. 5). The relative 3D positions of these facial features F are predefined based on an "averaged" face model, but could optionally be scaled to match the dimensions of the head 10.

[0049] In another step, facial features F are identified in the initial image frame lo and an initial head pose is determined corresponding to facial 2D coordinates of the identified facial features F. If a facial feature is identified, it can be characterised by facial 2D coordinates within the initial image frame lo. The initial head pose can then be determined based on the facial 2D coordinates by solving a perspective-n-point problem. The details of this process will be described below with respect to the pose updating loop. The initial head pose describes the position of the head 10with respect to the camera 2. Such a pose may be described by a rotation R 0 and a translation t 0 . A covariance is assigned to this initial head pose, which is a measure of the expected error of the head pose.

[0050] As shown in figs. 4 and 6, the real head 10 is approximated by a ellipsoidal head model (EFIM) 20, the dimensions of which can be set to be standardised values corresponding to average values of real human heads. Optionally, the EFIM 20 may be scaled. Once the initial head pose has been determined, the EFIM 20 is aligned with the initial head pose and a region of interest 30 is defined by projecting the EFIM 20, or rather a visible head surface 22 thereof, onto the image plane 2.1. Determination of the visible head surface 22 is described below with respect to the pose updating loop.

[0051] The steps described so far can be regarded as part of an initialization process. Once this is done, the method continues with the steps referring to the actual head pose estimation, which will now be described partially referring to figs. 3-6. The steps are part of a pose updating loop which is shown in the right half of fig. 2.

[0052] While fig. 3 shows an initial image frame l n recorded by the camera 2 and provided to the processing device 3, this may be identical to the image frame lo. According to one step of the method performed by the processing device 3, a plurality of salient points S are identified within the region of interest 30 and selected (see also fig. 6). Such salient points S are located in textured regions of the initial image frame l n and may be corners of an eye, a mouth, a nose, an ear or the like. In order to identify the salient points S, a suitable algorithm like FAST may be used. The salient points S are represented by initial 2D coordinates p, in the image frame l n . A weight is assigned to each salient point S which depends on a distance of the salient point S from a border 31 of the region of interest 30. The closer the respective salient point S is to the border 31 , the lower is its weight. Alternatively or additionally, the weight could depend on the number of image frames l n, l n+i (or the number of pose updating loops, respectively) in which a certain salient point S has been identified. It is possible that salient points S with the lowest weight are not selected, but discarded as being (rather) unreliable. This may serve to enhance the total performance of the method. It should be noted that the region of interest 30 comprises, apart from a facial region 32, several non- facial regions, e.g. a neck region 33, a head top region 34, a head side region 35 etc. Identifying and selecting the salient points S is performed independently of the previously defined facial features F, i.e. they are treated separately from the facial features F. Flowever, this does not exclude the possibility that at least one salient point S (e.g. a corner of an eye) may correspond to a facial feature F.

[0053] With the initial 2D coordinates p, of the selected salient points S known, corresponding initial 3D coordinates P, are determined. This is achieved by projecting the initial 2D coordinates onto a visible head surface 22 of the EFIM 20, as illustrated in fig. 4. The visible head surface 22 is that part of a surface 21 of the EFIM 20 that is considered to be visible for the camera 2. The initial 3D coordinates P £ may also be seen as the result of an intersection between a ray 40 starting at an optical center of the camera 2 and passing through the respective salient point S at the image plane 2.1 , and the visible head surface 22 of the EFIM 20. The equation of the ray 40 is defined as P= C+JW, with V being a vector parallel to the line that goes from the camera’s optical center C through P £ . The scalar parameter k is computed by solving the quadratic equation of the geometric model.

[0054] In another step, a 2D updated image frame l n+i (see fig. 3), which has been recorded by the camera 2, is provided to the processing device 3 and at least some of the previously selected salient points S are identified within this updated image frame l n+i along with updated 2D coordinates q,. This identification may be performed using optical flow. While the flowchart in fig. 2 indicates that identification within the updated image frame l n+i is performed after determining the initial 3D coordinates P £ corresponding to the initial image frame l n , the sequence of these steps may be inverted or they may be performed in parallel.

[0055] In another step, the processing device 3 uses the updated 2D coordinates q, and the initial 3D coordinates P, to solve a perspective-n-point problem and thus, to update the head pose. The head pose is computed by calculating updated 3D coordinates P' £ resulting from a translation t and rotation R, so that P = R - P t + t , and by minimizing the error between the reprojection of the 3D features onto the image plane and their respective detected 2D features by means of an iterative approach. In other words, the processing device 3 calculates the head rotation R and translation t which minimise the error of

with 7r(P) denoting the perspective projection operator, where p : M 3 ® M 2 , and i is the index of the i-th feature point.

[0056] In the definition of the error, it is also possible to take into account the weight associated with the respective salient point S, so that an error resulting from a salient point S with low weight contributes less to the total error. In this case, each of the summands in eq. (1 ) would have an individual weight factor. Applying the translation t and rotation R to the initial head pose yields the prediction head pose. A covariance is assigned to the prediction head pose, which may depend, among others, on the number of identified salient points S and/or the weight of the salient points S. Also, the covariance of the prediction head pose is set to be greater than the covariance of the initial head pose.

[0057] In another step, the region of interest 30 is updated. In this embodiment, the region of interest 30 is defined by the projection of the visible head surface 22 of the EHM 20 onto the image. The visible head surface 22 in turn is defined by the intersection of the head surface 21 with a boundary plane 24 (See fig. 4 and 6). The boundary plane 24 has a normal vector resulting from the cross product between a parallel vector to the X-axis of the camera 2 and a vector parallel to the centre axis 23 of the EHM 20. In other words, the boundary plane 24 is parallel to the X-axis and to the centre axis 23. [0058] The updated region of interest 30 (see fig. 3) again comprises non-facial regions like the neck region 33, the head top region 34, the head side region 35 etc. In the next loop, salient points from at least one of these non-facial regions 33- 35 may be selected. For example, the head side region 35 now is closer to the center of the region of interest 30, making it likely that a salient point from this region will be selected, e.g. a feature of an ear.

[0059] In another step, it is attempted to identify facial features F in the updated image frame l n+i and to determine a correction head pose corresponding to facial 2D coordinates of the identified facial features F. If a facial feature is identified, it can be characterised by facial 2D coordinates within the updated image frame l n+i - Detection of the facial features F can e.g. be based on the approach proposed in V. Kazemi and J. Sullivan "One millisecond face alignment with an ensemble of regression trees. In International Conference on Computer Vision and Pattern Recognition (CVPR), pages 1867-1874. IEEE, 2014. The correction head pose can then be determined based on the facial 2D coordinates (and the predefined positions of the facial features on the head 10) by solving a perspective-n-point problem equivalent to eq. (1 ) or eq. (4).

[0060] The difference between the salient features S, which are selected in each pose updating loop, and the facial features F, which are defined before entering the first pose updating loop, becomes evident from figs. 5 and 6, which referred to a situation where the head 10 is turned sideways with respect to the camera 2, thus presenting a profile view. Although some of the facial features F will not be visible for the camera 2 at all and others may be visible but hard to identify, the set of facial features F remains unchanged. In the situation as depicted in fig. 5, the correction had pose may be assigned a relatively large covariance or it may even be impossible to determine the correction head pose. The salient points S shown in fig. 6, however, are selected in each pose updating loop, wherefore, by definition, all of them are visible for the camera 2. As a result of the profile view, a major part of the selected salient points S is not located in the facial region but on the side of the head 10.

[0061 ] If a sufficient number of facial features F with their corresponding facial 2D coordinates is identified, the correction head pose can be successfully determined. A covariance is assigned to the correction head pose, which can e.g. depend on the number of facial features F that were successfully identified and/or the head pose itself (because a frontal head pose, for example, can be determined with greater accuracy than a profile view head pose). However, it is possible that no facial features or an insufficient number of facial features have been identified, which makes determination of the correction head pose impossible.

[0062] It should be noted that the steps pertaining to the facial features and the steps pertaining to the salient points do not have to be performed in the sequence indicated by the flowchart of fig. 2, but could also be performed in another sequence or simultaneously. This is due to the fact that the facial features and the salient points are treated in a completely independent way.

[0063] If a correction head pose has been determined successfully, the prediction head pose is corrected with the correction head pose using a Kalman filter. By way of example, the head rotation R can be represented using a quaternion, p =

[Px, Py, p z , p w \ where p w is the scalar part and { p x , p y , p z } the vector part. The head translation t in homogeneous coordinates can be denoted as t = \t x , t y , t z , l] . Then it is possible to concatenate the rotation and translation of the head to have a 8x1 state vector x in the Kalman Filter, which comprises the overall estimated pose from the first frame as in x = [p T , t T ] T . This state vector x could be further extended by an 8X1 velocity vector which would allow to filter the output based on a linear motion model.

[0064] A linear process model can be defined, and hence a linear Kalman Filter, in which the transition between the initial head pose and the prediction head pose is represented by a state transition matrix A (see Eq. (5)). This state transition matrix a has to be updated in each pose updating loop.

%k k %k- 1 Eq. (5)

[0065] Herein, the indexes k, k-1 denote different points in time or different image frames. A k is a 8x8 matrix given by Eq. (6)

A = A P 0

Eq. (6)

. 0 A t.

with a normal distributed process noise with covariance Q. This covariance Q is normally assumed to be a constant, e.g. a diagonal matrix in which the matrix elements are the expected mean square error between the estimation and the real state. The covariance P k of the prediction state , corresponding to the prediction head pose, is updated with Q based on the covariance P^ of the correction state

X k -1-

P k A k P k -iA T k + Q Eq. (7)

[0066] A p , the state transition sub-matrix to project the rotation ahead, is defined by the rotation R SP , represented by the quaternion p SP = p Spx , Psp y> Psp z> Psp w ] and resulting from the determination of the prediction head pose, i.e.,

PSP W PSP Z PSPy PSP X

Psp z PSP W ~ PSP X PSPy

A p Eq. (8)

-Pspy PSP X PSP W PSP Z

-PsPx ~ PSPy PSP Z PSP w .

while the state transition sub-matrix A t to update the translation is given by

[0067] The measurement model is given by Eq. (3) z k = H x k + v k , Eq. (3)

[0068] where z k corresponds to the new measurement at time k, or rather, the correction head pose determined by identification of the facial features F. H is a 7x8 matrix, given by H = [I 7 0], which relates the current state to the measurement and v fe denotes the measurement noise in the observation. The covariance of the measurement noise is updated in each pose updating loop according to the expected accuracy of the correction head pose. In addition, the covariance of the process noise may also be updated e.g. according to the number of identified salient features S or the reprojection error.

[0069] Subsequently, the state estimate is updated in a correction step using Eq.

(2),

x k = x k + K k (z - H x k ) Eq. (2) where K k denotes the Kalman gain and x k the state estimate corresponding to a corrected head pose. The gain depends on the ratio of the covariance of the correction head pose to the covariance of the prediction head pose. In other words, if the covariance of the correction head pose is low with respect to the covariance of the prediction head pose (e.g. in a case where the user faces the camera 2, wherefore all facial features have been detected), the Kalman gain is rather low. In a case where the covariance of the correction head pose is high with respect to the covariance of the prediction head pose (e.g. in a case where the camera 2 faces the profile of a user, wherefore only about half of the facial features had been detected) the Kalman gain is rather high.

[0070] The head pose represented by x k is then used as the initial head pose for the next pose updating loop. Its covariance is set to a value that is equal or less then both of the covariance of the prediction head pose and the correction head pose.

[0071] If no correction head pose could be determined successfully, the prediction head pose is used as the initial head pose for the next pose updating loop. Of course, the covariance of this initial head pose is set to be the covariance of the prediction head pose.