Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AVATAR PUPPETING IN VIRTUAL OR AUGMENTED REALITY
Document Type and Number:
WIPO Patent Application WO/2021/252343
Kind Code:
A1
Abstract:
Puppeting a representation of a host user as an avatar in a virtual or augmented reality environment includes continually tracking only the user's head and wrist positions and orientations in a 3D space; converting the tracked head and wrist positions and orientations into corresponding head and wrist positions and orientations of the avatar; and transmitting only the avatar head and wrist positions and orientations to a client. The method further includes using an inverse kinematics algorithm to determine a kinematic chain from at least one wrist to the head of the avatar, with the head and wrist positions and orientations as constraints to the inverse kinematics algorithm. Changes in the head and wrist positions and orientations of the user cause corresponding changes in head and wrist positions and orientations of the avatar.

Inventors:
POSOKHOW BRUNO ANDRÉ (US)
Application Number:
PCT/US2021/036146
Publication Date:
December 16, 2021
Filing Date:
June 07, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
POSOKHOW BRUNO ANDRE (US)
International Classes:
G06T13/00
Foreign References:
US20160046023A12016-02-18
US20130028476A12013-01-31
US20110228976A12011-09-22
US9381426B12016-07-05
Attorney, Agent or Firm:
STERN, Ronald (US)
Download PDF:
Claims:
CLAIMS

1. A method of generating and pu ppeting a representation of a user of a host device as an avatar in a virtual or augmented reality environment on a client device, the method comprising: detecting position and orientation of a head and a wrist of the host device user in a 3D space, wherein the head position and orientation are determined by a video camera coupled to the host device that captures a face of the user of the host device, wherein the host device runs a machine learning algorithm to transform the head position and orientation into data to be transmitted to the clientdevice over a network, and wherein the wrist position and orientation are determined by one or more of fiducial markers, visual detection, or Inertial Measurement Units (IMU) tracked by controllers integrated with a trackpad or buttons to produce defined hand poses; transcribing the head and wrist positions and orientations into corresponding head and wrist positions and orientations of the avatar in the virtual or augmented reality environment on a clientdevice; and applying the head and wrist positions and orientations as constraints to an inverse kinematics algorithm to determine a position and orientation of segments of a simplified skeleton of the host user such as at least one of hands, feet, head, elbows, knees or torso and recreate an avatar body posture that mimics the host user.

2. The method of claim 1 , further comprising transmitting only the avatar head and wrist positions and orientations to the clientdevice.

3. A host method of puppeting a representation of a hostuseras an avatar in a virtual or augmented reality environment, the method comprising: continually tracking only the user’s head and wrist positions and orientations in a 3D space; converting the tracked head and wrist positions and orientations into corresponding head and wrist positions and orientations of the avatar; transmitting only the avatar head and wrist positions and orientations to a client; and with the head and wrist positions and orientations as constraints to an inverse kinematics algorithm, using the inverse kinematics algorithm to determine a kinematic chain from at least one wrist to the head of the avatar, whereby changes in the head and wrist positions and orientations of the user cause corresponding changes in head and wrist positions and orientations of the avatar.

4. The method of claim 3, further comprising selecting a hand poseforthe avatar from a limited number of hand poses, and updating the avatar with the selected hand pose.

5. The method of claim 3, further comprising programmatically determining at least one of torso and lower limbs of the avatar.

6. The method of claim 3, further comprising updating a full representation of the avatar, including programmatically updating a torso and lower limbs of the avatar.

7. The method of claim 3, further comprising tracking facial expressions of the user; converting the user's facial expressions to facial expressions of the avatar; and also transmitting numerical values representing the facial expressions to the client.

8. The method of claim 3, wherein tracking the wrist position and orientation includes capturing images of fiducial markers coupled to the user’s wrists; and processing the images to determine the wrist positions and orientations of the user.

9. The method of claim 8, wherein the fiducial markers include different patterns that are coupled to a handheld device and wherein the tracking includes capturing images of the fiducial markers with a video camera as the user holds and manipulates the device; wherein these different patterns are positioned in such a way that at least one pattern is in sight of the video camera at all times.

10. The method of claim 9, wherein a mu Iti-face object is attached to the device, wherein the different patterns are on difference faces of the multi-face object.

11.The method of claim 9, wherein the device includes position and orientation sensors; and wherein the processing includes processing both sensor readings and the images to determine the wrist position and orientation.

12. The method of claim 9, wherein the controller is configured to be held in a tight and fixed position ; and wherein the wrist position and orientation relative to the detected fiducial markers is deduced.

13. The method of claim 9, wherein the device includes a controller having a number of buttons that are assigned to different hand poses forthe avatar; and wherein hand poses are selected via the buttons.

14. The method of claim 9, wherein the device includes a game controller.

15. The method of claim 3, further comprising rendering an object in the environ mentand programmatically updating a hand pose of the avatar based on the avatar's interaction with the object.

16. The method of claim 3, furthercomprising hosting the virtual or augmented reality environment in which the avatar is puppeted.

17. A host system comprising: a computer system programmed to host a virtual or augmented reality environment including an avatar of a hostuser; and a video camera for continually capturing images of the host user, wherein the computer system is further programmed to: process images from the camera to determine head and wrist positions and orientations of the host user, convert the host user’s head and wrist positions and orientations into corresponding head and wrist positions and orientations of the avatar, transmit only the avatar head and wrist positions and orientations to a client, and with the head and wrist positions and orientations as constraints to an inverse kinematics algorithm, determine a kinematic chain from at least one wrist to the head of the avatar.

18. The system of claim 17, further comprising a handheld tracking device and fiducial markers carried by the device; wherein these different patterns are positioned in such a way that at least one pattern is in sight of the video camera; wherein the controller is configured to be held in a fixed position to enable the computer system to deduce the wrist position and orientation relative to the detected fiducial markers.

19. The system of claim 18, wherein the device includes orientation sensors; and wherein the computer system is programmed to process both sensor readings and the images to determine the wrist position and orientation.

20. The system of claim 18, wherein the device includesa number of buttonsthat are assigned to different hand poses for the avatar; and wherein hand poses are communicated to the computer via the buttons.

21. The system of claim 17, wherein the virtual or augmented reality environment comprises at least one of: a virtual trade show, a virtual product demonstration wherein the host user is a salesperson, a training session wherein the host user is a trainer, or a virtual psychotherapy session wherein the host user is a therapist

22. The system of claim 21 , wherein the virtual or augmented reality environment comprises an empathy training session, wherein: the host user takes on a role of a virtual patient visit and a client user takes on a role of a doctor during a first recorded session , and the hostuser takes on the role of doctor while the client usertakes on the role of patient during a second recorded session, wherein the second recorded session is subsequent the first recorded session.

23. The system of claim 22, wherein the empathy training session comprises transmitting modified body features ordemographics(e.g., color of skin, weight or other physical features to the client userto experience interactions between the client user and the modified avatar.

24. A method of representing a hostuser as an avatar in a virtual or augmented reality environment, the method comprising: receiving, at a client position and orientation parameters of the avatar head and wrist; and with the position and orientation parameters of the avatar head and wrist as constraintsto an inverse kinematics algorithm, the client using the inverse kinematics algorithm to determine a kinematic chain from at least one wrist to the head of the avatar, whereby changes in head and wrist positions and orientations of the user cause corresponding changes in head and wrist positions and orientations of the avatar displayed by the client

25. The method of claim 24, wherein using the inverse kinematics algorithm to determine the kinematic chain comprises determining a position and orientation of segments of a simplified skeleton of the user such as at least one of hands, feet head, elbows, knees or torso and recreate an avatar body posture that mimics the user.

26. The method of claim 25, wherein responsive to determining the simplified skeleton of the user, transmitting a video playback of the avatar for display on the clientdevice.

Description:
AVATAR PUPPETING IN VIRTUAL OR AUGMENTED REALITY

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefitof U.S. Provisional

Application No.63/036,157, filed June 8, 2020, the contents of which are incorporated herein by reference.

BACKGROUND

[0002] An avatar in virtual or augmented reality may be puppeted by tracking physical body motions, including head position and rotation, hand gestures, and facial expressions of a user. Tracking the motion of different body joints and converting those motions into avatar motions is computationally intensive. For certain situations, the tracking also involves the use of high-end hardware. Bandwidth usage and latency are typically high, as information regarding the position and orientation of many different joints is sent to one or more clients.

BRIEF DESCRIPTION OF THE DRAWINGS

[0003] FIG. 1 is an illustration of a method of avatar puppeting in virtual or augmented reality.

[0004] FIG. 2 is an illustration of a method of using the method of FIG. 1 in a virtual reality environment

[0005] FIG. 3 is an illustration of a particular embodiment of a method of avatar puppeting in virtual or augmented reality.

[0006] FIG. 4 is an illustration of a host system for hosting a virtual or augmented reality environment

[0007] FIG. 5 is an illustration of a device for tracking wrist position and orientation.

[0008] FIG. 6 is an illustration of a particularembodimentof a virtual or augmented reality system.

[0009] FIGS. 7 is an illustration of a method of recognizing facial expressions. [0010] FIG. 8 is an illustration of a method of performing inverse kinematics with respect to an avatar.

DETAILED DESCRIPTION

[001 1] Because sending an entire detailed avatar through a network is extremely intensive and leads to bandwidth and latency issues, there is a need fora host system that sends only a very reduced set of information to reconstruct the avatar at the clientside. As a result, the bandwidth usage would be drastically lower and the avatar rendering speed improved. Additionally, it would be advantageous to maintain a realistic body language of the avatar during realtime communication between the host and client even with the transmission of the reduced set of body motion parameters being transmitted to the user. The embodiments described herein achieve such a solution.

[0012] Reference is made to FIG. 1 , which illustrates a method of avatar puppeting in virtual oraugmented reality. Certain functions of the method are performed on a host side by a host system (the “host method"). Certain other functions are performed on a clientside by a client device (the “client method").

In some embodiments functions described on the host side and/or the clientside may take place in the cloud.

[0013] On the host side, the host method includes hosting a virtual or augmented reality environment (block 100). Virtual reality (“VR") is a computer-generated simulation of a three-dimensional image or environment that can be interacted with in a seemingly real or physical way. The VR environment may be an immersive experience that can be similar to or completely different from the real world. Augmented reality (“AR") is an interactive experience of a real-world environmentwhere the objects that reside in the real world are enhanced by computer-generated perceptual information , sometimes across multiple sensory modalities, including visual, auditory, haptic, somatosensory and olfactory. As used herein, the term “environment may refer to an VR environ mentoran AR environment

[0014] The environ ment includes an avatar. An avatar generally refers to a two or three dimensional icon orfigure representing a particular person. In the method of FIG. 1 , the avatar represents a host user or a character or en alter ego of that host user.

[0015] As used herein, puppetlng of an avatar refers to tracking the motion of different body Joints of a host user and converting those motions Into avatar motions. The host user has a large number of Joints or segments that can be tracked, including but not limited to, hands, wrist elbow, shoulder, spine, hip, knee, ankle and feet

[0016] The host method of FIG. 1 further includes continualy tracking only a small subset of those Joints and segments: the head and wrist positions and orientations of the host user in a 3D space (block 110). Wrist position and orientation as used herein refers to the orientation and position of a single wrist or the positions and orientations of both wrists. Head position and orientation as used herein refers to the outline or outer boundaries of the head and does not include facial expressions and other details of the face.

[0017] The host method of FIG. 1 further indudes converting the tracked head and wrist positions and orientations of the host user into corresponding head and wrist positions and orientations parameters of the avatar (block 120). to other embodiments, the tracked head and wrist position sand orientations of the host user may be converted into the avatar head and wrist positions and orientations parameters by the client

[0018] In some embodiments, the host method of FIG. 1 further includes transmitting only the avatar head and wrist positions and orientations to a client (block 130). This is the only tracked information that is transmitted to the clientside. In other embodiments, the head and wrist positions and orientations are among other parameters transmitted to the clientside. For example, the host device may send the following data to the client device: host voice, host face expression , host head orientation , host eye gaze direction , and/or host gestures to namea few.

[0019] The host method of FIG. 1 further Includes using an Inverse kinematics algorithm to determine a kinematic chain from at least one wrist to the head of the avatar (block 140). In computer animation, an inverse kinematics algorithm is a mathematical algorithm that can take as in put a series of mechanical constraints for the position and/or orientation of various body parts. These mechanical constraints associated with a degree of enforcement are the input of the algorithm. These inputs could come from video processing and tracking of body parts but could also come from logical definition (e.g. feet are above the ground). The output of the algorithm is a human body pose that satisfies these mechanical constraints to the best possible results after a series of iterations. At block 140 of FIG. 1 , the head and wrist positions and orientations are provided as constraintsto the inverse kinematics algorithm, and the inverse kinematics algorithm provides a kinematic chain from avatar wrist to avatar head.

[0020] The host method of FIG. 1 further includes completing a partial or full representation of the avatar (block 150). The host user may select a hand pose for the avatar from a limited number of hand poses. For example, the host user may selecta hand pose from two available hand poses (e.g., thumbs up, and finger pointing). The avatar is updated with the selected hand pose. A numerical value indicating the selected hand pose may be transmitted to the cl lent side (block 160).

[0021] The selected hand pose is applied to the avatar and the hand of the avatar may be modified to interact with an object in the environment For example, a hand holding a door handle will need to follow the constraint of the door handle, which has a known position and orientation. These constraints are applied to the position and orientation of the hand.

[0022] The torso or lowerlimbs or both of the avatar may be created and updated programmatically by the client or host system. For instance, translation of the head results in translation of the torso, and the feet are assigned a logical definition (e.g., always beneath the torso). The head position and orientation may be applied to the inverse kinematics algorithm to determine the position and orientation of the torso and the lower limbs. The client or host system updates the position of the torso and lower limbs and translate the torso and the lower limbs of the avatar.

[0023] On the clientside, the client method for each client includes receiving the tracking information from the hostside (block 170). In the method of FIG. 1 , this tracking information consists of the head and wrist positionsand orientations.

[0024] Each clientthen uses its own inverse kinematics algorithm to determine a wrist-to-head kinematic chain of the avatar (block 180). The received head and wrist positions and orientations are provided as constraints to the inverse kinematics algorithm.

[0025] Each clientthen receives the selected hand pose and completes a partial or full representation of the avatar (block 190). The received hand pose is applied to the avatar and the hand of the avatar may be modified to interact with an object in the environment Additionally, the torso and/or limbs are updated.

[0026] Changes in the head and wrist positions and orientations of the host user cause the same corresponding changes in head and wrist positions and orientations of the avatar at both the host and client sides. However, an avatar at the client side may appear differently than the corresponding avatar at the hostside. For instance, the hostsystem mightusea different inverse kinematics algorithm than a client, or the hostsystem mightuse a different program for creating the torso and lower limbs. However, the hand and head position and orientations on both sides will be consistent and synchronized.

[0027] Additional reference is made to FIG. 2. There are certain environments in which hand and head synchronization are advantageous. Considerthe example of a virtual showroom. The host system creates a virtual reality environment having an avatar and multiple objects (block 200). The avatar represents a virtual salesman , and it is puppeted by the host user. The objects represent different items for sale.

[0028] The host user puppets the avatar (virtual salesman) to interact with the objects (block 210). The virtual salesman points to one of the objects, and that object is moved forward, next to the virtual salesman. The virtual salesman then touches certain portions of the selected object to display certain features, and the object is animated to display those selected features. The avatar's hand may be updated programmatically to touch the surface of the selected object [0029] As part of the example, say the objects are different models of washing machines. The virtual salesman is puppetedto point to one of the models at the back of the showroom. The selected model is moved to the front of the showroom, next to the salesman. The salesman is then puppetedto point to various features of the selected model. The virtual salesman may be puppeted to open a door of the selected model, spin a dial of the selected model, depress a button of the selected model, etc. The hostsystem animates the selected model to show a door being opened, a dial being spun, a button being depressed. The selected model can also be updated display features resulting from these action (e.g., a drum spinning).

[0030] At the client side, a client observes the virtual salesman, and can communicate with hostuserto selects particular object and demonstrate certain features of the selected object Communications between the host user and a client may be performed by means such as Voice Over IP (VoIP), text messaging, chat or videoconference, to name a few.

[0031] In thisexample, the virtual salesman is puppeted and is displayed and demonstrated by tracking only head and wrist positions and orientations of the host user. The wrist controls the selection, and the head is used to complete the pose of the virtual salesman .

[0032] The method of FIG. 1 offers several advantages over conventional approachesof avatar puppeting. The method of FIG. 1 is far less computationally intensive, as it allows for only wrist and head positions and orientations to be tracked. This results in a reduction in the amountof data is transmitted to the clientside, and it also results in an increase in rendering speed on both the hostandclientsides. Yet the method of FIG. 1 still provides a quality representation of an avatar at both the host and client sides where the avatar is rendered in a realistic manner. In other words, the embodiments described herein allow for gesture-based control of the avatar on commodity hardware using vision-only camera information and is capable of rendering fine movements to convey a realistic body language during real-time communication between users (e.g., hostand client users or multiple hosts and clients). [0033] The combination of the reduction in tracking data and parallel processing on the host and client sides also results in better synchronization between the avatar on the host side and the avatar(s) on the client side. Blocks 140 and 150 on the host side may be performed in parallel with blocks 180 and 190 on the client side. Depending on transmission rates from the host system to a client, and processing and graphics capability of a client, the avatar rendered by the host system may be fully synchronized with the avatar rendered by a client.

[0034] The selection of hand poses offers additional advantages. It is computationally inexpensive (when compared to tracking a hand) and very reliable. For certain environments, avatar quality does not suffer.

[0035] Tracking only head and wrist position and orientation avoids the use of high-end tracking equipment and instead enables the use of a low cost, yet accurate tracking equipment An example of such low cost equipment for tracking wrist position and orientation will be described below.

[0036] The environ mentis not limited to a single avatar. An environment may contain multiple avatars that are puppeted by multiple user hosts. The host system may transmit head and wrist positions and orientations for each avatar. The method of FIG. 1 is especially advantageous because the improvements in data bandwidth, rendering speed and synchronization are realized for each avatar.

[0037] Reference is now made to FIG. 3, which illustrates a particular embodiment of a host method of avatar puppeting in virtual or augmented reality environment The method on the host side may be performed by the host system 400 of FIG. 4.

[0038] Position and orientation of a head and a wrist of the host user are detected in a 3D space (block 300). The head position and orientation are determined by a video camera coupled to a host device that capturesa face of the user of the host device. Specifically, the host device may capture video frames of the host user via a camera sensor and process one or more of the video frames to detect the head of the host user. The host device may estimate the position and orientation of the head of the host user (in 3D coordinates) based on the processed video frames, in some embodiments, the host device runs a machine learning model that transforms the estimated head position and orientation into data to be transmitted to the clientdevice over a network. In other embodiments, the transformation of the estimated position and orientation of the head is implemented remotely in a cloud computing system and/or at the client side.

[0039] Position and orientation of a wrist of the host device user is detected in the 3D space (block 310). The wrist position and orientation may be determined by one or more of fiducial markers, visual detection, or Inertial Measurement Units (IMU) tracked by controllers integrated with a trackpad or buttons to select defined hand poses.

[0040] The visual detection method comprises processing each video frame in steps. The first step is the segmentation of the hands by removing the background. The user is asked to bring their hands outthen in thefield of view of the camera with little else moving. By removing the part of the frame that has not changed between the two pictures, it is possible to extract the part of the image that is specific to the hands. By asking the userto present an open palm in frontof the camera, it is possible to process the region of the image where the hands were segmented and to apply a deep learning inference model to detect the fingers of the hand.

[0041] The user may then be asked to place the open palm of the hand at a specific distance from the camera to calibrate the size of the features of the hand (e.g., distance from the wrist to the thumb) which will be used subsequently to calculate the distance between the video camera and the hand on subsequentframes by comparing with the calibrated frame. As a result, it is then possible to detect the position of the wrists in 3D in the reference frame of the video camera in real time. For such a detection methodology to be employed, the system requires video frames as inputs and outputs 3D position and orientation of the wrists in the reference frame of the video camera.

[0042] In some embodiments, the use of fiducials can make the detection of the position and orientation of the wrists more reliable. For example, the hostuser may hold a handheld device in each hand with several fiducial markers on it (e.g. Aruco markers). These markers may be positioned such that there is one fiducial marker in sight of the video camera of the host device at all times. These fiducial markers can be detected in the video frame processing to calculate the position and orientation of one or several fiducial markers in real time. Each fiducial marker is unique for the handheld device. To clarify, for an individual, for example, the fiducial markers of the left and right hands are unique. As a result, the technology is able to reliably calculate the position and orientation of the handheld device in real time and by deduction the position and orientation of each wrist of the host user. In such embodiment, the inputs to the system are numbered aruco markers attached to a stick and virtual model of the stick with the markers, while the output is a 3D position of the stick and therefore the wrists. It will be appreciated that the abovementioned process is advantageous because its computationally inexpensive.

[0043] It is further noted that the development of the Virtual Reality industry has led to the creation of handheld controllers that integrate an Inertial Measurement Unit(IMU) and can be tracked by external trackers (e.g. HTC Vive controllers). These controllers have buttons and/or a trackpad integrated therein making it possible to select specific pre-programmed hand poses (e.g. thumbs up, finger pointing, etc.) upon pressing a button for example. Such technology may, for example, be incorporated into various embodiments to assist in determining the wrist position and orientation.

[0044] Upon determining the head and wrist positions and orientations, the head and wrist positions and orientations are transcribed into corresponding head and wrist positions and orientations of the avatar in the virtual or augmented reality environment on a client device (block 320). In other words, in one embodiment, the host determines the raw head/wrist positions and orientations, and sends this raw data to each client, and each client processes the raw data to determine how the wrist and head of the host and clientavatar should move.

[0045] When the position and orientation of the head and both wrists are known to the client device, it is possible to use algorithms known as Inverse Kinematics to infertile position and orientation of the elbows, shoulders and overall position of the torso based on anatomical mechanical constraints that are consistent with a human body. The head and wrist positions and orientations are applied as constraintsto an inverse kinematics algorithm to determine a position and orientation of segments of a simplified skeleton of the hostuser such as at least one of hands, feet, head, elbows, knees or torso (block 330). An avatar body posture that mimics the host user is recreated (block 340).

[0046] In computer animation, Inverse Kinematics is a mathematical algorithm that can take as inputa series of mechanical constraints forthe position and/or orientation of various body parts such as limbs, the head, the hands, feet or intermediate parts such as elbows or knees. These mechanical constraints associated with a degree of enforcement are the inputof the algorithm. These in puts could come from video processing and detection of body parts but could also come from logical definition (e.g. feet are above the ground). Sometimes these inputs may be determined by a relative position compared to another device such as a joystick or an object tracked in space.

[0047] The output of the algorithm is a human body pose that satisfies these mechanical constraintsto the best possible results after a series of iterations. In other words, the algorithm will outputthe position and orientation of each extremity of each segment for the simplified skeleton: hands, feet, head, elbows, knees and torso.

[0048] The human body model can be a simplified skeleton comprising multiple kinematic chains of segments connected by joint For example, the shoulderto elbow to wristto hand is a kinematic chain. Each segment of the chain is rigid and has parameters that constraintthe segment position and orientation with respect to neighboring segments in the chain or external elements in the environment (e.g. hand holding a door handle). These constraints can represent muscle forces or external constraints. These constraints can be almost absolute, meaning that they must be satisfied first and foremost An example of such absolute constraint is when the hand of the human body must be holding a door handle. Other constraints can be associated with a priority level indicating how importantthe constraint is and its priority in the degree of satisfaction by the solution. For example, the human head may be required to look at an object but the position of the head is flexible since the direction of the eyes is more important

[0049] These constraint parameters can control programmatically during the execution of a software program. For example, a hand holding a door handle will need to follow the constraintof the door handle which is programmed to open or closed following a software input The program will calculate the position and orientation of the handle during an animation which therefore determines the constraints applied to the position and orientation of the hand.

[0050] The algorithm will typically initiate with a given body pose that may come from a previous frame or from a random state and will calculate how far each constraint is fulfilled. Using the Jacobian inverse technique, the delta between the current state of the constraints and its goal is then multiplied by the priority factor. The higherthe priority factor the more important the delta will affect the overall cost calculation. For each animation frame, the algorithm will iterate several times to reduce the overall cost calculation uptoa limit set by the person programming the algorithm. The limit could be a given number of iterations or an overall cost threshold. When the limit is reached, the algorithm stops and outputthe body pose found to satisfy the constraints to the limit of what is possible.

[0051] Other heuristic based algorithms can be used to find an acceptable solution such as Cyclic Coordinate Descent (CCD), and Forward And Backward Reaching Inverse Kinematics (FABRIK).

[0052] In some cases, the system may be over-constrained meaning that no body solution would even satisfy all constraints. This typically results in body poses that are contrived or not natural. When the algorithm is used by a skilled engineerthe system is typically under-constrained meaning that several body poses would satisfy the set of constraints which typically results in natural body poses.

[0053] In certain instances, the tracking date may be expanded beyond the head and wrists. In another embodimentof a host method, the tracking date may be expanded to include facial expressions and gaze direction of the host user's head. The host system may use a conventional algorithm for recognizing the host user's facial expressions and update the avatar to display the recognized facial expressions. However, the tracking data is not transmitted to the client side. Rather, a numerical value is transmitted to the client side so as not to impact data bandwidth. For instance, there might be a predefined set of expressions (e.g., frown, smile, sneer) and the numerical value would select one of the predefined expressions and the magnitude of that expression (e.g., half smile).

[0054] Reference is now made to FIG. 4, which illustrates a host system 400 for avatar puppeting in virtual or augmented reality. The host system 400 includes a computer system 410 programmed to host a virtual or augmented reality environment including an avatar of a hostuser. The host system 400 further includes a handheld tracking device 420 that carries fiducial markers. The handheld device 420 is held by the host user during hosting. The host system 400 further includes a camera system 430 that, during hosting, captures images of the host user's head and the fiducial markers on the tracking device 420.

[0055] The computer system 410 is programmed to process images from the camera system 420 to determine head and wrist positions and orientations of the host user, and convert the host user's head and wrist positions and orientations into corresponding head and wrist positions and orientations of the avatar.

[0056] For instance, the computer system 410 executes code for recognizing the host user’s head from a sequence of image frames, determining changes in the position and orientation of the user’s head, and converting those changes into head motionsforthe avatar. The code also recognizes thefiducial markers in a sequence of the images, determines changes in the position and orientation of the fiducial markers, deduces changes in position and orientation of the wrists, and converts those changes into wrist motions for the avatar.

[0057] In the alternative, the computer system 410 can be programmed to determine the wrist orientations only from the images. For instance, the image processing may include segmentation of the hands by removing the background. The user is asked to bring their hands outthen in the field of view of the camera with little else moving. By removing the part of the frame that has not changed between the two pictures, the part of the image that is specific to the hands is extracted. Images of the host users open palm can be applied to a machine learning model, which detect the fingers of the hand. The host userthen places the open palm of the hand at a specific distance from the camera in order to calibrate the size of the features of the hand (e.g. distance from the wrist to the thumb) which will be used subsequently to calculate the distance between the video camera and the hand on subsequentframes by comparing with the calibrated frame. This enables the position and orientation of the wrists to be detected in real time.

[0058] The computer system 410 is further programmed to transmit only the avatar head and wrist positions and orientations to one or more clients. These positions and orientations may be transmitted via a network interface or other data communications interface.

[0059] The computer system 410 is further programmed to determine a kinematic chain from at least one wrist to the head of the avatar with the head and wrist positions and orientations as constraints. The computer system 410 may also complete a partial or full representation of the avatar.

[0060] As a result of the improvements in date bandwidth and rendering speed, the host system 400 does not require high end computers or specialized hardware. The computer system 410 may include only a desktop computer or a mobile device such as a smartphone, a tablet, or a laptop.

[0061] The camera system 420 may include only a single video camera that captures images of the host user's head and the fiducial markers (or hands). Alternatively and/oradditionally, the single video camera may capture the host user's facial expressions as described above. In the alternative, the camera system 420 may include a first video camera dedicated to capturing images of the host user's head, and a second video camera dedicated to capturing images of the fiducial markers (or the host user's hands). The video feed is supplied to the computer system 410 and may be processed frame-by-frame to extract the tracking information aboutthe position and orientation of the head and wrists.

[0062] Additional reference is made to FIG. 5, which illustrates an example of a handheld tracking device 420. The device 420 carries fiducial markers 500. For example, a multi-face object such as a hexahedron or dodecahedron is attached to a body of the device 420. Different patterns of fiducial markers 500 are visible on differentfaces of the multi-face object These different patterns may be positioned such that at least one pattern is in sight of the camera system 430 at all times.

[0063] The handheld tracking device 420 further includes hand grips

510 for both of the host user's hands. The hand grips 510 are configured to be held in a tight and fixed position to enable the computer system 410 to deduce the wrist position and orientation relative to the fiducial markers 500.

[0064] The handheld tracking device 420 may furtherinclude orientation sensors 520. The computer system 410 may be programmed to process both sensor readings and the images of the fiducial markers to determine the wrist position and orientation. The device 420 may also include a set of buttons 530 that are assigned to different hand poses forthe avatar. The hand poses are communicated to the computer system 410 via the buttons 530.

[0065] The handheld device 420 may communicate with the computer system 410 via a communications interface.540. For instance, the communications interface 540 may include a Bluetooth transmitter. In some embodiments, the handheld device may include a game controller.

[0066] Reference is now made to FIG. 6, which illustrates a particular embodiment of a virtual or augmented reality system 600 including a host device 610 and a client device 650. The host device 610 may be a smartphone, a tablet, a laptop or desktop computer, which includes a microphone 612 for capturing the voice of the hostuser, anda video camera 614 for capturing the face and parts of the body of the hostuser. The host device 610 runs code 616 that converts facial expressions, head orientation, eye gaze direction and the body position of the host user into data that can be transmitted to the client device 650 over a peer-to-peer (P2P) network 660.

[0067] The code 616 for detection of the head position and orientation, the position and orientation of the eyes and the detection of the facial expressions can be based on developmentframeworks made available by hardware manufacturers such as Apple with the ARKit development framework or Google with the ARCore development framework. In the alternative, such detection could be performed by a deep learning model. The deep learning model could be running locally on the host device or remotely in the cloud 670.

[0068] The client device 650 may be an AR- or VR-enabled smartphone, an AR- or VR-enabled tablet or other VR or AR device. The client device 650 includes a microphone 652 for capturing a client’s voice, and it also includes a video camera 654. The client device 650 runscode 656 for tracking its position in space in six degrees of freedom in order to display a 3D image of objects in a virtual environment from the point of view of the clientdevice. The code 656 may be based on AR frameworks such as ARKitand ARCore for mobile devices or using external trackers.

[0069] To in itiate a host session , the host device 610 is matched with the client device 650 to meet in a virtual room. Match criteria is based on parameters provided by both the client device 650 and the host device 610. The cloud 670 may be responsible for relaying some of the data necessary for relaying the experience (e.g. avatar data). After log-ins, the session begins.

[0070] The client device 650 views the scene in 3D, whether as an overlay to the real world (AR) or fully immersed in the 3D scene (VR). The microphone 652 of the client device 652 captures the client’s voice and sends it to the host device 610. The clientdevice650 may offer a user interface (not shown)forenabling the clientto provide input driving the experience in the environment

[007] ] Since the client device 650 knows its position and orientation in space in real time, the client device 650 is able to position the avatar at a location and orientation in a way that is consistent over time in comparison with the client and the physical space for the case of ARor with regard to the virtual space in the case of VR. As a result, the client can have a consistent experience when interacting with the virtual avatar of the host

[0072] It will be appreciated that the user on the host side could decide to trigger some pre-calculated animations of the avatar. For example, the salesperson (host) may select a button on the host interface that would cause the avatar to stand up or sit down. The point of viewto look at the avatar on the client side may be controlled by the way the client is holding his/her device with regards to the position of the avatar. Additionally, according to another example, the host may trigger an animation that would cause the avatar to walk around and thus causing the clientto move his/her device to continuously point to the avatar.

[0073] The host device 610 can communicate with the client device via VoIP 618, 658 so as to hear clients voice and can talk to the client in such a way that both sides can havea natural conversation. At the same time, the video camera 614 of the host device 610 is capturing the host face and body parts, converting the host facial expression, head orientation, eye gaze direction and body part position and orientation into corresponding motions of an avatar A, and sending this data to the client device 650. The clientdevice 650 integrates all the received data and re-generates an avatar B that mimics the host very naturally. The lag between the voice data and tracking data is synchronized to ensure that the avatar rendering feels natural to the client

[0074] The synchronization between the host device 610 and the clientdevice 650 may be performed at blocks 619 and 659. The synchronization ensures that objects and avatars are in the same position and orientation on both the hostside and the client side relative the virtual environment. The synchronization also manages when objects are shown or disappear and compensate for lag between the voice data and the face and body expression data. A synchronization framework may be implemented that ensures objects and avatars are in the same position and orientation on both the host side and the client side of the application. As a result, the host and the avatar will be able to see the same scene simultaneouslyforall practical purposes. The synchronization technology also manages when objects appear or disappear.

[0075] For example, one method described above may stream the voice through a different network pipeline than the one used to stream the avatar parameters, especially lip syncing. To ensure a good match between the lips pronouncing a word and the sound of the word, the system can insert synchronization markers in both streams from the host down to the client When the client receives the corresponding markers, it can calculate the lag between the two streams and send a compensation value backto the host. This allows the host to insert a delay in either stream. As a result, the lip syncing and the voice will stay in sync even through dynamic changes of lag on either stream.

[0076] The host device 610 may provide a user interface (not shown)thatallowsthe host userto trigger avatar animations, change the parameters of the scene such as lighting conditions, music or move the experience to a completely different scene. The user interface can supplement the user actions and automatically adjust the scene and the avatar to improve the client's experience. The host can use the user interface to add, move, change or remove objects in the scene. For instance, the host user can use the user interface to present commercial products for sale to the clientand be available to describe the features of the product and answer questions in real-time. The host can initiate animations to illustrate selling points. The host can remove layers of the product to show the inside of the product

[0077] The code for the host and clientdevices 610 and 650 may be developed on top of a 3D Game Engine such as Unity or Epic's Unreal Engine. It could also potentially be built directly with 3D frameworks such as, for example, those provided by Mobile OS like SceneKitin iOS. Several technologies maybe used and developed on top of the 3D engine, the first one of which is the Augmented Reality framework. In the iOS ecosystem, the AR framework is called ARKit In the Android ecosystem, the equivalentAR framework is called ARCore.

[0078] These frameworks provide advantageous capabilities. For example, they provide a way for the device running the application to know its location in space in 6 degrees of freedom in real-time. This is called Visual Inertial Odometry (VIO). The inputs of the VIO are the video frame for each camera. The outputof the VIO is the position and orientation of the device. The VIO algorithm is processing each video frame for each of the cameras used by the system (in one embodiment the front user camera of the smartphone or tablet). For each video frame, the algorithm may look for recognizable visual features using feature detection algorithms (e.g. SIFT, SURF) and will classify the features in a way that can make it possible for the same visual feature to be matched in the following video frame if it is still in the field-of-view of the camera. As the result the VIO algorithm will be perceiving changes in the movements of detectable visual features in the field of view of the device camera. This analysis provides information aboutthe movement of the camera itself which is then combined with the information coming from the device Inertial Measurement Unit (IMU) which outputs the acceleration of the device as well as changes in orientation. Combining all this information in a filter (e.g., an Extended Kalman Filter or EKF), the device is operable to deduce its new position and orientation relative to the original position and orientation when the filter (e.g., EKF) was started.

[0079] As a result, the device is operable to produce virtual renderings of a scene with avatars, products and objects that appear in a consistent position with regards to the real world by matching the movements of the device with the movements of the virtual camera used in the 3D gaming engine producing the synthetic frames showed to the user. As a result, a sitting avatar will always appear in the same location with regards to the physical world irrespective of the location of the device held by the end user.

[0080] An algorith m using data provided by these AR frameworks allows the device to detect surfaces and make assumptions with regards to where the floor is located along with walls and even detect complex objects like a table.

[0081] On the host side of the application , the AR framework may capture the face of the host with the userfacing video camera which may, for example, include special face recognition sensors such as Face ID on the iOS ecosystem. The application takes the data coming from the video camera or sensor and encodes it into a network message that is sent over the air to the client side to reproduce the host face expression as part of the host avatar shown on the client side of the application.

[0082] Reference is made to FIG. 7, which illustrates a method for recognizing and transmitting facial expressions between the host and the client while maintaining the above-described improved: data bandwidth, rendering speed, and synchronization that is realized for the avatar.

[0083] In some embodiments, the processing of the host user's facial features is implemented using a deep learning neural network trained over a body of images annotated for these facial features. The training will take facial images of individuals with a high diversity of origin, skin color, gender, styles that have been annotated to indicate where the eyes, nose, mouth and other facial features are. The training creates a deep learning neural network that is eventually used on new facial images to outputthe locations of these features.

[0084] By analyzing the changes of positions of these facial features, the algorithm is able to recognize facial expressions such as smiling, frowning, winking, tongue out, etc. Here again, deep learning techniques maybe used to analyze a portrait and return the expressions on the face.

[0085] According to embodiments discloses herein, to optimize the bandwidth usage during the transmission of the virtual avatar from the host device to the clientdevice, these facial expressions can be categorized into known expressions as well as an intensity. For example, instead of transmitting the movement of all feature points on the face, the algorithm may produce a key/value pairs such as smile = 0.4. The value associated with the expression is an indication of the intensity of the expression. As a result, the payload in the network is much smallerwhile the expressiveness of the avatar is maintained. Multiple facial expressions can be present on a face simultaneously and commu n icated for each frame.

[0086] On the client side, the virtual avatar of the host may be composed of a mesh surface typically composed of triangles or quads (i.e., a surface with 4 edges). The mesh is typically textured by matching the points of the mesh with a 2D image or texture. It is possible to displace the vertices forming the mesh in such a way to represent the deformations of the skin of the avatar. These deformations can be prepared in advance to represent various facial expressions (e.g. smile, frown, etc.). These deformations are typically called blendshapes. When a facial expression is received from the host device onto the clientdevice, a blendshape is activated proportionally to the intensity value, the mesh of the virtual avatar is modified to match the pre-determined expression proportionally to the intensity value. Asa result, theclientwill see an avatar with the same expressionsas the host user substantially close to real time (accounting for network and processing delay). [0087] It will be appreciated the inputs of the facial expression system are the video frames of the host device front user camera as well as the avatar model and its blendshapes. The output is the avatar following the expression of the host user as displayed on the client device.

[0088] Reference is now made to FIG. 8, which illustrates a method of performing inverse kinematics. The human body model 800 can be modeled as a simplified skeleton 810 having multiple kinematic chains of segments connected by joint

[0089] For example, the shoulder to elbow to wrist to hand is a kinematic chain. Each segment of the chain is rigid and has parameters that constraintthe segment position and orientation with respect to neighboring segments in the chain or external elements in the environ ment(e.g. hand holding a door handle). These constraints can represent muscle forces or external constraints. These constraints can be almost absolute, meaning thatthey must be satisfied first and foremost An example of such absolute constraint is when the hand of the human body has to be holding a door handle. Other constraints can be associated with a priority level indicating how importantthe constraint is and its priority in the degree of satisfaction by the solution. For example, the human head may be required to look at an object but the position of the head is flexible since the direction of the eyes is more important

[0090] For example, a hand holding a door handle should follow the constraint of the door handle which is programmed to open or closed following a software input Position and orientation of the handle can be computed during an animation which therefore determines the constraints applied to the position and orientation of the hand.

[0091] The inverse kinematics algorithm may initiate with a given body pose that may come from a previous frame or from a random state and then calculate how far each constraintis fulfilled. Using the Jacobian inverse technique, the delta between the current state of the constraints and its goal is then multiplied by the priority factor. The higherthe priority factor the more important the delta will affectthe overall cost calculation. For each animation frame, the algorithm will iterate several times to reduce the overall cost calculation up to a limit set by the person programming the algorithm. The limit could be a given number of iterations or an overall cost threshold. When the limit is reached, the algorithm stops and outputthe body pose found to satisfy the constraints to the limit of what is possible.

[0092] A method and system herein are particularly useful for environments in which head and hand orientation are critical. Examples of such environments include, butare not limited to, virtual trade shows, virtual product demonstrations, and virtual training sessions. Additional examples include virtual psychotherapy, empathy training session, and doctor-patient sessions.

Example 1

[0093] The host useris acting as a sales agent, a product demonstrator ora tech support representative. The host device is providing a user interface for triggering interactions with various elements of the virtual environments in orderto demonstrate the functionalities of a product. For instance, the host user can pointto specific areas of the product and have the host avatar point its fingertowards that area. Specific animations can make the avatar interact with the product (e.g., opening a door of a washing machine product, turning the product on/off, or starting a laundry virtually).

Example 2

[0094] An environ mentis hosted for professional training where the client is a trainee and the host user is an instructor. For example, hospitality employees could be trained to provide a concierge service for a hotel customer. In this case, the client of the system is the hospitality employee and the host of the system is an actor who plays the part of a customer. In this instance, the hospitality employee might wear a VR headset and interact with the avatar of the customer in a fully immersive VR environment As such, the employee can be trained to provide a service consistent with the expectations and policies of the hotel chains he works at

Example 3

[0095] An environ ment is hosted for a psychotherapy session . In this case, the client is a patient who wears a VR headset. The host is the therapist or an actor who is playing a part simulating or re-creating a traumatic event for the healing benefit of the therapist.

Example 4

[0096] An environ mentis hosted for empathy training. The host user takes on a role of a virtual patient visit and a clienttakes on a role of a doctor during a first recorded session. During a second recorded session, the host usertakes on the role of doctor while the clienttakes on the role of patient For example, in the first session, the doctor or the nurse may be the clients of the system and an actor is the host playing the part of a patient The nurse has an interaction with the virtual patient as she would in a clinical setting. The session is recorded especially the voice of both the nurse and the virtual patient The nurse may be wearing a VR headset for higher realism of the simulation.

[0097] In the second session, the nurse may be placed in the position of the patient (e.g. lying on a bed) and the recorded first session is played again while the nurse is listening to the interaction from the point of view of the patient As a result the nurse is experiencing him/herself from the point of view of their patient with the goal of eliciting sympathy or compassion towards the patient and offering an opportunity for self-introspection towards improving the relationship with his/her patients in the future.

Example 5

[0098] An environ mentis hosted for empathy training. The client experiences the avatar of a live remote actor while wearing a VR headset In this case, the client is put in the shoes of a completely different person from the standpoint of demographics (e.g., a modem white male could be put through the experience of discrimination while in the shoes of a black girl in the 50s in the South of the United States or a Palestinian national could experience life as an Israeli and vice-versa).

Definitions:

[0099] Augmented Reality: Augmented reality refers to an interactive experience of a real-world environment where the objects that reside in the real world are enhanced by computer-generated perceptual information, sometimes across multiple sensory modalities, including visual, auditory, haptic, somatosensory and olfactory.

[00100] Avatar refers to a graphical illustration that represents a computer user, or a character or alter ego that represents that user. An avatar can be represented either in three-dimensional form (for example, in games or virtual worlds) or in two-dimensional form as an icon in Internet forums and virtual worlds.

[00101] Visual Inertial Odometry ( VIO): VIO means estimating the 3D pose (translation + orientation) of a moving camera relative to its starting position, using visual features along with inertial data provided by an onboard inertial measurement unit (IMU) typically using accelerometer, gyroscope and sometimes magnetometer.

[00102] Virtual Reality (VR): VR is a simulated immersive experience that can be similar to or completely differentfrom the real world.

[00103] Voice Over IP: VoIP, also called IP telephony, is a method and group of technologies for the delivery of voice communications and multimedia sessions over the Internet Protocol (IP) networks, such as the Internet

[00104] Of course, it is to be appreciated that anyone of the examples, embodiments or processes described herein may be combined with one or more other examples, embodiments and/or processes or be separated and/or performed amongst separate devices or device portions in accordance with the present apparatuses, devices and methods.

[00105] Having described several embodiments herein, it will be recognized by those skilled in the art that various modifications, alternative constructions, and equivalents may be used. The various examples and embodiments may be employed separately or they may be mixed and matched in combination to form any iteration of the alternatives. Additionally, a number of well-known processes and elements have not been described in order to avoid unnecessarily obscuring the focus of the present disclosu re. Accordingly, the above description should not be taken as limiting the scope of the invention. Those skilled in the art will appreciate that the presently disclosed embodiments teach by way of example and not by limitation . Therefore, the matter contained in the above description or shown in the accompanying drawings should be interpreted as illustrative and notin a limiting sense. The following claims are intended to cover all generic and specific features described herein, as well as all statements of the scope of the present method and system, which, as a matter of language, might be said to fall there between.

[00106] For example, the following variations will be appreciated by those of ordinary skill in the art and well within the scope of embodiments described herein:

[00107] Multiple clients: The system is not limited to oneclientand one host One example could be one host with many clients allowing the host to showcase a commercial product to multiple clients at the same time. In this experience, each client can see the host and talk to him and other clients.

[00108] Multiple hosts: the system could be used where two or more hosts could together present a product to one or several clients.

[00109] Automatic facial expressions: an embodiment described above uses the camera to detect the host facial expressions to transcribe it into data that is used on the clientside to reproduce these facial expressions on the client's device. Another variant of this embodiment may automatically pick facial expressions based on the tone of the user by comparing the tone to a baseline most likely leveraging a deep learning model.

[001 10] Similarly facial expressions could be picked by analyzing the language used by the host while talking to the client This latter variant would leverage speech -to-text technologies along with Natural Language Processing algorithms. The algorithm would classify the language in such a way that can be matched with one or a combination of facial expressions.