Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AUTOMATIC DATA-DRIVEN HUMAN SKELETON LABELLING
Document Type and Number:
WIPO Patent Application WO/2023/027691
Kind Code:
A1
Abstract:
A computer system obtains a plurality of first images of a scene captured concurrently by a plurality of first cameras. Each first image is captured by a respective first camera that is disposed in a distinct location in the scene. The computer system generates a plurality of two- dimensional (2D) feature maps from the plurality of first images, and each first image corresponds a respective subset of 2D feature maps. The plurality of feature maps are projected into a plurality of aggregated volumes of the scene. The computer system generates a plurality of three-dimensional (3D) heatmaps corresponding to the plurality of aggregated volumes of the scene using a heatmap neural network. Automatically and without user intervention, the computer system identifies positions of a plurality of key points in the scene from the plurality of 3D heatmaps. Each key point corresponds to a joint of a person in the scene.

Inventors:
LI ZHONG (US)
GUO YULIANG (US)
DU XIANGYU (US)
QUAN SHUXUE (US)
XU YI (US)
Application Number:
PCT/US2021/047317
Publication Date:
March 02, 2023
Filing Date:
August 24, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06N3/04
Domestic Patent References:
WO2019241667A12019-12-19
Foreign References:
US20200272888A12020-08-27
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for automatically labelling images, comprising: obtaining a plurality of first images of a scene captured concurrently by a plurality of first cameras, wherein each first image is captured by a respective first camera that is disposed in a distinct location in the scene; generating a plurality of two-dimensional (2D) feature maps from the plurality of first images, each first image corresponding to a respective subset of 2D feature maps; projecting the plurality of feature maps into a plurality of aggregated volumes of the scene; generating a plurality of three-dimensional (3D) heatmaps corresponding to the plurality of aggregated volumes of the scene using a heatmap neural network; and automatically and without user intervention, identifying positions of a plurality of key points in the scene from the plurality of 3D heatmaps, each key point corresponding to a joint of a person in the scene.

2. The method of claim 1, wherein for each first image, the respective subset of 2D feature maps are generated from the respective first image using a respective backbone neural network, and the respective backbone neural network is trained separately from, or jointly end-to-end with, the heatmap neural network and other respective backbone neural networks.

3. The method of claim 1 or 2, wherein identifying the positions of the plurality of key points in the scene from the plurality of 3D heatmaps further comprises: applying a normalized exponential function on each of the plurality of 3D heatmaps to identify a position of a respective one of the plurality of key points in the scene.

4. The method of any of the preceding claims, wherein the plurality of first images are captured concurrently when the plurality of first images are captured within a temporal window.

5. The method of any of the preceding claims, wherein each of the plurality of first cameras includes a time-of-flight camera.

6. The method of any of the preceding claims, wherein the positions of the plurality of key points are identified in a first coordinate of the scene, further comprising: obtaining a second image of the scene captured by a second camera concurrently with the plurality of first images;

28 determining a correlation between the first coordinate of the scene and a second coordinate of the second camera; in accordance with the correlation of the first and second coordinates, converting the positions of the plurality of key points from the first coordinate to the second coordinate; and automatically labelling the second image with the plurality of key points based on the converted positions of the plurality of key points in the second coordinate.

7. The method of claim 6, labelling the second image with the plurality of key points further comprising: interpolating an additional key point position from the converted positions of the plurality of key points in the second coordinate; and associating the additional key point position with an additional key point that is not among the plurality of key points.

8. The method of claim 6, labelling the second image with the plurality of key points further comprising: fitting a subset of the converted positions of the plurality of key points to a human body; identifying a geometric center of the subset of the converted positions of the plurality of key points; and identifying a location of the human body corresponding to the subset of the converted positions of the plurality of key points based on the geometric center.

9. The method of claim 6, further comprising: using the second image labelled with the plurality of key points to train a deep learning model.

10. The method of claim 6, wherein the correlation between the first coordinate and the second coordinate includes a plurality of displacement parameters associated with a 3D displacement between the first and second coordinates and a plurality of rotation parameters associated with a 3D rotation between the first and second coordinates.

11. The method of claim 6, determining the correlation between the first coordinate and the second coordinate further comprising: obtaining, from the plurality of first cameras, a plurality of first test images of the scene; obtaining, from the second camera, one or more second test images of the scene, the second test images captured concurrently with the first test images; detecting a position of a first test point in the first coordinate of the scene from the plurality of first test images; detecting a position of a second test point in the second coordinate of the second camera, the second test point having a known physical position in the first coordinate with respect to the first test point; and in accordance with the known physical position of the second test point in the scene with respect to the first test point, deriving the correlation of the first and second coordinates.

12. The method of claim 6, wherein the second camera is configured to capture a color image, a monochromatic image, or a depth image.

13. The method of claim 6, wherein the second camera is installed on a mobile device or augmented reality (AR) glasses.

14. A computer system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-13.

15. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-13.

Description:
Automatic Data-Driven Human Skeleton Labelling

TECHNICAL FIELD

[0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for generating information of human joints and skeletons from image data.

BACKGROUND

[0002] Human pose estimation needs a large amount of data that label key points of human bodies in images. Such key point labels can be synthesized, manually created, or automatically identified in the images. Automatically identified labels require minimal human labor and computer resources and have a reasonable level of accuracy. However, automatically identified labels are normally a result of applying a specific imaging equipment. The specific imaging equipment provides images of limited quality, and is oftentimes used with physical markers that are attached to surface of tracking objects. Physical markers are inconvenient to use, cause data pollution, and even interfere with an object’s movement in some situations. It would be beneficial to have a more convenient human pose estimation mechanism for identifying key points of human bodies in images (particularly, images taken by conventional cameras) than the current practice.

SUMMARY

[0003] Accordingly, there is a need for a convenient human pose estimation mechanism for identifying key points of human bodies in images, particularly in images taken by conventional cameras (e.g., a camera of a mobile phone or augmented glasses). To that end, this application is directed to leveraging first cameras’ labelling capability to automatically label key points in an image captured by a second camera (e.g., an RGB camera, a time-of-flight (ToF) camera). The first and second cameras are synchronized in time, and more importantly, calibrated in space to determine a physical correlation between two coordinates of the first and second cameras. The physical correlation is optionally represented by a rotation and translation matrix. The first cameras are distributed in a scene and capture a plurality of first images concurrently. Feature maps and aggregated volumes are derived from the plurality of first images of the scene, and applied to create first key points in the scene. A subset of the first key points are converted to corresponding second key points in a second image captured by the second camera based on the physical correlation of the two coordinates of the first and second cameras. Additional missing key points can be filled on the second image based on the second key points. As such, the second key points and/or additional missing key points are annotated on the second image automatically and without user intervention.

[0004] In an aspect, a method is implemented at a computer system for automatically labelling images. The method includes obtaining a plurality of first images of a scene captured concurrently by a plurality of first cameras. Each first image is captured by a respective first camera that is disposed in a distinct location in the scene. The method further includes generating a plurality of two-dimensional (2D) feature maps from the plurality of first images, and each first image corresponds a respective subset of 2D feature maps. The method further includes projecting the plurality of feature maps into a plurality of aggregated volumes of the scene and generating a plurality of three-dimensional (3D) heatmaps corresponding to the plurality of aggregated volumes of the scene using a heatmap neural network. The method further includes automatically and without user intervention, identifying positions of a plurality of key points in the scene from the plurality of 3D heatmaps. Each key point corresponds to a joint of a person in the scene.

[0005] In some embodiments, the positions of the plurality of key points are identified in a first coordinate of the scene. The method further includes obtaining a second image of the scene captured by a second camera concurrently with the plurality of first camera and determining a correlation between the first coordinate of the scene and a second coordinate of the second camera. The method further includes in accordance with the correlation of the first and second coordinates, converting the positions of the plurality of key points from the first coordinate to the second coordinate. The method further includes automatically labelling the second image with the plurality of key points based on the converted positions of the plurality of key points in the second coordinate.

[0006] In another aspect, some implementations include a computer system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods. This computer system achieves key point labelling and annotation by performing data-driven volumetric triangulation methods on a 3D human pose, time alignment, and coordinate system calibration on two types of cameras.

[0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods. BRIEF DESCRIPTION OF THE DRAWINGS

[0008] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0009] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0010] Figure 2 is an example local imaging environment having a plurality of client devices, in accordance with some embodiments.

[0011] Figure 3 is an example flow chart of a process for identifying and annotating key points, in accordance with some embodiments.

[0012] Figure 4 is an example flow chart of a process for synchronizing a plurality of first cameras and a second camera in a scene, in accordance with some embodiments.

[0013] Figure 5 is an example flow chart of a process for recording calibration data from a plurality of first cameras and a second camera, in accordance with some embodiments. [0014] Figures 6A and 6B are two test images applied to calibrate a plurality of first camera and a second camera in space, in accordance with some embodiments, and Figure 6C is a flow chart of a process for calibrating the first and second cameras in space, in accordance with some embodiments.

[0015] Figure 7 is a flow chart of a process for annotating key points in a second image 204 captured by a second camera based on a plurality of first images captured by a plurality of first cameras, in accordance with some embodiments.

[0016] Figure 8 is a flow chart of a process for identifying key points in a plurality of first images captured by a plurality of first cameras, e.g., using volumetric triangulation, in accordance with some embodiments.

[0017] Figure 9 is a flowchart of a method for annotating images automatically, in accordance with some embodiments.

[0018] Figure 10 is a block diagram illustrating a computer system, in accordance with some embodiments.

[0019] Like reference numerals refer to corresponding parts throughout the several views of the drawings. DETAILED DESCRIPTION

[0020] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0021] Various embodiments of this application are directed to leveraging first cameras’ automatic key point identification and labelling capability to annotate key points in an image captured by a second camera (e.g., an RGB camera, a depth camera). The first and second cameras are automatically synchronized in time and calibrated in space, and a physical correlation is determined between two coordinates of the first and second cameras. In some embodiments, the first cameras are fixed in a scene, and a coordinate associated with a scene is used for each first camera to identify key points captured by the respective first camera. First key points are identified on first images captured by the first cameras and converted to corresponding second key points in a second image captured by the second camera concurrently with the first images, thereby annotating the second key points on the second image fast and accurately. In some embodiments, the first key points are associated with physical markers that are attached to an object and can be easily detected from the first images captured by the first cameras. Conversely, in some embodiments, the first key points are not associated with any physical markers, and the first key points are detected from the first images using image processing algorithms.

[0022] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, an imaging device 104D, a head-mounted display (also called AR glasses) 104E, or intelligent, multi-sensing, network- connected home devices (e.g., a thermostat). Each client device 104 can collect data (e.g., images) or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally (e.g., for training and/or inference) at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content for training a machine learning model (e.g., deep learning network) and/or video content obtained by a user to which a trained machine learning model can be applied to determine one or more actions associated with the video content.

[0023] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104D and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104D, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera 104C in the real time and remotely.

[0024] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Intemet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0025] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C). The client device 104C obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104 A). The server 102 A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104A obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results from the server 102A, and presents the results on a user interface (e.g., associated with the application). The client device 104A itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104B imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface locally.

[0026] In various embodiments of this application, a data processing model is used to identify key points in images captured by a plurality of cameras. The data processing model is optionally trained in a client device 104 or a server 102, and used in the client device 104 or server 102 for inference of the key points. In some embodiments, image labelled with the key points is used as training data to train a deep learning model for subsequent use. Training and inference of the deep learning model is also implemented in either the client device 104 or sever 102.

[0027] Figure 2 is an example local imaging environment 200 having a plurality of client devices 104, in accordance with some embodiments. The plurality of client devices 104 include a plurality of first imaging devices 202, a second imaging device 204, and a server 102. The first and second imaging devices 202 and 204 are disposed in a scene and configured to capture images of respective fields of view related to the same scene. The first imaging devices 202 are a first type of imaging devices (e.g., an infrared camera), and the second imaging device 204 is a second type of imaging device (e.g., a visible light camera). The second type is distinct from the first type. In some embodiments, the first and second imaging devices 202 and 204 are only communicatively coupled to each other directly via a local area network 110 (e.g., a Bluetooth communication link). Alternatively, in some embodiments, the first and second imaging devices 202 and 204 are communicatively coupled to each other via a remote area network 108. In some embodiments, the first and second imaging devices 202 and 204 are communicatively coupled to the server 102 via a local area network, a remote area network or both. The server 102 is configured to process images captured by the first and second imaging devices 202 and 204 jointly with the first and second imaging devices 202 and 204 and/or communicate the images or related data between the first and second imaging devices 202 and 204. In an example, the server 102 is a local computer machine disposed on the scene and communicates with the first and second imaging devices 202 and 204 via a local area network.

[0028] In some embodiments, the plurality of first imaging devices 202 capture a plurality of first images concurrently (e.g., within a time window) in the scene, and the plurality of first images are used to map the scene into a 3D map of the scene. The 3D map created from the first images captured by the first cameras 202 corresponds to a first coordinate, and the second image corresponds to a second coordinate. A physical correlation between the first and second coordinates can be calibrated and used to convert a location in the first coordinate associated with the first images to a location in the second coordinate of the second image. In some embodiments, the scene and first imaging devices 202 are fixed, so is the first coordinate. The first coordinate is an absolute coordinate with respect to the scene. The second imaging device 204 moves in the scene, and the second coordinate is a relative coordinate with respect to the scene.

[0029] In some embodiments, the 3D map of the scene includes a plurality of first feature points distributed at different locations of the 3D map. Each first image captured by a respective first camera 202 includes a first subset of the plurality of first feature points. The second image captured by the second camera 204 includes one or more second feature points corresponding to a second subset of the first feature points. Each second feature point corresponds to different coordinate values in the first and second coordinates. The different coordinate values of the one or more second feature points can be used to determine the physical correlation between the first and second coordinates. In some embodiments, the one or more second feature points are defined at known locations, e.g., comers and middle points of a checker board. More details on determination of the physical correlation between the first and second coordinates are explained below with reference to Figure 6.

[0030] In some situations, a first imaging device 202A has a first field of view 208 A, and the second imaging device 204 has a second field of view 210 that shares a common portion with the first field of view 208A. A human body 206 is located in the scene, and appears in both the second field of view 210 of the second camera 204 and the first field of view 208A of the first camera 202A. The human body 206 is captured by both the second camera 204 and the first camera 202A, and can be seen in a first image captured by the first imaging device 202A and a second image captured by the second imaging device 204, however, from two different perspectives. [0031] In some embodiments, the human body 206 carries a plurality of physical markers. Each physical marker is optionally attached to a joint of the human body 206 or a body part that has a known position relative to the joint of the human body 206. When a first imaging device 202A captures a first image including the human body 206, the plurality of physical markers are recorded on the first image. Given the physical markers, a plurality of key points corresponding to the human body 206 (specifically, joints of the human body 206) are identified. In some embodiments, the first imaging device 202A and physical markers facilitate detection of the physical markers based on unique imaging characteristics. For example, each first imaging device 202A includes an infrared camera with infrared emitters, and the physical markers have distinct infrared reflection properties. The physical markers appear differently (e.g., have a higher brightness level) on the first image that is an infrared image, and can be easily and accurately recognized by an infrared image processing algorithm. In some embodiments, the first imaging device 202A is configured to identify the locations of the physical markers on the first image locally. In some embodiments, the first imaging device 202A is configured to provide the first image to the server 102 or second imaging device 204, and the server 102 or second imaging device 204 is configured to identify the locations the physical markers on the first image.

[0032] In some embodiments, the locations of the physical markers on the first image are converted to locations on the second image based on the physical correlation between the coordinates of the first and second images. The converted locations on the second image are used to identify key points of the human body on the second image. Alternatively, in some embodiments, key points of the human body 206 are identified and tracked based on the locations of the physical markers on the first image captured by the first imaging device 202A. Based on the physical correlation between the coordinates of the first and second images, the key points associated with the physical markers on the first image are converted to key points on the second image captured by the second camera 204. These key points onto the second image are connected to generate a skeleton model of the human body 206 for the second image, and therefore, associated with different body parts of the human body 206. As such, these key points are annotated on the second image in association with the body parts (e.g., joints) of the human body 206.

[0033] Additionally, in some embodiments, no physical marker is attached to the human body 206, and the first image captured by the first imaging device 202A is processed directly to identify one more first key points of the human body 206 in the first image. Optionally, deep learning techniques are applied to identify the one or more first key points in the first image. The one or more first key points in the first image are converted to second key point(s) on the second image captured by the second camera 204 using the physical correlation of the first and second coordinates. These second key point(s) on the second image are connected to generate the skeleton model of the human body 206 for the second image, and therefore, associated with different body parts of the human body 206. In some embodiments, the first imaging device 202A is configured to provide the first image to the server 102 or second imaging device 204, and the server 102 or second imaging device 204 is configured to identify the one or more first key points in the first image, convert the identified first key point(s) to the second key point(s), and generate the skeleton model of the human body 206.

[0034] In some embodiments, the plurality of first imaging devices 202 are fixed at different locations of the scene, and have different first fields of view 208 that optionally overlap with each other, such that the 3D map of the scene can reasonably cover the entire scene. As the human body 206 changes a body location in the scene, the human body 206 stays in the second field of view 210 of the second imaging device 204, e.g., because the second imaging device 204 is adjusted to keep the human body 206 partially or entirely within the second field of view 210. In some situations, the human body 206 is in the first field of view of the first imaging device 202A at a first moment of time. At a second moment of time, the human body 206 does not exist in the first field of view of the first imaging device 202A, and exists in the first field of view of the first imaging device 202B. For the first moment of time, key points associated with the body parts of the human body 206 are identified from the first image captured by the first imaging device 202A, and converted to the second coordinate of the second image captured by the second imaging device concurrently with the first image. For the second moment of time, key points associated with the body parts of the human body 206 are identified from a third image captured by the first imaging device 202B, and converted to the second coordinate of another second image captured by the second imaging device 204 concurrently with the third image.

[0035] As such, when the human body 206 moves in the scene, a sequence of first images sequentially captured by the plurality of first imaging devices 202 are used to identify the key points associated with the body parts of the human body 206 in a sequence of second images that are captured by the second imaging device 204 concurrently with the sequence of first images. For example, the sequence of first images are captured by the first imaging devices 202A, 202B, 202C, and 202D successively, and include a respective number of successive first images for each of the first imaging devices 202A-202D based on a movement speed of the human body. It is noted that, in some context, the plurality of first cameras 202 are collectively called the first camera 202 including a plurality of first camera portions (e.g., first camera portion 202A).

[0036] In some embodiments, the human body 206 appears concurrently in the first fields of view of two fist imaging devices 202A and 202B and the second imaging device 204 at a specific moment of time. One of two images captured by the two first imaging devices 202A and 202B is selected according to a device selection criterion to facilitate identification of the key points associated with the body parts of the human body 206 in the second image captured by the second imaging device 204. For example, in accordance with the device selection criterion, the first imaging device 202 whose first field of view overlaps the second field of view more is selected to determine the key points in the second image. In another example, the first imaging device 202 physically closer to the second imaging device 204 is selected to help identify the key points in the second image. In some embodiments, selection of either one of the two first imaging devices 202A and 202B does not change positions of the key points of the human body 206 within the first coordinate associated with the 3D map of the scene. Alternatively, in some embodiments, both of the two images captured by the two first imaging devices 202A and 202B are applied to facilitate identification of the key points associated with the body parts of the human body 206 in the second image captured by the second imaging device 204. In some embodiments, the two images have some common key points and some different key points, and are complementary to each other for identifying the key points of the human body 206. The identified key points of the human body is a collection of the common and different key points of the two images. In some embodiments, the two images have the same key points, which are thereby identified for the human body 206. The first imaging devices 202A and 202B share the same first coordinate (e.g. which is fixed with respect to the scene), and the identified key points in the first coordinate are converted to the second coordinate of the second image captured by the second imaging device 204.

[0037] Figure 3 is an example flow chart of a process 300 for identifying and annotating key points, in accordance with some embodiments. The process 300 is implemented jointly by a plurality of first cameras 202 and a second camera 204. In some embodiments, the process 300 involves a server 102 configured to process images jointly with the first and second cameras 202 and 204 and/or communicate the images or related data among the first and second cameras 202 and 204. In an example, each first camera 202 is an infrared camera, and first images captured by the first cameras 202 are infrared images. Alternatively, in another example, each first camera 202 is a high end visible light camera system that has a key point detection and annotation capability. However, the first cameras 202 are expensive and not integrated in conventional consumer electronic devices. Conversely, in an example, the second camera 202 is installed on a mobile phone 104C or AR glasses 104E and configured to capture a color image, a monochromatic image, or a depth image. As such, the second images captured by the second camera 204 are labeled with key points based on the key point detection and annotation capability associated with the first camera 202.

[0038] Specifically, the second images captured by the second camera 204 are associated with the first images that are captured by the first cameras 202 and record key points of the human body 206. The first images captured by the first camera 202 are synchronized (302) with the second images captured by the second camera 204. Each image captured by the first and second cameras 202 and 204 is associated with a timestamp keeping track of a time when the respective image is captured. The first images captured by the first camera 202 have a first frame rate (e.g., 240 frames per second (FPS)), and the second images captured by the second camera 204 have a second frame rate (e.g., 30 FPS). In some embodiments, each second image is associated with a temporally closest first image having the key points of the second image regardless of whether the closest first image is captured earlier or later than the second image. Alternatively, in some embodiments, each second image is associated with a temporally closest first image captured earlier than and having the second key points of the second image. Alternatively, in some embodiments, each second image is associated with a temporally closest first image captured later than and having the second key points of the second image. Alternatively, in some embodiments, each second image is associated with two temporally closest first images, one captured earlier than the second image and the other captured later than the second image.

[0039] The first cameras 202 are associated with a first coordinate, and the second camera 204 is associated with a second coordinate. The first and second coordinates are physically related according to a physical correlation (e.g., represented by a rotation and translation matrix). An object is located at a first position in the first coordinate and at a second position in the second coordinate, and the first position is related to the second position based on the physical correlation. In some embodiments, the physical correlation between the first coordinate and the second coordinate includes a plurality of displacement parameters associated with a 3D displacement between the first and second coordinates and a plurality of rotation parameters associated with a 3D rotation between the first and second coordinates. Calibration is implemented (304) to determine the physical correlation. Specifically, the same object is captured by both the first and second cameras 202 and 204, and the first and second locations of the same object in the first and second images are collected as calibration data. The calibration data are used to determine (306) the physical correlation between the first and second coordinates. In an example, the object includes a plurality of fixed locations on a checker board (e.g., in Figures 6A and 6B).

[0040] In some embodiments, a plurality of first images are captured by the first cameras 202, and a plurality of second images are captured by the second camera 204 concurrently with the first images. Human action data are recorded (308) on the first and second images by the first and second cameras 202 and 204, respectively.

[0041] Information of first key points of a human body 206 is extracted from the first images, and include positions of the first key points in the first coordinate of the first cameras 202. The positions of the first key points in the first coordinate are converted to positions of second key points associated with body parts of the human body 206 in the second coordinate of the second image using the physical correlation. In some embodiments, one or more additional key points are missing. The one or more additional missing key points are derived (e.g., interpolated) from the second key points that are converted from the information of the first key points of the human body 206. As such, data post-processing is implemented (310) to calculate the key points of the first and second images and derive the one or more additional missing key points of the second key points of the second image.

[0042] Based on the physical correlation of the first and second coordinates of the first and second cameras 202 and 204, the key points in the first images captured by the first cameras 202 are converted to key points associated with the body parts of the human body 206 in the second image captured by the second camera 204. Annotations associated with the key points of the first image are also projected (312) on the second image captured by the second camera 204. By these means, the process 300 is used for generating data for human related algorithm. As a trend comes up for a data-driven algorithm, accurate and automatic data labeling and generation becomes crucial and the process 300 provides a data generation solution without extensive manual labeling.

[0043] In some embodiments, a second image is associated with two temporally closest first images, one captured earlier than the second image and the other captured later than the second image. The two temporally closest first images are optionally captured by the same first camera 202 or distinct first cameras 202. The two first images have two separate sets of key points from which a set of key point may be temporally interpolated and converted to the key points associated with the body parts of the human body 206 in the second image captured by the second camera 204. Alternatively, the two separate sets of key points of the two first images are converted to two separate sets of key points, in the second coordinate, from which the key points in the second image are temporally interpolated. [0044] Figure 4 is an example flow chart of a process 302 for synchronizing a plurality of first cameras 202 and a second camera 204 in a scene, in accordance with some embodiments. A server 102 includes (402) one of a local computer machine located in the scene and a remote server communicatively coupled to the first and second cameras 202 and 204 via the one or more communication network 108. When the server 102 includes the local computer machine, each of the plurality of first cameras 202 and second camera 204 is coupled to the local computer machine via a local area network (e.g., a WiFi network) or via wired links. When the server 102 includes the remote server, each of the plurality of first cameras 202 and second camera 204 is coupled to the remote server 102 via at least a wide area network (e.g., a cellular network). Software applications are executed on the local computer machine or the remote server to receive image data from the plurality of first cameras 202 and second camera 204 and process the image data if needed.

[0045] System times tracked at the first camera 202 and second camera 204 may not be identical and need to be calibrated to ensure that image data captured by the cameras 202 and 204 are synchronized. In some embodiments, each first camera 202 sends a test signal to the server 102, and the test signal includes a first time stamp recording a sending time tracked based on a first camera time. The server 102 receives the test signal, and tracks a first time stamp recording a receiving time tracked based on a server time. Specifically, the server 102 receives the test signal, determines a time difference between the first camera time and server time. In some embodiments, the server 102 determines a latency time for the test signal and deducts the latency from the time difference. In some embodiments, the latency is negligible compared with the time difference. Subsequently, the time difference is used to synchronize (404) the first camera 202 and server 102. In some embodiments, the second camera 204 is similarly synchronized (406) with the server 102, optionally based on a related latency, such that times tracked by the first cameras 202, second camera 204, and server 102 can be calibrated with respect to each other, e.g., based on the server time of the server 102. Alternatively, in some embodiments, the server 102 is synchronized (408) with the second camera 204, optionally based on a related latency, such that times tracked by the first cameras 202, second camera 204, and server 102 can be calibrated with respect to each other, e.g., based on the second camera time of the second camera 204. [0046] In some embodiments, the first cameras 202 and the second camera 204 communicate data with each other directly and do not involve the server 102 in the process 300. One of the first cameras 202 sends a test signal to the second camera 204 which tracks a receiving time and determines a time difference between the first and second cameras 202 and 204. The second camera 204 optionally deducts a latency from the time difference. The time different is used to synchronize the first and second cameras 202 and 204. Conversely, in some embodiments, the second camera 204 sends a test signal to one of the first cameras 202 which tracks a receiving time and determines a time difference between the first and second cameras 202 and 204. The time difference, from which a latency is optionally deducted, is used to synchronize the first and second cameras 202 and 204.

[0047] Figure 5 is an example flow chart of a process 304 for recording calibration data from a plurality of first camera 202 and a second camera 204, in accordance with some embodiments. The plurality of first camera 202 and a second camera 204 exist in a scene including one or more objects that are associated with a plurality of key points.

[0048] In some embodiments, a physical marker is disposed at a key point and at a known position with respect to the key point, and emits signals at a predefined marking frequency (e.g., 0-200 Hz). After the first camera 202 starts (502) recording first images, the first images are captured to record (504) three-dimensional (3D) positions of the plurality of physical markers. Alternatively, in some embodiments, no physical marker is applied to mark the key points, and the first images include 3D positions of the plurality of key points of the one or more objects in the scene. The first camera 202 sends (504) the first images to the server 102 frame by frame. Each first image includes a first timestamp recording a first frame time when the respective first image is captured. The server 102 receives and stores (506) the first images 508, which are captured by the first camera 202 and include the 3D positions of the physical markers or key points and the respective first timestamp of each first image.

[0049] After the second camera 204 starts (502) recording second images, the second camera 204 captures (510) the second images sends the second images to the server 102 frame by frame. Each second image 512 includes a second timestamp recording a second frame time when the respective second image is captured by the second camera 204. System times of the first camera 202, second camera 204, and server 102 have been previously calibrated, and therefore, the first timestamp, the second timestamp, or both of them are adjusted to synchronize the first and second images captured by the first and second cameras 202 and 204. [0050] Figures 6A and 6B are two test images 600 and 620 applied to calibrate a plurality of first cameras 202 and a second camera 204 in space, in accordance with some embodiments, and Figure 6C is a flow chart of a process 306 for calibrating the first and second cameras 202 and 204 in space, in accordance with some embodiments. The first test image 600 is captured (602) by one of the first cameras 202, and the second test image 620 is captured (604) by the second camera 204 concurrently with the first test image 600 in the same scene. Given different locations and orientations of the first and second cameras 202 and 204, the first and second test images 600 and 620 are captured from two different perspectives. Both the first and second test images 600 and 620 include a checker board 606 and have a plurality of key points 608 (e.g., 608A, 608B, 608C, and 608D) located on a plurality of predefined locations of the checker board 602.

[0051] The first test image 600 is associated with a first timestamp recording a first frame time when the first test image 600 is captured. The second test image 620 is associated with a second timestamp recording a second frame time when the second test image 620 is captured. In an example, the first camera 202 is an RGB camera configured to capture test images at a first frame rate of 200 FPS, and the second camera 204 is a camera integrated in a mobile phone 104C configured to capture test images at a second frame rate of 30 FPS. In some embodiments, the first test image 600, first timestamp, second test image 620, and second timestamp are consolidated and processes in a server 102 or the second camera 204 to calibrate the first and second cameras 202 and 204 in space, i.e., determine a physical correlation between coordinate systems of the first camera 202 and the second camera 204. [0052] The first test image 600 is among a sequence of successive first test images captured by the first camera 202, and the second test image 620 is among a sequence of successive second test images captured by the second camera 204. For each second test image 620, the server 102 identifies (610) a temporally closest first test image (i.e., the first test image 600) that is captured substantially concurrently with and includes the key points of the second test image 620. In some embodiments, each second test image 620 is captured concurrently with a temporally closest first test image 600 regardless of whether the closest first test image 600 is captured earlier or later than the second test image 620. Alternatively, in some embodiments, each second test image 620 is captured concurrently with a temporally closest first test image 600 captured earlier than and including the key points of the second test image 620. Alternatively, in some embodiments, each second test image 620 is captured concurrently with a temporally closest first test image 600 captured later than and including the key points of the second test image 620. Alternatively, in some embodiments, each second test image 620 is associated with two temporally closest first images 600, one captured earlier than the second image and the other captured later than the second image. [0053] In some embodiments, each long side of the checker board 606 corresponds to two of the key points 608. The locations of the key points 608 are easily recognized from a first coordinate of the first test image 600. In addition to the key points 608, locations can be easily identified for four corners 612 of the checker board 606 and middle points 614 of the key points 608 on the first test image 600 based on the locations of the key points 608. Locations of a subset or all of the key points 608, four corner 612, and middle points 614 are identified on the second test image 620, and compared with the corresponding locations on the first test image 600 to determine the physical correlation between coordinate systems of the first camera 202 and the second camera 204.

[0054] In some embodiments, a perspective-n-point (PnP) method is applied to determine the physical correlation between the coordinate systems of the first and second cameras 202 and 204. Alternatively, in some embodiments, a random sample consensus (RANSAC) method is applied (616) to determine the physical correlation between the coordinate systems of the first and second cameras 202 and 204. The physical correlation is optionally represented (618) by a rotation and translation matrix correlating the locations of the key points 608, corners 612, or middle points 614 identified from the first test image 600 to corresponding locations in the second coordinate of the second camera 204.

[0055] Stated in another way, the physical correlation between the first coordinate and the second coordinate is determined by obtaining, from the first camera 202, one or more first test images 600 of a scene in which a plurality of test key points 608, 612, or 614 exist to be detected from the one or more first test images and obtaining, from the second camera 204, one or more second test images 620 of the scene in which a plurality of second test key points 608, 612, or 614 in Figure 6B exist to be detected from the one or more second test images 620. The second test images 620 are captured concurrently with the first test images 600. The first test key points have known physical positions in the scene with respect to the second test key points. The first and second test key points are detected from the first and second test images 600 and 620, respectively.

[0056] In accordance with the known physical positions of the first test key points in the scene with respect to the second test key points, the physical correlation of the first and second coordinates is determined. In an example, the first test key points detected from the first test images 600 are the key points 608, and the second test key points detected from the second test images 620 are the comers 612 and middle points 614. Locations of these key points 608 in each first test image 600 are applied to derive locations of the corners 612 and middle points 614 in the first test image 600. The derived location of the corners 612 and middle points 614 in the first test images 600 are compared with the corresponding locations in the second test images 620 to determine the physical correlation. More specifically, the plurality of first test key points includes a first marker unit (e.g., 608A) and a second marker unit (e.g., 608B), and the plurality of second test key points includes a third marker unit (e.g., 614A) that is located at a middle point between positions of the first and second marker units. [0057] Further, in some embodiments, the one or more first test images 600 include a sequence of first image frames, and each first test marker is included in a subset of the first test images 600. The one or more second test images 620 include a sequence of second image frames, and each second test marker corresponds to a respective subset of first test key points and is included in a subset of the second test images 620. Additionally, in some embodiments, at least a subset of the first and second test key points are attached on a checker board. The checker board is moved among a plurality of board poses in the scene and recorded in both the sequence of first image frames 600 and the sequence of second image frames 610. Every two of the board poses have different positions or different orientations with respect to one another.

[0058] In some embodiments, the one or more first test images include a single first test image 600, and the one or more second test images include a single second test image frame 620. Each second test key point corresponds to (e.g., is derived from) a respective subset of first test key points. Additionally, in some embodiments not shown in Figures 6A- 6C, the first and second test key points are marked on a three-dimensional (3D) box having a plurality of sides, and each side is covered by checker board patterns. The 3D box is recorded in both the first and second test images 600 and 620.

[0059] Figure 7 is a flow chart of a process 700 for annotating key points in a second image 204 captured by a second camera based on a plurality of first images captured by a plurality of first cameras 202, in accordance with some embodiments. The plurality of first cameras 202 captures (702) the plurality of first images each covering a respective portion of a scene where the first and second cameras 202 and 204 are located. Each first image is a 2D image. In some situations, an object exists in a first subset of the first images, and does not exist in a second subset of the first images due to occlusion. The first and second subsets of the first images are used (704) jointly by way of volumetric triangulation to locate key points of the object accurately in the scene. In some embodiments, the positions of the plurality of first key points are identified (706) in a first coordinate of the scene associated with the plurality of first cameras 202. A second image of the scene is captured (708) by a second camera 204 concurrently with the plurality of first cameras 202. In some embodiments, the second image includes a plurality of second key points corresponding to a subset of the plurality of first key points captured in a subset of the first images. The second image is matched (710) with each first image to identify the subset of the first images. The plurality of second key points are represented in a second coordinate of the second camera 204. More details on volumetric triangulation are explained below with reference to Figure 8.

[0060] A physical correlation is determined between the first coordinate of the scene and the second coordinate of the second camera 204. In some embodiments, the first cameras 202 are fixed in the scene, and the first coordinate of the first cameras 202 is associated with an absolute coordinate that is fixed with respect to the scene. The second camera 204 moves in the scene, and the second coordinate of the second camera 204 includes a relative coordinate that varies with respect to the absolute coordinate. In some embodiments, the physical correlation between the first and second coordinates includes a rotation and translation matrix. In accordance with the physical correlation of the first and second coordinates, the positions of the plurality of first key points are converted (712) from the first coordinate to the second coordinate. The second image is automatically labeled (714) with the plurality of second key points based on the converted positions of the subset of first key points in the second coordinate. In some embodiments, the second image includes an RGB image, a time-of-flight image, or both of them.

[0061] In some embodiments, after the positions of the plurality of key points are identified in the second coordinate, an additional key point position is interpolated from the converted positions of the plurality of key positions in the second coordinate, and associated with an additional key point that is not among the plurality of key points. In some embodiments, a subset of the converted positions of the plurality of key points are fitted to a human body 206. A geometric center is identified for the subset of the converted positions of the plurality of key points. A location of the human body is identified for the subset of the converted positions of the plurality of key points based on the geometric center. For example, a subset of the plurality of key points correspond to eyes, ears, mouth, and neck, and used to a geometric center corresponding to a head of the human body 206. In some embodiments, the second image labelled with the key points is used as training data to train a deep learning model for subsequent use. This deep learning model is not relevant to the data processing model used to identify the first key points in the first images captured by the plurality of first cameras 202. [0062] In some embodiments, the process 700 is implemented in a markerless manner, i.e., does not require the human body 206 to wear a specific suit or physical markers. A customized calibration pipeline enables generation of human labeling data from external devices. Specifically, calibration methods conveniently synchronize times of a server 102 and different cameras 202 and 204 and calibrate coordinate systems of the cameras 202 and 204. A data-driven volumetric triangulation technique is applied to accurately compute 3D joint locations with first cameras 202 disposed at different viewing angles. In some embodiments, a data-driven deep learning network is applied to fuse multi-view data collected from the plurality of first cameras 202 into accurate 3D human joint labels (e.g., in a 3D map). By these means, a computer system can be coupled to multiple ToF or RGB cameras and applied to automatically and accurately label 2D/3D human joint positions on images captured by unsynchronized camera devices.

[0063] Figure 8 is a flow chart of a process 800 for identifying key points in a plurality of first images captured by a plurality of first cameras 202, e.g., using volumetric triangulation, in accordance with some embodiments. The plurality of first cameras 202 are fixed at different locations of a scene, and capture the plurality of first images concurrently. Image data including the plurality of first images are passed into a 2D backbone (e.g., ResNet-152) to generate a plurality of 2D feature maps. For example, a first camera 202A captures (802) a first image from which a first 2D feature map is extracted (804) using the 2D backbone, and a first camera 202B captures (806) another first image from which a second 2D feature map is extracted (808) using the 2D backbone. The 2D feature maps exacted from the plurality of first images are projected (810) into a plurality of volumes, e.g., with a perview aggregation. Specifically, in some embodiments, each 2D feature map corresponds to a respective first image captured by a respective first camera 202, and is projected to a volume based on a camera position and orientation associated with the respective first camera 202. [0064] The volumes are passed (812) into a 3D convolutional neural network (CNN) to produce a plurality of 3D heatmaps. The 3D CNN is an example of a heatmap neural network. A softmax function (e.g., a normalized exponential function) is applied (814) to determine 3D positions of key points from the plurality of 3D heatmaps, thereby improving an accuracy level of the 3D positions of key points determined from the first images. As such, a data processing model includes the 2D backbone, 3D CNN, and soft-max function, and is trained before being applied to infer the 3D positions of key points in the first images. Different neural networks in the data processing model are trained either separately or jointly. In some embodiments, the data processing model is trained in a server 102, and each first image is provided to the server 102 to be processed using the data processing model. Alternatively, in some embodiments, the data processing model is trained in a server 102 and provided to the first or second camera, which processes each first image using the data processing model.

[0065] Figure 9 is a flowchart of a method for annotating images automatically, in accordance with some embodiments. For convenience, the method 900 is described as being implemented by a computer system (e.g., a client device 104, a server 102, or a combination thereof). An example of the client device 104 is a mobile phone 104C or AR glasses 104E. In an example, the method 900 is applied to annotate key points of a human body captured in images. The images may be, for example, captured by a second camera (e.g., a camera of a mobile phone 104C or AR glasses 104E), and annotated locally or streamed to a server 102 (e.g., for storage at storage 106 or a database associated with the server 102) to be annotated. The same human body is also contained in one or more first images captured by a subset of a plurality of first cameras, and markers associated with the key points or the key points themselves to be annotated can be easily recognized in the first image and used to guide annotation of the key points in the second image.

[0066] Method 900 is, optionally, governed by instructions that are stored in a non- transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 9 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 1006 of the computer system 1000 in Figure 10). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 900 may be combined and/or the order of some operations may be changed.

[0067] The computer system obtains (902) a plurality of first images of a scene captured concurrently by a plurality of first cameras (e.g., RGB cameras, ToF cameras). Each first image is captured by a respective first camera that is disposed in a distinct location in the scene. In some embodiments, the plurality of first images are captured concurrently when the plurality of first images are captured within a temporal window (e.g., within 5 milliseconds). The computer system generates (904) a plurality of two-dimensional (2D) feature maps from the plurality of first images. Each first image corresponds to a respective subset of 2D feature maps. In some embodiments, for each first image, the respective subset of 2D feature maps are generated from the respective first image using a respective backbone neural network, and the respective backbone neural network is trained separately from, or jointly end-to-end with, the heatmap neural network and other respective backbone neural networks.

[0068] The computer system projects (906) the plurality of feature maps into a plurality of aggregated volumes of the scene and generates (908) a plurality of three- dimensional (3D) heatmaps corresponding to the plurality of aggregated volumes of the scene using a heatmap neural network. In some embodiments, a normalized exponential function is applied on each of the plurality of 3D heatmaps to identify a position of a respective one of the plurality of key points in the scene. Automatically and without user intervention, the computer system identifies (910) positions of a plurality of key points in the scene from the plurality of 3D heatmaps. Each key point corresponds to a joint of a person in the scene.

[0069] In some embodiments, the positions of the plurality of key points are identified (912) in a first coordinate of the scene. The computer system obtains (914) a second image of the scene captured by a second camera 204 concurrently with the plurality of first images, and determines (916) a correlation between the first coordinate of the scene and a second coordinate of the second camera 204. In some embodiments, the second camera is configured to capture a color image, a monochromatic image, or a depth image. In some embodiments, the plurality of first images are captured concurrently with the second image, when the plurality of first images are captured within a temporal window (e.g., within 5 milliseconds) of the second image. In accordance with the correlation of the first and second coordinates, the computer system converts (918) the positions of the plurality of key points from the first coordinate to the second coordinate, and automatically labels (920) the second image with the plurality of key points based on the converted positions of the plurality of key points in the second coordinate. Further, in some embodiments, the correlation between the first coordinate and the second coordinate includes a plurality of displacement parameters associated with a 3D displacement between the first and second coordinates and a plurality of rotation parameters associated with a 3D rotation between the first and second coordinates [0070] Additionally, in some embodiments, the computer system interpolates an additional key point position from the converted positions of the plurality of key points in the second coordinate, and associates the additional key point position with an additional key point that is not among the plurality of key points. In some embodiments, a subset of the converted positions of the plurality of key points are fitted to a human body. The computer system identifies a geometric center of the subset of the converted positions of the plurality of key points. A location of the human body is identified for the subset of the converted positions of the plurality of key points based on the geometric center. In some embodiments, the computer system uses the second image labelled with the plurality of key points to train a deep learning model.

[0071] Further, in some embodiments, the correlation is determined between the first coordinate and the second coordinate using a plurality of first test images and one or more second test images. The computer system obtains, from the plurality of first cameras 202, the plurality of first test images of the scene and obtains, from the second camera 204, the one or more second test images of the scene. The second test images are captured concurrently with the first test images. The computer system detects a position of a first test point in the first coordinate of the scene from the plurality of first test images, a position of a second test point in the second coordinate of the second camera 204. The second test point having a known physical position in the first coordinate with respect to the first test point. That said, the second test point overlaps, or has a known displacement with respect to, the first test point. In accordance with the known physical position of the second test point in the scene with respect to the first test point, the computer system derives the correlation of the first and second coordinates. In this application, test point is also called test key point.

[0072] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to annotate key points in images as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 2-8 are also applicable in an analogous manner to method 900 described above with respect to Figure 9. For brevity, these details are not repeated here.

[0073] Figure 10 is a block diagram illustrating a computer system 1000, in accordance with some embodiments. The computer system 1000 includes a server 102, a client device 104, a storage 106, or a combination thereof. The computer system 1000 is configured to implement any of the above methods in Figures 3-10. The computer system 1000, typically, includes one or more processing units (CPUs) 1002, one or more network interfaces 1004, memory 1006, and one or more communication buses 1008 for interconnecting these components (sometimes called a chipset). The computer system 1000 includes one or more input devices 1010 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the computer system 1000 uses a microphone and voice recognition or a camera and gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras 202 or 204, scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The computer system 1000 also includes one or more output devices 1012 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geolocation receiver, for determining the location of the client device 104.

[0074] Memory 1006 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 1006, optionally, includes one or more storage devices remotely located from one or more processing units 1002. Memory 1006, or alternatively the non-volatile memory within memory 1006, includes a non-transitory computer readable storage medium. In some embodiments, memory 1006, or the non- transitory computer readable storage medium of memory 1006, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 1014 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 1016 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 1004 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 1018 for enabling presentation of information (e.g., a graphical user interface for application(s) 1024, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 1012 (e.g., displays, speakers, etc.);

• Input processing module 1020 for detecting one or more user inputs or interactions from one of the one or more input devices 1010 and interpreting the detected input or interaction; • Web browser module 1022 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 1024 for execution by the computer system 1000 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 1026 for receiving training data (.g., training data 1042) and establishing a data processing model (e.g., data processing module 1028) for processing content data (e.g., video data, visual data, audio data) to be collected or obtained by a client device 104;

• Data processing module 1028 for processing content data using data processing models 1044, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 1028 is associated with one of the user applications 1024 to process the content data in response to a user instruction received from the user application 1024;

• Camera calibration module 1030 for calibrating two cameras in time and in space, where a physical correlation is identified between a first coordinate of a plurality of first cameras and a second coordinate of a second camera;

• Keypoint annotation module 1032 for leveraging a first camera’s capability of identifying locations of key points of an object (e.g., a human body 206), converting the identified key point locations to locations of key points in a second image captured by a second camera, and annotating the key points in the second image; and

• One or more databases 1034 for storing at least data including one or more of: o Device settings 1036 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 1038 for the one or more user applications 1024, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 1040 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 1042 for training one or more data processing models 1044; o Data processing model(s) 1044 for processing content data (e.g., video data, visual data, audio data) using deep learning techniques; and o Content data and results 1046 that are obtained by and outputted to the client device 104 of the computer system 1000, respectively, where the content data includes images captured by the first and second cameras 202 and 204, locations of key points in the first images captured by the first cameras 202, and/or information of key points annotated in the second images captured by the second camera 204.

[0075] Optionally, the one or more databases 1034 are stored in one of the server 102, client device 104, and storage 106 of the computer system 1000. Optionally, the one or more databases 1034 are distributed in more than one of the server 102, client device 104, and storage 106 of the computer system 1000. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 1044 are stored at the server 102 and storage 106, respectively.

[0076] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 1006, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 1006, optionally, stores additional modules and data structures not described above.

[0077] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0078] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0079] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0080] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.