Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DUAL CAMERA BASED MONOCULAR SLAM ARCHITECTURE
Document Type and Number:
WIPO Patent Application WO/2023/277903
Kind Code:
A1
Abstract:
This application is directed to pose estimation or prediction and image rendering, e.g., in extended reality. An electronic device obtains a sequence of first images captured by a first camera and a sequence of second images captured by a second camera concurrently with the sequence of first images. A plurality of first camera poses of the first camera are determined based on at least the sequence of first images, and each first image corresponds to at least one first camera pose. A plurality of second camera poses of the second camera are determined from the plurality of first camera poses of the first camera based on a pre-determined coordinate correlation. In accordance with the plurality of second camera poses, the electronic device renders display of a virtual object on top of the sequence of second images.

Inventors:
ZHANG FUYAO (US)
CHEN DINGYI (US)
DENG FAN (US)
XIONG WEIFENG (US)
Application Number:
PCT/US2021/039858
Publication Date:
January 05, 2023
Filing Date:
June 30, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T7/70; G06T7/246; G06T7/73; G06T7/80
Foreign References:
US20210191421A12021-06-24
US20200367970A12020-11-26
US20130148851A12013-06-13
US20120148145A12012-06-14
US20170214899A12017-07-27
Other References:
GEORGES YOUNES; DANIEL ASMAR; ELIE SHAMMAS; JOHN ZELEK: "Keyframe-based monocular SLAM: design, survey, and future directions", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 July 2016 (2016-07-02), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081351338, DOI: 10.1016/j.robot.2017.09.010
Attorney, Agent or Firm:
WANG, Jiambai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for rendering an image, comprising: obtaining a sequence of first images captured by a first camera; obtaining a sequence of second images captured by a second camera concurrently with the sequence of first images; determining a plurality of first camera poses of the first camera based on at least the sequence of first images, each first image corresponding to at least one first camera pose; determining a plurality of second camera poses of the second camera from the plurality of first camera poses of the first camera based on a pre-determined coordinate correlation; and in accordance with the plurality of second camera poses, rendering display of a virtual object on top of the sequence of second images.

2. The method of claim 1, further comprising obtaining a plurality of motion sensor data measured concurrently with the sequence of first images and the sequence of second images, wherein the plurality of first camera poses are determined based on the sequence of first images and the plurality of motion sensor data using simultaneous localization and mapping (SLAM).

3. The method of claim 2, wherein each second image corresponds to a respective second camera pose and a second pose time, and the respective second camera pose is determined from a corresponding first camera pose based on the pre-determined coordinate correlation.

4. The method of claim 3, wherein for each second image, the corresponding first camera pose from which the respective second camera pose is determined is selected, interpolated, or extrapolated from the plurality of first camera poses based on the second pose time of the respective second camera pose.

5. The method of claim 1, further comprising obtaining a plurality of motion sensor data measured concurrently with the sequence of first images and the sequence of second images, wherein: the plurality of first camera poses of the first camera are determined based on the sequence of first images; determining the second camera poses of the second camera from the first camera poses of the first camera based on the pre-determined coordinate correlation further includes: converting the first camera poses based on the pre-determined coordinate correlation; and determining a plurality of preliminary camera poses based on the converted first camera poses and the plurality of motion sensor data.

6. The method of claim 5, wherein each second image corresponds to a respective second camera pose and a second pose time, and the respective second camera pose is selected, interpolated, or extrapolated from the plurality of preliminary camera poses based on the second pose time of the respective second camera pose.

7. The method of any of claims 2-6, wherein: the sequence of first images have a first frame rate; the sequence of second images have a second frame rate that is distinct from the first frame rate; and the plurality of motion sensor data have a sensor sampling rate that is greater than the first and second frame rates.

8. The method of any of claims 2-6, wherein: each of the sequence of first images is obtained with a respective first time stamp indicating a respective first image time when the respective first image is captured; each of the sequence of second images is obtained with a respective second time stamp indicating a respective second image time when the respective second image is captured; each motion sensor data is obtained with a respective sensor time stamp indicating a respective sensor time when the respective motion sensor data is measured; and each of the first and second camera poses is associated with a respective pose time determined by one of the respective first image time, second image time, and sensor time.

9. The method of any of the preceding claims, wherein the display of the virtual object is rendered on the top of each second image with a respective second camera pose corresponding to a second pose time that is synchronous to one of (1) a second image time when the respective second image is available for image rendering and (2) a display time when the virtual object is rendered.

10. The method of any of the preceding claims, wherein the sequence of first images are captured by the first camera according to a first camera setting configured to facilitate simultaneous localization and mapping (SLAM) associated with the first camera, and the sequence of second images are captured by the second camera according to a second camera setting configured to facilitate rendering the display of the virtual object.

11. The method of claim 10, wherein the first camera setting does not include automatic adjustment of a focal length and an exposure time of the first camera such that a latency of each first image captured by the first camera is minimized, and the second camera setting includes autofocus or exposure configurations that are selected freely by a user or a program developer, such that an image quality of each second image captured by the second camera is adjusted to render the display of the virtual object.

12. The method of any of the preceding claims, further comprising: obtaining object-oriented information of the virtual object based on at least the plurality of second camera poses of the second camera, wherein the display of the virtual object is rendered on top of the sequence of second images based on the object-oriented information.

13. The method of any of the preceding claims, wherein the first camera and the second camera are integrated in the same electronic device, the first camera being a wide-angle camera, the second camera being an RGB camera.

14. The method of any of the preceding claims, an electronic device including both the first camera and the second camera, the method further comprising: mapping a scene where the first camera is disposed using the sequence of first images, wherein the plurality of first camera poses of the first camera are determined in the scene; and providing mapping information of the scene to a second electronic device.

15. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-14.

16. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-14.

Description:
Dual Camera Based Monocular SLAM Architecture

TECHNICAL FIELD

[0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for localizing a camera in an environment and mapping the environment efficiently in an extended reality application that is executed by an electronic device.

BACKGROUND

[0002] Simultaneous localization and mapping (SLAM) is widely applied in virtual reality (VR), augmented reality (AR), autonomous driving, and navigation. In SLAM, high frequency pose estimation is enabled by sensor fusion. Asynchronous time warping (ATW) is often applied with SLAM in an AR system to warp an image before it is sent to a display to correct for head movement that occurs after the image is rendered. In both SLAM and ATW, relevant image data and inertial sensor data are synchronized and used for estimating and predicting camera poses, while the same images are also used as background images to render virtual objects according to the camera poses. Camera settings are optimized based on requirements of the SLAM and ATW, which unavoidably compromise image qualities of the resulting images. It would be beneficial to have a more efficient approach to achieve both accurate estimation of camera poses and rendering of high quality images.

SUMMARY

[0003] Various embodiments of this application are directed to applying two separate cameras to capture separate images for camera pose estimation and image rendering in an extended reality application. Camera settings of the two separate cameras can be optimized for their separate purposes. A SLAM-based camera pose estimation process is highly sensitive and has special requirements for input image data, and therefore, a first camera (e.g., a wide-angle camera) is applied to meet the special requirements for the SLAM-related input image data. A second camera (i.e., an RGB camera) meets distinct requirements of a high quality camera preview for image rendering at the same time. Camera settings of the first camera are optimized and fixed for SLAM. Camera setting of the second camera can be freely adjusted automatically or by a developer or a user of the extended reality application to provide high quality images on which virtual objects are overlaid. Specifically, the first camera disables autofocus and automatic exposure functions, provides an image refresh rate that matches a maximum image refresh rate acceptable to a SLAM algorithm, and has a relatively large field of view that can give more feature points. Such camera settings of the first camera may result in in blurry and/or dim camera preview images and cannot support rendering of a smooth virtual AR scene. In contrast, the second camera can enable the autofocus and automatic exposure functions, provide an image refresh rate that is not limited by the SLAM algorithm, and use a field of view that is not distorted and provides high fidelity images. By these means, application of two separate cameras is an efficient approach to estimate camera poses accurately and render high quality images, and can enhance user experience with extended reality.

[0004] In one aspect, a method is implemented at an electronic device for rendering an image. The method includes obtaining a sequence of first images captured by a first camera and obtaining a sequence of second images captured by a second camera concurrently with the sequence of first images. The method further includes determining a plurality of first camera poses of the first camera based on at least the sequence of first images. Each first image corresponds to at least one first camera pose. The method further includes determining a plurality of second camera poses of the second camera from the plurality of first camera poses of the first camera based on a pre-determined coordinate correlation, and in accordance with the plurality of second camera poses, rendering display of a virtual object on top of the sequence of second images.

[0005] In some embodiments, the method further includes obtaining a plurality of motion sensor data measured concurrently with the sequence of first images and the sequence of second images. The plurality of first camera poses are determined based on the sequence of first images and the plurality of motion sensor data using simultaneous localization and mapping (SLAM). Alternatively, in some embodiments, a plurality of motion sensor data are measured concurrently with the sequence of first images and the sequence of second images. The plurality of first camera poses of the first camera are determined based on the sequence of first images, e.g., without using the motion sensor data. The second camera poses of the second camera from the first camera poses of the first camera based on the pre-determined coordinate correlation by converting the first camera poses based on the pre-determined coordinate correlation and determining a plurality of preliminary camera poses based on the converted first camera poses and the plurality of motion sensor data. Further, in some embodiments, each second image corresponds to a respective second camera pose and a second pose time, and the respective second camera pose is selected, interpolated, or extrapolated from the plurality of preliminary camera poses based on the second pose time of the respective second camera pose. Additionally, in some embodiments, the first camera of the electronic device includes a wide-angle camera, a LiDAR scanner, or a fisheye camera, which is configured to capture the sequence of first images for use in SLAM.

[0006] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

BRIEF DESCRIPTION OF THE DRAWINGS

[0008] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0009] Figure 1 A is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments. Figure IB illustrates a pair of augmented reality (AR) glasses (also called a head-mounted display) that can be communicatively coupled to a data processing environment, in accordance with some embodiments.

[0010] Figure 2 is a block diagram illustrating a data processing system, in accordance with some embodiments.

[0011] Figure 3 is a flowchart of a process for processing inertial sensor data and image data of an electronic system (e.g., a server, a client device, or a combination of both) using a SLAM module, in accordance with some embodiments.

[0012] Figure 4A is a temporal diagram illustrating a plurality of parallel temporal threads of inertial sensor data, first image data, second image data, and third image data, in accordance with some embodiments.

[0013] Figure 4B is another temporal diagram illustrating a plurality of parallel temporal threads of inertial sensor data, first image data, second image data, and rendered virtual objects, in accordance with some embodiments. [0014] Figure 5 is a block diagram illustrating a process of applying two cameras for separate pose estimation and image rendering in an extended reality application, in accordance with some embodiments.

[0015] Figure 6 is a block diagram illustrating a plurality of processing threads implemented in an Android operating system (OS), in accordance with some embodiments. [0016] Figure 7A is an image rendered based on a single camera monocular SLAM system, in accordance with some embodiments, and Figure 7B is an image rendered based on a dual camera monocular SLAM system of this application, in accordance with some embodiments.

[0017] Figure 8 is a flowchart of a method for rendering an image, in accordance with some embodiments.

[0018] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0019] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic systems with digital video capabilities.

[0020] Various embodiments of this application are directed to using two cameras to render extended reality content (e.g., virtual, augmented, or mixed reality content) on an electronic device. A first camera captures images processed for camera localization and environment mapping, and a second camera captures a distinct set of images that are used as either a background for displaying virtual objects or as reality content to be mixed with a virtual environment. Camera settings of the second camera support zooming and exposure control, allowing users to easily shoot high-quality images to be combined with extended reality content (e.g., virtual object, virtual environment). Specifically, this invention uses the first camera and the second camera to meet different requirements for input images in SLAM and image rendering, respectively. The different requirements are inconsistent under some circumstances. [0021] Specifically, the first camera is configured to provide input images each of which has a latency less than a latency threshold and/or a number of feature points greater than a threshold feature point number. In an example, the first camera includes a wide-angle camera and utilizes hardware performance advantages of the wide-angle camera to obtain a larger field of view angle. Images captured by the wide-angle camera provide prompt response and sufficient feature points required by SLAM. Conversely, the second camera has camera settings that are configurable by an application developer or a user to satisfy needs of computer vision algorithms. For example, an RGB camera is used as the second camera and has camera settings allowing autofocus and automatic exposure adjustment. In some situations, the camera settings of the second camera are predefined by the application developers to incorporate automatic image processing (e.g., a filter function) of the input images captured by the second camera. Such automatic adjustment and/or image processing provides high quality images, and however, compromises a latency of each image captured by the second camera. By these means, the computer vision algorithms do not need to share camera resources with the SLAM algorithm, and can easily receive customized high-quality images from the second camera (e.g., the RGB camera) to focus on visual effects and user experience of the high-quality images themselves.

[0022] Figure 1 A is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104 A, tablet computers 104B, mobile phones 104C, or intelligent, multi-sensing, network-connected home devices (e.g., a camera). In some implementations, the one or more client devices 104 include a head-mounted display 150 configured to render extended reality content. Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally (e.g., for training and/or for prediction) at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provides system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104. For example, storage 106 may store video content (including visual and audio content), static visual content, and/or inertial sensor data for training a machine learning model (e.g., deep learning network). Alternatively, storage 106 may also store video content, static visual content, and/or inertial sensor data obtained by a client device 104 to which a trained machine learning model can be applied to determine one or more poses associated with the video content, static visual content, and/or inertial sensor data.

[0023] The one or more servers 102 can enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the head-mounted display 150) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera and a mobile phone 104C. The networked surveillance camera collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera in the real time and remotely.

[0024] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other electronic systems that route data and messages.

[0025] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the client device 104C and head-mounted display 150). The client device 104C or head-mounted display 150 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Subsequently to model training, the client device 104C or head-mounted display 150 obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the training data processing models locally. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and head-mounted display 150). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104A or head-mounted display 150 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized or predicted device poses) from the server 102A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104A or head-mounted display 150 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and head-mounted display 150), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104B or head- mounted display 150. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104B or head-mounted display 150 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0026] Figure IB illustrates a pair of augmented reality (AR) glasses 150 (also called a head-mounted display) that can be communicatively coupled to a data processing environment 100, in accordance with some embodiments. The AR glasses 150 can be includes one or more cameras, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera(s) and microphone are configured to capture video and audio data from a scene of the AR glasses 150, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 150. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 150 is processed by the AR glasses 150, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 150 jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 150 itself or interact with an application (e.g., a gaming application) executed by the AR glasses 150. In some embodiments, the display of the AR glasses 150 displays a user interface, and the recognized or predicted device poses are used to render virtual objects with high fidelity or interact with user selectable display items on the user interface.

[0027] In some embodiments, SLAM techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 150 with inertial sensor data. Device poses are recognized and predicted, and a scene in which the AR glasses 150 is located is mapped and updated. The SLAM techniques are optionally implemented by each of the server 102 and AR glasses 150 independently or by both of the server 102 and AR glasses 150 jointly.

[0028] Figure 2 is a block diagram illustrating a data processing system 200, in accordance with some embodiments. The data processing system 200 includes a server 102, a client device 104, a storage 106, or a combination thereof. The data processing system 200, typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The data processing system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice- command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the data processing system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more cameras 260 (e.g., a first camera 260A and a second camera 260B), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The data processing system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning satellite) or other geo-location receiver, for determining the location of the client device 104. Optionally, the client device 104 includes an inertial measurement unit (IMU) 280 integrating multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space. Examples of the one or more inertial sensors include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.

[0029] Memory 206 includes high-speed random access memory, such as DRAM,

SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the data processing system 200 (e.g., games, social network applications, smart home applications, extended reality application, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video data, visual data, audio data, and inertial sensor data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 240, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

• Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., AR glasses 150) based on images captured by the first camera 260A and sensor data captured by the IMU 280, where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 232 and data processing module 228, and the module 232 further includes an SLAM module 232 for mapping a scene where a client device 104 is located and identifying a location of the client device 104 within the scene;

• Pose-based rendering module 234 for rendering virtual objects on top of a field of view of the second camera 260B of the client device 104 in real time;

• Pose data buffer 236 for storing pose data optionally with inertial sensor data and image data for the purposes of determining recent camera poses and predicting subsequent camera poses; and

• One or more databases 238 for storing at least data including one or more of: o Device settings 240 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 242 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 244 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 246 for training one or more data processing models 240; o Data processing model(s) 248 for processing content data (e.g., video data, visual data, audio data) using deep learning techniques; and o Content data and results 250 that are obtained by and outputted to the client device 104 of the data processing system 200, respectively, including a subset or all of the following data: Historic inertial sensor data 252 that are measured by the IMU 280; Historic image data 254 (e.g., video or static visual data) that are measured by the cameras 260 of the client device 104; Historic pose data 256 that are determined based on the historic inertial sensor data 252 and historic image data 254; and [0030] Optionally, the one or more databases 238 are stored in one of the server 102, client device 104, and storage 106 of the data processing system 200. Optionally, the one or more databases 238 are distributed in more than one of the server 102, client device 104, and storage 106 of the data processing system 200. In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 240 are stored at the server 102 and storage 106, respectively.

[0031] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0032] Figure 3 is a flowchart of a process 300 for processing inertial sensor data and image data of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a SLAM module 232, in accordance with some embodiments. The process 300 includes measurement preprocessing 302, initialization 304, local visual-inertial odometry (VIO) with relocation 306, and global pose graph optimization 308. In measurement preprocessing 302, a first camera 260 A captures image data of a scene at an image frame rate (e.g., 30 FPS), and features are detected and tracked (310) from the image data. An IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the first camera 260A capturing the image data, and the inertial sensor data are pre-integrated (312) to provide pose data. In initialization 304, the image data captured by the first camera 260A and the inertial sensor data measured by the IMU 280 are temporally aligned (314). A vision-only structure from motion (SfM) techniques 314 are applied (316) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the first camera 260A.

[0033] After initialization 304 and during relocation 306, a sliding window 318 and associated states from a loop closure 320 are used to optimize (322) a VIO. When the VIO corresponds (324) to a keyframe of a smooth video transition and a corresponding loop is detected (326), features are retrieved (328) and used to generate the associated states from the loop closure 320. In global pose graph optimization 308, a multi-degree-of-freedom (multi- DOF) pose graph is optimized (330) based on the states from the loop closure 320, and a keyframe database 332 is updated with the keyframe associated with the VIO.

[0034] Additionally, the features that are detected and tracked (310) are used to monitor (334) motion of an object in the image data and estimate image-based poses 336, e.g., according to the image frame rate. In some embodiments, the inertial sensor data that are pre-integrated (312) may be propagated (338) based on the motion of the object and used to estimate inertial-based poses 340, e.g., according to a sampling frequency of the IMU 280. The image-based poses 336 and the inertial-based poses 340 are stored in the pose data buffer 236 and used by the module 230 to estimate and predict poses that are used by the pose-based rendering module 234. Alternatively, in some embodiments, the module 230 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 336 to estimate and predict more poses that are further used by the pose-based rendering module 234.

[0035] In SLAM, high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280. The imaging sensors (e.g., the first camera 260A, a LiDAR scanner) provide image data desirable for pose estimation, and oftentimes operate at a lower frequency (e.g., 30 frames per second) and with a larger latency (e.g., 30 millisecond) than the IMU 280. Conversely, the IMU 280 can measure inertial sensor data and operate at a very high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., < 0.1 millisecond). Asynchronous time warping (ATW) is often applied in an AR system to warp an image before it is sent to a display to correct for head movement that occurs after the image is rendered. ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing image frames. In both SLAM and ATW, relevant image data and inertial sensor data are stored locally, such that they can be synchronized and used for pose estimation/predication.

In some embodiments, the image and inertial sensor data are stored in one of multiple STL containers, e.g., std::vector, std::queue, std::list, etc., or other self-defined containers. These containers are generally convenient for use. The image and inertial sensor data are stored in the STL containers with their timestamps, and the timestamps are used for data search, data insertion, and data organization.

[0036] Figure 4A is a temporal diagram 400 illustrating a plurality of parallel temporal threads of inertial sensor data, first image data, second image data, and third image data, in accordance with some embodiments. A first temporal thread 402 of inertial sensor data and a second temporal thread 404 of first image data are applied to determine a plurality of camera poses of a first camera 260A (e.g., a wide-angle camera) jointly. A third temporal thread 412 of image data are applied to render virtual environment or virtual objects with second images 416 captured by a second camera 260B (e.g., an RGB camera), thereby forming a fourth thread 414 of third image data. Specifically, the first temporal thread 402 includes a temporally-ordered sequence of inertial sensor samples 406 measured by the IMU 280 at a sampling frequency. The second temporal thread 404 of image data includes a temporally-ordered sequence of first images 408 captured by the first camera 260A at a first image frame rate. The third temporal thread 412 of image data includes a temporally-ordered sequence of second images 416 captured by the second camera 260B at a second image frame rate. The fourth temporal thread 414 of image data includes a temporally-ordered sequence of third images 418 rendered for display on a display at a display refresh rate. Each third image 418 is rendered based on one or more second images 416 that are recently available, while the display refresh rate of the third images is optionally equal to or distinct from the second image frame rate of the second images 416. In each third image 418, a virtual object is rendered on the one or more second images 416, or a virtual environment is mixed with the one or more second images 416.

[0037] Each inertia sensor sample 406 has a sensor latency from being captured by the IMU 280 to being available to be used by a pose determination and prediction module 230. Each first image 406 has a first image latency from being captured by the camera 260 to being available to be used by the pose determination and prediction module 230. Each second image 416 has a second image latency from being captured by the camera 260 to being available to be used by the pose-based rendering module 234. A quality of the second image 416 is equal to or better than that of the first images 408. Optionally, the second image frame rate of the second images 416 is greater than the first image frame rate of the first images 408, and the second image latency of the second images 416 is greater than the first image latency of the first images 408. The sampling frequency of the inertial sensor samples 406 is greater than both of the first and second image frame rates, and the sensor latency is less than the first and second image latencies.

[0038] In some embodiments, a temporal position of each of the inertial sensor samples 406 and images 408, 416, and 418 on the temporal threads 402, 404, 412, and 414 corresponds to a respective time when the corresponding inertial sensor sample 406 or image 408 is available to be used for camera pose determination or image rendering. Each of the sequence of first images 408 is obtained with a respective first time stamp indicating a respective first image time when the respective first image is captured. Each of the sequence of second images is obtained with a respective second time stamp indicating a respective second image time when the respective second image is captured. Each motion sensor data is obtained with a respective sensor time stamp indicating a respective sensor time when the respective motion sensor data is measured. Each of the first and second cameras 260A and 260B has a respective camera pose at a respective pose time. If the respective camera pose (i.e., an image-based camera pose) is determined based on a respective first or second image, the respective pose time corresponds to one of the respective first or second image time.

When the respective camera pose is integrated from the inertial sensor samples 406, the respective pose time corresponds to a corresponding sensor time.

[0039] In an example, the first images 408 are captured at a frequency of 30 Hz, and the first image frame rate is 30 frames per second. The first image latency of each first image 408 is approximately 30 ms. Thus, every two consecutive first images 408 are temporally separated by 33 ms, while each first image 408 is available the first image latency J of 30 ms (covering image transfer and processing) after being captured by the first camera 260A. In contrast, the inertial sensor samples 406 are measured by the IMU 280 at a sampling frequency of 1000 Hz, and have the sensor latency that approximates 0.1 ms, which is almost negligible compared with the first image latency J and a frame separation AT between two first images 408. A first image 408A is captured at a prior time /,, and made available at a first time t \. At an instant t j that is between the times t, and / i, a prior image 408P that precedes the first image 408A (immediately or not) has been available and used to determine camera pose data of the first camera 260A jointly with the sensor samples 406. However, when the first image 408A is available at rii, the first image 408A is used to generate an image-based camera pose associated with the prior time t, retroactively, update the camera pose data at the instant t j that is between t, and / i, and determine subsequent camera pose data that follows the first time rii and precedes a subsequent time hi when a subsequent image 408B is available. As such, the image-based camera pose associated with the prior time t, is determined based on the first image 408 A and can be used to determine the camera pose data within a temporal range that lasts a combined duration 422 of a frame separation AT and the first image latency J (e.g., ~ 63 ms), i.e., from the prior time t, to the subsequent time hi. [0040] In some embodiments, during the duration 422, after the first image 408A arrives and is available, an Extended Kalman Filter (EKF) or Error State Kalman Filter (ESKF) is applied to obtain poses at least between the prior time t, of capturing the first image 408 A and the subsequent time h; of receiving the subsequent first image 408B. Specifically, the first image 408A arrives and is available at the first time rii, and the inertial sensor samples 406 (e.g., acceleration, angular velocity, and corresponding bias and noise) captured between the first and prior times C and t, have already been available. For each instant t j between t, and / i, the camera pose is updated using the image-based camera pose that is retroactively generated for the prior time t, and the inertial sensor samples 406 captured between the current and prior times t j and t e.g., based on an integration operation using the available inertial sensor samples 406. As more inertial sensor samples 406 are captured, at a subsequent instant ^between t \ and / /;, the inertial sensor samples 406 captured between £4 and t k are also made available. For each subsequent instant t k between t \ and / /;, the camera pose is estimated and determined using the image-based camera pose that is retroactively generated for the prior time t, and the inertial sensor samples 406 that are captured between the subsequent and prior times t j and t k , e.g., based on the integration operation using the available inertial sensor samples 406.

[0041] In some embodiments, the first images 408 and the inertial sensor samples 406 are applied to predict camera poses. Specifically, when the first image 408A is available at £ 4 , the first image 408A is used to generate an image-based camera pose associated with the prior time t, retroactively. Each inertial sensor sample 406 captured between t, and te is used to provide a relative pose variation, and can be combined with the image-based camera pose associated with the first image 408A to determine a camera pose at a specific time between t, and te. After at least one camera pose is derived between t, and te based on its corresponding inertial sensor samples 406, the at least one camera pose is used to predict an upcoming camera pose that has not occurred or for which its corresponding inertial sensor samples 406 have not been available yet. For example, after a current camera pose is determined for the instant t j based on the inertial sensor samples between t, and t h the current camera pose and the image-based camera pose associated with the prior time t, are applied to derive a subsequent camera pose at the subsequent instant t k between £4 and C, e.g., by linear extrapolation. In another example, a series of camera poses are determined for a series of times between t, and £4 based on the image-based camera pose associated with the prior time t, and the inertial sensor samples 406. The series of camera poses are applied to predict a subsequent camera pose at the subsequent instant t k , e.g., by linear extrapolation.

[0042] Compared with the second image latency of the second images 416, the first image latency of the first images 408 is substantially small (e.g., less than a latency threshold). The Pose Determination and Prediction Module 230 only needs to update the camera poses retroactively for a controlled portion of the duration 422. This allows generation of the camera poses to be substantially synchronized with image rendering without compromising accuracy of the camera poses. However, the substantially small first image latency does not allow a time window for optimizing camera settings in real time. A focal length and an exposure time are predefined and not adjustable. No or little image processing is performed. In some situations, the first images 408 are out of focus and have a limited visual quality to human eyes, while feature points can still be accurately identified from the first images 408 to determine camera poses and map the field of view of the first camera. [0043] Referring to Figure 4 A, a plurality of first camera poses 410 are determined based on the inertial sensor data 406 and first images 408 and marked on the second thread 404. The first camera poses 410 optionally have the same sampling frequency as the inertial sensor data 406, and can be applied to derive more intermediate first camera poses by interpolation if needed. Each third image 418 is rendered based on one or more second images 416 that are recently available. For example, the third image 418A or 418B is rendered with one or more virtual objects overlaid on top of the second image 416A or 416B, respectively. The virtual objects are rendered on the second image 416A from a perspective of the second camera 260B that has captured the second image 416A or 416B. The first and second cameras 260A and 260B are both integrated within the same electronic device, and a fixed positional shift that is represented by a pre-determined coordinate correlation. The pre determined coordinate correlation is calibrated for each type of electronic devices before the electronic devices of the respective type are shipped out of factory. In this situation, the second image 416A or 416B corresponds to an instantaneous second camera pose of the second camera 260B at a second pose time, e.g., the subsequent instant t k between the times t A and te when the first image 408A and subsequent image 408B are available. The instantaneous second camera pose of the second camera 260B corresponding to the second image 416A or 416B is determined from a first camera pose of the first camera 260 corresponding to a corresponding subsequent instant t k according to the pre-determined coordinate correlation, while the first camera pose of the first camera 260 corresponding to the corresponding subsequent instant t k is determined in real time or extrapolated from the image-based camera pose corresponding to the first image 408A and the corresponding inertial sensor samples 406 between t, and t k.

[0044] In some situations, the first camera pose of the first camera 260 corresponding to the subsequent instant t k is retroactively updated after the subsequent image 408B of the first camera 260A is available at the subsequent time te, so is the instantaneous second camera pose of the second camera 260B corresponding to the second image 416A or 416B updated retroactively. In some situations, the third image 418A is being currently displayed, and immediately updated based on the update of the instantaneous second camera pose corresponding to the second image 416A. Alternatively, in some situations, the third image 418A is being currently displayed and not updated, regardless of the update of the instantaneous second camera pose corresponding to the second image 416A. In contrast, the third image 418B is rendered immediately prior to the third image 418 A. The third image 418B has already been refreshed and replaced with the third image 418 A, and it is too late to update the third image 418B, when the subsequent image 408B of the first camera 260A is available for updating the first camera pose corresponding to the second image 416B and the third image 418B.

[0045] Additionally, in some embodiments, the display of the virtual object is rendered on the top of each second image 416 with a respective second camera pose corresponding to a second pose time that is synchronous to one of a second image time when the respective second image 416 is available for image rendering and a display time to when the virtual object is rendered. For example, referring to Figure 4A, the third image 418A is obtained by rendering the virtual object on the second image 416A that is made available immediately before the display time to of the third image 418 A. In some situations, the second camera pose used to render the virtual object corresponds to the display time to of the third image 418 A. Alternatively, in some situations, the second camera pose used to render the virtual object corresponds to the second pose time when the second image 416A is available, although the second camera pose may have shifted slightly at the display time tD of the third image 418 A.

[0046] Figure 4B is another temporal diagram 450 illustrating a plurality of parallel temporal threads of inertial sensor data, first image data, second image data, and rendered virtual objects, in accordance with some embodiments. When the first image 408 A is available at rii, the first image 408A is used to generate an image-based camera pose associated with the prior time t, retroactively, the image-based camera pose is converted to a second camera pose 420-i of the second camera 260B corresponding to the prior time t, based on the pre-determined coordinate correlation between the first and second cameras 260A and 260B. The second camera pose of the second camera 260B corresponding to the prior time t, are combined with the inertial sensor samples 406 to determine in real time, or extrapolate in advance, subsequent second camera poses 420 of the second camera 260B corresponding to the subsequent time t k of the second image 416A or 416B. Stated another way, the second camera pose 420-i of the second camera 260B corresponding to the prior time t, is combined with the inertial sensor samples 406 captured after the prior time t, to determine retroactively and in real time, or extrapolate, a plurality of preliminary camera poses 420-p of the second camera 260B using a SLAM technique. The second image 416A or 416B corresponds to a respective second camera pose and a second pose time (e.g., the subsequent time t k ), and the respective second camera pose is one of the plurality of preliminary camera poses corresponding to the second pose time.

[0047] In some embodiments, the subsequent image 408B that immediately follows the first image 408A is available, and the image-based camera pose of the first camera 260A is updated retroactively. When this occurs, the second camera pose of the second camera 260B corresponding to the second image 416A or 416B are updated retroactively. In some situations, the third image 418A is being currently displayed, and immediately updated based on the update of the second camera pose corresponding to the second image 416A. Alternatively, in some situations, the third image 418A is being currently displayed and not updated, regardless of the update of the instantaneous second camera pose corresponding to the second image 416A. In contrast, the third image 418B is rendered immediately prior to the third image 418 A. The third image 418B has already been refreshed and replaced with the third image 418 A, and it is too late to update the third image 418B, when the subsequent image 408B of the first camera 260A is available for updating the first camera pose corresponding to the second image 416B and the third image 418B.

[0048] It is noted that camera settings of the second camera 260B are optimized to provide high quality second images 416. For example, the second camera 260 uses autofocus or exposure configurations that are selected by a developer or a user of an extended reality application (e.g., an AR application). An image quality of each second image 416 captured by the second camera 260B is adjusted to render the display of the virtual object with a desirable visual effect. In some situations, the second image latency of the second images 416 is greater than the first image latency of the first images 408. As far as the second image latency is not detectable to human eyes, the autofocus and exposure configurations are preferred so as to provide the desirable visual effect.

[0049] Figure 5 is a block diagram illustrating a process 500 of applying two cameras for separate pose estimation and image rendering in an extended reality application, in accordance with some embodiments. First images 408 are captured (502) by a first camera 260A using SLAM-related camera settings 504, and each first image 408 has a respective first timestamp 506 recording a time when the respective first image 408 is captured. The first images 408 are processed by a SLAM module 232 to generate a plurality of first camera poses and map a scene where the first camera 260 is located. Each first camera pose is associated with a respective first timestamp 506. Specifically, each first image 408 is used to generate an image-based camera pose of the first camera 480 associated with the respective first timestamp 506 retroactively. In some embodiments, each first image 408 corresponds to a plurality of inertial sensor samples 406 that are captured after the first image 408 is captured. Each of the plurality of inertial sensor samples 406 corresponds to a respective sensor time stamp indicating a respective sensor time when the respective motion sensor data is measured. A first camera pose is determined (e.g., in real time, retroactively, predictively) at the respective sensor time based on the image-based camera pose of the respective first image 408 and a subset of inertial sensor samples 406. More details on generating the image- based camera pose and first camera poses of the first camera 260A are described above with reference to Figure 4A.

[0050] Second images 416 are captured (510) by a second camera 260B using image- related camera settings 512, and each second image 416 has a respective second timestamp 514 recording a time when the respective second image 416 is captured. In some embodiments, the respective second timestamp 514 is used to identify a first camera pose corresponding to the respective second timestamp 514, and the first camera pose is converted to a second camera pose 516 of the second camera 260B corresponding to the respective second timestamp 514. Both the second image 416 and the second camera pose 516 corresponding to the respective second timestamp 514 are provided to a graphics Tenderer library 518 to extract object-oriented information 520 of a virtual object including, but not limited to, objected oriented shade, geometry, and texture information. A computer vision module 522 receives the second images 416, second timestamps 514, image-related camera settings 512, second camera pose 516 corresponding to each second image 416, and the object-oriented information 520, and renders the virtual object on top of the second images 416 according to the object-oriented information 520.

[0051] In some embodiments, the first camera 260A includes a wide-angle camera, and the second camera 260B includes a visible light camera (e.g., an RGB camera). The wide-angle camera has a larger field of view than the RGB camera, such that each first image 408 has a larger number of feature points than a corresponding second image 416 to facilitate determine a camera pose and mapping of the scene of the first camera 260A. The wide-angle camera does not have automatic adjustment of focus and exposure time based on the SLAM- related camera settings 504, and can capture and process images promptly (e.g., in a latency less than a latency threshold) in response to any pose change of the first camera 260A. Such prompt image capturing and processing allows the first camera poses of the first camera 260 to be determined substantially in real time. The first images 408 have a compromised image quality that is nevertheless sufficient to identify the feature points in the first images 408 for the purposes of pose determination and scene mapping.

[0052] Figure 6 is a block diagram illustrating a plurality of processing threads 600 implemented in an Android operating system (OS), in accordance with some embodiments. The plurality of processing threads 600 include a first wide-angle camera thread 602, a second RGB camera thread 604, and a third graphics library thread 606. An electronic device (e.g., AR glasses 150, mobile phone 104C) includes a first wide-angle camera 260A configured to generate first images 408 and a second RGB camera 260B configured to generate second images 416 concurrently with the first images 408. The electronic device is configured to implement the Android OS on which the first and second images 408 and 418 are processed using the first wide-angle camera thread 602, second RGB camera thread 604, and third graphics library (GL) thread 606 and applied to render third images 418 using a computer vision module 522.

[0053] A first camera instance 608 includes SLAM-related camera settings 504, and controls the first wide-angle camera 260A to capture first images 408 according to the SLAM-related camera settings 504 (e.g., image resolution, focal length, exposure time, first image latency, first image frame rate). Each of the first images 408 is rendered on a surface 610 of the wide-angle camera thread 602. During a callback session, the surface 610 provides the first images 408 and associated information (e.g., first timestamps 506, the SLAM-related camera settings 504) to an image reader module 614 via an image reader instance 612. That said, the image reader module 614 generates a callback request, collects the first images and associated information, and sends the collected first images and associated information to a SLAM module 232. The SLAM module 509 generates a plurality of first camera poses and maps a field of view of the first camera 260A. The first camera poses of the first camera 260A are converted to second camera poses of the second RGB camera 260B corresponding to each second image 416 based on inertial sensor samples 406 and a predefined coordinate correlation between the cameras 260A and 260B. In some embodiments, the field of view of the first wide-angle camera 260A is converted to the field of view of the second RGB camera 260B to map the field of view of the second RGB camera 260 in real time.

[0054] A second camera instance 616 includes image-related camera settings 512, and controls the second RGB camera 260B to capture second images 416 according to the image-related camera settings 512 (e.g., image resolution, adjustable focal length, adjustable exposure time, second image latency, second image frame rate). The image-related camera settings 512 are optionally predefined by a developer or a user of an AR application of the electronic device. Each of the second images 416 is rendered on two surfaces 618 A and 618B of the RGB camera thread 604. A first surface 618 A is used to preview the second images 416, and a second surface 618B is used to obtain object-related information. During a callback session, the first surface 618A provides the second images 416 and associated information (e.g., second timestamps 514, the image-related camera settings 512) to an image reader module 622 via an image reader instance 620. That said, the image reader module 622 generates a first callback request 624 A and collects the second images 416 and associated information. Additionally, the image reader module 622 provides a second timestamp of each second image 416 to the SLAM module 232 and obtains a second image pose corresponding to the second image 416 from the SLAM module 232. The image reader module 622 generates a second callback request 624B including the second camera pose. In response to this callback request 624B, the GL thread 606 extracts the object-related information and feeds the object-related information to the second surface 618B. As such, the second images 416 and associated information are provided to the computer vision module 522 jointly with the object-related information by way of the image reader instance 620 and image reader module 622.

[0055] In some embodiments, the plurality of first camera poses include image-based camera poses generated directly from the first images 408. Each image-based camera poses is converted to a second camera pose 420-i of the second RGB camera 260B directly using a pre-determined coordinate correlation between camera poses of the first and second cameras 260A and 260B. The second camera pose 420-i is subsequently combined with a series of inertial sensor samples to determine (in real time or retroactively) or predict a plurality of preliminary camera poses 420-p of the second camera 260B. A second camera pose corresponding to a second pose time when a second image is available for image rendering is selected from the plurality of preliminary camera poses 420-p. Alternatively, in some embodiments, each image-based camera pose is combined with a series of inertial sensor samples to determine (in real time or retroactively) or predict a first camera pose 410 corresponding to a second pose time when a specific second image 416 is captured. This second pose time is between two first images 408, and the first camera pose 410 is converted to a second camera pose of the second camera 260B corresponding to the second pose time when the specific second image 416 is captured based on the pre-determined coordinate correlation.

[0056] Referring to Figure 6, the plurality of processing threads 600 are established on the Android OS of a smartphone equipped with two or more cameras including the first wide-angle camera 260 A and the second RGB camera 260B. The first wide-angle camera thread 602 is created to configure camera setting applied in the camera instance 608 according to a SLAM algorithm used in the SLAM module 232. Specifically, an ImageReader.newInstance is optionally created to get the surface 610 and couple the surface 610 to the camera instance 608. The format and resolution of the first images 408 are optionally configured according to the SLAM-related camera settings 504. The first images 408 and the first timestamps 506 are then provided to the SLAM module 232 through an ImageReader callback session.

[0057] In the RGB camera thread 604, a developer of a corresponding application configures the second RGB camera 260B (i.e., determine the image-related camera settings 512) according to image rendering needs. An ImageReader.newInstance 616 is optionally created to get a comprehensive surface 618 and couple the surface 618 to the camera instance 616. The format and resolution of the second images 416 captured by the RGB camera 260B are configured according to the image-related camera settings 512 determined by the image rendering needs. The second images 416 are provided to the corresponding computer vision module 522 or other image consumption modules (not shown) through another ImageReader callback 624. In some situations, the image timestamp 514 of each second camera 260B is sent to an ATW module of the SLAM module 232, and the ATW module returns the second camera pose of the second camera 260B corresponding to this timestamp. Because of the support of the ATW module, the wide-angle camera 260A and the RGB camera 260B may have different image frame rates, e.g., 30 FPS and 60 FPS, respectively.

[0058] In the GL thread 606, a GLSurfaceRender is created to render an RGB camera preview and a virtual scene based on the second pose data of the second RGB camera 260B by the ATW module of the SLAM module 232. For example, an additional Surface instance 618B is created to obtain SurfaceTexutre of GLSurfaceRender and interact with the camera instance 616 in the RGB camera thread 604. A GLSurfaceView 626 is created to draw a rendering result of GLSurfaceRender on a screen of the electronic device.

[0059] In some situations, any cameras involved in the SLAM algorithm of the

SLAM module 232 are required to be calibrated to enable their geometric measurability. Intrinsic parameters of both the wide-angle camera 260 A and RGB camera 260B are pre calibrated. An extrinsic rigid body transformation (e.g., a pre-determined coordinate correlation) between the RGB and wide-angle cameras are also required to be determined to convert the first camera pose determined in a wide-angle camera reference system to the second camera pose in an RGB camera reference system. If the SLAM module 232 requires further inertial sensor samples, e.g., provided by an accelerometer and a gyroscope, noise characteristics and the extrinsic rigid body transformation with the cameras 260A and 260B are also determined. Additionally, in some implementations, the wide-angle and RGB cameras 260A and 260B have a spatial separation that is substantially small (e.g., less than a threshold separation) within the same electronic device. The extrinsic rigid body transformation (e.g., pre-determined coordinate correlation) between the wide-angle and RGB cameras 260A and 260B cannot be determined by camera-only calibration. Additional sensors are used to determine the extrinsic rigid body transformation (e.g., pre-determined coordinate correlation) between the RGB and wide-angle camera.

[0060] Referring to Figure 6, the wide-angle camera 260A has a wider field of view that helps get more feature points to support the related SLAM algorithm. The RGB camera 260B can be freely configured by a developer or a user of an extended reality application.

The RGB camera 260B has a first frame rate (e.g., 60 PFS) that is rendered by a graphics processing unit (GPU) maximum 4K and does not need to be scaled. The ATW module of the SLAM module 232 optionally runs at the second frame rate of the RGB camera 260B (e.g.,

60 FPS), while the wide-angle camera 260A has a distinct first frame rate (e.g., 30 FPS). In some situations, the RGB camera 260B operates with a distinct computer vision module from the computer vision module 522. However, when two cameras 260A and 260B operate concurrently, power consumption increases. For example, a single camera based monocular SLAM architecture consumes a current of 762 m A, while a dual camera based monocular SLAM architecture consumes a current of 986 mA.

[0061] In some embodiments, a first electronic device uses its first camera 260A to obtain a real scene map model of a scene where the first electronic device is located. The first electronic device provides the real scene map model to a second electronic device that is distinct from the first electronic device and located in the same scene. The second electronic device needs to determine its own device pose using its own SLAM module without remapping the scene, because the real scene map model is shared by the first electronic device. In some embodiments, the first and second electronic devices share their real-time positions and the real scene map model with each other via one or more communication networks 108 to which both electronic devices are coupled, e.g., a Wi-Fi Direct.

[0062] Figure 7A is an image 700 rendered based on a single camera monocular

SLAM system, in accordance with some embodiments, and Figure 7B is an image 750 rendered based on a dual camera monocular SLAM system of this application, in accordance with some embodiments. Referring to Figure 7A, a single camera is used to capture a sequence of first images, and each first image 702 is used to determine a camera pose, map a scene (i.e., a field of view) of the single camera, and render a plurality of virtual objects 704. The first image 702 is blurry, because the single camera is out of focus. Operation of the single camera is focused on SLAM, and therefore, not optimized for an image quality of the first image 702. Referring to Figure 7B, two distinct cameras are used to capture two sequences of images separately. The two sequence of images may have distinct image frame rates and refresh rates. One of the two sequences of images are used to determine camera poses and map a scene for the cameras, and the other one of the two sequences of images 706 are used to render a virtual object 708. The other one of the two sequences of images 706 are used as a background for the virtual object 708. The image 706 is on focus and has a better image quality than the first image 702. In some embodiments, the camera (e.g., the first camera 260A) applied for determining camera poses has a fixed focal length and a fixed exposure time, and a fixed frame rate range. In some embodiments, the camera (e.g., the first camera 260A) applied for determining camera poses has a larger field of view and provides more feature points than the other camera (e.g., the second camera 260B) applied to provide background images.

[0063] Figure 8 is a flowchart of a method 800 for rendering an image, in accordance with some embodiments. For convenience, the method 800 is described as being implemented by an electronic device (e.g., a client device 104, a server 102, or a combination thereof). An example of the client device 104 is a head-mount display 150 or a mobile phone 104C. In an example, the method 800 is applied to determine and predict poses, map a scene, and render both virtual and real content concurrently in extended reality (e.g., VR, AR). Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the electronic system. Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 of the system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed. [0064] The electronic device includes a first camera 260A (e.g., a wide-angle camera) and a second camera 260B. The electronic device obtains (802 and 804) a sequence of first images 408 captured by the first camera 260 A and a sequence of second images 416 captured by the second camera 260B concurrently with the sequence of first images 408. The electronic device determines (806) a plurality of first camera poses of the first camera based on at least the sequence of first images 408. Each first image 408 corresponds to at least one first camera pose. In some embodiments, the at least one first camera pose includes image- based camera poses generated directly from the first images and/or camera poses determined from the image-based camera poses and inertial sensor samples 406. The electronic device determines (808) a plurality of second camera poses of the second camera 260B from the plurality of first camera poses of the first camera based on a pre-determined coordinate correlation. In accordance with the plurality of second camera poses, the electronic device renders (810) display of a virtual object on top of the sequence of second images, i.e., generates the third images 418. In some embodiments, the first camera 260 A and the second camera 260B are integrated in the same electronic device, while the first camera 260A is a wide-angle camera and the second camera 260B is an RGB camera.

[0065] In some embodiments, the electronic device further obtains (812) a plurality of motion sensor data measured concurrently with the sequence of first images and the sequence of second images. Referring to Figure 4A, the plurality of first camera poses are determined based on the sequence of first images 408 and the plurality of motion sensor data 406 using simultaneous localization and mapping (SLAM). Further, in some embodiments, each second image 416 corresponds to a respective second camera pose and a second pose time, and the respective second camera pose is determined from a corresponding first camera pose based on the pre-determined coordinate correlation. Additionally, in some embodiments, for each second image 416, the corresponding first camera pose from which the respective second camera pose is determined is selected, interpolated, or extrapolated from the plurality of first camera poses based on the second pose time of the respective second camera pose. Specifically, the corresponding first camera pose is retroactively determined, determined in real time, or predicted based on one of the first image 408 and a subset of the motion sensor data 406.

[0066] In some embodiments, the electronic device obtains a plurality of motion sensor data 406 measured concurrently with the sequence of first images 408 and the sequence of second images 416. Referring to Figure 4B, the plurality of first camera poses of the first camera 260A (specifically, a plurality of image-based camera poses) are determined based on the sequence of first images, e.g., without using the motion sensor data 406. The second camera poses of the second camera 260B are converted from the first camera poses of the first camera 260A based on the pre-determined coordinate correlation by converting the first camera poses (i.e., image-based camera poses) based on the pre-determined coordinate correlation and determining a plurality of preliminary camera poses based on the converted first camera poses and the plurality of motion sensor data. Further, in some embodiments, each second image 416 corresponds to a respective second camera pose and a second pose time, and the respective second camera pose is selected, interpolated, or extrapolated from the plurality of preliminary camera poses based on the second pose time of the respective second camera pose.

[0067] In some embodiments, the sequence of first images 408 have a first frame rate

(e.g., 30 FPS), and the sequence of second images 416 have a second frame rate (e.g., 60 FPS) that is distinct from the first frame rate. The plurality of motion sensor data 406 have a sensor sampling rate (e.g., 1000 samples per second) that is greater than the first and second frame rates. Further, in some embodiments, the sequence of third images 418 has a display refresh rate that is optionally identical to or distinct from the second frame rate.

[0068] In some embodiments, each of the sequence of first images 408 is obtained with a respective first timestamp 506 indicating a respective first image time when the respective first image 408 is captured. Each of the sequence of second images 416 is obtained with a respective second timestamp 514 indicating a respective second image time when the respective second image 416 is captured. Each motion sensor data 406 is obtained with a respective sensor timestamp indicating a respective sensor time when the respective motion sensor data 406 is measured. Each of the first and second camera poses is associated with a respective pose time determined by one of the respective first image time, second image time, and sensor time.

[0069] In some embodiments, the display of the virtual object is rendered on the top of each second image 416 with a respective second camera pose corresponding to a second pose time that is synchronous to one of a second image time when the respective second image is available for image rendering and a display time when the virtual object is rendered. For example, referring to Figure 4A, the third image 418 A is obtained by rendering the virtual object on the second image 416A that is made available immediately before the display time of the third image 418 A. In some situations, the second camera pose used to render the virtual object corresponds to the display time. Alternatively, in some situations, the second camera pose used to render the virtual object corresponds to the second pose time of the second image 416A.

[0070] In some embodiments, the sequence of first images 408 are captured by the first camera 260A according to a first camera setting (e.g., SLAM-related camera settings 504 in Figure 5) configured to facilitate simultaneous localization and mapping (SLAM) associated with the first camera 260 A, and the sequence of second images 416 are captured by the second camera 260B according to a second camera setting (e.g., image-related camera settings 512 in Figure 5) configured to facilitate rendering the display of the virtual object. Further, in some embodiments, the first camera setting does not include automatic adjustment of a focal length and an exposure time of the first camera such that a latency of each first image captured by the first camera is minimized, and the second camera setting includes autofocus or exposure configurations that are selected freely by a user or a program developer, such that an image quality of each second image captured by the second camera is adjusted to render the display of the virtual object.

[0071] In some embodiments, the electronic device obtains object-oriented information (e.g., object-oriented shade, geometry, texture information) of the virtual object from a graphics library based on at least the plurality of second camera poses of the second camera and/or the second images 416. The display of the virtual object is rendered on top of the sequence of second images 416 based on the object-oriented information.

[0072] For the first camera 260A, a wide-angle camera can be applied to provide a larger field of view and more feature points than a visible light camera that has similar focal length and exposure time. For the second camera 260B, camera settings are not constrained by the requirements of the SLAM module 232 and can be freely configured by the developer to focus on visual effect and image quality of the second images 416. Autofocus can be used, allowing users to watch a clear preview of the second camera, and an automatic exposure algorithm allows users to watch a bright color camera preview. Such a second camera 260B is directly connected to the computer vision module 522 to avoid hardware resource consumption and processing delay caused by additional image processing. Additionally, based on an asynchronous time warp scheme of SLAM, the method 800 can output, at a substantially high frame rate (e.g., 60 FPS), the third images 418 that include virtual objects overlaid on the second images 416, thereby enhancing user experience with augmented or mixed reality.

[0073] A dual-camera based monocular SLAM architecture as described herein is based on monocular SLAM and is different from a binocular SLAM architecture. In this application, the dual-camera based monocular SLAM architecture is directed to bringing more data benefits to monocular SLAM and related modules base on hardware advantages of multiple cameras (e.g., a wide-angle camera). SLAM's and computer vision’s requirements about input image data are separately considered and addressed by different cameras. Virtual objects in the mobile phone AR scene are better combined with camera previews.

[0074] It should be understood that the particular order in which the operations in

Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to use two cameras for SLAM and image rendering as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 3-7 are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here.

[0075] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0076] As used herein, the term “if’ is, optionally, construed to mean “when” or

“upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context. [0077] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0078] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.