Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
CLOSED-LOOP POSE DETECTION AND MAPPING IN SLAM
Document Type and Number:
WIPO Patent Application WO/2023/229600
Kind Code:
A1
Abstract:
This application is directed to camera pose determination in simultaneous localization and mapping (SLAM). An electronic device obtains a sequence of visual images each captured at a respective room. For a first visual image captured at a current room, the electronic device identifies the current room among a plurality of known rooms. Each known room is mapped by a respective set of former images, and each former image is associated with a respective former pose in the respective known room, hi accordance with identification of the current room among the known rooms, the electronic device identifies a first set of former images associated with the current room, selects a first former image from the first set of former images to match the first visual image, and determines a first camera pose associated with the first visual image based on a first former pose associated with the first former image.

Inventors:
LIN YUN-JOU (US)
Application Number:
PCT/US2022/031183
Publication Date:
November 30, 2023
Filing Date:
May 26, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T7/20; G06T7/00
Foreign References:
US20190035100A12019-01-31
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method for SLAM, implemented at an electronic device, comprising: obtaining a sequence of visual images each of which is captured at a respective room; for a first visual image, identifying a current room where the first visual image is captured among a plurality of known rooms, wherein each known room corresponds to a respective set of former images that map the respective known room, and each former image is associated with a respective former pose in the respective known room; and in accordance with an identification of the current room among the plurality of known rooms: identifying a first set of former images associated with the current room; selecting a first former image from the first set of former images to match the first visual image; and determining a first, camera pose associated with the first visual image based on a first former pose associated with the first former image.

2. The method of claim 1, further comprising: obtaining a sequence of depth images that are synchronous with the sequence of visual images, each visual image corresponding to a respective depth image and associated with a respective camera pose, the sequence of visual images having a first subset of visual images that, are captured in the current room and include the first visual image; and building a current three-dimensional (3D) room model of the current room from the first subset of visual images and a first subset of corresponding depth images.

3. The method of claim 2, wherein the first subset of visual images includes a second visual image that follows the first visual image, further comprising: determining a second camera pose associated with the second visual image based on a second former pose associated with a second former image in the first set of former images.

4. The method of claim 2, wherein the first subset of visual images includes an alternative visual image, further comprising: determining an alternative camera pose associated with the alternative visual image based on a camera pose associated with an earlier visual image in the first subset of visual images.

5. The method of any of claims 2-4, wherein the first, subset of visual images includes a third visual image that precedes the first visual image, further comprising: determining that a current 3D room model is not complete for the first subset of visual images at a time of obtaining the third visual image, wherein the third visual image is not associated with any known room and corresponding former images of any known room in accordance with a determination that the current 3D room model is not complete for the first subset of visual images.

6. The method of any of claims 2-4, wherein each of the plurality of known rooms corresponds to a respective set of embeddings, the method further comprising: extracting a first set of embeddings for a current 3D room model of the current room, the first set of embeddings including semantic and instance information describing the current room, wherein the first set of embeddings includes a plurality of semantic elements configured to provide semantic information of a plurality of objects or regions located in the current 3D room model.

7. The method of claim 6, further comprising: comparing the first set of embeddings with a respective set of embeddings of each known room, and in accordance with a comparison result, determining that the first set of embeddings satisfies a room identification criterion, thereby identifying the current room among the plurality of known rooms.

8. The method of claim 6, further comprising: for each known room, determining a similarity level of the first set of embeddings and the respective set of embeddings of the respective known room, and determining (1) that a first similarity level of the first set of embeddings of the current room and the respective set of embeddings of one of the plurality of known rooms is greater than a similarity threshold and (2) that the first similarity level is the greatest among a first plurality of similarity levels of the first set of embeddings with the respective sets of embeddings corresponding to the plurality of known rooms, thereby identifying the current room as the one of the plurality of known rooms.

9. The method of claim 1, further comprising: obtaining a sequence of depth images that are synchronous with the sequence of visual images, each visual image corresponding to a respective depth image and associated with a respective camera pose, the sequence of visual images having a second subset of visual images captured in a next room; creating a next 3D room model of the next room from the second subset of visual images and a second subset of corresponding depth images; extracting a second set of embeddings from the next 3D room model, the second set of embeddings including semantic and instance information describing the next room; for each known room, determining a similarity level of the second set of embeddings and the respective set of embeddings of the respective known room; determining (1 ) that a second similarity level of the second sets of embeddings and the respective set of embeddings corresponding to one of the plurality of known rooms is less than a similarity threshold and (2) that the second similarity level is the greatest among a second plurality' of similarity' levels of the second sets of embeddings with the respective sets of embeddings corresponding to the plurality of known rooms; and determining that the next room is not among the plurality of known rooms.

10. The method of claim 9, further comprising: expanding the plurality of known rooms to include the next room, wherein the next room is associated with the second set of embeddings.

11. The method of any of the preceding claims, further comprising: for each of the first set of former images, extracting respective image-based descriptors describing objects or regions existing in the respective former image; wherein selecting the first former image further includes: obtaining a first image-based descriptor describing a first region in the first visual image; comparing the first image-based descriptor with the respective image-based descriptors of the first set of former images; and in accordance with a comparison result, determining that at least one respective image-based descriptor of each of a subset of former images satisfy an image selection criterion, the subset of former images including the first former image.

12. The method of any of claims 1-10, further comprising: for each of the first, set of former images, extracting respective image-based descriptors describing objects or regions existing in the respective former image; wherein selecting the first former image further includes: obtaining a first image-based descriptor describing a first region in the first visual image; for each of the first set of former images, determining a respective similarity level of the first image-based descriptor and at least one respective image-based descriptor; determining that a subset of similarly levels between the first image-based descriptor and the at least one respective image-based descriptor of each of a subset of former images are the greatest among a plurality of similarity levels of the first image-based descriptor and the respective image-based descriptors of the first set of former images, the subset of former images including the first former image.

13. The method of claim 11 or 12, wherein the first image-based descriptor is extracted from the first visual image using a You Only Look Once (YOLO) object detection model, and for each of the first set of former images, the respective image-based descriptors are extracted from the YOLO object detection model.

14. The method of claim 11 or 12, wherein the first image-based descriptor is extracted from the first visual image using a maximally stable extremal regions (MSER) recognition model, and for each of the first set of former images, the respective image-based descriptor is extracted from the MSER recognition model.

15. The method of claim 13 or 14, wherein selecting the first former image further comprises: for each of the subset of former images: identifying a respective region associated with the at least one respective image-based descriptor of the respective former image, the respective region matching the first region of the first visual image; and comparing keypoints the first region of the first visual image with keypoints of the respective region of the respective former image in the subset of former images; and selecting the first former image whose keypoints of the respective region match the keypoints of the first visual image better than the keypoints of the respective region of any other former image in the subset of former images.

16. The method of any of the preceding claims, the sequence of visual images including a plurality of former images, the plurality of former images including the respective set of former images for each known room, the method further comprising: prior to obtaining the first visual image, creating a respective 3D room model for each known room, wherein the current room is identified among the plurality of known rooms based on the respective 3D room model of each known room.

17. The method of any of the preceding claims, wherein each visual image is one of a color image and a monochromatic image.

18. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-17.

19. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform a method of any of claims 1-17.

Description:
Closed-Loop Pose Detection and Mapping in SLAM

TECHNICAL FIELD

[0001] This application relates generally to image data processing including, but not limited to, methods, systems, and non-transitory computer-readable media for determining a camera pose of a camera device in simultaneous localization and mapping (SLAM).

BACKGROUND

[0002] Simultaneous localization and mapping (SLAM) is widely applied in virtual reality (VR), augmented reality (AR), autonomous driving, and navigation. In SLAM, feature points of a scene are detected and applied to map a three-dimensional (3D) virtual space corresponding to the scene. The feature points can be conveniently and accurately mapped for the 3D virtual space using an optical camera and an Inertial Measurement Unit (IMU) that exist in many mobile devices. Additionally, current work detects a historic keyframe that matches a current image based on a bag of words technique. A bags-of words descriptor is extracted from each historic keyframe and saved in a mapping database to represent the respective historic keyframe. Descriptor matching is applied to compare the current image with each of a set of historic keyframes and identify the matching historic keyframe, thereby forming a loop closure between the current image and the matching historic keyframe in SLAM. However, as the mapping database is expanded to include more historic keyframes, a search time increases with a drop of search reliability. In some situations, data are collected from large spaces corresponding to similar scenes. Many SLAM systems only consider two- dimensional (2D) descriptors which lack semantic or structure information, and tend to get false detection results based on this bags-of-words technique. Even if a similar historic keyframe can be detected, image matching may fail during a following stage of comparing feature points. It would be beneficial to have a fast, efficient, and accurate image matching mechanism for a SLAM system.

Sl .AlMARY

[0003] Various implementations of this application disclose a hierarchy loop closure detection and matching strategy to achieve accurate SLAM: results and provide structure-level semantic information. A fast, efficient, and accurate SL. AM-based image matching mechanism is applied to identify a matching historic keyframe (also called a matching former image) with a current image and determine a camera pose of a camera, e.g., when the camera moves among a plurality of scenes. The current image and the matching historic keyframe forms a loop closure in SLAM. Specifically, a mobile device obtains RGB images, wide- angle fisheye images, or depth images captured by the camera that is integrated in the mobile device or another electronic device. The camera is optionally an RGB camera or wide angle fisheye camera, and is applied jointly with an IMU in a SLAM system. A pose of the camera has a six degrees of freedom (6DOF) and is determined by the SLAM: system using image and inertial data collected by the camera and IMU. A 3D model is reconstructed for a scene using 3D image data and associated poses. A room detection thread is implemented to check whether the collected data are sufficient to identify a room. In accordance with a determination that, a room is identified from the collected data, information of the identified room is stored in a room database and corresponding images are stored as historic keyframes (also called former images) labelled with a corresponding room identification.

[0004] After the room where the current image is captured is identified, room-level matching is conducted to determine whether the room has been visited previously. In accordance with a determination that the identified room is among a plurality of known room that was previously visited, the current image is associated with a set of historic keyframes stored with a corresponding known room. Region and/or object matching and keypoint matching follow room-level matching, and are applied to identify a matching historic keyframe from which a former pose is extracted and used to determine a camera pose associated with the current image. As such, room-level matching is a new level of rooming added to reduce a number of historic keyframes to which the current image is compared, thereby significantly increasing the speed, efficiency, and accuracy level of image matching and pose determination in SLAM.

[0005] In one aspect, a method is implemented at an electronic device for SLAM. The method includes obtaining a sequence of visual images each of which is captured at a respective room, and for a first visual image (e.g., a current image), identifying a current room where the first visual image is captured among a plurality of known rooms. Each known room corresponds to a respective set of former images that map the respective known room, and each former image is associated with a respective former pose in the respective known room. The method further includes in accordance with an identification of the current room among the plurality of known rooms, identifying a first set of former images associated with the current room, selecting a first former image from the first set of former images to match the first visual image, and determining a first camera pose associated with the first visual image based on a first former pose associated with the first former image. [0006] In some embodiments, the method includes obtaining a sequence of depth images that are synchronous with the sequence of visual images. Each visual image corresponds to a respective depth image and is associated with a respective camera pose, and the sequence of visual images has a first subset of visual images that are captured in the current room and include the first visual image. The method further includes building a first three-dimensional (3D) room model of the current room from the first subset of visual images and a first subset of corresponding depth images. Further, in some embodiments, each of the plurality of known rooms corresponds to a respective set of embeddings. The method further includes extracting a first set of embeddings for a current 3D room model of the current room. The first, set of embeddings include semantic and instance information describing the current room. The first set of embeddings includes a plurality of semantic elements configured to provide semantic information of a plurality of objects or regions located in the current 3D room model. Additionally, in some embodiments, the method further includes comparing the first set of embeddings with a respective set of embeddings of each known room, and in accordance with a comparison result, determining that the first set of embeddings satisfies a room identification criterion, thereby identifying the current room among the plurality of known rooms.

[0007] In another aspect, some implementations include an electronic device that includes one or more processors and memory' having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods.

[0008] In yet another aspect, some implementations include a n on-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0009] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures. [0011] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0012] Figure 2 is a block diagram illustrating an electronic system for processing data (e.g., image data, motion data), in accordance with some embodiments.

[0013] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0014] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0015] Figure 5 is a flowchart of a process for processing inertial sensor data (also called motion data) and image data of an electronic system (e.g., a server, a client device, or a combination of both) using a SLAM module, in accordance with some embodiments.

[0016] Figure 6 is a flow diagram of a process for determining a camera pose based on room matching, in accordance with some embodiments.

[0017] Figure 7 is a diagram having a plurality of visual images captured as a camera moves among a plurality of rooms, in accordance with some embodiments.

[0018] Figure 8 A is a block diagram of an example room having projected wall points on a 2D floor plan, in accordance with some embodiments, and Figure 8B is a block diagram of an example room digitalized to a 2D room map, in accordance with some embodiments.

[0019] Figure 9 is a flow diagram of an example process for identifying a subset of former images that are substantially similar to a first visual image based on region matching, in accordance with some embodiments.

[0020] Figure 10 is a flow diagram of an example process for matching a region 904 in a first visual image with a corresponding region in one of the plurality of known rooms using keypoints, in accordance with some embodiments.

[0021] Figure 11 is a flow diagram of an example method for determining a camera pose associated with a visual image, in accordance with some embodiments.

[0022] Like reference numerals refer to corresponding parts throughout the several views of the drawings. DETAILED DESCRIPTION

[0023] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it wall be apparent to one of ordinary’ skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0024] Extended reality includes augmented reality (AR) in which virtual objects are overlaid on a view of a real physical world, virtual reality (VR) that includes only virtual content, and mixed reality (MR) that combines both AR and VR and in which a user is allowed to interact with real-world and digital objects in real time. Key techniques in AR include SLAM and physical world reconstruction. Image data and inertial motion data are collected and processed using SLAM techniques, and enable a 6DOF tracking system to identify stationary feature points distributed in a scene in many mobile AR system. In some embodiments, the image data include RGBD images, and each RGBD image has a color image including red, green, blue band information and a depth image. Optionally, the color image used in SLAM can be used in monocular depth estimation of the depth image using deep learning techniques. SLAM: tools are developed to localize a location of a user and build a 3D representation of an environment. A computer model of 3D appearance of the scene is established from information of the feature points in the scene. A location where a camera has visited is detected in a loop closure detection process, e.g., by establishing correspondence between feature points of a current image and each historic keyframe.

[0025] Errors in estimation of a 3D map of the scene are cumulative and propagative, and a drift is inherent for camera poses determined in SLAM. A quality of the reconstructed 3D map and an accuracy of the camera pose generated from SLAM are determined by loop closure detection and matching techniques, i.e., the camera must return to a location with no or a small mismatch in the corresponding camera poses to enable a desirable quality of the reconstructed 3D map and accuracy of the camera pose. However, many indoor places may appear similar to each other. Ambiguities tend to arise as a corresponding historic keyframe that matches a current frame is identified from a historic keyframe dataset. This increases a difficulty in loop closure detection and results in many false loop closure results. In some embodiments, a hierarchy loop closure detection strategy is applied based on room matching. Room-level information is generated for the current image and matches corresponding roomlevel information of a set of historic keyframes of a plurality of known rooms. In some embodiments, a room where the current image is captured is identified in the plurality of known rooms. A known room identified as the room where the current image is captured correspond to a set of historic keyframes. A historic image is identified from the set of historic keyframes to match the current image based on object or region matching and keypoint matching. By these means, such hierarchy loop closure detection can efficiently reduce false detection results and increase a positive detection rate and a matching speed. [0026] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by another client device 104 or the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, process the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0027] The one or more servers 102 can enable real-time data communication with the client devices 104 that, are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 can implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and provides the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. [0028] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100, The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDM A), Bluetooth, Wi-Fi , voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0029] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally. [0030] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e.g. the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102 A, presents the results on a user interface (e.g., associated with the application), rending virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102 A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0031] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display . The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses. The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0032] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0033] Figure 2 is a. block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, a client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geolocation receiver, for determining the location of the client device 104.

[0034] Optionally, the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space. Examples of the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer.

[0035] Memory? 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory', such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory' devices, or one or more other non-volatile solid state storage devices. Memory’ 206, optionally, includes one or more storage devices remotely located from one or more processing units 202, Memory 206, or alternatively' the non-volatile memory within memory 7 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

* Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

* Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

* User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application/ s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

® Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction; * Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

* One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices);

* Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

* Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224;

* Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., AR glasses 104D), where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228;

* Pose-based rendering module 238 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating mixed, virtual, or augmented reality content using images captured by the camera 260, where the virtual objects are rendered and the mixed, virtual, or augmented reality content are created from a perspective of the camera 260 based on a camera pose of the camera 260; and

* One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques; o Pose data database 252 for storing pose data of the camera 260; o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and include the candidate images, and o Room database 256 for storing information of 3D room models of a plurality of rooms in which image data are captured by the camera 260, where the information of 3D room models include at least a respective set of embeddings further including semantic and instance information describing each room,

[0036] In some embodiments, the pose determination and prediction module 230 further includes an SIAM module 232 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image and IMU sensor data. Additionally, in some embodiments, the SLAM module 232 includes a multilevel mapping module 234 for matching a current image with a historic keyframe on a room level, an object and region level, and on a keypoint level. The camera pose of the client device 104 associated with the current image is identified based on a former pose of the client device 104 associated with the historic keyframe.

[0037] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.

[0038] Each of the above identified elements may be stored in one or more of the previously mentioned memory' devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0039] Figure 3 is another example data processing system 300 for training and applying a neural network based (NN-based) data processing model 250 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 250 and a data processing module 228 for processing the content data using the data processing model 250. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106.

Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 250 to the client device 104.

[0040] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 250 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g,, extract a regi on of interest (R.OI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 250, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 250 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 250 is provided to the data processing module 228 to process the content data.

[0041] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0042] The data processing module 228 includes a data pre-processing modules 314, a model-based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model-based processing module 316 applies the trained data processing model 250 provided by the model training module 226 to process the pre-processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 250. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0043] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 250, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance w'ith some embodiments. The data processing model 250 is established based on the neural network 400. A corresponding model -based processing module 316 applies the data processing model 250 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights w1, w2, w3, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs.

[0044] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every' node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0045] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 250 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0046] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 250 to process content data (particularly, textual and audio data). Nodes in layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (L.STM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0047] The training process is a process for calibrating all of the weights w. for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the foiward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0048] Figure 5 is a flow'chart of a process 500 for processing inertial sensor data (also called motion data) and image data of an electronic system (e.g., a server 102, a client device 104, or a combination of both) using a SLAM module (e.g., 232 in Figure 2), in accordance with some embodiments. The process 500 includes measurement preprocessing 502, initialization 504, local visual -inertial odometry (VIO) with relocation 506, and global pose graph optimization 508. In measurement preprocessing 502, a camera 260 captures image data of a scene at an image frame rate (e.g., 30 FPS), and features are detected and tracked (510) from the image data. An inertial measurement unit (IMU) 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the camera 260 capturing the image data, and the inertial sensor data are pre-integrated (512) to provide pose data. In initialization 504, the image data captured by the camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (514). Vision-only structure from motion (SfM) techniques 514 are applied (516) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the camera 260.

[0049] After initialization 504 and in relocation 506, a sliding window 518 and associated states from a loop closure 520 are used to optimize (522) a VIO. When the VIO corresponds (524) to a keyframe of a smooth video transition and a corresponding loop is detected (526), features are retrieved (528) and used to generate the associated states from the loop closure 520. In global pose graph optimization 508, a multi -degree-of-freedom pose graph is optimized (530) based on the states from the loop closure 520, and a keyframe database 532 is updated with the keyframe associated with the VIO.

[0050] Additionally, the features that are detected and tracked (510) are used to monitor (534) motion of an object in the image data and estimate image-based poses 536, e.g., according to the image frame rate. In some embodiments, the inertial sensor data that are pre-integrated (513) may be propagated (538) based on the motion of the object and used to estimate inertial -based poses 540, e.g., according to the sampling frequency of the IMU 280. The image-based poses 536 and the inertial-based poses 540 are stored in a pose data buffer and used by the SLAM module 232 to estimate and predict poses. Alternatively, in some embodiments, the SLAM module 232 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 536 to estimate and predict more poses.

[0051] In SLAM, high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280. The imaging sensors (e.g., camera 260, lidars) provide image data desirable for pose estimation, and oftentimes operate at a low frequency (e.g., 30 frames per second) and with a large latency (e.g., 30 millisecond). Conversely, the IMU 280 can measure inertial sensor data and operate at a very' high frequency (e.g., 1000 samples per second) and with a negligible latency (e.g., < 0.1 millisecond). Asynchronous time warping (ATW) is often applied in an AR system to warp an image before it is sent to a display to correct for head movement that occurs after the image is rendered. ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing image frames. In both SLAM and ATW, relevant image data and inertial sensor data are stored locally such that they can be synchronized and used for pose estimation/predication. In some embodiments, the image and inertial sensor data are stored in one of multiple STL containers, e.g., stdzvector, std::queue, std: :list, etc., or other self-defined containers. These containers are generally very convenient for use. The image and inertial sensor data are stored in the STL containers with their time stamps, and the timestamps are used for data search, data insertion, and data organization. [0052] In some embodiments, the camera 260 includes a wade angle fisheye camera or a RGB camera, and the SLAM module 232 utilizes images captured by the camera 260 and IMU data to determine cameras poses of the camera 260. Resulting camera poses are more accurate using both the image and IMU data compared with only image data are involved. In some situations, a gravity direction is derived from the IMU, data and applied to reconstruct the 3D model in which a normal of a ground plane is parallel with the gravity di recti on. The normal of the ground plan helps differentiate a known room from one or more other similar rooms.

[0053] Figure 6 is a flow' diagram of a process 600 for determining a camera pose based on room matching, in accordance with some embodiments. The process 600 is implemented by an SLAM module 232 of an electronic system 200 (particularly, by the multilevel mapping module 234 in Figure 2). The SLAM module 232 obtains image data 602 and motion data 604. The image data 602 are captured by a camera 260 of an electronic device (e.g., AR glasses 104D), and the motion data 604 are measured by an IMU 280 of the electronic device. The SLAM module 232 reconstructs a 3D room model 606 based on the image data 602, In some embodiments, a subset of the image data 602 includes a first visual image 602A (e.g., a current image that is most recently captured by the camera 260). The 3D room model 606 corresponds to a current room where the subset of the image data 602 is captured. The electronic sy stem 200 detects (608) the current room based on the 3D room model 606, and determines whether the current room is among a plurality of known rooms. If the current room is (610) among a plurality of known rooms, the current room corresponds to a matching known room. Information of the matching known room is extracted from a room database 256, and stored (612) in association with information of the current room (e.g., a room embedding).

[0054] Each known room corresponds to a respecti ve set of former images that map the respective known room. A first set of former images are associated with the matching known room. Given that the current room matches the matching known room, the first set of former images are associated with the current room. A first former image is selected from the first set of former images associated with the matching known room to match the first visual image 602 A recently captured by the camera 260. In some embodiments, a subset of the first set of former images are determined to be substantially similar (614) to the first visual image 602A. The first former image is further selected (616) from the subset of the first set of former images based on object or region matching on a coarse level and on keypoint matching on a fine level. Information of the first former image is applied to update the 3D room model 606 of the current room. Further, in some embodiments, the information of the first former image includes a first former pose, and a first camera pose associated with the first visual image is determined based on the first former pose associated with the first former image.

[0055] As the camera 260 consecutively captures images in the image data 602, the 3D room model 606 is constantly updated based on information of the latest image (e.g., feature points extracted from the latest image). Particularly, when the camera 260 initially enters a current room, the 3D room model 606 is created with limited image data 602 and cannot represent the current room completely. As the camera 260 starts to collect more and more image data 602 in the current room, the 3D room model 606 is gradually completed until the 3D room model 606 can represent the current room thoroughly. After the 3D room model 606 is completed, the image data 602 continues to update the 3D room model 606 to improve an accuracy of the 3D room model 606.

[0056] Figure 7 is a diagram having a plurality of visual images 700 captured as a camera moves among a plurality' of rooms, in accordance with some embodiments. The plurality of visual images 700 are captured by a camera 260 (e.g., a wide angle fisheye camera, an RGB camera), and include a first subset of visual images 702, a second subset of visual images 704, and a third subset of former images 706 that precede the first subset of visual images 702. Each of the plurality of visual images 700 is captured at a respective room (e.g., a first known room 708 A, a second know room 708B, ..., an N~th room 708N). Each image 700 is optionally a color image or a monochromatic image. The first subset of visual images 702 are captured at a current room 710 and include a first visual image 602A. The third subset of former images 706 includes a plurality of sets of former images 706 A, 706B, ..., and 706N corresponding to a plurality of known rooms 708 A, 708B, ... and 708N. The third subset of former images 706 are captured for the known rooms 708 before the first subset of visual images 702 and second subset of visual images 704 are captured. For each known room 708, a respective 3D room model is created based on a respective set of former images 706.

[0057] After the first visual image 602A is captured, the current room 710 is identified among the plurality of known room 708. In an example, a first set of former images 706A corresponding to the first known room 708A are identified as being associated with the current room 710, e.g., based on the respective 3D room model of the first known room 708A. A first former image 712 is selected from the first set of former images 706A to match the first visual image 602A, The first former image 712 has a first former pose, and the first visual image 602A has a first camera pose. Given that the first visual image 602A matches the first former image 712, the first camera pose is determined based on the first former pose. As such, a loop closure occurs in pose estimation between this first visual image 602A of the first subset of visual images 702 and the first former image 712 A of the first set of former images 706 A.

[0058] In some embodiments, a sequence of depth images 720 are synchronous with the sequence of visual images 700. Each visual image corresponds to a respective depth image and is associated with a respective camera pose. A current 3D room model 732 is built for the current room 710 from the first subset of visual images 702 and a first subset of corresponding depth images 720A. Assume that the first subset of images 702 starts with an initial image 714 captured when the camera 260 initially enters the current room 710. As the camera 260 consecutively captures images 702, the current 3D room model 732 is constantly updated based on information of each newly captured image (e.g., based on feature points extracted from the first visual image 602A). Particularly, when the camera 260 initially enters a current room, the current 3D room model 732 is created with limited image data 602 and cannot represent the current room 710 completely. As the camera 260 starts to collect more and more image data 702 in the current room 710, the current 3D room model 732 is gradually completed, e.g., before the first visual image 602A is captured. After the 3D room model 732 is completed, the current 3D room model 732 optionally continues to be updated with newly captured images (e.g., a second visual image 602B that follows the first visual image 602 A).

[0059] In some embodiments, the first subset of visual images 702 includes a second visual image 602B that follows the first visual image 602A. A second camera pose associated with the second visual image 602B is determined based on a second former pose associated with a second former image 712B in the first set of former images 706 A corresponding to the current room that is identified in the plurality of known rooms 708 (e.g., the first known room 708A). The second visual image 602B immediately follows the first visual image 602A or is separated from the first visual image by an integer number of visual images in the first subset. The loop closure occurs in pose estimation between this second visual image 602B of the first subset of visual images 702 and the second former image 712B of the first set of former images 706A. Alternatively, in some embodiments, the current 3D room model 732 is established and allows the current room 710 to be associated with the first known room 708A. But the first or second camera pose of the first or second visual image 602A or 602B is determined based on the current 3D room model 732 and corresponding first subset of visual images 702, rather than on the first set of former images 706A. By these means, the loop closure occurs locally in pose estimation between this second visual image 602B and another earlier visual image of the first subset of visual images 702.

[0060] In some embodiments, the first subset of visual images 702 includes a third visual image 602C that precedes the first visual image 602A. The current 3D room model 732 is not complete for the first subset of visual images 702 at a time of obtaining the third visual image 602C. The third visual image 602C is not associated with any known room 708 or corresponding former images 706 of any known room 708 in accordance with a determination that the current 3D room model 732 is not complete for the first subset of visual images 702. A third camera pose associated with the third visual image 602C is determined based on a camera pose associated with an earlier visual image in the first subset of visual images 702. The third visual image 602C immediately precedes the first visual image 602A or is separated from the first visual image 602A by an integer number of visual images in the first subset 702.

[0061] Stated another way, in some embodiments, the first subset of visual images 702 include an alternative visual image 716 (e.g., the second visual image 602B, the third visual image 602C). An alternative camera pose associated with the alternative visual image 716 is determined based on a camera pose associated with an earlier visual image in the first subset of visual images 702. The alternative visual image 716 immediately follows or precedes the first visual image, or is separated from the first visual image 602A by an integer number of visual images in the first subset of visual images 702.

[0062] After the current 3D room model 723 is completed, a first set of embeddings 734 are extracted from the current 3D room model 732 of the current room 710. The first set of embeddings 734 includes semantic and instance information describing the current room 710. The first set of embeddings 734 includes a plurality of semantic elements configured to provide semantic information of a plurality of objects or regions located in the current 3D room model 732. Also, in some embodiments, each of the plurality of known rooms 708 corresponds to a respective set of embeddings 724. comparing the first set of embeddings with a respective set of embeddings 724 of each known room 708. In accordance with a comparison result, the first set of embeddings 734 satisfies a room identification criterion 740, thereby identifying the current room 710 among the plurality of known rooms 708, e.g., as the first known room 708A.

[0063] In an example, for each known room 708 (e.g., 708 A, 708B, ..., 708N), a similarity level is determined between the first set of embeddings 734 and the respective set of embeddings 724 of the respective known room. The first set of embeddings 734 of the current room 710 have a first similarity level 726A with the respective set of embeddings of one of the plurality of known rooms 708 (e.g., 708A). The first set of embeddings 734 have a plurality of similarity levels 726 with the respective sets of embeddings 724 corresponding to the plurality of known rooms 708. If the first, similarity level 726A is greater than a similarity threshold and is the greatest among the first plurality of similarity levels 726, the current room 710 is identified as the one of the plurality of known rooms 708 (e.g., 708A).

[0064] In some embodiments, the sequence of depth images 720 include a second subset of depth images 720B corresponding to the second subset of visual images 704. The second subset of visual images 704 are captured in a next room 718. A next 3D room model 736 of the next room 718 is created from the second subset of visual images 704 and a second subset of corresponding depth images 720B. A second set. of embeddings 738 are extracted from the next 3D room model 736, and include semantic and instance information describing the next room 718. For each known room 708, the second set of embeddings have a similarity level 726’ with the respective set of embeddings of the respective known room. The second sets of embeddings 738 have a second similarity level 726A’ with the respective set of embeddings 724 corresponding to one of the plurality of known rooms 708. The second sets of embeddings 738 have a second plurality of similarity levels 726’ with the respective sets of embeddings 724 corresponding to the plurality of known rooms 708. If the second similarity level 726A’ is the greatest among the second plurality of similarity levels 726’, but is less than a similarity threshold, the next room is not among the plurality of known rooms, i.e., distinct from any of the plurality of known rooms. Additionally, in some embodiments, the next room 718 is added to the plurality of known rooms 708 as a new room. The plurality of known rooms 708 are expanded to include the next room 718, and the next room 718 is associated with the second set of embeddings 738. [0065] Each known room 708 is associated with a former 3D room model constructed based on a set of former images 706, and the current and next rooms are associated with the current or next 3D room models constructed based on the first set of visual images 702 and the second set of visual images 704, respectively. Each former, current, or next room is associated with a respective set of embeddings including semantic and instance information of the respective 3D room model. In some embodiments, for each former, current, or next room, information of the respective 3D room model is saved to a room database 256, and corresponding images are labeled with a respective room identifier (e.g., roomID). The information of each known room is extracted from the room database 256 based on the respective room identifier. In some situations, the instance information includes an embedding vector to which a corresponding set of embeddings of a 3 D room model are organized. If two embedding vectors are similar, the corresponding rooms are regarded as the same/similar room, i.e., matched to each other. Room models are registered in the room database 256 and reviewed to determine whether corresponding rooms are the same room based on the embedding vectors. Additionally, in some embodiments, for two registered room models, two corresponding rooms are successfully matched to each other, if feature points of the two rooms are close and semantic labels are identi cal.

[0066] Figure 8 A is a block diagram of an example room 800 having projected wall points on a 2D floor plan 802, in accordance with some embodiments, and Figure 8B is a block diagram of an example room 800 digitalized to a 2D room map 850, in accordance with some embodiments. After a 3D room model (e.g., a current room model 732 in Figure 7) is reconstructed, a set of feature points 804 associated with each wall 806 are identified based on 3D instance and semantic information of the respective wall 806. Each wall 806 and the corresponding set of feature points 804 are projected to the 2D floor plan 802 having a normal direction that is aligned along with the gravity direction. The projected 2D floor plan 802 is vectorized as the 2D digital room map 850.

[0067] In some embodiments, a room segmentation technique is applied to extract this room 800 (e.g., a current room 710). When the camera 260 initially enters the current room 710, the current 3D room model 732 is created with limited image data 602 and cannot represent the current room 710 completely. As the camera 260 starts to collect more and more image data 602 in the current, room 710, the current 3D room model 732 is gradually completed until the current 3D room model 732 can be developed to represent the current room 710 thoroughly. Before the 3D room model 732 is completed, a room detection thread adds newly reconstructed wail points 804 to the 2D floor plan 802 and re-detects the corresponding room based on the respective set of embeddings 734 until a complete current 3D room model 732 is extracted.

[0068] Figure 9 is a flow diagram of an example process 900 for identifying a subset of former images 714 that are substantially similar to a first visual image 602A based on region matching, in accordance with some embodiments. The first visual image 602A is captured at a current room 710, and the current room 710 is matched with one of a plurality of known rooms, e.g., a first known room 708A. In some embodiments, each of the current and known rooms 710 and 708 is represented with a respective set of embeddings, and the current room 710 is identified from the plurality of known rooms 708 (e.g., as the first known room 708A) based on the respective sets of embeddings 734 and 724. After room matching, the first visual image 602A is associated with a first subset of former images 706A captured in the first known room 708A. A subset of similar former images 714 are identified the first subset of former images 706A based on object or region matching on a coarse level.

[0069] For region or object matching, a first image-based descriptor 902 represents a first region or object 904 in the first visual image 602A, and is extracted using a maximally stable extremal regions (MSER) recognition model or a You Only Look Once (YOLO) object detection model, respectively. For each of the first set of former images 706A, one or more respective image-based descriptors 906 represent regions or objects 908 in each former image 706A, and are extracted using the MSER recognition model or YOLO object detection model. For each former image 706A, one or more similar levels 910 are determined between the first image-based descriptor 902 and the one or more respective image-based descriptors 906 of the respective former image 706A. The first set of former images 706A include a subset of similar former images 714 each having at least a region similar to the first region 904 of the first visual image 602A. Stated another way, the subset of similar images 714 are identified in the former images 706A in accordance with a determination that each similar image 1006 has a respective image-based descriptor substantially close to the first imagebased descriptor of the first visual image 602A. The substantially close respective imagebased descriptors of the subset of similar images correspond to a subset of similarity levels that are the greatest among a plurality of similarity levels 910 of the first image-based descriptor 902 and the respective image-based descriptors 906 of the first set of former images 706A. For example, the subset of former images 714 include three former images 706 A having the largest similarity levels among the first set of former images 706A.

[0070] Figure 10 is a flow diagram of an exampl e process 1000 for matching a region 904 in a first visual image 602A with a corresponding region in one of the plurality of known rooms 708 using keypoints, in accordance with some embodiments. After room matching and coarse region matching, a subset of similar images 714 are identified from a first subset of former images 706A captured in the first known room 708A. A first former image 714A matches the first visual image 602A, and is further selected from the subset of similar images 714 based on keypoint matching on a fine level. In some embodiments, for each of the subset of former images 714, a respective region is associated with at least one respective imagebased descriptor of the respective former image 714, and is identified to match the first region 904 of the first visual image 602A. Keypoints 1002 of the first region 904 of the first visual image 602A are compared with keypoints 1004 of the respective region 9008 of the respective former image in the subset of former images 714. The keypoints 1004 A of the respective region 908A of the first former image 712A match the keypoints 1002 of the first visual image 602A better than the keypoints 1004B of the respective region 908B of any other former image 714B in the subset of former images 714.

[0071] For example, the subset of similar former images 714 include a first similar former image 714A and a second similar former image 714B, The region 904 of the first visual image 602A is matched to a region 908A in the first similar former image 714A and a region 908B in the second similar former image 714B, The keypoints 1002 of the region 904 are compared to keypoints 1004A of the region 908A and keypoints 1004B of the region 908B. The keypoints 1004B of the region 908B do not match the keypoints 1002 of the region 904, while the keypoints 1006,A of the region 908,A match the keypoints 1002 of the region 904. The mismatched keypoints 1004B of the region 908B are reduced, and therefore, an inlier rate of image matching are increased.

[0072] In some embodiments, the first visual image 602A includes a plurality of regions 904, and the subset of former images 714 are i dentified in accordance with a determination that each of the plurality of regions 904 is matched to a respective region 908 in each former image 714 based on their image-based descriptors 902 and 906. Further, in some situations, when the first visual image 602A is compared with each of the subset of former images 714, feature point matching is implemented for each and every one of the plurality of regions 904. Alternatively, in some situations, when the first visual image 602A is compared with each of the subset of former images 714, feature point matching is implemented for a subset of the plurality of regions 904.

[0073] Region matching requires more computational resources than room matching, and however, less computational resources than feature point matching. Room matching helps reduce images that match the first visual image 602A to the first subset of former images 602A, and region or object matching further reduces the images that match the first visual image 602A to the subset of similar former images 714. Keypoint matching only need to be applied to the subset of similar former images 714. By these means, region matching and keypoint matching do not need to be applied to an excessive number of historic images (e.g., all former images 706, a subset of visual images in the first subset 702) captured prior to the first visual image 602A, thereby conserving computational resources and expediting an image matching speed in SLAM.

[0074] Figure 11 is a flow diagram of an example method 1 100 for determining a camera pose associated with a visual image, in accordance with some embodiments. For convenience, the method 1100 is described as being implemented by an electronic system (e.g., an electronic system 200 including an HMD 104D, a mobile device 104C, a server 102, or a combination thereof). More specifically, the method 1100 is implemented by a multilevel mapping module 234 of the electronic system 200. Method 1100 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 11 may correspond to instructions stored in a computer memory' or non-transitory computer readable storage medium (e.g., memory 206 of the electronic system 200 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory’, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1100 may be combined and/or the order of some operations may be changed.

[0075] The electronic system 200 obtains (1102) a sequence of visual images 602 each of which is captured at a respective room. For a first visual image 602A, the electronic system 200 identifies (1 104) a current room 710 where the first visual image 602A is captured among a plurality of known rooms 708. Each known room 708 corresponds (1106) to a respective set of former images 706 that map the respective known room 708, and each former image 706 is associated with (1 108) a respective former pose of a camera 260 in the respective known room 708. In accordance with an identification of the current room 710 among the plurality of known rooms 708 (1110), the electronic system 200 identifies (1112) a first set of former images associated with the current room 710, selects (1114) a first former image 7I2A from the first set of former images to match the first visual image 602A, and determines (1116) a first camera pose associated with the first visual image 602A based on a first former pose associated with the first former image 712A. In some embodiments, each visual image is one of a color image and a monochromatic image.

[0076] In some embodiments, the electronic system 200 obtains (1118) a sequence of depth images 720 that are synchronous with the sequence of visual images 700. Each visual image corresponds to a respective depth image and is associated with a respective camera pose. The sequence of visual images have a first subset of visual images 702 that are captured in the current room 710 and include the first visual image 602A. The electronic system 200 builds (1120) a current 3D room model 732 of the current room 710 from the first subset of visual images 702 and a first subset of corresponding depth images 720A.

[0077] Further, in some embodiments, the first subset of visual images 702 include a second visual image 602B that follows the first visual image 602A. The electronic system 200 determines a second camera pose associated with the second visual image 602B based on a second camera pose associated with a second former image 712B in the first set of former images 706A of the current room 710. Optionally, the second visual image 602B immediately follows the first visual image 602 A or is separated from the first visual image 602 A by an integer number of visual images in the first subset of visual images. In some embodiments, In some embodiments, the first subset of visual images 702 includes a third visual image 602C that precedes the first visual image 602 A. The electronic system 200 determines that a current 3D room model 732 is not complete for the first subset of visual images 702 at a time of obtaining the third visual image 602C. The third visual image 602C is not associated with any known room 708 and corresponding former images 706 of any known room 708 in accordance with a determination that the current 3D room model 732 is not complete for the first subset of visual images 702. Further, in some embodiments, the electronic system 200 determines a third camera pose associated with the third visual image 602C based on a camera pose associated with an earlier visual image in the first subset of visual images 702. In some embodiments, the first subset of visual images 702 includes an alternative visual image 716. The electronic system 200 determines an alternative camera pose associated with the alternative visual image 716 based on a camera pose associated with an earlier visual image in the first subset of visual images 702. Optionally, the alternative visual image 716 immediately follows or precedes the first visual image 602A or is separated from the first visual image 602A by an integer number of visual images in the first subset 702.

[0078] Additionally, in some embodiments, each of the plurality of known rooms 708 corresponds to a respective set of embeddings 724. The electronic system 200 extracts (1122) a first set of embeddings 734 for a current 3D room model 732 of the current room 710. The first set of embeddings 734 include semantic and instance information describing the current room 710. The first set of embeddings 734 includes a plurality of semantic elements configured to provide semantic information of a plurality of objects or regions located in the current 3D room model 732. In some embodiments, the electronic system 200 compares (1124) the first set of embeddings 734 with a respective set of embeddings 724 of each known room 708, and in accordance with a comparison result, determines (1126) that the first set of embeddings 734 satisfies a room identification criterion 740, thereby identifying the current room 710 among the plurality of known rooms 708. For example, for each known room 708, the electronic system 200 determines a similarity level 726 of the first set of embeddings and the respective set of embeddings 724 of the respective known room 708. The electronic system 200 further determines (1) that a first similarity level 726A of the first set of embeddings 734 of the current room 710 and the respective set of embeddings 724 of one of the plurality of known rooms 708 is greater than a similarity threshold and (2) that the first similarity level 726A is the greatest among a plurality of similarity levels 726 of the first set of embeddings 734 with the respective sets of embeddings 724 corresponding to the plurality of known rooms 708, thereby identifying the current room 710 as the one of the plurality of known rooms 708.

[0079] In some embodiments, the electronic system 200 obtains a sequence of depth images 720 that are synchronous with the sequence of visual images 700. Each visual image corresponds to a respective depth image and is associated with a respective camera pose. The sequence of visual images 700 has a second subset of visual images 704 captured in a next room 718. The electronic system 200 creates a next 3D room model 736 of the next room 718 from the second subset of visual images and a second subset of corresponding depth images 720B, The electronic system 200 extracts a second set of embeddings from the next 3D room model 736, and the second set of embeddings 738 include semantic and instance information describing the next room 718. For each known room 708, the electronic system 200 determines a similarity level 726’ of the second set of embeddings 738 and the respective set of embeddings 724 of the respective known room 708, and determining (1) that a second similarity level 726A’ of the second sets of embeddings 738 and the respective set of embeddings 724 corresponding to one of the plurality of known rooms 708 is less than a similarity threshold and (2) that, the second similarity level 726A’ is the greatest among a plurality of similarity levels 726A’ of the second sets of embeddings 738 with the respective sets of embeddings 724 corresponding to the plurality of known rooms 708. The electronic system 200 determines that the next room 718 is not among the plurality of known rooms 708. Further, in some embodiments, the electronic system 200 expands the plurality of known rooms 708 to include the next room 718, and the next room 718 is associated with the second set of embeddings 738.

[0080] In some embodiments, for each of the first set of former images 706A, the electronic system 200 extracts respective image-based descriptors 906 describing objects or regions 908 existing in the respective former image 706A. The first former image 712A is selected by obtaining a first image-based descriptor 902 describing a first region 904 in the first visual image 602A, comparing the first image-based descriptor 902 with the respective image-based descriptors 906 of the first set of former images 706A, and in accordance with a comparison result, determining that at least one respective image-based descriptor 906 of each of a subset of former images 714 satisfies an image selection criterion, the subset of former images 714 including the first former image 712A. For example, for each of the first set of former images 706 A, the electronic system 200 determines a respective similarity level 910 of the first image-based descriptor 902 and at least one respective image-based descriptor 906, and that a subset of similarly levels 910 between the first image-based descriptor 902 and the at least one respective image-based descriptor 906 of each of a subset of former images 714 are the greatest among a plurality of similarity levels of the first image-based descriptor 902 and the respective image-based descriptors 906 of the first set of former images 706A. The subset of former images 714 includes the first former image 712A.

[0081] Further, in some embodiments, the first image-based descriptor 902 is extracted from the first visual image 602 A using a You Only Look Once (YOLO) object detection model. For each of the first set of former images 706A, the respective image-based descriptors 906 are extracted from the YOLO object detection model. Alternatively, in some embodiments, the first image-based descriptor 902 is extracted from the first visual image 602A using a maximally stable extremal regions (MSER) recognition model, and for each of the first set of former images 706 A, the respective image-based descriptor 906 is extracted from the MSER recognition model. Additionally, in some embodiment, the first former image 712A is selected by for each of the subset of former images 714, identifying a respective region 908 associated with the at least one respective image-based descriptor 906 of the respective former image 706A and comparing keypoints the first region 904 of the first visual image 602A with keypoints of the respective region 908 of the respective former image in the subset of former images 714. The respective region matches the first region 904 of the first visual image 602A. The first former image 712A is selected when its keypoints 1004 of the respective region match the keypoints 1002 of the first visual image 602A better than the keypoints of the respective region of any other former image in the subset of former images 714.

[0082] In some embodiments, the electronic system 200 obtains a plurality of former images 706 including the respective set of former images for each known room 708, and creates a respective 3D room model for each known room 708. The current room 710 is identified among the plurality of known rooms 708 based on the respective 3D room model of each known room 708.

[0083] It should be understood that the particular order in which the operations in Figure 11 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art. would recognize various ways to detect a plane as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 5-10 are also applicable in an analogous manner to method 1 100 described above with respect to Figure 11. For brevity, these details are not repeated here.

[0084] The method 1 100 utilizes depth and pose information to generate 3D room model. In some embodiments, a mono camera is used to provide image data 602 that are feed into both SLAM system and depth generation neural network to generate camera poses and depth. An accuracy level of 6DOF poses captured by the mono camera is lower than from that of those captured by a wide-angle fisheye camera. Alternatively, in some embodiments, a camera captures the image data 602, and a separate depth sensor is used to measure depth information synchronous with the image data 602.

[0085] Various embodiments of this application are implemented to derive accurate camera poses from an SLAM system and build accurate model in the indoor environment with similar scenes. Moreover, room matching reduces a searching time for identifying loop closure images from collected data. Also, data can be collected from different devices and fused, such that reconstructed 3D room models provide both geometry information and semantics/instance information which can improve a success rate of loop closure detection. The reconstructed semantic 3D room model can be applied to in user applications for virtual object displaying, and semantic building model reconstruction. Additionally, a hierarchy matching strategy is applied to detect a loop closure frame (e.g., the first former image 712A) from a coarse level to a fine level, and from room-level detection/matching, similar image detection in a room, region/object level matching, and finally keypoint level matching. By these means, a success rate of loop closure detection and an inlier rate of matching is enhanced. [0086] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0087] As used herein, the term “if” is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0088] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view 7 of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0089] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary? skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.