Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR RETRIEVING IMAGES BASED ON SEMANTIC PLANE FEATURES
Document Type and Number:
WIPO Patent Application WO/2023/091131
Kind Code:
A1
Abstract:
This application is directed to identifying a candidate image for determining a camera pose associated with a query image. An electronic device obtains a query image of a scene and generates a query feature map, a query plane instance feature, and a query plane semantic feature from the query image. The query plane instance feature identifies one or more bounding boxes defining one or more planes in the query image, and the query plane semantic feature includes semantic information of the one or more planes. The electronic device aggregates the query feature map, query plane instance feature, and query plane semantic feature to a plane-assisted global descriptor. For each image in a database, a respective global image descriptor has a respective distance from the plane-assisted global descriptor. The electronic device selects the candidate image from the plurality of images based on the respective distance corresponding to each image.

Inventors:
JI PAN (US)
LIU JIACHEN (US)
XU YI (US)
Application Number:
PCT/US2021/059742
Publication Date:
May 25, 2023
Filing Date:
November 17, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T7/162; G06T7/174; G06T13/20
Domestic Patent References:
WO2021173850A12021-09-02
WO2018224442A12018-12-13
Foreign References:
US10248663B12019-04-02
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A method, implemented at an electronic system, comprising: obtaining a query image in a scene; generating a query feature map, a query plane instance feature, and a query plane semantic feature from the query image, wherein the query plane instance feature identifies one or more bounding boxes defining one or more planes in the query image, and the query plane semantic feature includes semantic information of the one or more planes; aggregating the query feature map, query plane instance feature, and query plane semantic feature to a plane-assisted global descriptor; for each of a plurality of images in a database, obtaining a respective global image descriptor, and determining a respective distance of the plane-assisted global descriptor and the respective global image descriptor; and selecting a candidate image from the plurality of images based on the respective distance corresponding to each of the plurality of images.

2. The method of claim 1, wherein the one or more bounding boxes are applied to perform an operation of extracting the query feature map, query plane instance feature, and query plane semantic feature having a predefined spatial size, and aggregating the query feature map, query plane instance feature, and query plane semantic feature to the plane- assisted global descriptor further comprises: concatenating the query feature map, query plane instance feature, and query plane semantic feature to form a plane-assisted query feature map; and converting the plane-assisted query feature map to the plane-assisted global descriptor.

3. The method of claim 1, aggregating the query feature map, query plane instance feature, and query plane semantic feature to the plane-assisted global descriptor further comprising: aggregating the query feature map, query plane instance feature, and query plane semantic feature to a plane-assisted query feature map using a feature aggregation network; and converting the plane-assisted query feature map to the plane-assisted global descriptor using a vector aggregation network.

4. The method of claim 3, wherein: the query feature map is generated using a feature backbone network; and the query plane instance feature and query plane semantic feature are generated from a plane detection model and a semantic segmentation model, respectively.

5. The method of claim 4, wherein a subset of the feature backbone network, plane detection model, semantic segmentation model, feature aggregation network, and vector aggregation network is trained based on a triplet loss L using an anchor training image a a positive sample pi of the anchor training image, and a negative sample of the anchor training image, and the triplet loss L is represented as:

L(a, p, n) = max (daj, p:) - d(a nt) + T, 0) where dicp p,) = || a, - p, ||, d(aj, n) = || a, - n, ||, and T is a preset margin.

6. The method of claim 1, further comprising: determining that the candidate image is among a predefined number of images having smaller respective distances than remaining image of the plurality of images in the database, wherein the query image has a higher similarity level to the predefined number of images than to the remaining images in the database.

7. The method of any of claims 1-6, for a first image in the database, obtaining the respective global image descriptor further comprising: generating a first feature map, a first plane instance feature, and a first plane semantic feature from the first image, wherein the first plane instance feature identifies one or more first bounding boxes defining one or more planes in the first image, and the first plane semantic feature includes semantic information of the one or more planes in the first image; aggregating the first feature map, first plane instance feature, and first plane semantic feature to the respective global image descriptor of the first image; wherein for the first image, the respective distance is determined between the plane- assisted global descriptor of the query image and the respective global image descriptor of the first image.

8. The method of any of claims 1-6, for a second image in the database, obtaining the respective global image descriptor further comprising extracting the respective global image descriptor of the second image from memory of the electronic system.

9. The method of any of claims 1-8, wherein the plurality of images are captured in the scene where the query image is captured, the method further comprising: extracting, from the database, information of a candidate camera pose at which the candidate image is captured in the scene; and determining that the query image is captured at a query camera pose related to the candidate camera pose of the candidate image.

10. The method of claim 9, wherein determining that the query image is captured at the query camera pose related to the candidate camera pose of the candidate image further comprises: extracting a plurality of query feature points of the query image; identifying a plurality of reference feature points of the candidate image; and comparing the plurality of query feature points and the plurality of reference feature points to determine the query camera pose relative to the candidate camera pose.

11. The method of claim 9, further comprising: generating a plurality of query plane embeddings corresponding to a plurality of query planes in the query image; obtaining a plurality of candidate plane embeddings corresponding to a plurality of candidate planes of the candidate image; comparing the plurality of query plane embeddings and the plurality of candidate plane embeddings to identify a first number of matching plane pairs, each matching plane pair including a respective query plane of the query image and a respective candidate plane of the candidate image that substantially matches the respective query plane; and wherein the first number of matching plane pairs are applied to determine the query camera pose related to the candidate camera pose of the candidate image.

12. The method of claim 11, wherein comparing the plurality of query plane embeddings and the plurality of candidate plane embeddings further comprises: determining a cost matrix including a plurality of matrix elements, each matrix element indicating a distance between a respective query plane embedding and a respective candidate plane embedding; and applying a Hungarian method to the cost matrix to identify the first number of matching plane pairs.

13. The method of claim 11, further comprising: selecting a subset of matching plane pairs; determining a fitting error between the query planes and candidate planes in the subset of matching plane pairs, and determining whether the fitting error is below a threshold error; and in accordance with a determination that the fitting error is below the threshold error: determining a relative pose change between the query camera pose and the candidate camera pose; and determining the query camera pose based on the candidate camera pose of the candidate image and the relative pose change.

14. An electronic system, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-13.

15. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-13.

Description:
Methods and Systems for Retrieving Images based on Semantic Plane Features

TECHNICAL FIELD

[0001] This application relates generally to data processing technology including, but not limited to, methods, systems, and non-transitory computer-readable media for localizing a camera in an environment of an extended reality application that is executed by an electronic system.

BACKGROUND

[0002] Simultaneous localization and mapping (SLAM) is widely applied in virtual reality (VR), augmented reality (AR), autonomous driving, and navigation. In SLAM, high frequency pose estimation is enabled by sensor fusion. Asynchronous time warping (ATW) is often applied with SLAM in an AR system to warp a query image before it is sent to a display to correct for head movement that occurs after the query image is rendered. In both SLAM and ATW, relevant image data and inertial sensor data are synchronized and used for localizing a camera in a scene, e.g., estimating and predicting camera poses. The same query image is optionally used as background images to render virtual objects according to the camera poses. In some situations, the camera pose corresponding to the query image is determined from one or more previously-recorded images of the scene which share one or more sparse features with the query image. The camera poses are not consistently accurate across the scene when camera localization depends on the sparse features. It would be beneficial to have a more accurate and efficient camera localization mechanism than the current practice.

SUMMARY

[0003] Various embodiments of this application are directed to a camera localization process that utilizes planar information and semantic information of a query image to retrieve one or more candidate images that has similar camera poses with the query image from a database. Image features of the query image and the candidate images in the database are augmented with semantic and planar features to form global descriptors, which are applied for distance metric learning and retrieval of the one or more candidate images. The one or more candidate images is selected as being substantially similar to the query image based on semantic and structural features. The selected candidate image(s) are used to enable place recognition and camera localization, and a query camera pose of the query image is determined based on a camera pose of the selected candidate image(s).

[0004] In one aspect, a camera localization method is implemented at an electronic system. The method includes obtaining a query image, a candidate image, and information of a candidate camera pose at which the candidate image is captured in a scene. The method further includes generating a plurality of query plane embeddings corresponding to a plurality of query planes in the query image, obtaining a plurality of candidate plane embeddings corresponding to a plurality of candidate planes of the candidate image, and comparing the plurality of query plane embeddings and the plurality of candidate plane embeddings to identify a first number of matching plane pairs. Each matching plane pair includes a respective query plane of the query image and a respective candidate plane of the candidate image that substantially matches the respective query plane. The method further includes based on the first number of matching plane pairs, iteratively determining that the query image is captured at a query camera pose related to the candidate camera pose of the candidate image.

[0005] In another aspect, a method is implemented at an electronic system to identify a candidate image. The method includes obtaining a query image in a scene and generating a query feature map, a query plane instance feature, and a query plane semantic feature from the query image. The query plane instance feature identifies one or more bounding boxes defining one or more planes in the query image, and the query plane semantic feature includes semantic information of the one or more planes. The method further includes aggregating the query feature map, query plane instance feature, and query plane semantic feature to a plane-assisted global descriptor. The method further includes for each of a plurality of images in a database, obtaining a respective global image descriptor, and determining a respective distance of the plane-assisted global descriptor and the respective global image descriptor. The method further includes selecting a candidate image from the plurality of images based on the respective distance corresponding to each of the plurality of images.

[0006] In another aspect, some implementations include an electronic system that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods. [0007] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0008] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof.

Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0010] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0011] Figure 2 is a block diagram illustrating an electronic system configured to process content data (e.g., image data), in accordance with some embodiments.

[0012] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0013] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0014] Figure 5A is a flowchart of a process for processing inertial sensor data and image data of an electronic system using a SLAM module, in accordance with some embodiments.

[0015] Figure 5B is a flow diagram of a camera localization process that determines a query camera pose corresponding to a query image, in accordance with some embodiments.

[0016] Figure 6 is a flow diagram of an image retrieval process that retrieves one or more candidate images from a database for a query image, in accordance with some embodiments.

[0017] Figure 7 is a flow diagram of a training process that trains an image retrieval model applied in an image retrieval process, in accordance with some embodiments. [0018] Figure 8A is a flow diagram of a camera localization process that determines a query camera pose associated with a query image based on a candidate camera pose associated with a candidate image, in accordance with some embodiments.

[0019] Figure 8B is a flow diagram of an iterative camera pose process that determines a query camera pose based on matching plane pairs of a query image and candidate image, in accordance with some embodiments.

[0020] Figure 8C is a flow diagram of another iterative camera pose process that determines a query camera pose based on matching plane pairs of a query image and candidate image, in accordance with some embodiments.

[0021] Figure 9 is a flow diagram of a training process that trains a camera localization model applied in a camera localization process, in accordance with some embodiments.

[0022] Figure 10A is a flow diagram of an image retrieval method, in accordance with some embodiments, and Figure 10B is a flow diagram of a camera localization method, in accordance with some embodiments.

[0023] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0024] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0025] In various embodiments of this application, semantic and structural features are extracted from images for place recognition and camera localization. The camera localization pipeline includes three main functional blocks: (i) image retrieval, (ii) plane retrieval, and (iii) plane Perspective-n-Points (PnP). Specifically, image retrieval techniques are applied to find one or more candidate images from a database based on a query image. An example image retrieval method augments image features with semantic and planar features to form global image descriptors. The augmented global descriptors lead to a discriminative capability for distance metric learning and improve image retrieval performance. After the candidate image(s) are retrieved, a plane retrieval network is trained to get information of matching planes in the query and candidate images. A plane PnP method is applied to final a camera pose of the query image based on the information of matching planes in the query and candidate images.

[0026] Augmented reality (AR) enhances images that are viewed on a screen or other display and overlays computer-generated images, sounds, or other data on a real world environment. Mixed reality (MR) overlays and anchors virtual objects to real world and allows a user to interact with the virtual or real objects in a field of view. Semantic segmentation is applied to assign a semantic label (e.g., floor, wall, table) to each pixel of an image. Plane detection generates plane masks on a two-dimensional (2D) image. A plane mask is an image of the same size as an input image, and in an example, each pixel in the plane mask is assigned with a plane identifier (ID) if the respective pixel belongs to a corresponding plane region. All pixels with the same plane ID form a corresponding plane region. In visual place recognition, it is determined whether a query image corresponds to a portion of a scene that has been scanned and mapped to a three-dimensional (3D) virtual space. In camera localization: a 6-Degree-of-Freedom (6-DoF) camera pose is estimated for the query image with respect to the scene, and the 6-DoF camera pose optionally includes a position and an orientation of a camera in the scene when the query image is captured.

Distance metric learning (also called metric learning) automatically constructs task-specific distance metrics from weakly supervised training data, and the learned distance metrics are used to perform various tasks (e.g., k-NN classification, clustering, information retrieval). [0027] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0028] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104 to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely. [0029] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0030] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. The content data may broadly include inertial sensor data captured by inertial sensor(s) of a client device 104. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the mobile phone 104C or HMD 104D obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0031] In some embodiments, both model training and data processing are implemented locally at each individual client device 104 (e.g., the mobile phone 104C and HMD 104D). The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102A) associated with a client device 104 (e g the client device 104A and HMD 104D). The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104 (e.g., the client device 104B and HMD 104D), while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0032] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The AR glasses 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the AR glasses 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the AR glasses 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the AR glasses 104D is processed by the AR glasses 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and AR glasses 104D jointly to recognize and predict the device poses The device poses are used to control the AR glasses 104D itself or interact with an application (e.g., a gaming application) executed by the AR glasses 104D. In some embodiments, the display of the AR glasses 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface. [0033] As explained above, in some embodiments, deep learning techniques are applied in the data processing environment 100 to process video data, static image data, or inertial sensor data captured by the AR glasses 104D. 2D or 3D device poses are recognized and predicted based on such video, static image, and/or inertial sensor data using a first data processing model. Visual content is optionally generated using a second data processing model. Training of the first and second data processing models is optionally implemented by the server 102 or AR glasses 104D. Inference of the device poses and visual content is implemented by each of the server 102 and AR glasses 104D independently or by both of the server 102 and AR glasses 104D jointly.

[0034] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., AR glasses 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geolocation receiver, for determining the location of the client device 104.

[0035] Optionally, the client device 104 includes an inertial measurement unit (IMU) 280 integrating sensor data captured by multi-axes inertial sensors to provide estimation of a location and an orientation of the client device 104 in space. Examples of the one or more inertial sensors of the IMU 280 include, but are not limited to, a gyroscope, an accelerometer, a magnetometer, and an inclinometer. [0036] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.);

• Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices); • Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 to process the content data in response to a user instruction received from the user application 224, and in an example, the data processing module 228 is applied to implement an image retrieval process 600 in Figure 6 or a camera localization process 800 in Figure 8A;

• Pose determination and prediction module 230 for determining and predicting a pose of the client device 104 (e.g., AR glasses 104D), where in some embodiments, the pose is determined and predicted jointly by the pose determination and prediction module 230 and data processing module 228, and the module 230 further includes an SLAM module 232 for mapping a scene where a client device 104 is located and identifying a pose of the client device 104 within the scene using image and IMU sensor data, a camera localization module 234 for determining a query camera pose associated with a query image based on one or more candidate images stored in a database, and an image retrieval module 236 for extracting one or more candidate images from a database to match a query image for camera localization;

• Pose-based rendering module 238 for rendering virtual objects on top of a field of view of the camera 260 of the client device 104 or creating mixed reality content using images captured by the camera 260; and

• One or more databases 240 for storing at least data including one or more of: o Device settings 242 including common device settings (e g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 244 for the one or more user applications 224, e g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 246 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 248 for training one or more data processing models 250; o Data processing model(s) 250 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 250 includes an image retrieval model for implementing an image retrieval process 600 and a camera localization model for implementing a camera localization process 800; o Pose data database 252 for storing pose data associated with candidate images (e.g., stored in a database 564 in Figure 5B); and o Content data and results 254 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 250 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104, and include the candidate images.

[0037] Optionally, the one or more databases 240 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 240 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.

[0038] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0039] Figure 3 is an example data processing system 300 for training and applying a neural network based (NN-based) data processing model 250 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 250 and a data processing module 228 for processing the content data using the data processing model 250. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct form the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, both of the model training module 226 and the data processing module 228 are located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 250 to the client device 104.

[0040] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 250 is trained according to a type of the content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 250, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 250 to reduce the loss function, until the loss function satisfies a loss criteria (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 250 is provided to the data processing module 228 to process the content data.

[0041] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled.

[0042] The data processing module 228 includes a data pre-processing modules 314, a model -based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model-based processing module 316. Examples of the content data include one or more of: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model -based processing module 316 applies the trained data processing model 250 provided by the model training module 226 to process the pre-processed content data. The model -based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing model 250. In some embodiments, the processed content data is further processed by the data postprocessing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0043] Figure 4A is an example neural network (NN) 400 applied to process content data in an NN-based data processing model 250, in accordance with some embodiments, and Figure 4B is an example node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 250 is established based on the neural network 400 A corresponding model-based processing module 316 applies the data processing model 250 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the one or more node inputs. As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the one or more node inputs are combined based on corresponding weights wi, W2, ws, and W4 according to the propagation function. In an example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the one or more node inputs. [0044] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the one or more layers includes a single layer acting as both an input layer and an output layer. Optionally, the one or more layers includes an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layers 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the one or more hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0045] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 250 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The one or more hidden layers of the CNN are convolutional layers convolving with a multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0046] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 250 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. In an example, each node 420 of the RNN has a time-varying real -valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0047] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0048] Figure 5A is a flowchart of a process 500 for processing inertial sensor data and image data of an electronic system (e g., a server 102, a client device 104, or a combination of both) using a SLAM module 232, in accordance with some embodiments. The process 500 includes measurement preprocessing 502, initialization 504, local visual- inertial odometry (VIO) with relocation 506, and global pose graph optimization 508. In measurement preprocessing 502, an RGB camera 260 captures image data of a scene at an image frame rate (e.g., 30 FPS), and features are detected and tracked (510) from the image data. An IMU 280 measures inertial sensor data at a sampling frequency (e.g., 1000 Hz) concurrently with the RGB camera 260 capturing the image data, and the inertial sensor data are pre-integrated (512) to provide data of a variation of device poses 540. In initialization 504, the image data captured by the RGB camera 260 and the inertial sensor data measured by the IMU 280 are temporally aligned (514). A vision-only structure from motion (SfM) techniques 514 are applied (516) to couple the image data and inertial sensor data, estimate three-dimensional structures, and map the scene of the RGB camera 260.

[0049] After initialization 504 and during relocation 506, a sliding window 518 and associated states from a loop closure 520 are used to optimize (522) a VIO. When the VIO corresponds (524) to a keyframe of a smooth video transition and a corresponding loop is detected (526), features are retrieved (528) and used to generate the associated states from the loop closure 520. In global pose graph optimization 508, a multi -degree-of-freedom (multiDOF) pose graph is optimized (530) based on the states from the loop closure 520, and a keyframe database 532 is updated with the keyframe associated with the VIO.

[0050] Additionally, the features that are detected and tracked (510) are used to monitor (534) motion of an object in the image data and estimate image-based poses 536, e.g., according to the image frame rate. In some embodiments, the inertial sensor data that are pre-integrated (512) may be propagated (538) based on the motion of the object and used to estimate inertial -based poses 540, e.g., according to a sampling frequency of the IMU 280. The image-based poses 536 and the inertial-based poses 540 are stored in the pose data database 252 and used by the module 230 to estimate and predict poses that are used by the pose-based rendering module 238. Alternatively, in some embodiments, the SLAM module 232 receives the inertial sensor data measured by the IMU 280 and obtains image-based poses 536 to estimate and predict more poses 540 that are further used by the pose-based rendering module 238.

[0051] In SLAM, high frequency pose estimation is enabled by sensor fusion, which relies on data synchronization between imaging sensors and the IMU 280. The imaging sensors (e g., the RGB camera 260, a LiDAR scanner) provide image data desirable for pose estimation, and oftentimes operate at a lower frequency (e.g., 30 frames per second) and with a larger latency (e.g., 30 millisecond) than the IMU 280. Conversely, the IMU 280 can measure inertial sensor data and operate at a very high frequency (e g., 1000 samples per second) and with a negligible latency (e.g., < 0.1 millisecond). Asynchronous time warping (ATW) is often applied in an AR system to warp an image before it is sent to a display to correct for head movement and pose variation that occurs after the image is rendered. ATW algorithms reduce a latency of the image, increase or maintain a frame rate, or reduce judders caused by missing image frames. In both SLAM and ATW, relevant image data and inertial sensor data are stored locally, such that they can be synchronized and used for pose estimation/predication. In some embodiments, the image and inertial sensor data are stored in one of multiple STL containers, e.g., std::vector, std::queue, std: :list, etc., or other selfdefined containers. These containers are generally convenient for use. The image and inertial sensor data are stored in the STL containers with their timestamps, and the timestamps are used for data search, data insertion, and data organization.

[0052] Figure 5B is a flow diagram of a camera localization process 550 that determines a query camera pose 552 corresponding to a query image 554, in accordance with some embodiments. The camera localization process 550 includes an image retrieval stage 556, a plane retrieval stage 558, a plane perspective-n-point (PnP) stage 560. In the image retrieval stage 556, an electronic system 200 retrieves one or more candidate images 562 from a database 564, and the one or more candidate images 562 substantially match the query image 554. In the plane retrieval stage 558, query planes and candidate planes are retrieved from the query image 554 and each candidate image 562, respectively. The query planes of the query image 554 and the candidate planes of each candidate image 562 are matched to one another, if a first number of matching plane pairs are identified in the query image 554 and the respective candidate image 562. In some embodiments, an image in the database 564 is qualified and identified as a candidate image 562 in accordance with a determination that the first number is greater than a predefined matching pair number threshold (e g., 3), i.e., in accordance with a determination that there are more than the predefined threshold of matching plane pairs in the query image 554 and each candidate image 562. Additionally, in some embodiments, plane embeddings are determined for the query image 554 or candidate image 562 and applied to identify the corresponding matching plane pairs thereof.

[0053] In the plane PnP stage 560, based on the first number of matching plane pairs 566, the electronic system 200 iteratively determines that the query image 554 is captured at a query camera pose 552 related to a known candidate camera pose of the candidate image 562. For example, for each iteration (e.g., iteration 812 in Figure 8B), a second number of matching plane pairs (e g., plane pairs 806’ in Figure 8B) are selected in the first number of matching plane pairs (e.g., plane pairs 806 in Figure 8A). For each of the selected second number of matching plane pairs, it is determined whether the respective query plane and the respective candidate plane are substantially parallel (e.g., do not form any angle or form an angle less than a threshold angle (e.g., 5 or 10 degrees)). In some situations, in accordance with a determination that in each of the selected second number of matching plane pairs the respective query plane and the respective candidate plane are substantially parallel, the query camera pose 552 is determined to be aligned with and equal to the candidate camera pose of the candidate image 562. Alternatively, in some situations, for each iteration, an iteration fitting error (e.g., error 816 in Figure 8B) is determined between the query planes and candidate planes in the second number of matching plane pairs. A smallest fitting error (e.g., error 818 in Figure 8B) is identified among the iteration fitting error of each of the plurality of iterations, and used to determine a relative pose change (e.g., pose change 820-1 in Figure 8B) between the query camera pose and the candidate camera pose in the respective iteration associated with the smallest fitting error. The query camera pose 552 is determined based on the candidate camera pose of the candidate image and the relative pose change.

[0054] Additionally and alternatively, in some embodiments, for each iteration (e.g., iteration 812 in Figure 8C), an iteration fitting error (e.g., error 816 in Figure 8C) is determined between the query planes and candidate planes in the second number of matching plane pairs, and compared with a threshold error (e.g., threshold error 838 in Figure 8C). For a last iteration, in accordance with a determination that the iteration fitting error is below the threshold error, the plurality of iterations are terminated, and a relative pose change (e.g., pose change 820-2 in Figure 8C) is determined between the query camera pose and the candidate camera pose and used to determine the query camera pose. More details on the plane PnP stage 560 are explained below with respect to Figures 8A-8C.

[0055] Figure 6 is a flow diagram of an image retrieval process 556 that retrieves one or more candidate images 562 from a database 564 for a query image 554, in accordance with some embodiments. The query image 554 is captured in a scene by a camera (e g., a camera 260 in Figure 2). Optionally, an electronic device (e.g., a mobile phone 104C) includes the camera and is configured to retrieve the one or more candidate images 562 from the database 564 for the query image 554 locally. Optionally, an electronic device (e g., a mobile phone 104C) does not include the camera, and is configured to receive the query image 554 via one or more communication networks 108 and retrieve the one or more candidate images 562 from the database 564 for the query image 554. The electronic device generates a query feature map 602, a query plane instance feature 604, and a query plane semantic feature 606 from the query image 554. The query plane instance feature 604 identifies one or more bounding boxes defining one or more query planes in the query image 554, and the query plane semantic feature 606 includes semantic information of the one or more query planes of the query image 554. The query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are aggregated to a plane-assisted global descriptor 608. In some embodiments, the one or more bounding boxes associated with the query plane(s) of the query image 554 are applied to perform an operation of extracting the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 having a predefined spatial size, e.g., 14x 14. The query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are concatenated to form a plane-assisted query feature map 610, which is further converted to the plane-assisted global descriptor 608.

[0056] For each of a plurality of images in the database 564, a respective global image descriptor 612 is obtained. In some embodiments, each image is stored with the respective global image descriptor 612 in the database 564. Alternatively, in some embodiments, as the electronic device processes the query image, it extracts an image from the database 564, determines a corresponding feature map, plan instance feature, and plane semantic feature of the extracted image, and aggregates the determined map and features to the respective global image descriptor 612. After obtaining both the plane-assisted global descriptor 608 and respective global image descriptor 612, the electronic device determines a respective distance 614 of these two descriptors 608 and 612. A candidate image 562 is selected from the plurality of images based on the respective distance 614 corresponding to each of the plurality of images in the database 564.

[0057] In some embodiments, the images of the database 564 are ranked based on the respective distance 614 of the global image descriptor 612 of each image from the plane- assisted global descriptor 608, and a predefined number of (Nd) candidate images 562 having the smallest respective distances 614 are identified. The selected candidate image 562 is among the predefined number of (N ) images. The predefined number of (Nd) images have smaller respective distances 614 than remaining images of the plurality of images in the database 564. The smaller respective distance 614 indicate that the query image 554 has a higher similarity level to the predefined number of (Nd) images than to the remaining images in the database 564. Alternatively, in some embodiments, for each image in the database 564, the respective distance 614 is compared with a predefined distance threshold DTH. In accordance with a determination that the respective distance 614 is less than the predefined distance threshold DTH, the electronic device identifies the respective image corresponding to the respective distance 614 as a candidate image 562. Further, in some embodiments, the electronic device identifies the predefined number of (Nd) candidate images (e.g., 5 candidate images) from the database 564 and aborts processing more images in the database 564. Alternatively and additionally, in some embodiments, the electronic device is not limited by the predefined number (Ac) and identifies in the database 564 all images having the respective distances 614 that satisfy the predefined distance threshold DTH.

[0058] The image retrieval process 556 is enabled by deep learning techniques, e.g., by an image retrieval model, and the image retrieval model includes at least a feature backbone network 616, a plane processing network 618, and a vector aggregation network 620. Both of the feature backbone network 616 and plane processing network 618 are configured to receive and process the query image 554. In some embodiments, the plane processing network 618 is a single network. In some embodiments, the plane processing network 618 includes a plane detection network (e.g., PlaneRCNN) and a semantic segmentation network (e.g., DeepLab). The plane detection network is configured to characterize structural features of the query image 554 with the query plane instance feature 604, and the semantic segmentation network is configured to characterize semantic features of the query image 554 with the query plane semantic feature 606. As such, referring to Figure 6, the query image 554 is passed through three parallel branches, i.e., the feature backbone network 616, plane detection network, and semantic segmentation network.

[0059] In some embodiments, the one or more bounding boxes are generated by the plane detection network of the plane processing network 618, and used to perform a region of interest align (ROIAlign) operation on each of the three parallel branches to extract a respective feature map for each region of interest (e.g., for each bounding box) in the query image 554. In some embodiments, the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are aggregated to a plane-assisted query feature map 610 using a feature aggregation network 622. The vector aggregation network 620 is configured to generate the plane-assisted global descriptor 608 associated with the query image 554. In an example, an Net VLAD network is applied to discern a large amount of content clutter (e.g., people, cars), changes in viewpoint, and different illumination conditions (e.g., night and daytime illuminations). The NetVLAD network encodes features of different planes defined by the bounding boxes into a high-dimensional vector (i.e., the plane-assisted global descriptor 608) using one or more clustering centers. In some embodiments, the vector aggregation network 620 has a single layer.

[0060] In some embodiments, additional operations are performed to localize the camera that captures the query image 554, i.e., identify a query camera pose including a camera position and a camera orientation of the camera in association with the query image 554. The plurality of images of the database 564 are captured in the same scene where the query image 554 is captured. The electronic device extracts from the database 564 information of a candidate camera pose at which the candidate image 562 is captured in the scene, and determines that the query image is captured at the query camera pose related to the candidate camera pose of the candidate image 562. In some situations, the query camera pose associated with the query image 554 is identical to the candidate camera pose at which the candidate image 562 is captured. Alternatively, in some situations, the query camera pose associated with the query image 554 is moved by a relative pose change with reference to the candidate camera pose.

[0061] Specifically, in some embodiments not shown, a plurality of query feature points are extracted from the query image 554, and a plurality of reference feature points are identified from the candidate image 562. The plurality of query feature points and the plurality of reference feature points are compared to determine the query camera pose relative to the candidate camera pose. In accordance with a determination that the query feature points of the query image 554 and the reference feature points of the candidate image 562 match each other, the query camera pose associated with the query image 554 is identical to the candidate camera pose of the candidate image 562.

[0062] Alternatively, in some embodiments, a plurality of query plane embeddings (e.g., embeddings 802 in Figure 8A) are generated for the plurality of query planes in the query image 554. A plurality of candidate plane embeddings (e.g., embeddings 804 in Figure 8A) correspond to a plurality of candidate planes of the candidate image 562. The plurality of query plane embeddings and the plurality of candidate plane embeddings are compared to identify a first number of matching plane pairs (e.g., plane pairs 806 in Figure 8A). Each matching plane pair includes a respective query plane of the query image and a respective candidate plane of the candidate image that substantially matches the respective query plane, e.g., with a tolerance. The first number of matching plane pairs are applied to determine the query camera pose 552 related to the candidate camera pose of the candidate image 562. Further, in some embodiments, the electronic device determines a cost matrix (e g., cost matrix 808 in Figure 8A) including a plurality of matrix elements from the plurality of query plane embeddings and the plurality of candidate plane embeddings. Each matrix element indicates a distance between a respective query plane embedding and a respective candidate plane embedding. A Hungarian method (e.g., method 810 in Figure 8A) is applied to the cost matrix to identify the first number of matching plane pairs. Additionally, in some embodiments, a subset of matching plane pairs (e.g., plane pairs 806’ in Figure 8A) are selected. A fitting error (e.g., fitting error 818 in Figure 8B or 8C) is determined between the query planes and candidate planes in the subset of matching plane pairs. The electronic device determines whether the fitting error is below a threshold error (e.g., threshold error 838 in Figure 8C). In accordance with a determination that the fitting error is below the threshold error, the electronic device determines a relative pose change (e.g., pose changes 820 in Figures 8B and 8C) between the query camera pose 552 and the candidate camera pose, and determines the query camera pose 552 based on the candidate camera pose of the candidate image 562 and the relative pose change. More details on determining the query camera pose 552 associated with the query image 554 based on the candidate camera pose are explained below with reference to Figures 8A-8C.

[0063] Referring to Figure 6, the image retrieval process 556 leverages structural and semantic information of planes of the query image 554 and leads to accurate and robust image retrieval, particularly for query images 554 that are captured in indoor environments and contain multiple structured planes. In an AR application, a scene is mapped to a 3D virtual space, and the camera is localized in the 3D virtual space. The image retrieval process 556 enables a smooth AR experience. After a camera pose is determined in the 3D virtual space, a pre-built map of the 3D virtual space is loaded to determine the camera pose, and virtual objects are rendered in the 3D virtual space in a realistic manner.

[0064] Figure 7 is a flow diagram of a training process 700 that trains an image retrieval model applied in an image retrieval process 600, in accordance with some embodiments. As explained above, the image retrieval model includes a feature backbone network 616, a plane processing network 618, and a vector aggregation network 620. Both of the feature backbone network 616 and plane processing network 618 are configured to receive and process the query image 554. In some embodiments, the plane processing network 618 is a single network. In some embodiments, the plane processing network 618 includes two separate and parallel networks, a plane detection network (e g., PlaneRCNN) and a semantic segmentation network (e g., DeepLab). The plane detection network is configured to characterize structural features of the query image 554 with the query plane instance feature 604, and the semantic segmentation network is configured to characterize semantic features of the query image 554 with the query plane semantic feature 606. In some embodiments, the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are aggregated to a plane-assisted query feature map 610 without using deep learning techniques. Alternatively, in some embodiments, the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are aggregated to a plane-assisted query feature map 610 using a feature aggregation network 622. The vector aggregation network 620 is configured to generate the plane-assisted global descriptor 608 associated with the query image 554.

[0065] In some embodiments, each individual network of the image retrieval model is trained separately using a respective data set. Each of the feature backbone network 616, plane detection network, semantic segmentation network, vector aggregation network 620, and feature aggregation network 622 is trained separately. Alternatively, in some embodiments, all of the feature backbone network 616, plane detection model, semantic segmentation model, feature aggregation network 622, and vector aggregation network 620 are trained end-to-end based on a triplet loss L using an anchor training image a a positive sample pi of the anchor training image, and a negative sample n t of the anchor training image. The triplet loss L is represented as: d(fli,ni) = l lctj - n I where d x y ; ) = || x l -y t || and T is a preset margin.

[0066] Additionally and alternatively, in some embodiments, the image retrieval model includes a first subset of networks and a second subset of networks. Each of the first subset of networks is trained separately using a respective data set, and the second subset of networks are trained jointly in an end-to-end manner. For example, each of PlaneRCNN and DeepLab is pretrained individually and integrated in the image retrieval model. The feature backbone network 616, vector aggregation network 620, and feature aggregation network 622 are trained jointly in an end-to-end manner, e.g., using the triplet loss L.

[0067] In some embodiments, the training process 700 is implemented at a server 102 to train the image retrieval model, and the image retrieval model is provided to a client device 104 that implements the image retrieval process 600. Alternatively, in some embodiments, the training process 700 is implemented at a client device 104, which thereby both trains and applies the image retrieval model. In some embodiments, the training process 700 is implemented at a server 102 to train the image retrieval model. A client device 104 sends the query image 554 to the server 102, and the image retrieval model is implemented by the server 102 to process the query image 554. The candidate image 562 is returned by the server 102 to the client device 104.

[0068] Figure 8A is a flow diagram of a camera localization process 800 that determines a query camera pose 552 associated with a query image 554 based on a candidate camera pose 852 associated with a candidate image 562, in accordance with some embodiments. The camera localization process 800 is implemented at an electronic system (e.g., an electronic system 200 including a client device 104). Information of the candidate camera pose 852 associated with the candidate image 562 is known, e.g., obtained with the candidate image 562. The electronic system generates a plurality of query plane embeddings 802 corresponding to a plurality of query planes in the query image 554 and obtains a plurality of candidate plane embeddings 804 corresponding to a plurality of candidate planes of the candidate image 562. The plurality of candidate plane embeddings 804 of the plurality of candidate planes of the candidate image 562 are optionally extracted from memory of the electronic system (e.g., memory 206 in Figure 2) or generated from the candidate image 562. The plurality of query plane embeddings 802 and the plurality of candidate plane embeddings 804 are compared to identify a first number of matching plane pairs 806. Each matching plane pair 806 includes a respective query plane of the query image 554 and a respective candidate plane of the candidate image 562 that matches the respective query plane. Based on the first number of matching plane pairs 806, the electronic system iteratively determines (850 or 880) that the query image 554 is captured at the query camera pose 552 related to the candidate camera pose 852 of the candidate image 562.

[0069] In some embodiments, a cost matrix 808 is applied to compare the query and candidate plane embeddings 802 and 804. The cost matrix 808 includes a plurality of matrix elements, and each matrix element indicates a distance (i.e., a similarity level) between a respective query plane embedding 802 and a respective candidate plane embedding 804. In an example, each matrix element of the cost matrix 808 measures the distance using a cosine similarity of the respective plane embeddings 802 and 804. A Hungarian method is further applied to the cost matrix 808 to identify the first number of matching plane pairs 806. Under some circumstances, the image retrieval process 500 identifies a predefined number of (e g., 10) candidate images 562. A first candidate image 562 has only one matching plane pair 806 with the query image 554 based on comparison of their query plane embeddings 802 and 804, and each of the other nine candidate images 562 has at least 5 matching plane pairs 806 with the query image 554 based on comparison of their query plane embeddings 802 and 804. In some embodiments, only when the first number is great than a predefined matching pair number threshold (e.g., 3), the query camera pose 552 is determined from the matching plane pairs 806. Conversely, in the above example, if there is only one matching plane pair 806, the corresponding first candidate image 562 is not used to determine the query camera pose 552 of the query image 554.

[0070] In some embodiments, the respective candidate plane of the candidate image 562 matches the respective query plane of the query image 554, when the respective candidate plane is parallel with the respective query plane. In some embodiments, the respective candidate plane has a candidate plane normal direction that does not align with and deviates slightly from a query plane normal direction of the respective query plane. The respective candidate plane of the candidate image 562 is regarded as matching the respective query plane of the query image 554, when a deviation of the candidate plane normal direction is less than a threshold angle (e g., less than 5 degrees).

[0071] Referring to Figure 8A, in some embodiments, the plurality of query plane embeddings 802 and the plurality of candidate plane embeddings 804 are generated from the query image 554 and the candidate image 562 using a plane retrieval network 822, respectively. The plane retrieval network 822 includes a feature backbone network 824, a plane detection model 826, and a region of interest (RO I) polling network 828. A query feature map 830 is generated from the query image 554 using the feature backbone network 824. A plurality of query bounding boxes 832 correspond to the plurality of query planes of the query image 554, and information of the query bounding boxes 832 is generated from the query image 554 using the plane detection model 826. Information of the query bounding box 832 includes plane masks and local 3D plane parameters of the query planes of the query image 554. The plurality of query plane embeddings 802 are generated from the information of the plurality of bounding boxes 832 and the query feature map 830 using the ROI polling network 828.

[0072] Likewise, a candidate feature map 834 is generated from the candidate image 562 using the feature backbone network 824. A plurality of candidate bounding boxes 836 correspond to the plurality of candidate planes of the candidate image 562, and information of the candidate bounding box 836 is generated from the candidate image 562 using the plane detection model 826. Information of the candidate bounding box 836 includes plane masks and local 3D plane parameters of the candidate planes of the candidate image 562. The plurality of candidate plane embeddings 804 are generated from the plurality of bounding boxes 836 and the candidate feature map 834 using the ROI polling network 828.

[0073] Figure 8B is a flow diagram of an iterative camera pose process 850 that determines a query camera pose 552 based on matching plane pairs 806 of a query image 554 and candidate image 562, in accordance with some embodiments. The query camera pose 552 is determined during a plurality of iterations 812. In some embodiments, for each iteration 812, a second number of matching plane pairs 806’ are selected from the first number of matching plane pairs 806, and the second number is less than the first number. For example, each iteration 812 processes 3 matching plane pairs 806’ among total 7 matching plane pairs 806 of the query image 554 and candidate image 562. Further, in some embodiments, during each iteration 812, for each of the selected second number of matching plane pairs 806’, the electronic system determines whether the respective query plane and the respective candidate plane are (814) substantially parallel (e.g., having an angle less than the threshold angle, such as less than 5 degrees). In accordance with a determination that in each of the selected second number of matching plane pairs 806’ of the respective query plane and the respective candidate plane are parallel, it is determined that the query camera pose 552 is aligned with and equal to the candidate camera pose 852 of the candidate image 562.

[0074] Additionally, in some embodiments, the electronic system determines an iteration fitting error 816 between the query planes and candidate planes in the second number of matching plane pairs 806’ during each iteration 812. The electronic system identifies a smallest fitting error 818 among the iteration fitting error 816 of each of the plurality of iterations 812. A relative pose change 820-1 between the query camera pose 552 and the candidate camera pose 852 is determined for the respective iteration 812 associated with the smallest fitting error 818. The query camera pose 552 is determined based on the candidate camera pose 852 of the candidate image 562 and the relative pose change 820-1. [0075] Figure 8C is a flow diagram of another iterative camera pose process 880 that determines a query camera pose 552 based on matching plane pairs 806 of a query image 554 and candidate image 562, in accordance with some embodiments. In some embodiments, the electronic system determines an iteration fitting error 816 between the query planes and candidate planes in the second number of matching plane pairs 806’ and whether the iteration fitting error 816 is below a threshold error 838 during each of a subset of iterations 812. For a last iteration 812 (e.g., Iteration 2), in accordance with a determination that the iteration fitting error 816 is below the threshold error 838, the electronic system terminates the plurality of iterations 812, determines a relative pose change 820-2 between the query camera pose 552 and the candidate camera pose 852. The query camera pose 552 is determined based on the candidate camera pose 852 of the candidate image 562 and the relative pose change 820-2.

[0076] Figure 9 is a flow diagram of a training process 900 that trains a camera localization model applied in a camera localization process 800 as shown in Figure 8, in accordance with some embodiments. As explained above, a plurality of query plane embeddings 802 and a plurality of candidate plane embeddings 804 are generated from a query image 554 and a candidate image 562 using a plane retrieval network 822, respectively. The plane retrieval network 822 includes a feature backbone network 824, a plane detection model 826, and an ROI polling network 828. A query feature map 830 is generated from the query image 554 using the feature backbone network 824. A plurality of query bounding boxes 832 correspond to the plurality of query planes of the query image 554, and information of the query bounding boxes 832 is generated from the query image 554 using the plane detection model 826. Information of the query bounding box 832 includes plane masks and local 3D plane parameters of the query planes of the query image 554. The plurality of query plane embeddings 804 are generated from the plurality of bounding boxes 832 and the query feature map 830 using the ROI polling network 828.

[0077] A candidate feature map 834 is generated from the candidate image 562 using the feature backbone network 824. A plurality of candidate bounding boxes 836 correspond to the plurality of candidate planes of the candidate image 562, and information of the candidate bounding box 836 is generated from the candidate image 562 using the plane detection model 826. Information of the candidate bounding box 836 includes plane masks and local 3D plane parameters of the candidate planes of the candidate image 562. The plurality of candidate plane embeddings 804 are generated from the plurality of bounding boxes 836 and the candidate feature map 834 using the ROI polling network 828.

[0078] In some embodiments, based on metric learning, the feature backbone network 824 and ROI polling network 828 are trained j ointly in a supervised manner, while the plane detection model 826 is predefined and configured to generate plane masks and local 3D plane parameters 832 and 836 for planes of the query image 554 and candidate image 562, respectively.. Specifically, an electronic system (e.g., a server 102, a client device 104 configured to implement the process 800) receives a first training image 902, a first ground truth 904 of the first training image 902, a second training image 906, and a second ground truth 908 of the second training image 906. Each of the first and second ground truths 904 and 908 include respective ground truth plane labels. The ground truth plane labels of paired images (the first and second training images 902 and 906) are leveraged to form positive and negative plane pairs. A first training feature map 910 is generated from the first training image 902 using the feature backbone network 824, and a second training feature map 912 is generated from the second training image 906 using the feature backbone network 824. A plurality of first training plane embeddings 914 are generated from the first ground truth 904 and the first training feature map 910 using the ROI polling network 828. A plurality of second training plane embeddings 916 are generated from the second ground truth 908 and the second training feature map 912 using the ROI polling network 828. A triplet loss LT EMBDI is determined between the first and second training plane embeddings 914 and 196. The feature backbone network 824 and the ROI polling network 828 are trained based on the triplet loss LTJMBDI.

[0079] Alternatively, in some embodiments, both the plane detection model 826 and the feature backbone network 824 are predefined. The ROI polling network 828 is trained in a supervised manner. Specifically, the electronic system receives a first training image 902, a first ground truth 904 of the first training image 902, a second training image 906, a second ground truth 908 of the second training image 906. Each of the first and second ground truths 904 and 908 include respective ground truth plane labels. The ground truth plane labels of paired images (the first and second training images 902 and 906) are leveraged to form positive and negative plane pairs. A first training feature map 910 is generated from the first training image 902 using the feature backbone network 824, and a second training feature map 912 is generated from the second training image 906 using the feature backbone network 824. A plurality of first training plane embeddings 914 are generated from the first ground truth 904 and the first training plane embeddings 914 using the ROI polling network 828. A plurality of second training plane embeddings 916 are generated from the second ground truth 908 and the second training plan embeddings 916 using the ROI polling network 828. A triplet loss LT_EMBD2 is determined between the first and second training plane embeddings 914 and 916, and the ROI polling network 828 is trained based on the triplet loss LT_EMBD2. [0080] Stated another way, the ground truth plane labels are used to pool the feature maps 910 and 912. Given the ground truth positive and negative plane pairs of the first and second training images 902 and 906, the triplet loss LT EMBDI or LT_EMBD2 can be built on those corresponding training plane embeddings 914 and 916. During training with the triplet loss LT_EMBDI or LT_EMBD2, the plane embeddings 914/916 of positive plane pairs become closer to each other. For negative plane pairs, distance between their plane embeddings 914/916 become larger. More specifically, the triplet loss LT_EMBDI or LT_EMBD2 is defined as follows: are feature embeddings of the anchor plane a the positive plane pi and the negative plane sampled from the groundtruth plane bounding boxes in 904/908,, and T is a preset margin.

[0081] During inference, ground truth plane labels do not exist, and the pretrained plane detection model 826 is applied to generate plane masks and 3D plane parameters for the query image 554 and the candidate image 562 that is optionally retrieved in an image retrieval process 600. An example of the pretrained plane detection model 826 is PlaneRCNN. Given plane masks, the plane embeddings 914 are pooled using the ROI polling network 828. Referring to Figure 8A, during inference, a cost matrix 808 is formed via a pairwise distance of plane feature embeddings 802 and 804. Plane matches are found by applying (808) a Hungarian method on the cost matrix. In some embodiments, in accordance with a detection of more than 3 plane matching pairs , a plane perspective-n-point (Plane- PNP) method is implemented based on random sample consensus (RANSAC). The electronic system randomly sample three plane pairs from the first number of matching plane pairs 806 and solve a pose candidate in a closed form.

[0082] In some embodiments, in accordance with the Plane-PNP method, the electronic system randomly pick three matching plane pairs 806’ during each iteration 812. Each plane is represented as nx d. where n is a normal direction of the respective plane and d is an offset. A transformation matrix corresponding to a relative pose change 820 is (R t\

J, where R is a rotation matrix describing rotational transformation between two planes, and t is a translation vector describing a translational shift between the two planes. A query camera pose 552 associated with a query image 554 is T q , and a candidate camera pose 852 of a retrieved candidate image 562 is T r , which is known during a camera localization process 800. A relative pose change 820 (T qr ) between the query image 554 and the candidate image 562 is represented as T qr — T^ 1 T r . As such, relationships among the query camera pose 552, candidate camera pose 852, query planes, and candidate planes are described as follows: Rqr * n n (3) where n q and n r are a query plane of the query image 554 and a candidate plane of the retrieved candidate image 562, respectively; d q and d r are a query offset of the query plane and a candidate offset of the candidate plane, respectively; R qr is a rotation matrix describing rotational transformation of the query plane of the query image 554 and the candidate plane of the retrieved candidate image 562; and t qr is a translation vector describing the relative pose change 820 between the query plane and the candidate plane. Based on the three matching plane pairs 806’, singular value decomposition (SVD) is applied to solve the rotation matrix Rq r (i.e., the relative pose change 820) from equation (3). The translation vector t qr is directly solved in equation (4). The relative pose change 820 (T qr ) is a combination of the rotation matrix R qr and translation vector t qr .

[0083] With solved T qr and known T r , the query camera pose T q (552) is represented as:

Tq = Tqr 1 * T r (5)

A fitting error Ze is defined as follows:

If a current fitting error L l e is smaller than a previous lowest error L* e , the electronic system requires that L e = R e and T qr = T qr . After running the RANSAC, with T qr and T r which is known, the query camera pose T q (552) is derived as in equation (5).

[0084] Based on the above Plane-PNP method, at least three matching plane pairs 806’ are applied to solve the query camera pose T q (552). In some embodiments, it is determined whether a matching plane pair 806’ of a respective query plane of the query image 554 and a respective candidate plane of the candidate image 562 are parallel, e.g., whether normal directions of the respective query and candidate planes form an angle that is less than 10 degrees. In accordance with a determination that the respective query and candidate planes of the matching plane pair 806’ are parallel, the query camera pose T q (552) is equal to the candidate camera pose 852. Conversely, if the respective query and candidate planes of the matching plane pair 806’ are not parallel, the Plane-PNP method is implemented to solve the query camera pose T q (552) associated with the query image 554. In some embodiments, predictions of the query planes from the query image 554 are not sufficiently accurate, and the fitting error is required to be less than 0.5. In accordance with a determination that the fitting error is less than a threshold error 838 (e.g., 0.5), the query camera pose T q (552) is used. Conversely, in accordance with a determination that the fitting error is greater than the threshold error 838 (e.g., 0.5), the candidate camera pose 852 is used. In some implementations, planar objects with homogenous color are inserted into the scene. A camera localization system performs much better at localizing a camera after the insertion due to inclusion of planar geometry features.

[0085] In some embodiments, the training process 900 is implemented at a server 102 to train the camera localization model, and the camera localization model is provided to a client device 104 to implement the camera localization process 800. Alternatively, in some embodiments, the training process 900 is implemented to train the camera localization model locally at a client device 104, which applies the camera localization model to implement the camera localization process 800. In some embodiments, the training process 900 is implemented at a server 102 to train the camera localization model. A client device sends the query image 554 to the server 102, and the camera localization model is implemented by the server 102 to process the query image 554.

[0086] Figure 10A is a flow diagram of an image retrieval method 1000, in accordance with some embodiments. For convenience, the method 1000 is described as being implemented by an electronic system 200 (e g., a client device 104, a server 102, or a combination thereof). In some embodiments, the client device 104 is a mobile phone 104C, AR glasses 104D, smart television device, or drone. Method 1000 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1000 may be combined and/or the order of some operations may be changed.

[0087] The electronic system obtains (1002) a query image 554 in a scene and generates (1004) a query feature map 602, a query plane instance feature 604, and a query plane semantic feature 606 from the query image 554. The query plane instance feature 604 identifies (1006) one or more bounding boxes defining one or more planes in the query image 554, and the query plane semantic feature includes semantic information of the one or more planes. The electronic system aggregates (1008) the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 to a plane-assisted global descriptor 608. For each of a plurality of images in a database 564, the electronic system obtains (1010) a respective global image descriptor 612, and determines (1012) a respective distance 614 of the plane-assisted global descriptor 608 and the respective global image descriptor 612. A candidate image 562 is selected (1014) from the plurality of images based on the respective distance 614 corresponding to each of the plurality of images of the database 564. In some embodiments, the candidate image 562 is (1016) among a predefined number of images having smaller respective distances than remaining image of the plurality of images in the database 564. The query image 554 has a higher similarity level to the predefined number of images than to the remaining images in the database 564.

[0088] In some embodiments, the one or more bounding boxes are applied to perform an operation of extracting the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 having a predefined spatial size. For example, each of the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 has a size of 14 x 14. The query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are concatenated form a plane-assisted query feature map 610. The plane-assisted query feature map 610 is converted to the plane-assisted global descriptor 608.

[0089] In some embodiments, the query feature map 602, query plane instance feature 604, and query plane semantic feature 606 are aggregated to a plane-assisted query feature map 610 using a feature aggregation network 622. The plane-assisted query feature map 610 is converted to the plane-assisted global descriptor 608 using a vector aggregation network 620 (e g , a NETVLAD layer). Further, in some embodiments, the query feature map 602 is generated using a feature backbone network 616. The query plane instance feature 604 and query plane semantic feature 606 are generated from a plane detection model and a semantic segmentation model, respectively. Additionally, in some embodiments, a subset of the feature backbone network 610, plane detection model, semantic segmentation model, feature aggregation network, and vector aggregation network 620 is trained based on a triplet loss L using an anchor training image a t , a positive sample p t of the anchor training image, and a negative sample n, of the anchor training image. [0090] In some embodiments, for a first image in the database 564, the respective global image descriptor 612 is obtained by generating a first feature map, a first plane instance feature, and a first plane semantic feature from the first image. The first plane instance feature identifies one or more first bounding boxes defining one or more planes in the first image, and the first plane semantic feature includes semantic information of the one or more planes in the first image. The first feature map, first plane instance feature, and first plane semantic feature are aggregated to the respective global image descriptor of the first image. For the first image, the respective distance 614 is determined between the plane- assisted global descriptor of the query image 554 and the respective global image descriptor 612 of the first image. Conversely, in some embodiments, for a second image in the database 564, the respective global image descriptor 612 is extracted from memory of the electronic system.

[0091] In some embodiments, the plurality of images of the database 564 are captured in the scene where the query image 554 is captured. The electronic system extracts, from the database 564, information of a candidate camera pose at which the candidate image 562 is captured in the scene. The query image is captured at a query camera pose related to the candidate camera pose of the candidate image 562. The query camera pose is determined based on the candidate camera pose. Further, in some embodiments, the electronic system extracts a plurality of query feature points of the query image 554, identifies a plurality of reference feature points of the candidate image, and compares the plurality of query feature points and the plurality of reference feature points to determine the query camera pose relative to the candidate camera pose.

[0092] Alternatively, referring to Figures 8A-8C, in some embodiments, the electronic system generates a plurality of query plane embeddings 802 corresponding to a plurality of query planes in the query image 554. A plurality of candidate plane embeddings 804 correspond to a plurality of candidate planes of the candidate image 562. The plurality of query plane embeddings 802 and the plurality of candidate plane embeddings 804 are compared to identify a first number of matching plane pairs 806. Each matching plane pair 806 includes a respective query plane of the query image 554 and a respective candidate plane of the candidate image 562 that substantially matches the respective query plane. The first number of matching plane pairs 806 are applied to determine the query camera pose 552 related to the candidate camera pose 852 of the candidate image 562. Further, in some embodiments, the electronic system determines a cost matrix 808 including a plurality of matrix elements and applies a Hungarian method 810 to the cost matrix to identify the first number of matching plane pairs. Each matrix element indicates a distance between a respective query plane embedding and a respective candidate plane embedding. Additionally, in some embodiments, a subset of matching plane pairs 806’ is selected. A fitting error 816 is determined between the query planes and candidate planes in the subset of matching plane pairs and compared with a threshold error 838. In accordance with a determination that the fitting error 816 is below the threshold error 838, the electronic system determines a relative pose change 820-2 between the query camera pose 552 and the candidate camera pose 852 and determines the query camera pose 552 based on the candidate camera pose 852 of the candidate image 562 and the relative pose change 820-2.

[0093] Figure 10B is a flow diagram of a camera localization method 1050, in accordance with some embodiments. For convenience, the method 1050 is described as being implemented by an electronic system 200 (e g., a client device 104, a server 102, or a combination thereof). In some embodiments, the client device 104 is a mobile phone 104C, AR glasses 104D, smart television device, or drone. Method 1050 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 6 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 1050 may be combined and/or the order of some operations may be changed.

[0094] The electronic system obtains (1052) a query image 554, a candidate image 562, and information of a candidate camera pose 852 at which the candidate image 562 is captured in a scene, and generates (1054) a plurality of query plane embeddings 802 corresponding to a plurality of query planes in the query image 554. The electronic system obtains (1056) a plurality of candidate plane embeddings 804 corresponding to a plurality of candidate planes of the candidate image 562. In some embodiments, the plurality of candidate plane embeddings 804 of the plurality of candidate planes of the candidate image 562 are extracted from memory of the electronic system. Alternatively, in some embodiments, the plurality of candidate plane embeddings 804 are generated from the candidate image 562. [0095] The plurality of query plane embeddings 802 and the plurality of candidate plane embeddings 804 are compared (1058) to identify a first number of matching plane pairs 806. Each matching plane pair 806 includes (1060) a respective query plane of the query image 554 and a respective candidate plane of the candidate image 562 that substantially matches the respective query plane. In some embodiments, the plurality of query plane embeddings 802 and the plurality of candidate plane embeddings 804 are compared by determining (1062) a cost matrix 808 including a plurality of matrix elements and applying (1064) a Hungarian method 810 to the cost matrix 808 to identify the first number of matching plane pairs 806. Each matrix element indicates a distance between a respective query plane embedding 802 and a respective candidate plane embedding 804. Based on the first number of matching plane pairs 806, the electronic device iteratively determines (1066) that the query image 554 is captured at a query camera pose 552 related to the candidate camera pose 852 of the candidate image 562.

[0096] In some embodiments, the electronic system determines that the first number is great than a predefined matching pair number threshold. The query camera pose 552 is determined in accordance with a determination that the first number is greater than the predefined matching pair number threshold. Further, in some embodiments, the predefined matching pair number threshold is equal to 3, and the query camera pose 552 is determined in accordance with a determination that the query image 554 and candidate image 562 have at least 3 matching plane pairs.

[0097] In some embodiments, the query camera pose 552 is determined during a plurality of iterations 812. For each iteration 812, the electronic system selects (1066) a second number of matching plane pairs 806’ in the first number of matching plane pairs 806. The second number is less than the first number. Further, in some embodiments, during each iteration 812, for each of the selected second number of matching plane pairs 806’, the electronic system determines (1070) whether the respective query plane and the respective candidate plane are parallel. In accordance with a determination that in each of the selected second number of matching plane pairs 806’ of the respective query plane and the respective candidate plane are parallel, the electronic system determines (1072) that the query camera pose 552 is aligned with and equal to the candidate camera pose 852 of the candidate image 562. [0098] Additionally, in some embodiments, for each iteration 812, the electronic system determines an iteration fitting error 816 between the query planes and candidate planes in the second number of matching plane pairs 806’. A smallest fitting error is identified among the iteration fitting error 816 of each of the plurality of iterations 812. A relative pose change 820 is determined between the query camera pose 552 and the candidate camera pose 852 in the respective iteration 812 associated with the smallest fitting error. The query camera pose 552 is determined based on the candidate camera pose 852 of the candidate image and the relative pose change 820.

[0099] In some embodiments, for each of a subset of iterations 812, the electronic system determines an iteration fitting error 816 between the query planes and candidate planes in the second number of matching plane pairs 806’ and whether the iteration fitting error 816 is below a threshold error 838. For a last iteration, in accordance with a determination that the iteration fitting error 816 is below the threshold error 838, the electronic system terminates the plurality of iterations 812, determines a relative pose change 820 between the query camera pose 552 and the candidate camera pose 852, and determines the query camera pose 552 based on the candidate camera pose 852 of the candidate image 562 and the relative pose change 820.

[00100] In some embodiments, the plurality of query plane embeddings 802 and the plurality of candidate plane embeddings 804 are generated from the query image 554 and the candidate image 562 using a plane retrieval network 822, respectively. The plane retrieval network 822 includes a feature backbone network 824, a plane detection model 826, and an ROI polling network 828. Further, in some embodiments, a query feature map 830 is generated from the query image 554 using the feature backbone network 824. Information of a plurality of bounding boxes 832 corresponding to the plurality of query planes are generated using the plane detection model 826. The plurality of query plane embeddings 802 are generated from the information of the plurality of bounding boxes 832 and the query feature map 830 using the ROI polling network 828.

[00101] Additionally, in some embodiments, the plane detection model 826 is predefined and configured to generate plane masks and local 3D plane parameters (i.e., the information of the bounding boxes 832) for planes of the query or candidate image 554 or 562. The feature backbone network 824 and ROI polling network 828 are trained jointly in a supervised manner, e g., based on a triplet loss LT EMBDI. Alternatively, in some embodiments, the plane detection model 826 and the feature backbone network 824 are predefined, and the ROI polling network 828 is trained in a supervised manner, e.g., based on a triplet loss LT_EMBD2. More details on training the ROI polling network 828 are explained above with reference to Figure 9.

[00102] It should be understood that the particular order in which the operations in Figure 10A or 10B have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to retrieve candidate images or determining a camera pose as described herein. Additionally, it should be noted that details of other processes described above with respect to Figures 6-9 are also applicable in an analogous manner to methods 1000 and 1050 described above with respect to Figures 10A and 10B. For brevity, these details are not repeated here.

[00103] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Additionally, it will be understood that, although the terms “first,” “second,” etc may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[00104] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context. [00105] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[00106] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.