Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SLAM-GUIDED MONOCULAR DEPTH REFINEMENT SYSTEM USING SELF-SUPERVISED ONLINE LEARNING
Document Type and Number:
WIPO Patent Application WO/2022/187753
Kind Code:
A1
Abstract:
Systems and methods are provided for refining a depth estimation model. The systems and methods disclosed herein can receive a plurality of image frames comprising scenes of an environment captured by a camera; execute a first module comprising and augmented Simultaneous Localization and Mapping (SLAM), wherein the SLAM is augmented by one of a trained depth network and a learning-based optical flow, the first module configured to generate camera poses, map points of the environment, and depth maps for each image frame based on executing the augmented SLAM on a plurality of image frames; execute a second module comprising an online training platform, the second module configured to compute one or more loss parameters and iteratively update the depth estimation model based on the one or more loss parameters; and generate refined depth maps for the plurality of images using the refined depth estimation model.

Inventors:
JI PAN (US)
YAN QINGAN (US)
MA YUXIN (US)
XU YI (US)
Application Number:
PCT/US2022/020893
Publication Date:
September 09, 2022
Filing Date:
March 18, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06T7/55; G06T15/10
Foreign References:
US20210065391A12021-03-04
US20190325597A12019-10-24
US20210118184A12021-04-22
Other References:
TEED ZACHARY, DENG JIA: "RAFT: Recurrent All-Pairs Field Transforms for Optical Flow.", LIBRARY/ COMPUTER SCIENCE /COMPUTER VISION AND PATTERN RECOGNITION, 26 March 2020 (2020-03-26), pages 402 - 419, XP047569509, Retrieved from the Internet [retrieved on 20220622]
Attorney, Agent or Firm:
AGDEPPA, Hector, A. (US)
Download PDF:
Claims:
Claims

What is claimed is:

1. A method for depth estimation from monocular images, comprising: receiving an image sequence comprising a plurality of image frames, each of the plurality of image frames comprising one or more features of an environment; generating map points, camera poses, and depth maps based on tracking and mapping a plurality of features included in the plurality of image frames; computing one or more loss parameters from the generated map points, camera poses, and depth maps; and refining a trained depth estimation model based on the one or more loss parameters, wherein the refined depth estimation model predicts depth maps from the image sequence.

2. The method of claim 1, further comprising generating a reconstruction of the environment by fusing the predicted depth maps together.

3. The method of one of claims 1 and 2, wherein generating map points, camera poses, and depth maps based on tracking and mapping a plurality of features included in the plurality of image frames comprises Simultaneous Localization and Mapping (SLAM) augmented by one of a trained depth network and a learning-based optical flow.

4. The method of claim 3, wherein the SLAM is augmented by a learning-based optical flow incorporated into front end tracking of the SLAM, the method comprising: locating depth correspondence of feature points between each image frame of the plurality of image frames and each sequentially preceding image frame, wherein feature points are points of each image that make up a feature of the plurality of features, wherein the map points, camera poses, and depth maps are generated by the SLAM based on the located depth correspondence as an input into the SLAM.

5. The method of one of claims 3 and 4, wherein the learning-based optical flow is a recurrent all-pairs field transforms (RAFT) optical flow.

6. The method of claim 3, wherein the SLAM is augmented by a trained depth network comprising a convolutional neural network (CNN), the method comprising: predicting depth maps for each of the plurality of image frames using the CNN; and inputting the predicted depth maps into the SLAM, wherein at least the map points and camera poses are generated by the SLAM based on the depth maps predicted by the CNN.

7. The method of any one of claims 1-6, wherein the one or more loss parameters comprises one or more of photometric loss, map point loss, depth consistency loss, and edge-aware depth smoothness loss.

8. The method of claim 7, wherein refining a trained depth estimation model based on the one or more loss parameters comprise computing an overall refinement loss (L) as: where Lp represents the photometric loss; Ls represents the edge-aware normalized smoothness loss; Lm represents the map point loss; Lc represents the depth consistency loss; and As, Am, and Ac represent weighting parameters for balancing the contribution of each loss term to the overall refinement loss.

9. The method of any one of claims 1-8, further comprising: selecting keyframes from as a subset of the plurality of image frames, wherein the one or more loss parameters are computed based on the selected keyframes.

10. The method of any one of claims 1-8, wherein the image sequence is received by a cloud-based server resident on a network, wherein the generating, computing, and refining performed at the cloud-based server.

11. A non-transitory computer-readable storage medium storing a plurality of instructions executable by one or more processors, the plurality of instructions when executed by the one or more processors cause the one or more processors to perform a method comprising: receiving an image sequence comprising a plurality of image frames, each of the plurality of image frames comprising one or more features of an environment; generating map points, camera poses, and depth maps based on tracking and mapping a plurality of features included in the plurality of image frames; computing one or more loss parameters from the generated map points, camera poses, and depth maps; and refining a trained depth estimation model based on the one or more loss parameters, wherein the refined depth estimation model predicts depth maps from the image sequence.

12. The non-transitory computer-readable storage medium of claim 11, wherein the method further comprises generating a reconstruction of the environment by fusing the predicted depth maps together.

IB. The non-transitory computer-readable storage medium of one of claims 11 and 12, wherein generating map points, camera poses, and depth maps based on tracking and mapping a plurality of features included in the plurality of image frames comprises Simultaneous Localization and Mapping (SLAM) augmented by one of a trained depth network and a learning-based optical flow.

14. The non-transitory computer-readable storage medium of claim 13, wherein the learning-based optical flow is a recurrent all-pairs field transforms (RAFT) optical flow.

15. The non-transitory computer-readable storage medium of claim 13, wherein the SLAM is augmented by a trained depth network comprising a convolutional neural network (CNN).

16. The non-transitory computer-readable storage medium of any one of claims 11-15, wherein the one or more loss parameters comprises photometric loss, map point loss, depth consistency loss, and edge-aware depth smoothness loss.

17. The non-transitory computer-readable storage medium of claim 16, wherein refining a trained depth estimation model based on the one or more loss parameters comprise computing an overall refinement loss (L) as: where Lp represents the photometric loss; Ls represents the edge-aware normalized smoothness loss; Lm represents the map point loss; Lc represents the depth consistency loss; and As, Am, and Ac represent weighting parameters for balancing the contribution of each loss term to the overall refinement loss.

18. A system for refining a depth estimation model, the system comprising: a memory configured to store instructions; and one or more processors communicably coupled to the memory and configured to execute the instruction to: receive a plurality of image frames comprising scenes of an environment captured by a camera; execute a first module comprising and augmented Simultaneous Localization and Mapping (SLAM), wherein the SLAM is augmented by one of a trained depth network and a learning-based optical flow, the first module configured to generate camera poses, map points of the environment, and depth maps for each image frame based on executing the augmented SLAM on a plurality of image frames; execute a second module comprising an online training platform, the second module configured to compute one or more loss parameters and iteratively update the depth estimation model based on the one or more loss parameters; and generate refined depth maps for the plurality of images using the refined depth estimation model.

19. The system of one of claim 18, wherein the trained depth network is a convolutional neural network (CNN), and the learning-based optical flow is a recurrent all-pairs field transforms (RAFT) optical flow.

20. The system of one of claim 18, wherein the one or more loss parameters comprises photometric loss, map point loss, depth consistency loss, and edge-aware depth smoothness loss.

Description:
SLAM-GUIDED MONOCULAR DEPTH REFINEMENT SYSTEM USING SELF-

SUPERVISED ONLINE LEARNING

Cross-Reference to Related Applications

[0001] This application claims the benefit of U.S. Provisional Application No. 63/162,920 filed March 18, 2021, titled "SLAM-GUIDED MONOCULAR DEPTH REFINEMENT SYSTEM USING SELF-SUPERVISED ONLINE LEARNING," which is hereby incorporated herein by reference in its entirety.

[0002] This application also claims the benefit of U.S. Provisional Application No. 63/285,574 filed December 3, 2021, titled "GEOREFINE: ONLINE MONOCULAR DEPTH REFINEMENT FOR GEOMETRICALLY CONSISTENT DENSE MAPPING," which is hereby incorporated herein by reference in its entirety.

Technical Field

[0003] The present disclosure relates generally to systems and methods for depth estimation from one or more images, and, to systems and methods that combine an environmental mapping technique with an online learning scheme to refine a trained depth network for depth estimation.

Description of the Related Art

[0004] 3D reconstruction from monocular images has been an active research area in computer vision for decades. An example application of a 3D reconstruction is in Augmented Reality and/or Virtual Reality applications, where virtual objects may be placed on real-world features of an environment based on the 3D reconstruction of said environment. [0005] Traditionally, an environment is usually reconstructed in the form of a set of sparse 3D points via geometric techniques, such as Structure-from-Motion (SfM) or Simultaneous Localization and Mapping (SLAM). Over the years, the monocular geometric methods have been continuously improved and become increasingly accurate in recovering 3D map points. Representative open-source systems along this line include COLMAP (an offline SfM system), and ORB-SLAM (an online SLAM system).

[0006] Recently, deep-learning-based methods have been used in predicting a dense depth map from a single image. Those models are either trained in a supervised manner using ground-truth depths, or through a self-supervised way leveraging the photometric consistency between stereo and/or monocular image pairs. During inference, with the prior knowledge learned from data, the depth model can generate dense depth images even in textureless image regions. However, the errors in the predicted depths are still relatively high.

[0007] A few methods aim to reap the best of geometric systems and deep methods. One method let the monocular SLAM and learning-based depth prediction form a self-improving loop to improve the performance of each module. Another method adopted a test-time fine tuning strategy to enforce geometric consistency using the outputs from a conventional monocular SfM system (i.e., COLMAP). Nonetheless, both methods require to pre-compute and store the sparse map points and camera poses from SfM or SLAM in an offline manner, which is not applicable to some online applications where data pre processing is not possible.

Brief Summary of Embodiments

[0008] According to various embodiments of the disclosed technology, systems and methods are provided for estimating a depth map from one or more images in a self- supervised manner.

[0009] In accordance with some embodiments, a method for depth estimation from monocular images is provided. The method comprises receiving an image sequence comprising a plurality of image frames, each of the plurality of image frames comprising one or more features of an environment; generating map points, camera poses, and depth maps based on tracking and mapping a plurality of features included in the plurality of image frames; computing one or more loss parameters from the generated map points, camera poses, and depth maps; and refining a trained depth estimation model based on the one or more loss parameters, wherein the refined depth estimation model predicts depth maps from the image sequence.

[0010] In another aspect, a non-transitory computer-readable storage medium storing a plurality of instructions is provided. The instructions are executable by one or more processors and when executed by the one or more processors cause the one or more processors to perform a method comprising receiving an image sequence comprising a plurality of image frames, each of the plurality of image frames comprising one or more features of an environment; generating map points, camera poses, and depth maps based on tracking and mapping a plurality of features included in the plurality of image frames; computing one or more loss parameters from the generated map points, camera poses, and depth maps; and refining a trained depth estimation model based on the one or more loss parameters, wherein the refined depth estimation model predicts depth maps from the image sequence.

[0011] In another aspect, a system for refining a depth estimation model is provided. The system comprises a memory configured to store instructions and one or more processors communicably coupled to the memory. The one or more processors are configured to execute the instruction to receive a plurality of image frames comprising scenes of an environment captured by a camera; execute a first module comprising and augmented Simultaneous Localization and Mapping (SLAM), wherein the SLAM is augmented by one of a trained depth network and a learning-based optical flow, the first module configured to generate camera poses, map points of the environment, and depth maps for each image frame based on executing the augmented SLAM on a plurality of image frames; execute a second module comprising an online training platform, the second module configured to compute one or more loss parameters and iteratively update the depth estimation model based on the one or more loss parameters; and generate refined depth maps for the plurality of images using the refined depth estimation model.

[0012] Other features and aspects of the disclosed technology will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which illustrate, by way of example, the features in accordance with embodiments of the disclosed technology. The summary is not intended to limit the scope of any inventions described herein, which are defined solely by the claims attached hereto.

Brief Description of the Drawings

[0013] The present disclosure, in accordance with one or more various embodiments, is described in detail with reference to the following figures. The figures are provided for purposes of illustration only and merely depict typical or example embodiments.

[0014] FIG. 1 is a schematic diagram illustrating a depth refinement system according to various embodiments disclosed herein.

[0015] FIG. 2 is a diagram illustrating an architecture for an example depth refinement system according to an embodiment disclosed herein.

[0016] FIG. 3 is a graphical comparison of camera trajectories by versions of the depth refinement system of FIG. 2 on EuRoC Sequence dataset.

[0017] FIG. 4 illustrates a qualitative comparison of depth prediction performed using the depth refinement system of FIG. 2 on the EuRoC Sequence dataset.

[0018] FIG. 5 is a graphical comparison of absolute relative errors achieved using the depth refinement system of FIG. 2 and a prior art system on KITTI Sequence dataset.

[0019] FIG. 6 is a diagram illustrating an architecture for an example depth refinement system 600 according to another embodiment disclosed herein.

[0020] FIG. 7 illustrates a qualitative comparison of depth prediction performed using the depth refinement system of FIG. 6 on the EuRoC Indoor MAV dataset. [0021] FIG. 8 illustrates an example 3D reconstruction generated using depth maps generated by the depth refinement system of FIG. 6 from the EuRoC Indoor MAV data set.

[0022] FIG. 9 illustrates an example 3D reconstruction generated using depth maps generated by the depth refinement system of FIG. 6 on the on the TUM-RGBD dataset.

[0023] FIG. 10 illustrates a qualitative comparison of depth prediction performed using the depth refinement system of FIG. 6 on the TUM-RGBD dataset.

[0024] FIG. 11 illustrates an example 3D reconstruction generated using depth maps generated by the depth refinement system of FIG. 6 from the ScanNet dataset.

[0025] FIG. 12 illustrates additional example 3D reconstructions generated using depth maps generated by the depth refinement system of FIG. 6 from the TUM-RGBD dataset.

[0026] FIG. 13 is an example computing component that may be used to implement various features of embodiments described in the present disclosure.

[0027] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided there.

Detailed Description

[0028] Embodiments of the systems and methods disclosed provide for a depth prediction model that is incrementally refined using self-supervised monocular depth network.

[0029] As described above, existing geometric and deep learning-based methods for 3D reconstruction of an environment require pre-computing and storage of sparse map points and camera poses from SfM or SLAM in an offline manner, which may not be applicable to some online applications where data pre-processing is not possible. That is, for example, while the SfM or SLAM may be performed online, such processing performed prior to the 3D reconstruction. The sparse map points and camera poses must be pre-computed and stored, then read in to perform the 3D reconstruction. For example, after an agent (e.g., a robotic system or other image sensor-based system) is deployed into an environment, it may be preferential for the agent to automatically improve its 3D perception capability while moving around the environment. In such a scenario, the pre stored methods would not be ideal, and the online learning methods disclosed herein are more desirable. For example, sparse points and camera poses can be tracked and processed for mapping in an online environment, while online learning is performed to refine and improve the mapping model.

[0030] Accordingly, embodiments herein provide for methods and systems that combine mapping techniques (e.g., SLAM and/or SfM) with a trained depth model that is refined using an online learning scheme. SLAM and SfM may be used interchangeably, and therefore will be referred to herein as SLAM. However, embodiments herein may be implemented using either SLAM and/or SfM as desired. The depth model can be any model that has been trained either with a self-supervised method or in a supervised fashion using either monocular or stereo images. The depth model is then incrementally refined using a test image sequence in an online platform, which may achieve geometrically consistent depth predictions over the entire image sequence. SLAM is an example mapping technique that can be implemented as an online system that fits the framework of embodiments disclosed herein.

[0031] Embodiments herein provide for a monocular mapping technique that is augmented via deep learning techniques, such as a learning-based depth prediction (e.g., a trained depth network) and/or a learning-based optical flow. The augmented mapping technique processes images from an image sequence (e.g., a sequential collection of images) and outputs predication data, such as, depth predications (e.g., depth maps), map points, camera poses, and/or pose graphs. Output prediction data is then feed into a training platform that computes loss parameters. The loss parameters may include one or more self-supervised losses (e.g., photometric losses, map point losses, and edge-aware depth smoothness losses) to refine the depth prediction model. Additionally, alone or in combination with the above noted self-supervised losses, loss parameters may also include an occlusion-aware depth consistency loss configured to enforce temporal depth consistency while considering the occlusion regions. Furthermore, the embodiments disclosed herein provide for a keyframe mechanism configured to expedite the model convergence in the early stage of online refinement. The refined model may then be used to process input images and provide refined depth maps that may be fused together to generate reconstructions of the environment in which the images were captured.

[0032] In an example embodiment, SLAM is augmented with a learning-based depth prediction, such as a trained depth network (e.g., convolutional neural network (CNN) according to various embodiments), to bootstrap the performance of the SLAM. Embodiments disclosed herein are not limited to only CNN, but any network architectures desired may be used, for example, recent Transformer based models. The SLAM sparse map points, depth maps, camera poses, and pose graphs are then fed into an online training platform to refine the depth network. The refined depth network can be used to output depth maps that are fused together to provide a 3D reconstruction of an environment. In view of a potential scale shift between the SLAM and depth predictions (such as those from a depth CNN), embodiments herein provide for scale self-correction using map points output from the SLAM. An online training platform disclosed herein may compute loss parameters that include one or more self-supervised losses (e.g., photometric losses, map point losses, and edge-aware depth smoothness losses) and/or occlusion-aware depth consistency loss to refine the depth network. Additionally, alone or in combination with the above noted self-supervised losses, a keyframe mechanism in the form of a keyframe map- point memory revisiting strategy may be employed that is configured to expedite the model convergence in the early stage of online refinement.

[0033] While SLAM may be applicable to online systems, front-end tracking of SLAM may fail under challenging conditions, such as but not limited to fast motion and large rotations (e.g., as may occur in an indoor environment). To facilitate a robust system, another example embodiment disclosed herein may enhance the robustness of the monocular mapping technique by incorporating a learning-based optical flow method. An example of such method is a recurrent all-pairs field transforms (RAFT) optical flow (also referred to as a RAFT-flow), which is both robust and accurate in a wide range of unseen scenes. While embodiments herein may be described with reference to a RAFT-flow, any deep learning based flow method may be used, for example but not limited to, a global motion aggregation methods. Embodiments disclosed herein may also include a learning module configured to refine the depth prediction model with one or more loss parameters, e.g., photometric losses, map point losses, and edge-aware depth smoothness losses) and/or occlusion-aware depth consistency loss. Additionally, based on careful analysis of failure cases of self-supervised refinement, embodiments disclosed herein may include an effective keyframe mechanism in the form of a keyframe selection strategy configured to make sure that no refinement step worsens the depth results.

[0034] As used herein reference to online training or online may refer a network- based infrastructure, for example, one or more servers and one or more databases resident on a network accessible via a wireless network. Various embodiments disclosed herein provide for methods executed on and systems in a cloud-based infrastructure, for example, one or more cloud servers and one or more cloud-based databases resident on a network. For example, embodiments disclosed herein may receive one or more images as a sequence of images from a network edge device and transmitted to one or more servers and one or more databased configured to execute the functions disclosed herein. Servers and/or cloud-based servers may be any computational server.

[0035] It should be noted that the terms "optimize," "optimal", "refined", "improved", and "improvement" and the like as used herein can be used to mean making or achieving performance as effective or perfect as possible. However, as one of ordinary skill in the art reading this document will recognize, perfection cannot always be achieved. Accordingly, these terms can also encompass making or achieving performance as good or effective as possible or practical under the given circumstances or making or achieving performance better than that which can be achieved with other settings or parameters. A. Technology Overview

1. SLAM

[0036] SLAM is an online geometric system that reconstructs a 3D map consisting of 3D points and simultaneously localizes camera poses, corresponding to images processed by SLAM, with respect to the reconstructed 3D map. SLAM systems can be roughly classified into two categories: (i) direct SLAM, which directly minimizes photometric error between adjacent image frames and optimizes the geometry using semi-dense measurements; and (ii) feature-based (in-direct) SLAM, which extracts and tracks a set of sparse feature points extracted from image frames and then computes the geometry in the back-end using these sparse measurements. Geometric SLAM systems are accurate and robust due to a number of techniques, including robust motion estimation, keyframe procedures, bundle adjustment, and pose-graph optimization.

[0037] Embodiments disclosed herein may utilize any SLAM system and/or SLAM techniques known in the art. For example, various embodiments use a feature-based SLAM system, such as ORB-SLAM (or its variations, such as ORB-SLAM2, ORB-SLAM3, which will be collective referred to herein as ORB-SLAM) which is a keyframe and feature based monocular SLAM system. ORB-SLAM may deliver accurate 3D reconstructions and may support multiple image sensor modes and types. For example, ORB-SLAM supports both monocular and RGB-D (Red-Green-Blue-Depth) modes, which may provide for seamlessly integration with depth networks to bootstrap the SLAM.

[0038] Other example SLAM systems include, but are not limited to: CNN-SLAM, which is a hybrid SLAM system that uses CNN depth to bootstrap back-end optimization of sparse geometric SLAM and helps recover a metric scale for 3D reconstruction; DROID- SLAM, which builds SLAM from scratch with a deep learning framework and achieves accuracy in camera poses, but does not have the functionality of dense mapping; CodeSLAM, which is a real-time learning-based SLAM system that optimizes a compact depth code over a conditional variational auto-encoder (VAE) and simultaneously performs dense mapping; DeepFactor, which extends CodeSLAM by using fully-differentiable factor- graph optimization; CodeMapping, which further improves over CodeSLAM via introducing a separate dense mapping thread to ORB-SLAM3 and additionally conditioning VAE on sparse map points and reprojection errors. Various embodiments disclosed herein may have similarity to one or more of these above SLAM systems (e.g., CodeMapping) in terms of functionalities, but the embodiments herein are significantly different in system design and far more accurate in dense mapping.

[0039] Conventional SLAM methods include a front-end processing and a back-end processing. The front-end processing receives sensor information, such as images or other sensor information, to perform local map point tracking. The back-end processing uses the information from the front-end to compute camera pose and map points from which the back-end can compute an environment geometry and pose graph. For example, in feature- based geometric SLAM techniques, the front-end extracts a set of sparse feature points from image frames and tracks the feature points across each image frame in a sequence. The back-end then computes the geometry and poses using these sparse measurements to output map points of the geometry, depth maps, and camera poses.

2. Monocular Depth Estimation

[0040] Supervised depth estimation methods dominate early approaches. An early deep learning-based method predicts depth maps via CNN and introduced a set of depth evaluation metrics that are used today. Another method formulated depth estimation as a continuous conditional random field (CRF) learning problem. Yet another method leveraged a deep ordinal regression loss to train the depth network. A few other methods combine depth estimation with additional tasks, e.g., pose estimation and surface normal regression.

[0041] Self-supervised depth estimation has recently become popular. Early self- supervised methods applied photometric loss between left-right stereo image pairs to train a monocular depth model in an unsupervised/self-supervised way. Another method introduced a pose network to facilitate using a photometric loss across neighboring temporal images. Later self-supervised methods were proposed to improve the photometric self-supervision. Some methods leverage an extra flow network to enforce cross-task consistency. A few others employ new loss terms during training. For example, one method used a temporal depth scale consistency term to induce consistent depth predictions. A second method observed that the depth tends to converge to small values in the monocular setup and used a simple normalization method to regularize the depth. New network architectures are also introduced to improve the performance. Along this line, recurrent networks were exploited in the pose and/or depth networks. A recent method (e.g., Monodepth2 method) provided various fine-tuning strategies, including a per-pixel minimum photometric loss, an auto-masking strategy, and a multi-scale framework. Embodiments disclosed herein implement a self-supervised online depth refinement that is compatible with various of the above systems but add several important modifications in terms of training losses and the training strategy, leading to significant improvements over prior methods.

[0042] Instead of using ground-truth depths, some methods obtain training depth data from the off-the-shell SfM or SLAM. For example, one method performed SD reconstruction of internet photos via geometric SfM and then used the reconstructed depths to train a depth network. Another method learned the depths of moving people by watching and reconstructing frozen people. Yet another method improved generalization performance of the depth model by training the depth network with various sources, including ground-truth depths and geometrically reconstructed ones. A further method formed a self-improving loop with SLAM and a depth model to improve the performance of each one. One method handled moving objects by unrolling scene flow prediction. Another method adopted a test-time fine-tuning strategy to enforce geometric consistency using COLMAP outputs. Another approach bypassed the need of running COLMAP via the use of deformation splines to estimate camera poses. Most of these methods require a pre processing step to compute and store SD reconstructions. In contrast, embodiments disclosed herein are performed in an online manner without a need of performing offline 3D reconstruction. 3. Online Learning

[0043] Online learning remains a less explored area in the context of self-supervised depth estimation. One exception is a method that proposes an online meta-learning algorithm for online adaptation of visual odometry (VO) in a self-supervised manner. This method is implemented based on a SfM learner with a depth network and a pose network, but only the performance of the pose network is evaluated. While this prior method showed improved pose results over previous learning-based VO methods, its performance is still far behind that of a geometric SLAM method, such as ORB-SLAM. Accordingly, embodiments disclosed herein make the best of geometric methods and deep-learning methods in a way that accurate camera poses are obtained via long-studied geometric SfM or SLAM and high- quality dense depths are generated via learning-based models. This goal is achieved in the embodiments by the SLAM-guided incremental depth refinement system described below.

[0044] Another related area is incremental learning for supervised classification tasks. An essential problem faced by incremental learning is the catastrophic forgetting issue, i.e., the neural network tends to forget what it has learned in previous classes or tasks. In the self-supervised online learning according to embodiments disclosed herein, instead of countering catastrophic forgetting, a more important aspect is how to expedite the convergence of the depth model refinement on test sequences. To this end, a simple memory revisiting training strategy is implemented in the embodiments disclosed herein to help the model converge faster than prior art methods at the early stage of refinement.

B. Depth Refinement System Overview

[0045] FIG. 1 is a schematic diagram illustrating a depth refinement system 100 according to various embodiments disclosed herein. The depth refinement system 100 is configured to create and refine (e.g., teach) a depth prediction model to generate geometrically-consistent dense mapping from monocular image sequences, which can be used to create a reconstruction (e.g., 2D or 3D) of an environment. The system may be implemented using, for example, one or more processors and memory elements such as, for example, computer system 1300 of FIG. 13. Embodiments of depth refinement system 100 may also be referred to as GeoRefine according to some implementations.

[0046] Embodiments of the depth refinement system 100 include an augmented SLAM module 110 configured to determine map points and camera poses 112 for a sequence of monocular images frames 114 (e.g., an augmented monocular geometric SLAM). The SLAM module 110 receives input image frames 114 as an image sequence along with a depth map 116 for each image and determines map points and camera poses 112 using the depth maps 116 and image frames 114. The depth maps 116 may be generated by a depth network from the input images 114. For example, the images 114 may be input into a depth network 124 (described below) that creates depth maps 116. The SLAM module 110 is augmented by incorporating deep learning techniques, which makes the SLAM module 110 more robust as compared to conventional SLAM techniques.

[0047] The depth refinement system 100 also includes a depth refinement module

120 configured to implement an online training platform. The depth refinement module

120 selects a set of keyframes 122 according to magnitudes of camera translations (e.g., translational movements). The keyframes 122 are a number of image frames selected from the image frames 114. The selected keyframes 122, together with map points and camera poses 112 from SLAM 110, are input into a depth network 124 (sometimes referred to herein as a depth estimation network or depth estimation model), which is continuously updated to produce refined depth maps 126 for each image frame 114. The depth refinement module 120 executes a training platform to compute self-supervised losses on the selected keyframes 122. The various losses computed by the depth refinement module

120 maybe computationally intensive and not currently feasible on edge devices. Thus, an online platform may be beneficial in providing the computational resources necessary to efficiently compute the learning parameters (e.g., losses and keyframe mechanism).

However, as edge devices (e.g., edge computing systems such as, but not limited to, personal computers, local servers, etc.) become more powerful in terms of computing resources, the depth refinement module 120 and/or one or more of the remaining modules of the depth refinement system 100 may be executed at one or more edge devices. The depth network, according to some embodiments, may be a depth CNN. Then, a globally consistent dense map 132 can be reconstructed from refined depth maps 126, for example, using a global dense mapping module 130 executing one of any fusion method known in the art (e.g., truncated signed distance function (TSDF) fusion method, bundle fusion method, and the like). The global dense mapping module 130 is an optional module and may be left out of the depth refinement system 100 in various embodiments. For example, the augmented SLAM module 110 and the depth refinement module 120 may be executed to generated refined depth maps that may be provided to another entity, which may be configured to process the refined depth maps as desired.

[0048] The augmented SLAM module 110 and depth refinement module 120 may be executed in parallel (e.g., simultaneously) and/or in series (e.g., first executing the augmented SLAM module 110 followed by the depth refinement module 120).

[0049] Embodiments disclosed herein utilize images or image frames captured from cameras (e.g., visible light cameras, IR cameras, thermal cameras, ultrasound cameras, and other cameras) or other image sensors configured to capture video as a plurality of image frames and/or static images of an environment. In various embodiments, images may be captured by monocular image sensors configured to capture monocular videos as a plurality of image frames, each containing a different scene of the environment, in the form of monocular images. As described herein, a "monocular image" is an image from a single (e.g., monocular) camera, and encompasses a field-of-view (FOV) or a scene of a portion of the surrounding environment (e.g., a subregion of the surrounding environment). For example, as the image sensor progresses through an environment, perspectives of objects and features in the environment change, and the depicted objects/features themselves also change, thereby depicting separate scenes of the environment (e.g., particular combinations of objects/features). A monocular image may not include any explicit additional modality indicating depth nor any explicit corresponding image from another camera from which the depth can be derived (e.g., no stereo image sensor pair). In contrast to a stereo image, that may integrate left and right images from separate cameras mounted side-by-side to provide an additional depth channel, a monocular image does not include explicit depth information such as disparity maps derived from comparing the stereo images pixel-by-pixel. Instead, a monocular image may implicitly provide depth information in the relationships of perspective and size of elements depicted therein. The monocular image may be of a forward-facing (e.g., the direction of travel), 60-degree FOV, 90-degree FOV, 120-degree FOV, a rear/side facing FOV, or some other subregion based on the positioning and characteristics of the image sensor.

[0050] In some embodiments, the images or image frames processed by the depth refinement system 100 may be acquired directly or indirectly from a camera. For example, images from a camera may be fed via a wired or wireless connection to the depth refinement system 100 and processed in real-time. In another example, images may be stored in a memory and retrieved for processing. In some examples, one image may be processed in real-time as captured by the camera, while a second image may be retrieved from storage.

[0051] The depth refinement system 100 may be hosted on a server 150 implemented, for example, as the computer system 1300 of FIG. 13. The server 150 may be configured to execute modules 110 and 120, which may be stored in a database or other storage device communicably coupled to the server 150. The server 150 may comprise dedicated servers, or may instead comprise cloud instances, which utilize shared resources of one or more servers. These servers or cloud instances may be collocated and/or geographically distributed. The server may also comprise or be communicatively connected to one or more databases. Any suitable database may be utilized, including cloud-based database instances and proprietary databases. In addition, server 150 may be communicatively connected to one or edge device via one or more networks 140. Edge devices may be any device comprising an image sensor or camera configured to capture image sequence 114. The image sequence may be communicated to the server 150 via network 140. While server 150 is shown in FIG. 1 as configured to execute each of modules 110, 120 and 130, as noted above, module 130 is optional and may not be included in server(s) 150. That is, for example, another server remote from servers 150 may be configured to receive refined depth maps from module 120 and then perform global dense mapping on the refined depth maps.

[0052] Network 140 may comprise the Internet, and server 150 may communicate with edge devices through the Internet using standard transmission protocols. The network 140 may be any network known in the art, for example but not limited to, wireless cellular networks, local area network (LAN), wide area network (WAN), and the like.

[0053] Provided below are embodiments of the depth refinement system 100. For example, the following description in connection with FIGS. 2-5 is directed to an example embodiment of the depth refinement system 100 including an augmented SLAM module 100 that incorporates a deep learning-based depth prediction (e.g., a depth network augmented SLAM) and the depth refinement module 120 that incrementally refines the depth prediction model with one or more self-supervised losses based the output from the augmented SLAM. While the description in connection with FIGS. 6-11 are directed to another example embodiment of the depth refinement system 100 including an augmented SLAM module 110 that incorporates a learning-based optical flow (e.g., a RAFT-flow augmented SLAM) in the front-end of the SLAM and SLAM and an online refinement module 120 that incrementally refines the depth prediction model with self-supervised losses based the output from the learning-based optical flow augmented.

C. Learning Based Depth Prediction Augmented SLAM Embodiments

[0054] Monocular visual SLAM and learning-based depth prediction may be complementary to each other. For example, monocular SLAM may be accurate in reconstructing sparse map points and camera poses, albeit for regions with rich textures. Whereas learning-based depth prediction can output a dense depth map even in textureless regions leveraging the prior ground truth and/or stereo information in the training data but tends to have larger depth errors. In light of this, embodiments herein combine a SLAM and self-supervised monocular depth network to incrementally refine the depth prediction model. These embodiments use a dense depth prediction from a trained depth estimation network or model (such as a depth CNN or the like) executed on monocular images to bootstrap the SLAM. Simultaneously, embodiments employ the augmented SLAM outputs (e.g., map points, camera poses, depth maps, etc.) as output prediction data, to refine the depth model via an online training platform that computes self-supervised losses. The online learning also may use one or more of (i) an occlusion-aware depth consistency loss to enforce temporal consistency and (ii) a map-point memory revisiting strategy to expedite the convergence of training a model.

1. Methodology and Architecture

[0055] FIG. 2 is a diagram illustrating an architecture for an example depth refinement system 200 according to an embodiment disclosed herein. The depth refinement system 200 is configured to create and refine a depth prediction model to generate geometrically-consistent dense mapping from monocular image sequences. Depth refinement system 200 is an example implementation of depth refinement system 100 as set forth above. The system may be implemented using, for example, one or more processors and memory elements such as, for example, computer system 1300 of FIG. 13. As described above in connection with FIG. 1, the depth refinement system 200 may be hosted on a server, which may comprise dedicated server infrastructure and/or cloud-based infrastructure.

[0056] The depth refinement system 200 be configured to execute a geometric SLAM-guided monocular depth refinement method via self-supervised online learning. As shown in Figure 2, the depth refinement system 200 comprises of an augmented SLAM module 210 and an depth refinement module 230. The augmented SLAM module 210 and depth refinement module 230 may be executed in parallel (e.g., simultaneously) and/or in series (e.g., first executing the augmented SLAM module 210 followed by the depth refinement module 230). In various embodiments, the augmented SLAM module 210 comprises a depth network augmented SLAM, where the depth network is configured to perform learning-based depth predictions. For example, a sequence of monocular images 202 are feed into a trained depth network 204 via network 240. The trained depth network 204 (e.g., a depth estimation network or model) outputs a predicted depth map 206 as depth values for each pixel of the input image. The depth network 204 may be a CNN depth network (or the like) and the depth map 206 may include CNN depth values.

[0057] The augmented SLAM module 210 and depth refinement module 2S0 may be similar to the augmented SLAM module 110 and the depth refinement module 120 of FIG. 1, respectively, except as provided herein. Furthermore, while the embodiments herein are described with reference to a CNN augmented SLAM, embodiments herein are not limited to augmentation via a CNN only. Instead, any learning-based depth network configured to predict depth values from an image sequence maybe applied as desired.

[0058] As an illustrative example, the depth network 204 may include a set of neural network layers including convolutional components (e.g., 2D convolutional layers forming an encoder 108) that flow into decoder layers (e.g., 2D convolutional layers with upsampling operators forming a decoder 110). The encoder accepts an image (e.g., one of images 202), as an input and processes the image to extract features therefrom (e.g., feature representations). The features may be aspects of the image that are indicative of spatial information that the image intrinsically encodes. As such, encoding layers that form the encoder function to, for example, fold (i.e., adapt dimensions of a feature map to retain the feature representations included in the feature map) encoded features into separate channels, iteratively reducing spatial dimensions of the image while packing additional channels with information about embedded states of the features. Thus, the addition of the extra channels avoids the lossy nature of the encoding process and facilitates the preservation of more information (e.g., feature details) about the original monocular image.

[0059] The encoder comprises multiple encoding layers formed from a combination of two-dimensional (2D) convolutional layers, packing blocks, and residual blocks. Moreover, the separate encoding layers generate outputs in the form of encoded feature maps (also referred to as tensors), which the encoding layers provide to subsequent layers in the depth network 204. As such, the encoder may include a variety of separate layers that operate on the image, and subsequently on derived/intermediate feature maps that convert the visual information of the image into embedded state information in the form of encoded features of different channels. [0060] The decoder may unfold (e.g., adapt dimensions of the tensor to extract the features) the previously encoded spatial information in order to derive a depth map 206 for the image according to learned correlations associated with the encoded features. For example, the decoding layers may function to up-sample, through sub-pixel convolutions and other mechanisms, the previously encoded features into the depth map 206, which may be provided at different resolutions. In some embodiments, the decoding layers comprise unpacking blocks, two-dimensional convolutional layers, and inverse depth layers that function as output layers for different scales of the feature/depth map. The depth map may be a data structure corresponding to the input image that indicates distances/depths to objects/features represented therein. Additionally, in various embodiments, the depth map 206 may be a tensor with separate data values indicating depths for corresponding locations in the image on a per-pixel basis.

[0061] The depth network 204 may further include skip connections for providing residual information between the encoder and the decoder to facilitate memory of higher- level features between the separate components. While the depth network 204 is discussed as being a CNN depth network, as previously noted, the depth network 204, in various approaches, may take different forms and generally functions to process the monocular images and provide depth maps that are per-pixel estimates about distances of objects/features depicted in the images.

1.1. Augmented SLAM Module

[0062] Monocular visual SLAM may only produce SD map points at textured regions. To improve the performance of SLAM, one prior method fused supervised-learned depths of distant keyframes with the back-end optimization of Large Scale Direct monocular SLAM (LSD-SLAM). Another method combined LSD-SLAM with CNN depths predicted by a self- supervised model that is trained with stereo image pairs.

[0063] The augmented SLAM module 210 may use depth network predicted depths 206 as input to a SLAM. For example, as inputs into an RGB-D ORB-SLAM in a pseudo-RGBD (pRGBD) manner similar to methods known in the art. . As used herein, pRGBD refers to a method of creating depth maps without using a depth sensor (e.g., LiDAR or Time-of-Flight sensor), but from a depth network. Prior art methods used a depth model trained with monocular image sequences, whereas the augmented SLAM module 210 leverages a depth model that is trained with stereo and/or monocular images such that the scale information can be retained in the model. Compared to monocular SLAM, augmented SLAM module 210 has the following advantages: (i) network predicted depths (particularly CNN depths) enable the system to initialize instantly regardless of camera motions, in contrast to original monocular SLAM which requires enough camera motion to successfully initialize; and (ii) network predicted depths (particularly CNN depths, but any depth network architecture desired may be used, for example, recent Transformer based models) depths make the SLAM more robust, especially in frames with large texture-less regions or fast camera motions.

[0064] The augmented SLAM module 210 includes a front-end tracking block 212 that receives the network predicted depths 206 and image sequence 202. Using the network predicted depths 206 and image sequence 202, the front-end tracking block 212 extracts sparse feature points from each image and tracks the sparse feature points tracked across the image sequence 202. Additionally, some embodiments may also receive internal measurement unit (IMU) data from an IMU sensor as input with images 202, as described in greater detail below in connection with FIG. 6. Thus, some mbodiments may support multiple sensor modes and provides for minimum sensor setup. For example, using a monocular image sensor with (or without) an IMU sensor, facilitates two modes, such as, the monocular and Visual-Inertial (VI) modes. Furthermore, because a depth network 204 is used to infer the depth map for every image 202, a pRGBD mode may also be implemented. Further details with respec to the VI and pRGBD modes are provided below with respect to FIG. 6, which may be similarly implemented in the depth refinement system 200.

[0065] The augmented SLAM module 210 includes back-end mapping block 214 configured to use the information from the front-end tracking block 212 to compute camera pose 222 and map points 224 that can be used to compute an environment geometry and pose graphs. For example, the map points 224 may include 3D coordinate information in a world coordinate system from which the back-end computes the geometry and poses using these sparse measurements to output map points 222 of the geometry and camera poses 224. The coordinates for the map points may include depth values, for example, depth from a image sensor based on the world coordinate system. The map points 222 may be stored in a map database 216 (or other storage device).

[0066] Additoinally, network predicted depths tend to suffer from a global scale shift across frames. On the other hand, geometric SLAM can help correct the scale shift by ensuring the geometric consistency over a large temporal window. Thus, the augmented SLAM module 210 is configured to execute a scale self-correction method to mitigate this issue. For example, first, the SLAM runs on the image sequence for a subset of image frames from the sequence as a warm start, for example, any desired number of sequentially first image frames (e.g., 10, 20, 30, 40, 50, etc.) to permit the SLAM to generate an initial mapping of the environment as the warm start. This may let the SLAM correct inconsistent network predicted depths using geometric constraints. After the warm start, the mapping block 214 computes a scale shift factor S correction is computed between SLAM determined map points and network predicted depths for those map points for the current frame by:

Eq. (1)

[0067] where N is the number of map points visible (e.g., present) in the current frame, D^ lam is the depth predicted by the SLAM for the n th map point in the current frame, and Dn nn is the corresponding network predicted depth. The network predicted depth of the next sequential frame can then scale-corrected by multiplying S correction before input into the SLAM. This scale self-correction method may lead to significant boost in the accuracy of camera pose estimates and improvements in depth prediction (as described in connection with FIGS. S and 4 and Tables 1-3 below). 1.2. Depth Refinement Module

[0068] At the depth refinement module 230, a depth prediction model is incrementally refined for each incoming image frame with a training platform 235 configured to compute one or more loss parameters. The training platform 235 may comprise one or more of a photometric loss block 236, an edge-aware depth smoothness loss block 238, a map-point loss block 237, and a depth consistency loss block 239. The training platform 235, in various embodiments, is an online training platform.

[0069] To perform online learning, a data loader 231 included in the depth refinement module 230 receives map points 224 and camera poses 222from the augmented SLAM module 210. The data loader 231 is configured to extract, read, and/or load data from both of the augmented SLAM module 210 and image sequence source (e.g., an image sensor or memory device). The data loader 231 then feeds the map points 224, camera poses 222, and pose graph 226 to frame-pose snippet block 232.

[0070] The frame-pose snippet block 232 is configured to construct frame snippets having synchronized image-pose pairs. A frame snippet is a subset of image frames 202 synchronized with corresponding poses (e.g., by time stamp data included in the metadata for camera poses 222 and image frames 202. The number of frames included in a snippet may be any number desired, for example, three, five, six, etc.

[0071] The training platform 235 receives output from the frame-pose snippet block 236 and uses the output to determine a one or more self-supervised losses.

[0072] For example, photometric loss block 236 may be configured to determine a photometric loss (L p ) for the frame snippet from the frame-pose snippet block 232. The photometric loss may be defined as a difference between a target frame I t and a synthesized frame I j®i warped from a source frame I j using the depth image and the relative pose T j®i :

Eq. (2) [0073] where pe() is the photometric loss function computed with a £ 1 norm (e.g., a summation of the absolute value of each value in the target frame I t and a synthesized frame I j®i ) and a structural similarity index (SSIM) (e.g., a perceptual metric that quantifies image quality degradation caused by processing such as data compression or by losses in data transmission). While some photometric loss functions use a 3-frames snippet to construct the photo-consistency, some embodiments of the depth refinement module 230 may employ a wider baseline photometric loss by using a n-frame snippet, where n is the number of frames in the snippet. For example, the frame-pose snippet block 232 constructs a 6-frame snippet that may be used, where j E = {i — 4, i — 3, i — 2, i — 1, i + 1}. c Z j represents a set of frames that correspond to neighboring frames of the current frame Z^. The 6-frame snippet is empirically defined, and the above example is provided for illustrative purposes only. Additionally, while a 6-frame snippet is provided herein, this is for illustrative purposes only and any n-frame snippet may be used where n is an empirically set integer. Another important difference in the depth refinement module 230 from prior art systems is that the relative pose T j®i comes from the augmented SLAM module 210 (e.g., as part of camera pose 222), which is more accurate than a pose predicted by a pose network, as known in the art.

[0074] The edge-aware normalized smoothness loss block 238 may be configured to determine an edge-aware normalized smoothness loss (L s ) as

Eq. (3)

[0075] where d- = d j /d j is the mean-normalized inverse depth to prevent depth scale diminishing, d represents the depth value from a corresponding depth map D t , d x and d y represents partial derivatives where x and y are coordinates within of the corresponding depth map D t . Note that the edge-aware normalized smoothness loss (L s ) is computed for a current depth map D t .

[0076] The map points 224 may have undergone extensive optimization through bundle adjustment within module 210 (e.g., as part of known SLAM techniques in the back- end mapping block 214), so the depth values of these map points 224 may be more accurate than corresponding network predicted depths 206. The map point loss block 237 leverages these map-point SLAM predicted depths to build a map-point loss as a supervision signal to the depth refinement model. The map-point loss may be the difference between SLAM map points 224 and the corresponding CNN depths 206 as follows,

Eq. (4)

[0077] where there are jV; 3D map points 224 from augmented SLAM module 210 for the i th image frame after filtering as described below, D i n is the depth prediction from the depth network 204 for the n th map point for the I th image frame, and D?^ am is the corresponding SLAM depth. Note that the map-point loss (L m ) is computed for the depth map D, of the center frame (e.g., the current image frame).The filtering ensures that accurate map points are used. For example, first, the map point loss block 237 determines whether or not each map point 224 for the current image frame is present (e.g., observed) by a number of SLAM keyframes. The keyframes for this filtering task are determined by the SLAM and are not the same keyframes described below for the keyframe mechanism. While any number of keyframes may be used (e.g., 2, 3, 4, 5, 6, etc.), in various embodiments the number of key frames may be set to 5. If the map point is not observed in the number of SLAM keyframes, then the map point is discarded. In this way, the map point loss block 237 uses map points that are found in a number of frames in the sequence, and not only a single frame. Second, map point loss block 237 projects map points from a world coordinate system (e.g., the coordinate system of the geometry of the environment constructed by the mapping block 214) to the coordinate system of the current image frame and determines a projection error for each map point. If the projection error for a given map point exceeds an error threshold, the given map point is discarded. The error threshold may be set as desired based on acceptable tolerances, for example, the threshold may be 3 pixels, 4 pixels, 5 pixels, etc.). Additionally, map points whose projected depth differ from the corresponding network predicted depth 206 by a margin are also discarded. For example, map points are discarded if their projected depth differ from the network predicted depth 206 by more than 20%. Other margins may be used as desired, for example, stringency may be increased by decreasing the margin (e.g., 10%) or laxer by increasing the margin (e.g., 30%). The filtering may eliminate bad map-point-to-network-depth correspondence pairs near foreground-background boundaries.

[0078] In addition to the loss terms above, embodiments herein implement an occlusion-aware depth consistency loss block 239 configured to determine an occlusion- aware depth consistency loss and keyframe revisit block 234 configured to execute a keyframe memory revisiting strategy to build on to the online depth refinement pipeline.

1.2.1. Occlusion-Aware Depth Consistency Loss

[0079] The occlusion-aware depth consistency loss block 239 receives, via the data loader 231, depth maps predicted by the depth network 204 and a relative poses T=[Rjt] included in camera pose 222 from tracking block 216.

[0080] Given the depth maps of two adjacent image frames, e.g., and D j , and their relative pose T = [R\t], occlusion-aware depth consistency loss block 239 builds a robust consistency loss between Di and D j to make the depth predictions at time / and j consistent with each other. Note that the depth values at corresponding positions of frame / and j are not necessarily equal as the image sensor used to capture I t and I j moves over time.

[0081] For a pixel x at frame /, correspondence with frame j can be computed as follows:

Eq. (5) [0082] where 7t(·) is a projection function that maps coordinates [x,y,z] T for pose T to [x/z, y/z] T , R i®j and t i®j are the rotation (/?) and translation (t) from frame / to j, and xt is the homogeneous form of x i and K represents image sensor intrinsics, which is the same for each image frame of a given image sequence. Since x i®; is non-integer, a bilinear sampling (e.g., an inverse warp) is applied to obtain the depth value for map point x in frame j , e.g., D j (x i®j ), where ( ) denotes the bilinear sampling operator. Further, D j (x i®j ) is transferred to the coordinate system of frame / to construct the consistency loss

Eq. (6)

[0083] Xi j i represents a set of 3D coordinates. The corresponding depth is then the third element of X i®j®i , denoted by the 3D vector [X i®j®i 3 . The initial depth consistency loss is defined as

Eq. (7)

[0084] The depth consistency loss determined in Eq. (7) may include pixels in occluded regions, which may be detrimental to the depth refinement model. To effectively handle such occlusions, the occlusion-aware depth consistency loss block 239 performs a per-pixel photometric loss, as known in the art. The per-pixel photometric loss is not necessarily the same loss computed by the photometric loss block 236. Following the per- pixel photometric loss, a per-pixel depth consistency loss is determined by taking the minimum of the depth consistency loss from Eq. (7) instead of an average over a set of neighboring frames:

Eq. (8)

1.2.2. Keyframe Memory Revisiting

[0085] At the initial stage of depth refinement for a test image sequence, the depth prediction network or model 204 does not converge fast enough to adapt to the current environment, especially when there is a large domain gap between the training and testing sequences. For example, if depth refinement system 200 (e.g., the depth network 204) is trained on an outdoor image sequence (or multiple different sequences) and then tested and refined on an indoor image sequence, then a large domain gap exists. Similarly, a larger domain gap may exist where translational and/or rotational motion in the training image sequence is significantly less than that in the test and refine image sequence. To expedite the model convergence, the depth refinement module 230 includes a memory revisiting strategy to periodically fine-tune the network with the map points from previous keyframes before applying training platform 235 to the current frame. Keyframe map points are chosen for memory revisiting because the keyframes are light-weight (e.g., more memory efficient than the photometric loss) and highly optimized through geometric constraints in local or global bundle adjustment. The revisiting strategy includes constructing the map point loss (e.g., as in Eq. (4) above) using most recent N key keyframes and the corresponding map points, instead of the entire image sequence used above. As an example, the most recent N key keyframes may be the sequentially most recent 20 frames. However, N key keyframes may be any number desired for revisiting. This strategy may make the model converge much faster and thus leads to better depth refinement. [0086] The memory revisiting is provided to make full use of the keyframe map points to achieve fast convergence, whereas prior art systems use incremental learning to mitigate the catastrophic forgetting issue.

1.2.3. Overall Refinement Strategy

[0087] The overall refinement loss (L) from training platform 235 may be provided as

Eq. (9)

[0088] where A s , A m , and A c are the weighting parameters selected for balancing the contribution of each loss term. The weighing parameters may be tuned on a subset of image frames (e.g., a validation set), for example, by enumerating a set of candidate weights and selecting those that provide optimal results.

[0089] Given a trained the depth model 204, the depth refinement system 600 first warms up the augmented SLAM module 210 without applying scale self-correction up to frame t w , which lets the augmented SLAM module 210 achieve stable reconstructions. Frame t w may be any empirical image frame, e.g., the 50 th or any other selected image frame. Then, after frame t w , the scale self-correction is initiated. The reconstructed map points 224 and camera poses 222 are simultaneously passed via data loader 2S1 to the depth refinement module 2S0 to execute the training platform 2S5. During the first frame t k ey , the previous most recent N key keyframes and are revisited. The map point loss from previous most recent N key keyframes are used to train the depth network 204 (e.g., refine the depth network 204) for one gradient step. Then, the training platform 235 is conducted for the current frame (e.g., first frame t key for a first pass) by minimizing the loss terms in Eq. (9) and performing gradient descent for k number of steps, as shown in Algorithm 1 below. After that, the refined depth network can be applied to the current image frame to output a refined depth map 240 (D .

[0090] An example algorithm is summarized below as Algorithm 1. Algorithm 1 GeoRefine: Geometric SLAM-Guided Depth Refinement using Self-Supervised Online Learning 1: Pretrain the depth model in a self-supervised manner with stereo and monocular images.

2: for frame i = 0, ··· , ··· do

3: Warm start CNN SLAM up to frame t w , and apply scale self-correction after t w ;

4: Perform one-step gradient descent via keyframe memory revisiting; ► apply only if i < t key

5: Apply k- step gradient descent by minimizing Eq. (9);

6: Save the depth map D j .

7: end for

2. Experiments

[0091] The depth refinement system 200 of FIG. 2 was evaluated on two datasets: EuRoC dataset and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset. Ablation studies are provided that verify the effectiveness of the depth refinement system 200. Quantitative and qualitative results on EuRoC dataset and KITTI dataset are provided. For quantitative depth evaluation, the standard error and accuracy metrics are employed, including the Absolute Relative (Abs Rel) error, Squared Relative (Sq Rel) error, root-mean-square error (RMSE), RMSE log, d <1.25 (e.g., di), d < 1.252 (e.g., d 2 ), and d < 1.253 (e.g., 6 3 ).

2.1. Implementation Details

[0092] The depth refinement system 200 comprises an augmented SLAM module 210 and a depth refinement module 230. As described above, in some embodiments the augmented SLAM module 210 may be based on ORB-SLAM running with RGB images and CNN depth images in an RGB-D mode. The depth refinement module 230, in some embodiments, may be based on Monodepth2 with data loader 231 and training platform 235. In some embodiments, Robot Operating System (ROS) may be used to exchange data between modules for cross-language compatibility. During online learning, k was set to 5 to let the depth network perform gradient descent for 5 steps. The ADAM algorithm was used as the optimizer and the learning rate was set to 1.0e 5 . The weighting parameters X s and X c were set to 1.0e 3 and 1.0e 2 respectively. The weight parameter X m was set to different values for the EuRoC dataset and the KITTI dataset, because the datasets have different depth ranges. For these experiments, X m was set to 5.0e 2 for the EuRoC dataset and X m was set to 1.0e 3 for the KITTI dataset. Batchnorm layers were frozen in the depth network because the batch size was 1 during online training.

[0093] SLAM was set to maintain a consistent scale over a large temporal window by correcting CNN depth scale with map points. A warm start period was provided for SLAM to initialize the reconstruction. For this implementation, 500 frames were provided for warm start in EuRoC Vicon Room sequences. KITTI sequences vary greatly in length, the warm start period was set to not exceed 40% of the total frame count.

[0094] Map points were filtered with criterion to ensure good supervision signal for online depth refinement. First, map points observed in fewer than 5 keyframes were discarded. Then, map points with reprojection errors greater than 1 pixel were discarded. Finally, if a map point depth differs from original CNN depth by 30% of the map point depth, it was also discarded.

[0095] SLAM keyframe poses and corresponding map points were updated and pushed to the online depth refinement module 230 for every frame. Map point filtering was conducted for every frame because some points may have become qualified/disqualified after SLAM back-end optimization. Keyframe poses and map points were stored in a dictionary indexed by the frame ID. The t key was set to 500 and N key was set to 24. The system was let to revisit up to 24 latest keyframes for the first 500 frames.

2.2. EuRoC Indoor MAV Dataset

[0096] The EuRoC Micro Air Vehicles (MAV) dataset is an indoor dataset which contains stereo image sequences and camera parameters (both intrinsics and extrinsics). A MAV mounted with global shutter stereo cameras is used to capture the data in a large machine hall and a Vicon room. Five sequences are recoded in the machine and six in the Vicon room. Ground-truth camera poses and depths are obtained with a Vicon device and a Leica MS50 laser scanner. Those sequences are characterized as easy, medium, or difficult according to a few factors: with good texture or not, in a bright or dark scene, and with motion blur or not. Five sequences were used, i.e., MH_01, MH_02, MH_04, Vl_01 and Vl_02, as the training set to pretrain Monodepth2, and use the remaining Vicon sequences except V2_03 to evaluate the performance of all competing methods. Sequence V2_03 was excluded due to missing frames in one of the cameras. The images were rectified with the provided intrinsics to remove image distortion. To generate ground-truth depths, the laser point cloud was projected onto the image plane of the left camera. The images were resized to 754x480 for online depth refinement.

[0097] Ablation Study

[0098] An ablation was performed study on EuRoC Sequence V2_01 to gauge the contribution of each component to the depth refinement system 200. First a base model was constructed by running an online learning algorithm with the photometric loss as in Eq. (2) and the depth smoothness loss as in Eq. (3). Note that the photometric loss uses the camera pose from augmented SLAM module 210 instead of the pose network. We denote this base model as "BaseModel". We then gradually add to this base model new components, including the scale self-correction in augmented SLAM module 210 ("+ Scale Correction"), the map point loss ("MapPoint Loss"), the occlusion-aware depth consistency loss ("Depth Consistency"), and finally the keyframe memory revisiting loss ("Keyframe Revisit").

Table 1 [0099] Table 1 above shows the complete ablation results on EuRoC Sequence V2_01. Base model ("Base Model") reduced the absolute relate depth error from 17.4% (by the pretrained Monodepth2 model) to 12.8%, which verifies the effectiveness of simple self- supervised online learning method executed by the depth refinement system 200. Adding scale self-correction ("+ Scale Correction") increased the depth accuracy metric di from 84.8% to 86.7%. We also give a visual comparison of camera trajectories by two versions of augmented SLAM module 210, one with scale self- correction and the other without, in FIG. 3.

[00100] FIG. 3 is a graph 300 plotting a comparison of camera trajectories by versions of the augmented SLAM module 210 according to embodiments disclosed. Line 310 represents a ground truth camera trajectory from the EuRoC Sequence V2_01. Line 320 represents a camera trajectory for the EuRoC Sequence V2_01 provide by a version of the augmented SLAM module 210 without scale-correction. Line 330 represents a camera trajectory for the EuRoC Sequence V2_01 provide by a version of the augmented SLAM module 210 with scale-correction.

[00101] It is clear that from graph 300 that the trajectory with the scale correction aligns better with the ground truth. This shows that SLAM map points after extensive geometric optimization indeed improve the scale information from the initial CNN depths. Using the map point loss ("+MapPoint Loss") further reduced Abs Rel from 12.2% to 9.5% and increases di from 86.7% to 91.3%. This demonstrates the benefits of having accurate SLAM map points as supervision. Adding occlusion-aware depth consistency loss ("+Depth Consistency") further achieved an improvement of 0.7% in terms of Abs Rel. The final model with keyframe memory revisiting achieved the best results with Abs Rel as 8.6% and dΐ as 92.0%. From this ablation study, it is evident that each component of our method makes non-trivial contributions in improving depth prediction.

[00102] Quantitative Results

[00103] Quantitative evaluation was conducted on the EuRoC test sequences, i.e.,

Vl_03, V2_01, and V2_02, and the depth evaluation results are provided in Table 2, below.

In Table 2, the depth refinement system 200 is represented as DRS. Consistent improvements were observed by the depth refinement system 200 over the baseline Monodepth2 on all three sequences. However, the improvement margin is different for each sequence. On Sequence Vl_03, the pretrained Monodepth2 performs well with Abs Rel as low as 9.6% and the depth refinement system 200 reaches a lower Abs Rel of 8.3%. On Sequence V2_01, much more significant improvements were observed by the depth refinement system 200 and the final depth result (with Abs Rel of 8.6%) on this sequence is comparable to the result on Vl--_03. The depth refinement system 200 also improved over the pretrained baseline model for V2_02. The pretrained Monodepth2 model has a much larger depth error on V2_02 such as the scale information may not be adequately retained in the pretrained model. Since the depth refinement system 200 performs online learning on a monocular sequence, the scale information cannot be corrected from a poor initial model.

Table 2

[00104] Another quantitative evaluation for the relative depth was performed by aligning the scale of predicted depth with the ground truth. This quantitative depth evaluation with per-frame scale alignment is shown in Table 3, below. From Table 3, it can be seen that the relative depth evaluation for Sequence Vl_03 and V2_01 is consistent with the absolute depth evaluation in Table 2. However, different behavior was observed for Sequence V2_02, e.g., the depth refinement system 200 achieved a significant improvement as well in terms of relative depth evaluation. With the depth refinement system 200, Abs Rel reduces from 19.2% to 10.5% and di increases from 70.6% to 89.0% in the performance of relative depth estimation.

[00105] Qualitative Results

[00106] FIG. 4 illustrates a qualitative comparison of depth maps performed using the depth refinement system 200 on the EuRoC dataset. Input images are shown in the left most column, the depth map output for each input image by Monodepth2 in the second (middle) column, and the depth map for each image output by the depth refinement system 200 are shown in the third (right most) column. As can be seen in FIG. 4, as compared to Monodepth2, the depth refinement system 200 provides refined depth maps having finer details and clearer edges.

[00107] FIG. 4 clearly shows the qualitative improvements brought by depth refinement system 200. In particular, the depth refinement system 200 can significantly improve the depth quality from a noisy initial depth and refine an initial depth map to have fine-grained depth details.

2.3. KITTI Outdoor Driving Dataset

[00108] The KITTI dataset consists of outdoor driving sequences for road scene understanding. Stereo image sequences and ground-truth depths are captured via calibrated stereo cameras and a Velodyne LiDAR sensor mounted on the top of a driving car. The depth refinement system 200 was trained and tested using the rectified images provided with the KITTI dataset and the Eigen split. The test images are sampled from 28 test sequences that have no overlap with the training sequences. We run ou^ system on those 28 sequences to generate refined depth maps. The images are resized to 640x192 for the online depth refinement module.

[00109] Quantitative Results [00110] The depth evaluation results on the KITTI Eigen split test set are shown in Table 4, below. In Table 4, the depth refinement system is indicated as DRS. Furthermore, M represents self-supervised monocular supervision; S represents self-supervised stereo supervision; D represents depth supervision; Y represents Yes; N represents No; "-"means that the result is not available from the paper. Best numbers in each block are marked in bold. The depth refinement system 200 starts with a CNN model (e.g., a Monodepth2 model in this implementation) that is trained with monocular and stereo images and is then refined with monocular images only. The depth refinement system 200 is thus marked as "(S)M". The depth refinement system 200 achieves the best performance among all the self- supervised monocular methods, which verifies the effectiveness of the depth refinement system 200 in refining depth in challenging outdoor environments. Compared to the test time finetuning method by Luo, the depth refinement system 200 can reduce the Abs Rel from 13.0% to 9.9%; compared to the offline refinement method by pRGBD-Refined the depth refinement system 200 can increase the di by a considerable margin, from 87.4% to 90.0%. Notably, the depth refinement system 200 achieves the best results in terms of the accuracy metrics (di, d2, d^) among all the self-supervised methods.

Table 4

[00111] FIG. 5 shows a comparison of embodiments disclosed herein to Monodepth2 evaluated on the KITTI dataset. FIG. 5 shows a per-per-sequence comparison, where each set of columns represents a sequence from the KITTI dataset. On the listed sequences, clear benefits are observed by applying the depth refinement system 200, which are shown as solid black bars and the results from Monodepth2 are shown as bars with hatching.

[00112] Accordingly, embodiments disclosed herein provide for geometric SLAM guided monocular depth refinement methods and systems that use self-supervised online learning. The embodiments disclosed herein rely on a pretrained model to build an augmented SLAM (e.g., CNN augmented SLAM) comprising scale self-correction. An online learning framework has been proposed to refine the depth, combining the benefits of geometric SLAM and new training losses. The above experimental data demonstrated the state-of-the-art performance of the embodiments disclosed herein on public datasets, including EuRoC and KITTI datasets.

D. Learning-based Optical Flow Augmented SLAM Embodiments

[00113] Embodiments described herein provide a depth refinement system having a hybrid SLAM of a SLAM augmented with a learning-based optical flow incorporated into the front-end of the SLAM and a depth refinement module configured to run an online depth refinement via a training platform. Various embodiments provide for an online system by design that can achieve improved robustness and accuracy via: (i) a robustified augmented SLAM that incorporates learning-based optical flow (such as a RAFT-flow or the like) in the front-end to survive in challenging scenarios; (ii) self-supervised losses that leverage outputs from the augmented SLAM module and enforce long-term geometric consistency; and (iii) system design that avoids degenerate cases in the depth refinement.

[00114] While the embodiments herein are described with reference to augmenting a monocular SLAM module using a RAFT-flow, embodiments herein are not intended to be limited to a RAFT-flow only. Instead, any desired learning-based optical flow may be incorporated into the front end of the monocular SLAM. For example, a global motion aggregation method may be used instead of a RAFT-flow.

1. Methodology and Architecture

[00115] FIG. 6 is a diagram illustrating an architecture for an example depth refinement system 600 according to another embodiment disclosed herein. The depth refinement system 600 is configured to create and refine a depth prediction model to generate geometrically-consistent dense mapping from monocular image sequences. Depth refinement system 600 is another example implementation of depth refinement system 100 as set forth above. The system may be implemented using, for example, one or more processors and memory elements such as, for example, computer system 1300 of FIG. 13. As described above in connection with FIG. 1, the depth refinement system 600 may be hosted on a server 150, which may comprise dedicated server infrastructure and/or cloud- based infrastructure.

1.1. Augmented SLAM Module

[00116] Monocular visual SLAM may have some drawbacks: (i) its front-end often fails to track features under adverse environments, e.g., low-texture, fast motion, and large rotation; and (ii) it can only reconstruct an environment up to an unknown global scale. To improve the performance of SLAM, a few methods have been proposed to improve back end optimization of direct LSD-SLAM. Instead, the depth refinement system 600 disclosed herein improves the front-end of feature-based monocular geometric SLAM to address front-end tracking loss that is a common cause for SLAM failures and accuracy decrease. Thus, the depth refinement system 600 comprises a flow-SLAM module 610, in which the SLAM is augmented by incorporating a learning-based optical flow model 612 into the front- end tracking block 616 of the SLAM, while executing a SLAM back-end mapping block 618 as known in the art. The flow-SLAM module 610 may be similar to the augmented SLAM module 110 of FIG. 1, except as provided herein.

[00117] In an example embodiment, the depth refinement system 600, replaces front-end feature matching in traditional SLAM methods (e.g., ORB-SLAM) with an optical flow model 612, while sampling sparse points according to SLAM techniques for robust estimation and mapping in the back-end block 618. RAFT is an example state-of-the-art optical flow method shown to have strong cross-dataset generalization. A RAFT-flow model constructs a correlation volume for all pairs of pixels and uses a gated recurrent unit (GRU) to iteratively update the optical flow. Thus, flow-SLAM module 610 allows the depth refinement system 600 to reap the advantages of learning-based optical flow along with established robust estimator techniques of SLAM in one module.

[00118] In various embodiments, the optical flow model 612 may be trained using a deep learning framework. For example, a depth CNN may be used to train the RAFT-flow of the example embodiment disclosed herein. For example, as shown in FIG. 6, the depth model 634 may be used to predict depth maps 642 that may be used to train the flow model 612. The depth model 634 may be a CNN; however, the depth refinement system 600 is not limited to only CNN, but any network architectures desired may be used, for example, recent Transformer based models. In another example, the flow model 634 may be trained by another depth model, which may be a depth CNN. Depth maps 642 may be unrefined (e.g., a first pass of the depth model) and/or refined depth maps following refinement by the training platform 635.

[00119] In operation, input data 605 is received by the depth refinement system 600 via network 640. The input data 605 comprises at least image sequences, for example, sequences of image frames captured by an image sensor. In some embodiments, the image sequences may be retrieved from memory or processed in real-time as they are captured and fed by the image sensor. The input data 605 may be communicated from an edge device, such as a memory and/or image sensor, to an online server or cloud-based platform (e.g., cloud-based server and database) via wireless communication. The input data 605 may be streamed to the depth refinement system 600 and/or communicated in blocks of data.

[00120] Image frames from input data 605 are input into the flow-SLAM model 612, which may be stored in a database or other storage device and executed by one or more processors. For each incoming image frame /;, correspondences of feature points from a sequentially preceding image frame (e.g., the immediately preceding image frame) is located by executing the flow model 612, prior to the sampling sparse points in the back end block 618. The learning-based optical flow provides the coordinate displacement vector (d u , d v ), so the correspondence of a point (x, y) can be found at (x+d u , y+d v ). Feature points may refer to points (e.g., pixels) in a given image frame that make up a feature included in the image frame. An example feature point may be an ORB feature, which is an example standard feature descriptor used in ORB-SLAM and many other SLAM algorithms and systems. The depth correspondence is communicated to the front-end tracking block 616 as flow data 614. For example, for each incoming image /;, the flow model 614 constructs a correlation volume for all pairs of pixels from incoming image I t and preceding image frame

/i- ! - The flow model 612 uses the GRU to iteratively update the optical flow, thereby locating the correspondences of feature points from the previous image frame /^. For example, for each feature from the last image frame I i-1 and once the respective feature is associated with a map point, it correspondence in incoming frame I t is located by adding flow If there are multiple candidate points (or pixels) within a predefined radius around a target pixel in /;, the candidate with the smallest descriptor residual is selected. If there are none, then a new feature is created instead, with the descriptor being copied from h-i -

[00121] In various embodiments, the flow model 612 uses only Nf = 0.1 x N t matched candidate pairs of pixels and nearby ORB features for initial pose calculation, where N t is the total ORB features within the current incoming frame I^. Constraining the matched candidates as such may improve robustness of the flow model 612. Compared to leveraging the entire flow between consecutive image frames, sampling a subset of pixels may be beneficial and improve the accuracy of the flow model 612. This may be because the flow in these regions is generally more accurate. Additionally, a forward-backward consistency check may be performed on predicted flows to obtain a valid flow mask by using a threshold of 1 pixel.

[00122] Next, the flow-SLAM module 610 performs a local map point tracking step at the front-end tracking block 616. The local map point tracking is configured to densify potential associations from views or scenes other than that of the current incoming frame I t and further optimize camera pose estimations. ORB features are combined with the RAFT- flow because traditional ORB features may keep structure information (e.g., edges and/or cornerness), mitigating the drifting caused by flow mapping in long sequential tracking. The front-end tracking block 616 then outputs a 2D point matching and tracking to the back-end mapping block 618.

[00123] The back-end mapping block 618 uses the output from the front-end tracking block 618 to perform sparse mapping using the output from the front-end tracking block 616 to computes the environment geometry and camera poses for each scene. The back end mapping block 618 outputs computed camera poses, map points, and depth maps as output data 620. The back-end mapping block 618 computes sparse 3D points and corresponding depth values, while the depth network 634 provides dense points and depth values.

[00124] In some embodiments, the input data 605 may also include data from inertial measurement unit (IMU) sensors (e.g., sensors that detect and report out angular rate, orientation, force exerted on a body, magnetic fields, etc.). Example IMU sensors include, but are not limited to accelerometers, gyroscopes, magnetometers, and the like. IMU sensors may be coupled to the image sensor and/or part of a system or device comprising the image sensor, thereby providing IMU data indicative of the motion and forces experienced by the image sensor. In some embodiments, IMU data may be used, along with the image sequence, to determine camera pose and map points in the Flow-SLAM module 610.

[00125] The input data 605 may also include metadata, such as a time stamp and/or geographic location information (e.g., as detected by a Global Positioning System (GPS) coupled to the image sensor). Time stamp metadata may indicate a point in time at which each image frame of the image sequence and/or IMU data is captured. Similarly, geographic location information may indicate a geographic location at which the image frame was captured and/or IMU data was measured. The image frames and IMU data may be associated based on the time stamps and/or geographic location (e.g., image frames and IMU data for a given point in time and/or given location may be associated together).

1.1.2. Multiple Sensor Modes

[00126] Embodiments herein may support multiple sensor modes and provides for minimum sensor setup. For example, using a monocular image sensor with (or without) an IMU sensor (e.g., providing IMU data with input data 605), facilitates two SLAM modes, such as, the monocular and Visual-Inertial (VI) modes. Furthermore, because a depth model (e.g., a CNN) is used to infer the depth map for every image, a pseudo-RGBD (pRGBD) mode may also be implemented. [00127] In embodiments under the monocular mode, the flow-SLAM module 610 reconstructs camera poses and 3D map points in an arbitrary scale at the back-end mapping block 618. As noted above, the flow model 612 may be based on a trained depth model (e.g., a CNN depth model), from which predicted depth maps can be leveraged to adapt the scale of map points and camera poses for the back-end mapping block 618. For example, as shown in FIG. 6, the trained depth network 634 may provide depth maps to the flow-SLAM module 610 for use to adapt scale of map points and camera poses. This scale alignment step at the back-end mapping block 618 may be beneficial because SLAM outputs will be used in the downstream task of the depth refinement module 630. If the scales between the flow-SLAM module 610 and depth refinement module 630 differ too much, depth refinement will be sub-optimal or fail. Thus, in the monocular mode, the back-end mapping block 618 performs back-end scale alignment using the outputs from the flow model 612 and predicted depth maps from the trained depth model. For example, initial map points are constructed and during an initialization of the depth refinement system 600 (e.g., by the flow-SLAM module 610), the scale may be continuously aligned for a number of steps by solving the following least-squares problem:

Eq. (10)

[00128] where s is the scale alignment factor to be estimated, d(x) is the depth values from the trained depth model, and d(x) is the depth values from the SLAM map points respectively SLAM map points. Eq. 10 may be performed for any number of steps, for example, d(x) and d(x) may be computed from an initial 5 image frames, initial 10 image frames, etc. The outputs 620 from the flow-SLAM module 610 can then be scale-adjusted by multiplying s before input into the depth refinement module 630.

[00129] If the scales of two modules are already in the same order (e.g., when SLAM runs in the visual-inertial (VI) mode), the above-described alignment step becomes optional. [00130] VI SLAM is usually more robust than monocular SLAM under challenging environments with low-texture, motion blur and occlusions. Since inertial sensors provide scale information from IMU data, camera poses and 3D map points from VI flow-SLAM can be recovered in metric scale. Thus, in VI mode, given a scale aware depth model (e.g., a model that predicts depth in metric scale, for example, from IMU data), some embodiments run the depth refinement module 630 without taking special care of scale discrepancies between the modules 610 and 630.

[00131] In another example, the pRGBD mode may provide a way to incorporate deep depth priors into executing a geometric SLAM. However, this approach may result in in sub-optimal SLAM performance if the depth predictions are treated naively as the ground truth to run the RGBD mode due to noisy predictions. In the RGBD mode of ORB-SLAM3, the depth is mainly used in two SLAM stages, e.g., system initialization and bundle adjustment. By using the input depth, the system can initialize instantly from the first frame, without the need of waiting until having enough temporal baselines. For each detected feature point, employing the depth and camera parameters, the system creates a virtual right correspondence, which leads to an extra reprojection error term in bundle adjustment. To mitigate the negative impact of the noise in depth predictions, the depth refinement system 600 may be configured to make two changes in the pRGBD mode as compared to the original RGBD mode. For example, (i) the flow SLAM module 610 may receive the refined depth maps as inputs (e.g., depth maps 642) from the depth refinement module 630 (as described in the below) to ensure that the input depth maps are more accurate and temporally consistent and (ii) the reprojection error term for the virtual right points may be removed in the bundle adjustment. Note that the input CNN depth map is still used in the map point initialization and new keypoint insertion, benefiting the robustness of the flow SLAM module 610.

1.2. Depth Refinement Module

[00132] The depth refinement module 630 receives output data 620, including map points and camera poses from the flow-SLAM module 610. The depth model 634 is then incrementally refined based on one or more loss parameters computed by training platform 635. The depth refinement module 630 may be similar to the depth refinement module 120 of FIG. 1, except as provided herein.

[00133] In various embodiments, the depth model 634 may be a trained CNN depth model, but other depth models may be implemented as well, such as but not limited to, Transformed based depth networks. Training platform 635 may comprise one or more of a photometric loss block 636, an edge-aware depth smoothness loss block 638, a map-point loss block 637, and a depth consistency loss block 639. The training platform 635 and the self-supervised losses determined therein may be similar to the training platform 235 of FIG. 2.

[00134] For example, similar to photometric loss block 236, photometric loss block 636 may be configured to determine the photometric loss (L p ), according to Eq. 2 above. The photometric loss block 636 may employ a wider baseline photometric loss by using a 5- frame snippet, for example, defined by with j E < Ai = {i — 9, i — 6, i — 3, i + 1}. The frame snippet is empirically defined, and the above example is provided for illustrative purposes only. Additionally, while a 5-frame snippet is provided herein, this is for illustrative purposes only and any n-frame snippet may be used where n is an empirically set integer. Another important difference in the dense mapping module 630 from prior art systems is that the relative pose T j®i comes from the flow-SLAM module 610, which is more accurate than the one predicted by a pose network.

[00135] The edge-aware normalized smoothness loss block 638 may be configured to determine an edge-aware normalized smoothness loss (L s ) according to Eq. (3) above.

[00136] The map points from the flow-SLAM module 610 may have undergone extensive optimization through bundle adjustment in the back-end mapping block 618 (e.g., as part of known SLAM techniques), so the depth values of these map points may be more accurate than the corresponding depth values from the trained depth network 634 may be used by the flow model 612. The map point loss block 637 leverages these map-point SLAM predicted depths to build a map-point loss as a supervision signal to the depth refinement model. The map-point loss may be the difference between SLAM map points in data 620 and the corresponding depths from the depth model 634, which may be determined using Eq. (4) as set forth above. In this embodiment, there are N t 3D map points in the data 620 from flow-SLAM module 610 for the I th image frame after filtering as described below, D i n is the depth from the n th map point for the I th image frame, and Df^ m is the corresponding depth from the flow-SLAM module 610.

[00137] Filtering of the map points from the flow-SLAM module 610 is performed as follows. First, the map point loss block 637 determines whether or not each map point for the current image frame is present (e.g., observed) by a number of SLAM keyframes. The keyframes for this filtering task are determined by the SLAM and are not the same keyframes described below for the keyframe mechanism. The number of keyframes may be any desired number, where a keyframe may be inserted every few empirically determined frames. If the map point is not observed in the number of SLAM keyframes, then the map point is discarded. Otherwise, the map point is maintained. In this way, the map point loss block 637 uses map points that are found in a number of frames in the sequence, and not only a single frame. Second, map point loss block 637 projects map points from a world coordinate system (e.g., the coordinate system of the geometry of the environment constructed by the back-end mapping block 618) to the coordinate system of the current image frame and determines a projection error for each map point. If the projection error for a given map point exceeds an error threshold, the given map point is discarded. The error threshold may be set as desired based on acceptable tolerances, for example, the threshold may be 3 pixels, 4 pixels, 5 pixels, etc.). This filtering scheme ensures that only accurate map points are used.

[00138] In addition to the loss terms above, embodiments herein implement an occlusion-aware depth consistency loss block 639 configured to determine an occlusion- aware depth consistency loss and keyframe selection block 634 configured to execute a keyframe selection strategy to build the online depth refinement pipeline. 1.2.1. Occlusion-Aware Depth Consistency

[00139] The occlusion-aware depth consistency loss block 639 receives depth maps included as map points and relative poses T=[Rjt] included with camera poses in the data 620. Given the depth maps of two adjacent image frames, and D j , and their relative pose T = [i?|t] , occlusion-aware depth consistency loss block 239 builds a robust consistency loss between and D j to make the depth predictions at time / and j consistent with each other. Note that the depth values at corresponding positions of frame / and j are not necessarily equal as the image sensor used to capture I t and I j moves over time. With camera pose T, the depth map D j can be warped and then transformed to a depth map D t of frame i, via image warping and coordinate system transformation. For example, an approach similar to that described in connection with Eq. 5 and 6 above may be used, where D t may represent [X i®j®i 3 . Then the initial depth consistency loss may be provided as

L c (D i ,D,) = \l - D l /D i \

Eq. (11)

[00140] The depth consistency loss determined in Eq. (11) may include pixels in occluded regions, which may be detrimental to the depth refinement model. Thus, following a per-pixel photometric loss, a per-pixel depth consistency loss is determined by taking the minimum of the depth consistency loss from Eq. (11) using Eq. (8), above, instead of an average over a set of neighboring frames.

1.2.2. Degenerate Cases and Keyframe Mechanism

[00141] In some scenarios, self-supervised losses are not without degenerate cases. If they are not carefully considered, self-supervised training may be deteriorated, leading to worse depth predictions. A first degenerate case happens when the camera stays static. One approach to address the first degenerate case is to remove static frames in the image sequence by computing and thresholding the average optical flow of consecutive frames (e.g., from the RAFT-flow). For example, an auto-masking strategy may be used to automatically mask out static pixels when calculating the photometric loss.

[00142] A second degenerate case is when the camera undergoes purely rotational motion. This degeneracy has not been considered in self-supervised depth estimation applications. Under pure rotation, motion recovery using the fundamental matrix (e.g., a matrix depicting an epipolar geometry between corresponding points from two frames) suffers from ambiguity, so homography-based methods may be preferred. In the context of photometric loss, if the camera motion is pure rotation (e.g., the translation movement t = 0), view synthesis (or reprojection) step as in Eq. (5) may not depend on depth anymore (e.g., depth cancels out after applying the projection function). The 2D correspondences are directly related by a homography matrix. So, in this case, as long as the camera motion is accurately given, any arbitrary depth can minimize the photometric loss, which is undesirable when training or finetuning the depth network 6S4 (as depth may be arbitrarily wrong).

[00143] To circumvent the degenerate cases described above, the depth refinement module 630 includes a keyframe mechanism in the form of a keyframe selection block 631. The keyframe selection block 631 is configured to facilitate online depth refinement without deterioration. For example, after camera poses are received from the flow-SLAM module 610, the keyframe selection block 631 selects keyframes for depth refinement according to a magnitude of camera translations and outputs the keyframes and related data as keyframe data 632. For example, if the norm of camera translations is over a set threshold, the corresponding frame is selected as a keyframe (e.g., the candidate for applying self- supervised losses). The corresponding frame may be identified via time-stamp synchronizer that synchronizes the data 620 with the data 605 based on time stamp metadata (e.g., association of data 620 with data 605 having for a point in time based on time stamp metadata). This ensures that the depth refinement module 630 has enough baselines (e.g., translations) for the photometric loss calculations to be effective. [00144] The translations may be determined from the camera pose estimations in data 620 and/or based on the optional IMU data. For example, IMU sensors may be used to detect translations, which may be associated with corresponding image frames based on time stamp metadata (e.g., via the time-stamp synchronizer).

[00145] The selected keyframes from the keyframe selection block 631 are then fed into the depth model 634 as keyframe data 632. The keyframe data 632 is indicative of the selected keyframes and related data, which are used to perform the photometric loss calculations by the photometric loss block 636, as described above. In various embodiments, the keyframe selection strategy executed by keyframe selection block 631 may be applied to each of the loss parameters of the training platform 635.

1.2.3. Overall Refinement Strategy

[00146] The overall refinement loss (L) from training platform 635 may be determined according to Eq. (9) above.

[00147] The depth refinement system 600 is configured to refine any trained depth estimation network or model to achieve geometrically-consistent depth prediction for each frame of an image sequence. As the flow-SLAM module 610 runs on separate threads, the keyframe data 632 is buffered. The keyframe data 632 including image frames, map points, and camera poses, in a time-synchronized data queue of a fixed size. While the time- synchronized data queue may be set to any fixed size, in an illustrative example (see Section D.2.1 below) the time-synchronized data queue is set to 11.

[00148] Depth refinement for the current keyframe is conducted by minimizing the loss term in Eq. (9) and performing gradient descent for K* steps, where K* is the number of keyframes selected by the keyframe selection block 631. After K* depth refinement steps are performed on the depth model 634, a depth inference may be ran using the refined depth model 634 to generate and save a depth map 640 for the current keyframe. Global maps can be finally reconstructed by performing TSDF or bundle fusion, as known in the art.

[00149] In a case that depth refinement is demanded for every image frame (e.g., per-frame depth refinement), a data queue for per-frame data 633 may be maintained. The per-frame data 633 may be used to construct a frame snippet by taking a number of current consecutive frames and a number of most recent keyframes, where the number of keyframes is greater than the number of consecutive frames. In an example embodiment, a 5-frame snippet may be constructed from two current consecutive frames and three of most recent keyframes. In this case, depth refinement for the current frame is conducted by minimizing the loss term in Eq. (9) and performing gradient descent for K steps. After K depth refinement steps, a depth inference may be ran using the refined depth model 634 to generate and save a refined depth map 640 for the current frame.

[00150] Global maps can be finally reconstructed by performing fusion, for example, TSDF or bundle fusion, as known in the art. Thus, depth refinement system 600 may include an optional global mapping module 650, which is substantively similar to global dense mapping module 130 of FIG. 1. The global mapping module 650 receives refined depth maps 640 and executes a fusion algorithm, such as TSDF, bundle fusion, or the like, to generate a 3D reconstruction of the environment pertaining the input image sequence.

[00151] An example algorithm is summarized below as Algorithm 2.

Algorithm 2 GeoRefine: self-supervised online depth refinement for geometrically- consistent dense mapping. _

Pretrain the depth model. > supervised or self-supervised

2 Run RAFT-SLAM > on separate threads

3 Data preparation: buffer time-synchronized keyframe data into a fixed-sized queue Q * ; (optionally) from another data queue Q for per-frame data,

4 while True do

5 Check stop condition. > stop-signal from SLAM

6 Check SLAM failure signal. > clear data queue if received

7 for k < — 1 to K * do > keyframe refinement

8 Load data in Q * to GPU, >batch size as 1 9 Compute losses as in Eq. (9),

10 Update depth model via one gradient descent step >ADAM optimizer 11 end for 12 Run inference and save refined depth for current keyframe

13 for rk * — 1 to K do > Optional per-frame refinement

14 Check camera translation from last frame, > skip if too small

15 Load data in Q * and Q to GPU, > batch size as 1

16 Compute losses as in Eq. (9),

17 Update depth model via one gradient descent step. > ADAM optimizer

18 end for

19 Run inference and save refined depth for current frame.

20 end for 21 Run global mapping. > TSDF or bundle fusion 22 Output: refined depth maps and global TSDF meshes.

2. Experiments

[00152] Experiments were conducted on three public datasets: EuRoC, TUM-RGBD, and ScanNet datasets. Ablation studies were performed to verify the effectiveness of each component in depth refinement system 600 and quantitative and qualitative results are presented on the three datasets. For quantitative depth evaluation, the standard error and accuracy metrics are used, including the MAE, Abs Rel, RMSE, d < 1.25 (namely di), d < 1.252 (namely 62), and d < 1.253 (namely 63).

2.1. Implementation Details

[00153] The depth refinement system 600 comprises a flow-SLAM module 610 (e.g., a RAFT-SLAM module in this implementation) and an online depth refinement module 630. The RAFT-SLAM module is implemented based on ORB-SLAM3, which supports both monocular and VI modes. In the experiments, both modes were texted and show that the depth refinement system 600 achieves consistent improvements over pretrained models under both modes. The online depth refinement module refines a pre-trained depth model (e.g., a depth CNN model) with a customized data loader (e.g., an interface through which data 620 is extracted, read, and/or loaded into the depth refinement module 630)and training losses. In the experiments, a self-supervised model (i.e., Monodepth2) and a supervised model (e.g., DensePredictionTransformers or DPT) are compared against to showcase the effectiveness of depth refinement system 600. A Robot Operating System (ROS) is utilized to exchange data between modules 610 and 630 for cross-language compatibility. ADAM is used as the optimizer and set the learning rate to 1.0 e 5 . The weighting parameters s , m , and l 0 are set to l. 0e 4 , 5.0 e 2 , and 1.0 e_1 , respectively. The batchnorm layers (if present) are frozen in the depth network because the batch size is always 1 during online training. For the DPT model, the decoder layers are frozen as they are relatively well trained.

[00154] Map points are filtered with criterion, as set forth above, to ensure good supervision signal for online depth refinement. To this end, map points observed in fewer than 5 keyframes or with reprojection errors greater than 1 pixel are discarded. SLAM poses and corresponding map points are updated and pushed to the online depth refinement module 630 for every frame. Map point filtering is conducted for every frame because some points may have become qualified/disqualified after SLAM back-end optimization. A keyframe data queue of length 11 (e.g., keyframe data 632) and a per-frame data queue of length 2 (e.g., per-frame data 633) is maintained. The motion threshold (translation) for keyframe refinement is set to 0.05 m, while per-frame is set to 0.01m. The number of refinement steps for keyframes is set to 15 (or 3 for per-frame) for the Monodepth2 model, and to 3 for the DPT model (or 0 for per-frame as no per- frame refinement is needed for it).

[00155] As noted above, the evaluated implementation of the depth refinement system 600 utilizes ROS as the agent for cross-language communication. Each processed frame is published as well as its subsequent frame, currently tracked map-points and the camera pose, if exist, to the depth refinement module 620. The frames are fed into the RAFT model 612 to get pair-wise flow predictions, including both the forward and the backward flows. A forward-backward consistency check is performed on the predicted flows to obtain a valid flow mask by using a threshold of 1 pixel. To increase efficiency, a down-scaled image with size 256 512 was used for ROS communication and flow prediction. In the monocular mode, after the system successfully initializes, the map points are continuously aligned, and camera poses to CNN depth for five steps to make their scales consistent to each other.

[00156] It may be difficult to ensure the RAFT-SLAM module never encounters failure cases. For example, it may fail occasionally on sequences with strong motion blur and significant rolling-shutter artifacts. In the event of failures, after the depth refinement module 620 receives a signal of SLAM failure, the queues both for keyframe and per-frame data are cleared. In this case, the keyframe depth refinement process is paused, but the per- frame depth inference can still run if depth maps for all frames are demanded. This strategy may ensure the depth refinement system 600 is rarely disrupted and the system continues to run after the SLAM module recovers, a strategy is employed.

2.2. EuRoC Indoor MAV Dataset

[00157] As noted above, the EuRoC MAV dataset is an indoor dataset which contains stereo image sequences and camera parameters. An MAV mounted with global shutter stereo cameras is used to capture the data in a large machine hall and a Vicon room. Five sequences are recorded in the machine hall and six are in the VICON room. The ground- truth camera poses, and depths are obtained with a VICON device and a Leica MS50 laser scanner, so all Vicon sequences were used as test sets. To generate ground-truth depths, the laser point cloud was projected onto the image plane of the left camera. The original images have a size of 480 x 754 and are resized to 256 x 512 or 256 x 320 for Monodepth2 and to 384 x 384 for DPT.

[00158] Ablation Study

[00159] Without loss of generality, an ablation study was performed on Seq. V2_03 of the EuRoC dataset to gauge the contribution of each component to the depth refinement system 600 under both monocular and pRGBD. First, a base system was constructed by running an online refinement algorithm with the photometric loss as in Eq. (2), the depth smoothness loss as in Eq. (3), and the map-point loss as in Eq. (4). Note that the photometric loss uses camera poses from the RAFT-SLAM module instead of a pose network. This base model is denoted as "BaseSystem". Then new components are added to this base model, including the scale alignment strategy into RAFT-SLAM module ("+ Scale Alignment") and the occlusion- aware depth consistency loss ("+ Depth Consistency"). Under the pRGBD mode, "BaseSystem" takes the pretrained depth as input without using our proposed changes, and this base system uses the depth consistency loss. Then, new components to the base system, i.e., using refined depth from the online depth refinement module ("+Refined Depth"), using the RAFT-flow in SLAM front-end ("+RAFT-flow"), and removing the reprojection error term in bundle adjustment ("+Remove BA Term").

Table 5

[00160] Table 5 above shows a complete set of ablation results. Under the monocular mode, "BaseSystem" reduces the absolute relate depth error from 9.9% (by the trained DPT model) to 9.0%, which verifies the effectiveness of the basic self- supervised refinement method executed by the depth refinement system 600. Using RAFT-flow in the SLAM front- end makes SLAM more robust, generating more accurate pose estimation, which in turn improves the depth refinement module. Adding scale self-alignment ("+ Scale Alignment") improves the depth quality significantly in all metrics, e.g., Abs Rel decreases from 8.3% to 6.4% and di increases from 91.5% to 95.2%. Adding occlusion-aware depth consistency loss ("Depth Consistency") further achieves an improvement of 1.1% in terms of Abs Rel and 1.8% in terms of di. From this ablation study, it is evident that each component of the depth refinement method makes non-trivial contributions in improving depth results. Similar conclusions can be drawn under the pRGBD mode.

[00161] Table 6 shows odometry results for the depth refinement system in pRGBD mode. It is evident that the depth refinement system 600 in pRGBD mode outperforms the baseline, e.g., ORB-SLAM3, both in terms of robustness and accuracy, and each proposed new component contributes to the improvement. Note that "BaseSystem" uses only the trained depth from DPT to form a pRGBD mode.

Table 6

[00162] Quantitative Depth Results in the Monocular Mode

[00163] Quantitative evaluation was conducted by running the depth refinement system 600 under monocular RAFT-SLAM on the EuRoC test sequences, and the depth evaluation results are presented in Tab. 6, below. The depth refinement system 600 is denoted as "DRS", because it used DPT as the starting depth model. Per-frame scale alignment was performed between the depth prediction and the ground truth. From Table 7, below, consistent and significant improvements over the baseline models on all test sequences are achieved by the depth refinement system 600. For example, on Seq. Vl_01, the depth refinement system 600 reduces Abs Rel from 14.0% (by DPT) to 5.0%, achieving over two-times reduction in depth errors. Table 7

VI 01

VI 02

VI 03

V2 01

V2 02

V2 03

[00164] Quantitative Depth Results in the Visual-Inertial Mode

[00165] When IMU data is available, the depth refinement system 600 can be ran under VI RAFT-SLAM to get camera poses and map points directly in metric scale. Note that, in a VI mode, no scale alignment is needed. The quantitative depth results are also provided in Table 6 above, from which it can seen that the depth refinement system 600 under the VI mode performs on par with the monocular mode even without scale alignment. Compared to a similar dense mapping, i.e., CodeMapping, the depth refinement system 600 is significantly more accurate with similar run-time (e.g., around 1 second per keyframe), demonstrating the improvements achieved by the depth refinement system 600.

[00166] Quantitative Depth Results in the pRGBD Mode

[00167] Quantitative depth evaluation under the pRGBD mode is provided in the right columns of Table 7. From Table 7, it can be seen that the pRGBD mode performs slightly better than the other two modes in terms of depth results. This may be attributed to the fact that under this mode, the SLAM and depth refinement modules 610 and 630 form a loosely coupled loop so that each module benefits from the other.

[00168] Qualitative Depth Results

[00169] FIG. 7 illustrates qualitative visual comparisons of depth maps performed using the depth refinement system 600 on the EuRoC dataset. FIG. 7 illustrates qualitative visual comparisons of depth maps generated by the "DRS" implementation of the depth refinement system 600 against DPT.

[00170] From FIG. 7 the qualitative improvements brought by the depth refinement method according to embodiments disclosed herein can be clearly seen. For example, FIG. 7 shows how the depth refinement system 600 can correct inaccurate geometry that is commonly present in the pretrained models. For example, in the first row of FIG. 7, input images 710 illustrate a piece of thin paper 712 lying on the floor. DPT generated a depth map 720 in which pixels corresponding to the piece of thin is predicted to have much higher depth values than its neighboring floor pixels (e.g., box 722 compared to box 724). Whereas DRS is able to rectify this depth to be consistent with the floor, as show in depth map 730 in box 732 as compared to box 734.

[00171] FIG. 8 illustrates a global reconstruction of an environment using the depth refinement system 600. Particularly, FIG. 8 illustrates a global map of the EuRoC VICON room using refined depth maps generated by the depth refinement system 600 and fusing the refined depth maps together. As shown in FIG. 8, geometrically consistent reconstruction can be achieved using the refined depth maps.

[00172] Odometry Results

[00173] Table 8, below, shows odometry comparisons of the depth refinement system 600 with current state-of-the-art methods on the EuRoC dataset in the monocular mode. The same parameter settings were adopted with ORB-SLAM3 in all our experiments. As seen in Table 8, the depth refinement system 600 achieves comparable results with Droid-SLAM (note that Droid-SLAM may be elaborately designed for SLAM, while the embodiments herein may not be so designed) and significantly outperforms other monocular baselines.

Table 8

[00175] Depth results for the depth refinement system 600 using a self-supervised model, e.g., Monodepth2, as the base model on EuRoC are provided. Monocular and stereo images were taken from five sequences (MH_01, MH_02, MH_04, Vl_01, and Vl_02) as the training set to train the base model Monodepth2. Since stereo images with a known baseline distance are used, the trained Monodepth2 is scale-aware. The quantitative depth results are shown in Table 9, below, from which it can be seen that the depth refinement system 600, denoted as "DRS-MD2", improves over Monodepth2 by a significant margin in all three SLAM modes.

Table 9

(e.g., set by the depth refinement module 630) in Table 9, below. Table 9 shows quantitative depth evaluation for keyframes on EuRoC in the VI mode. Depth maps are predicted in metric scale, e.g., in meters, and no per-frame scale alignment is performed. Compared with the per-frame results, the depth quality of keyframes is slightly better in most sequences. This verifies the effectiveness of the keyframe mechanism disclosed herein and demonstrates the importance of having large-enough baselines between temporal frames during depth refinement.

Table 10

2.3. TUM-RGBD Dataset

[00177] TUM-RGBD is a dataset mainly for bench-marking performance of RGB-D SLAM or odometry. This dataset was created using a Microsoft Kinect sensor and eight high-speed tracking cameras to capture monocular images, their corresponding depth images, and camera poses. This dataset is particularly difficult for monocular systems as it contains a large amount of motion blur and rolling-shutter distortion caused by fast camera motion. Two monocular sequences are used from this dataset, i.e., "freiburg3_structure_texture_near" and "freiburg3_structure_texture_far", to test the depth refinement system 600, as they satisfy requirements of sufficient camera translations. Quantitative depth results are presented in Table 11 below, with the depth refinement system 600 represented as DRS. As before, both modes of the depth refinement system 600 improves upon the pretrained DPT model by a significant margin, achieving 2-4 times' reduction in terms of Abs Rel.

Table 11 freiburg3_structure_texture_near freiburg3_structure_texture_far refinement system 600 on the TUM-RGBD dataset. Particularly, FIG. 9 illustrates a global map of a TUM-RGBD sequence using refined depth maps generated by the depth refinement system 600 and fusing the refined depth maps together. As shown in FIG. 9, the scene geometry is faithfully recovered.

[00179] Additionally, the depth refinement system 600 was evaluated on two more sequences from the TUM-RGBD dataset, i.e., freiburg3_long_office_household and freiburg3_long_office_household_validation. The same settings were adopted as in the above and use the DPT model pretrained on NYUv2 as our initial model. The quantitative depth results are shown in Table 12, from which consistent and significant improvements are achieved by the depth refinement system 600 over the pretrained model. In Table 12, the depth refinement system 600 represented as DRS.

Table 12 freiburg3_long_office_household freiburg3_long_office_household_validation freiburg3_nostructure_texture_near_withlooop on TUM-RGBD in Table 13 below. Where "X" means no pose output due to system failure and "(X)" means partial pose results. Compared to the baseline ORB-SLAM3, the improved odometry results by the depth refinement system 600 verify that using RAFT makes the

SLAM system more robust and accurate. In particular, the depth refinement system 600 in both the monocular (DRS-Mono) and pRGBD (DRS-pRGBD) modes outperforms a recent deep odometry method proposed by Li by a significant margin.

Table 13

[00181] FIG. 10 illustrate qualitative visual comparisons of depth maps generated by the depth refinement system 600 on the TUM-RGBD dataset. FIG. 10 illustrates qualitative visual comparisons of, from left column to right column, input images, depth maps generated by DPT, and depth maps generated by the depth refinement system 600. As can be seen from FIG. 10, the depth refinement system 600 is able to reduce (and even eliminate) may artifacts and erroneous predictions as compared to DPT. [00182] FIG. 11 illustrates a global reconstruction of another environment using the depth refinement system 600 on the TUM-RGBD dataset. Particularly, FIG. 11 illustrates a global map of the freiburg3_long_office_household sequence using refined depth maps generated by the depth refinement system 600 and fusing the refined depth maps together. As shown in FIG. 11, the scene geometry is faithfully recovered.

2.4. ScanNet Dataset

[00183] ScanNet is an indoor RGB-D dataset consisting of more than 1500 scans. This dataset was captured by a handheld device, so motion blur exists in most of the sequences, posing challenges both for monocular SLAM and depth refinement. Moreover, camera translations in this dataset are small as most of the sequences are from small rooms (e.g., bathrooms and bedrooms). To test the depth refinement system 600, three sequences that have relatively larger camera translations are sampled and ran using NYUv2-pretrained DPT as the base model. The results are summarized in Table 14, below. The pretrained DPT model performs well on ScanNet, reaching Abs Rel of 6.3% to 8.0%, probably due to dataset similarity between ScanNet and NYUv2. The depth refinement system 600 continues to improve the depth results in all metrics. In particular, on scene0228_00, the depth refinement system 600 reduces Abs Rel from 8.0% to 5.0% and increases di from 93.1% to 97.9%.

[00184] FIG. 12 illustrates a global reconstruction of an environment using the depth refinement system 600 on the ScanNet dataset. Particularly, FIG. 12 illustrates a global map of the scene0228_00 sequence using refined depth maps generated by the depth refinement system 600 and fusing the refined depth maps together. Table 14

2.4. KITTI Dataset

[00185] Depth results for the depth refinement system 600 on KITTI are provided in Table 14 above, with the depth refinement system 600 represented as DRS-MD2-Mono. The motion threshold for keyframes was set to 0.25 m (or 0.05 m for per-frame), to 0.01, and three frame snippet (e.g., 0, -1, 1) are used to build the loss; other parameters remain the same as provided above. Compared to the base model Monodepth2, depth refinement system 600 reduced Abs Rel by 1% and improved di by 2.8%. However, due to moving objects in KITTI, the improvement may not be as significant as in non-dynamic indoor environments.

[00186] Accordingly, embodiments disclosed herein combine geometric SLAM and deep learning in a symbiotic way. The embodiments disclosed herein rely on a robustified SLAM (e.g., flow-SLAM module 610) to compute camera poses and sparse map points. The embodiments disclosed herein further include an online learning framework configured to refine depth predictions with self-supervised training losses. Geometrically-consistent global maps can be reconstructed by fusing the refined depth maps.

E. Example Computing System

[00187] FIG. 13 is an example computing component that may be used to implement various features of embodiments described in the present disclosure.

[00188] FIG. 13 depicts a block diagram of an example computer system 1300 in which various of the embodiments the self-supervised depth estimation system 100 described herein may be implemented. The computer system 1300 includes a bus 1302 or other communication mechanism for communicating information, one or more hardware processors 1304 coupled with bus 1302 for processing information. Hardware processor(s) 1304 may be, for example, one or more general purpose microprocessors.

[00189] The computer system 1300 also includes a main memory 1306, such as a random access memory (RAM), cache and/or other dynamic storage devices, coupled to bus 1302 for storing information and instructions to be executed by processor 1304, for example, instructions for executing the architecture of FIG. 1. Main memory 1306 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 1304. Such instructions, when stored in storage media accessible to processor 1304, render computer system 1300 into a special-purpose machine that is customized to perform the operations specified in the instructions. [00190] The computer system 1300 further includes a read only memory (ROM) 1308 or other static storage device coupled to bus 1302 for storing static information and instructions for processor 1304. A storage device 1310, such as a magnetic disk, optical disk, or USB thumb drive (Flash drive), etc., is provided and coupled to bus 1302 for storing information and instructions.

[00191] The computer system 1300 may be coupled via bus 1302 to a display 1312, such as a liquid crystal display (LCD) (or touch screen), for displaying information to a computer user. An input device 1314, including alphanumeric and other keys, is coupled to bus 1302 for communicating information and command selections to processor 1304. Another type of user input device is cursor control 1316, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 1304 and for controlling cursor movement on display 1312. In some embodiments, the same direction information and command selections as cursor control may be implemented via receiving touches on a touch screen without a cursor.

[00192] One or more image sensors 1318 may be coupled to bus 1302 for capturing video as a plurality of image frames and/or static images of an environment. Image sensors include any type of cameras (e.g., visible light cameras, IR cameras, thermal cameras, ultrasound cameras, and other cameras) or other image sensors configured to capture. For example, image sensors 1318 may captures images that are processed according to the embodiments disclosed herein (e.g., in FIGS. 2 and 6). In some embodiments, image sensors 1318 communicate information to main memory 1306, ROM 1308, and/or storage 1310 for processing in real-time and/or for storage so to be processed at a later time. According to some embodiments, image sensors 1318 need not be included and images for processing may be retrieved from memory.

[00193] The computing system 1300 may include a user interface module to implement a GUI that may be stored in a mass storage device as executable software codes that are executed by the computing device(s). This and other modules may include, by way of example, components, such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

[00194] In general, the word "component," "engine," "module," "system," "database," data store," "block", and the like, as used herein, can refer to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example, Java, C or C++. A software component may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpreted programming language such as, for example, BASIC, Perl, or Python. It will be appreciated that software components may be callable from other components or from themselves, and/or may be invoked in response to detected events or interrupts. Software components configured for execution on computing devices may be provided on a computer readable medium, such as a compact disc, digital video disc, flash drive, magnetic disc, or any other tangible medium, or as a digital download (and may be originally stored in a compressed or installable format that requires installation, decompression or decryption prior to execution). Such software code may be stored, partially or fully, on a memory device of the executing computing device, for execution by the computing device. Software instructions may be embedded in firmware, such as an EPROM. It will be further appreciated that hardware components may be comprised of connected logic units, such as gates and flip- flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors.

[00195] The computer system 1S00 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system

1S00 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 1S00 in response to processor(s) 1S04 executing one or more sequences of one or more instructions contained in main memory 1S06. Such instructions may be read into main memory 1S06 from another storage medium, such as storage device 1310. Execution of the sequences of instructions contained in main memory

1306 causes processor(s) 1304 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.

[00196] The term "non-transitory media," and similar terms, as used herein refers to any media that store data and/or instructions that cause a machine to operate in a specific fashion. Such non-transitory media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical or magnetic disks, such as storage device 1310. Volatile media includes dynamic memory, such as main memory 1306. Common forms of non-transitory media include, for example, a floppy disk, a flexible disk, hard disk, solid state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge, and networked versions of the same.

[00197] Non-transitory media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between non-transitory media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 1302. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.

[00198] The computer system 1300 also includes a communication interface 1318 coupled to bus 1302. Communication interface 1318 provides a two-way data communication coupling to one or more network links that are connected to one or more local networks. For example, communication interface 1318 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 1318 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN (or WAN component to communicated with a WAN). Wireless links may also be implemented. In any such implementation, communication interface 1318 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information. [00199] A network link typically provides data communication through one or more networks to other data devices. For example, a network link may provide a connection through local network to a host computer or to data equipment operated by an Internet Service Provider (ISP). The ISP in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet." Local network and Internet both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link and through communication interface 1318, which carry the digital data to and from computer system 1300, are example forms of transmission media.

[00200] The computer system 1300 can send messages and receive data, including program code, through the network(s), network link and communication interface 1318. In the Internet example, a server might transmit a requested code for an application program through the Internet, the ISP, the local network and the communication interface 1318.

[00201] The received code may be executed by processor 1304 as it is received, and/or stored in storage device 1310, or other non-volatile storage for later execution.

[00202] Each of the processes, methods, and algorithms described in the preceding sections may be embodied in, and fully or partially automated by, code components executed by one or more computer systems or computer processors comprising computer hardware. The one or more computer systems or computer processors may also operate to support performance of the relevant operations in a "cloud computing" environment or as a

"software as a service" (SaaS). The processes and algorithms may be implemented partially or wholly in application-specific circuitry. The various features and processes described above may be used independently of one another, or may be combined in various ways.

Different combinations and sub-combinations are intended to fall within the scope of this disclosure, and certain method or process blocks may be omitted in some implementations.

The methods and processes described herein are also not limited to any particular sequence, and the blocks or states relating thereto can be performed in other sequences that are appropriate, or may be performed in parallel, or in some other manner. Blocks or states may be added to or removed from the disclosed example embodiments. The performance of certain of the operations or processes may be distributed among computer systems or computers processors, not only residing within a single machine, but deployed across a number of machines.

[00203] As used herein, a circuit might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a circuit. In implementation, the various circuits described herein might be implemented as discrete circuits or the functions and features described can be shared in part or in total among one or more circuits. Even though various features or elements of functionality may be individually described or claimed as separate circuits, these features and functionality can be shared among one or more common circuits, and such description shall not require or imply that separate circuits are required to implement such features or functionality. Where a circuit is implemented in whole or in part using software, such software can be implemented to operate with a computing or processing system capable of carrying out the functionality described with respect thereto, such as computer system 1300.

[00204] As used herein, the term "or" may be construed in either an inclusive or exclusive sense. Moreover, the description of resources, operations, or structures in the singular shall not be read to exclude the plural. Conditional language, such as, among others, "can," "could," "might," or "may," unless specifically stated otherwise, or otherwise understood within the context as used, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps.

[00205] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting.

Adjectives such as "conventional," "traditional," "normal," "standard," "known," and terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time, but instead should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. The presence of broadening words and phrases such as "one or more," "at least," "but not limited to" or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent.

[00206] As used herein, the terms circuit and component might describe a given unit of functionality that can be performed in accordance with one or more embodiments of the present application. As used herein, a component might be implemented utilizing any form of hardware, software, or a combination thereof. For example, one or more processors, controllers, ASICs, PLAs, PALs, CPLDs, FPGAs, logical components, software routines or other mechanisms might be implemented to make up a component. Various components described herein may be implemented as discrete components or described functions and features can be shared in part or in total among one or more components. In other words, as would be apparent to one of ordinary skill in the art after reading this description, the various features and functionality described herein may be implemented in any given application. They can be implemented in one or more separate or shared components in various combinations and permutations. Although various features or functional elements may be individually described or claimed as separate components, it should be understood that these features/functionality can be shared among one or more common software and hardware elements. Such a description shall not require or imply that separate hardware or software components are used to implement such features or functionality.

[00207] Where components are implemented in whole or in part using software, these software elements can be implemented to operate with a computing or processing component capable of carrying out the functionality described with respect thereto.

[00208] In this document, the terms "computer program medium" and "computer usable medium" are used to generally refer to transitory or non-transitory media. These and other various forms of computer program media or computer usable media may be involved in carrying one or more sequences of one or more instructions to a processing device for execution. Such instructions embodied on the medium, are generally referred to as "computer program code" or a "computer program product" (which may be grouped in the form of computer programs or other groupings). When executed, such instructions might enable a computing component to perform features or functions of the present application as discussed herein.

[00209] It should be understood that the various features, aspects and functionality described in one or more of the individual embodiments are not limited in their applicability to the particular embodiment with which they are described. Instead, they can be applied, alone or in various combinations, to one or more other embodiments, whether or not such embodiments are described and whether or not such features are presented as being a part of a described embodiment. Thus, the breadth and scope of the present application should not be limited by any of the above-described exemplary embodiments.

[00210] Terms and phrases used in this document, and variations thereof, unless otherwise expressly stated, should be construed as open ended as opposed to limiting. As examples of the foregoing, the term "including" should be read as meaning "including, without limitation" or the like. The term "example" is used to provide exemplary instances of the item in discussion, not an exhaustive or limiting list thereof. The terms "a" or "an" should be read as meaning "at least one," "one or more" or the like; and adjectives such as "conventional," "traditional," "normal," "standard," "known." Terms of similar meaning should not be construed as limiting the item described to a given time period or to an item available as of a given time. Instead, they should be read to encompass conventional, traditional, normal, or standard technologies that may be available or known now or at any time in the future. Where this document refers to technologies that would be apparent or known to one of ordinary skill in the art, such technologies encompass those apparent or known to the skilled artisan now or at any time in the future.

[00211] The presence of broadening words and phrases such as "one or more," "at least," "but not limited to" or other like phrases in some instances shall not be read to mean that the narrower case is intended or required in instances where such broadening phrases may be absent. The use of the term "component" does not imply that the aspects or functionality described or claimed as part of the component are all configured in a common package. Indeed, any or all of the various aspects of a component, whether control logic or other components, can be combined in a single package or separately maintained and can further be distributed in multiple groupings or packages or across multiple locations.