Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR DEPTH ESTIMATION USING FISHEYE CAMERAS
Document Type and Number:
WIPO Patent Application WO/2022/271499
Kind Code:
A1
Abstract:
The present invention is directed to extended reality systems and methods thereof. According to a specific embodiment, images captured using a single fisheye lens without rectilinear correction are used for depth estimation. The distortion characteristics of the fisheye lens are used in conjunction with a pretrained model to generate a depth map. There are other embodiments as well.

More Like This:
Inventors:
JI PAN (US)
YAN QINGAN (US)
XU YI (US)
Application Number:
PCT/US2022/033563
Publication Date:
December 29, 2022
Filing Date:
June 15, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
E02F9/20
Foreign References:
US20160180507A12016-06-23
US20160163110A12016-06-09
US20170091535A12017-03-30
US20140316665A12014-10-23
US20100097526A12010-04-22
Other References:
ZHEN CHEN; GEORGIADIS ANTHIMOS: "Parameterized Synthetic Image Data Set for Fisheye Lens", 2018 5TH INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND CONTROL ENGINEERING (ICISCE), IEEE, 20 July 2018 (2018-07-20), pages 370 - 374, XP033501686, DOI: 10.1109/ICISCE.2018.00084
Attorney, Agent or Firm:
BRATSCHUN, Thomas D. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method for generating a pretrained model for a fisheye lens, the method comprising: capturing training images using one or more fisheye lenses, the training images having a fisheye distortion associated with the one or more fisheye lenses; storing a training data model in a memory; determining a geometrical difference associated with the fisheye distortion using the training images; calculating a photometric loss value using at least the geometric difference; generating a training depth map using at least the photometric loss value; obtaining a reference data model; generating a reference depth map using at least the reference data model: calculating a ranking loss value using at least the training depth map and the reference depth map; and updating tire training data model using at least the photometric loss value and the ranking loss value.

2. The method of claim 1 further comprising determining a depth value of training images using a parallax between two fisheye lenses.

3. The method of claim 1 further comprising determining a depth value of the training images based on a change of image position over a predetermined time interval.

4. The method of claim 1 further comprising projecting the training images into a rectilinear space.

5. The method of claim 4 further comprising unprojecting a uniform pixei gri d into a fisheye space.

6. The method of claim 1 further comprising: calculating a scale invariant loss value using at least the training depth map and the reference depth map; and updating tire training data model using tire scale invariant loss value.

7. The method of claim 6 further comprising normalizing tire training depth map and the reference depth map.

8. The method of claim 6 further comprising calculating a cumulative loss value using the photometric loss value, ranking loss value, and the scale invariant loss value.

9. The method of claim 1 further comprising calculating a radial distortion polynomial associated with the fisheye distortion.

10. The method of claim 9 further comprising generating an intermediate rectilinear depth map using at least a function of the radial distortion polynomial.

11. An extended reality apparatus comprising: a housing comprising a front side and a rear side; a camera module being mounted on the housing and comprising a fisheye lens and a sensor, the fisheye lens being positioned on the front side of the housing and characterized by a di stortion characteristics, the camera module being configured to capture a plurality of images, the plurality of images being non-rectiiinear; a display being positioned on the rear side of the housing; a storage configured to store a pretrained model, the pretrained model being based on at least the distortion characteristics of the fisheye lens; a memory coupled to the sensor and being configured to store the plurality of images: and a processor coupled to the memory and configured to generate a depth map using the plurality of images and the pretrained model.

12. The apparatus of claim 11 wherein: the processor is further configured to generate an object image; and the display is configured to display a composite image including the object image overlaying the plurality of images.

13. The apparatus of claim 11 wherein the pretrained model further comprising weight values and bias values determined using a photometric loss attributed to the distortion characteristics.

14. The apparatus of claim 11 wherein the processor is further configured to perform rectilinear correction on the plurality of images.

15. The apparatus of claim 11 wherein the processor is further configured to identify relative object depth using the pretrained model.

16. A method for operating an extended reality device, the method comprising: capturing a plurality of images using a fisheye lens, the fisheye lens being characterized by a distortion characteristics, the plurality of images being non-rectilinear; storing the plurality of images in a memory; accessing a pretrained model stored in a storage, the pretrained model being based on at least the distortion characteristics of the fisheye lens; generating a depth map using the pretrained model and tire plurality of images and the pretrained model; generating a plurality of rectilinear images using the plurality of images; and n displaying the plurality of rectilinear images.

17. The method of claim 16 further comprising projecting the depth map to a rectilinear space.

18. The method of claim 16 further comprising: identifying a region using the depth map; generating an object image; and generating an augmented reality image by overlaying the object image over the region.

19. The method of claim 18 further comprising displaying the augmented reality image in a rectilinear space.

20. The method of claim 18 further comprising calculating a headset pose.

Description:
METHODS AND SYSTEMS FOR DEPTH ESTIMATION USING FISHEYE

CAMERAS

CROSS-REFERENCES TO RELATED APPLICATIONS [0001] The present application claims priority to U.S. Provisional Application No. 63/215,397, entitled “FisheyeDistill: Self-Supervised Monocular Depth Estimation with Ordinal Distillation for Fisheye Cameras”, filed June 25, 2021, which is commonly owned and incorporated by reference herein for all purposes.

BACKGROUND OF THE INVENTION

[0002] The present invention is directed to extended reality systems and methods thereof. [0003] Over the last decade, extended reality (XR) devices — including both augmented reality (AR) devices and virtual reality (VR) devices . -have become increasingly popular.

Important design considerations and challenges for XR devices include performance, cost, and power consumption. Due to various limitations, existing XR devices have been inadequate — including but not limited to depth determination- . for reasons further explained below.

[0004] It is desired to have new and improved XR systems and methods thereof.

BRIEF SUMMARY OF THE INVENTION [0005] The present invention is directed to extended reality systems and methods thereof. According to a specific embodiment, images captured using a single fisheye lens without rectilinear correction are used for depth estimation. The distortion characteristics of tire fisheye lens are used in conjunction with a pretrained model to generate a depth map. There are other embodiments as well.

[0006] A system of one or more computers can he configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by a data processing apparatus, cause the apparatus to perform the actions. One general aspect includes a method for generating a pretrained model for a fisheye lens. The method includes capturing training images using one or more fisheye lenses, the training images having a fisheye distortion associated with the one or more fisheye lenses. The method also includes storing a training data model in a memory. The method also includes determining a geometrical difference associated with the fisheye distortion using the training images. The method also includes calculating a photometric loss value using at least the geometric difference. The method also includes generating a training depth map using at least the photometric loss value. The method also includes obtaining a reference data model. The method also includes generating a reference depth map using at least the reference data model. The method also includes calculating a ranking loss value using at least the training depth map and the reference depth map. The method also includes updating the training data model using at least the photometric loss value and the ranking loss value. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

[0007] implementations may include one or more of the following features. The method may include determining a depth value of training images using a parallax between two fisheye lenses. The method may include determining a depth value of the training images based on a change of image position over a predetermined time interval. The method may include projecting the training images into a rectilinear space. The method may include unprojecting a uniform pixel grid into a fisheye space. The method may include: calculating a scale invariant loss value using at least the training depth map and the reference depth map, and updating the training data model using the scale invariant loss value. The method may include normalizing the training depth map and the reference depth map. The method may include calculating a cumulative loss value using the photometric loss value, ranking loss value, and the scale invariant loss value. The method may include calculating a radial distortion polynomial associated with the fisheye distortion. The method may include generating an intermediate rectilinear depth map using at least a function of the radial distortion polynomial. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer- accessible medium.

[0008] One general aspect is directed to an extended reality apparatus, which includes a housing that includes a front side and a rear side. The apparatus also includes a camera module that is mounted on the housing and may include a fisheye lens and a sensor. The fisheye lens is positioned on the front side of the housing and characterized by a distortion characteristics.

The camera module is configured to capture a plurality of images, the plurality of images being non-reetilinear. The apparatus also includes a display that is positioned on tire rear side of the housing. The apparatus also includes a storage configured to store a pretrained model, which is based on least the distortion characteristics of the fisheye lens. The apparatus also includes a memory that is coupled to the sensor and configured to store the plurality of images. The apparatus also includes a processor coupled to the memory and configured to generate a depth map using the plurality of images and the pretrained model. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0009] Implementations may include one or more of the following features. The processor may be further configured to generate an object image. The display is configured to display a composite image including the object image overlaying the plurality of images. The pretrained model may include weight values and bias values determined using a photometric loss attributed to the distortion characteristics. The processor is further configured to perform rectilinear correction on the plurality of images. The processor is further configured to identify relative object depth using the pretrained model. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0010] One general aspect includes a method for operating an extended reality device. The method includes capturing a plurality of images using a fisheye lens, which is characterized by a distortion characteristics. The plurality of images is noil-rectilinear. The method also includes storing the plurality of images in a memory. The method also includes accessing a pretrained model stored in a storage, the pretrained model being based on least the distortion characteristics of the fisheye lens. The method also includes generating a depth map using the pretrained model and the plurality of images and the pretrained model. The method also includes generating a plurality of rectilinear images using the plurality of images. The method also includes displaying the plurality of rectilinear· images. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods. [0011] Implementations may include one or more of the following features. The method may include projecting the depth map to a rectilinear space. The method may include identifying a region using the depth map, generating an object image, and generating an augmented reality image by overlaying the object image over the region. The method may include displaying the augmented reality image in a rectilinear space. The method may include calculating a headset pose. Implementations of the described techniques may include hardware, a method or process, or computer software on a computer-accessible medium.

[0012] It is to he appreciated that embodiments of the present invention provide many advantages over conventional techniques. Among other things, depth estimation techniques according to the present invention allow for accurate and efficient depth estimation in various scenarios including real-time applications. Depth estimation techniques according to various embodiments of the present invention can be used in a wide variety of systems, including XR devices (e.g.„ head-mounted displays) that are equipped with single fisheye lens or camera. For example, the depth estimation methods of the present invention can be implemented with various XR devices to generate optimal depth maps from single monocular image input to provide users with an immersive AR/VR experience. Embodiments of the present invention present a cost-effective approach for depth estimation that is able to preserve accurate scene geometry under various conditions (e.g., low-light, over-exposure, texture-less regions, etc.) and can he adopted into existing XR systems via software or firmware update. There are other benefits as well.

[0013] The present invention achieves these benefits and others in the context of known technology. However, a further understanding of the nature and advantages of the present invention may be realized by reference to the latter portions of the specification and attached drawings.

BRIEF DESCRIPTION OF THE DRAWINGS [0014] Figure 1A is a simplified block diagram illustrating an extended reality device according to embodiments of the present invention.

[0015] Figure 1B is a simplified block diagram illustrating an extended reality device according to embodiments of the present invention.

[0016] Figure 2 is a simplified flow diagram illustrating an operation of an extended reality device according to embodiments of the present invention.

[0017] Figure 3 is a simplified block diagram illustrating a system for generating a depth estimation model according to embodiments of the present invention.

[0018] Figure 4 is a simplified block diagram illustrating a machine learning method according to embodiments of the present invention.

[0019] Figure 5 is a simplified flow diagram illustrating a process for generating a depth estimation model according to embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION [0020] The present invention is directed to extended reality systems and methods thereof. According to a specific embodiment, images captured using a single fisheye lens without rectilinear correction are used for depth estimation. The distortion characteristics of tire fisheye lens are used in conjunction with a pretrained model to generate a depth map. There are other embodiments as well .

[0021] With the advent of virtual reality and augmented reality applications, depth sensing and estimation schemes are becoming more and more popular. The ability to improve situational awareness by estimating the depths of surrounding environments accurately and efficiently from images promises exciting new applications in immersive virtual and augmented realities, robotic control, self-driving, etc. There has been great progress in recent years, especially with the arrival of deep learning technology. However, it remains a challenging task due to various reasons. For example, time of flight (ToF) devices (e.g., Lidar, Radar) provide ground truth depth information that promotes supervised training processes to improve depth estimation accuracy. ToF devices, while accurate, have various disadvantages, such as high cost, computational complexity, high energy consumption, and others. Certain depth estimation methods implemented with stereo cameras also suffer from challenges including high cost, limited field-of-view, etc. Hence, there is a need for more robust and scalable solutions for depth estimation.

[0022] Embodiments of the present invention provide a complete depth estimation system for XR devices (e.g., AR-glass). In various embodiments, stereo fisheye cameras with low- power sensors are used with fisheye lenses to capture a large field of view (FoV), which is advantageous to provide an immersive AR/VR experience. Implemented with fisheye (or ultrawide angle) cameras, various embodiments of the present invention provide a complete system-wide solution that covers various stages of the depth estimation (e.g., training and inference), which may involve features such as real-time on edge devices (e.g., mobile phone, embedding devices), to enable depth estimation based on fisheye images (e.g., monocular and/or stereoscopic) in various situations (e.g., low-light/over-exposure conditions, large homogeneous regions, etc.).

[0023] The following description is presented to enable one of ordinary skill in the art to make and use the invention and to incorporate it in the context of particular applications. Various modifications, as well as a variety of uses in different applications will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to a wide range of embodiments. Thus, the present invention is not intended to be limited to the embodiments presented, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

[0024] In the following detailed description, numerous specific details fire set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without necessarily being limited to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

[0025] The reader’s attention is directed to all papers and documents which are filed concurrently with this specification and which are open to public inspection with this specification, and the contents of ail such papers and documents are incorporated herein by reference. All the features disclosed in tills specification, (including any accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equi valent or similar features. [0026] Furthermore, any element in a claim that does not explicitly state “means for” performing a specified function, or “step for” performing a specific function, is not to be interpreted as a “means” or “step” clause as specified in 35 U.S.C. Section 112, Paragraph 6.

In particular, the use of “step of’ or “act of’ in the Claims herein is not intended to invoke the provisions of 35 U.S.C. 112, Paragraph 6.

[0027] Please note, if used, the labels left, right, front, back, top, bottom, forward, reverse, clockwise, and counterclockwise have been used for convenience purposes only and are not intended to imply any particular fixed direction. Instead, they are used to reflect relative locations and/or directions between various portions of an object.

[0028] Figure 1 A is a simplified block diagram illustrating an extended reality device according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. It is to be understood that the term “extended Reality” (XR) is broadly defined, which includes virtual reality (VR), augmented reality (AR), and/or other similar technologies. For example, XR apparatus 115 as shown can be configured as VR, AR, or others. Depending on the specific implementation, XR apparatus 115 may include small housing for AR applications or relatively larger housing for VR applications.

[0029] In various embodiments, XR apparatus 115 may include a pair of stereo cameras (e.g., 180A and 1 BOB) that are configured on the front side of XR apparatus 115. Cameras 180A and 180B are respectively mounted on the left and right side of the XR apparatus 115 and are configured to capture a pair of stereo images. It is to he appreciated that additional cameras may he configured below cameras 180A and 1 BOB to provide an additional field of view and range estimation accuracy. For example, cameras 180A and 180B both include ultrawide angle or fisheye lenses that offer large fields of view, and they share a common field of view. Depending on the implementation, cameras 180A and 180B may be configured with different mounting angles. For example, in VR applications where four or more cameras are mounted on XR apparatus 115, cameras are configured with both vertical and horizontal tilt angles to maximize the total field of view. AR applications usually utilize no more than two cameras, and these two cameras may slightly tilt to the side to increase the horizontal field of view, but vertically they are substantially level when worn. In various embodiments, the vertical camera tilt is limited to less than 10 degrees each way, up or down, to avoid potential blockage. When two or more cameras are utilized, for instance, a depth value of the image captured by either camera (e.g., 180A or 180B) may be associated with and can be determined using a parallax between the two camera lenses.

[0030] While the XR apparatus 115 may include multiple cameras, embodiments of the present invention afford depth determination using images captured by a single fisheye camera, whose fisheye distortion characteristics is used — among other things — in depth determination. For example, XR apparatus 115 includes a single camera 180C that comprises a fisheye lens, which may be characterized by distortion characteristics. Camera 180C may further include a sensor (not shown) that facilitates image capture. In various implementations, the sensors of the front cameras may be low-resolution monochrome sensors, which are not only energy- efficient (without color filter and color processing thereof), but also relatively inexpensive, both in terms of device size and cost. Depending on the applications, a single fisheye camera 180C is configured to capture one or more monocular images at a predetermined resolution (e.g., 640 x 400). The monocular image captured by the single fisheye camera 180C may have a fisheye distortion that is associated with the distortion characteristics of the fisheye camera 180C. In various embodiments, multiple fisheye cameras may be implemented, hut only a single fisheye camera and its images are only for depth estimation.

[0031] In various implementations, the distorted/non-rectilinear monocular image may be fed into a depth estimation system of XR apparatus 115 (described in further detail below) to generate a depth map without performing image rectification on tire input image. In some cases, the depth map can identify the relative object depth of tire images. Display 185 is configured on the backside of XR apparatus 115. For example, display 185 may be a semitransparent display that overlays information on an optical lens in AR applications. In VR implementations, display 185 may include a non-transparent display.

[0032] Figure IB is a simplified block diagram illustrating components of extended reality apparatus 115 according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. In some embodiments, an XR apparatus (e.g., an XR headset) 115 includes a computing system 115n as shown, which might include, without limitation, at least one of memory 140, processor 150, storage 160, communication interface 170, a camera module 180, a display 185, and/or peripheral devices 190, and/or the like. The element of computing system 115n can be configured together to perform a depth estimation process on an input image/video to produce an output image/video augmented with virtual contents.

[0033] In some instances, the processor 150 might communicatively be coupled (e.g., via a bus, via wired connectors, or via electrical pathways (e.g., traces and/or pads, etc.) of printed circuit hoards ("PCBs") or integrated circuits ("ICs"), and/or the like) to each of one or more of memory 140, storage 160, communication interface 170, camera module 180, display 185, and/or peripheral devices 190, and/or the like.

[0034] In various embodiments, camera module 180 is configured to capture images and/or videos of the surrounding environment and is mounted on the housing of tire XR apparatus 115. Camera module 180 includes one or more cameras that include their respective lenses (e.g., a fisheye lens 181) and sensors (e.g., a sensor 182) used to capture images and/or video of an area in front of the XR apparatus 115. For example, camera module 180 includes cameras 180A and 180B as shown in Figure 1A, and they are configured respectively on the left and right sides of the housing, in other embodiments, camera module 180 includes a single camera (e.g., camera 180C in Figure 1 A), which is equipped with fisheye lens 181 and/or sensor 182. Fisheye lens 181 may be positioned on the front side of the housing. In some cases, fisheye lens 181 is characterized by a distortion characteristics and a field-of-view. It is to he appreciated that the distortion characteristics of the fisheye lens 181 allow camera module 180 to capture one or more images characterized by fisheye distortion (i.e., non- reetiiinear images) that provides a large field-of-view. In various implementations, the sensor 182 of the camera module 180 may he low-resolution monochrome sensors, which are not only energy-efficient (without color filter and color processing thereof), but also relatively inexpensive, both in terms of device size and cost.

[0035] In some embodiments, memory 140 is coupled to sensor 182 and configured to store the images and/or videos captured by the camera module 180. For example, when camera module 180 includes two cameras (e.g., 180A and 180B) that are respectively mounted on two sides of the XR apparatus 115 (e.g., left and right side as shown in Figure 1A), camera module 180 can capture one or more stereoscopic images pairs, which can be temporarily stored at memory 140 for further processing. In some cases, when camera module 180 includes a single camera (e.g., 180C) that has a fisheye lens (e.g., fisheye lens 181), camera module 180 can capture one or more monocular images and/or videos, which can be stored at memory 140 and further processed by processor 150 for depth estimation. In some cases, memory 140 is coupled to processor 150 and configured to store instructions executable by the processor 150. Memory 140 may include, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory ("RAM"), a read-only memory ("ROM"), and/or non-volatile memory, which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like. In various embodiments, memory 140 may be implemented as a part of the processor 150 in a system-on- chip (SoC) arrangement.

[0036] In various implementations, storage 160 is configured to store a pretrained module (e.g., a depth estimation module as described in further detail below) that is based on at least the distortion characteristics of the fisheye lens. For example, the pretrained module is a depth estimation module that accepts the distorted and/or non-reetiiinear image inputs and outputs a depth map in response to executable instructions from processor 150. The pretrained module may comprise weight values and bias values associated with the distortion characteristics of the fisheye lens 181. In some cases, the storage 160 is incorporated within a computer system, such as the computing system 115n. in other embodiments, the storage 160 might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that storage 160 can be used to program, configure, and/or adapt a general-purpose computer with the instructions/code stored thereon.

[0037] In various embodiments, processor 150 is configured to perform various executable instructions for image processing. For example, processor 150 is configured to generate a depth map using one or more images/videos stored in memory 140. In some cases, processor 150 may further retrieve and/or refine the pretrained model stored in storage 160 to identify relative object depth for depth map generation. Depending on the implementations, processor 150 may perform a rectilinear correction on the input images to assist the depth estimation process.

[0038] Processor 150 might include, without limitation, a central processing unit (CPU) 151, a graphical processing unit (GPU) 152, and/or a neural processing unit (NPU). Different types of processing units are optimized for different types of computations. For example, CPU 151 handles various types of system functions, such as moving input images/videos to memory 140, and retrieving and executing executable instructions (e.g., rectilinear· correction, depth feature extraction, pose feature extraction, loss term calculation, etc.). In some eases, GPU 152 may be specially designed to manipulate graphic creation and image processing, which is advantageous to process input images/videos for depth estimation and produce output images/videos augmented with virtual contents. For example, the output image/video may be sent by GPU 152 to display 185 to provide an immersive AR/VR experience to a user 120n. In various implementations, NPU 153 is specifically to utilize at least the input images/videos and a depth estimation model (e.g., the pretrained model) to perform a depth estimation process, through which a depth map can be obtained, and the parameters of the depth estimation model may be adjusted/updated to further improve the estimation accuracy. It is to be appreciated that processor module 120 may be configured as a multi-core processor with one or more processing units, each of which can read and execute program instructions separately for efficient processing and overall power consumption reduction.

[0039] In AR applications, the field of view of camera module 180 overlaps with a field of view' of an eye of the user 120n. The display sereen(s) and/or projector(s) 185 may be used to display or project the generated image overlay s images or video of the actual area. The communication interface 170 provides wired or wireless communication with other devices and/or networks. For example, communication interface 170 may be connected to a computer for tether operations, where the computer provides the processing power needed for graphic- intensive applications.

[0040] In various implementations, computing system 115n further includes one or more peripheral devices 190 configured to improve user interaction in various aspects. For example, peripheral devices 190 may include, without limitation, at least one of spcaker(s) or earpiece(s), eye-tracking sensor(s), light source(s), audio sensor(s) or microphone(s), touch screen(s), keyboard, mouse, and/or other input/output devices.

[0041] Figure 2 is a simplified flow' diagram illustrating an operation of an extended reality device according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternati ves, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped, and they should limit the scope of the claims.

[0042] At step 202, a plurality of images is captured using a fisheye lens. For example, the fisheye lens is characterized by a distortion characteristics such that the plurality of images captured by the fisheye lens are non-rectilinear. When a pair of stereo cameras (e.g., cameras 180A and 180B in Figure 1 A) are provided, the images may be captured in the form of stereoscopic image pair (e.g., left image and right image). In other embodiments, when a single monocular camera is configured, images may he captured in the form of a plurality of monocular images and/or videos (e.g., a plurality of sequential monocular image frames). The camera may be characterized by a plurality of extrinsic parameters associated with its position and orientation (e.g., camera pose or headset pose) and a plurality of intrinsic parameters associated with its geometric properties (e.g., focal length, aperture, resolution, principal point, field-of-view', fisheye distortion characteristics, etc.). It is to be appreciated that the images captured by the fisheye lens offer a large field-of-view that provides more coverage of the surrounding environment.

[0043] At step 204, the plurality of images is stored in a memory. For example, a buffer memory may be used to store the plurality of images. In various embodiments, the captured images are stored temporarily for the purpose of processing, and may first be stored in volatile memory and later transferred to non-volatile memory.

[0044] At step 206, a pertained model stored in a storage is accessed and/or retrieved by, for example, a processor 150 as shown in Figure IB. In some cases, the pretrained model is a depth estimation model that is trained with a plurality of stereoscopic image pairs and/or a plurality of temporal monocular images. The pretrained estimation model may be trained by a depth estimation system, which will be described in further detail below. The pretrained model is based on at least the distortion characteristics of the fisheye lens. For example, the pretrained model can utilize the distortion characteristics to generate a depth map. The pretrained model may include weight values and bias values associated with the distortion characteristics. It is to be appreciated that the distorted input images may directly be fed into the pretrained model for depth estimation without going through an image rectification process, which can not only preserve the wide field-of-view of the image, but also significantly save the computational resources and time such that real-time applications based on fisheye depth maps may be realized.

[0045] At step 208, a depth map is generated. For example, the depth map is generated using at least the pretrained model and the plurality of images. In various embodiments, the pretrained model receives the plurality of input images and its metadata including the intrinsic and/or extrinsic parameters of the camera. The pretrained model may then output a depth map corresponding to the image of the scene based on the parameters obtained through a depth estimation training process (described in further detail below). In some cases, the pretrained model may be configured to calculate a camera/headset pose. For example, a parallax between two fi sheye lenses may be determined by the pretrained model based on at least the extrinsic and/or intrinsic parameters of the cameras to provide depth information. In other embodiments, the pretrained model may calculate a change of image position between two or more monocular images during a predetermined time interval to obtain depth information. The generated depth map reflects the relative distance of the surrounding objects from the XR apparatus and helps to improve situational awareness and/or virtual content generation, and/or the like. In some cases, the parameters of the pretrained model may be updated to refine the pretrained model via, for example, an NPU to improve depth estimation accuracy and efficiency. [0046] At step 210, a plurali ty of rectilinear images is generated. For example, the plurality of rectilinear images is generated using the plurality of images by projecting the depth map to a rectilinear space. In various implementations, the depth map may be configured to identify a region of interest for virtual content generation. For example, the depth map may be used (e.g., by processor 150 as shown in Figure IB) to generate an object image, which may overlay the identified region to generate a composite image (e.g., an augmented reality image). The augmented reality image may be displayed in a rectilinear space, it is to be appreciated that the rectilinear images provide various benefits such as increased image resolution for improving user experience.

[0047] At step 212, the plurality of rectilinear images is displayed via, for example, display 185 as shown in Figure IB. In various implementations, the display provides a user with a constant feed of augmented video/images that includes changes in the scenes as the user/headset moves. For instance, the computing system 115n of the XR apparatus 115 may constantly receive fisheye images as the user/headset changes position, the pretrained model thus iteratively generates depth maps to update the output image including the virtual content (e.g., the object image) shown on the display.

[0048] Figure 3 is a simplified block diagram illustrating a system for generating a depth estimation model according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications.

[0049] As shown, a depth estimation training system 300 includes, without limitation, a processor module 305, a CPU 310, a GPU 315, an NPU 320, a communication interface 325, a storage 330, a memory 335, a data bus 340, and/or the like. The elements of depth estimation system 300 can be configured together to perform a depth estimation training process and generate a depth estimation model, as further described below.

[0050] In various embodiments, processor module 305 may communicatively be coupled to communication interface 325, storage 330, memory 335 via data bus 340. The communication interface 325 provides wired or wireless communication with other devices and/or networks. For example, system 300 may access a plurality of training images captured by one or more fisheye lenses from network 350 and/or a reference data model via communication interface 325 for depth estimation training. Storage 330 may include a dynamic random-access memory (DRAM) and/or non-volatile memory. For example, the plurality of training images may be stored in storage 330 to train a training data model.

[0051] In various implementations, memory 335 is configured to store one or more sequences of instructions executable by processor module 305. For example, the training data model may be stored in the memory and is continually updated in response to processor 305 executing a sequence of instructions (e.g,, depth estimation algorithms, pose extraction algorithms, loss term calculation algorithms, and/or the like). In some cases, the training images may he processed by GPU 315 to extract depth and/or pose features from the image during the training process. CPU 310 and/or NPU 320 may iteratively perform a depth estimation training algorithm to refine the training data model.

[0052] Figure 4 is a simplified block diagram illustrating a machine learning method according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. In various implementations, a depth estimation training process 400 implemented with a computing system (e.g., depth estimation training system 300) uses one or more parallel pipelines to train a depth estimation model with various loss terms and/or a reference data model.

[0053] The depth estimation training process 400 first starts with receiving a plurality of training images 405. In various implementations, the plurality of training images 405 is captured by one or more fisheye lenses. For example, the plurality of training images 405 may be captured by a pair of stereo cameras respectively equipped with a fisheye lens and/or a sensor. The plurality of training images 405 may therefore be acquired in the form of a plurality of stereoscopic Image pahs. In other cases, the plurality of training images 405 may he captured by a single monocular’ camera equipped with a fisheye lens and/or a sensor. The plurality of training images 405 may then be acquired in the form of a plurality of temporal monocular images. For example, the training images 405 may be a plurality of sequential frames from a monocular video.

[0054] It is to be appreciated that the camera(s) may be characterized by a plurality of extrinsic parameters associated with its position and orientation (e.g., camera pose) and/or a plurality of intrinsic parameters associated with its geometric properties (e.g., focal length, aperture, resolution, fieid-of-view, principal point, fisheye distortion characteristics, etc.). For example, the fisheye lens is characterized by a distortion characteristics, which results in a fisheye distortion of the training images captured. In some embodiments, the plurality of training images 405 may be sent to the depth estimation training system 300 via communication interface 325 (as shown in Figure 3). The plurality of extrinsic and/or intrinsic parameters of the eamera(s) may be provided to the depth estimation training system 300 along with the plurality of training images 405.

[0055] In various implementations, the depth estimation training system 300 includes a pose network 410 configured to estimate a geometrical difference between the training images 405. For example, when using a plurality of temporal monocular images as training data, pose network 410 may be trained — via a deep learning process — to predict the geometrical difference (e.g., camera pose difference) between a first monocular· image characterized by a first timestamp and a second image characterized by a second timestamp. The geometrical difference may be associated with a change of image position over a predetermined time interval (e.g., from the first timestamp to the second timestamp).

[0056] Similarly, when the training images 405 are provided as a plurality of stereoscopic image pairs, pose network 410 estimates a geometrical difference between tire pair of stereoscopic images. The geometrical difference may be determined using one or more extrinsic and/or intrinsic parameters (e.g., fisheye distortion characteristics). In various implementations, the pose features extracted by the pose network 410 help to predict a transformation matrix 415 between the training images 405 (e.g., the first monocular image and the second monocular image). In some cases, the pose difference may be fixed. The pose network 410 allows the depth estimation model 300 to learn the pose relationships among the training images 405, which may be used in later presses such as view synthesis and/or loss term calculation.

[0057] According to some embodiments, the depth estimation training system 300 includes a depth network 420 configured to generate a training depth map 425. For example, the depth network 420 takes the training images 405 (e.g., a plurality of distorted monocular images) as input and performs a deep learning process to predict a dense depth map as the training depth map 425. in various embodiments, a convolutional neural network is used to extract depth features and generate the training depth map 425 at a predetermined resolution. It is to be appreciated that a depth value of the training images 405 may be determined in various ways.

In an example, a depth value of the training images 405 is determined using a parallax between two fisheye lenses that are used to capture the training images. The parallax may be associated with a plurality of extrinsic and/or intrinsic parameters of the fisheye lenses. In other embodiments, a depth value of a training image is determined based on a change of image position over a predetermined time interval.

[0058] In various implementations, the transformation matrix 415 generated by pose network 410 and the training depth map 425 produced by depth network 420 may be utilized to calculate a photometric loss 460 via a combination of unprojection and projection operations. For example, the unprojection and/or projection operations may be performed using a radial distortion polynomial associated with the fisheye distortion.

[0059] In some embodiments, a view synthesis module 445 is configured to perform a view synthesis process to generate a synthesis image 450 from two neighboring fisheye images (e.g., the first monocular image and the second monocular image). The view synthesis process learns a depth and pose relationship from the training images 405 using at least the transformation 415 and the training depth map 425. For example, with a pair of stereoscopic training images, the synthesis image 450 may be generated by projecting the right image onto the left image or vice versa. Similarly, with two temporal monocular images, the synthesis image 450 may be produced by projecting the first monocular image onto the second monocular image or vice versa.

[0060] In some embodiments, the view synthesis module 445 performs an unprojection operation to unproject a uniform pixel grid into a fisheye space. The uniform pixel grid may be characterized by a resolution that is substantially the same as the input training images (e.g., around 640 x 400). In some cases, due to the existence of fisheye distortion, an intermediate rectilinear depth map 440 is generated first using training depth map 425 to transform 3D points into camera coordinates. The intermediate rectilinear depth map 440 may be obtained by projecting the training images 405 into a rectilinear space. Once the intermediate rectilinear depth map 440 is obtained, the unprojection operation is then performed by applying a radial distortion polynomial function.

[0061] In various implementations, the view synthesis module 445 is configured to further perform a projection operation using the intermediate rectilinear depth map 440 and the transformation matrix 415 to generate the distorted synthesis image 450. By adopting view synthesis as a supervision signal, the training data model 490 is trained in a self-supervised manner without requiring large amounts of ground truth data.

[0062] In a specific example, the view synthesis module 445 operates to synthesize virtual target images from neighboring views To achieve this, a projection function F is introduced to map 3D points P; in 3D space to image coordinates Accordingly, the corresponding unprojection function F is also used to convert image pixels — based on the training depth map D — into 3D space in order to acquire color information from other views. For fisheye cameras, given a 3D point in camera coordinates and with the projection function from 3D point P; to distorted image pixel can be obtained through the following mapping equation: where is the angle of incidence, is the the polynomial radial distortion model mapping the incidence angle to the image radius, stand tor the focal length and the principal point derived from the intrinsic matrix K, and denote the set of fisheye distortion coefficients.

[0063] For the unprojection operation, due to the existence of fisheye distortion, an intermediate rectified depth map is generated first from the training depth map D to transform a pixel p, into camera coordinates, by warping a pixel grid according to Eqn. 1. The rectified D is then used to unproject the grid into 3D by applying the unprojection function

[0064] In some cases, the view synthesis process includes: (1) unproject a uniform pixei grid, which has the same resolution with input frames through (see Eqn. 2 above); and (2) project those 3D points by (see Eqn. 1 above) and the associated pose information from pose network 410, to obtain distorted synthesis images 450.

[0065] According to various embodiments, the distorted synthesis image 450 is utilized to calculate a photometric loss 460 between two neighboring fisheye images. For example, photometric loss 460 is calculated by comparing the synthesis image 450 and one image of the training pair using at least the geometric difference. Such process may be performed iteratively via a deep learning algorithm to minimize the photometric loss and tune parameters for a training data model 490. In this way, the 3D geometry can be preserved by incorporating photometric loss 460 in training the depth estimation model and/or updating the training depth map 425.

[0066] In further embodiments, an edge-aware depth smoothness loss may be introduced by a smoothing module (not shown) to account for color similarity and/or irregularity between neighboring pixels. In some cases, an auto-masking mechanism may be adopted to mask out static pixels when computing the photometric loss 460. For example, one or more static regions may he identified, and a mask may he applied to discount these regions when calculating the photometric loss 460.

[0067] In a specific example, a target fisheye image i t and a source fisheye image I. are used to calculate the photometric loss. The depth network 420 and the pose network 410 are jointly trained to predict a dense depth map D, and a relative transformation matrix T ... The per -pixel minimum photometric loss can be calculated as follows: and where denotes a weighted combination of the LI and structured

SIMilarity (8 SIM) losses , and is a bilinear sampling operator. The edge-aware smoothness losss is defined as follows: where d is a mean-normalized inverse depth.

[0068] In various implementations, a reference data model 430 is introduced to train the training data model 490. For example, the reference data model 430 that is diversely trained across various datasets might be obtained from the network 350 via a communication interface 325 (as shown in Figure 3). The reference data model 430 is well suited for- — among other things — predicting relative depth relationships among pixels (as it is trained on large and diverse datasets), which promotes generality and cross-dataset learning in training the depth estimation model. As such, it is desired to introduce the ordinal information between neighboring pixels (e.g., whether a pixel is farther or closer than its neighbors) from the reference data model 430 to train the training data model 490. For example, reference data model 430 receives tire training images 405 and generates a reference depth map 435, which includes the depth ordering information (e.g., at per pixel level).

[0069] In various embodiments, a ranking loss 480 is calculated using the training depth map 425 and the reference depth map 430. For example, the ranking loss 480 may be calculated by comparing each pixel to its neighboring pixel (e.g., its left horizontal neighbor, and/or its top vertical neighbor, and/or the like). Ranking loss 480 allows the training data model 490 to learn fee depth ordering relationships among pixels from the reference data model 430.

[0070] In some embodiments, a scale invariant loss 470 is calculated using at least the training depth map 425 and the reference depth map 435. The scale invariant loss 470 is configured to strengthen fee supervision in texture-less regions (e.g., low-light regions, overexposure regions, homogenous regions in indoor environments, and/or the like). It is to be appreciated that the reference data model 430 may be greater than the Raining data model in terms of size, complexity, and/or the like. In some eases, the training depth map 425 and the reference depth map 435 are normalized to improve scale awareness of the depth estimation model. In various implementations, a cumulative loss 485 is calculated using two or more of the loss terms to refine the training data model 490. For example, the cumulative loss value is calculated using fee photometric loss 460, ranking loss 480, and scale invariant loss 470.

[0071] In a specific example, the training depth map is denoted as and the reference depth map is denoted as D MIDAS , the ranking loss is calculated from each pixel to its left horizontal neighbor as follows: where a and are hyper-parameters controlling the ranking gap and can be set to . for example . Similaly, the ranking loss can also be calculated from each pixel to its top vertical neighbor The final ranking loss may be calculated as the sum o

[0072] In some cases, to calculate the scale invariant loss 470 , the training depth map 425 and the reference depth map 435 may first be normalized to generate normalized deph maps respectively. The scale invariant loss is defined as follows:

[0073] in some embodiments, the cumulative loss 485 is a weighted combination of the photometric loss, the edge-aware depth smoothness loss, the ranking loss, and the scale invariant loss, i.e.,

[0074] Figure 5 is a simplified flow diagram illustrating a process for generating a depth estimation model according to embodiments of the present invention. This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternati ves, and modifications. For example, one or more steps may be added, removed, repeated, replaced, modified, rearranged, and/or overlapped , and they should limit the scope of the claims,

[0075] At step 502, a plurality of training images is captured. For example, the plurality of training images is captured by one or more fisheye lenses and have a fisheye distortion associated with the one or more fisheye lenses. In an example, the training images are a plurality of stereoscopic image pairs. In another embodiment, the training images are a plurali ty of monocul ar images or a plurality of temporal monocular image frames from a video. [0076] At step 504, a training data model is stored in a memory. For example, the training data model is first temporarily stored in volatile memory for further processing and can be constantly refined during a training process. The training data model may later be transferred to non-volatile memory once the training process completes.

[0077] At step 506, a geometrical difference associated with the fisheye distortion is determined using the training images. For example, one or more extrinsic and/or intrinsic parameters of the fisheye lens may he used in determining the geometrical difference. The geometrical difference may reflect the pose difference between the training images and can be used in the later process such as view synthesis, which generates a supervision signal for training the depth estimation model.

[0078] At step 508, a photometric loss value is calculated using at least the geometric difference. For example, the photometric loss value may be calculated by comparing a synthesis image generated by view synthesis and at least one of the original training images.

In some cases, the photometric loss value is associated with the 3D geometry of the scene embodied in one or more training images.

[0079] At step 510, a training depth map is generated using at least the photometric loss value. For example, the training depth map — generated by a neural network — can be used to calculate one or more loss terms (e.g,, photometric loss, and/or the like), which are configured to update the training data model.

[0080] At step 512, a reference data model is obtained. For example, the reference data model is a depth estimation model that has been trained on a large amount of diverse data and can predict the depth ordering relationships among neighboring pixels. In various implementations, the reference data model is larger than the training data model.

[0081] At step 514, a reference depth map is generated using at least tire reference data model. For example, the reference depth map takes the training images as input and outputs the reference depth map based on the training images. The reference depth map generated by the reference data model may estimate the relative depth relationship among neighboring pixels.

[0082] At step 516, a ranking loss value is calculated using at least the training depth map and the reference depth map. For example, the ranking loss value may be calculated by comparing each pixel to its neighboring pixel (e.g., its left horizontal neighbor, and/or its top vertical neighbor, and/or the like). By minimizing the ranking loss, the training data model gradually learns the depth ordering relationships from the reference data model. Since the training data model is smaller than the reference data model, it is able to predict similar depth ordering relationships with less computational time and resources.

[0083] At step 518, the training data model is updated using at least the photometric loss value and the ranking loss value. For example, the photometric loss value and the ranking loss value may be continually minimized as the training process progresses, and the training data model may thus he constantly updated. In some cases, additional loss terms (e.g., edge-aware depth smoothness loss, scale invariant loss, and/or the like) may also be introduced to facilitate the depth estimation training process. [0084] This diagram is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the ail would recognize many variations, alternatives, and modifications.

[0085] While the above is a full description of the specific embodiments, various modifications, alternative constructions and equivalents may be used. Therefore, the above description and illustrations should not be taken as limiting the scope of the present invention which is defined by the appended claims.