Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ROBUST CAMERA LOCALIZATION BASED ON A SINGLE COLOR COMPONENT IMAGE AND MULTI-MODAL LEARNING
Document Type and Number:
WIPO Patent Application WO/2021/214540
Kind Code:
A1
Abstract:
Disclosed herein is a system and method for localizing/delocalizing a camera. The process may include receiving a color component image acquired, using a camera, from a scene at a trained image classification deep convolutional neural network (DCNN); recognizing, by the trained image classification DCNN, the scene and predicting a scene class of the scene from a plurality of scenes; Receiving the color component image at a trained generative adversarial network (GAN); generating, by the trained GAN, a reconstructed point cloud image based on the color component image; receiving the color component image and the reconstructed point cloud image at a trained multi-modal regression DCNN; and estimating pose and orientation of the camera by the trained multi-modal regression DCNN.

Inventors:
TABATABAIE SEYED MOJTABA (IR)
GHOFRANI ALI (IR)
Application Number:
PCT/IB2020/062451
Publication Date:
October 28, 2021
Filing Date:
December 24, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
VAGHEIAT MATLOUB PARS (IR)
International Classes:
G06N3/08
Domestic Patent References:
WO2019153245A12019-08-15
Foreign References:
US20140267614A12014-09-18
Other References:
YU XIANG; TANNER SCHMIDT; VENKATRAMAN NARAYANAN; DIETER FOX: "PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 November 2017 (2017-11-01), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080833445
Attorney, Agent or Firm:
HAMIAN FANAVARI KARAFAM (IR)
Download PDF:
Claims:
What is claimed is:

1. A computer- implemented method for camera localization based on a color component image, comprising: receiving the color component image acquired, using a camera, from a scene at a trained image classification deep convolutional neural network (DCNN); recognizing, by the trained image classification DCNN, the scene from a plurality of scenes, wherein the recognizing includes predicting a scene class that the color component image belongs to; receiving the color component image at a trained generative adversarial network (GAN), the trained GAN including a trained generator and a trained discriminator, wherein the trained generator configured to receive the color component image; generating, by the trained GAN, a reconstructed point cloud image based on the color component image, wherein the trained generator configured to generate the reconstructed point cloud image and the trained discriminator configured to distinguish, based on a ground truth point cloud image, whether the reconstructed point cloud image is real; receiving the color component image and the reconstructed point cloud image at a trained multi-modal regression DCNN, the trained multi-modal regression DCNN including a first trained regression DCNN and a second trained regression DCNN, wherein the first and the second trained regression DCNNs are parallel and fused to each other, the first trained regression DCNN configured to receive the color component image and the second trained regression DCNN configured to receive the reconstructed point cloud image generated by the trained GAN; and estimating pose and orientation of the camera by the trained multi-modal regression

DCNN.

2. The method according to claim 1, wherein estimating the pose and the orientation of the camera includes: estimating a Cartesian position of the camera including a position of the camera on X-axis, Y-axis, and Z-axis; and estimating a quaternion information of the camera including an orientation of the camera about elements w, p, q, and r.

3. The method according to claim 1, wherein receiving the color component image acquired, using the camera, from the scene at the trained classification DCNN comprises: receiving an RGB (Red Green Blue) image acquired, using the camera, from an indoor scene at the trained image classification DCNN.

4. The method according to claim 3, further comprising: resizing the RGB image to a predefined size prior to the receiving the RGB image acquired, using the camera, from the indoor scene at the trained image classification DCNN.

5. The method according to claim 1, further comprising a training phase, the training phase comprising: acquiring a plurality of color component images, using a first camera, and a plurality of point cloud images, using a second camera, from the plurality of scenes, the plurality of color component and point cloud images including at least one color component and at least one point cloud image acquired from each respective scene of the plurality of scenes; training, by one or more processors, an image classification DCNN by feeding the plurality of color component images thereto, the image classification DCNN configured to label the plurality of color component images based on said each respective scene and classify them into a plurality of scene classes; training, by the one or more processors, a plurality of GANs, wherein each of the plurality of GANs specified to a specific scene class of the plurality of scene classes, said each of the plurality of GANs including a generator and a discriminator, wherein said each of the plurality of GANs trained by: feeding all color component images classified in the specific scene class of the plurality of scene classes to the generator of the each of the plurality of GANs, wherein the generator configured to generate a reconstructed point cloud image based on each of the all color component images; and feeding all point cloud images corresponding to said all color component images, as ground truth data, to the discriminator of the each of the plurality of GANs, the discriminator configured to determine, based on the ground truth data, whether the reconstructed point cloud image generated based on said each of the all color component images is real; and optimizing the generator of the each of the plurality of GANs based on a feedback of the discriminator of the each of the plurality of GANs; and training, by the one or more processors, a plurality of multi-modal regression DCNN units configured to estimate pose and orientation of the first and the second cameras at the each of the plurality of color component images, each of the plurality of multi-modal regression DCNN units including a first regression DCNN and a second regression DCNN, wherein the first and the second regression DCNNs are parallel and fused to each other, said each of the plurality of multi modal regression DCNN units trained by: feeding the all color component images classified in the specific scene class to the first regression DCNN of the each of the plurality of multi-modal regression DCNN units; and feeding said reconstructed point cloud image generated based on said each of the all color component images to the second regression DCNN of said each of the plurality of multi-modal regression DCNN units.

6. The method according to claim 5, wherein the pose includes: a Cartesian position of the first and the second cameras including a position of the first and the second cameras on X-axis, Y-axis, and Z-axis.

7. The method according to claim 5, wherein the orientation includes: a quaternion information of the first and the second cameras including an orientation of the first and the second cameras about elements w, p, q, and r.

8. The method according to claim 5, wherein acquiring the plurality of color component images, using the first camera, and the plurality of point cloud images, using the second camera, comprises: acquiring a plurality of RGB images and the plurality of point cloud images, from each of a plurality of indoor scenes.

9. The method according to claim 5, wherein feeding the plurality of color component images to the image classification DCNN further comprises: feeding augmented forms of the plurality of color component images to the image classification DCNN, wherein the augmented forms of the plurality of color component images obtained by at least one of brightness variations, contrast variations, noise addition, and a combination thereof.

10. The method according to claim 5, wherein feeding the all color component images classified in the specific scene class to the first regression DCNN of the each of the plurality of multi-modal regression DCNN units further comprises: feeding augmented forms of said all color component images to the first regression DCNN of the each of the plurality of multi-modal regression DCNN units, wherein the augmented forms of said all color component images obtained by at least one of mask insertion, patch removal, brightness variations, contrast variations, noise addition, and a combination thereof.

11. The method according to claim 5, wherein feeding said reconstructed point cloud image generated based on said each of the all color component images to the second regression DCNN of the each of the plurality of multi-modal regression DCNN units further comprises: feeding augmented forms of the reconstructed point cloud image generated based on said each of the all color component images to the second regression DCNN of the each of the plurality of multi-modal regression DCNN units, wherein the augmented forms of the reconstructed point cloud image obtained by mask insertion and patch removal.

12. The method according to claim 5, further comprising: resizing the plurality of color component images to a predefined size prior to feeding the plurality of color component images to the image classification DCNN.

13. The method according to claim 5, wherein acquiring the plurality of color component images, using the first camera, and the plurality of point cloud images, using the second camera, from the plurality of scenes includes: acquiring the at least one color component image, using a Complementary Metal Oxide Semiconductor (CMOS) camera, and the at least one point cloud image, using a Light Detection and Ranging (LiDAR) camera, from the each respective scene such that the CMOS camera and the LiDAR camera are disposed at different poses and orientations.

14. A system for camera localization based on a color component image, comprising: a camera configured to acquire the color component image from a scene; a processor; and a storage device configured to store a set of instructions that when executed by the processor performs a method for camera localization based on the color component image, the method comprising: receiving the color component image acquired, using the camera, from the scene at a trained image classification deep convolutional neural network (DCNN); recognizing, by the trained image classification DCNN, the scene from a plurality of scenes, wherein the recognizing includes predicting a scene class that the color component image belongs to; receiving the color component image at a trained generative adversarial network (GAN), the trained GAN including a trained generator and a trained discriminator, wherein the trained generator configured to receive the color component image; generating, by the trained GAN, a reconstructed point cloud image based on the color component image, wherein the trained generator configured to generate the reconstructed point cloud image and the trained discriminator configured to distinguish, based on a ground truth point cloud image, whether the reconstructed point cloud image is real; receiving the color component image and the reconstructed point cloud image at a trained multi-modal regression DCNN, the trained multi-modal regression DCNN including a first trained regression DCNN and a second trained regression DCNN, wherein the first and the second trained regression DCNNs are parallel and fused to each other, the first trained regression DCNN configured to receive the color component image and the second trained regression DCNN configured to receive the reconstructed point cloud image generated by the trained GAN; and estimating pose and orientation of the camera by the trained multi-modal regression DCNN.

15. The system according to claim 14, further comprising: resizing the color component image to a predefined size prior to the receiving the color component image acquired, using the camera, from the scene at the trained image classification

DCNN.

16. The system according to claim 14, wherein estimating the pose and the orientation of the camera includes: estimating a Cartesian position of the camera including a position of the camera on X-axis, Y-axis, and Z-axis; and estimating a quaternion information of the camera including an orientation of the camera about elements w, p, q, and r.

17. A computer-readable medium configured to store a set of instructions that, when executed by a processor, performs a method for camera localization based on a color component image, the method comprising: receiving the color component image acquired, using a camera, from a scene at a trained image classification deep convolutional neural network (DCNN); recognizing, by the trained image classification DCNN, the scene from a plurality of scenes, wherein the recognizing includes predicting a scene class that the color component image belongs to; receiving the color component image at a trained generative adversarial network (GAN), the trained GAN including a trained generator and a trained discriminator, wherein the trained generator configured to receive the color component image; generating, by the trained GAN, a reconstructed point cloud image based on the color component image, wherein the trained generator configured to generate the reconstructed point cloud image and the trained discriminator configured to distinguish, based on a ground truth point cloud image, whether the reconstructed point cloud image is real; receiving the color component image and the reconstructed point cloud image at a trained multi-modal regression DCNN, the trained multi-modal regression DCNN including a first trained regression DCNN and a second trained regression DCNN, wherein the first and the second trained regression DCNNs are parallel and fused to each other, the first trained regression DCNN configured to receive the color component image and the second trained regression DCNN configured to receive the reconstructed point cloud image generated by the trained GAN; and estimating pose and orientation of the camera by the trained multi-modal regression

DCNN.

18. The computer-readable medium according to claim 17, further comprising: resizing the color component image to a predefined size prior to the receiving the color component image acquired, using the camera, from the scene at the trained image classification

DCNN.

19. The computer-readable medium according to claim 17, wherein estimating the pose and the orientation of the camera includes: estimating a Cartesian position of the camera including a position of the camera on X-axis, Y-axis, and Z-axis; and estimating a quaternion information of the camera including an orientation of the camera about elements w, p, q, and r.

Description:
ROBUST CAMERA LOCALIZATION BASED ON A SINGLE COLOR

COMPONENT IMAGE AND MULTI-MODAL LEARNING

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of priority from U.S. Provisional Patent Application Ser. No. 63/012,485, filed on Apr. 20, 2020, and entitled “Robust Camera Pose Estimation Based on Multi-Modal Learning” which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

[0002] The present disclosure generally relates to a method and system for camera localization/relocalization; and more particularly, to a method and system for camera localization/relocalization based on a single color component image, and by using generative adversarial neural network (GAN) and deep convolutional neural networks (DCNNs) including a multi-modal DCNN. More particularly, the present disclosure relates to a method and system for localizing/relocalizing a camera in an indoor/interior environment that may be robust against environmental variations.

BACKGROUND

[0003] Camera localization, i.e., relocating a position and an orientation of a camera, may be considered as a critical issue in computer vision. Accurate and robust estimation of camera pose may be a key to the applications such as autonomous driving, augmented reality, 3D reconstruction, visual Simultaneous Localization and Mapping (visual SLAM), and navigation. [0004] Localization and navigation in places where no Wi-Fi or GPS data are available, or in places with permanent modifications and object movements has always been a challenging issue. Current indoor positioning systems employ sensors, communication technologies, or depth imaging devices to locate objects. Such technologies are not on client- side and require expensive devices, periodic monitoring, maintenance, and calibration.

[0005] Low precision of current technologies for camera pose estimation may be sufficient for indoor/outdoor positioning applications; however, for applications such as autonomous driving and augmented reality a centimeter accuracy may be highly required. Thereby, there is need to develop a low cost and highly robust system which may be utilized without need to costly devices.

SUMMARY

[0006] This summary is intended to provide an overview of the subject matter of the present disclosure, and is not intended to identify essential elements or key elements of the subject matter, nor is it intended to be used to determine the scope of the claimed implementations. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later. The proper scope of the present disclosure may be ascertained from the claims set forth below in view of the detailed description below and the drawings.

[0007] In one general aspect, the present disclosure relates to a computer-implemented method for camera localization based on a color component image. In exemplary embodiments, the computer-implemented method may include receiving the color component image acquired, using a camera, from a scene at a trained image classification deep convolutional neural network (DCNN); and recognizing, by the trained image classification DCNN, the scene from a plurality of scenes. The trained image classification DCNN may be configured to predict a scene class that the color component image may belong to. The method may further comprise receiving the color component image at a trained generative adversarial network (GAN) including a trained generator and a trained discriminator. The trained generator may be configured to receive the color component image. The method may further include generating, by the trained GAN, a reconstructed point cloud image based on the color component image. In particular, the trained generator of the trained GAN may be configured to generate the reconstructed point cloud image and the trained discriminator may be configured to distinguish, based on a ground truth point cloud image, whether the reconstructed point cloud image is real. The method may further include receiving the color component image and the reconstructed point cloud image at a trained multi-modal regression DCNN including a first trained regression DCNN and a second trained regression DCNN. The first and the second trained regression DCNNs may be parallel and fused to each other. In exemplary embodiments, the first trained regression DCNN may be configured to receive the color component image and the second trained regression DCNN may be configured to receive the reconstructed point cloud image generated by the trained GAN. The method may further comprise estimating pose and orientation of the camera by the trained multi-modal regression DCNN. In an exemplary embodiment, the pose may be determined by estimating a Cartesian position of the camera that may include a position of the camera on X-axis, Y-axis, and Z-axis. The orientation may be determined by estimating a quaternion information of the camera including an orientation of the camera about elements w, p, q, and r. The computer-implemented method may further comprise resizing the color component image to a predefined size prior to receiving the color component image acquired from the scene at the trained image classification DCNN.

[0008] In one or more exemplary embodiments, the computer-implemented method may include a training phase. The training phase may include acquiring a plurality of color component images, using a first camera, and a plurality of point cloud images, using a second camera, from the plurality of scenes. The plurality of color component and point cloud images may include at least one color component and at least one point cloud image, acquired from each respective scene of the plurality of scenes. In an exemplary embodiment, the at least one color component image and the at least one point cloud image may be acquired such that the first and the second cameras may be disposed at different poses and orientations. In an exemplary embodiment, the first camera may be a Complementary Metal Oxide Semiconductor (CMOS) or a charge-coupled device (CCD) camera and the second camera may be a Light Detection and Ranging (LiDAR) camera.

[0009] The training phase may further include training, by one or more processors, an image classification DCNN. In an exemplary embodiment, the image classification DCNN may be trained by feeding the plurality of color component images to it. The image classification DCNN may be configured to label the plurality of color component images based on said each respective scene and classify them into a plurality of scene classes.

[00010] The training phase may further include training, by the one or more processors, a plurality of GANs. Each of the plurality of GANs may be specified to a specific scene class of the plurality of scene classes. Each of the plurality of GANs may comprise a generator and a discriminator. In an exemplary embodiment, said each of the plurality of GANs may be trained by feeding all color component images classified in the specific scene class of the plurality of scene classes to the generator of the each of the plurality of GANs. The generator may be configured to generate a reconstructed point cloud image based on each of the all color component images. The each of the plurality of GANs may further be trained by feeding all point cloud images corresponding to the all color component images, as ground truth data, to the discriminator of the each of the plurality of GANs. The discriminator may be configured to determine, based on the ground truth data, whether the reconstructed point cloud image generated based on said each of the all color component images is real. The each of the plurality of GANs may further be trained by optimizing the generator of the each of the plurality of GANs based on a feedback of the discriminator of the each of the plurality of GANs.

[00011] The training phase may further include training, by one or more processors, a plurality of multi-modal regression DCNN units. In an exemplary embodiment, the plurality of multi-modal regression DCNNs may be configured to estimate pose and orientation of the first and the second cameras at the each of the plurality of color component images. Each of the plurality of multi-modal regression DCNN units may include a first regression DCNN and a second regression DCNN, wherein the first and the second regression DCNNs may be parallel and fused to each other. In an exemplary embodiment, said each of the plurality of multi-modal regression DCNN units may be trained by feeding the all color component images classified in the specific scene to the first regression DCNN of the each of the plurality of multi-modal regression DCNN units; and feeding the reconstructed point cloud image generated based on said each of the all color component images to the second regression DCNN of said each of the plurality of multi-modal regression DCNN units.

[00012] In another exemplary embodiment, a system for performing camera localization based on the color component image is disclosed. The system may include a camera for acquiring the color component image from the scene, a processor, and a storage device. The storage device may be configured to store a set of instructions that when executed by the processor may perform said computer-implemented method for camera localization.

[00013] Another aspect of the present disclosure may direct to a non-transitory computer-readable medium. In exemplary embodiments, the non-transitory computer-readable medium may be configured to store a set of instructions that, when executed by the processor, may perform said computer-implemented method for camera localization.

[00014] This Summary is provided to introduce a selection of concepts in a simplified form; these concepts are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[00015] The novel features which are believed to be characteristic of the present disclosure, as to its structure, organization, use and method of operation, together with further objectives and advantages thereof, will be better understood from the following drawings in which a presently preferred embodiment of the present disclosure will now be illustrated by way of example. It is expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the present disclosure. Embodiments of the present disclosure will now be described by way of example in association with the accompanying drawings in which:

[00016] FIG. 1 illustrates a flowchart diagram of an exemplary method for camera localization using a single color component image, consistent with one or more embodiments of the present disclosure;

[00017] FIG. 2 illustrates a schematic block diagram of a process for recognizing the color component image by a trained image classification DCNN, consistent with exemplary embodiments of the present disclosure;

[00018] FIG. 3 illustrates a schematic block diagram of a process for generating a reconstructed point cloud image by a trained GAN, consistent with one or more exemplary embodiments of the present disclosure;

[00019] FIG. 4 illustrates an exemplary block diagram of a process for estimating pose and orientation of the camera using a trained multi-modal regression DCNN, consistent with exemplary embodiments of the present disclosure; [00020] FIG. 5 illustrates a flowchart diagram of an exemplary training phase for performing the camera localization, consistent with exemplary embodiments of the present disclosure;

[00021] FIG. 6 illustrates exemplary camera trajectory styles for acquiring a plurality of color component and a plurality of point cloud images, consistent with exemplary embodiments of the present disclosure;

[00022] FIG. 7 illustrates a block diagram of an exemplary configuration of a computing system for performing the camera localization, consistent with one or more embodiments of the present disclosure;

[00023] FIG. 8 is a block diagram illustrating a configuration example of a localization system, consistent with one or more exemplary embodiments of the present disclosure; [00024] FIG. 9 is a block diagram illustrating a run-time operation of an application installed on a smartphone, consistent with exemplary embodiments of the present disclosure; and

[00025] FIG. 10 illustrates exemplary augmentations of an exemplary color component image acquired from an exemplary scene, consistent with one or more exemplary embodiments of the present disclosure.

DETAILED DESCRIPTION

[00026] In the following detailed description, numerous specific details are set forth by way of examples to provide a thorough understanding of the relevant teachings related to the exemplary embodiments. However, it should be apparent that the present teachings may be practiced without such details. In other instances, well known methods, procedures, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present teachings. [00027] The following detailed description is presented to enable a person skilled in the art to make and use the methods and devices disclosed in exemplary embodiments of the present disclosure. For purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the present disclosure. However, it will be apparent to one skilled in the art that these specific details are not required to practice the disclosed exemplary embodiments. Descriptions of specific exemplary embodiments are provided only as representative examples. Various modifications to the exemplary implementations will be plain to one skilled in the art, and the general principles defined herein may be applied to other implementations and applications without departing from the scope of the present disclosure. The present disclosure is not intended to be limited to the implementations shown, but is to be accorded the widest possible scope consistent with the principles and features disclosed herein. [00028] It must be noted that, as used in this specification, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise.

[00029] As used herein, the terms “comprising,” “including,” “constituting,” “containing,” “consisting of,” and grammatical equivalents thereof are inclusive or open-ended terms that do not exclude additional, unrecited elements or method steps.

[00030] Reference herein to “one embodiment,” “an embodiment,” “some embodiments,” “one or more embodiments,” “one exemplary embodiment,” “an exemplary embodiment,” “some exemplary embodiments,” and “one or more exemplary embodiments” indicate that a particular feature, structure or characteristic described in connection or association with the embodiment may be included in at least one of such embodiments. However, the appearance of such phrases in various places in the present disclosure do not necessarily refer to a same embodiment or embodiments.

[00031] The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments. Likewise, the term “embodiments of the invention” does not require that all embodiments of the invention include the discussed feature, advantage or mode of operation.

[00032] In addition, terms such as first, second and the like may be used herein to describe components. Each of these terms is not used to define an essence, order or sequence of a corresponding component but used merely to distinguish the corresponding component from other component(s).

[00033] It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two operations shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality /acts involved. [00034] The terms used in this specification may generally have their ordinary meanings in the art, within the context of the disclosure, and in the specific context where each term is used. Certain terms that are used to describe the disclosure are discussed below, or elsewhere in the specification, to provide additional guidance to the practitioner regarding the description of the disclosure. It will be appreciated that a term or a phrase may be said in more than one way.

[00035] Consequently, alternative language and synonyms may be used for any one or more of the terms discussed herein. Nor is any special significance to be placed upon whether or not a term is elaborated or discussed herein. Synonyms for certain terms are provided. A recital of one or more synonyms does not exclude the use of other synonyms. The use of examples anywhere in this specification including examples of any terms discussed herein is illustrative only, and is not intended to further limit the scope and meaning of the disclosure or of any exemplified term. Likewise, the disclosure may not be limited to various embodiments given in the present specification. [00036] Localization and navigation in places where no WiFi or GPS data are available, or in places with permanent modifications and object movements is a challenging issue in computer vision. A majority of indoor positioning systems may employ sensors, communication technologies, or depth imaging devices to locate objects; such systems may require expensive devices, periodic monitoring, maintenance, and calibration. On the other hand, the technologies developed so far may suffer from low precision that may be acceptable for outdoor positioning; however, for applications such as indoor navigation, autonomous driving, or augmented reality a centimeter accuracy may be highly required.

[00037] The exemplary method and system disclosed herein may employ an end-to-end model including convolutional and generative models to localize an object, such as a smartphone with high accuracy and without need to GPS or equivalent satellite-based localization systems. The system and method provided herein may replace known systems including, but is not limited to, systems utilizing sensors and communication devices, such as Bluetooth devices. Furthermore, the method and system disclosed in the exemplary embodiments of the present disclosure may localize the object with millimeter accuracy and may be robust against modifications of an environment, such as light changes and replacement, movement, or addition of an object in the environment. The exemplary aspects and embodiments of the present disclosure may also pertain to applications including, but not limited to, indoor navigation, outdoor navigation, augmented reality, autonomous driving, drones navigation inside tunnels, robot navigation and the like.

[00038] A number of particular aspects and embodiments are described herein, however many variations and permutations of these aspects fall within the scope of the disclosure. Although some benefits and advantages are mentioned, the scope of the present disclosure is not intended to be limited to particular benefits, uses or objectives. Rather, aspects of the disclosure are intended to be broadly applicable to different technologies, system configurations, networks and protocols, some of which are illustrated by way of example in the figures and in the following description of the preferred aspects. The detailed description and drawings are merely illustrative of the disclosure rather than limiting, the scope of the disclosure being defined by the appended claims and equivalents thereof.

[00039] Exemplary embodiments to be described hereinafter may be provided in various forms of products including, for example, a smartphone, a mobile device, a wearable device, a personal computer (PC), a laptop computer, a tablet computer, an intelligent vehicle, a smart home appliance, an autonomous vehicle, a robot, and the like. For example, exemplary embodiments may be applied to estimate a pose and an orientation of a product such as a smartphone, a mobile device, a wearable device, an autonomous vehicle, a robot, and the like. Exemplary embodiments may be applicable to various services using pose estimation/localization. For example, exemplary embodiments may be applicable to services providing information on an object or a gaze direction of a user by estimating a pose of a wearable device worn by the user.

[00040] It should be noted that if it is described in the specification that one component is “connected,” “coupled,” or “joined” to another component, a third component may be “connected,” “coupled,” and “joined” between the first and second components, although the first component may be directly connected, coupled or joined to the second component. [00041] As used herein, the term “determining” may encompass a wide variety of actions. For example, “determining” may include calculating, estimating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Additionally, “determining” may include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Furthermore, “determining” may include resolving, selecting, choosing, establishing and the like. [00042] The terms “computer,” “computing device/system” are to be understood broadly here. They also may include control devices and other processor-based data processing devices. [00043] Referring now to the figures, FIGs. 1-5 illustrate exemplary methods, processes, and neural networks with respect to camera localization/re-localization, in consistence with different aspects and embodiments of the present disclosure. FIG. 1 illustrates a flowchart diagram of an exemplary method 100 for camera localization using a single color component image, consistent with one or more embodiments of the present disclosure. The exemplary method 100 may include, but is not limited to, a plurality of steps consistent with exemplary aspects and embodiments described herein. The following description explains each step with respect to the exemplary processes illustrated in FIGs. 2-4.

[00044] Referring to FIG. 1, step 102 may comprise receiving a color component image 202 (see FIG. 2) from a scene at a trained image classification deep convolutional neural network (DCNN) 201. In an exemplary embodiment, the color component image 202 may include an RGB (Red Green Blue) image captured by a camera (e.g., a Complementary Metal Oxide Semiconductor (CMOS) camera 802 (see FIG. 8)). The color component image 202 may be received from the camera in real-time.

[00045] As to terminology, the term “scene(s)” may refer to a scene/part of an environment, for example an indoor/interior environment, or a scene/part of a structure in which a camera or an object (e.g., a smartphone) may be intended to be localized.

[00046] As used herein, the term “Convolutional Neural Network (CNN)” may generally refer to powerful tools for computer vision tasks. A CNN may be composed of an input layer, a convolutional block, and an output layer. The term “deep CNN” may denote stacked CNNs to obtain higher “representativeness”. Neural networks may be designed with a variety of connectivity patterns. In feed-forward networks, information may be passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, an output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that may span more than one of the input data chunks that may be delivered to the neural network in a sequence. A connection from a neuron in a given layer to a neuron in a lower layer may be called a feedback (or top-down) connection. A network with many feedback connections may be helpful when the recognition of a high-level concept may aid in discriminating particular low-level features of an input. The term "features" may refer to hidden signatures present in an input (e.g., images). In an exemplary embodiment, the trained image classification DCNN 201 may be configured to receive the RGB image acquired, using the camera, from an indoor scene.

[00047] Step 104 may include recognizing, by the trained image classification DCNN 201, the scene from a plurality of scenes. In particular, consistent with exemplary embodiments, the trained image classification DCNN 201 may be adapted for predicting a scene class that the color component image 202 may belong/correspond to, and for estimating a probability that the color component image 202 belongs to said scene class. In an exemplary embodiment, the color component image 202 (e.g., the RGB image) may be resized to a predefined size (e.g., 224*224*3 pixels) prior to being provided to the trained image classification DCNN 201. FIG. 2 illustrates a schematic block diagram of a process 200 for recognizing the color component image 202 by the trained image classification DCNN 201, consistent with exemplary embodiments of the present disclosure. In general, as shown in FIG. 2, a common architecture of a DCNN may include a plurality of convolutional blocks 204, wherein each convolutional block (e.g., a convolutional block 204a) may include a convolution layer 206, a normalization layer (LNorm) 208, and a pooling layer 210. In exemplary embodiments, the input (e.g., the color component image 202) may be received by a first convolutional block from the plurality of convolutional blocks 204. The convolution layer 206 may include one or more convolution filters that may be applied to the input (e.g., the color component image 202) to generate a feature map. In exemplary embodiments, any number of convolutional blocks may be included in a DCNN (e.g., the trained image classification DCNN 201) based on design preference. The normalization layer 208 may be used to normalize an output of the convolution filters. For example, the normalization layer may provide whitening or lateral inhibition. The pooling layer 210 may provide down sampling aggregation over space for local invariance and dimensionality reduction. In exemplary embodiments, the pooling layer 210 may progressively reduce spatial size of large images while preserving the most important information in them. For example, when the input is an image (as in the process 200), the pooling layer 210 may keep a maximum value from each image window and may preserve the best fits of each feature within the window. DCNNs may also include one or more fully connected layers 212 (e.g., FC1 212a and FC2 212b); the one or more fully connected layers 212 may be configured to take the high-level filtered images and translate them into a plurality of categories/classes with labels. The output of each layer may serve as an input of a succeeding layer in a DCNN to learn hierarchical feature representations from input data (e.g., images, audio, video, sensor data and/or other input data) supplied at the first convolutional block. [00048] In exemplary embodiments, neural networks (i.e., DCNNs) may implement an activation function based on their task. The term “activation function” may refer to mathematical equations that may determine an output of a neural network. Activation functions may be attached to each neuron in the neural network, and may determine whether each neuron should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. Activation functions may also contribute to normalizing an output of each neuron to a range between 1 and 0 or between -1 and 1. The activation functions may include, but are not limited to, binary step function, linear activation function, and non-linear activation functions including sigmoid/logistic activation function, TanH/hyperbolic tangent activation function, ReLU (Rectified Linear Unit) activation function, leaky ReLU, SELU (Scaled Exponential Linear Unit), parametric ReLU, Softmax activation function, and swish function. For example, in case of an image classification DCNN, such as the trained image classification DCNN 201 shown in FIG. 2, the network may further include a Softmax layer 214 after the one or more fully connected layers 212 that may implement a softmax activation function. The Softmax activation function may be configured to normalize the outputs for each class between 0 and 1 giving a probability that the input value belongs to a class. The softmax activation function may be used for the output layer of neural networks that may classify inputs into multiple classes. In exemplary embodiments, the process 200 may be performed on the at least one processor 702 (as shown in FIGs. 7 and 8) including CPU, GPU, and NPU.

[00049] Step 106 may include receiving the color component image 202 at a trained generative adversarial network (GAN) 302 (see FIG. 3). In exemplary embodiments, there may exist a plurality of trained GANs each of which specified to a scene class of the plurality of scene classes. Each of the plurality of trained GANs may be configured to receive all color component images estimated to belong to a specific scene class. For example, the GAN 302 may be specified to the scene class estimated for the color component image 202 (e.g., the scene class “1”). The color component image 202 may include the RGB image captured by the CMOS camera 802. The color component image 202 may be received from the CMOS camera 802 in real-time. In step 108 the trained GAN 302 may be adapted to generate a reconstructed point cloud image 304 based on the color component image 302. Each of the plurality of trained GANs may be capable of generating a separate reconstructed point cloud image based on every color component image that is estimated to belong to a specific scene class. For example, the GAN 302 may be specified to the scene class estimated for the color component image 202 (e.g., the scene class “ 1 ”) and may further be capable of generating separate reconstructed point cloud images based on each color component image belonging to that scene class (e.g., the scene class “1”).

[00050] In general, a GAN may be a deep neural network comprising two adversarial nets in a zero-sum game framework. In an embodiment, the GAN may comprise a GAN generator configured to generate new data instances and a GAN discriminator configured to evaluate the new data instances for authenticity. The GAN discriminator may be configured to analyze the new data instances and determine whether each new data instance belongs to the actual training data sets or if it was generated artificially. The GAN generator may be configured to create new images that may be passed to the GAN discriminator and the GAN generator may be trained to generate images that fool the GAN discriminator into determining that an artificial new data instance belongs to the actual training data. FIG. 3 is a schematic block diagram illustrating a process 300 of generating the reconstructed point cloud image 304 by the trained GAN 302, consistent with one or more exemplary embodiments of the present disclosure. The process 300 may include receiving the color component image 202 at a trained GAN generator 302a. The trained GAN generator 302a may be configured to generate the reconstructed point cloud image 304 based on the color component image 202. A trained GAN discriminator 302b may be adapted to receive the reconstructed point cloud image (fake image) 304 generated by the trained GAN generator 302a and compare it to a ground truth point cloud image (real image) 306 that may be provided to a GAN discriminator during an exemplary training phase 500 (see FIG. 5). In particular, the trained GAN generator 302a may be configured to generate the reconstructed point cloud image 304 that may not be distinguished from the ground truth point cloud image (real image) 306 by the trained GAN discriminator 302b. The ground truth point cloud image 306 may be obtained using a light detection and ranging (LiDAR) camera or a 3D laser scanner. As used herein, the term “ground truth” may refer to the information of a known set of pixels reflecting a known set of properties/features for a point cloud image utilized to train a neural network. As discussed above, the trained GAN discriminator 302b may be configured to compare the ground truth point cloud image (the real image) 306 with the reconstructed point cloud image (the fake image) 304 and distinguish real and fake image pairs 308. In exemplary embodiments, the process 300 may be performed on the at least one processor 702 (as shown in FIGs. 7 and 8) including CPU, GPU, and NPU. [00051] Step 110 may include receiving the color component image 202 (e.g., the RGB image) and the reconstructed point cloud image 304 at a trained multi-modal regression DCNN 402 (see FIG. 4). In exemplary embodiments, there may exist a plurality of trained regression DCNN units, wherein each of the plurality of trained regression DCNN units may be specified to a specific scene class of the plurality of scene classes. Each of the plurality of trained regression DCNN units may be configured to receive all color component images estimated to belong to a specific scene class, and all reconstructed point cloud images corresponding to them. For example, the trained multi-modal regression DCNN 402 may be specified to the scene class estimated for the color component image 202 (e.g., the scene class “1”). In an exemplary embodiment, the trained multi-modal regression DCNN 402 may include a first trained regression DCNN 402a and a second trained regression DCNN 402b that may be parallel to the first trained regression DCNN 402a. The first trained regression DCNN 402a may be configured to receive the color component image 202 (step 110a) and the second trained regression DCNN 402b may be configured to receive the reconstructed point cloud image 304 (step 110b). In an exemplary embodiment, the first and the second trained regression DCNNs (402a and 402b, respectively) may be connected to each other. The term “connect” may refer to any appropriate connection between layers through features fusion, including a concatenate connection. As used herein, the term “concatenate connection” may refer to connecting features (feature fusion 404) of a substantially same size respectively from layers that may be connected with each other, for example, by means of memory mapping. Vectors having features corresponding to each other may be combined by concatenate connection, doubling the number of channels of the layer containing the features. In step 112, pose 406 (see FIG. 4) and orientation 408 of the camera (i.e., the CMOS camera 802) may be determined/estimated by the trained multi-modal regression DCNN 402. In an exemplary embodiment, the trained multi-modal regression DCNN 402 may be configured to determine/estimate the pose 406 of the camera by estimating its Cartesian position (i.e., position of the camera on X-axis, Y-axis and Z-axis). The trained multi-modal regression DCNN 402 may also be configured to determine/estimate the orientation 408 of the camera by estimating its quaternion information/vector. The quaternion information/vector may comprise four elements including w, p, q, and r. The phrase "Cartesian position" as used herein may refer to a system of coordinates for locating a point on a plane (Cartesian plane) by its distance from each of two intersecting lines, or in space by its distance from each of three planes intersecting at a point. The term "quaternion" may refer to a vector of four elements that may represent a Euclidean rotation. Rotation matrix may be easily obtained using quaternion elements.

[00052] FIG. 4 illustrates an exemplary block diagram of a process 400 for estimating the pose 406 and orientation 408 of the camera using the trained multi-modal regression DCNN 402, consistent with exemplary embodiments of the present disclosure. As discussed in the steps 110 and 112, in consistence with exemplary embodiments of the present disclosure, the trained multi-modal regression DCNN 402 may comprise at least two trained regression DCNNs including the first and the second trained regression DCNNs (402a and 402b, respectively). As discussed in the step 110, the color component image 202 (e.g., the RGB image) may be received by the first trained regression DCNN 402a, and the reconstructed point cloud image 304 generated by the trained GAN 302 may be received by the second trained regression DCNN 402b. In consistence with the general architecture of DCNNs, as discussed earlier in the present disclosure, the first and the second trained regression DCNNs (402a and 402b, respectively) may include a plurality of hidden layers including one or more convolutional blocks (410a and 410b) followed by one or more fully connected layers 412 between input 414 and output 416. Each of the one or more convolutional blocks (410a and 410b) may comprise a convolution layer, a batch normalization layer, and a max pooling layer. After receiving the color component 202, by the first trained regression DCNN 402a, and the reconstructed point cloud image 304, by the second trained regression DCNN 402b, the convolution layers may be adapted to extract features of the input 414 (i.e., the color component 202 and the reconstructed point cloud image 304). Multiple filters may be used to extract the maximum of features and characteristics contained in the input 414. After convolving the filters across the input’s (414) pixels, each convolutional output may be fed into an activation function. In an exemplary embodiment, the each of the one or more convolutional blocks (410a and 410b) may further include a ReLU layer. The ReLU layer may implement the ReLU activation function which may be configured to exchange every negative number of the pooling layer with zero (0). This function may mathematically stabilize a CNN by preventing learned values from staying near zero or exploding toward infinity. The ReLU activation functions may include, but are not limited to, noisy ReLUs, leaky ReLUs, and exponential linear units. The max pooling layers may be adapted to receive an output of the convolution layers and reduce computational load and spatial dimension of said output of the convolution layers. The one or more fully connected layers 412 may be used for regression and may be configured to output the pose 406 (i.e., the Cartesian position (C,U,Z)) and the orientation 408 (i.e., the quaternion information (wpqr)). In exemplary embodiments, the process 400 may be performed on the at least one processor 702 (as shown in FIGs. 7 and 8) including CPU, GPU, and NPU.

[00053] Another aspect of the present disclosure relates to an exemplary training phase 500 for obtaining the trained neural networks described above including the trained image classification DCNN 201, the trained GAN 302, and the first and the second trained regression DCNNs (402a and 402b, respectively). FIG. 5 illustrates a flowchart diagram of the exemplary training phase 500 for performing the camera localization, consistent with exemplary embodiments of the present disclosure. The exemplary training phase 500 may include, but is not limited to, a plurality of steps consistent with exemplary embodiments of the present disclosure.

[00054] Referring to FIG. 5, step 502 may include acquiring a plurality of color component images, using a first camera, and a plurality of point cloud images, using a second camera, from the plurality of scenes. In an exemplary embodiment, at least one color component and at least one point cloud image may be acquired from each respective scene of the plurality of scenes. The plurality of color component images and the at least one color component image may include a plurality of RGB images and at least one RGB image, respectively. The first camera may include a CMOS or a charge-coupled device (CCD) camera and the second camera may include a LiDAR camera. In an exemplary embodiment, the plurality of color component images (e.g., the plurality of RGB images) and the plurality of point cloud images may be acquired from a plurality of indoor/interior scenes (e.g., in a structure/building). For example, in an exemplary embodiment, the plurality of RGB images and the plurality of point cloud images may be acquired from each of the plurality of indoor/interior scenes. The plurality of RGB images and the plurality of point cloud images may be acquired from the each respective [indoor/interior] scene such that the first and the second cameras are disposed at a plurality of different camera poses and orientations. In an exemplary embodiment, the plurality of different camera poses and orientations may include a plurality of camera trajectory styles (movement of the camera). FIG. 6 illustrates exemplary camera trajectory styles 600 for acquiring the plurality of color component and the plurality of point cloud images, consistent with exemplary embodiments of the present disclosure. In an exemplary embodiment, the first and the second cameras may be moved (while capturing the plurality of color component and the plurality of point cloud images) based on different patterns, some of which are shown in FIG. 6. The highlighted lines in the exemplary pictures illustrated in FIG. 6 represent the pattern of camera movement.

[00055] Step 504 may include feeding the plurality of color component images (e.g., the plurality of RGB images) to an image classification DCNN to obtain the trained image classification DCNN. The image classification DCNN may be configured to label the plurality of color component images based on the each respective scene and classify the plurality of color component images into the plurality of scene classes. Each of the plurality of scene classes may include all color component images acquired from a specific scene of the plurality of scenes; for example, the scene class “1” may include all RGB/color component images acquired from the scene “1”. The terms "classify" or "classifying" as used herein may refer to labeling of one or more objects (e.g. images, regions, pixels) into one of a number of predefined categories/classes. In exemplary embodiments, the plurality of color component images (e.g., the plurality of RGB images) may be resized to the predefined size (e.g., 224*224*3 pixels) prior to being provided to the image classification DCNN. In an exemplary embodiment, the image classification DCNN may be trained using one or more processors. In an exemplary embodiment, the image classification DCNN may be a pre-trained image classification model. In an exemplary embodiment, the trained image classification DCNN 201 may be obtained by training the pre-trained image classification model through “transfer learning”. The term “transfer learning” refers to a process of storing the information used in properly or improperly solving one problem to solve another problem of the same or similar nature as the first. Transfer learning may also be known as “inductive learning”. For example, transfer learning may make use of data from previous tasks. [00056] The step 504 may further include feeding augmented forms of the plurality of color component images to the image classification DCNN. In an exemplary embodiment, the augmented forms of the plurality of color component images may be obtained by brightness variations, contrast variations, noise addition, and a combination thereof. FIG. 10 illustrates exemplary augmentations 1000 of an exemplary color component image acquired from an exemplary scene, consistent with one or more exemplary embodiments of the present disclosure. Referring to FIG. 10, an un-modified version of the exemplary color component image 1002 is shown. The exemplary augmentations 1000 may include, but are not limited to, changing brightness 1004, addition of contrast 1006, addition of noise 1008, insertion of a patch 1010, and a combination thereof 1012.

[00057] The image classification DCNN disclosed herein, may have an architecture similar to the general architecture of DCNNs/CNNs, as described earlier. To introduce robustness and to avoid overfitting of the image classification DCNN (a model) during the training phase 500, exemplary embodiments of the present disclosure may incorporate different strategies including, but not limited to, reducing complexity of the model (e.g., by removing layers or reducing the number of neurons, etc.), early stopping when the model starts to overfit, using data augmentation, using regulations (i.e., adding a penalty term to a loss function), using dropout regulations, and batch normalization. The term “overfitting” as used herein may refer to a phenomenon in which machine learning may be overly concentrated on the data used in learning, resulting in a loss of generality. For example, when the overfitting phenomenon occurs, a CNN/DCNN may perform fairly well in removing artifacts from the images used in training, but artifacts contained in other images may not be effectively removed. A large number of parameters in a typical multilayer CNN/DCNN may tend to overfit the network even after data augmentation. As such, the model parameters and the training process may be preferably designed to reduce overfitting and promote identification of features that may be critical but scarce in the labeled training images.

[00058] Step 506 may include training a plurality of GANs using one or more processors. Each of the plurality of GANs (same as the general architecture of GANs described in the step 106 of FIG. 1) may include a GAN generator and a GAN discriminator. In an exemplary embodiment, each of the plurality of GANs may be specified to a specific scene class of the plurality of scene classes. In exemplary embodiments, training each of the plurality of GANs may include: feeding all color component images of the specific scene class of the plurality of scene classes to the GAN generator (step 506a); feeding all point cloud images corresponding to said all color component images, as ground truth data, to the GAN discriminator (step 506b); and optimizing the GAN generator based on a feedback of the GAN discriminator (step 506c). The point cloud image(s) may refer to the point cloud image(s) acquired by the LiDAR camera. Each of the plurality of GANs may be configured to generate a reconstructed point cloud image (fake image) based on each of the all color-component images classified in the specific scene class. In an exemplary embodiment, the GAN generator may be configured to generate the reconstructed point cloud image (fake image) such that the GAN discriminator may not distinguish the fake image from a corresponding ground truth point cloud image (real image). The GAN discriminator may be configured to determine, based on the ground truth data, whether the reconstructed point cloud image (fake image) generated based on said each of the all color component images is real. For example, GAN “1” from the plurality of GANs may receive a color component image “x” (e.g., an RGB image “x”) acquired from the scene “1” (the scene class “1”) with a determined camera pose and orientation, and may further be configured to generate a reconstructed point cloud image “x” based on said color component image “x”. In exemplary embodiments, using the ground truth point cloud images may help to improve robustness of the developed model for camera localization against environmental changes, such as replacement, addition or movement of an object in an environment (e.g., a structure or a part of a structure), and light changes, especially in an indoor/interior environment.

[00059] In general, during training of a GAN, the GAN generator may receive random variables, z, with a probability distribution P z (z) and may generate artificial/fake samples (e.g., images or text) based on the received random variables, z. The GAN discriminator may receive real/ground truth samples (e.g., real or observed images or text) and the artificial/fake samples generated by the GAN generator, and the GAN discriminator may predict whether the artificial/fake samples generated by the generator are real samples or artificial samples. The GAN discriminator may output a probability value of 1 when it predicts that the artificial/fake samples may be real samples, and a probability value of 0 when the GAN discriminator predicts that the artificial samples may be artificial/fake samples. During the GAN training process, the GAN generator and the GAN discriminator may be trained together to improve performance of each other in an adversarial manner. A GAN may implement a two-player mini-max game with the objective of deriving a Nash-equilibrium. The GAN generator and the GAN discriminator may be trained together until an adversarial loss function for the GAN is optimized. In an exemplary embodiment, the plurality of GANs may include a plurality of pre trained GANs that may be trained in the step 506 through transfer learning.

[00060] Step 508 may comprise training, using one or more processors, a plurality of multi-modal regression DCNN units. In exemplary embodiments, each of the plurality of multi modal regression DCNN units may include a first regression DCNN and a second regression DCNN. The plurality of multi-modal regression DCNN units may be configured to estimate pose and orientation of the first and the second cameras at each of the plurality of color component images. In an exemplary embodiment, the first and the second regression DCNNs may be connected to each other. The term “connect” may refer to any appropriate connection between layers through features fusion, such as concatenate connection. As used herein, the term “concatenate connection” may refer to connecting features of a substantially same size respectively from layers that are connected with each other, for example, by means of memory mapping. Vectors having features corresponding to each other may be combined by concatenate connection, doubling the number of channels of the layer containing the features. In an exemplary embodiment, the first and the second regression DCNNs may be clipped from a middle layer including a stack of MLP (Multi-Layer Perceptron) layers and may be concatenated to each other. In exemplary embodiments, training said each of the plurality of multi-modal regression DCNN units may include: feeding the all color component images classified in the specific scene class to the first regression DCNN of the each of the plurality of multi-modal regression DCNN units (step 508a), and feeding the reconstructed point cloud image generated based on said each of the all color component images to the second regression DCNN of the each of the plurality of multi-modal regression DCNN units (step 508b). In exemplary embodiments, training each of the multi-modal regression DCNN units may further include: feeding augmented forms of said all color component images to the first regression DCNN, and feeding augmented form(s) of the reconstructed point cloud image generated based on said each of the all color component images to the second regression DCNN. The augmented forms of said all color component images (as shown in FIG. 10) may be obtained by at least one of mask insertion, patch removal, brightness variations, contrast variations, noise addition, and the combination thereof. The augmented forms of the reconstructed point cloud image generated based on said each of the all color component images may be obtained by mask insertion and patch removal. The first and the second regression DCNNs may have an architecture similar to the general architecture of CNNs/DCNNs described earlier in the present disclosure. In an exemplary embodiment, the first regression DCNN and the second regression DCNN may include pre-trained regression models that may be trained in the step 508 through transfer learning.

[00061] In an exemplary embodiment, said each of the plurality of multi-modal regression DCNN units may be configured to determine/estimate the pose of the first and the second cameras, by estimating their Cartesian position (i.e., position of the first and the second cameras on X-axis, Y-axis and Z-axis). Meanwhile, said each of the plurality of multi-modal regression DCNN units may be configured to determine/estimate the orientation of the first and the second cameras by estimating their quaternion information/vector. As discussed earlier in the present disclosure, the quaternion vector may comprise four elements including w, p, q, and r.

[00062] In an exemplary embodiment, a loss function may be used to simultaneously reduce differences between the poses and orientations estimated by each of the plurality of multi-modal regression DCNN units and ground truth poses and orientations. In an exemplary embodiment, the loss function may include the following equation:

[00063] where PG.T. and QG.T. may be a ground truth position and a ground truth quaternion, respectively; Ppr ed and Qp red may be a position and a quaternion estimated by the model, respectively; and Var p and Var q may be variance of quaternion in every input batch. The II . ||2may refer to Euclidean distance.

[00064] FIG. 7 illustrates a block diagram of an exemplary configuration of a computing system 700 for performing the camera localization, consistent with one or more embodiments of the present disclosure. In exemplary embodiments, the computing system 700 may include at least one processor 702, at least one memory device 704, at least one interface 706, at least one storage device 708, at least one input/output (I/O) device 710, and a display 712, all of which may be coupled to a communication infrastructure 714 such as a bus, message queue, network, and a multi-core message -passing scheme.

[00065] The term “processor” may include any suitable hardware and/or software system, mechanism or component that processes data, signals, or other information. The processor may include a general-purpose central processing unit, a multiprocessing unit, a dedicated circuit that implements a specific function, or other systems. The process may not be limited to geographic locations or may not have time limits. For example, the processor may perform functions in “real-time,” “offline,” “batch mode,” and the like. Some of the processing may be performed at different times and places by another (or the same) processing system. Examples of processing systems may include servers, clients, end-user devices, routers, switches, network storage, and the like.

[00066] In exemplary embodiments, the at least one processor 702 may include one or more of a central processing unit (CPU), a digital signal processor (DSP), a graphics processing unit (GPU), and a neural processing unit (NPU). The at least one processor 702 may perform or control various processes by executing various programs stored in the at least one memory device 704 or the at least one storage device 708. The at least one processor 702 may control the constituent units of the computing system 700. The at least one memory device 704 may include various computer-readable media, such as volatile memory (e.g., random access memory (RAM)) and/or nonvolatile memory (e.g., read-only memory (ROM)). The at least one memory device 704 may also include rewritable ROM, such as Flash memory. The at least one storage device 708 may include various computer-readable media, such as magnetic tapes, magnetic disks, optical disks, solid-state memory (e.g., Flash memory), and the like. The at least one storage device 708 may include removable storage media 717 and/or non-removable storage media (e.g., a hard disk drive 716). The removable storage media 717 may comprise a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory, and the like. The removable storage media 717 may read from and/or may write to a removable storage unit 720 in a well-known manner. The removable storage unit 720 may comprise a floppy disk, magnetic tape, optical disk, etc., which may be read by and may be written to by the removable storage media. As will be appreciated by persons skilled in the relevant art, the removable storage unit 720 may include a computer-usable storage medium/computer-readable medium having stored therein computer software and/or data.

[00067] In exemplary implementations, the at least one storage device 708 may include other similar means for allowing computer programs or other instructions to be loaded into the computing system 700. Such means may include, for example, a removable storage unit 721 and an interface 718. Examples of such means may include a program cartridge and cartridge interface (such as that found in video game devices), a removable memory chip (such as an EPROM, or PROM) and associated socket, the removable storage unit 721 and the interface 718 which may allow software and data to be transferred from the removable storage unit 721 to the computing system 700. In a hardware implementation, the computer-readable medium may be part of the processing system separate from the at least one processor 702. However, as those skilled in the art will readily appreciate, the computer-readable medium, or any portion thereof, may be external to the processing system. By way of example, the computer-readable medium may include a transmission line, a carrier wave modulated by data, and/or a computer product separate from the device, all of which may be accessed by the at least one processor 702 through the communication infrastructure 714. Alternatively, or in addition, the computer- readable medium, or any portion thereof, may be integrated into the at least one processor 702, such as the case may be with cache and/or general register files. Although the various components discussed may be described as having a specific location, such as a local component, they may also be configured in various ways, such as certain components being configured as part of a distributed computing system.

[00068] The computer-readable medium may include a number of software modules. The software modules may include instructions that, when executed by the at least one processor 702, cause the processing system to perform various functions. The software modules may include a transmission module and a receiving module. Each software module may reside in a single storage device or be distributed across the at least one storage device 708. By way of example, a software module may be loaded into RAM from a hard drive when a triggering event occurs. During execution of the software module, the at least one processor 702 may load some of the instructions into cache to increase access speed. One or more cache lines may then be loaded into a general register file for execution by the processor. When referring to the functionality of a software module below, it will be understood that such functionality is implemented by the at least one processor 702 when executing instructions from that software module. Furthermore, it should be appreciated that aspects of the present disclosure result in improvements to the functioning of the processor, computer, machine, or other systems implementing such aspects. If implemented in software, the functions may be stored or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable medium may include both the at least one storage device 708 and the communication infrastructure 714 including any medium that facilitates transfer of a computer program from one place to another. As discussed, the at least one storage device 708 may be any available medium that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that may be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Additionally, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared (IR), radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray ® disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Thus, in some aspects computer-readable media may include non-transitory computer-readable media (e.g., tangible media). In addition, for other aspects computer-readable media may include transitory computer- readable media (e.g., a signal). Combinations of the above should also be included within the scope of computer- readable media.

[00069] Thus, certain aspects may include a computer program product for performing the operations presented herein. For example, such a computer program product may include a computer-readable medium having instructions stored (and/or encoded) thereon, the instructions being executable by the at least one processor 702 to perform the operations described herein. For certain aspects, the computer program product may include packaging material.

[00070] Referring again to FIG. 7, consistent with exemplary embodiments, the at least one I/O device 710 may include various devices that may allow data and/or other information to be input to or retrieved from the computing system 700. The at least one I/O device 710 may include cursor control devices, keyboards, keypads, microphones, monitors or other display devices, speakers, printers, network interface cards, modems, and the like.

[00071] The at least one interface 706 may include various interfaces that may allow the computing system 700 to interact with other systems, devices, or computing environments. The at least one interface 706 may include a communications interface 723 that may allow software and data to be transferred between the computing system 700 and external devices. The communications interface 723 may include a modem, a network interface 726, a communications port, a PCMCIA slot and card, or the like. Said network interface 723 may include, but is not limited to, interfaces to local area networks (LANs), wide area networks (WANs), fourth generation long term evolution (4G LTE) connectivity, unlicensed Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, and the like. Software and data transferred via the communications interface 723 may be in form of signals, which may be electronic, electromagnetic, optical, or other signals capable of being received by the communications interface 723. These signals may be provided to the communications interface 723 via a communications path 728. The communications path 728 may carry signals and may be implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link or other communications channels.

[00072] The at least one interface 706 may further include a user interface 722 and peripheral device interface. The at least one interface 706 may also include one or more peripheral interfaces 724 such as interfaces for printers, pointing devices (mice, track pad, or any suitable user interface now known to those of ordinary skill in the field, or later discovered), keyboards, and the like.

[00073] The display device 712 may include any type of device capable of displaying information to one or more users of the computing system 700. Examples of the display device 712 may include a monitor, a display terminal, a video projection device, and the like.

[00074] The communication infrastructure 714 may allow the at least one processor 702, the at least one memory device 704, the at least one interface 706, the at least one storage device 708, and the at least one I/O device 710 to communicate with one another, as well as other devices or components coupled to the communication infrastructure 714. The communication infrastructure 714 may represent one or more of several types of bus structures, such as a system bus, PCI bus, IEEE bus, USB bus, and so forth.

[00075] If programmable logic is used, such logic may execute on a commercially available processing platform or a special purpose device. One of ordinary skill in the art may appreciate that an embodiment of the disclosed subject matter may be practiced with various computing system configurations, including multi-core multiprocessor systems, minicomputers, mainframe computers, computers linked or clustered with distributed functions, as well as pervasive or miniature computers that may be embedded into virtually any device. For instance, the computing system 700 having the at least one processor 702 and the at least one memory device 704 may be used to implement the above-described embodiments. A processor device may be a single processor, a plurality of processors, or combinations thereof. Processor devices may have one or more processor “cores”.

[00076] An embodiment of the disclosure is described in terms of the exemplary computing system 700. It would be apparent to a person skilled in the relevant art how to implement the disclosure using other computing systems and/or computer architectures. Although operations may be described as a sequential process, some of the operations may in fact be performed in parallel, concurrently, and/or in a distributed environment, and with program code stored locally or remotely for access by single or multi-processor machines. In addition, in some embodiments the order of operations may be rearranged without departing from the spirit of the disclosed subject matter.

[00077] FIG. 8 is a block diagram illustrating a configuration example of a localization system 800, consistent with one or more exemplary embodiments of the present disclosure. In exemplary embodiments, the localization system 800 may include a CMOS camera 802, and a computing device 804. The computing device 804 may include the at least one processor 702, the at least one memory device 704, the at least one storage device 708, the user interface 722, the display device 712, and the communications interface 723. The localization system 800 may be implemented with a single processor or multiple processors. Alternatively, in an exemplary embodiment, the localization system 800 may be implemented by a plurality of modules included in different apparatuses. In such a case, the plurality of modules may be connected through the network interface 726. The localization system 800 may be equipped in various systems and/or computing devices, for example, a smartphone, a mobile device, a wearable device, a PC, a laptop computer, a tablet computer, an intelligent vehicle, a smart home appliance, an autonomous vehicle, a robot, and the like.

[00078] The CMOS camera 802 may be adapted to capture the color component image 202 from the scene in an environment (e.g., an indoor environment). The localization system 800 may retrieve or receive the color component image 202 and may use this image to estimate the pose 406 and the orientation 408 of the CMOS camera 802 utilizing methods described in the exemplary embodiments.

[00079] The communications interface 723 may include the network interface 726. The network interface 726 may include, but is not limited to, interfaces to local area networks (LANs), wide area networks (WANs), fourth generation long term evolution (4G LTE) connectivity, unlicensed Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, and the like. The communications interface 723 may acquire or retrieve the color component image 202. The acquired color component image 202 may be immediately sent to the at least one processor 702 and may be subjected to various processes therein, or may be stored in the at least one memory device 704 and may then be sent to the at least one processor 702 and subjected to various processes therein, if necessary.

[00080] The display device 712 may include any type of device capable of displaying information to one or more users of the localization system 800. [00081] The user interface 722 may include a touch panel, a pointing device, a keyboard, a voice input, or additional user input devices. The user interface 722 may be configured to receive the color component image 202 from a user of the localization system 800. In exemplary embodiments, the display device 712 may serve as the user interface 722 as well. [00082] The at least one memory device 704 may include the read-only memory (ROM) or the random access memory (RAM). The at least one memory device 704 may store a variety of information or programs. The variety of information may include, for example, any data generated by the at least one processor 702.

[00083] The at least one processor 702 may include one or more of the central processing unit (CPU), the digital signal processor (DSP), the graphics processing unit (GPU), or the neural processing unit (NPU). The at least one processor 702 may perform various processes or control by executing various programs stored in the at least one memory device 704 and the at least one storage device 708. The at least one processor 702 may control the constituent units of the localization system 800. In exemplary embodiments, the at least one processor 702 may estimate the pose 406 and the orientation 408 of the CMOS camera 802.

[00084] FIG. 9 is a block diagram illustrating a run-time operation 900 of an application installed on a smartphone 902, consistent with exemplary embodiments of the present disclosure. The application may cause the at least one processor 702 (e.g., a CPU, a DSP, a GPU and/or an NPU) to perform supporting computations during the run-time operation 900 of the application. The application may be configured to call functions defined on a user/client- side 904; for example, the application may provide for recognition of a scene indicative of a location (pose and orientation) in which the smartphone 902 currently operates. The application may, for example, use the CMOS camera 802 to provide for recognition of the scene and localization of the smartphone 902. The application may make a request to a compiled program code defined in an application programming interface (API) to provide an estimate of the current scene recognition and camera/smartphone 902 localization. This request may ultimately rely on an output of a deep neural network (e.g., the trained multi-modal regression DCNN 402) configured to provide scene estimates based on the acquired color component (e.g., RGB) image, for example the color component image 202.

[00085] Referring to FIG. 9, the application may include a pre-processing unit 906 on the user side 904 that may be configured (using for example, the JAVA programming language) to convert the format of the color component image 202 and then crop and/or resize the color component image 202. The pre-processed image may then be communicated via a socket interface 908 to a “scene classification and localization” application defined on a server-side 910. The pre-processed image may be received by a web server 912 on the server-side 910. The “scene classification and localization” application may include a run-time engine that may be configured (using for example, the C programming language) to recognize and classify the acquired color component image 202 based on the scene class that it may belong to. The run time engine may be configured to further preprocess the image by scaling and cropping. After receiving the pre-processed image by the web server 912, the web server 912 may pass the pre- processed image to one or more servers 918 through message queuing 914. The message queuing 914 may be configured to process a sequence of input data (e.g., color component images) received from a plurality of users/clients in a sequential order before passing them to the one or more servers 918. A load balancer 916 may be used as a reverse proxy and may be configured to distribute the application traffic across a number of servers (i.e., the one or more servers 918). The one or more servers 918 may include the computing system 700 and its corresponding processing elements described above. As discussed, the computing system 700 may include the at least one processor 702 that may perform the method 100 and the processes 200, 300 and 400 by executing the instructions related to the neural networks (i.e., the trained image classification DCNN 201, the trained GAN 302, and the trained multi-modal regression DCNN 402) stored in the at least one storage device 708. In exemplary embodiments, the processes 300 and 400 may output the pose 406 and the orientation 408 of the smartphone 902. [00086] The steps of a method or algorithm described in connection with the present disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in any form of storage medium that is known in the art. Software shall be construed broadly to mean instructions, data, or any combination thereof, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Some examples of storage media that may be used include random access memory (RAM), read-only memory (ROM), flash memory, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, a hard disk, a removable disk, a CD- ROM and so forth. A software module may include a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media. A storage medium may be coupled to a processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.

[00087] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, an in-dash vehicle computer, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi -processor systems, microprocessor- based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, various storage devices, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

[00088] Further, it should be appreciated that modules and/or other appropriate means for performing the methods and techniques described herein may be downloaded and/or otherwise obtained by a user terminal and/or base station as applicable. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, various methods described herein may be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a compact disc (CD) or floppy disk, etc.), such that a user terminal and/or base station may obtain the various methods upon coupling or providing the storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device can be utilized.

[00089] The flowcharts and block diagrams in the Figures may illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function (s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, may be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. [00090] The various operations of methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software component(s) and/or module(s), including, but not limited to, a circuit, an application-specific integrated circuit (ASIC), or processor. [00091] While the foregoing has described what are considered to be the best mode and/or other examples, it is understood that various modifications may be made therein and that the subject matter disclosed herein may be implemented in various forms and examples, and that the teachings may be applied in numerous applications, only some of which have been described herein. It is intended by the following claims to claim any and all applications, modifications and variations that fall within the true scope of the present teachings.

[00092] Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, including in the claims that follow, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.

[00093] The scope of protection is limited solely by the claims that now follow. That scope is intended and should be interpreted to be as broad as is consistent with the ordinary meaning of the language that is used in the claims when interpreted in light of this specification and the prosecution history that follows and to encompass all structural and functional equivalents. Notwithstanding, none of the claims are intended to embrace subject matter that fails to satisfy the requirement of Sections 101, 102, or 103 of the Patent Act, nor should they be interpreted in such a way. Any unintended embracement of such subject matter is hereby disclaimed. [00094] Except as stated immediately above, nothing that has been stated or illustrated is intended or should be interpreted to cause a dedication of any component, step, feature, object, benefit, advantage, or equivalent to the public, regardless of whether it is or is not recited in the claims.

[00095] It will be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study except where specific meanings have otherwise been set forth herein. Relational terms such as first and second and the like may be used solely to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions. An element proceeded by “a” or “an” does not, without further constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.

[00096] Unless otherwise stated, all measurements, values, ratings, positions, magnitudes, sizes, and other specifications that are set forth in this specification, are approximate, not exact. They are intended to have a reasonable range that is consistent with the functions to which they relate and with what is customary in the art to which they pertain.

[00097] It will be understood that the terms and expressions used herein have the ordinary meaning as is accorded to such terms and expressions with respect to their corresponding respective areas of inquiry and study, except where specific meanings have otherwise been set forth herein. Relational terms such as “first” and “second” and the like may be used solely to distinguish one entity or action from another without necessarily requiring or implying any actual such relationship or order between such entities or actions.

[00098] The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it may be seen that various features are grouped together in various implementations. This is for purposes of streamlining the disclosure, and is not to be interpreted as reflecting an intention that the claimed implementations require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed implementation. Thus, the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter. [00099] While various implementations have been described, the description is intended to be exemplary, rather than limiting and it will be apparent to those of ordinary skill in the art that many more implementations and implementations are possible that are within the scope of the implementations. Although many possible combinations of features are shown in the accompanying figures and discussed in this detailed description, many other combinations of the disclosed features are possible. Any feature of any implementation may be used in combination with or substituted for any other feature or element in any other implementation unless specifically restricted. Therefore, it will be understood that any of the features shown and/or discussed in the present disclosure may be implemented together in any suitable combination. Accordingly, the implementations are not to be restricted except in light of the attached claims and their equivalents. Also, various modifications and changes may be made within the scope of the attached claims.