Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
JOINT OBJECT DETECTION AND SIMULTANEOUS LOCALIZATION AND MAPPING METHOD FOR AUTOMATED PERCEPTION
Document Type and Number:
WIPO Patent Application WO/2023/118943
Kind Code:
A1
Abstract:
The present application describes a method to improve the Automated Perception in autonomous driving vehicles, namely Object Detection (OD) and Simultaneous Localization and Mapping (SLAM), with both of the previously mentioned procedures being accomplished simultaneously. The method comprises an initial training process configured to train a neural network model through a set of processing stages comprising input sets of three frames from a sensor; and an inference process phase configured to use the trained neural network to perform inference for OD and SLAM tasks comprising a stream of unlabeled data frames acquired from the sensor.

Inventors:
DA SILVA SIMÕES CLÁUDIA PATRÍCIA (PT)
DA ROCHA AFONSO JORGE TIAGO (PT)
MONTEIRO RAMOS FERREIRA FILIPA MARÍLIA (PT)
OLIVEIRA GIRÃO PEDRO MIGUEL (PT)
DANTAS CERQUEIRA RICARDO (PT)
LINHARES DA SILVA ANTÓNIO JOSÉ (PT)
CUNHA COSTINHA NÉVOA RAFAEL AUGUSTO (PT)
AZEVEDO FERNANDES DUARTE MANUEL (PT)
Application Number:
PCT/IB2021/062242
Publication Date:
June 29, 2023
Filing Date:
December 23, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BOSCH CAR MULTIMEDIA PORTUGAL SA (PT)
UNIV DO MINHO (PT)
International Classes:
G06N3/04; G01S17/89; G06N3/08
Foreign References:
CN111325794A2020-06-23
Other References:
ESKIL J\"ORGENSEN ET AL: "Monocular 3D Object Detection and Box Fitting Trained End-to-End Using Intersection-over-Union Loss", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 19 June 2019 (2019-06-19), XP081377955
Attorney, Agent or Firm:
DA SILVA GUEDELHA NEVES, Ana Isabel (PT)
Download PDF:
Claims:
CLAIMS

1. Method for Object Detection and Simultaneous Localization and Mapping Learning and Inference for Automated Perception in autonomous vehicles, comprising an initial training process (1) configured to train a neural network model through a set of processing stages comprising input sets of three frames (11, 12, 13) from a sensor; and an inference process phase (2) configured to use the trained neural network with a data frame representation step (201) and a set of n OD frame processing stages (202, 203, 204) simultaneously operating with a stream of unlabeled data frames (2010) acquired from the sensor; wherein the data frame representation step (201) processes a latest collected data frame t (2010) from the stream of unlabeled frames; and the set of n OD frame processing stages (202, 203, 204) processes the previous t-n remaining data frames from the stream of unlabeled frames from the sensor.

2. Method according to the previous claim, wherein the data frame representation step (201) comprises the steps of: data representing (2011) a frame t (2010) from the stream of unlabeled frames from a sensor; extract the features (2012) of the data representation (2011) of the frame t (2010) , and store (2014) them in a feature pyramid (2013) with n levels; detect the object (2015) from the extracted features (2012) , particularly in the form of object class, bounding box and orientation (2016) . 3. Method according to any of the previous claims, wherein the frame processing stage xt-i (202) comprises the steps of predicting the distribution of relative poses (2025) between the frame xt-i and the current frame xt (202) , providing the output to the SLAM backend (210) while updating features (2026) for the given frame.

4. Method according to any of the previous claims, wherein the training of the neural network model comprises an input sensor data step (10) , a data representation step (20) , a feature extraction step (30) , a detection and pose regression step (40) , an output data step (50) , and a multi-task loss function step (60) .

5. Method according to any of the previous claims, wherein the input sets of three frames from a sensor comprises two frames for a SMI procedure, an SMI frame t (11) and an SMI frame t-r (12) , and one frame for OD, an OD frame x (13) .

6. Method according to any of the previous claims, wherein the data representation step (20) converts the SMI frame t (11) in an SMI frame t Data representation (21) , the SMI frame t-r (12) in an SMI frame t-r Data representation (22) , and the OD frame x (13) in an OD frame x Data representation (23) .

7. Method according to any of the previous claims, wherein the feature extraction step (30) uses the neural network to extract key features from the set of three frames (11, 12, 13) , in particular from the data representations (21, 22, 8. Method according to any of the previous claims, wherein the Detection and Pose Regression step (40) comprises a confidence level step (41) that will produce an output that defines the confidence level of an intersection between two SMI frames (11, 12) , i.e., an intersection confidence (52) , and the distribution of transformations between the two SMI frames, i.e., a distribution of relative pose (51) .

9. Method according to any of the previous claims, wherein the output data step (50) comprises a distribution of relative pose (51) and an intersection confidence (52) information of the two SMI frames (11, 12) , and an estimated object class, bounding boxes and orientations (53) for detected objects in the OD frame.

10. Method according to any of the previous claims, wherein the multi-task loss function step (60) comprises the calculation of a loss value based on: i) the distribution of relative pose (51) and the confidence level of intersection (52) between the two SMI frames (11, 12) , and ii) the estimated object class, bounding box and orientation (53) for the detected objects in the OD frame .

11. Computer program, configured to carry out every step of the method described in previous claims 1 to 10.

12. (Non-transitory) Machine-readable storage device, on which the computer program of claim 11 is stored.

13. Data processing system, comprising the necessary physical means for the execution of the computer program of claim 11.

14. Electronic control unit, configured to carry out every step of one of the methods of claims 1 to 10.

Description:
DESCRIPTION

"Joint Object Detection and Simultaneous Localization and Mapping Method for Automated Perception"

Technical Field

The present application describes a Joint Obj ect Detection and Simultaneous Localization and Mapping Learning and inference Method for Automated Perception in autonomous vehicles .

Background art

Autonomous vehicles are typically equipped with a variety of sensors and computer systems to allow the vehicle to perceive its external environment . Examples of these sensors are LiDAR ( Light Detection Ranging ) , RADAR ( Radio Detecting and Ranging ) , cameras and GPS (Global Positioning System) / IMU ( Inertial Measurement Unit ) .

In particular , the LiDAR (Light Detection Ranging ) sensor is a laser sensor which sends light beams to scan the surrounding environment , allowing the measurement of distances/ranges from the laser sensor to an obstacle present in said surroundings . The output of the LiDAR sensors is usually classified as Point Clouds , which are described as sets of points in a three-dimensional ( 3D) space representing measured distances from the sensor to elements in the target environment . For automated perception, 3D LiDAR sensors are used, which provide valuable information about its environment in the form of 3D Point Clouds . Each point then represents information about sensed distances to elements in the environment , which includes position information ( x, y, and z coordinates in the LiDAR coordinate system) and may include additional data such as reflection and intensity values , among others .

Considering the field of autonomous driving , the vehicle will be called upon to identify and recognize existing obj ects in its surroundings , particularly in 3D space . At the same time , the vehicle shall build a map about its external environment and determine the position of the vehicle within the map . These two procedures are often called Obj ect Detection (OD ) and Simultaneous Localization and Mapping ( SLAM) , respectively, and are part of the Perception layer . In general , automated perception is the ability to sense an autonomous vehicle ' s environment and localize the vehicle within that environment .

Particularly, Obj ect Detection (OD ) is a crucial procedure for Automated Perception in order to provide vital inputs about the size , location and type of obj ects present in the scene . In this way, the autonomous vehicle can perceive its environment and make these inputs available to the decisionmaking layer towards achieving a fully autonomous driving vehicle . Prior state-of-the-art literature propose that 3D OD models can be decomposed into four maj or stages , namely Point-Cloud Representation, Data Feature Extractor, Detection Module and Predictions Refinement Network . These four categories define the architecture of these models . There are some state-of-art models for 3D Obj ect Detection such as Pixor , SECOND, PointPillars , Fast Point R-CNN and PV-RCNN .

Furthermore , SLAM refers to the challenge of building a map of the environment and determining the current positioning of the autonomous vehicle within the map . A common sub- procedure in SLAM is scan-matching which estimates the transformation between two scans . In the prior art , there are methods for scan-matching such as Iterative-Closest- Point ( ICP ) and Normal Distribution Transform (NDT ) .

However , the prior art does not contain integrated methods in a neural network able to , besides the scan-matching, detect obj ects simultaneously . Another component of SLAM is the backend optimization, where the map is updated for the new autonomous vehicle ' s pose . Known prior art makes reference to existing methods like iSAM, g2o and gtsam .

In summary, the prior art methods are only focused on detecting obj ects and performing SLAM in separate and independent procedures . Therefore, the present invention disclosure describes how to solve two main procedures of Automated Perception in autonomous driving vehicles , namely Obj ect Detection (OD ) and Simultaneous Localization and Mapping ( SLAM) procedures . Prior art handles these two procedures separately, however benefits can be derived if both are carried out simultaneously as it will be shown hereinafter .

Summary

The present invention describes a method for Obj ect Detection and Simultaneous Localization and Mapping Learning and Inference for Automated Perception in autonomous vehicles , comprising an initial training process configured to train a neural network model through a set of processing stages comprising input sets of three frames from a sensor ; and an inference process phase configured to use the trained neural network with a data frame representation step and a set of n OD frame processing stages simultaneously operating with a stream of unlabeled data frames acquired from the sensor; wherein the data frame representation step processes a latest collected data frame t from the stream of unlabeled frames ; and the set of n OD frame processing stages processes the previous t-n remaining data frames from the stream of unlabeled frames from the sensor .

In a proposed embodiment of present invention, the data frame representation step comprises the steps of : data representing a frame t from the stream of unlabeled frames from a sensor; extracting the features of the data representation of the frame t, and storing them in a feature pyramid with n levels ; detecting the obj ect from the extracted features , particularly in the form of obj ect classification, bounding box and orientation .

Yet in another proposed embodiment of present invention, the frame processing stage xt-i comprises the steps of predicting the distribution of relative poses between the frame xt-i and the current frame xt, providing the output to the SLAM backend while updating features for the given frame .

Yet in another proposed embodiment of present invention, the training of the neural network model comprises an input sensor data step , a data representation step, a feature extraction step , a detection and pose regression step , an output data step, and a multi-tas k loss function step .

Yet in another proposed embodiment of present invention, the input sets of three frames from a sensor comprises two frames for a Scan Matching and Intersection ( SMI ) procedure , an SMI frame t and an SMI frame t-r, and one frame for OD, an OD frame x .

Yet in another proposed embodiment of present invention, the data representation step converts the SMI frame t in an SMI frame t Data representation, the SMI frame t-r in an SMI frame t-r Data representation, and the OD frame x in an OD frame x Data representation .

Yet in another proposed embodiment of present invention, the feature extraction step uses the neural network to extract key features from the set of three frames , in particular from the data representations of said frames .

Yet in another proposed embodiment of present invention, the Detection and Pose Regression step comprises a confidence level step that will produce an output that defines the confidence level of an intersection between two SMI frames , i . e . , an intersection confidence , and the distribution of transformations between the two SMI frames , i . e . , a distribution of relative pose .

Yet in another proposed embodiment of present invention, the output data step comprises a distribution of relative pose and an intersection confidence information of the two SMI frames , and an estimated obj ect class , bounding boxes and orientations for detected obj ects in the OD frame .

Yet in another proposed embodiment of present invention, the multi-task loss function step comprises the calculation of a loss value based on i ) the distribution of relative pose and the confidence level of intersection between the two SMI frames , and ii ) the estimated obj ect class , bounding box and orientation for the detected obj ects in the OD frame .

The present invention further describes a computer program, configured to carry out every step of the method for Obj ect Detection and Simultaneous Localization and Mapping Learning and Inference for Automated Perception in autonomous vehicles .

The present invention further describes a (Non-transitory) Machine-readable storage device , on which the computer program is stored .

The present invention further describes a data processing system, comprising the necessary physical means for the execution of the computer program .

The present invention further describes an electronic control unit , configured to carry out every step of the method for Obj ect Detection and Simultaneous Localization and Mapping Learning and Inference for Automated Perception in autonomous vehicles .

General Description

This invention describes a new method to perform Obj ect Detection (OD ) and Simultaneous Localization and Mapping ( SLAM) procedures simultaneously based on LiDAR sensor data, taking advantage of existing ground truth data for each of the procedures in order to improve the final overall performance . Since the two OD and SLAM procedures are executed simultaneously, there is also a reduction in the run-time when compared with sequential or separate execution of both procedures .

This technological development is a contribution to improve the ability of Advanced Driving Assistant Systems (ADAS ) and Autonomous Driving Systems (ADS ) to detect obj ects and to build a map of the environment of an autonomous vehicle .

Generally, this invention relates to the training and inference processes for a new j oint pipeline for OD and SLAM procedures for Automated Perception based on a LiDAR sensor . Therefore, this new method includes two main phases : training phase and inference phase .

The training phase is responsible for training the neural network model for both OD and SLAM procedures . This model is then used in the inference phase to execute the OD and SLAM procedures having unlabelled sensor data as its input .

The training phase will receive as input sets of three LiDAR frames : two frames for the Scan Matching and Intersection ( SMI ) procedures , and one frame for the OD procedure . SMI is related to finding an appropriate spatial transformation from a given set of source points to a set of target points . In the point cloud domain, this relates directly thus to finding the optimal rotation and translation that aligns two-point clouds . Furthermore , scan intersection relates to the problem of assessing whether the real-world areas covered by any two scans ( or point clouds ) intersect . For the above- mentioned two SMI frames , corresponding ground truth data is known a priori : the relative transformation between them ( i . e . translation and rotation representation which leads one frame to be aligned with the other frame ) and the intersection level between them . The intersection level between any two frames is defined as the overlap between the areas covered by the sensor measurements . Some sufficient level of overlap is required to successfully compute the transformation between the two frames ( i . e . frames acquired in close spatial proximity will have a high intersection level , whereas frames acquired in distant areas will have low or zero intersection level ) . For the single OD frame , the obj ects of interest in the scene are individually labelled a priori in the form of bounding boxes , orientations , and obj ect classes .

With the training procedure , the neural network model is trained to : i ) Compute the distribution of relative positions between the SMI frames , as well as a level of confidence for their intersection . ii ) Predict the presence of obj ects in the scene of the OD frame , along with their obj ect class , bounding box and orientation .

This phase ensures the neural network model is trained to solve both procedures based on a shared multi-task loss and with shared weights .

The inference phase is responsible for making use of the model trained in the training phase to perform OD and SLAM procedures from a stream of unlabelled LiDAR frames . The process of Simultaneously Location and Mapping ( SLAM) is usually divided into two parts :

1 ) A SLAM frontend, which constructs a graph-based representation of the environment and attempts to perform data association of frames . The SLAM frontend can be considered as a sensor signal processing algorithm. The SLAM frontend alone cannot guarantee accuracy all the time , due to inconsistent estimates and error accumulation . Therefore ,

2 ) A SLAM backend is also considered, where the main goal is then to apply optimization techniques on the graph representation previously built by the SLAM frontend . This SLAM backend, as opposed to the frontend, can be considered sensor-agnostic .

The first stage of the method includes receiving a new frame and transforming it using the same data representation method from the training phase . Then, the trained feature extractor ensures key features are extracted and organizes these features based on the resolution levels defined during the training phase ( feature pyramid ) . Subsequently, the OD procedure can be executed in parallel with the remaining SLAM procedures . For the OD procedure , the model predicts the presence of obj ects in the new frame , along with their obj ect class , bounding box and orientation . For the SLAM procedure, a pose-only likelihood of intersection between each previously collected and stored frame and the new frame is evaluated . For each of those selected frames , previously extracted features are loaded from memory and used to calculate SMI procedures between the frame and the received frame . Therefore , the model is configured to compute : i ) the distribution of relative poses . ii ) the sensor-only confidence level of intersection between the two frames .

Those outputs are provided to the SLAM backend, which is responsible for the global pose optimization .

Brief description of the drawings

For better understanding of the present application, figures representing preferred embodiments are herein attached which, however, are not intended to limit the technique disclosed herein.

Fig. 1 - discloses a flowchart illustrating the steps for the training process for a new joint OD and SLAM procedures Learning for Automated Perception. The reference numbers are related to :

I - training process;

10 - Input sensor data step (e.g. , LiDAR sensor, RGB camera) / data frames;

II - SMI frame t;

12 - SMI frame t-r, corresponding to a frame previously acquired ;

13 - OD frame x;

20 - Data representation step;

21 - SMI frame t Data representation;

22 - SMI frame t-r Data representation, where the frame t-r corresponds to a frame previously acquired;

23 - OD frame x Data representation;

30 - Feature extraction step;

31 - SMI frame t Feature Extractor;

32 - SMI frame t-r Feature Extractor;

33 - OD frame x Feature Extractor;

40 - Detection and Pose Regression step;

41 - Confidence level step;

42 - Intersection classification step;

43 - Pose Regression step;

44 - intersection;

45 - object detector;

50 - Output data step;

51 - Distribution of relative pose;

52 - Intersection confidence; 53 - estimated object class, bounding box and orientation;

60 - Multi-task loss function step;

61 - loss function calculation;

311 - SMI frame t feature pyramid step;

321 - SMI frame t-r feature pyramid step;

W - weights shared between the feature extractors (31, 32, 33) for both procedures;

Fig. 2 - discloses a flowchart illustrating the steps for the inference process for a new joint OD and SLAM procedures Learning for Automated Perception. The reference numbers are related to :

2 - inference process;

201 - data frame representation step;

202 - OD frame xt-i processing stage;

203 - OD frame xt-2 processing stage;

204 - OD frame xt-n processing stage;

210 - SLAM backend;

2010 - SMI frame t;

2011 - SMI frame t Data representation method;

2012 - SMI frame t Feature Extractor;

2013 - SMI frame t feature pyramid;

2014 - store features;

2015 - object detector;

2016 - object classification, bounding box, orientation;

2017 - Map Pose of Vehicle;

2020 - likely to intersect?;

2021 - retrieve features;

2022 - save features;

2023 - intersection classification & pose; 2024 - intersection confidence distribution of relative poses ;

2025 - predicted relative pose;

2026 - update features;

2031 - retrieve features;

2032 - save features;

2033 - intersection classification & pose;

2034 - intersection confidence, distribution of relative poses;

2035 - predicted relative pose;

2036 - update features;

2040 - likely to intersect?;

2041 - retrieve features;

2042 - save features;

2043 - intersection classification & pose;

2044 - intersection confidence, distribution of relative poses;

2045 - predicted relative pose;

2046 - update features;

20201 - NO;

20202 - YES;

20231 - Not likely to intersect;

20232 - likely to intersect;

2030 - likely to intersect?;

20301 - NO;

20302 - YES;

20331 - Not likely to intersect;

20332 - likely to intersect;

20401 - NO;

20402 - YES;

20431 - Not likely to intersect;

20432 - likely to intersect; Description of Embodiments

With reference to the figures, some embodiments are now described in more detail, which are however not intended to limit the scope of the present application.

As previously mentioned, the herein disclosed method performs simultaneous procedures based on LiDAR sensor data, in particular Object Detection and Simultaneous Localization and Mapping.

The first stage of the method consists in the execution of a training process (1) . The training process (1) comprises six steps: Input sensor data step (10) , Data representation step (20) , Feature extraction step (30) , Detection and Pose Regression step (40) , Output data step (50) and multi-task loss function step (60) .

The input sensor data step (10) is responsible for acquiring the data frames from the sensor, which include sets of three frames; two frames for the Scan Matching and Intersection (SMI) procedures, i.e. , an SMI frame t (11) and an SMI frame t-r (12) , and one frame for the OD procedure, i.e., an OD frame x (13) . The two SMI frames (11, 12) , are taken at time frames t-r and t, where r>0. For this pair of frames, additional corresponding transformation and intersection level ground truth data is available. For the single OD frame (x) corresponding object ground truth data is available (position, orientation, bounding box and object class for each individual instance) . The data representation step (20) transforms each frame into a format that can be processed by the following steps. This means that SMI frame t (11) will be converted in an SMI frame t Data representation (21) , the SMI frame t-r (12) in an SMI frame t-r Data representation (22) , and the OD frame x (13) in an OD frame x Data representation (23) . Different data representation approaches such as (but not limited to) Voxel- , Point-, and Frustum-based or 2D Projection can be used.

The Feature Extraction step (30) uses a neural network to extract key features from the LiDAR scan frames (11, 12, 13) , in particular from the data representations (21, 22, 23) of said frames. Different feature extractors can be used such as (but not limited to) point, segment, object-wise, 2D CNN backbones or 3D CNN backbones . The same Feature Extraction (30) weights (W) are applied to the three frames (shared weights) . In the context of training a Neural Network (NN) , it means that, in the training procedure of the NN, the backpropagation step takes into account the losses of both tasks together rather than as separate optimizations, or separate NNs. The features may be extracted at different spatial resolution levels. In the case of the SMI frame t Feature Extractor (31) , the resulting features will comprise an SMI frame t feature pyramid (311) , and similarly, the features of the SMI frame t-r Feature Extractor (32) will comprise an SMI frame t-r feature pyramid (321) . For both feature pyramids steps (311, 321) , n levels (n >= 1) of extracted features may be laid out in order of increasing spatial resolution.

The Det ction and Pose Regression step (40) of the training process (1) comprises Confidence level step (41) that will produce an output that defines the confidence level of intersection between the two SMI frames (11, 12) , i.e. , the Intersection confidence (52) , and the distribution of transformations between the two SMI frames, i.e. , the Distribution of the relative pose (51) . This step will be repeated for each of the n levels of the Feature Pyramids (311, 321) . The outputs from each level, with regard to the Distribution of relative pose (51) and Intersection confidence (52) , are individually forwarded to a loss function calculation (61) , comprised in the next step of the process (1) named multi-Task loss function step (60) . The outputs of the Confidence level step (41) are ensured by an Intersection classification step (42) , a Pose Regression step (43) and an intersection classification (44) , all comprised herein. In the Intersection classification step (42) , the model predicts whether the two SMI frames (11, 12) have enough overlap to predict a transformation between them. Ground truth overlap class (overlap/no overlap) is determined by comparing the level of overlap between the two SMI scans to a pre-determined threshold value. The Pose Regression step (43) estimates the relative positions of the two SMI scans (11, 12) in the form of 3D translation and rotation. This step (43) will only be executed if the two SMI scans overlap, according to the ground truth data, as indicated by the intersection classification (44) , i.e., if there is an intersection between the two currently loaded frames (t and t-r) ; this is part of the ground-truth data which is only known in the training phase, not in the inference phase. The Detection and Pose Regression step (40) also comprises an object detector (45) that predicts the presence, class, and oriented bounding boxes of objects of interest in the scene of the OD frame (13) from the features extracted in OD frame x Feature Extractor (33) . As briefly addressed, the subsequent step is the Output data step (50) , which is comprised of the Distribution of relative pose (51) and the Intersection confidence (52) information of the two SMI scans (11, 12) , and estimated object classes, bounding boxes and orientations (53) information for each object detected in the OD scan (13) . The distribution of relative poses (51) is the output of Pose Regression step (43) and represents a set of predicted transformations between the two SMI frames (11, 12) . The intersection confidence (52) is the output of Intersection classification step (42) and represents the predicted confidence in the two SMI frames (11, 12) overlapping. The estimated object class, bounding box and orientation (53) represents the output of the object detection block (45) in the form of a set of predicted oriented bounding boxes and object classifications, one for each predicted object in the scene of the OD frame (13) .

Finally, the training process (1) is concluded in the multitask loss function step (60) which includes the calculation of loss value based on: i) the distribution of relative pose (51) and the confidence level of intersection (52) between the two SMI frames (11, 12) , and ii) the estimated object class, bounding box and orientation (53) for the detected objects.

Considering the detailed computational procedures, the aim of the neural network is to minimize its error when making predictions. The loss value calculation (61) is obtained through computational procedures and reflects the difference between the predicted and the ground truth values (i.e. a numerical representation of how far off the model's prediction is from the truth) . During the training process (1) this loss value (61) is backpropagated through the neural network to determine the updates to be applied to the weights (W) of the neural network.

The second stage (2) of the method consists in the execution of an inference process phase that makes use of the trained model to perform real-time OD and SLAM from a stream of LiDAR frames (11, 12, 13) , being, therefore, responsible for classifying data acquired from the sensor setup to infer a result, i.e., leverages the trained NN model to perform the processes of OD and SLAM in real-time by analysing only incoming unlabelled sensor frames. The inference process (2) comprises a current frame processing stage (201) , and a set of n previous frame processing stages (202, 203, 204) . The current frame processing stage (201) is related with the data processing applied to the latest collected frame by the sensor which comes from a stream of unlabeled LiDAR frames. Therefore, this current frame processing stage (201) comprises a sequence where the frame t (2010) is exposed to a Data representation method (2011) , with further processing in a Feature Extractor (2012) . The extracted features of the preceding step will comprise the frame t feature pyramid (2013) and be stored (2014) in system memory. Afterwards, an object detector (2015) will predict the presence, class, and oriented bounding boxes of objects of interest (2016) in the scene of the OD frame. Finally, the current frame processing stage (201) will also receive two outputs from the SLAM backend block (210) which comprise the updated map and history of the preceding positions of the vehicle (2017) . Still in the inference process phase (2) , the set of n previous frame processing stages (202, 203, 204) comprise a frame xt-i processing stage (202) , a frame xt-2 processing stage (203) and a frame xt-n processing stage (204) . The processing stages are all similar in the sense they comprise the same steps, but relate to different frames. For the particular example of the previous frame processing stage xt-i (202) an estimated/predicted distribution of relative poses (2025) between the frame xt-i and the current frame xt (202) , is observed. This prediction (2025) is the output from the SLAM backend (210) . This predicted distribution of relative poses (2025) , regarding the frame xt-i, will feed the update features (2026) , updating the saved features for the given frame. Higher resolution levels of features are deleted based on the likelihood of the frame being required in the future (i.e. proximity to current pose of the vehicle) and the available memory, among other criteria.

A pose-only intersection likelihood (2020) will be computed based on the predicted relative poses (2025) , i.e. , between current and previous frames. Pose-only intersection is defined as the amount of overlap between two circles centered in the LiDAR position in each frame, with a radius equal to a predetermined maximum range of the LiDAR sensor. A minimum overlap level is considered to detect an intersection and defines a maximum distance between intersecting frames. The pose-only likelihood of intersection between frames is the likelihood of the distance between frames being above this minimum overlap level. Frames with low likelihood values are excluded from further processing in the current iteration (20201) .

Frames with high likelihood values will receive further processing, and in the retrieve features stage (2021) , the previously saved features for the given frame will be loaded from system memory. The saved features (2022) may have different resolution levels. The intersection classification and pose step (2023) computes whether the pair of frames, composed by frame xt (2010) and frame xt-i intersect and, if so, computes the distribution of relative poses using the previously trained model (1) . This step takes as input saved features from each of the frames in the pair. If likely to intersect (20232) , the intersection confidence and the distribution of relative poses (2024) will then be forwarded to the SLAM backend (210) . The SLAM backend (210) uses these inputs to optimize the pose graph. The same stages are executed in parallel but with regard to frame xt-2 (203) up to frame xt-n (204) .