Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MODELING CONSISTENCY IN MODALITIES OF DATA FOR SEMANTIC SEGMENTATION
Document Type and Number:
WIPO Patent Application WO/2024/044488
Kind Code:
A1
Abstract:
Techniques and systems are provided for training a machine learning (ML) model. A technique can include generating a first set of features for objects in images, predicting image feature labels for the first set of features, comparing the predicted image feature labels to ground truth image feature labels to evaluate a first loss function, perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features, combining the BEV projected image features and a first set of flattened features to generate combined image features, generating a segmented BEV map of the environment based on the combined image features, comparing the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function, and training the ML model for generation of segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

Inventors:
BORSE SHUBHANKAR MANGESH (US)
RAVI KUMAR VARUN (US)
UNGER DAVID (US)
YOGAMANI SENTHIL KUMAR (US)
PORIKLI FATIH MURAT (US)
Application Number:
PCT/US2023/072284
Publication Date:
February 29, 2024
Filing Date:
August 16, 2023
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
QUALCOMM INC (US)
International Classes:
G06V10/44; G06V10/764; G06V10/80; G06V10/82; G06V20/56; G06V20/58
Other References:
ZHIJIAN LIU ET AL: "BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 May 2022 (2022-05-26), XP091233652
NATAN OSKAR ET AL: "End-to-End Autonomous Driving With Semantic Depth Cloud Mapping and Multi-Agent", IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, IEEE, vol. 8, no. 1, 21 June 2022 (2022-06-21), pages 557 - 571, XP011932888, ISSN: 2379-8858, [retrieved on 20220621], DOI: 10.1109/TIV.2022.3185303
SHUBHANKAR BORSE ET AL: "X-Align++: cross-modal cross-view alignment for Bird's-eye-view segmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 June 2023 (2023-06-06), XP091532138
NOURELDIN HENDY ET AL: "FISHING Net: Future Inference of Semantic Heatmaps In Grids", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 June 2020 (2020-06-17), XP081697957
Attorney, Agent or Firm:
AUSTIN, Shelton W. (US)
Download PDF:
Claims:
CLAIMS

WHAT IS CLAIMED IS:

1. An apparatus for training machine learning models, comprising: at least one memory comprising instructions; and at least one processor coupled to the at least one memory and configured to: obtain one or more images of an environment; generate, using a first machine learning model, a first set of features for one or more objects in the one or more images; predict, using a second machine learning model, image feature labels for the first set of features; compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combine the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generate, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and train the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

2. The apparatus of claim 1, wherein the at least one processor is further configured to: obtain the first three-dimensional representation of the environment; generate a second set of features for one or more objects in the first three-dimensional representation; and flatten the second set of features from three dimensions to two dimensions to generate the first set of flattened features.

3. The apparatus of claim 1, wherein the first three-dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud.

4. The apparatus of claims 1, wherein the first three-dimensional representation of the environment comprises a lidar point cloud and wherein the at least one processor is further configured to: predict a predicted first set of flattened features based on the BEV projected image features; compare the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

5. The apparatus of claim 4, wherein the at least one processor is further configured to: obtain a radar point cloud representation of the environment; generate a third set of features for one or more objects in the radar point cloud representation; flatten the third set of features from three dimensions to two dimensions to generate a second set of flattened features; and generate the segmented BEV map of the environment based on the combined BEV projected image features, the first set of flattened features, and the second set of flattened features.

6. The apparatus of claim 5, wherein the at least one processor is further configured to: predict a predicted second set of flattened features based on the BEV projected image features; compare the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function.

7. The apparatus of claim 6, wherein the at least one processor is further configured to: perform an additional perspective transform on the predicted image feature labels to generate a predicted segmented BEV map; compare the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

8. The apparatus of claim 6, wherein the at least one processor is further configured to: perform an additional perspective transform on the generated segmented BEV map to generate an additional predicted image feature labels; compare the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

9. The apparatus of claim 1, wherein the at least one processor is configured to: predict, using the second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels.

10. The apparatus of claim 1, wherein the at least one processor is configured to train the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

11. A method for training machine learning models, comprising: obtaining one or more images of an environment; generating, using a first machine learning model, a first set of features for one or more objects in the one or more images; predicting, using a second machine learning model, image feature labels for the first set of features; comparing the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; performing a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combining the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generating, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; comparing the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and training the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

12. The method of claim 11, further comprising: obtaining the first three-dimensional representation of the environment; generating a second set of features for one or more objects in the first three-dimensional representation; and flattening the second set of features from three dimensions to two dimensions to generate the first set of flattened features.

13. The method of claim 11, wherein the first three-dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud.

14. The method of claim 11, wherein the first three-dimensional representation of the environment comprises a lidar point cloud and further comprising: predicting a predicted first set of flattened features based on the BEV projected image features; comparing the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

15. The method of claim 14, further comprising: obtaining a radar point cloud representation of the environment; generating a third set of features for one or more objects in the radar point cloud representation; flattening the third set of features from three dimensions to two dimensions to generate a second set of flattened features; and generating the segmented BEV map of the environment based on the combined BEV projected image features, the first set of flattened features, and the second set of flattened features.

16. The method of claim 15, further comprising: predicting a predicted second set of flattened features based on the BEV projected image features; comparing the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function.

17. The method of claim 16, further comprising: performing an additional perspective transform on the predicted image feature labels to generate a predicted segmented BEV map; comparing the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

18. The method of claim 16, further comprising: performing an additional perspective transform on the generated segmented BEV map to generate an additional predicted image feature labels; comparing the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

19. The method of claim 11, further comprising: predicting, using the second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels.

20. The method of claims 11, further comprising training the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

21. A non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to: obtain one or more images of an environment; generate, using a first machine learning model, a first set of features for one or more objects in the one or more images; predict, using a second machine learning model, image feature labels for the first set of features; compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combine the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generate, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and train the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

22. The non-transitory computer-readable medium of claim 21, wherein the instructions cause the at least one processor to: obtain the first three-dimensional representation of the environment; generate a second set of features for one or more objects in the first three-dimensional representation; and flatten the second set of features from three dimensions to two dimensions to generate the first set of flattened features.

23. The non-transitory computer-readable medium of claim 21, wherein the first three- dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud.

24. The non-transitory computer-readable medium of claim 21, wherein the first three- dimensional representation of the environment comprises a lidar point cloud and wherein the instructions cause the at least one processor to: predict a predicted first set of flattened features based on the BEV projected image features; compare the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

25. The non-transitory computer-readable medium of claim 24, wherein the instructions cause the at least one processor to: obtain a radar point cloud representation of the environment; generate a third set of features for one or more objects in the radar point cloud representation; flatten the third set of features from three dimensions to two dimensions to generate a second set of flattened features; and generate the segmented BEV map of the environment based on the combined BEV projected image features, the first set of flattened features, and the second set of flattened features.

26. The non-transitory computer-readable medium of claim 25, wherein the instructions cause the at least one processor to: predict a predicted second set of flattened features based on the BEV projected image features; compare the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function.

27. The non-transitory computer-readable medium of claim 26, wherein the instructions cause the at least one processor to: perform an additional perspective transform on the predicted image feature labels to generate a predicted segmented BEV map; compare the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

28. The non-transitory computer-readable medium of claim 26, wherein the instructions cause the at least one processor to: perform an additional perspective transform on the generated segmented BEV map to generate an additional predicted image feature labels; compare the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

29. The non-transitory computer-readable medium of claim 21, wherein the instructions cause the at least one processor to: predict, using the second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels.

30. The non-transitory computer-readable medium of claim 21, wherein the instructions cause the at least one processor to train the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

Description:
MODELING CONSISTENCY IN MODALITIES OF DATA FOR SEMANTIC

SEGMENTATION

FIELD

[0001] The present disclosure generally relates to processing image data to perform semantic segmentation. For example, aspects of the present disclosure including systems and techniques for providing modeling consistency when performing semantic segmentation using different modalities of data (e.g., images or frames including pixels and birds-eye view images or frames).

BACKGROUND

[0002] Increasingly, devices or systems (e.g., autonomous vehicles, such as autonomous and semi- autonomous vehicles, drones or unmanned aerial vehicles (UAVs), mobile robots, mobile devices such as mobile phones, extended reality (XR) devices, and other suitable devices or systems) include multiple sensors to gather information about an environment, as well as processing systems to process the sensor information for various purposes, such as route planning, navigation, collision avoidance, etc. One example of a processing system is an Advanced Driver Assistance System (ADAS) for an autonomous or semi-autonomous vehicle. To enable analysis of the sensor data, the data from multiple sensors may be fused into a single view or model of an environment around the autonomous vehicle. One such view or model may be a birds-eye-view (BEV) (e.g., a top-down view) of the environment.

[0003] Generating a semantically segmented BEV may be useful for many applications and systems, including augmented reality (AR), virtual reality (VR), mixed reality (MR), robotic systems, manufacturing systems, quality assurance, automotive and aviation (e.g., manufacturing, autonomous driving or navigation, etc.), three-dimensional scene understanding, object grasping, object tracking, video analytics, security systems, among many others. For instance, the semantically segmented BEV may include an image where each pixel of the image is labeled according to an object category to predict objects in the view. The semantically segmented BEV can facilitate effective operation of various systems. In an illustrative example, an autonomous vehicle can identify shapes and locations of other vehicles based on a BEV representation of an environment around the autonomous vehicle to navigate through traffic. SUMMARY

[0004] The following presents a simplified summary relating to one or more aspects disclosed herein. Thus, the following summary should not be considered an extensive overview relating to all contemplated aspects, nor should the following summary be considered to identify key or critical elements relating to all contemplated aspects or to delineate the scope associated with any particular aspect. Accordingly, the following summary presents certain concepts relating to one or more aspects relating to the mechanisms disclosed herein in a simplified form to precede the detailed description presented below

[0005] Systems and techniques are described herein for training machine learning models. In one illustrative example, an apparatus for training machine learning models, comprising: at least one memory; and at least one processor coupled to the at least one memory and configured to: obtain one or more images of an environment; generate, using a first machine learning model, a first set of features for one or more objects in the one or more images; predict, using a second machine learning model, image feature labels for the first set of features; compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combine the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generate, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and train the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0006] In another example, a method fortraining machine learning models, comprising: obtaining one or more images of an environment; generating, using a first machine learning model, a first set of features for one or more objects in the one or more images; predicting, using a second machine learning model, image feature labels for the first set of features; comparing the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; performing a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combining the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generating, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; comparing the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and training the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0007] In another example, a non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to: obtain one or more images of an environment; generate, using a first machine learning model, a first set of features for one or more objects in the one or more images; predict, using a second machine learning model, image feature labels for the first set of features; compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combine the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generate, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and train the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0008] In another example, an apparatus for training machine learning models, the apparatus including: means for obtaining one or more images of an environment; means for generating, using a first machine learning model, a first set of features for one or more objects in the one or more images; means for predicting, using a second machine learning model, image feature labels for the first set of features; comparing the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; means for performing a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; means for combining the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; means for generating, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; means for comparing the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and means for training the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0009] In some aspects, the apparatus is, is part of, and/or includes a vehicle or a computing device or component of a vehicle (e.g., an autonomous or semi-autonomous vehicle), a mobile device (e.g., a mobile telephone or so-called “smart phone” or other mobile device), a wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a server computer, or other device. In some aspects, the apparatus includes a camera or multiple cameras for capturing one or more images. In some aspects, the apparatus further includes a display for displaying one or more images, notifications, and/or other displayable data. In some aspects, the apparatuses described above can include one or more sensors (e.g., one or more inertial measurement units (IMUs), such as one or more gyrometers, one or more accelerometers, any combination thereof, and/or other sensor).

[0010] This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used in isolation to determine the scope of the claimed subject matter. The subject matter should be understood by reference to appropriate portions of the entire specification of this patent, any or all drawings, and each claim.

[0011] The foregoing, together with other features and aspects, will become more apparent upon referring to the following specification, claims, and accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] Illustrative aspects of the present application are described in detail below with reference to the following figures: [0013] FIGs. 1A and IB are block diagrams illustrating a vehicle suitable for implementing various aspects, in accordance with aspects of the present disclosure.

[0014] FIG. 1C is a block diagram illustrating components of a vehicle suitable for implementing various aspects, in accordance with aspects of the present disclosure;

[0015] FIG. ID illustrates an example implementation of a system-on-a-chip (SOC), in accordance with some examples;

[0016] FIG. 2A is a component block diagram illustrating components of an example vehicle management system according to various aspects;

[0017] FIG. 2B is a component block diagram illustrating components of another example vehicle management system according to various aspects;

[0018] FIG. 3A - FIG. 4 are diagrams illustrating examples of neural networks, in accordance with some examples;

[0019] FIG. 5 illustrates a technique for generating segmented BEV, in accordance with aspects of the present disclosure;

[0020] FIG. 6 illustrates a technique for generating segmented BEV with enhanced data consistency, in accordance with aspects of the present disclosure;

[0021] FIG. 7 illustrates a technique for generating segmented BEV with supervised feature detection, in accordance with aspects of the present disclosure;

[0022] FIG. 8 is a flow diagram illustrating a process for training a ML algorithm for generating a segmented BEV map, in accordance with aspects of the present disclosure;

[0023] FIG. 9 illustrates an example computing device architecture of an example computing device which can implement the various techniques described herein. DETAILED DESCRIPTION

[0024] Certain aspects of this disclosure are provided below. Some of these aspects may be applied independently and some of them may be applied in combination as would be apparent to those of skill in the art. In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of aspects of the application. However, it will be apparent that various aspects may be practiced without these specific details. The figures and description are not intended to be restrictive.

[0025] The ensuing description provides example aspects only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the example aspects will provide those skilled in the art with an enabling description for implementing an example aspect. It should be understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the application as set forth in the appended claims.

[0026] Systems and techniques are described for training a machine learning based algorithm for generating segmented BEV maps using images of the environment captured by a camera have a perspective view (e.g., a view forward, to the side, or behind of a vehicle) of the environment combined with a three-dimensional representation of the environment provided, for example, by a lidar or radar. Often in such systems, a single loss function is evaluated based on the generated segmented BEV map. In accordance with aspects of the present disclosure, additional loss functions may be provided to help improve generation of the segmented BEV map during training. While various examples described herein are with respect to vehicles (e g., a computing system or device of a vehicle), the systems and techniques described herein apply to any computing device that can perform segmentation of images or frames, such as extended reality (XR) devices (e.g., virtual reality (VR) devices, augmented reality (AR) devices, mixed reality (MR) devices, etc.), robotics devices or computing systems or devices thereof, aircraft or computing systems or devices thereof, maritime vessels or computing systems or devices thereof, and/or other devices. Additionally, example aspects related to the systems and techniques described herein are included in Appendix A provided herewith. [0027] The systems and techniques described herein may be implemented by any type of system or device. One illustrative example of a system that can be used to implement the systems and techniques described herein is a vehicle (e.g., an autonomous or semi-autonomous vehicle) or a system or component (e.g., an ADAS or other system or component) of the vehicle. FIGS. 1 A and IB are diagrams illustrating an example vehicle 100 that may implement the systems and techniques described herein. With reference to FIGS. 1A and IB, a vehicle 100 may include a control unit 140 and a plurality of sensors 102-138, including satellite geopositioning system receivers (e.g., sensors) 108, occupancy sensors 112, 116, 118, 126, 128, tire pressure sensors 114, 120, cameras 122, 1 6, microphones 124, 134, impact sensors 130, radar 132, and lidar 138. The plurality of sensors 102-138, disposed in or on the vehicle, may be used for various purposes, such as autonomous and semi-autonomous navigation and control, crash avoidance, position determination, etc., as well to provide sensor data regarding objects and people in or on the vehicle 100. The sensors 102-138 may include one or more of a wide variety of sensors capable of detecting a variety of information useful for navigation and collision avoidance. Each of the sensors 102-138 may be in wired or wireless communication with a control unit 140, as well as with each other. In particular, the sensors may include one or more cameras 122, 136 or other optical sensors or photo optic sensors. The sensors may further include other types of object detection and ranging sensors, such as radar 132, lidar 138, IR sensors, and ultrasonic sensors. The sensors may further include tire pressure sensors 114, 120, humidity sensors, temperature sensors, satellite geopositioning sensors 108, accelerometers, vibration sensors, gyroscopes, gravimeters, impact sensors 130, force meters, stress meters, strain sensors, fluid sensors, chemical sensors, gas content analyzers, pH sensors, radiation sensors, Geiger counters, neutron detectors, biological material sensors, microphones 124, 134, occupancy sensors 112, 116, 118, 126, 128, proximity sensors, and other sensors.

[0028] The vehicle control unit 140 may be configured with processor-executable instructions to perform various aspects using information received from various sensors, particularly the cameras 122, 136, radar 132, and lidar 138. In some aspects, the control unit 140 may supplement the processing of camera images using distance and relative position information (e.g., relative bearing angle) that may be obtained from radar 132 and/or lidar 138 sensors. The control unit 140 may further be configured to control steering, breaking and speed of the vehicle 100 when operating in an autonomous or semi-autonomous mode using information regarding other vehicles determined using various aspects.

[0029] FIG. 1C is a component block diagram illustrating a system 150 of components and support systems suitable for implementing various aspects. With reference to FIGS. 1A, IB, and 1C, a vehicle 100 may include a control unit 140, which may include various circuits and devices used to control the operation of the vehicle 100. In the example illustrated in FIG. 1C, the control unit 140 includes a processor 164, memory 166, an input module 168, an output module 170 and a radio module 172. The control unit 140 may be coupled to and configured to control drive control components 154, navigation components 156, and one or more sensors 158 of the vehicle 100.

[0030] The control unit 140 may include a processor 164 that may be configured with processorexecutable instructions to control maneuvering, navigation, and/or other operations of the vehicle 100, including operations of various aspects. The processor 164 may be coupled to the memory 166. The control unit 140 may include the input module 168, the output module 170, and the radio module 172.

[0031] The radio module 172 may be configured for wireless communication. The radio module 172 may exchange signals 182 (e.g., command signals for controlling maneuvering, signals from navigation facilities, etc.) with a network node 180, and may provide the signals 182 to the processor 164 and/or the navigation components 156. In some aspects, the radio module 172 may enable the vehicle 100 to communicate with a wireless communication device 190 through a wireless communication link 92. The wireless communication link 92 may be a bidirectional or unidirectional communication link and may use one or more communication protocols.

[0032] The input module 168 may receive sensor data from one or more vehicle sensors 158 as well as electronic signals from other components, including the drive control components 154 and the navigation components 156. The output module 170 may be used to communicate with or activate various components of the vehicle 100, including the drive control components 154, the navigation components 156, and the sensor(s) 158.

[0033] The control unit 140 may be coupled to the drive control components 154 to control physical elements of the vehicle 100 related to maneuvering and navigation of the vehicle, such as the engine, motors, throttles, steering elements, other control elements, braking or deceleration elements, and the like. The drive control components 154 may also include components that control other devices of the vehicle, including environmental controls (e.g., air conditioning and heating), external and/or interior lighting, interior and/or exterior informational displays (which may include a display screen or other devices to display information), safety devices (e.g., haptic devices, audible alarms, etc ), and other similar devices.

[0034] The control unit 140 may be coupled to the navigation components 156 and may receive data from the navigation components 156. The control unit 140 may be configured to use such data to determine the present position and orientation of the vehicle 100, as well as an appropriate course toward a destination. In various aspects, the navigation components 156 may include or be coupled to a global navigation satellite system (GNSS) receiver system (e.g., one or more Global Positioning System (GPS) receivers) enabling the vehicle 100 to determine its current position using GNSS signals. Alternatively, or in addition, the navigation components 156 may include radio navigation receivers for receiving navigation beacons or other signals from radio nodes, such as Wi-Fi access points, cellular network sites, radio station, remote computing devices, other vehicles, etc. Through control of the drive control components 154, the processor 164 may control the vehicle 100 to navigate and maneuver. The processor 164 and/or the navigation components 156 may be configured to communicate with a server 184 on a network 186 (e.g., the Internet) using wireless signals 182 exchanged over a cellular data network via network node 180 to receive commands to control maneuvering, receive data useful in navigation, provide real-time position reports, and assess other data.

[0035] The control unit 140 may be coupled to one or more sensors 158. The sensor(s) 158 may include the sensors 102-138 as described, and may the configured to provide a variety of data to the processor 164.

[0036] While the control unit 140 is described as including separate components, in some aspects some or all of the components (e.g., the processor 164, the memory 166, the input module 168, the output module 170, and the radio module 172) may be integrated in a single device or module, such as a system-on-chip (SOC) processing device. Such an SOC processing device may be configured for use in vehicles and be configured, such as with processor-executable instructions executing in the processor 164, to perform operations of various aspects when installed into a vehicle.

[0037] FIG. ID illustrates an example implementation of a system-on-a-chip (SOC) 105, which may include a central processing unit (CPU) 110 or a multi-core CPU, configured to perform one or more of the functions described herein. In some cases, the SOC 105 may be based on an ARM instruction set. In some cases, CPU 110 may be similar to processor 164. Parameters or variables (e.g., neural signals and synaptic weights), system parameters associated with a computational device (e g , neural network with weights), delays, frequency bin information, task information, among other information may be stored in a memory block associated with a neural processing unit (NPU) 125, in a memory block associated with a CPU 110, in a memory block associated with a graphics processing unit (GPU) 115, in a memory block associated with a digital signal processor (DSP) 106, in a memory block 185, and/or may be distributed across multiple blocks. Instructions executed at the CPU 110 may be loaded from a program memory associated with the CPU 110 or may be loaded from a memory block 185.

[0038] The SOC 105 may also include additional processing blocks tailored to specific functions, such as a GPU 115, a DSP 106, a connectivity block 135, which may include fifth generation (5G) connectivity, fourth generation long term evolution (4G LTE) connectivity, Wi-Fi connectivity, USB connectivity, Bluetooth connectivity, and the like, and a multimedia processor 145 that may, for example, detect and recognize gestures. In one implementation, the NPU is implemented in the CPU 110, DSP 106, and/or GPU 115. The SOC 105 may also include a sensor processor 155, image signal processors (ISPs) 175, and/or navigation module 195, which may include a global positioning system. In some cases, the navigation module 195 may be similar to navigation components 156 and sensor processor 155 may accept input from, for example, one or more sensors 158. In some cases, the connectivity block 135 may be similar to the radio module 172.

[0039] FIG. 2A illustrates an example of vehicle applications, subsystems, computational elements, or units within a vehicle management system 200, which may be utilized within a vehicle, such as vehicle 100 of FIG. 1A. With reference to FIGS. 1A-2A, in some aspects, the various vehicle applications, computational elements, or units within vehicle management system 200 may be implemented within a system of interconnected computing devices (i.e., subsystems), that communicate data and commands to each other. In other aspects, the vehicle management system 200 may be implemented as a plurality of vehicle applications executing within a single computing device, such as separate threads, processes, algorithms or computational elements. However, the use of the term vehicle applications in describing various aspects are not intended to imply or require that the corresponding functionality is implemented within a single autonomous (or semi-autonomous) vehicle management system computing device, although that is a potential implementation aspect. Rather the use of the term vehicle applications is intended to encompass subsystems with independent processors, computational elements (e.g., threads, algorithms, subroutines, etc.) running in one or more computing devices, and combinations of subsystems and computational elements.

[0040] In various aspects, the vehicle applications executing in a vehicle management system 200 may include (but is not limited to) a radar perception vehicle application 202, a camera perception vehicle application 204, a positioning engine vehicle application 206, a map fusion and arbitration vehicle application 208, a route vehicle planning application 210, sensor fusion and road world model (RWM) management vehicle application 212, motion planning and control vehicle application 214, and behavioral planning and prediction vehicle application 216. The vehicle applications 202-216 are merely examples of some vehicle applications in one example configuration of the vehicle management system 200. In other configurations consistent with various aspects, other vehicle applications may be included, such as additional vehicle applications for other perception sensors (e.g., LIDAR perception layer, etc.), additional vehicle applications for planning and/or control, additional vehicle applications for modeling, etc., and/or certain of the vehicle applications 202-216 may be excluded from the vehicle management system 200. Each of the vehicle applications 202-216 may exchange data, computational results and commands.

[0041] The vehicle management system 200 may receive and process data from sensors (e.g., radar, lidar, cameras, inertial measurement units (IMU) etc.), navigation systems (e.g., GPS receivers, TMUs, etc ), vehicle networks (e.g., Controller Area Network (CAN) bus), and databases in memory (e.g., digital map data). The vehicle management system 200 may output vehicle control commands or signals to the drive by wire (DBW) system/control unit 220, which is a system, subsystem or computing device that interfaces directly with vehicle steering, throttle and brake controls. The configuration of the vehicle management system 200 and DBW system/control unit 220 illustrated in FIG. 2A is merely an example configuration and other configurations of a vehicle management system and other vehicle components may be used in the various aspects. As an example, the configuration of the vehicle management system 200 and DBW system/control unit 220 illustrated in FIG. 2A may be used in a vehicle configured for autonomous or semi- autonomous operation while a different configuration may be used in a non-autonomous vehicle.

[0042] The radar perception vehicle application 202 may receive data from one or more detection and ranging sensors, such as radar (e.g., 132) and/or lidar (e.g., 138), and process the data to recognize and determine locations of other vehicles and objects within a vicinity of the vehicle 100. The radar perception vehicle application 202 may include use of neural network processing and artificial intelligence methods to recognize objects and vehicles, and pass such information on to the sensor fusion and RWM management vehicle application 212.

[0043] The camera perception vehicle application 204 may receive data from one or more cameras, such as cameras (e.g., 122, 136), and process the data to recognize and determine locations of other vehicles and objects within a vicinity of the vehicle 100. The camera perception vehicle application 204 may include use of neural network processing and artificial intelligence methods to recognize objects and vehicles, and pass such information on to the sensor fusion and RWM management vehicle application 212.

[0044] The positioning engine vehicle application 206 may receive data from various sensors and process the data to determine a position of the vehicle 100. The various sensors may include, but is not limited to, GPS sensor, an IMU, and/or other sensors connected via a CAN bus. The positioning engine vehicle application 206 may also utilize inputs from one or more cameras, such as cameras (e.g., 122, 136) and/or any other available sensor, such as radars, LIDARs, etc.

[0045] The map fusion and arbitration vehicle application 208 may access data within a high- definition (HD) map database and receive output received from the positioning engine vehicle application 206 and process the data to further determine the position of the vehicle 100 within the map, such as location within a lane of traffic, position within a street map, etc. The HD map database may be stored in a memory (e.g., memory 166). For example, the map fusion and arbitration vehicle application 208 may convert latitude and longitude information from GPS into locations within a surface map of roads contained in the HD map database. GPS position fixes include errors, so the map fusion and arbitration vehicle application 208 may function to determine a best guess location of the vehicle 100 within a roadway based upon an arbitration between the GPS coordinates and the HD map data. For example, while GPS coordinates may place the vehicle 100 near the middle of a two-lane road in the HD map, the map fusion and arbitration vehicle application 208 may determine from the direction of travel that the vehicle 100 is most likely aligned with the travel lane consistent with the direction of travel. The map fusion and arbitration vehicle application 208 may pass map-based location information to the sensor fusion and RWM management vehicle application 212.

[0046] The route planning vehicle application 210 may utilize the HD map, as well as inputs from an operator or dispatcher to plan a route to be followed by the vehicle 100 to a particular destination. The route planning vehicle application 210 may pass map-based location information to the sensor fusion and RWM management vehicle application 212. However, the use of a prior map by other vehicle applications, such as the sensor fusion and RWM management vehicle application 212, etc., is not required. For example, other stacks may operate and/or control the vehicle based on perceptual data alone without a provided map, constructing lanes, boundaries, and the notion of a local map as perceptual data is received.

[0047] The sensor fusion and RWM management vehicle application 212 may receive data and outputs produced by one or more of the radar perception vehicle application 202, camera perception vehicle application 204, map fusion and arbitration vehicle application 208, and route planning vehicle application 210, and use some or all of such inputs to estimate or refine the location and state of the vehicle 100 in relation to the road, other vehicles on the road, and other objects within a vicinity of the vehicle 100. For example, the sensor fusion and RWM management vehicle application 212 may combine imagery data from the camera perception vehicle application 204 with arbitrated map location information from the map fusion and arbitration vehicle application 208 to refine the determined position of the vehicle within a lane of traffic. As another example, the sensor fusion and RWM management vehicle application 212 may combine object recognition and imagery data from the camera perception vehicle application 204 with object detection and ranging data from the radar perception vehicle application 202 to determine and refine the relative position of other vehicles and objects in the vicinity of the vehicle. As another example, the sensor fusion and RWM management vehicle application 212 may receive information from vehicle-to-vehicle (V2V) communications (such as via the CAN bus) regarding other vehicle positions and directions of travel and combine that information with information from the radar perception vehicle application 202 and the camera perception vehicle application 204 to refine the locations and motions of other vehicles. The sensor fusion and RWM management vehicle application 212 may output refined location and state information of the vehicle 100, as well as refined location and state information of other vehicles and objects in the vicinity of the vehicle, to the motion planning and control vehicle application 214 and/or the behavior planning and prediction vehicle application 216.

[0048] As a further example, the sensor fusion and RWM management vehicle application 212 may use dynamic traffic control instructions directing the vehicle 100 to change speed, lane, direction of travel, or other navigational element(s), and combine that information with other received information to determine refined location and state information. The sensor fusion and RWM management vehicle application 212 may output the refined location and state information of the vehicle 100, as well as refined location and state information of other vehicles and objects in the vicinity of the vehicle 100, to the motion planning and control vehicle application 214, the behavior planning and prediction vehicle application 216 and/or devices remote from the vehicle 100, such as a data server, other vehicles, etc., via wireless communications, such as through C- V2X connections, other wireless connections, etc.

[0049] As a still further example, the sensor fusion and RWM management vehicle application 212 may monitor perception data from various sensors, such as perception data from a radar perception vehicle application 202, camera perception vehicle application 204, other perception vehicle application, etc., and/or data from one or more sensors themselves to analyze conditions in the vehicle sensor data. The sensor fusion and RWM management vehicle application 212 may be configured to detect conditions in the sensor data, such as sensor measurements being at, above, or below a threshold, certain types of sensor measurements occurring, etc., and may output the sensor data as part of the refined location and state information of the vehicle 100 provided to the behavior planning and prediction vehicle application 216 and/or devices remote from the vehicle 100, such as a data server, other vehicles, etc., via wireless communications, such as through C- V2X connections, other wireless connections, etc. [0050] The refined location and state information may include vehicle descriptors associated with the vehicle 100 and the vehicle owner and/or operator, such as: vehicle specifications (e.g., size, weight, color, on board sensor types, etc.); vehicle position, speed, acceleration, direction of travel, attitude, orientation, destination, fuel/power level(s), and other state information; vehicle emergency status (e.g., is the vehicle an emergency vehicle or private individual in an emergency), vehicle restrictions (e.g., heavy/wide load, turning restrictions, high occupancy vehicle (HOV) authorization, etc.); capabilities (e.g., all-wheel drive, four-wheel drive, snow tires, chains, connection types supported, on board sensor operating statuses, on board sensor resolution levels, etc.) of the vehicle; equipment problems (e.g., low tire pressure, weak breaks, sensor outages, etc ); owner/operator travel preferences (e.g., preferred lane, roads, routes, and/or destinations, preference to avoid tolls or highways, preference for the fastest route, etc.); permissions to provide sensor data to a data agency server (e.g., 184); and/or owner/operator identification information.

[0051] The behavioral planning and prediction vehicle application 216 of the autonomous vehicle system 200 may use the refined location and state information of the vehicle 100 and location and state information of other vehicles and objects output from the sensor fusion and RWM management vehicle application 212 to predict future behaviors of other vehicles and/or objects. For example, the behavioral planning and prediction vehicle application 216 may use such information to predict future relative positions of other vehicles in the vicinity of the vehicle based on own vehicle position and velocity and other vehicle positions and velocity. Such predictions may take into account information from the HD map and route planning to anticipate changes in relative vehicle positions as host and other vehicles follow the roadway. The behavioral planning and prediction vehicle application 216 may output other vehicle and object behavior and location predictions to the motion planning and control vehicle application 214.

[0052] Additionally, the behavior planning and prediction vehicle application 216 may use object behavior in combination with location predictions to plan and generate control signals for controlling the motion of the vehicle 100. For example, based on route planning information, refined location in the roadway information, and relative locations and motions of other vehicles, the behavior planning and prediction vehicle application 216 may determine that the vehicle 100 needs to change lanes and accelerate, such as to maintain or achieve minimum spacing from other vehicles, and/or prepare for a turn or exit. As a result, the behavior planning and prediction vehicle application 216 may calculate or otherwise determine a steering angle for the wheels and a change to the throttle setting to be commanded to the motion planning and control vehicle application 214 and DBW system/control unit 220 along with such various parameters necessary to effectuate such a lane change and acceleration. One such parameter may be a computed steering wheel command angle.

[0053] The motion planning and control vehicle application 214 may receive data and information outputs from the sensor fusion and RWM management vehicle application 212 and other vehicle and object behavior as well as location predictions from the behavior planning and prediction vehicle application 216, and use this information to plan and generate control signals for controlling the motion of the vehicle 100 and to verify that such control signals meet safety requirements for the vehicle 100. For example, based on route planning information, refined location in the roadway information, and relative locations and motions of other vehicles, the motion planning and control vehicle application 214 may verify and pass various control commands or instructions to the DBW system/control unit 220.

[0054] The DBW system/control unit 220 may receive the commands or instructions from the motion planning and control vehicle application 214 and translate such information into mechanical control signals for controlling wheel angle, brake and throttle of the vehicle 100. For example, DBW system/control unit 220 may respond to the computed steering wheel command angle by sending corresponding control signals to the steering wheel controller.

[0055] In various aspects, the vehicle management system 200 may include functionality that performs safety checks or oversight of various commands, planning or other decisions of various vehicle applications that could impact vehicle and occupant safety. Such safety check or oversight functionality may be implemented within a dedicated vehicle application or distributed among various vehicle applications and included as part of the functionality. In some aspects, a variety of safety parameters may be stored in memory and the safety checks or oversight functionality may compare a determined value (e.g., relative spacing to a nearby vehicle, distance from the roadway centerline, etc.) to corresponding safety parameter(s), and issue a warning or command if the safety parameter is or will be violated. For example, a safety or oversight function in the behavior planning and prediction vehicle application 216 (or in a separate vehicle application) may determine the current or future separate distance between another vehicle (as refined by the sensor fusion and RWM management vehicle application 212) and the vehicle 100 (e.g., based on the world model refined by the sensor fusion and RWM management vehicle application 212), compare that separation distance to a safe separation distance parameter stored in memory, and issue instructions to the motion planning and control vehicle application 214 to speed up, slow down or turn if the current or predicted separation distance violates the safe separation distance parameter. As another example, safety or oversight functionality in the motion planning and control vehicle application 214 (or a separate vehicle application) may compare a determined or commanded steering wheel command angle to a safe wheel angle limit or parameter, and issue an override command and/or alarm in response to the commanded angle exceeding the safe wheel angle limit.

[0056] Some safety parameters stored in memory may be static (i.e., unchanging over time), such as maximum vehicle speed. Other safety parameters stored in memory may be dynamic in that the parameters are determined or updated continuously or periodically based on vehicle state information and/or environmental conditions. Non-limiting examples of safety parameters include maximum safe speed, maximum brake pressure, maximum acceleration, and the safe wheel angle limit, all of which may be a function of roadway and weather conditions.

[0057] FIG. 2B illustrates an example of vehicle applications, subsystems, computational elements, or units within a vehicle management system 250, which may be utilized within a vehicle 100. With reference to FIGS. 1 A-2B, in some aspects, the vehicle applications 202, 204, 206, 208, 210, 212, and 216 of the vehicle management system 200 may be similar to those described with reference to FIG. 2A and the vehicle management system 250 may operate similar to the vehicle management system 200, except that the vehicle management system 250 may pass various data or instructions to a vehicle safety and crash avoidance system 252 rather than the DBW system/control unit 220. For example, the configuration of the vehicle management system 250 and the vehicle safety and crash avoidance system 252 illustrated in FIG. 2B may be used in a non- autonomous vehicle.

[0058] In various aspects, the behavioral planning and prediction vehicle application 216 and/or sensor fusion and RWM management vehicle application 212 may output data to the vehicle safety and crash avoidance system 252. For example, the sensor fusion and RWM management vehicle application 212 may output sensor data as part of refined location and state information of the vehicle 100 provided to the vehicle safety and crash avoidance system 252. The vehicle safety and crash avoidance system 252 may use the refined location and state information of the vehicle 100 to make safety determinations relative to the vehicle 100 and/or occupants of the vehicle 100. As another example, the behavioral planning and prediction vehicle application 216 may output behavior models and/or predictions related to the motion of other vehicles to the vehicle safety and crash avoidance system 252. The vehicle safety and crash avoidance system 252 may use the behavior models and/or predictions related to the motion of other vehicles to make safety determinations relative to the vehicle 100 and/or occupants of the vehicle 100.

[0059] In various aspects, the vehicle safety and crash avoidance system 252 may include functionality that performs safety checks or oversight of various commands, planning, or other decisions of various vehicle applications, as well as human driver actions, that could impact vehicle and occupant safety. In some aspects, a variety of safety parameters may be stored in memory and the vehicle safety and crash avoidance system 252 may compare a determined value (e.g., relative spacing to a nearby vehicle, distance from the roadway centerline, etc.) to corresponding safety parameter(s), and issue a warning or command if the safety parameter is or will be violated. For example, a vehicle safety and crash avoidance system 252 may determine the current or future separate distance between another vehicle (as refined by the sensor fusion and RWM management vehicle application 212) and the vehicle (e.g., based on the world model refined by the sensor fusion and RWM management vehicle application 212), compare that separation distance to a safe separation distance parameter stored in memory, and issue instructions to a driver to speed up, slow down or turn if the current or predicted separation distance violates the safe separation distance parameter. As another example, a vehicle safety and crash 252 may compare a human driver's change in steering wheel angle to a safe wheel angle limit or parameter, and issue an override command and/or alarm in response to the steering wheel angle exceeding the safe wheel angle limit.

[0060] As indicated above, different modes of data received from different sensors may be fused into a single data modality which can provide more information about the environment than would be available from a single type of sensor. In some cases, one or more machine learning techniques may be used as a part of sensor fusion.

[0061] A neural network is an example of a machine learning system, and a neural network can include an input layer, one or more hidden layers, and an output layer. Data is provided from input nodes of the input layer, processing is performed by hidden nodes of the one or more hidden layers, and an output is produced through output nodes of the output layer. Deep learning networks typically include multiple hidden layers. Each layer of the neural network can include feature maps or activation maps that can include artificial neurons (or nodes). A feature map can include a filter, a kernel, or the like. The nodes can include one or more weights used to indicate an importance of the nodes of one or more of the layers. In some cases, a deep learning network can have a series of many hidden layers, with early layers being used to determine simple and low level characteristics of an input, and later layers building up a hierarchy of more complex and abstract characteristics.

[0062] A deep learning architecture may learn a hierarchy of features. If presented with visual data, for example, the first layer may learn to recognize relatively simple features, such as edges, in the input stream. In another example, if presented with auditory data, the first layer may learn to recognize spectral power in specific frequencies. The second layer, taking the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes for visual data or combinations of sounds for auditory data. For instance, higher layers may learn to represent complex shapes in visual data or words in auditory data. Still higher layers may learn to recognize common visual objects or spoken phrases.

[0063] Deep learning architectures may perform especially well when applied to problems that have a natural hierarchical structure. For example, the classification of motorized vehicles may benefit from first learning to recognize wheels, windshields, and other features. These features may be combined at higher layers in different ways to recognize cars, trucks, and airplanes.

[0064] Neural networks may be designed with a variety of connectivity patterns. In feed-forward networks, information is passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, the output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that span more than one of the input data chunks that are delivered to the neural network in a sequence. A connection from a neuron in a given layer to a neuron in a lower layer is called a feedback (or top-down) connection. A network with many feedback connections may be helpful when the recognition of a high-level concept may aid in discriminating the particular low-level features of an input. The connections between layers of a neural network may be fully connected or locally connected. Various examples of neural network architectures are described below with respect to FIG. 3 A - FIG. 4.

[0065] As noted previously, some sensor fusion systems utilize neural networks or other machine learning systems to fuse disparate modalities of data, such as image data, radar data, and lidar data. Neural networks may be designed with a variety of connectivity patterns. In feed-forward networks, information is passed from lower to higher layers, with each neuron in a given layer communicating to neurons in higher layers. A hierarchical representation may be built up in successive layers of a feed-forward network, as described above. Neural networks may also have recurrent or feedback (also called top-down) connections. In a recurrent connection, the output from a neuron in a given layer may be communicated to another neuron in the same layer. A recurrent architecture may be helpful in recognizing patterns that span more than one of the input data chunks that are delivered to the neural network in a sequence. A connection from a neuron in a given layer to a neuron in a lower layer is called a feedback (or top-down) connection. A network with many feedback connections may be helpful when the recognition of a high-level concept may aid in discriminating the particular low-level features of an input.

[0066] The connections between layers of a neural network may be fully connected or locally connected. FIG. 3A illustrates an example of a fully connected neural network 302. In a fully connected neural network 302, a neuron in a first layer may communicate its output to every neuron in a second layer, so that each neuron in the second layer will receive input from every neuron in the first layer. FIG. 3B illustrates an example of a locally connected neural network 304. In a locally connected neural network 304, a neuron in a first layer may be connected to a limited number of neurons in the second layer. More generally, a locally connected layer of the locally connected neural network 304 may be configured so that each neuron in a layer will have the same or a similar connectivity pattern, but with connections strengths that may have different values (e.g., 310, 312, 314, and 316). The locally connected connectivity pattern may give rise to spatially distinct receptive fields in a higher layer, because the higher layer neurons in a given region may receive inputs that are tuned through training to the properties of a restricted portion of the total input to the network.

[0067] One example of a locally connected neural network is a convolutional neural network. FIG. 3C illustrates an example of a convolutional neural network 306. The convolutional neural network 306 may be configured such that the connection strengths associated with the inputs for each neuron in the second layer are shared (e.g., 308). Convolutional neural networks may be well suited to problems in which the spatial location of inputs is meaningful. Convolutional neural network 306 may be used to perform one or more aspects of video compression and/or decompression, according to aspects of the present disclosure.

[0068] One type of convolutional neural network is a deep convolutional network (DCN). FIG. 3D illustrates a detailed example of a DCN 300 designed to recognize visual features from an image 326 input from an image capturing device 330, such as a car-mounted camera. The DCN 300 of the current example may be trained to identify traffic signs and a number provided on the traffic sign. Of course, the DCN 300 may be trained for other tasks, such as identifying lane markings or identifying traffic lights.

[0069] The DCN 300 may be trained with supervised learning. During training, the DCN 300 may be presented with an image, such as the image 326 of a speed limit sign, and a forward pass may then be computed to produce an output 322. The DCN 300 may include a feature extraction section and a classification section. Upon receiving the image 326, a convolutional layer 332 may apply convolutional kernels (not shown) to the image 326 to generate a first set of feature maps 318. As an example, the convolutional kernel for the convolutional layer 332 may be a 5x5 kernel that generates 28x28 feature maps. In the present example, because four different feature maps are generated in the first set of feature maps 318, four different convolutional kernels were applied to the image 326 at the convolutional layer 332. The convolutional kernels may also be referred to as filters or convolutional filters. [0070] The first set of feature maps 318 may be subsampled by a max pooling layer (not shown) to generate a second set of feature maps 320. The max pooling layer reduces the size of the first set of feature maps 318. That is, a size of the second set of feature maps 320, such as 14x14, is less than the size of the first set of feature maps 318, such as 28x28. The reduced size provides similar information to a subsequent layer while reducing memory clonsumption. The second set of feature maps 320 may be further convolved via one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).

[0071] In the example of FIG. 3D, the second set of feature maps 320 is convolved to generate a first feature vector 324. Furthermore, the first feature vector 324 is further convolved to generate a second feature vector 328. Each feature of the second feature vector 328 may include a number that corresponds to a possible feature of the image 326, such as “sign,” “60,” and “100.” A softmax function (not shown) may convert the numbers in the second feature vector 328 to a probability. As such, an output 322 of the DCN 300 is a probability of the image 326 including one or more features.

[0072] In the present example, the probabilities in the output 322 for “sign” and “60” are higher than the probabilities of the others of the output 322, such as “30,” “40,” “50,” “70,” “80,” “90,” and “100”. Before training, the output 322 produced by the DCN 300 is likely to be incorrect. Thus, an error may be calculated between the output 322 and a target output. The target output is the ground truth of the image 326 (e.g., “sign” and “60”). The weights of the DCN 300 may then be adjusted so the output 322 of the DCN 300 is more closely aligned with the target output.

[0073] To adjust the weights, a learning algorithm may compute a gradient vector for the weights. The gradient may indicate an amount that an error would increase or decrease if the weight were adjusted. At the top layer, the gradient may correspond directly to the value of a weight connecting an activated neuron in the penultimate layer and a neuron in the output layer. In lower layers, the gradient may depend on the value of the weights and on the computed error gradients of the higher layers. The weights may then be adjusted to reduce the error. This manner of adjusting the weights may be referred to as “back propagation” as it involves a “backward pass” through the neural network. [0074] In practice, the error gradient of weights may be calculated over a small number of examples, so that the calculated gradient approximates the true error gradient. This approximation method may be referred to as stochastic gradient descent. Stochastic gradient descent may be repeated until the achievable error rate of the entire system has stopped decreasing or until the error rate has reached a target level. After learning, the DCN may be presented with new images and a forward pass through the network may yield an output 322 that may be considered an inference or a prediction of the DCN.

[0075] Deep belief networks (DBNs) are probabilistic models comprising multiple layers of hidden nodes. DBNs may be used to extract a hierarchical representation of training data sets. A DBN may be obtained by stacking up layers of Restricted Boltzmann Machines (RBMs). An RBM is a type of artificial neural network that can learn a probability distribution over a set of inputs. Because RBMs can learn a probability distribution in the absence of information about the class to which each input should be categorized, RBMs are often used in unsupervised learning. Using a hybrid unsupervised and supervised paradigm, the bottom RBMs of a DBN may be trained in an unsupervised manner and may serve as feature extractors, and the top RBM may be trained in a supervised manner (on a joint distribution of inputs from the previous layer and target classes) and may serve as a classifier.

[0076] Deep convolutional networks (DCNs) are networks of convolutional networks, configured with additional pooling and normalization layers. DCNs have achieved state-of-the-art performance on many tasks. DCNs can be trained using supervised learning in which both the input and output targets are known for many exemplars and are used to modify the weights of the network by use of gradient descent methods.

[0077] DCNs may be feed-forward networks. In addition, as described above, the connections from a neuron in a first layer of a DCN to a group of neurons in the next higher layer are shared across the neurons in the first layer. The feed-forward and shared connections of DCNs may be exploited for fast processing. The computational burden of a DCN may be much less, for example, than that of a similarly sized neural network that comprises recurrent or feedback connections.

[0078] The processing of each layer of a convolutional network may be considered a spatially invariant template or basis projection. If the input is first decomposed into multiple channels, such as the red, green, and blue channels of a color image, then the convolutional network trained on that input may be considered three-dimensional, with two spatial dimensions along the axes of the image and a third dimension capturing color information. The outputs of the convolutional connections may be considered to form a feature map in the subsequent layer, with each element of the feature map (e.g., feature maps 320) receiving input from a range of neurons in the previous layer (e.g., feature maps 318) and from each of the multiple channels. The values in the feature map may be further processed with a non-linearity, such as a rectification, max(0,x). Values from adjacent neurons may be further pooled, which corresponds to down sampling, and may provide additional local invariance and dimensionality reduction.

[0079] FIG. 4 is a block diagram illustrating an example of a deep convolutional network 450. The deep convolutional network 450 may include multiple different types of layers based on connectivity and weight sharing. As shown in FIG. 4, the deep convolutional network 450 includes the convolution blocks 454A, 454B. Each of the convolution blocks 454A, 454B may be configured with a convolution layer (CONV) 456, a normalization layer (LNorm) 458, and a max pooling layer (MAX POOL) 460.

[0080] The convolution layers 456 may include one or more convolutional filters, which may be applied to the input data 452 to generate a feature map. Although only two convolution blocks 454A, 454B are shown, the present disclosure is not so limiting, and instead, any number of convolution blocks (e.g., convolution blocks 454A, 454B) may be included in the deep convolutional network 450 according to design preference. The normalization layer 458 may normalize the output of the convolution filters. For example, the normalization layer 458 may provide whitening or lateral inhibition. The max pooling layer 460 may provide down sampling aggregation over space for local invariance and dimensionality reduction.

[0081] The parallel filter banks, for example, of a deep convolutional network may be loaded on a CPU 110 or GPU 115 of an SOC 105 to achieve high performance and low power consumption. In alternative aspects, the parallel filter banks may be loaded on the DSP 106 or an ISP 175 of an SOC 105. In addition, the deep convolutional network 450 may access other processing blocks that may be present on the SOC 105, such as sensor processor 155 and navigation module 195, dedicated, respectively, to sensors and navigation. [0082] The deep convolutional network 450 may also include one or more fully connected layers, such as layer 462A (labeled “FC1”) and layer 462B (labeled “FC2”). The deep convolutional network 450 may further include a logistic regression (LR) layer 464. Between each layer 456, 458, 460, 462A, 462B, 464 of the deep convolutional network 450 are weights (not shown) that are to be updated. The output of each of the layers (e.g., 456, 458, 460, 462A, 462B, 464) may serve as an input of a succeeding one of the layers (e.g., 456, 458, 460, 462A, 462B, 464) in the deep convolutional network 450 to learn hierarchical feature representations from input data 452 (e.g., images, audio, video, sensor data and/or other input data) supplied at the first of the convolution blocks 454A The output of the deep convolutional network 450 is a classification score 466 for the input data 452. The classification score 466 may be a set of probabilities, where each probability is the probability of the input data including a feature from a set of features.

[0083] In some cases, sensor data from multiple sensors may be fused to generate a segmented BEV. FIG. 5 illustrates a technique for generating segmented BEV 500, in accordance with aspects of the present disclosure. In technique 500, input camera data 502 (e g., images captured by cameras such as cameras 122, 136 of FIG. IB) may be input to a camera data encoder 504. The camera data encoder 504 may include one or feature extractors. These feature extractors may be ML based and be used to identify certain features in the camera data. As an example, the feature extractors may include one or more layers or transformer blocks which may include feature maps for recognizing certain features. The camera data encoder 504 may output the identified features as intermediate camera features 506. Of note, the input camera data 502 and camera data encoder 504 may operate in a 2D space (e.g., on a height and width axes with respect to the camera). A perspective transformation 508 may be applied to the output intermediate camera features which converts the intermediate camera features from, for example, a frontal view of an environment from a vehicle, to BEV projected camera features as if features were generated based on a camera positioned above the vehicle. In some cases, the perspective transformation 508 may be ML based.

[0084] In some cases, lidar data 510 may be received, for example as a lidar point cloud, captured by a lidar, such as lidar 138 of FIG. IB. Lidar may transmit a beam of ultraviolet, visible, or near infrared light into an environment and detects reflections of the beam from objects in the environment. Based on an amount of time needed for the reflections to be detected, distances to objects in the environment may be determined and lidar points may be described based on the point’s location on a width, height, and depth axes with respect to the lidar. Thus, the lidar data is three-dimensional data. The lidar data 510 may be input to a lidar data encoder 512. The lidar data encoder 512 may be similar to the camera data encoder 504, but configured (e.g., trained) to operate in a 3D space to identify features in the lidar data and output the identified features as intermediate lidar features 514. The intermediate lidar features 514 may then be flattened 516 to BEV projected lidar features, for example, by removing or averaging the height information (e.g., height axes, height channel, height dimension).

[0085] In some cases, input radar data 518 may be received, for example, as a radar point cloud, captured by a radar, such as radar 132. In some cases, radar operates in a manner similar to lidar, but uses radio frequency waves rather than light. The input radar data 518 may be input to a radar data encoder 520. The radar data encoder 520 may be similar to the lidar data encoder 512 and the radar data encoder 520 may identify features in the radar data and output the identified features as intermediate radar features 522. The intermediate radar features 522 may then be flattened 524 to BEV projected radar features, for example, by removing or averaging the height information (e.g., height axes, height channel, height dimension).

[0086] The BEV projected camera features, BEV projected lidar features, and BEV projected radar features may be combined 526, for example, by combining the BEV projected camera features, the BEV projected lidar features, and the BEV projected radar features. In some cases, the BEV proj ected features may be combined by concatenating the BEV proj ected camera features, the BEV projected lidar features, and the BEV projected radar features. The combined BEV projected features may be input to a decoder 528. The decoder 528 may include ML models to identify and segment (e.g., labels) the combined BEV projected features to generate (e.g., predict) a BEV segmented map 530. As a part of training, a loss function may be determined based on the BEV segmented map 530. The loss function may indicate how close a predicted BEV segmented map 530 is to a ground truth map and the loss function, such as a cross-entropy loss function, may be used, for example via back-propagation, to adjust weights in the ML models for the decoder 528, perspective transformation 508, camera data encoder 504, lidar data encoder 512, and/or radar data encoder 520 to improve predictions. However, such a technique does not enforce consistency between pixel-wise semantic segmentation and the BEV transform, nor is there a mechanism to enforce consistency between the different fused modalities (e.g., the camera, lidar, and radar based datasets). In some cases, additional loss functions may be used to provide such consistency.

[0087] FIG. 6 illustrates a technique for generating segmented BEV with enhanced data consistency 600, in accordance with aspects of the present disclosure. In some cases, consistency may be enforced between the lidar projected BEV features and/or radar projected BEV features and the camera projected BEV features when combining the BEV projected features. In some cases, two loss functions may be added to technique 500 for improving consistency as between the lidar and/or radar data and the camera data in the BEV space. As discussed above with respect to FIG. 5, the input camera data 502 (e.g., as BEV projected camera features) may be combined 526 (e.g., concatenated or otherwise combined) in BEV space with the input lidar data 510 (e.g., as the BEV projected lidar features) and it may be beneficial to assure that the transformed BEV projected camera features 602 are properly pixel aligned and/or properly overlapped with the flattened 516 BEV projected lidar features and/or flattened 524 BEV projected radar features. This alignment/overlapping may be performed by passing the BEV projected camera features 602 to a lidar mapper 604. The lidar mapper 604 may be a ML model which predicts segmentation using the lidar data. Generally, data points of the lidar data 510 are segmented and the lidar mapper 604 predicts segmentation information based on the BEV projected camera features 602. The actual segmented and flattened BEV projected lidar features 606 may be used as a ground truth reference and a sparse loss function, such as a cross entropy loss, may be used for back-propagation to adjust, for example, weights in the perspective transformation 508, lidar data encoder 512, and/or the camera encoder 504.

[0088] In some cases, the BEV projected camera features 602 may also be pixel aligned and/or overlapped with the flattened BEV projected radar features 608. Similar to input lidar data 510, radar points of data points of the input radar data 518 may also be labeled. The BEV projected camera features 602 may be passed to a radar mapper 610. Similar to the lidar mapper 604, the radar mapper 610 may also be a ML model with predicts segmentation of the flattened BEV projected radar features 608 from the BEV projected camera features 602. The actual segmented and flattened BEV projected radar features 608 may be used as a ground truth reference and another sparse loss function, such as a cross entropy loss, may be used for back-propagation to adjust, for example, weights in the perspective transformation 508, radar data encoder 520, and/or the camera data encoder 504.

[0089] FIG. 7 illustrates a technique for generating segmented BEV with supervised feature detection 700, in accordance with aspects of the present disclosure. In some cases, supervised training may be performed on the intermediate camera features 506 to improve feature generation from the input camera data 502. To perform supervised training on the intermediate camera features 506, the intermediate camera features 506 may be passed to a segmentation decoder 702. The segmentation decoder 702 may be a ML model which predicts labels for (e.g., segments) the intermediate camera features 506 to generate segmented camera features 704.

[0090] In cases where ground truth segmented camera features are available, the segmented camera features 704 may be compared to the ground truth segmented camera features and a loss function, such as a cross-entropy loss function, may be used for back-propagation 720 to adjust weights in the camera data encoder 504 to improve the camera data encoder 504 for generating intermediate camera features 506 that are more easily segmented.

[0091] Where ground truth segmented camera features are not available, pseudo-labelling 706 may be used. In pseudo-labelling 706, an expert ML model (e.g., expert system), such as a ML model separately trained to perform segmentation may be used to predict ground truth labels and generate the ground truth segmented camera features. In some cases, the expert ML model may be a relatively complex ML model that may not be configured to executed within performance constraints that may exist for generating the segmented BEV (e.g., may execute on more powerful hardware resources as a part of a training process as compared to hardware resources that may be available for generating the segmented BEV at inference time). The generated ground truth segmented camera features from the expert ML model may be used in place of the ground truth segmented camera features for determining the loss and back-propagation, as discussed above with respect to when ground truth segmented camera features are available (e.g., the generated ground truth segmented camera features may be used as the ground truth segmented camera features). In cases where the expert ML model labels differ from the labels used for the segmented BEV, a class-wise weighting based on confidence values of predictions from the expert ML model may be used. [0092] In some cases, to enforce the cyclic consistency between predicted 2D semantic and BEV outputs, the segmented camera features 704 may be passed to a segmented perspective transformer 708. In some cases, the segmented perspective transformation 708 may be similar to the perspective transformation 508. The segmented perspective transformation 708 may also be ML based and the segmented perspective transformer 708 may predict a predicted segmented BEV view 710 from the segmented camera features 704. The predicted segmented BEV view 710 may be compared to the BEV segmented map 530, for example, using an L1/L2 loss function based on the resolution and range of predicted BEV logits. Back-propagation of the loss may be performed to adjust weights for camera data encoder 504, lidar data encoder 512, radar data encoder 520, perspective transformation 508, and/or decoder 528.

[0093] As an example of enforcing the cyclic consistency between predicted 2D semantic and BEV outputs, the BEV segmented map 530 may be passed to an inverse segmented perspective transformer 712. In some cases, the inverse segmented perspective transformer 712 may invert the perspective transformation 508. The inverse segmented perspective transformer 712 may be ML based and the inverse segmented perspective transformer 712 may predict segmented camera features 714. The predicted segmented camera features 714 may be compared 722 to the segmented camera features 704, for example, using an L1/L2 loss function based on the resolution and range of predicted BEV logits. Back-propagation (not shown) of the loss may be performed to adjust weights for camera data encoder 504, lidar data encoder 512, radar data encoder 520, perspective transformation 508, and/or decoder 528. In certain cases, performance may be measured based on a mean intersection over union in BEV space. In some cases, the techniques for generating a segmented BEV as shown in FIGs. 6 and 7 may be implemented together in a single system (e.g., as a system for generating a segmented BEV utilizing up to five loss functions and back-propagation during training).

[0094] FIG. 8 is a flow diagram illustrating a process for training a ML algorithm for generating a segmented BEV map, in accordance with aspects of the present disclosure. The process 800 may be performed by a computing device (or apparatus) or a component (e.g., a chipset, codec, etc.) of the computing device. The computing device may be a vehicle (e.g., vehicle 100) or component or system of a vehicle (e.g., control unit 140 of FIG. 1A and 1C, SOC 105 of FIG. ID, vehicle management system 200 of FIG. 2A, vehicle management system 250 of FIG. 2B, etc ), mobile device (e.g., a mobile phone), a network-connected wearable such as a watch, an extended reality (XR) device such as a virtual reality (VR) device, augmented reality (AR) device, a mixed reality (MR) device), laptop, desktop, or other type of computing device (e.g., computing system 900 of FIG. 9). The operations of the process 800 may be implemented as software components that are executed and run on one or more processors (e.g., processor 164 of FIG. 1C, CPU 110, GPU 115, DSP 106, NPU 125 of FIG. ID, processor 910 of FIG. 9, etc.).

[0095] At operation 802, the computing device (or component thereof) may obtain one or more images of an environment (e.g., from a camera such as camera 122, camera 136 of FIG. IB, sensor(s) 158 of FIG. 1C, sensors 155 of FIG. ID, image capturing device 330 of FIG. 2D, input device 945 of FIG. 9, etc.).

[0096] At operation 804, the computing device (or component thereof) may generate, using a first machine learning model, a first set of features (e.g., intermediate camera features 506 of FIGs. 5- 7) for one or more objects in the one or more images. For example, a camera data encoder 504 of FIGs. 5-7 may identify and extract features from input images.

[0097] At operation 806, the computing device (or component thereof) may predict, using a second machine learning model (e.g., segmentation decoder 702 of FIG. 7), image feature labels for the first set of features. For example, a segmentation decoder may predict labels for features from input images.

[0098] At operation 808, the computing device (or component thereof) may compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function. For example, feature labels output from the segmentation decoder 702 of FIG. 7 may be compared to a ground truth feature label to determine a loss, such as through a cross-entropy loss function, hinge loss function, etc. In some cases, the computing device (or component thereof) may predict, using the second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels. For example, segmentation decoder 702 of FIG. 7 may be a different ML model as compared to the camera encoder 504 of FIGs. 5-7. [0099] In some cases, the computing device (or component thereof) may obtain the first three- dimensional representation of the environment, generating a second set of features (e.g., by lidar encoder 512, radar encoder 520 of FIGs. 5-7) for one or more objects in the first three-dimensional representation, and flattening (e.g., flattening 516, flattening 524 of FIGs. 5-7) the second set of features from three dimensions to two dimensions to generate the first set of flattened features (e.g., flattened BEV projected lidar features 606, flattened BEV projected radar features 608 of FIG. 6). In some cases, the first three-dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud (e.g., intermediate lidar features 514, intermediate radar features 522 of FIGs. 5-7). In some cases, the computing device (or component thereof) may obtain a radar point cloud representation (e.g., input radar data 518 of FIGs. 5-7) of the environment, generating a third set of features (e.g., by radar encoder 520 of FIGs. 5-7) for one or more objects in the radar point cloud representation, flattening (e.g., flattening 524 of FIGs. 5-7) the third set of features from three dimensions to two dimensions to generate a second set of flattened features (e.g., flattened BEV projected radar features 608 of FIG. 6), and generating the segmented BEV map (e.g., BEV segmented map 530 of FIGs. 5-7) of the environment based on the combined (e.g., combination 526 of FIGs. 5-7) BEV projected image features, the first set of flattened features, and the second set of flattened features.

[0100] At operation 810, the computing device (or component thereof) may perform a perspective transform (e.g., perspective transformation 508 of FIGs. 5-7) on the first set of features to generate a bird’s eye view (BEV) projected image features (e.g., transformed BEV projected camera features 602 of FIG. 6). In some cases, the computing device (or component thereof) may predict (e.g., by radar mapper 610 of FIG. 6) a predicted second set of flattened features based on the BEV projected image features, comparing the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function, and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function. In some cases, the computing device (or component thereof) may perform an additional perspective transform (e.g., by segmented perspective transformer 708 of FIG. 7) on the predicted image feature labels to generate a predicted segmented BEV map (e.g., predicted segmented BEV view 710 of FIG. 7), comparing the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function, and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function. In some cases, the computing device (or component thereof) may perform an additional perspective transform (e.g., by inverse segmented perspective transformer 712 of FIG. 7) on the generated segmented BEV map to generate an additional predicted image feature labels (e.g., predicted segmented camera features 714 of FIG. 7), comparing the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function, and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0101] At operation 812, the computing device (or component thereof) may combine (e.g., combination 526 of FIGs. 5-7) the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment. In some cases, the first three- dimensional representation of the environment comprises a lidar point cloud (e.g. lidar data 510 of FIGs. 5-7). In some cases, the computing device (or component thereof) may predict (e.g., by lidar mapper 604) a predicted first set of flattened features based on the BEV projected image features, comparing the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function, and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

[0102] At operation 814, the computing device (or component thereof) may generate, using the first machine learning model, a segmented BEV map (e.g., BEV segmented map 530 of FIGs. 5- 7) of the environment based on the combined image features. At operation 816, the computing device (or component thereof) may compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function.

[0103] At operation 818, the computing device (or component thereof) may train the first machine learning model for generation of one or more segmented BEV maps (e.g., BEV segmented map 530 of FIGs. 5-7) based on the evaluated first loss function and the evaluated second loss function. In some cases, the computing device (or component thereof) may train the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

[0104] In some examples, the processes described herein (e.g., process 800 and/or other process described herein) may be performed by a computing device or apparatus (e.g., SOC 105). In another example, the process 800 may be performed by the vehicle 100 of FIG. 1A.

[0105] FIG. 9 is a diagram illustrating an example of a system for implementing certain aspects of the present technology. In particular, FIG. 9 illustrates an example of computing system 900, which may be for example any computing device making up internal computing system, a remote computing system, a camera, or any component thereof in which the components of the system are in communication with each other using connection 905. Connection 905 may be a physical connection using a bus, or a direct connection into processor 910, such as in a chipset architecture. Connection 905 may also be a virtual connection, networked connection, or logical connection.

[0106] In some aspects, computing system 900 is a distributed system in which the functions described in this disclosure may be distributed within a datacenter, multiple data centers, a peer network, etc. In some aspects, one or more of the described system components represents many such components each performing some or all of the function for which the component is described. In some aspects, the components may be physical or virtual devices.

[0107] Example system 900 includes at least one processing unit (CPU or processor) 910 and connection 905 that communicatively couples various system components including system memory 915, such as read-only memory (ROM) 920 and random access memory (RAM) 925 to processor 910. Computing system 900 may include a cache 912 of high-speed memory connected directly with, in close proximity to, or integrated as part of processor 910.

[0108] Processor 910 may include any general purpose processor and a hardware service or software service, such as services 932, 934, and 936 stored in storage device 930, configured to control processor 910 as well as a special-purpose processor where software instructions are incorporated into the actual processor design. Processor 910 may essentially be a completely self- contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric. [0109] To enable user interaction, computing system 900 includes an input device 945, which may represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech, etc. Computing system 900 may also include output device 935, which may be one or more of a number of output mechanisms. In some instances, multimodal systems may enable a user to provide multiple types of input/ output to communicate with computing system 900.

[0110] Computing system 900 may include communications interface 940, which may generally govern and manage the user input and system output The communication interface may perform or facilitate receipt and/or transmission wired or wireless communications using wired and/or wireless transceivers, including those making use of an audio j ack/plug, a microphone jack/plug, a universal serial bus (USB) port/plug, an AppleTM LightningTM port/plug, an Ethernet port/plug, a fiber optic port/plug, a proprietary wired port/plug, 3G, 4G, 5G and/or other cellular data network wireless signal transfer, a BluetoothTM wireless signal transfer, a BluetoothTM low energy (BLE) wireless signal transfer, an IBEACONTM wireless signal transfer, a radio-frequency identification (RFID) wireless signal transfer, near-field communications (NFC) wireless signal transfer, dedicated short range communication (DSRC) wireless signal transfer, 802.11 Wi-Fi wireless signal transfer, wireless local area network (WLAN) signal transfer, Visible Light Communication (VLC), Worldwide Interoperability for Microwave Access (WiMAX), Infrared (IR) communication wireless signal transfer, Public Switched Telephone Network (PSTN) signal transfer, Integrated Services Digital Network (ISDN) signal transfer, ad-hoc network signal transfer, radio wave signal transfer, microwave signal transfer, infrared signal transfer, visible light signal transfer, ultraviolet light signal transfer, wireless signal transfer along the electromagnetic spectrum, or some combination thereof. The communications interface 940 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers that are used to determine a location of the computing system 900 based on receipt of one or more signals from one or more satellites associated with one or more GNSS systems. GNSS systems include, but are not limited to, the US-based Global Positioning System (GPS), the Russia-based Global Navigation Satellite System (GLONASS), the China-based BeiDou Navigation Satellite System (BDS), and the Europe-based Galileo GNSS. There is no restriction on operating on any particular hardware arrangement, and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed. [0111] Storage device 930 may be a non-volatile and/or non-transitory and/or computer-readable memory device and may be a hard disk or other types of computer readable media which may store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, a floppy disk, a flexible disk, a hard disk, magnetic tape, a magnetic strip/stripe, any other magnetic storage medium, flash memory, memristor memory, any other solid-state memory, a compact disc read only memory (CD-ROM) optical disc, a rewritable compact disc (CD) optical disc, digital video disk (DVD) optical disc, a blu-ray disc (BDD) optical disc, a holographic optical disk, another optical medium, a secure digital (SD) card, a micro secure digital (microSD) card, a Memory Stick® card, a smartcard chip, a EMV chip, a subscriber identity module (SIM) card, a mini/micro/nano/pico SIM card, another integrated circuit (IC) chip/card, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash EPROM (FLASHEPROM), cache memory (e.g., Level 1 (LI) cache, Level 2 (L2) cache, Level 3 (L3) cache, Level 4 (L4) cache, Level 5 (L5) cache, or other (L#) cache), resistive random-access memory (RRAM/ReRAM), phase change memory (PCM), spin transfer torque RAM (STT-RAM), another memory chip or cartridge, and/or a combination thereof.

[0112] The storage device 930 may include software services, servers, services, etc., that when the code that defines such software is executed by the processor 910, it causes the system to perform a function. In some aspects, a hardware service that performs a particular function may include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as processor 910, connection 905, output device 935, etc., to carry out the function. The term “computer-readable medium” includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instruction(s) and/or data. A computer-readable medium may include a non-transitory medium in which data may be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash memory, memory or memory devices. A computer-readable medium may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, or the like.

[0113] Specific details are provided in the description above to provide a thorough understanding of the aspects and examples provided herein, but those skilled in the art will recognize that the application is not limited thereto. Thus, while illustrative aspects of the application have been described in detail herein, it is to be understood that the inventive concepts may be otherwise variously embodied and employed, and that the appended claims are intended to be construed to include such variations, except as limited by the prior art. Various features and aspects of the above-described application may be used individually or jointly. Further, aspects may be utilized in any number of environments and applications beyond those described herein without departing from the broader scope of the specification. The specification and drawings are, accordingly, to be regarded as illustrative rather than restrictive. For the purposes of illustration, methods were described in a particular order. It should be appreciated that in alternate aspects, the methods may be performed in a different order than that described.

[0114] For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software. Additional components may be used other than those shown in the figures and/or described herein. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the aspects in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects.

[0115] Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

[0116] Individual aspects may be described above as a process or method which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations may be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination may correspond to a return of the function to the calling function or the main function.

[0117] Processes and methods according to the above-described examples may be implemented using computer-executable instructions that are stored or otherwise available from computer- readable media. Such instructions may include, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or a processing device to perform a certain function or group of functions. Portions of computer resources used may be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, source code. Examples of computer-readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.

[0118] In some aspects the computer-readable storage devices, mediums, and memories may include a cable or wireless signal containing a bitstream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.

[0119] Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof, in some cases depending in part on the particular application, in part on the desired design, in part on the corresponding technology, etc.

[0120] The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed using hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors. When implemented in software, firmware, middleware, or microcode, the program code or code segments to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable medium. A processor(s) may perform the necessary tasks. Examples of form factors include laptops, smart phones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also may be embodied in peripherals or add-in cards. Such functionality may also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.

[0121] The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are example means for providing the functions described in the disclosure.

[0122] The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general purposes computers, wireless communication device handsets, or integrated circuit devices having multiple uses including application in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer-readable data storage medium comprising program code including instructions that, when executed, performs one or more of the methods, algorithms, and/or operations described above. The computer-readable data storage medium may form part of a computer program product, which may include packaging materials. The computer-readable medium may comprise memory or data storage media, such as random access memory (RAM) such as synchronous dynamic random access memory (SDRAM), read-only memory (ROM), nonvolatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, and the like. The techniques additionally, or alternatively, may be realized at least in part by a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and that may be accessed, read, and/or executed by a computer, such as propagated signals or waves.

[0123] The program code may be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, an application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Such a processor may be configured to perform any of the techniques described in this disclosure. A general-purpose processor may be a microprocessor; but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or apparatus suitable for implementation of the techniques described herein.

[0124] One of ordinary skill will appreciate that the less than (“<”) and greater than (“>”) symbols or terminology used herein may be replaced with less than or equal to (“<”) and greater than or equal to (“>”) symbols, respectively, without departing from the scope of this description. [0125] Where components are described as being “configured to” perform certain operations, such configuration may be accomplished, for example, by designing electronic circuits or other hardware to perform the operation, by programming programmable electronic circuits (e.g., microprocessors, or other suitable electronic circuits) to perform the operation, or any combination thereof.

[0126] The phrase “coupled to” or “communicatively coupled to” refers to any component that is physically connected to another component either directly or indirectly, and/or any component that is in communication with another component (e g., connected to the other component over a wired or wireless connection, and/or other suitable communication interface) either directly or indirectly.

[0127] Claim language or other language reciting “at least one of’ a set and/or “one or more” of a set indicates that one member of the set or multiple members of the set (in any combination) satisfy the claim. For example, claim language reciting “at least one of A and B” or “at least one of A or B” means A, B, or A and B. In another example, claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, A and B and C, or any duplicate information or data (e.g., A and A, B and B, C and C, A and A and B, and so on), or any other ordering, duplication, or combination of A, B, and C. The language “at least one of’ a set and/or “one or more” of a set does not limit the set to the items listed in the set. For example, claim language reciting “at least one of A and B” or “at least one of A or B” may mean A, B, or A and B, and may additionally include items not listed in the set of A and B.

[0128] Claim language or other language reciting “at least one processor configured to,” “at least one processor being configured to,” “one or more processors configured to,” “one or more processors being configured to,” or the like indicates that one processor or multiple processors (in any combination) can perform the associated operation(s). For example, claim language reciting “at least one processor configured to: X, Y, and Z” means a single processor can be used to perform operations X, Y, and Z; or that multiple processors are each tasked with a certain subset of operations X, Y, and Z such that together the multiple processors perform X, Y, and Z; or that a group of multiple processors work together to perform operations X, Y, and Z. In another example, claim language reciting “at least one processor configured to: X, Y, and Z” can mean that any single processor may only perform at least a subset of operations X, Y, and Z. [0129] Where reference is made to one or more elements performing functions (e.g., steps of a method), one element may perform all functions, or more than one element may collectively perform the functions. When more than one element collectively performs the functions, each function need not be performed by each of those elements (e.g., different functions may be performed by different elements) and/or each function need not be performed in whole by only one element (e.g., different elements may perform different sub-functions of a function). Similarly, where reference is made to one or more elements configured to cause another element (e.g., an apparatus) to perform functions, one element may be configured to cause the other element to perform all functions, or more than one element may collectively be configured to cause the other element to perform the functions.

[0130] Where reference is made to an entity (e.g., any entity or device described herein) performing functions or being configured to perform functions (e.g., steps of a method), the entity may be configured to cause one or more elements (individually or collectively) to perform the functions. The one or more components of the entity may include at least one memory, at least one processor, at least one communication interface, another component configured to perform one or more (or all) of the functions, and/or any combination thereof. Where reference to the entity performing functions, the entity may be configured to cause one component to perform all functions, or to cause more than one component to collectively perform the functions. When the entity is configured to cause more than one component to collectively perform the functions, each function need not be performed by each of those components (e.g., different functions may be performed by different components) and/or each function need not be performed in whole by only one component (e.g., different components may perform different sub-functions of a function).

[0131] Illustrative aspects of the disclosure include:

[0132] Aspect 1. An apparatus for training machine learning models, comprising: at least one memory comprising instructions; and at least one processor coupled to the at least one memory and configured to: obtain one or more images of an environment; generate, using a first machine learning model, a first set of features for one or more objects in the one or more images; predict, using a second machine learning model, image feature labels for the first set of features; compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combine the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generate, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and train the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0133] Aspect 2. The apparatus of claim 1, wherein the at least one processor is further configured to: obtain the first three-dimensional representation of the environment; generate a second set of features for one or more objects in the first three-dimensional representation; and flatten the second set of features from three dimensions to two dimensions to generate the first set of flattened features.

[0134] Aspect 3. The apparatus of any of claims 1-2, wherein the first three-dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud.

[0135] Aspect 4. The apparatus of any of claims 1-3, wherein the first three-dimensional representation of the environment comprises a lidar point cloud and wherein the at least one processor is further configured to: predict a predicted first set of flattened features based on the BEV projected image features; compare the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

[0136] Aspect 5. The apparatus of claim 4, wherein the at least one processor is further configured to: obtain a radar point cloud representation of the environment; generate a third set of features for one or more objects in the radar point cloud representation; flatten the third set of features from three dimensions to two dimensions to generate a second set of flattened features; and generate the segmented BEV map of the environment based on the combined BEV projected image features, the first set of flattened features, and the second set of flattened features. [0137] Aspect 6. The apparatus of claim 5, wherein the at least one processor is further configured to: predict a predicted second set of flattened features based on the BEV projected image features; compare the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function.

[0138] Aspect 7. The apparatus of claim 6, wherein the at least one processor is further configured to: perform an additional perspective transform on the predicted image feature labels to generate a predicted segmented BEV map; compare the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0139] Aspect 8. The apparatus of claim 6, wherein the at least one processor is further configured to: perform an additional perspective transform on the generated segmented BEV map to generate an additional predicted image feature labels; compare the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0140] Aspect 9. The apparatus of any of claims 1-8, wherein the at least one processor is configured to: predict, using a second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels.

[0141] Aspect 10. The apparatus of any of claims 1-9, wherein the at least one processor is configured to train the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

[0142] Aspect 11. A method for training machine learning models, comprising: obtaining one or more images of an environment; generating, using a first machine learning model, a first set of features for one or more objects in the one or more images; predicting, using a second machine learning model, image feature labels for the first set of features; comparing the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; performing a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combining the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generating, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; comparing the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and training the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0143] Aspect 12. The method of claim 11, further comprising: obtaining the first three- dimensional representation of the environment; generating a second set of features for one or more objects in the first three-dimensional representation; and flattening the second set of features from three dimensions to two dimensions to generate the first set of flattened features.

[0144] Aspect 13. The method of any of claims 1-2, wherein the first three-dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud.

[0145] Aspect 14. The method of any of claims 1-3, wherein the first three-dimensional representation of the environment comprises a lidar point cloud and further comprising: predicting a predicted first set of flattened features based on the BEV projected image features; comparing the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

[0146] Aspect 15. The method of claim 4, further comprising: obtaining a radar point cloud representation of the environment; generating a third set of features for one or more objects in the radar point cloud representation; flattening the third set of features from three dimensions to two dimensions to generate a second set of flattened features; and generating the segmented BEV map of the environment based on the combined BEV projected image features, the first set of flattened features, and the second set of flattened features. [0147] Aspect 16. The method of claim 5, further comprising: predicting a predicted second set of flattened features based on the BEV projected image features; comparing the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function.

[0148] Aspect 17. The method of claim 6, further comprising: performing an additional perspective transform on the predicted image feature labels to generate a predicted segmented BEV map; comparing the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0149] Aspect 18. The method of claim 16, further comprising: performing an additional perspective transform on the generated segmented BEV map to generate an additional predicted image feature labels; comparing the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function; and training the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0150] Aspect 19. The method of any of claims 1-8, further comprising: predicting, using a second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels.

[0151] Aspect 20. The method of any of claims 1-9, further comprising training the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

[0152] Aspect 21. A non-transitory computer-readable medium having stored thereon instructions that, when executed by at least one processor, cause the at least one processor to: obtain one or more images of an environment; generate, using a first machine learning model, a first set of features for one or more objects in the one or more images; predict, using a second machine learning model, image feature labels for the first set of features; compare the predicted image feature labels for the first set of features to ground truth image feature labels corresponding to the one or more images to evaluate a first loss function; perform a perspective transform on the first set of features to generate a birds eye view (BEV) projected image features; combine the BEV projected image features and a first set of flattened features to generate combined image features, wherein the first set of flattened features are generated based on a first three-dimensional representation of the environment; generate, using the first machine learning model, a segmented BEV map of the environment based on the combined image features; compare the segmented BEV map to a ground truth segmented BEV map to evaluate a second loss function; and train the first machine learning model for generation of one or more segmented BEV maps based on the evaluated first loss function and the evaluated second loss function.

[0153] Aspect 22. The non-transitory computer-readable medium of claim 21, wherein the instructions cause the at least one processor to: obtain the first three-dimensional representation of the environment; generate a second set of features for one or more objects in the first three- dimensional representation; and flatten the second set of features from three dimensions to two dimensions to generate the first set of flattened features.

[0154] Aspect 23. The non-transitory computer-readable medium of any of claims 21-22, wherein the first three-dimensional representation of the environment comprises one of a radar point cloud or a lidar point cloud.

[0155] Aspect 24. The non-transitory computer-readable medium of any of claims 21-23, wherein the first three-dimensional representation of the environment comprises a lidar point cloud and wherein the instructions cause the at least one processor to: predict a predicted first set of flattened features based on the BEV projected image features; compare the predicted first set of flattened features to the first set of flattened features to evaluate a third loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, and the evaluated third loss function.

[0156] Aspect 25. The non-transitory computer-readable medium of claim 24, wherein the instructions cause the at least one processor to: obtain a radar point cloud representation of the environment; generate a third set of features for one or more objects in the radar point cloud representation; flatten the third set of features from three dimensions to two dimensions to generate a second set of flattened features; and generate the segmented BEV map of the environment based on the combined BEV projected image features, the first set of flattened features, and the second set of flattened features.

[0157] Aspect 26. The non-transitory computer-readable medium of claim 25, wherein the instructions cause the at least one processor to: predict a predicted second set of flattened features based on the BEV projected image features; compare the predicted second set of flattened features to the second set of flattened features to evaluate a fourth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, and the evaluated fourth loss function.

[0158] Aspect 27. The non-transitory computer-readable medium of claim 26, wherein the instructions cause the at least one processor to: perform an additional perspective transform on the predicted image feature labels to generate a predicted segmented BEV map; compare the predicted segmented BEV map to the generated segmented BEV map to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0159] Aspect 28. The non-transitory computer-readable medium of claim 26, wherein the instructions cause the at least one processor to: perform an additional perspective transform on the generated segmented BEV map to generate an additional predicted image feature labels; compare the predicted image feature labels to the additional predicted image feature labels to evaluate a fifth loss function; and train the first machine learning model based on the evaluated first loss function, the evaluated second loss function, the evaluated third loss function, the evaluated fourth loss function, and the evaluated fifth loss function.

[0160] Aspect 29. The non-transitory computer-readable medium of any of claims 21 -28, wherein the instructions cause the at least one processor to: predict, using a second machine learning model that is separate from the first machine learning model, feature labels for the ground truth image feature labels. [0161] Aspect 30. The non-transitory computer-readable medium of any of claims 21-29, wherein the instructions cause the at least one processor to train the first machine learning model at least in part by back propagating the evaluated first loss function and the evaluated second loss function.

[0162] Aspect 31. An apparatus for processing audio data, the apparatus comprising one or more means for performing operations according to any of claims 11 to 20.