Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM FOR OBTAINING A PREDICTION OF AN ACTION OF A VEHICLE AND CORRESPONDING METHOD
Document Type and Number:
WIPO Patent Application WO/2020/174327
Kind Code:
A1
Abstract:
A system (10; 20) for obtaining a prediction an action (at) of a vehicle (V), including a camera for acquiring a sequence of images (Ft) of the scene dynamic seen by the vehicle (V), in particular in front of the vehicle (V), a convolutional neural network visual encoder (50) configured to obtaining for each acquired image (Ft) in said sequence of images (Ft) of the scene dynamic seen by the vehicle (V) at each time step (t) a corresponding visual features vector (vt), one or more sensor (40) configured to obtain a position of the vehicle (st) at the same time step (st), a Recurrent Neural Network, in particular a LSTM, network (65; 70) configured to receive said said visual features vector (vt) and position of the vehicle (st) at said time step (t) and to generate a prediction of the action (at) of the vehicle (V), The system (20) is configured to receive as input a set of control commands (C) representing maneuvers of the vehicle (V) said Recurrent Neural Network (70) comprising a plurality of Recurrent Neural Network branches (701, 702, 703, 7Ο4) each corresponding to a control command (Ci)in said set of control commands (C), said system (20) comprising a command conditioned switch (60) configured upon reception of a control command (c±) to select the corresponding branch (701, 702, 703, 7Ο4) of said Recurrent Neural Network network (70), said system (20) being then configured to operate said selected corresponding branch (701, 702, 703, 7Ο4) to process said visual features vector (vt) and position of the vehicle (st) at said time step (t) to obtain said prediction of the action (at) of the vehicle (V).

Inventors:
DI STEFANO ERIKA (IT)
FURLAN AXEL (IT)
FONTANA DAVIDE (IT)
CHERNUKHA IVAN (IT)
SANGINETO ENVER (IT)
SEBE NICULAE (IT)
Application Number:
PCT/IB2020/051422
Publication Date:
September 03, 2020
Filing Date:
February 20, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MARELLI EUROPE SPA (IT)
International Classes:
B60W50/00; G05D1/02
Other References:
LU CHI ET AL: "Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 August 2017 (2017-08-12), XP080952610
PAUL DREWS ET AL: "Aggressive Deep Driving: Model Predictive Control with a CNN Cost Model", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 July 2017 (2017-07-17), XP080777382
PAUL DREWS ET AL: "Vision-Based High Speed Driving with a Deep Dynamic Observer", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 5 December 2018 (2018-12-05), XP080989484
Attorney, Agent or Firm:
CROVINI, Giorgio (IT)
Download PDF:
Claims:
CLAIMS

1. A system (10; 20) for obtaining a prediction an action (at) of a vehicle (V) , including

a camera for acquiring a sequence of images (Ft) of the scene dynamic seen by the vehicle (V) , in particular in front of the vehicle (V) ,

a convolutional neural network visual encoder (50) configured to obtaining for each acquired image (Ft) in said sequence of images (Ft) of the scene dynamic seen by the vehicle (V) at each time step (t) a corresponding visual features vector (vt) ,

one or more sensor (40) configured to obtain a position of the vehicle (st) at the same time step (st) ,

a Recurrent Neural Network, in particular a LSTM, network (65; 70) configured to receive said said visual features vector (vt) and position of the vehicle (st) at said time step (t) and to generate a prediction of the action (at) of the vehicle (V) taking in account the previous hidden state (ht-i) ,

characterized in that

the system (20) is configured to receive as input a set of control commands (C) representing maneuvers of the vehicle (V) ,

said Recurrent Neural Network (70) comprising a plurality of Recurrent Neural Network branches (70i, 702, 703, 7O4) each corresponding to a control command

(Ci)in said set of control commands (C) ,

said system (20) comprising a command conditioned switch (60) configured upon reception of a control command (d) to select the corresponding branch (70i, 702, 703, 7O4) of said Recurrent Neural Network network

(70) ,

said system (20) being then configured to operate said selected corresponding branch (70i, 702, 703, 7O4) to process said visual features vector (vt) and position of the vehicle (st) at said time step (t) to obtain said prediction of the action (at) of the vehicle (V) .

2. The systen according to claim 1, characterized in that said Recurrent Neural Network includes a LSTM network^

3. The system according to claim 1, characterized in that said convolutional neural network visual encoder (50) is a dilated fully convolutional neural network visual encoder (50) .

4. The system according to claim 1, characterized in that said system (20) being configured to operate said corresponding branch (70i, 702, 703, 704) to obtain said prediction of the action (at) of the vehicle (V) as a map of said image acquired (Ft), position of the vehicle (st) and control command (ct) at a same given time step (t) .

5. The system according to claim 1, characterized in that said maneuvers are included in a navigation path of the vehicle, in particular provided by a navigation system.

6. The system according to claim 1, characterized in that said action includes a steering angle and a vehicle speed.

7. The system according to any of the previous claims, characterized in that is included in a system for the autonomous driving of the vehicle.

8. A method for predicting an action (at) of a vehicle, including

acquiring a sequence of images (Ft) of the scene dynamic seen by the vehicle (V) , in particular in front of the vehicle (V) ,

obtaining at each time step (t) a visual features vector (vt) by applying a convolutional neural network , in particular dilated fully convolutional neural network, visual encoder (50) to a corresponding acquired image (Ft) ,

obtaining (40) a position of the vehicle (st) at the same time step (st),

supplying said visual features vector (vt) and position of the vehicle (st) at said time step (t) to a Recurrent Neural Network network, in particular a LSTM network (65; 70),

characterized in that includes

determining a set of control commands (C) representing maneuvers of the vehicle (V)

providing a plurality of branches (70i, 702, 703,

7O4) of said Recurrent Neural Network network network (70) each corresponding to a control command (Ci)in said set of control commands (C) ,

when a control command (Ci) is issued, selecting the corresponding branch (70i, 702, 703, 7O4) of said

Recurrent Neural Network network network (70) and supplying said visual features vector (vt) and position of the vehicle (st) at said time step (t) to said corresponding branch (70i, 702, 703, 7O4),

operating said corresponding branch (70i, 702, 703,

7O4) to said prediction of the action (at) of a vehicle .

9. The method according to claim 8, characterized in that said operating said corresponding branch (70i, 702, 703, 7O4) to obtain said prediction of the action

(at) of the vehicle (V) as a map of said image acquired (Ft), position of the vehicle (st) and control command (ct) at a same given time step (t) .

10. The method according to claim 8, characterized in that said maneuvers are included in a navigation path of the vehicle.

11. The method according to claim 8, characterized in that said action includes a steering angle and a vehicle speed.

Description:
"System for obtaining a prediction of an action of a vehicle and corresponding method"

★★★

TEXT OF THE DESCRIPTION

Technical field

The present description relates to techniques for obtaining a prediction of an action of a vehicle, in particular a road vehicle, including

a camera for acquiring a sequence of images of the scene dynamic seen by the vehicle, in particular in front of the vehicle,

a convolutional neural network visual encoder configured to obtaining for each acquired image in said sequence of images of the scene dynamic seen by the vehicle at each time step a corresponding visual features vector ,

one or more sensor configured to obtain a position of the vehicle (s t) at the same time step,

a Recurrent Neural Network, in particular a LSTM, network configured to receive said said visual features vector and position of the vehicle at said time step and to generate a prediction of the action of the vehicle .

Description of the prior art

Most of the deep-learning based autonomous driving methods can be categorized into two major paradigms: mediated perception approaches and behaviour reflex (or end-to-end) methods . The former are composed of different, distinct recognition components such as pedestrian detectors, lane segmentation, traffic light/sign detectors, etc. The corresponding detection outcomes are then combined into an intermediate overall scene representation, which is the knowledge input to a (typically rule-based) decision maker system in order to plan the vehicle's next actions. On the other hand, the behaviour reflex approach is an emerging paradigm consisting in training a deep network in order to directly map raw data sensors into the vehicle's action decisions. A network which takes raw data sensors (e.g., images) as input and outputs vehicle's actions is also denoted as end-to-end trainable. Modern behaviour reflex approaches use Convolutional Neural Networks (CNNs) to extract visual information from the frames captured by the vehicle's on-board camera, for instance using a simple CNN trained for a regression task: the output neuron predicts the steering angle. One problem with this 'CNN-only' architecture is that every decision depends only on the current frame. There is no "memory" about the observed dynamics of the scene because past frames are not represented at all.

It is known, for instance from the publication of Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell "End-to-end learning of driving models from large-scale video datasets" . In CVPR, pages 3530-3538, 2017 to introduce a dynamics representation into the network using an LSTM (Long Short Term Memory) network.

Specifically, as shown in figure 1 the system 10, which is indeed substantially a neural network, is composed of two main sub-networks, a dilated fully convolutional neural network visual encoder, in short FCN, 50 representing (static) visual information extracted from each frame independently of the other frames. Given a frame Ft input at time step t, the FCN 50 represents the frame Ft using a feature vector v t . Specifically, the feature vector v t corresponds to the vector of neuron activations of the last layer of the FCN 50. Then, the feature vector v t is concatenated with a current vehicle position s t (represented using a 2-dimensional vector) and input to a LSTM network 65. This second sub-network predicts the most likely action a t taking into account its previous hidden state value.

A continuous prediction can be formulated using a regression task and a mean squared error loss. However, it has been widely observed by different works that this loss performs poorly, for instance when the target distribution is multimodal. Therefore, the regression problem is cast in such solution into a classification task using discrete bins which represent the target value's range. In more detail, the possible range of values of steering angle t ([-90, 90] degrees) is discretized into N = 181 bins. Similarly, the possible range of values of vehicle speed m t is discretized in N = 181 bins. Consequently, the number of output neurons of the network is 2N = 362, each neuron corresponding to a "bin-class" and the adopted loss function (see following Eq. 1 and 2) is a standard cross entropy between the predicted and the true class values. In Eq. 1, expressing steering angle loss H(p a ,q a ), q a (x) is the network prediction and p a (x) is the training ground truth and, similarly, Eq. 2 refers to the speed loss H (pm, qm) ·

H (p a , q a ) = -åpa (Ft, St) log q a (Ft,St ) (1)

H (pm, q) = ~ åpm (Ft, St) log qm (Ft, St) ( 2 )

The final loss is an equally-weighted sum of two cross entropy losses.

Such a solution, besides having poor losses, is based only on the sensor, thus lacks a high-level control of the network behaviour.

Object and summary

An object of one or more embodiments is to overcome the limitations inherent in the solutions achievable from the prior art.

According to one or more embodiments, that object is achieved thanks to a method having the characteristics specified in claim 1. One or more embodiments may refer to a corresponding system.

The claims form an integral part of the technical teaching provided herein in relation to the various embodiments .

According to the solution described herein, the system is configured to receive as input a set of control commands representing maneuvers of the vehicle, the Recurrent Neural Network comprising a plurality of Recurrent Neural Network branches each corresponding to a control command in said set of control commands,

said system comprising a command conditioned switch configured upon reception of a control command to select the corresponding branch of said Recurrent Neural Network network,

said system being then configured to operate said selected corresponding branch to process said visual features vector and position of the vehicle at said time step to obtain said prediction of the action of the vehicle.

The solution described herein is also directed to a corresponding method for predicting an action of a vehicle .

Brief description of the drawings

The embodiments will now be described purely by way of a non-limiting example with reference to the annexed drawings, in which:

- Figure 1 has been already discussed in the foregoing;

- Figure 2 illustrates a context of application of the solution here described;

- Figure 3 represents a block schematic of the system here described;

- Figure 4 represent in more detail the system of figure 4;

Detailed description of embodiments

The ensuing description illustrates various specific details aimed at an in-depth understanding of the embodiments. The embodiments may be implemented without one or more of the specific details, or with other methods, components, materials, etc. In other cases, known structures, materials, or operations are not illustrated or described in detail so that various aspects of the embodiments will not be obscured.

Reference to "an embodiment" or "one embodiment" in the framework of the present description is meant to indicate that a particular configuration, structure, or characteristic described in relation to the embodiment is comprised in at least one embodiment. Likewise, phrases such as "in an embodiment" or "in one embodiment", that may be present in various points of the present description, do not necessarily refer to the one and the same embodiment. Furthermore, particular conformations, structures, or characteristics can be combined appropriately in one or more embodiments.

The references used herein are intended merely for convenience and hence do not define the sphere of protection or the scope of the embodiments.

In brief here is described a system and method for obtaining a prediction for an action of a vehicle which is based on a deep-learning based method for autonomous driving based on a network trained "end-to-end". The solution here described jointly models the scene dynamics observed by the vehicle while moving and a command-conditioned decision strategy which takes into account high-level commands representing, for instance, the passenger's goal (i.e. the desired destination). Scene dynamics is modeled using a Recurrent Neural Network (specifically, using an LSTM) . The solution here described however regards a system and method that while modelling scene dynamics also takes into account the passenger' s goal, instead of having a resulting network which is a function of only the sensor data. The solution here described, providing a high-level control of the network behaviour, makes use of "commands", externally provided as an additional input to the network, in order to condition the network behaviour .

By way of example, with reference to figure 2 showing schematically a road map with streets, suppose a passenger, e.g. a vehicle V, in particular a road or land vehicle, desires to go from a point A to a point B. The navigator would produce the following sequence of commands corresponding to a path P: turn right (C4 as better detailed in the following) at the next intersection, turn left (C2) at the next intersection, follow the road (ci) . The network function is now described by a function of sensors and commands and can be externally controlled. The commands, e.g. left, ask to the network to plan a short-term policy, i.e. a vehicle maneuver corresponding to a sequence of actions) which is able to drive the vehicle till the next intersection and then turn left.

A system 20 for obtaining a prediction of an action a t of a vehicle V, according to the solution here described is shown in figure 3.

With 50 is indicated a dilated fully convolutional neural network visual encoder which receives image frames F t representing the observed dynamics of the scene acquired by a front camera mounted on the vehicle V, not shown, at a time step t, and extract corresponding visual representations v t , e.g. foreground pixel and background pixel, on the basis of such image frames F t . As mentioned dilated FCN 50 is for instance a CNN extracted from the well known AlexNet having replaced the last dense layers with 1 c 1 convolutional filters as discussed for instance in Xu et al . It is pre-trained on ImageNet and then fine-tuned together with the rest of the network. Dilated FCN 50 is configured to represent visual information v t extracted from each frame F t independently of the other frames, i.e. static visual information. More specifically, given a frame F t input at time step t, the dilated FCN 50 represents such frame F t using a feature vector, or visual vector, v t obtained by the neuron activations of its last layer

The dilated FCN 50 outputs CNN features vector v t to a command conditioned switch block 60 which receives also a current position of the vehicle s t . The current vehicle position st = (xt, yt) where xt , yt are the coordinates of the vehicle V at time step t and are obtained using egomotion sensors, represented by an egomotion sensor block 40. Ego-motion sensors are sensors that measure the vehicle motion with respect to an arbitrarly fixed reference frame (e.g., IMU sensors) . The trajectory output by these sensors is synchronized with the camera frames in such a way to get a position at each time step t.

Then the CNN features v t in the command block 60 is concatenated with the vehicle's current position s t , represented using a two dimensional vector (x t , y t ) into a joint representation (s t , v t ) . A LSTM network 70 includes a plurality of LSTM branches 70i, 702, 703, 70 4 , in the example four. The command block 60 upon receiving a control command c t is configured to switch to an LSTM branch among the plurality of LSTM branches 70i, 702, 703, 70 4 , corresponding to such control command C t · Specifically, the control command c t acts as a switch between branches 70i, 70 2 , 70 3 , 70 4 . During the

"forward-pass", only one of the branches is activated, depending on the input command c t · As a consequence, only the sub-policy corresponding to the current value of branches 70i, 702, 703, 70 4 is involved when the joint representation (s t , v t ) is processed. Preferably, the joint representation (s t , v t ) is input to each LSTM branch, but only the selected branch processes such input .

The control command c t is originated from a predefined set C of commands Ci with i index from 1 to ICI which for instance can be C= {ci,C2, C3, C 4 } , where ci, C2, C3, C4 are respectively in the example here shown continue, left, straight, right. These control commands C t can be originated, for instance, by a navigator of the vehicle V.

The system 20 outputs, i.e. learns, a map function f (Ft , s t , C t ) a t , where a t is a predicted vehicle's action at time step t, i.e. a map of predicted vehicle's actions as a function of an acquired image F t , in particular represented by the corresponding CNN features v t , and of the vehicle's current position s t and command C t ·

Since continuous outputs are used, the predicted vehicle's action a t is defined as a pair of steering angle and speed magnitude: a t =(oi t ,m t ), where t is the steering angle in radiants, and m t is the vehicle speed value .

It is underlined that control commands are used as input to the system 20 or network in order to choose a short-term policy, while actions are the instantaneous outputs of the system 10, i.e. compose such short-term policy .

Each LSTM branch 70i predicts the most likely action a t taking into account its previous hidden state value h t-i , which represents the visual dynamics. It is important to note that, although figure 3 shows one single hidden state h t-i , actually each LSTM in each branch computes its own hidden state h t-i .

In figure it is shown part of the system 20 in a more detailed manner.

Each input frame F t is resized into a resolution of 360 x 640 pixels and represented with 3 RGB channels. The FCN 50 includes a first convolutional layer CV1 96 11x11 filters, then a MaxPool layer MP with 3x3 filters, a second convolutional layer CV2 27x27x256 with 5x5 filters at stride 1, a third convolutional layer with 384 3x3 filters at stride 1, pad 1, a fourth convolutional layer CV4 with 384 3x3 filters at stride 1, pad 1 [13x13x256] a fifth convolutional layer CV5 with 256 3x3 filters at stride 1, pad 1, then a sixth convolutional leyer CV6 and seventh convolutional layer CV7 with 4096 lxl filters. Each LSTM branch 70i comprises two stacked LSTM layer including each 64 neurons. A final output layer 71, in each branch, is composed of 362 neurons, in particular is a FC Softmax layer 64x362. After the sixth convolutional layers CV6 and the seventh convolutional Iyer CV7 a Dropout layer with a dropout factor equal to 0.5 may be applied for regularization.

Thus, summing up, the system 20 for obtaining a prediction an action a t of a vehicle V, i.e. a road or land vehicle, just described, includes a camera for acquiring a sequence of images F t of the scene dynamic seen by the vehicle V, which is in particular in front of the vehicle V, i.e. takes the images of what it is in front of the vehicle V, a convolutional neural network visual encoder 50, preferably a dilated FCN, configured to obtain for each acquired image F t at a different time t in said sequence of images F t of the scene dynamic seen by the vehicle V at each time step t a corresponding visual features vector v t , for instance representing a classification of the pixel of the image in according to classes such as foreground, background and others, one or more sensor 40, e.g. egomotion sensors obtained by exploting the camera acquired image frames, configured to obtain a position of the vehicle S t at the same time step, a Recurrent Neural Network, in particular a LSTM, network 70 configured to receive said said visual features vector v t and position of the vehicle s t at said time step t and to generate a prediction of the action a t of the vehicle (V) taking in account the previous hidden state h t-i , wherein such system 20 is configured to receive as input a set of control commands C representing maneuvers of the vehicle V, in particular corresponding to a sequence of actions, the Recurrent Neural Network 70 comprising a plurality of Recurrent Neural Network branches 70i, 70 2 , 70 3 , 70 4 each corresponding to a control command Ci in said set of control commands C, the system 20 comprising a command conditioned switch 60 configured upon reception of a control command Ci to select the corresponding branch 70i, 70 2 , 70 3 , 70 4 of said

Recurrent Neural Network network 70, such system 20 being then configured to operate said selected corresponding branch 70i, 70 2 , 70 3 , 70 4 , selected by the switch 60, to obtain said prediction of the action a t of the vehicle V by processing the input, i.e. said visual features vector v t and position of the vehicle s t at said time step t, in particular as a map of of the image acquired Ft, position of the vehicle s t and control command C t at a same given time step t. As indicated, the system 20 preferably supplies said visual features vector v t and position of the vehicle St at said time step t to each corresponding branch 70i, 70 2 , 70 3 , 70 4 , then only the selected branch processed the input, this representing a simpler implementation than applying the visual features vector v t and position of the vehicle s t input only to the selected branch.

The system 20 described is preferably included in an autonomous driving system, for instance to provide predictions of actions to be performed by the vehicle to follow a certain path P .

To better understand the performace of the system 20 of figures 3 and 4 here is now briefly discussed the loss function associated to such system 20.

In equation 3 below is it shown a loss function Loss (F t , s t , c t) of the system 20.

The first term is a sum of cross entropy losses of the steering angle and the second term is a sum of cross entropy losses for the speed m, over the command nimne | C | . As indicated, are function of the predictions q a, qm and p a , Pm is the training ground truth. The number of branches corresponds to the number of commands ( | c | ) , thus to the number of loss components in each term. Each branch 70i is responsible for learning from examples corresponding to a command Ci . Therefore, one backpropagation pass for a sample associated with a ground-truth command C t should contribute to back-propagate the error only in the branch 70i , where C t = Ci . In Eq. 3 this is represented by a indicator function 1 (a , c t ) , which is equal to 1 if and only if C t = ci . For efficiency reasons, the control commands d are encoded as a one-hot vector.

Thus, the advantages of the method and system just disclosed are clear. The method and system described allows predicting the action improving the LSTM operation by using a command conditioned network.

The known solutions which use FCN and LMTS are reflexive system, where the LMTS supplies the steering angle or other parameters as a reaction to the output of the FCN encoder. The method and system decribed by using the command conditioned network, where each LMTS is trained for a specific vehicle maneuver, is able to operate by taking into account the passenger' s final destination (which is represented as input to the network as a sequence of commands, the latter provided by the vehicle's Navigator) .

The solution here described in addition is applying a dynamic input coming from a sequence of image to the command conditioned network of LSTM branches instead of a static input.

Of course, without prejudice to the principle of the embodiments, the details of construction and the embodiments may vary widely with respect to what has been described and illustrated herein purely by way of example, without thereby departing from the scope of the present embodiments, as defined the ensuing claims.

Of course the neural networks of the system, i.e. here described can be implemented by one or more processors or microprocessors or any processing system, in particular any processing system arranged in the vehicle which is able to support such neural networks.