Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD FOR DETERMINING THE DEPTH FROM A SINGLE IMAGE AND SYSTEM THEREOF
Document Type and Number:
WIPO Patent Application WO/2022/201212
Kind Code:
A1
Abstract:
Method for determining the depth from a single image and system thereof The present invention relates to a computer-implemented method (2) for determining the depth of a digital image (I), wherein said method (2) comprises the step of training (20) a neural network (3), wherein said training step (20) comprises a first training substep (200) and a second training substep (201), wherein said first training substep (200) comprises the following steps: acquiring (200A) at least one digital image (L, R) of a scene (S), said at least one digital image (L, R) having a first spatial resolution; down-sampling (200B) said at least one digital image (L, R) so as to obtain said digital image (I) being constituted by a matrix of pixels and having a predetermined spatial resolution lower than said first spatial resolution of said at least one digital image (L, R); processing (200C) said digital image (I), obtained in said down-sampling step (200B), through said neural network (3) for generating a first depth map correlated to the depth of each pixel and the surrounding pixels of said digital image (I); and wherein said second training substep (201) comprises the following steps: processing (201A) said at least one digital image (L, R) with a matching technique for generating an optimizing depth map (22) correlated with the depth of each pixel of said at least one digital image (L, R); down-sampling (201B) said optimizing depth map (22) so as to obtain a second depth map with said predetermined spatial resolution lower than said first spatial resolution of said at least one digital image (L, R); determining (201C) a loss function between said first depth map obtained from said first training substep (200) and said second depth map obtained from said second training substep (201), for optimizing said first depth map generated by said neural network (3), wherein the error of said first depth map with respect to said second depth map is used to update the weights of said neural network (3) through back-propagation.

Inventors:
POGGI MATTEO (IT)
ALEOTTI FILIPPO (IT)
TOSI FABIO (IT)
MATTOCCIA STEFANO (IT)
PELUSO VALENTINO (IT)
CIPOLLETTA ANTONIO (IT)
CALIMERA ANDREA (IT)
Application Number:
PCT/IT2022/050065
Publication Date:
September 29, 2022
Filing Date:
March 22, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV BOLOGNA ALMA MATER STUDIORUM (IT)
TORINO POLITECNICO (IT)
International Classes:
G06T7/593
Other References:
TOSI FABIO ET AL: "Learning Monocular Depth Estimation Infusing Traditional Stereo Knowledge", 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE, 15 June 2019 (2019-06-15), pages 9791 - 9801, XP033687286, DOI: 10.1109/CVPR.2019.01003
PELUSO VALENTINO ET AL: "Enabling monocular depth perception at the very edge", 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), IEEE, 14 June 2020 (2020-06-14), pages 1581 - 1583, XP033799180, DOI: 10.1109/CVPRW50498.2020.00204
POGGI MATTEO ET AL: "Towards Real-Time Unsupervised Monocular Depth Estimation on CPU", 2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), IEEE, 1 October 2018 (2018-10-01), pages 5848 - 5854, XP033491101, DOI: 10.1109/IROS.2018.8593814
GODARD CLEMENT ET AL: "Unsupervised Monocular Depth Estimation with Left-Right Consistency", 2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), IEEE COMPUTER SOCIETY, US, 21 July 2017 (2017-07-21), pages 6602 - 6611, XP033250025, ISSN: 1063-6919, [retrieved on 20171106], DOI: 10.1109/CVPR.2017.699
M. POGGI ET AL.: "2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS", 2018, IEEE, article "Towards real-time unsupervised monocular depth estimation on cpu", pages: 5848 - 5854
V. PELUSO ET AL.: "2019 Design, Automation & Test in Europe Conference & Exhibition (DATE", 2019, IEEE, article "Enabling energy-efficient unsupervised monocular depth estimation on armv7-based platforms", pages: 1703 - 1708
R. SANCHEZ-LBORRA: "Tinyml-enabled frugal smart objects: Challenges and opportunities", IEEE CIRCUITS AND SYSTEMS MAGAZINE, vol. 20, no. 3, 2020, pages 4 - 18, XP011804944, DOI: 10.1109/MCAS.2020.3005467
D. WOFK ET AL.: "2019 International Conference on Robotics and Automation (ICRA", 2019, IEEE, article "Fastdepth: Fast monocular depth estimation on embedded systems", pages: 6101 - 6108
E. CHOU ET AL.: "Privacy-preserving action recognition for smart hospitals using low-resolution depth images", ARXIV:1811.09950, 2018
A. GEIGER ET AL.: "Vision meets robotics: The kitti dataset", THE INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, vol. 32, no. 11, 2013, pages 1231 - 1237, XP055674191, DOI: 10.1177/0278364913491297
D. EIGEN ET AL.: "Depth map prediction from a single image using a multi-scale deep network", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2014, pages 2366 - 2374
F. LIU ET AL.: "Learning depth from single monocular images using deep convolutional neural fields", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 38, no. 10, 2015, pages 2024 - 2039, XP011621539, DOI: 10.1109/TPAMI.2015.2505283
N. YANG ET AL.: "Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry", PROCEEDINGS OF THE EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV, 2018, pages 817 - 833
F. TOSI ET AL.: "Learning monocular depth estimation infusing traditional stereo knowledge", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2019, pages 9799 - 9809
J. WATSON ET AL.: "Self-supervised monocular depth hints", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2019, pages 2162 - 2171, XP033723726, DOI: 10.1109/ICCV.2019.00225
C. GODARD ET AL.: "Unsupervised monocular depth estimation with leftright consistency", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2017, pages 270 - 279
H. FU ET AL.: "Deep ordinal regression network for monocular depth estimation", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018, pages 2002 - 2011, XP033476165, DOI: 10.1109/CVPR.2018.00214
T. BAGAUTDINOV ET AL.: "Probability occupancy maps for occluded depth images", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2015, pages 2829 - 2837, XP032793729, DOI: 10.1109/CVPR.2015.7298900
S. SUN ET AL.: "Benchmark data and method for real-time people counting in cluttered scenes using depth sensors", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, vol. 20, no. 10, 2019, pages 3599 - 3612, XP011748658, DOI: 10.1109/TITS.2019.2911128
V. SRIVASTAV ET AL.: "International Conference on Medical Image Computing and Computer-Assisted Intervention", 2019, SPRINGER, article "Human pose estimation on privacy-preserving low resolution depth images", pages: 583 - 591
M.-R. LEE ET AL.: "Vehicle counting based on a stereo vision depth maps for parking management", MULTIMEDIA TOOLS AND APPLICATIONS, vol. 78, no. 6, 2019, pages 6827 - 6846, XP036755907, DOI: 10.1007/s11042-018-6394-6
R. HG ET AL.: "2012 Eighth International Conference on Signal Image Technology and Internet Based Systems", 2012, IEEE, article "An rgb-d database using microsoft's kinect for windows for face detection", pages: 42 - 46
A. E. ESHRATIFAR ET AL.: "Jointdnn: an efficient training and inference engine for intelligent mobile cloud computing services", IEEE TRANSACTIONS ON MOBILE COMPUTING, 2019
R. SZELISKI: "Computer vision: algorithms and applications", 2010, SPRINGER SCIENCE & BUSINESS MEDIA
D. SCHARSTEIN ET AL.: "A taxonomy and evaluation of dense two-frame stereo correspondence algorithms", INTERNATIONAL JOURNAL OF COMPUTER VISION, vol. 47, no. 1-3, 2002, pages 7 - 42
H. HIRSCHMULLER: "2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05", vol. 2, 2005, IEEE, article "Accurate and efficient stereo processing by semiglobal matching and mutual information", pages: 807 - 814
L. DI STEFANO ET AL.: "A fast area-based stereo matching algorithm", IMAGE AND VISION COMPUTING, vol. 22, no. 12, 2004, pages 983 - 1005, XP004549569, DOI: 10.1016/j.imavis.2004.03.009
T. KANADE ET AL.: "A stereo matching algorithm with an adaptive window: Theory and experiment", vol. 16, 1994, IEEE, pages: 920 - 932
J. ZBONTAR ET AL.: "Computing the stereo matching cost with a convolutional neural network", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2015, pages 1592 - 1599, XP032793569, DOI: 10.1109/CVPR.2015.7298767
N. MAYER ET AL.: "A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 4040 - 4048, XP033021635, DOI: 10.1109/CVPR.2016.438
A. KENDALL ET AL.: "End-to-end learning of geometry and context for deep stereo regression", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2017, pages 66 - 75, XP033282860, DOI: 10.1109/ICCV.2017.17
J.-R. CHANG ET AL.: "Pyramid stereo matching network", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018, pages 5410 - 5418, XP033473454, DOI: 10.1109/CVPR.2018.00567
F. ZHANG ET AL.: "Ga-net: Guided aggregation net for end-to-end stereo matching", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2019, pages 185 - 194, XP033687508, DOI: 10.1109/CVPR.2019.00027
A. SAXENA ET AL.: "Make3d: Learning 3d scene structure from a single still image", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 31, no. 5, 2008, pages 824 - 840, XP011266536, DOI: 10.1109/TPAMI.2008.132
I. LAINA ET AL.: "2016 Fourth international conference on 3D vision (3DV", 2016, IEEE, article "Deeper depth prediction with fully convolutional residual networks", pages: 239 - 248
Y. CAO ET AL.: "Estimating depth from monocular images as classification using deep fully convolutional residual networks", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 28, no. 11, 2017, pages 3174 - 3182, XP011698882, DOI: 10.1109/TCSVT.2017.2740321
Y. CAO ET AL.: "Monocular depth estimation with augmented ordinal depth relationships", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020
H. MOHAGHEGH ET AL.: "Aggregation of rich depth-aware features in a modified stacked generalization model for single image depth estimation", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 29, no. 3, 2018, pages 683 - 697, XP011714236, DOI: 10.1109/TCSVT.2018.2808682
K. KARSCH ET AL.: "Depth transfer: Depth extraction from video using non-parametric sampling", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 36, no. 11, 2014, pages 2144 - 2158, XP011560105, DOI: 10.1109/TPAMI.2014.2316835
T. ZHOU ET AL.: "Unsupervised learning of depth and ego-motion from video", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2017, pages 1851 - 1858
M. POGGI ET AL.: "2018 International Conference on 3D Vision (3DV", 2018, IEEE, article "Learning monocular depth estimation with unsupervised trinocular assumptions", pages: 324 - 333
R. MAHJOURIAN ET AL.: "Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints", CVPR, 2018
H. KUMAR ET AL.: "Depth map estimation using defocus and motion cues", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 29, no. 5, 2018, pages 1365 - 1379, XP011722700, DOI: 10.1109/TCSVT.2018.2832086
C. WANG ET AL.: "Learning depth from monocular videos using direct methods", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018, pages 2022 - 2030, XP033476167, DOI: 10.1109/CVPR.2018.00216
Z. YIN ET AL.: "Geonet: Unsupervised learning of dense depth, optical flow and camera pose", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018, pages 1983 - 1992, XP055583159, DOI: 10.1109/CVPR.2018.00212
M. RASTEGARI ET AL.: "European conference on computer vision", 2016, SPRINGER, article "Xnor-net: Imagenet classification using binary convolutional neural networks", pages: 525 - 542
A. PILZER ET AL.: "Refine and distill: Exploiting cycle-inconsistency and knowledge distillation for unsupervised monocular depth estimation", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2019, pages 9768 - 9777
L. ANDRAGHETTI ET AL.: "2019 International Conference on 3D Vision (3DV", 2019, IEEE, article "Enhancing self-supervised monocular depth estimation with traditional visual odometry", pages: 424 - 433
J. QIU ET AL.: "Going deeper with embedded fpga platform for convolutional neural network", PROCEEDINGS OF THE 2016 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS, 2016, pages 26 - 35, XP055423746, DOI: 10.1145/2847263.2847265
H. ALEMDAR ET AL.: "2017 International Joint Conference on Neural Networks (IJCNN", 2017, IEEE, article "Ternary neural networks for resource-efficient ai applications", pages: 2547 - 2554
M. RUSCI ET AL.: "Quantized nns as the definitive solution for inference on low-power arm mcus? work-in-progress", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS, 2018, pages 1 - 2, XP033441589, DOI: 10.1109/CODESISSS.2018.8525915
L. LAI ET AL., ENABLING DEEP LEARNING AT THE LOT EDGE, 2018, pages 1 - 6
S. VOGEL ET AL.: "Efficient hardware acceleration of cnns using logarithmic data representation with arbitrary log-base", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, 2018, pages 1 - 8, XP058420986, DOI: 10.1145/3240765.3240803
S. HAN ET AL.: "Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding", ARXIV:1510.00149, 2015
M. GRIMALDI ET AL.: "Optimality assessment of memory-bounded convnets deployed on resource-constrained risc cores", IEEE ACCESS, vol. 7, 2019, pages 599 - 611
J. YU ET AL.: "2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA", 2017, IEEE, article "Scalpel: Customizing dnn pruning to the underlying hardware parallelism", pages: 548 - 560
H. LI ET AL.: "Pruning filters for efficient convnets", 5TH INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS (ICLR, 2017
S. ELKERDAWY ET AL.: "2019 IEEE International Conference on Image Processing (ICIP", 2019, IEEE, article "Lightweight monocular depth estimation model by joint end-to-end filter pruning", pages: 4290 - 4294
A. TONIONI ET AL.: "Unsupervised adaptation for deep stereo", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2017, pages 1605 - 1613
A. TONIONI ET AL.: "Unsupervised domain adaptation for depth prediction from images", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019
A. B. OWEN: "A robust hybrid of lasso and ridge regression", CONTEMPORARY MATHEMATICS, vol. 443, no. 7, 2007, pages 59 - 72
X. GUO ET AL.: "Learning monocular depth by distilling cross-domain stereo networks", PROCEEDINGS OF THE EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV, 2018, pages 484 - 500
A. MISHRA ET AL.: "Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2018
B. JACOB ET AL.: "Quantization and training of neural networks for efficient integer-arithmetic-only inference", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2018, pages 2704 - 2713, XP033476237, DOI: 10.1109/CVPR.2018.00286
M. CORDTS ET AL.: "The cityscapes dataset for semantic urban scene understanding", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2016, pages 3213 - 3223, XP033021503, DOI: 10.1109/CVPR.2016.350
C. GODARD ET AL.: "Digging into self-supervised monocular depth estimation", PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, 2019, pages 3828 - 3838
Y. WANG ET AL.: "Unos: Unified unsupervised optical-flow and stereodepth estimation by watching videos", PROCEEDINGS OF THE IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 2019, pages 8071 - 8081
E. ILG ET AL.: "Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation", PROCEEDINGS OF THE EUROPEAN CONFERENCE ON COMPUTER VISION (ECCV, 2018, pages 614 - 630
A. TORRALBA ET AL.: "80 million tiny images: A large data set for nonparametric object and scene recognition", IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, vol. 30, no. 11, 2008, pages 1958 - 1970, XP011227515, DOI: 10.1109/TPAMI.2008.128
A. TORRALBA ET AL.: "Object and scene recognition in tiny images", JOURNAL OF VISION, vol. 7, no. 9, 2007, pages 193 - 193
Attorney, Agent or Firm:
TIBURZI, Andrea et al. (IT)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method (2) for determining the depth of a digital image (I), wherein said method (2) comprises the step of training (20) a neural network (3), wherein said training step (20) comprises a first training substep (200) and a second training substep (201), wherein said first training substep (200) comprises the following steps:

- acquiring (200A) at least one digital image (L, R) of a scene (S), said at least one digital image (L, R) having a first spatial resolution;

- down-sampling (200B) said at least one digital image (L, R) so as to obtain said digital image (I) being constituted by a matrix of pixels p and having a predetermined spatial resolution lower than said first spatial resolution of said at least one digital image (L, R);

- processing (200C) said digital image (I), obtained in said down-sampling step (200B), through said neural network (3) for generating a first depth map d correlated to the depth of each pixel p and the surrounding pixels of said digital image (I); and wherein said second training substep (201) comprises the following steps:

- processing (201 A) said at least one digital image (L, R) with a matching technique for generating an optimizing depth map (22) correlated with the depth of each pixel p of said at least one digital image (L, R);

- down-sampling (201 B) said optimizing depth map (22) so as to obtain a second depth map with said predetermined spatial resolution lower than said first spatial resolution of said at least one digital image (L, R); - determining (201 C) a loss function Linit between said first depth map d obtained from said first training substep (200) and said second depth map d obtained from said second training substep (201), for optimizing said first depth map d generated by said neural network (3), wherein the error of said first depth map d with respect to said second depth map d is used to update the weights of said neural network (3) through back-propagation.

2. Method (2) according to the preceding claim, characterized in that said image acquisition step (200A) and said processing step

(201 A) are carried out by means of a stereo matching technique, so as to detect and process a left image (L) and a right image (R), in that said processing substep (201 A) is carried out so as to compute a left depth map ( DL ) considering said left image (L) as a reference image and a right depth map (DR) considering said right image (R) as said reference image, and in that said loss function Linit is computed according to the following formula: wherein αap is set to 1 ; αps is set to 1 ; Llap is a first photometric loss function calculated between said left image

(L) and a warped right image obtained warping said right image (R) according to said left depth map ( DL ); Lrap is a second photometric loss function calculated between said right image (R) and a warped left image obtained warping said left image (L) according to said right depth map (DR);

Llps is a first proxy supervision loss function correlated with said first depth map d obtained from said first training substep (20) and a second left depth map obtained from said second training substep (21), according to said left depth map (DL); and Lrps is a second proxy supervision loss function correlated with said first depth map d obtained from said first training substep (20) and a second right depth map obtained from said second training substep (21), according to said right depth map (DR).

3. Method (2) according to the preceding claim, characterized in that said first photometric loss function Llap and said second photometric loss function Lrap are computed according to the following formulas in that said first Llps and second Lrps proxy supervision loss functions are computed according to the following formulas

wherein berHu is a reverse Huber loss function and with α = 0.2.

4. Method (2) according to the preceding claim, characterized in that said processing substep (201 A) provides a left-right consistency constraint, wherein pixels p having different disparities between said left ( DL ) and right (DR) depth maps are invalidated according to the following formula wherein ε is a threshold value enforcing consistency over the left and right depth maps, which is set to 3.

5. Method (2) according to any one of the preceding claims, characterized in that it further comprises, after said training step (20), an optimizing step (21) for the deployment of said trained neural network (3) on a power device, such as a microcontroller.

6. Method (2) according to the preceding claim, characterized in that said optimizing step (21) comprises the following substeps

- quantizing (210) said trained neural network (3) using 8-bit fixed point representation; and

- transforming (212) said quantized neural network (3) by means of a compiler into an executable in binary format which is compliant with the hardware architecture of said power device.

7. Method (2) according to the preceding claim, characterized in that it comprises, after said quantizing substep (210), the substep of fine tuning (211) based on knowledge distillation method, for recovering any loss of accuracy caused by said quantizing substep (210).

8. Method (2) according to any one of the preceding claims, characterized in that said predetermined spatial resolution of said digital image (I) and said second depth map is 32 x 32 pixels or 48 x 48 pixels.

9. Method (2) according to any one of the preceding claims, characterized in that said down-sampling step (200B) of said first training substep (200) and said down-sampling step (201 B) of said second training substep (201) are executed by means of a nearest neighbor interpolation technique.

10. Imaging system (1) comprising an image detection unit (10), configured to detect at least one image of a scene (S), generating at least one digital image (L, R), at least one processing unit (11), operatively connected to said image detection unit (10), and storage means (12), connected to said at least one processing unit (11 ), for storing depth data, said system (1) being characterized in that said at least one processing unit (11) is configured to determine the depth of digital images by training a neural network (3) according to any one of claims 1-9.

11. System (1) according to the preceding claim, characterized in that said image detection unit (10) comprises two image detection devices (100, 101 ) for the acquisition of stereo mode images, wherein a first image detection device (100) detects a left image (L) and a second image detection device (101) detects a right image (R), and a monocular sensor (102) for digital images sampling. 12. System (1) according to the preceding claim, characterized in that said two image detection devices (100, 101) comprises a video camera and/or a camera, mobile and/or active sensors, such as LiDARs, Radar or Time of Flight (ToF) cameras, and said monocular sensor (102) comprises an RGB or IR sensor.

13. System (1) according to any one of claims 10-12, characterized in that said neural network (3) comprises an encoder (30) and a plurality of decoders (31) arranged in sequence to said encoder (30), wherein said encoder (30) comprises a plurality of convolutional layers (300, 301) followed by leaky ReLU activations, and wherein each of said decoders (31) comprises a plurality of convolutional layers (300) followed by leaky ReLU activations, each convolutional layer (300) having a stride factor of 1 and each convolutional layer (301 ) having a stride factor of 2.

14. A computer-implemented method (2) for training (20) a neural network (3) for determining the depth of a digital image (I), comprising: a first training substep (200) comprising the following steps:

- acquiring (200A) at least one digital image (L, R) of a scene (S), said digital image (L, R) having a first spatial resolution,

- down-sampling (200B) said at least one digital image (L, R) so as to obtain said digital image (I) being constituted by a matrix of pixels p and having a predetermined spatial resolution lower than said first spatial resolution of said at least one digital image (L, R); - processing (200C) said digital image (I), obtained in said down- sampling step (200B), through said neural network (3) for generating a first depth map d correlated to the depth of each pixel p and the surrounding pixels of said digital image (I); and a second training substep (201) comprising the following steps: - processing (201 A) said at least one digital image (L, R) with a matching technique for generating an optimizing depth map (22) correlated with the depth of each pixel p of said at least one digital image (L, R);

- down-sampling (201 B) said optimizing depth map (22) so as to obtain a second depth map d with said predetermined spatial resolution lower than said first spatial resolution of said at least one digital image (L,

R);

- determining (201 C) a loss function Linit between said first depth map d obtained from said first training substep (200) and said second depth map d obtained from said second training substep (201), so as to train said neural network (3) for optimizing said first depth map d generated by said neural network (3), wherein the error of said first depth map d with respect to said second depth map d is used to update the weights of said neural network (3) through back-propagation.

15. Computer program comprising instructions which, when the program is executed by a processor, cause the execution by the processor of the steps 200B-201C of the method according to any one of claims 1-9 and 14.

16. Storage means readable by a processor comprising instructions which, when executed by a processor, cause the execution by the processor of the method steps 200B-201C according to any one of claims 1-9 and 14.

Description:
Method for determining the depth from a single image and system thereof

The present invention relates to a method for determining the depth from a single image and a system thereof. Technical field

More specifically, the invention relates to a method for determining the depth from a single digital image, wherein the method has been studied and made in particular for estimating depth, optical flow and semantic from a single image, but it can be used for any electronic systems capable of estimating depth clues from images.

The following description will relate to monocular depth estimation, but it is evident that it must not be considered to be limited to that specific use.

Background art Currently, depth perception is one of the foremost cues for dealing with many real-world applications like autonomous or assisted driving, robotics, safety, and security.

Although for this purpose there exist effective active technologies, such as Light Detection and Ranging (LiDAR), inferring depth from images has several advantages as bulky mechanical parts are no longer needed.

Therefore, it represents a long-standing problem in computer vision, and different approaches, such as stereo vision and multi-view stereo, have been extensively investigated.

The recent spread of machine learning (ML) has opened new frontiers and, in particular, enabled to infer depth from a single camera simplifying the setup remarkably and allowing to exploit of such a cue even in application contexts characterized by severe constraints of cost and size.

A potential drawback of ML-based methods is their complexity. In fact, the Convolutional Neural Networks (CNNs) used as backbone are well known to be highly resource-demanding, both in terms of computing power and memory space, and may call for parallel accelerators like GPU cards.

Nonetheless, as proposed in [1], the PyD-Net architecture leveraging an appropriate pyramidal network design enables depth perception with an accuracy comparable to state of the art but with much fewer hardware requirements making it feasible on high-end CPUs commonly available in desktops, but also portable devices with a few Watts of power budget, such as flagship smartphones and tablets or wired smart cameras.

As described in a recent work [2], additional hardware-aware optimization strategies applied to the PyD-Net model, mostly aimed at reducing the data type deployed for inference, enabled to save 33% of energy with an almost equivalent degree of accuracy.

In light of the above, it is natural to ask whether depth perception can go even further approaching smaller and cheaper off-the-shelf components able to work below the Watt mark, like tiny end-nodes powered with Micro Controller Units (MCUs).

This may offer interesting opportunities in the context of new edge applications and services in the Internet of Things (loT) segment, where distributed visual sensors can evolve from simple image collectors to intelligent hubs able to infer depth locally, namely with no access to the cloud, thereby ensuring portability even in geographical areas with limited data coverage, and higher quality of service (QoS) thanks to more certain latency and higher users privacy [3].

Unfortunately, MCUs are orders of magnitudes less performing than embedded CPUs/GPUs, and state of the art models for monocular depth estimation are too large for this purpose.

Even the smallest nets, e.g. [1], [4], are designed for systems with high-speed multi-core architectures and large SRAM banks, thus consuming up to 3.5-10 W of power. Instead, typical MCUs run at a much lower frequency (hundreds of MHz vs. 1-2 GHz) and impose tight memory constraints (hundreds of kB vs. 2-8 GB).

Furthermore, despite the achievements in terms of accuracy, modern state of the art techniques for monocular depth estimation [10], [12], [13] are overkill for many edge applications.

Specifically, while high-resolution dense depth maps are desirable when dealing with tasks such as 3D reconstruction and SLAM, a rough depth estimate suffices in many applications such as object/people counting [14], [15], pose estimation [16], action recognition [5], vehicle detection [17],

Indeed, millimetric depth measurements are not strictly required in these cases to tackle the problem successfully.

Known techniques comprise active sensors with bulky mechanical parts or digital multi-image systems with high power consumption. In fact, the known methods for estimating the depth, optical flow, and semantics of a scene do not allow to perform tasks in a reasonable time together with the low power consumption.

Moreover, known methods are not implementable on low-power devices due to limited available resources and some low-cost and low- power systems such as sonar or radar could only provide a limited subset of information.

It is evident that the methods according to the known technique are extremely expensive in computational terms, so that they cannot be easily used and applied.

Scope of the invention

In view of the above, it is therefore a scope of the present invention to overcome the abovementioned drawbacks by providing a method for determining depth, optical flow, and semantics from a single image on low- power devices by processing low-resolution images using a lightweight and highly accurate self-supervised network.

Another scope of the present invention is that of providing a method for obtaining a meaningful coarse depth representation from a tiny image, i.e., down to 32x32 pixels, sufficient to deal with high-level applications.

Further scope of the present invention is that of providing an architecture of a network designed for processing low-resolution images on low-power devices.

Another scope of the present invention is to provide a method that is highly reliable, relatively easy to make, and has competitive costs if compared with the prior art.

Further scope of the present invention is to provide the tools necessary for the execution of the method and the apparatuses which execute the method.

Object of the invention

It is therefore a specific object of the present invention a computer- implemented method for determining the depth of a digital image, wherein said method comprises the step of training a neural network, wherein said training step comprises a first training substep and a second training substep, wherein said first training substep comprises the following steps: acquiring at least one digital image of a scene, said digital image having a first spatial resolution; down-sampling said at least one digital image so as to obtain said digital image being constituted by a matrix of pixels p and having a predetermined spatial resolution lower than said first spatial resolution of said at least one digital image; processing said digital image, obtained in said down-sampling step, through said neural network for generating a first depth map d correlated to the depth of each pixel p and the surrounding pixels of said digital image; and wherein said second training substep comprises the following steps: processing said at least one digital image with a matching technique for generating an optimizing depth map correlated with the depth of each pixel p of said at least one digital image; down-sampling said optimizing depth map so as to obtain a second depth map with said predetermined spatial resolution lower than said first spatial resolution of said at least one digital image; determining a loss function L init between said first depth map d obtained from said first training substep and said second depth map obtained from said second training substep, for optimizing said first depth map d generated by said neural network, wherein the error of said first depth map d with respect to said second depth map is used to update the weights of said neural network through back-propagation.

Advantageously according to the present invention, said image acquisition step and said processing step may be carried out by means of a stereo matching technique, so as to detect and process a left image and a right image, said processing substep may be carried out so as to compute a left depth map D L considering said left image L as a reference image and a right depth map D R considering said right image R as said reference image, and said loss function L init may be computed according to the following formula: wherein α ap is set to 1 ; α ps is set to 1 ; L l ap is a first photometric loss function calculated between said left image L and a warped right image obtained warping said right image R according to said left depth map D L ; L r ap is a second photometric loss function calculated between said right image R and a warped left image obtained warping said left image L according to said right depth map D R ; L l ps is a first proxy supervision loss function correlated with said first depth map d obtained from said first training substep and a second left depth map obtained from said second training substep, according to said left depth map D L ; and L r ps is a second proxy supervision loss function correlated with said first depth map d obtained from said first training substep and a second right depth map obtained from said second training substep, according to said right depth map D R .

Always according to the present invention, said first photometric loss function L l ap and said second photometric loss function L r ap may be computed according to the following formulas wherein said first L l ps and second L r ps proxy supervision loss function may be computed according to the following formula wherein berHu is a reverse Huber loss function and

Also according to the present invention, said processing substep may provide a left-right consistency constraint, wherein pixels p having different disparities between said left D L and right D R depth map are invalidated according to the following formula wherein ε is a threshold value enforcing consistency over the left and right depth maps, which is set to 3.

Always according to the present invention, the method may further comprise, after said training step, an optimizing step for the deployment of said trained neural network on a power device, such as a microcontroller. Advantageously according to the present invention, said optimizing step may comprise the following substeps: quantizing said trained neural network using 8-bit fixed point representation; and transforming said quantized neural network by means of a compiler into an executable in binary format which is compliant with the hardware architecture of said power device.

Always according to the present invention, the method may further comprise after said quantizing substep, the substep of fine tuning based on knowledge distillation method, for recovering any loss of accuracy caused by said quantizing substep. Always according to the present invention, said predetermined spatial resolution of said digital image and said second depth map d may be 32 x 32 pixels or 48 x 48 pixels.

Always according to the present invention, said down-sampling step of said first training substep and said down-sampling step of said second training substep are executed by means of a nearest neighbor interpolation technique.

It is also specific object of the present invention, an imaging system comprising an image detection unit, configured to detect at least one image of a scene, generating at least one digital image, at least one processing unit, operatively connected to said image detection unit, and storage means, connected to said at least one processing unit, for storing depth data, wherein said at least one processing unit is configured to determine the depth of digital images by training a neural network.

Always according to the present invention, said image detection unit may comprise two image detection devices for the acquisition of stereo mode images, wherein a first image detection device detects a left image and a second image detection device detects a right image, and a monocular sensor for digital images sampling.

Always according to the present invention, said two image detection devices may comprise a video camera and/or a camera, mobile and/or active sensors, such as LiDARs, Radar or Time of Flight (ToF) cameras, and said monocular sensor may comprise an RGB or IR sensor.

Always according to the present invention, said neural network may comprise an encoder and a plurality of decoders arranged in sequence to said encoder, wherein said encoder comprises a plurality of convolutional layers followed by leaky ReLU activations, and wherein each of said decoders comprises a plurality of convolutional layers followed by leaky ReLU activations, each convolutional layer having a stride factor of 1 and each convolutional layer having a stride factor of 2.

It is also object of the present invention a computer-implemented method for training a neural network for determining the depth of a digital image, comprising: a first training substep comprising the following steps: acquiring at least one digital image of a scene, said digital image having a first spatial resolution, down-sampling said at least one digital image so as to obtain said digital image being constituted by a matrix of pixels p and having a predetermined spatial resolution lower than said first spatial resolution of said at least one digital image; processing said digital image, obtained in said down-sampling step, through said neural network for generating a first depth map d correlated to the depth of each pixel p and the surrounding pixels of said digital image; and a second training substep comprising the following steps: processing said at least one digital image with a matching technique for generating an optimizing depth map correlated with the depth of each pixel p of said at least one digital image; down-sampling said optimizing depth map so as to obtain a second depth map d with said predetermined spatial resolution lower than said first spatial resolution of said at least one digital image; determining a loss function L init between said first depth map d obtained from said first training substep and said second depth map d obtained from said second training substep, so as to train said neural network for optimizing said first depth map d generated by said neural network, wherein the error of said first depth map d with respect to said second depth map d is used to update the weights of said neural network through back-propagation.

It is also object of the present invention a computer program comprising instructions which, when the program is executed by a processor, cause the execution by the processor of the steps of the method.

It is further object of the present invention storage means readable by a processor comprising instructions which, when executed by a processor, cause the execution by the processor of the method steps.

Brief description of the drawings

The invention is now described, by way of example and without limiting the scope of the invention, with reference to the accompanying drawings which illustrate preferred embodiments of it, in which: figure 1 shows, in a schematic view, an embodiment of a system for determining the depth from a single image, according to the present invention; figure 2 shows, in a schematic view, an embodiment of the method for determining the depth from a single image, according to the present invention; figure 3 shows, in a schematic view, a training step of a neural network, according to the present invention; figure 4 shows, in a schematic view, an embodiment of the architecture of the neural network, according to the present invention; figure 5 shows examples of an application of the present method to images concerning traffic monitoring, wherein the high-resolution frame, followed by the 32x32 image processed by said neural network (top-right corner) and its output (mid-right corner) and, in the bottom-right corner, the differences in the coarse 3D structure of the scene with respect to the structure of the environment itself, which has been acquired in absence of vehicles (top-left example); figure 6 shows examples of an application of the present method on a testing image of the VAP dataset [18], wherein a) RGB original frame, b) ground-truth depth acquired using Kinect, c) RGB input image resized to 32x32, d) maps predicted by PyD-Net, e) maps predicted by said neural network, and f) outcome of a super-resolution network fed with the map of said neural network; figure 7A shows examples of self-sourced proxy labels on 48x48 images, from top to bottom, reference images, disparity maps produced by SGM [22] (on full resolution images, then downsampled by means of nearest neighbor interpolation), and predictions by said neural network, wherein in the SGM map, outliers detected by the left-right consistency check depicted in a darker grey; figure 7B shows examples of self-sourced proxy labels on 32x32 images, from top to bottom, reference images, disparity maps produced by SGM [22] (on full resolution images, then downsampled by means of nearest neighbor interpolation) and predictions by said neural network, wherein in the SGM map, outliers detected by the left-right consistency check depicted in darker grey; figure 8 shows a first table comprising data related to proxy labels accuracy on the test set of KITTI dataset [6] using the split of eigen et al. [7], maximum depth set to 80m; figure 9 shows a second table comprising data related to an ablation study on the test set of KITTI dataset [6] using the split of eigen et al. [7], maximum depth set to 80m; figure 10 shows a third table comprising data related to a quantitative evaluation on the test set of KITTI dataset [6] using the split of eigen et al. [7] with maximum depth set to 80m, wherein methods with * run post-processing [12]; figure 11 shows qualitative results concerning traffic monitoring, wherein the high-resolution frame, followed by 32x32 images processed by said neural network is shown for each example; figure 12 shows a fourth table comprising data related to a quantitative evaluation on the test set of KITTI dataset [6] using the split of eigen et al. [7] with maximum depth set to 80m; figure 13 shows a fifth table comprising data related to an evaluation of said neural network and quantized variants at different ranges, with a comparison with state of the art [10] on the same ranges; figure 14 shows a sixth table comprising data related to a quantitative evaluation on make3d dataset [30]; figure 15 shows qualitative results on make3d. from top to bottom, reference images, inverse depth maps by MonoResMatch [10], and by 48x48 neural network; figure 16 shows a seventh table comprising data related to extra- functional metrics of said neural network at different input resolutions on the nucleo-f767zi board; and figure 17 shows the memory breakdown of PyD-Net and said neural network at different input resolutions, wherein the dash indicates that the resolution is not compliant with the network topology.

Similar parts will be indicated in the various drawings with the same numerical references.

Detailed description

With reference to figure 1 , there is an image detection system, which is referred to with a numerical reference 1 , for training a neural network 3, as better specified below, comprising an image detection unit 10, a processing unit 11 , operatively connected to said image detection unit 10, and storage means 12, connected to said processing unit 11.

Said image detection unit 10 comprises in its turn two image detection devices 100, 101 which can be a video camera and/or a camera, mobile or fixed with respect to a first and a second position, and/or active sensors, such as LiDARs, Radar or Time of Flight (ToF) cameras and the like.

In the present embodiment, said detection devices 100, 101 are configured for the acquisition of stereo mode images, wherein a first image detection device 100 detects a left digital image L and a second image detection device detects a right digital image R (or vice versa) of the object or scene S to be detected.

As it will be better explained later, each of the two images L, R can be considered as a reference image or a target image for the computation of the right and left disparity or depth maps.

Flowever, in other embodiments of the present invention, the technique used for detecting the images can be different from the stereo matching technique and it is possible to provide a number of detection devices other than two.

Furthermore, as will be explained in more detail below, said image detection unit 10 comprises a monocular sensor 102 for single digital images processing, such as a Red Green Blue (RGB) or Infrared (IR) low- energy sensor.

The data acquired by said image detection unit 10 are processed by said processing unit 11 suitable for accurately determining the depth of a single image I calculated from said digital images L, R by means of the image depth determination method shown in figures 2-4, according to the present invention and as further explained below.

Moreover, said processing unit 11 is configured for performing a neural network 3 and said storage means 12 allows the storage of depth data, such as depth maps, optical flow, and other semantics computed by the present method as well as said neural network 3 itself.

Referring now to figure 2, the flow chart of the method for determining depth from said single image I according to the present invention, generically indicated by numerical reference 2, comprises a main step of training 20 said neural network 3, having said digital image I as input, and a further optimizing step 21 , for the deployment of said trained neural network 3 on a power device, such as a microcontroller.

In particular, with reference also to figures 3 and 4, said training step 20 comprises in its turn a first training substep 200 and a second training substep 201.

The step indicated with reference 200A involves the acquisition of said two digital images L, R of a scene, wherein each of said digital images L, R has a relevant first spatial resolution.

Subsequently, a down-sampling of said two digital images L, R is carried out in step 200B. In particular, said digital image I, consisting of a matrix of pixels p and having a predetermined spatial resolution lower than said first spatial resolution of said two digital images L, R is obtained. As explained in more detail below, the predetermined spatial resolution of said digital image I is a low spatial resolution (i.e., 48x48 pixel or 32x32 pixel) compared to the resolution of said digital images L, R (i.e., 640x480 pixel).

Then, the step of processing 200C said digital image I is carried out. More particularly, said digital image I obtained in said down-sampling step 200B is processed through said neural network 3 for generating a first depth map or predicted depth map d correlated to the depth of each pixel p and the surrounding pixels of said digital image I.

The step of processing 201 A involves the processing of said digital images L, R with a stereo matching technique for generating an optimizing depth map 22, which is correlated with the depth of each pixel p of said digital images L, R.

Then, a down-sampling of said optimizing depth map 22 is carried out in step 201 B. In particular, a second depth map or proxy depth map d with said predetermined spatial resolution lower than said first spatial resolution of said digital images L, R is obtained.

Subsequently, the step of determining 201 C a loss function L init between said first depth map d obtained from said first training substep 200 and said second depth map d obtained from said second training substep 201 is carried out. In particular, as it will be described below, the computation of said loss function L init allows the supervision of said neural network 3 through the optimization of said first depth map d generated by said neural network 3. The loss function L init is computed between the predicted depth map d and the proxy depth map d. The error of the former with respect to the latter is used to update the weights of the neural network 3 through back-propagation.

Network architecture

In order to accomplish monocular depth estimation under the challenging constraints outlined, two main factors need to be carefully taken into account to keep both manageable memory footprint and execution time: input spatial resolution and network complexity, with the former usually driving most of the design choices linked to the latter.

For instance, pooling and stride parameters in convolution are adjusted to enlarge the receptive field as well as to reduce the computational burden at higher resolutions.

Thus, the first design choice to meet the constraints consists of inferring inverse depth from a small input image I resulting in an extremely compact network model 3, namely microPyD-Net (μPyD-Net).

Figure 4 shows the architecture of said neural network 3. In particular, as mentioned above, said neural network 3 is designed to be able to process low-resolution images, such as 32x32 pixels. This is satisfied by implementing said network 3 with a low number of downsampling layers, (i.e., fewer than five layers with a stride factor of 2). In the present embodiment, two downsampling layers are used.

Moreover, the architecture of said neural network 3 is designed to be compatible with microcontrollers (e.g., devices with low memory occupancy, less than 512Kb as shown in figure 17).

In particular, in the present embodiment, said neural network 3 comprises an encoder 30 and three decoders 31.

The shallow encoder 30 extracts a three-level pyramid of features using six 3x3 convolutional layers 300, 301 followed by Leaky ReLU activations having a = 0.125, producing respectively 8, 8, 16, 16, 32, and 32 features.

The convolutional layer 300, 301 contains a set of filters, with height and width equal to 3. Each filter is convolved with the input activation map to produce the output activation map. The Leaky Relu is a rectifier function applied to each pixel of the input activation maps. In particular, the pixel value is scaled by a if negative, otherwise it passes unaltered.

According to figure 4, layers 301 apply a stride factor of 2, halving the spatial resolution. The stride factor defines the step used to apply the convolution kernel to the input. A stride factor of 2 means that the kernel is convolved over a patch centered on a given pixel, then it skips one pixel before convolving over the next one.

Then, three decoders 31 made of three convolutional layers, followed by leaky ReLU (except the last one), process each level of the pyramid producing 32 features each.

The output of the last layer is up-sampled through a 2x2 transposed convolution layer 302. The extremely compact architecture, counting barely 100K parameters, is thought to run on tiny resolution images and thus is tailored to low-power devices such as MCUs.

This, coupled with appropriate image resolutions, i.e., 48x48 and 32x32, allows for breaking the 512 kB memory and 1 FPS barriers on such a low-powered device, as it will be shown in detail in the present evaluation.

In particular, adding more layers either to the features extractor or the decoders would make one or both the requirements not met. With reference to figure 17, it can be noticed that the memory requirements of the current technical solutions fall below 512Kb, while known architectures require nearly 2MB. Furthermore, adding more convolutional layers to the encoder 30 or to the decoder 31 introduces more computations to perform, hence higher computational complexity, and a larger number of weights to store on-device, hence higher memory footprint.

Moreover, in the experiments, it will be proved how processing 48x48 and 32x32 will allow for deployment on such a family of devices and still source results accurate enough for several high-level applications.

In addition to the issues induced by processing low-resolution images, such as loss of details, providing supervision at such resolution becomes challenging.

In particular, when the annotation is sparse like in the KITTI dataset [6], downsampling such sparse depth data to the small input resolution of the network would make labels no longer reliable because of interpolation. For this reason, during training, a proxy-supervision [10] deploying a traditional stereo algorithm such as Semi-Global Matching (SGM) [22] is used.

Proxy-Supervision

Since sourcing accurate depth labels are expensive and time- consuming, several works replaced the need for accurate ground truth labels using view synthesis [12], [36] deploying pairs of synchronized images acquired by a stereo camera to exploit a re-projection loss for supervision [12],

That is, given a stereo pair made of images L and R, the network 3 is trained to infer from L an inverse depth map (i.e., disparity) D L . Then, R is warped according to D L to obtain A first photometric loss function L l ap calculated between L and warped image defined as in [12] supervises the network 3:

The network 3 can be trained also to infer a synthetic disparity map

D R for the right image R so as to enforce consistency between the two.

In this case, a second photometric loss function L r ap can be obtained comparing R with warped image L

In this field, a further step forward consists of using traditional stereo algorithms [21] to produce noisy disparity estimations and leverage on them for supervision. In particular, this strategy proved to be very effective for self-supervised training of both stereo [56], [57] and monocular [10], [57] networks, outperforming re-projection losses. This strategy has been implemented to avoid downsampling of sparse ground truth labels, which would result in even sparser annotations or incorrect values introduced by interpolation when the depth data is not available at the reduced resolution.

Purposely, as in [10], the SGM algorithm [22] is used for generating dense proxy labels from stereo pairs. For each pixel p and disparity hypothesis d, a Hamming matching cost C(p, d ) is computed between 9x7 census transformed images, then it is refined according to multiple scanline optimizations as follows: with P1 and P1 two smoothness penalties, discouraging disparity gaps between p and previous pixel p r along the scanline path. A winner-takes-all strategy is applied after summing up the outcome of each optimization phase.

Finally, the left-right consistency constraint [21] is enforced to filter out outliers as follows.

By computing disparity maps D L and D R with SGM, respectively assuming as reference left and right images, pixels having different disparities across the two maps are invalidated: wherein ε is a threshold value enforcing consistency over the left and right depth maps, which is set to 3.

As said, to validate the method, the high-resolution KITTI dataset [6] has been used. Hence, in order to obtain proxy labels as accurate as possible, SGM has been run on images at the original resolution W x H, then downsample them respectively to 48x48 and 32x32 using nearest- neighbor interpolation, and opportunely scale disparity values by to effectively obtain proxy labels at the same resolution of the network inputs. It is worth noting that such labels are not entirely dense since outliers are filtered-out enforcing the left-right consistency constraint.

Nonetheless, most points survive this process, and each valid value available in the inverse depth map is obtained without any interpolation from nearby points. The obtained labels are then used to provide supervision to μPyD-

Net employing a reverse Huber (berHu) loss [58]: where d(p) and are, respectively, the predicted disparity and the proxy annotation for pixel p while c is set as with α = 0.2.

Therefore, L l ps is a first proxy supervision loss function correlated with said first depth map d obtained from said first training substep 20 and a second left depth map obtained from said second training substep 21 , according to said left depth map D L . L r ps is a second proxy supervision loss function correlated with said first depth map d obtained from said first training substep 20 and a second right depth map obtained from said second training substep (21), according to said right depth map D R .

It will be shown with the experimental results shown below how despite the meager resolution of the images processed, μPyD-Net achieves, on the standard KITTI dataset [6], a depth accuracy comparable to seminal works [7], [8], although not on par with the current state of the art [9]— [11 ]-

However, this is not surprising given the much higher resolution and the thousand times more complex models of the latter methods, out of reach for MCUs used in applications at the edge.

A qualitative assessment is given through the use-case reported in figure 5, which shows that the coarse maps inferred by μPyD-Net can be used for vehicle detection seamlessly, by simply looking for differences in the coarse 3D structure of the scene with respect to the structure of the environment itself in absence of vehicles.

Intuitively, with lighter input images the hardware requirements reduce substantially enabling to bring the task within the computing capability of MCUs. Obviously, the limited operating frequency and low parallelism of MCUs prevent real-time performance (i.e., >30 FPS). However, the focus is on those applications with relaxed timing constraints, like traffic congestion monitoring, and not fast decision making needed, for instance, on autonomous driving.

However, in the case of decision-making systems, the present invention can be easily ported to mobile CPUs in order to gain about 100x performance at the cost of only 10x power consumption.

Moreover, in many circumstances, processing low-resolution images is even desirable. For instance, depth predictions can be used as privacy-preserving features for further processing in cloud systems.

Commonly referred to as collaborative intelligence [19], this strategy enables to preserve user privacy, masking sensitive data that are present in raw RGB images.

An example is depicted in figure 6, which applies to an image extracted from the VAP dataset [18]. Distinctive features in data managed by the staff in charge of monitoring the environment may violate users’ privacy.

Although low-resolution images partially help to alleviate this issue, sufficiently distinctive clues can still be inferred as it can be noticed from Figure 6 c).

However, by moving to a pure depth domain as in figure 6 d), one can hide details yet keeping the relevant information required for the high- level task. Bringing these sensing capabilities on tiny devices is paramount to reduce design costs and power consumption of systems based on RGB-D cameras (e.g., Kinect in (b)) or standard network for monocular depth estimation (d).

The map inferred by μPyDNet (e) and post-processed by a super- resolution network running on the cloud (f) achieves results comparable to the two baselines. To play with low-resolution inputs is beneficial in terms of the number of operations to be processed but also memory footprint to store hidden features maps. However, an in-depth analysis conducted in section Hardware and related metrics demonstrates that resolution scaling alone is not enough to fit current models on MCUs and that to achieve the goal requires much more design and optimization efforts. This motivates the need of inference models like the proposed μPyD-Net which, thanks to its compact topology, a novel training procedure, and the optimization pipeline, ensures prediction quality not far from that of standard techniques processing high-resolution images.

Figures 7A e 7B show, from top to bottom, some qualitative examples of low-resolution images (48 x 48), followed by proxy labels generated by SGM and disparity maps estimated by μPyD-Net.

It can be noticed how the network 3 accurately reproduces inverse depth estimations consistent with the self-sourced annotations. Finally, as in [10], the contribution given by photometric loss is sum to proxy- supervision in order to obtain the final loss L init as:

In particular, α ap and α ps have been tuned following [10]. In the present embodiment, both α ap and α ps are set to 1. However, their value can vary from 0 (cancellation of the term) to any positive value. Varying these values increases the effect of the terms. However, the second term is the most important, so it is not advisable to favour the first over the second.

Although SGM is effective, additional sources of proxy labels can be stereo networks trained in a self-supervised manner with photometric losses, as shown in [59], or in a supervised manner at the cost of requiring ground truth labels. In the ablation experiments, the impact of the different strategies has been studied.

Optimization step

The optimization step 21 allows the deployment of the network 3 into Cortex-M MCUs. In particular, as shown in figure 2, the optimization step 21 comprises two substeps:

- a front-end or quantizing substep 210, wherein the network 3 is quantized using 8-bit fixed-point representation;

- a fine-tuning substep 211 ; and

- a back-end or transforming step 212, wherein the high-level description of the quantized network is translated into low-level routines optimized for the target device.

In particular, the quantizing step 2T is carried out with a linear scheme with the power of two scalings in order to efficiently exploit the integer data-path of the Cortex-M architecture.

More in detail, it has been adopted a dynamic approach by which the radix-point of both feature maps and weights is assigned layer-by- layer. To calculate the optimal radix-point, it has been developed a simple heuristic that returns the optimal fraction length of each layer such that the mean squared error between the original floating-point distribution and the quantized one is minimized.

For intermediate features, the statistics on a subset of the training set (referred to as the calibration set) have been collected.

Then the accuracy loss due to quantizing step 210 is recovered through the fine-tuning substep 211 based on knowledge distillation [60].

The quantized model, set as the student, is re-trained to mimic the original floating-point network, the teacher. As a training loss, it has been adopted the mean squared error between the disparity maps inferred by the two actors (teacher and student).

An important aspect to be noticed is that the re-training of the quantized model encompasses the execution of the integer model. Since GPUs do not support integer arithmetic (at least those available in the setup) an emulation framework has been implemented, built upon the concept of fake-quantization [61] and tuned to be compliant with the arithmetic units of the Cortex-M cores.

It is a software wrapper that converts activations and weights (stored in fixed-point) to the 32-bit floating-point; after being processed, results are converted back to fixed-point.

The flow has been experimentally tested, and the results produced by emulation exactly match those collected on the target hardware. After quantization and fine-tuning, the network is ready to be deployed on the target device.

The porting stages leverage the CMSIS-NN library developed by Arm. It is a collection of handwritten routines that ensure efficient processing of integer CNNs. Unfortunately, the CMSIS-NN was mainly designed for simple tasks, like image classification and keyword spotting [49], hence it supports a limited set of operators.

The library has been augmented with optimized routines for the missing operators: deconvolution and leaky ReLU. For deconvolution, the input features are upsampled using a factor equal to the stride, then convolved with unit stride. For the leaky ReLU, the slope is constrained to be a power-of-two, hence it can be implemented with a simple shift operation. It has been observed that this choice achieves better performance with no impact on the final accuracy.

Experimental results

In this section, the datasets, implementation details, and report exhaustive experiments aimed at assessing the performance of the network, according to both functional (i.e., accuracy) and extra-functional (i.e., latency and memory footprint) metrics are described. The energy consumption is directly proportional to latency. This two-fold evaluation will show how the network performs in terms of depth accuracy compared to much more complex state of the art solutions, highlighting its much superior efficiency and suitability for low- power MCU platforms.

Dataset and training

The quantitative evaluation primarily involves two datasets: KITTI [6] and CityScapes [62], Moreover, the Make3D dataset [30] for additional experiments concerning the generalization of the network and other state of the art models is used.

The KITTI stereo dataset [6] is a collection of rectified stereo pairs made up of 61 scenes (more than 42K stereo frames) concerned with driving scenarios and it is the standard dataset for evaluating monocular depth estimation methods as well as for many other purposes. The average image resolution is 1242 x 375 and a LiDAR device, mounted and calibrated in proximity to the left camera, was deployed to measure depth information.

Following other works in this field [7], [12], the overall dataset has been divided into two subsets, composed respectively of 29 and 32 scenes. This subdivision is referred to as Eigen split. In particular, 697 frames belonging to the first group for testing purposes and 22600 more taken from the second for training have been used.

The CityScapes dataset [62] contains stereo pairs concerning about 50 cities in Germany taken from a moving vehicle in various weather conditions. It consists of 22,973 stereo pairs with a resolution of 2048 c 1024 pixels.

It is often used for pre-training [10], [12], [37] before moving to the KITTI dataset, but not for evaluation since no ground truth maps are provided (only disparity maps computed with SGM). As in [12] the lowest 20% of each stereo pair is discarded at training time.

The Make3D dataset [30] consists of a set of images and depth maps from a custom-built 3D scanner, collected during daytime in a diverse set of urban and natural areas in the city of Palo Alto and its surrounding regions. It contains 534 images at 1704x2272 resolution. Experiments have been run on the 134 testing images without retraining as in [12], [63],

Hardware Set-up

The μPyD-Net is tested and validated on a NUCLEO-F767ZI [66] development board manufactured by ST-Microelectronics. It hosts a chip- set powered with an Arm Cortex-M7 CPU with 216MHz clock frequency, 512 kB of SRAM and 2MB of flash memory. As reported in the datasheet [67], the current consumption is ≈100mA for a data intensive application run under the same operating condition of the experiments, and hence a resulting power consumption <400mW.

The .C description of the optimized μPyD-Net model is built using the GNU Arm Embedded Toolchain, version 6.3.1 , and flashed into the board using the mbed-clitoolchain.

The optimization framework is run on a workstation powered by a GPU NVIDIA GTX-1080 Ti. Extensive simulations on the KITTI dataset validated the integer emulation engine integrated into the front-end side. Collected traces show 100% accuracy with respect to on-board measurement.

Evaluation - Functional metrics

Predictions are evaluated according to standard functional metrics [7], [12]: Abs rel, Sq rel, RMSE and RMSE log represent error measures (+, the lower the better), while δ< K the percentage of predictions whose maximum between ratio and inverse ratio with respect to the ground truth is lower than a threshold K (*, the higher the better). The detailed formulation of each metric can be found in [7].

Since the performance of μPyDNet is limited to the accuracy of the proxy labels used for supervision, different strategies have been studied to source such data and how they behave at low resolution.

In particular, although SGM represents a popular choice and trade- off in terms of accuracy and speed, more accurate methods exist. To this aim, labels obtained by means of SGM and by distillation [59] from two state of the art neural networks, respectively trained with self-supervision and ground truth are considered.

Specifically, UnOS by Wang et al. [64], currently state of the art for self-supervised stereo, and DispNet-CSS by llg et al. [65] have been chosen.

For both the weights made available by the authors, respectively trained with the photometric loss on the full KITTI dataset (UnOS) or with ground truth on the SceneFlow dataset [26] and fine-tuned on KITTI 2015 training set (DispNet-CSS) have been used.

In particular, being ground truth labels required to train DispNet- CSS, distilling proxy labels through this model adds some constraints to such a solution.

The first table shown in figure 8 collects results concerning a comparison between the three, conducted on the Eigen test split.

At first the accuracy of the disparity maps processed at full resolution is reported, to highlight the importance of spatial resolution. There is almost equivalent performance by SGM and UnOS on error metrics, with DispNet-CSS producing better results.

Concerning deltas, SGM produces better accuracy. In particular, considering the 48 x 48 resolution, four main experiments are reported, respectively evaluating disparity map obtained by running SGM on 48x48 images and by either SGM, UnOS, and DispNet-CSS at full resolution and then downsampled by means of nearest neighbor interpolation (↓ 48 x 48), i.e., to the resolution used to train μPyD-Net.

UnOS or DispNet-CSS at 48 x 48 cannot be run because of their high compression factor (1/64), requiring larger images. Extremely bad performance is achieved by running SGM on 48 x 48 images, because of the high quantization of pixels at this resolution, making this solution unreliable both for applications on microcontrollers as well as for training neural networks to run on these latter.

Conversely, full-resolution images downsampled to 48 x 48 maintain acceptable performance, with DispNet-

CSS labels resulting much more accurate. In general, SGM and UnOS are close in performance, with the former resulting slightly more accurate and thus preferable to train μPyD-Net.

The same behavior can be observed by running experiments at 32 x 32 resolution. Although DispNet-CSS labels show much higher accuracy compared to SGM and UnOS, they need ground truth labels to be obtained. It will be highlighted next how training μPyD-Net on DispNet- CSS rather than SGM achieves only minor improvements, thus making SGM better suited for practical applications.

Finally, the effectiveness of μPyD-Net variants trained with different sources of self-supervision have been studied.

The second table shown in figure 9 collects results of 48 x 48 and 32 x 32 models trained respectively with image reprojection losses [12], proxy labels sourced through SGM algorithm, UnOS and DispNet-CSS.

At first, it is pointed out how the supervision from photometric losses performs much worse compared to the use of proxy labels. Although this approach is extremely popular [1], [12], [37], intuitively, the image content at such low resolution is much lower compared to the one available at the original resolution, thus leading to poor supervision.

Exploiting the guidance from accurate disparity maps at training time allows to greatly boost the accuracy achieved by μPyD-Net. Nevertheless, although the proxy labels show different accuracy according to the first table, in particular comparing row 4, 5, 6 and 8, 9, 10 sourcing supervision from SGM algorithm results slightly better than using UnOS, with a minor margin compared to DispNet-CSS, although DispNet-CSS needs ground truth for training conversely to SGM.

Thus, for practical applications, it is preferred the SGM solution because of the low margin with respect to DispNet-CSS moderate, yet the much greater flexibility.

The third table shown in figure 10 compares μPyD-Net with state-of- the-art solutions for monocular depth estimation.

In particular, the upper portion of the table contains complex architectures with millions of trainable parameters, suited only for high-end GPUs (e.g., the NVIDIA Titan XP).

On the other hand, the lower portion of the table lists networks requiring much less computational and memory requirements compatible with a broader range of devices.

Moreover, it is also reported for each network the resolution of the input image processed. At first glance, it can be noticed the large gap between the amount of information processed by μPyD-Net and other proposals.

However, shrinking the image using this extreme factor (down to 1/38 for width, 1/11 for height in case of 32 x 32 images) has a non- negligible impact on the input image fed to μPyD-Net. This degradation is particularly evident for small objects at a longer distance or thin structures as poles, causing higher errors compared to the ground-truths acquired at full-resolution.

For this reason, the third table also includes the lightweight PyD-Net [1] network processing much larger

256 x 512 images. Nonetheless, it is important to notice that even stretching the input of other proposals to either 48x48 or 32x32 they would not be able to run on the target’s hardware device due to excessive memory requirements. Moreover, PyD-Net [1] would not be compatible with such tiny image sizes since its pyramidal structure is too deep. However, as pointed out by previous studies in other fields [68], [69], the image content encoded in such tiny images is still enough to estimate a coarse estimation of the scene, comparable to state of the art techniques proposed just a few years ago [7], [8], with hundred times fewer parameters and computational requirements. Not surprisingly, 48 c 48 input images yield better results compared to 32 x 32.

To further support that μPyD-Net is effective at extracting most of the knowledge available from low resolution images, the performance achieved by two state of the art networks, respectively MonoDepth2 [63] and MonoResMatch [10], when processing 32x32 images are shown. Being these latter not able to process such tiny images because of architectural limitations, low-resolution images are simulated by down- sampling the inputs to 32 x 32 (↓ 32 x 32) and then up-sampling (↑) them back to the original resolution.

The fourth table shown in figure 12 collects the outcome of this evaluation, showing how μPyDNet places in between the two competitors for most metrics (i.e., Sq Rel, RMSE, RMSE log, δ< 1.25, δ< 1.25 3 ) although counting two order of magnitude less parameters.

This supports the fact that μPyD-Net itself is enough to extract most of the information available from low-resolution content, while keeping low complexity. This latter property is crucial for deployment on the target microcontrollers, over which MonoDepth2 and MonoResMatch parameters alone would not fit into the available memory.

The farther points in the scene are those most affected by the degradation introduced by processing tiny images since each pixel senses a larger portion of the real scene.

Therefore, the accuracy of μPyD-Net when sensing at different ranges the scenes included in the datasets will be assessed.

The fifth table shown in figure 13 reports a detailed comparison between μPyD-Net and its optimized counterpart considering different depth ranges, from 0m to 15, 25, 50 and 80m.

In particular, in the upper portion results obtained by μPyD-Net processing 32 x 32 images and in the middle processing with the same network 48 x 48 images.

On the very bottom of the same table, it is reported for comparison results yielded by state-of-the-art [10]. It can be notice in general how, independently of the input resolution and evaluation range, introducing the quantization it dramatically drops the performance of μPyD-Net (float32 vs int8 entries), as already observed in [2].

However, by fine-tuning the model after quantization (int8-ft entries), the original performance is restored for most metrics and sometimes even improved.

Focusing on how the metrics change across the different evaluation ranges, it can be perceived how on nearby measurements the gap between μPyD-Net and much more complex state of the art [10] gets lower.

For instance, by looking at the RMSE metric, it can be observed how the difference in terms of average error is about 3 when considering the full evaluation range 0-80m, while it drops to about 0.6 and 0.8 respectively for 48 x 48 and 32 x 32 images when dealing with the 0-15m range.

This behavior suggests that μPyD-Net might be not particularly suited for long-range depth measurements. However, for close-range depth sensing, it provides a valid alternative when a low-power budget is paramount.

As said, figure 11 shows a qualitative example of a traffic monitoring system processing images from the KITTI dataset [6] downsampled to 48 x 48 resolution.

In particular, four images acquired from a static point of view to simulate a monitoring camera placed on a crossroad are shown. For each one, it is reported on the right their downsampled counterpart, the estimated disparity map sourced by μPyD-Net and a segmentation map detecting objects on the scene over imposed to the original KITTI image.

To this aim, given the depth layout estimated for the empty scene (i.e., in absence of vehicles) estimated by μPyD-Net, a simple change- detection algorithm in the depth domain is sufficient to detect nearby cars reliably.

Although this application allows for simple traffic monitoring, the depth cue provided by μPyD-Net can be exploited for other purposes (e.g., 3D tracking) and to replace other sensors as well. Therefore, such information could be used in place of other sensors or to enrich other image-based cues such as object detection or semantic segmentation.

In order to assess the generalization properties, in the sixth table shown in figure 14 it is reported results on the Make3D dataset [30] following the evaluation proposed in [63], on a center crop of 2x1 ration and without applying median scaling (not required when training on stereo pairs).

In particular, it is pointed out how state-of-the-art networks suffer from huge drops when moved to unseen environments as well. μPyD-Net can still provide meaningful predictions, very close to those by MonoDepth when running at 48 x 48.

Figure 15 shows some qualitative examples, showing in particular how the coarse disparity maps by μPyD-Net are often less affected by artifacts with respect to the predictions by MonoResMatch.

Evaluation - Hardware-related metrics

The shift from high-performance GPUs to ultra-low-power MCUs encompasses the evaluation of hardware-related metrics besides accuracy, i.e., latency and memory, in order to assess the portability and efficiency.

As shown in the third table of figure 10, the number of parameters of standard monocular networks exceeds by far the memory constraints of commercial MCUs, preventing the deployment on the edge and therefore a direct comparison with μPyD-Net.

For this reason, this section focuses on the hardware characterization of μPyD-Net, demonstrating that the adopted architectural choices are mandatory to guarantee compliance with the limited resources of the hosting system.

The seventh table shown in figure 16 reports the hardware-related metrics measured at run-time on the NUCLEO-F767ZI board: RAM usage and execution time (averaged on 100 inference runs).

The μPyD-Net reaches a throughput of 3.4 FPS, which can be considered a remarkable result given the limited power budget of the adopted device. Moreover, the collected results demonstrate that input resolution is an effective knob in the accuracy latency memory space: a resolution of 32 x 32 enables 38% of memory savings and 2.24x higher throughput compared to 48 x 48.

This comes at the cost of some accuracy loss (as already shown in the fifth table). Flowever, this might be a false problem as errors can be masked by subsequent processing stages. The resolution is a design choice indeed, and it should be weighted depending on the requirements of the downstream application.

Even though the limited computational resources of the hosting MCU prevent real-time processing even for such a compact network, the measured performance meets the requirements of the applications described in Section II.

Flowever, if higher power and area budgets are available, μPyD-Net can be ported to more powerful systems and its application extended to other use-cases. To assess the scalability of μPyD-Net from the loT to the embedded segment, its performance has been tested on the mobile CPUs (ARM Cortex-A53) adopted in a previous work [2].

In this system, μPyD-Net processes up to 320 frame/s, a 94x boost that comes at the cost of 10x power consumption (~4W).

It might seem like the high efficiency brought by μPyD-Net is simply due to the input rescaling, hence the proposal may seem a relatively naive approach. A more detailed analysis reveals the design of μPyD-Net goes beyond this simplistic analysis.

On the one hand, it is correct that a lower input space contributes to the reduction of the memory footprint as all the inner feature maps get intrinsically smaller. On the other hand, what makes μPyD-Net smaller and faster, hence less energy hungry and able to fit tiny MCUs with marginal accuracy loss, lies in the topology of the network.

Input resolution scaling alone is not enough, but when jointly applied on the structure of μPyD-Net it enables design options that would not be possible otherwise. Like other pyramidal architectures, μPyD-Net applies a coarse to fine strategy where information is processed in a hierarchical manner. Since features of higher semantic level are inferred layer by layer traversing the pyramid bottom-up, it is intuitive to understand that the lower the resolution of the input image, the lower the number of layers needed to achieve a certain accuracy.

This is a general trend also recognized in other deep learning models, but on the specific case of μPyD-Net it has a much higher impact. With smaller inputs it is in fact possible to compress the topology by reducing the number of encoders and decoders, and not just their size, thus achieving aggressive RAM reduction.

To support this analysis, the bar chart in figure 17 shows the memory footprint vs. input resolution of PyD-Net (hatched bars) and μPyD- Net (plain bars), both quantized to 8-bit; the comparison is made splitting the contributions of weights (blue) and inner features (orange). In particular, the horizontal black line marks the RAM constraint (512 kB).

For both the networks, the dimensionality of the internal activations is re-scaled according to the input size. It is worth noticing that the minimum input resolution of PyD-Net is 64x64 since the input image is down-sampled by a factor of 26 across the pyramidal encoders. For such reason, the results below are not reported.

Therefore, at first, PyD-Net runs out of space and cannot fit into MCUs because the size of the weights does not scale with the input resolution. Even using the smallest input size (i.e., 64 x 64), the weight storage is about 2MB, namely 4x the RAM on-board. Second, μPyDNet shows an activations/weights ratio larger than PyD-Net.

For instance, with the highest input resolution (256 c 128) the RAM taken by the features significantly increases, from 2.3MB (PyD-Net) to 3.2MB (μPyD-Net).

The reason is that in PyD-Net the size of the feature maps processed by the topmost decoder, which is the most energy-hungry layer, is half of the input resolution, while in μPyD-Net it is not.

Therefore, μPyD-Net is less suited for high resolution. Flowever, as soon as inputs got re-scaled to 48x48, the activation footprint scales well and it meets the memory constraint. Working with 32x32 images ensure even some free space for other background applications or tasks.

The μPyD-Net does work not just because of the lower cardinality of the input space, but precisely because it has been tailored to adapt to the requirements of tiny applications.

As said, depth is of paramount importance in countless practical computer vision applications and the compelling results recently obtained by frameworks aimed at inferring this cue from a single camera have increased the interest for this topic.

Flowever, in most cases these methods require high-end GPUs or sufficiently capable embedded devices, precluding their practical deployment in several application contexts characterized by extreme low- power constraints such as those involving MCUs.

Therefore, according to the present invention, a two-fold strategy to enable monocular depth estimation on MCUs is proposed.

At first, the implementation of a lightweight Convolutional Neural Network 3 based on a pyramidal architecture, trained in a semi-supervised manner leveraging proxy supervision 210 obtained through a conventional stereo algorithm, capable of inferring accurate depth maps d from the tiny input image I fed to the network 3.

Then, the use of optimization strategies 21 aimed at performing computations with quantized 8-bit data and mapping the high-level description of the network to low-level routines suited for the target architecture.

In particular, exhaustive experimental results and an in-depth evaluation with devices belonging to the popular Arm Cortex-M family, confirm that monocular depth estimation is feasible with devices characterized by low-power constraints as MCUs. Therefore, the present method allows fostering the deployment of monocular depth estimation to new application contexts.

In fact, as mentioned above, the new challenge in the field of computer vision and depth perception calls for a paradigm shift: quality is no longer the only objective, and other extra-functional metrics need to be considered at the software-level and during the whole compilation flow. Therefore, it is necessary to find and implement proper resource-driven optimization strategies becomes paramount.

As described, the first and foremost assumption made with the present invention is that the processing of high-resolution images is no longer feasible, nor useful for the kind of applications addressed. It suffices to think that a single HD image may take more memory than that available on-chip. Hence, differently from the traditional high-quality vision systems, coarse depth estimates from low-resolution images is the way to meet the stringent hardware constraints of low power MCUs, and at the same time, a strength in many edge applications concerned with privacy issues [5].

In light of the above, the present invention provides a lightweight architecture of the neural network 3, referred to as μPyD-Net, which is specifically designed for processing low-resolution images (i.e., 48 x 48 pixel or 32 x 32 pixel) on a sub-W power envelope with MCUs. In particular, the network 3 maximizes the savings brought by inputs resolution scaling thanks to an internal topology optimized with hardware- conscious techniques.

Moreover, the present invention provides a semi-supervised training flow from low-resolution images and full-resolution disparity maps based on SGM (hence different compared to previous works [1], [2]) able to deliver enough supervision with low requirements and with equivalent performances of more costly systems.

Therefore, the present invention allows to enable monocular depth estimation on low-power MCUs, such as the Arm Cortex-M7 CPU and to obtain a meaningful coarse depth representation from a tiny image, i.e., down to 32 x 32 pixels, sufficient to tackle high-level applications.

The literature concerning monocular depth estimation and deep networks compression are pertinent to the present invention. In particular, relevant documents concerning the depth estimation are [1]-[2], [7]-[8], [10]-[13], [20], [21]-[29] and [30]-[44], those with regard the quantization are [45]-[51] while documents concerning the pruning technique are [51]- [55].

Advantages

A first advantage of the method according to the present invention is that of estimating depth maps, optical flow and semantics from a single image by deep-learning with low-power devices.

Another advantage of the method according to the present invention is that of inferring accurate depth maps from the tiny input image fed to the network.

A further advantage of the method according to the present invention is that of determining depth maps, optical flow and semantics on microcontrollers, such as those based on ARM architecture by processing low-resolution images using a self-supervised, lightweight network.

The present invention can be integrated into any application/device that needs to exploit dense depth, flow and semantic information obtainable at low cost and power consumption. Application contexts of particular interest include contexts where dense 3D estimation of the surrounding environment is required (e.g., proximity control systems, tracking systems, etc).

The present invention is described by way of example only, without limiting the scope of application, according to its preferred embodiments, but it shall be understood that the invention may be modified and/or adapted by experts in the field without thereby departing from the scope of the inventive concept, as defined in the claims herein.

References

[1] M. Poggi et al., “Towards real-time unsupervised monocular depth estimation on cpu,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 5848-5854.

[2] V. Peluso et al., “Enabling energy-efficient unsupervised monocular depth estimation on armv7-based platforms,” in 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2019, pp. 1703-1708.

[3] R. Sanchez-lborra et al., “Tinyml-enabled frugal smart objects: Challenges and opportunities,” IEEE Circuits and Systems Magazine, vol. 20, no. 3, pp. 4-18, 2020.

[4] D. Wofk et al., “Fastdepth: Fast monocular depth estimation on embedded systems,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6101-6108.

[5] E. Chou et al., “Privacy-preserving action recognition for smart hospitals using low-resolution depth images,” arXiv preprint arXiv:1811.09950, 2018.

[6] A. Geiger et al., “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11 , pp. 1231- 1237, 2013.

[7] D. Eigen et al., “Depth map prediction from a single image using a multi-scale deep network,” in Advances in neural information processing systems, 2014, pp. 2366-2374.

[8] F. Liu et al., “Learning depth from single monocular images using deep convolutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024-2039, 2015.

[9] N. Yang et al., “Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 817- 833.

[10] F. Tosi et al., “Learning monocular depth estimation infusing traditional stereo knowledge,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9799-9809.

[11] J. Watson et al., “Self-supervised monocular depth hints,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2162-2171.

[12] C. Godard et al., “Unsupervised monocular depth estimation with leftright consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 270-279.

[13] H. Fu et al., “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2002-2011.

[14] T. Bagautdinov et al., “Probability occupancy maps for occluded depth images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2829-2837.

[15] S. Sun et al., “Benchmark data and method for real-time people counting in cluttered scenes using depth sensors,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 10, pp. 3599-3612,

2019.

[16] V. Srivastav et al., “Human pose estimation on privacy- preserving low resolution depth images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 583-591.

[17] M.-R. Lee et al., “Vehicle counting based on a stereo vision depth maps for parking management,” Multimedia Tools and Applications, vol. 78, no. 6, pp. 6827-6846, 2019.

[18] R. Hg et al., “An rgb-d database using microsoft’s kinect for windows for face detection,” in 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems. IEEE, 2012, pp. 42—46.

[19] A. E. Eshratifar et al., “Jointdnn: an efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Transactions on Mobile Computing, 2019.

[20] R. Szeliski, Computer vision: algorithms and applications. Springer Science & Business Media, 2010.

[21] D. Scharstein et al., “A taxonomy and evaluation of dense two- frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, no. 1-3, pp. 7-42, 2002.

[22] H. Hirschmuller, “Accurate and efficient stereo processing by semiglobal matching and mutual information,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE, 2005, pp. 807-814.

[23] L. Di Stefano et al., “A fast area-based stereo matching algorithm,” Image and vision computing, vol. 22, no. 12, pp. 983-1005, 2004.

[24] T. Kanade et al., “A stereo matching algorithm with an adaptive window: Theory and experiment,” vol. 16, no. 9. IEEE, 1994, pp. 920-932.

[25] J. Zbontar et al., “Computing the stereo matching cost with a convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1592-1599.

[26] N. Mayer et al., “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4040-4048.

[27] A. Kendall et al., “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 66-75.

[28] J.-R. Chang et al., “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5410-5418.

[29] F. Zhang et al., “Ga-net: Guided aggregation net for end-to-end stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 185-194.

[30] A. Saxena et al., “Make3d: Learning 3d scene structure from a single still image,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 5, pp. 824-840, 2008.

[31] I. Laina et al., “Deeper depth prediction with fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV). IEEE, 2016, pp. 239-248.

[32] Y. Cao et al., “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 11 , pp. 3174-3182, 2017.

[33] Y. Cao et al., “Monocular depth estimation with augmented ordinal depth relationships,” IEEE Transactions on Circuits and Systems for Video Technology, 2020.

[34] H. Mohaghegh et al., “Aggregation of rich depth-aware features in a modified stacked generalization model for single image depth estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 3, pp. 683-697, 2018.

[35] K. Karsch et al., “Depth transfer: Depth extraction from video using non-parametric sampling,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 11 , pp. 2144-2158, 2014.

[36] T. Zhou et al., “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851-1858.

[37] M. Poggi et al., “Learning monocular depth estimation with unsupervised trinocular assumptions,” in 2018 International Conference on 3D Vision (3DV). IEEE, 2018, pp. 324-333.

[38] R. Mahjourian et al., “Unsupervised learning of depth and ego- motion from monocular video using 3d geometric constraints,” in CVPR, 2018.

[39] H. Kumar et al., “Depth map estimation using defocus and motion cues,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 5, pp. 1365-1379, 2018.

[40] C. Wang et al., “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2022-2030.

[41] Z. Yin et al., “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1983-1992.

[42] R. Garg et al., “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European conference on computer vision. Springer, 2016, pp. 740-756.

[43] A. Pilzer et al., “Refine and distill: Exploiting cycle- inconsistency and knowledge distillation for unsupervised monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9768-9777.

[44] L. Andraghetti et al., “Enhancing self-supervised monocular depth estimation with traditional visual odometry,” in 2019 International Conference on 3D Vision (3DV). IEEE, 2019, pp. 424-433.

[45] J. Qiu et al., “Going deeper with embedded fpga platform for convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016, pp. 26-35.

[46] H. Alemdar et al., “Ternary neural networks for resource- efficient ai applications,” in 2017 International Joint Conference on Neural Networks (IJCNN). IEEE, 2017, pp. 2547-2554.

[47] M. Rastegari et al., “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European conference on computer vision. Springer, 2016, pp. 525-542.

[48] M. Rusci et al., “Quantized nns as the definitive solution for inference on low-power arm mcus? work-in-progress,” in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, 2018, pp. 1-2.

[49] L. Lai et al., “Enabling deep learning at the lot edge,” pp. 1-6,

2018.

[50] S. Vogel et al., “Efficient hardware acceleration of cnns using logarithmic data representation with arbitrary log-base,” in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1-8. [51] S. Han et al., “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.

[52] M. Grimaldi et al., Optimality assessment of memory-bounded convnets deployed on resource-constrained rise cores,” IEEE Access, vol. 7, pp. 152599-152611 , 2019.

[53] J. Yu et al., “Scalpel: Customizing dnn pruning to the underlying hardware parallelism,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 548-560.

[54] H. Li et al., “Pruning filters for efficient convnets,” in 5th International Conference on Learning Representations (ICLR), 2017.

[55] S. Elkerdawy et al., “Lightweight monocular depth estimation model by joint end-to-end filter pruning,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 4290-4294.

[56] A. Tonioni et al., “Unsupervised adaptation for deep stereo,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1605-1613.

[57] A. Tonioni et al., “Unsupervised domain adaptation for depth prediction from images,” IEEE transactions on pattern analysis and machine intelligence, 2019.

[58] A. B. Owen, “A robust hybrid of lasso and ridge regression,” Contemporary Mathematics, vol. 443, no. 7, pp. 59-72, 2007.

[59] X. Guo et al., “Learning monocular depth by distilling cross- domain stereo networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 484-500.

[60] A. Mishra et al., “Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy,” in International Conference on Learning Representations, 2018.

[61] B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704- 2713.

[62] M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213-3223.

[63] C. Godard et al., “Digging into self-supervised monocular depth estimation,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 3828-3838. [64] Y. Wang et al., “linos: Unified unsupervised optical-flow and stereodepth estimation by watching videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8071- 8081.

[65] E. Ilg et al., Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation,” in

Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 614- 630.

[66] Nucleo-f767zi. [Online]. Available: https://www.st.com/en/evaluationtools/nucleo-f767zi.html

[67] Stm32f767zit6-datasheet. [Online]. Available: https://www.st.com/resource/en/datasheet/stm32f767zi.pdf

[68] A. Torralba et al., “80 million tiny images: A large data set for nonparametric object and scene recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 30, no. 11 , pp. 1958-1970, 2008.

[69] A. Torralba et al., Object and scene recognition in tiny images,” Journal of Vision, vol. 7, no. 9, pp. 193-193, 2007.