Sparse-to-Dense: Depth Prediction from Sparse Depth Samples and a Single Image

· 2021-09-30 · # 论文 # 深度估计 # Deeplearning

一、摘要

We consider the problem of dense depth prediction from a sparse set of depth measurements and a single RGB image. Since depth estimation from monocular images alone is inherently ambiguous and unreliable, to attain a higher level of robustness and accuracy, we introduce additional sparse depth samples, which are either acquired with a low-resolution depth sensor or computed via visual Simultaneous Localization and Mapping (SLAM) algorithms. We propose the use of a single deep regression network to learn directly from the RGB-D raw data, and explore the impact of number of depth samples on prediction accuracy. Our experiments show that, compared to using only RGB images, the addition of 100 spatially random depth samples reduces the prediction root-mean-square error by 50% on the NYU-Depth-v2 indoor dataset. It also boosts the percentage of reliable prediction from 59% to 92% on the KITTI dataset. We demonstrate two applications of the proposed algorithm: a plug-in module in SLAM to convert sparse maps to dense maps, and super-resolution for LiDARs. Software and video demonstration are publicly available.

二、介绍

深度感应和估计广泛工程应用：

Depth sensing and estimation is of vital importance in a wide range of engineering applications, such as robotics, autonomous driving, augmented reality (AR) and 3D mapping.
稀疏深度测量值易获得：

For instance, low-resolution depth sensors (e.g., a low-cost LiDARs) provide such measurements. Sparse depth measurements can also be computed from the output of SLAM and visual-inertial odometry algorithms.

三、相关工作

基于RGB的深度预测

Laina等人基于ResNet开发了一个全卷积深度残差网络
Godard等人将视差估计视为一个图像重建问题，训练神经网络使左图像扭曲变形以匹配右图像

从稀疏样本进行深度重建

Hawe等人假设视差映射在小波基础上是稀疏的，用共轭次梯度法重建了稠密的视差图像
Liu等人结合小波和轮廓波字典，实现更精确的重建

传感器融合

Mancini等人提出了一个接受RGB图像和光流图像作为输入的CNN来预测距离信息
Liao等人使用2D激光扫描仪以提供额外的深度参考信号作为输入，与单独使用RGB图像作为输入相比获得更高的准确度
Cadena等人开发了一个多模式自编码器来学习三种输入模式包括RGB,深度和语义标签。他们使用从FAST corner特征提取的深度信息作为系统输入的一部分来产生低分辨率的深度预测

四、方法

CNN结构

实验发现，瓶颈结构(即编码器-解码器)会有很好的表现，于是选择Laina等人提出的深度全卷积残差网络。网络结构如图：

网络的编码层对KITTI数据集使用ResNet-18、对NYU-Depth-v2数据集使用ResNet-50，去除最后的平均池化层和全连接层，后接一个3×3的卷积层；解码层由4个上采样层和一个双线性上采样层组成

深度采样

训练期间，从真值深度图像 $D^*$ 中随机采样要输入的稀疏深度 $D$ 。对于深度样本的任意目标数量 $m$ ，计算一个伯努利概率 $p=\frac{m}{n}，其中$ $n$ 是 $D^*$ 中有效深度像素的总和，对于任意像素( $i,j$ ):

通过这种策略每个训练样本的非零深度像素的实际数量在期望值 $m$ 附近变化