Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

· 2021-09-27 · # 论文 # 深度估计 # Deeplearning

论文链接
作者: David Eigen, Christian Puhrsch and Rob Fergus
邮箱: deigen@cs.nyu.edu, cpuhrsch@nyu.edu, fergus@cs.nyu.edu

一、摘要

Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation,finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.

二、介绍

深度估计意义：

Estimating depth is an important component of understanding geometric relations within a scene. In turn, such relations help provide richer representations of objects and their environment, often leading to improvements in existing recognition tasks , as well as enabling many further applications such as 3D modeling, physics and support models, robotics, and potentially reasoning about occlusions.
单目图像案例：

Potential applications include better understandings of the many images distributed on the web and social media outlets, real estate listings, and shopping sites. These include many examples of both indoor and outdoor scenes.

三、相关工作

Saxina等人使用线性回归和马尔可夫随机场（主要用于图像分割）从图像特征中预测深度
Hoiem等人不明确地预测深度而是将图像区域分成用于构成简单3D模型的几何结构（地面，天空，垂直结构）
Ladicky等人通过把语义对象标签与单目深度特征结合来提升性能
Karsh等人使用一个基于SIFT Flow的kNN传递机制来估计单张图像静态背景的深度
Scharstein等人通过匹配，聚合和优化技术提出了一个包含多种2帧立体相关方法的调查和评估
Snavelys等人匹配同场景下多个未校准的照片视图来创建常见标准的准确3D重建
Konda等人在图像小块上训练一个自编码器来从立体序列中预测深度

四、方法

1.模型结构

整个网络由两部分堆叠而成：一个粗糙尺度的网络先在全局上预测场景深度；随后用一个精细尺度的网络来预测局部区域

2.尺度不变损失（scale-invariant error）

因为仅仅找到场景的平均尺度就占了总误差的很大一部分，即平均误差占比较高，于是作者提出了一个尺度不变误差来衡量场景点之间的关系，对于预测深度 $y$ 和真实深度 $y^*$ , $n$ 个像素，下标为 $i$ ，定义尺度不变均方误差(在 log 空间)为：

其中，是对于给定( $y$ , $y^*$ )误差的最小化值，相当于一个正则项，表示整体的平均误差，用于进行全局尺度约束。引用CSDN博主「Xuefeng_BUPT」的理解：

在单目深度估计的问题中，从理论上说单目是无法获得尺度信息的，深度学习可以从大量的数据中学习到场景的尺度信息。但是，如果直接使用RMSE的loss函数来进行网络的训练，没有对图像尺度进行约束，导致估计得到的深度图像可能像素间相对值是准确的，但是整体深度和groundtruth给出的深度存在尺度上的差异。

此外若设置 $d_i = log(y_i) - log(y^*_i)$ 来表示像素 $i$ 预测值与真实值之间的差距，可以得到：

3.训练损失

将尺度不变误差作为训练损失：

其中 $d_i=log(y_i)-log(y_i^*)$ 并且 $λ∈[0,1]$ ， $λ=0$ 时变成 $l_2$ 范式， $λ=1$ 时是尺度不变误差。作者发现设定 $λ=0.5$ 有很好的预测质量提升

五、实验及结果

1.标准与对比

2.NYU Depth结果

3.KITTI结果

六、总结

本篇论文使用全局与局部两个尺度神经网络进行深度估计，与传统算法相比有很大提升，但预测结果仍不是很理想
本篇论文提出了尺度不变损失(Scale-Invariant Error)函数