From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation

· 2021-11-07 · # 论文 # 深度估计 # Deeplearning

论文链接
作者：Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, Il Hong Suh

一、摘要

Estimating accurate depth from a single image is challenging because it is an ill-posed problem as infinitely many 3D scenes can be projected to the same 2D scene. However, recent works based on deep convolutional neural networks show great progress with plausible results. The convolutional neural networks are generally composed of two parts: an encoder for dense feature extraction and a decoder for predicting the desired depth. In the encoder-decoder schemes, repeated strided convolution and spatial pooling layers lower the spatial resolution of transitional outputs, and several techniques such as skip connections or multilayer deconvolutional networks are adopted to recover the original resolution for effective dense prediction.

In this paper, for more effective guidance of densely encoded features to the desired depth prediction, we propose a network architecture that utilizes novel local planar guidance layers located at multiple stages in the decoding phase. We show that the proposed method outperforms the state-of-the-art works with significant margin evaluating on challenging benchmarks. We also provide results from an ablation study to validate the effectiveness of the proposed method.

二、方法

网络结构

多尺度局部平面指导

不同于现有的方法，在解码阶段通过简单的最近邻上采样层并通过跳跃连接恢复到原始分辨率，作者使用局部平面指导层，它用局部平面假设将特征图引导到全分辨率，并使用它们得到最终的深度估计
该层的设计目的不是直接估计相应尺度的全局深度值，因为训练损失仅根据最终深度估计来定义
所有LPG层和reduc1×1层的输出一起都被解释为全局深度，通过最后的卷积层作为非线性组合的一部分。因此，它们可以有不同的范围，可以作为一个空间位置的基准学习或从一个空间位置的基准中学习作为精准的相对补偿
对于 $k \times k$ 区域，局部平面假设只需要四个参数就可以进行有效的重构
传统上采样不会给出放大分辨率的细节，而局部线性假设可以提供有效指导
作者使用射线与平面的相交检测来将每个估计的4D平面系数转换为 $k \times k$ 的局部深度线索： $\tilde{c}_{i}=\frac{n_{4}}{n_{1} u_{i}+n_{2} v_{i}+n_{3}}$ 其中， $n=\left(n_{1}, n_{2}, n_{3}, n_{4}\right)$ 代表了估计的平面系数， $\left(u_{i}, v_{i}\right)$ 是像素i基于 $k \times k$ 大小的块的归一化坐标
局部平面指导层：

通过多个1×1卷积层的叠加将通道数量一直减少到3，然后通过两种方法对特征图进行局部平面系数估计：
- 由于单位法向量只用两个自由度(极坐标和方位角)，作者将给定特征图的前两个通道作为角度，用公式将其转换为单位法向量 $\left(n_{1}, n_{2}, n_{3}\right)$ ： $n_{1}=\sin (\theta) \cos (\phi), n_{2}=\sin (\theta) \sin (\phi), n_{3}=\cos (\theta)$
- 通过一个Sigmoid函数，该函数定义了平面到原点的垂直距离。之后将输出与最大距离 $\kappa$ 相乘得到实际的深度值
  最后将它们连接起来并用上式进行估计

训练损失

Eigen等引入了比例不变误差，并从中得到启发，他们使用了如下训练损失函数：

D(g)=\frac{1}{T} \sum_{i} g_{i}^{2}-\frac{\lambda}{T^{2}}\left(\sum_{i} g_{i}\right)^{2}

通过重写上式，可以看出它是方差和对数空间中误差的加权平方均值的和。因此，设置一个更高的 $\lambda$ 使其更注重最小化方差误差，并且作者在工作中使用 $\lambda = 0.85$ 。此外作者还发现适当缩放损失函数的范围可提高收敛性以及最终训练结果。最终作者将训练损失函数定义为：

L=\alpha \sqrt{D(g)}

其中， $\alpha$ 是一个常量，并在实验中将其设定为10

三、实验

通过在基础网络上添加核心模块评估效果：
通过使用多种不同基础网络在两个数据集上进行实验：
- NYU Depth V2数据集：
- KITTI‘s Eigen split数据集：
  
  其中，在NYU数据集上DenseNet-161的性能最好，而在KITTI数据集上ResNet-101的性能最好，作者认为这是NYU室内数据集中数据分布的相对较低的方差产生的影响，导致在实验中使用非常深的模型时的性能下降
定量结果：

其中，在使用 KITTI 的实验结果中，可以在天空或场景的上部看到伪影。作者认为这是非常稀疏的ground truth深度数据的结果

四、结论

设计了Local planar guidance层，给出内部特征映射与期望预测之间的显式关系
然而，在KITTI数据集的实验中，场景的上部可以观察到频繁的伪影现象