This commit is contained in:
Simon Meister 2017-10-30 15:11:59 +01:00
parent 8edfcbac9f
commit 86c2c12c78
4 changed files with 49 additions and 11 deletions

View File

@ -1,11 +1,11 @@
\begin{abstract}
Many state of the art energy-minimization approaches to optical flow and scene
flow estimation rely on a (piecewise) rigid scene model, where the scene is
flow estimation rely on a rigid scene model, where the scene is
represented as an ensemble of distinct, rigidly moving components, a static
background and a moving camera.
By constraining the optimization problem with a physically sound scene model,
these approaches enable higly accurate motion estimation.
these approaches enable state-of-the art motion estimation.
With the advent of deep learning methods, it has become popular to re-purpose
generic deep networks for classical computer vision problems involving

View File

@ -8,7 +8,10 @@ visually corresponding pixel in the second frame $I_2$,
thus representing the apparent movement of brigthness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space.
Scene flow is the generalization of optical flow to 3-dimensional space and
requires estimating dense depth. Generally, stereo input is used for scene flow
to estimate disparity-based depth, however monocular depth estimation can in
principle be used.
\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures

12
bib.bib
View File

@ -142,3 +142,15 @@
title = {3D Scene Flow with a Piecewise Rigid Scene Model},
booktitle = {{IJCV}},
year = {2015}}
@inproceedings{MRFlow,
author = {Jonas Wulff and Laura Sevilla-Lara and Michael J. Black},
title = {Optical Flow in Mostly Rigid Scenes},
booktitle = {{CVPR}},
year = {2017}}
@article{SPyNet,
author = {Anurag Ranjan and Michael J. Black},
title = {Optical Flow Estimation using a Spatial Pyramid Network},
journal = {arXiv preprint arXiv:1611.00850},
year = {2016}}

View File

@ -41,10 +41,15 @@ in parallel to classification and bounding box refinement.
\subsection{Related work}
\paragraph{Deep networks in optical flow and scene flow}
\paragraph{Deep networks in optical flow}
\cite{FlowLayers}
\cite{ESI}
End-to-end deep networks for optical flow were recently introduced
based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
which pose optical flow as generic pixel-wise estimation problem without making any assumptions
about the regularity and structure of the estimated flow.
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure
the optical flow estimation, but still require expensive energy minimization for each
new input, as CNNs are only used for some of the components.
\paragraph{Slanted plane methods for 3D scene flow}
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
@ -57,17 +62,35 @@ reducing the number of independently moving segments by allowing multiple
segments to share the motion of the object they belong to.
In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks, which are then combined
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
inputs to the object scene flow model from \cite{KITTI2015}.
inputs to their slanted plane scene flow model based on \cite{KITTI2015}.
Interestingly, these slanted plane methods achieve the current state-of-the-art
in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
However, the end-to-end deep networks are significantly faster than energy-minimization based slanted plane models,
generally taking a fraction of a second instead of minutes to compute and can often be modified to run in realtime.
These concerns restrict the applicability of the current slanted plane models in practical applications,
which often require estimations to be done in realtime and for which an end-to-end
approach based on learning would be preferable.
Futhermore, in other contexts, the move towards end-to-end deep learning has often lead
to significant benefits in terms of accuracy and speed.
As an example, consider the evolution of region-based convolutional networks, which started
out as prohibitively slow with a CNN as a single component and
became very fast and much more accurate over the course of their development into
end-to-end deep networks.
Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
and the ability of deep networks to learn to handle ambiguity from experience. % TODO instead of experience, talk about compressing large datasets / generalization
However, we think that the current end-to-end deep learning approaches to motion
estimation are limited by a lack of spatial structure and regularity in their estimates,
which stems from the generic nature of the employed networks.
To this end, we aim to combine the modelling benefits of rigid scene decompositions
with the promise of end-to-end deep learning.
%
In other contexts, the move from
% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning
\paragraph{End-to-end deep networks for 3D rigid motion estimation}
End-to-end deep learning for predicting rigid 3D object motions was first introduced with