WIP

2026-02-06 10:05:40 +00:00 · 2017-10-30 15:11:59 +01:00 · 2017-10-30 15:11:59 +01:00 · 86c2c12c78
commit 86c2c12c78
parent 8edfcbac9f
4 changed files with 49 additions and 11 deletions
--- a/abstract.tex
+++ b/abstract.tex
@ -1,11 +1,11 @@
 \begin{abstract}

 Many state of the art energy-minimization approaches to optical flow and scene
-flow estimation rely on a (piecewise) rigid scene model, where the scene is
+flow estimation rely on a rigid scene model, where the scene is
 represented as an ensemble of distinct, rigidly moving components, a static
 background and a moving camera.
 By constraining the optimization problem with a physically sound scene model,
-these approaches enable higly accurate motion estimation.
+these approaches enable state-of-the art motion estimation.

 With the advent of deep learning methods, it has become popular to re-purpose
 generic deep networks for classical computer vision problems involving
--- a/background.tex
+++ b/background.tex
@ -8,7 +8,10 @@ visually corresponding pixel in the second frame $I_2$,
 thus representing the apparent movement of brigthness patterns between the two frames.
 Optical flow can be regarded as two-dimensional motion estimation.

-Scene flow is the generalization of optical flow to 3-dimensional space.
+Scene flow is the generalization of optical flow to 3-dimensional space and
+requires estimating dense depth. Generally, stereo input is used for scene flow
+to estimate disparity-based depth, however monocular depth estimation can in
+principle be used.

 \subsection{Convolutional neural networks for dense motion estimation}
 Deep convolutional neural network (CNN) architectures
--- a/bib.bib
+++ b/bib.bib
@ -142,3 +142,15 @@
  title = {3D Scene Flow with a Piecewise Rigid Scene Model},
  booktitle = {{IJCV}},
  year = {2015}}
+
+@inproceedings{MRFlow,
+  author = {Jonas Wulff and Laura Sevilla-Lara and Michael J. Black},
+  title = {Optical Flow in Mostly Rigid Scenes},
+  booktitle = {{CVPR}},
+  year = {2017}}
+
+@article{SPyNet,
+  author = {Anurag Ranjan and Michael J. Black},
+  title = {Optical Flow Estimation using a Spatial Pyramid Network},
+  journal = {arXiv preprint arXiv:1611.00850},
+  year = {2016}}
--- a/introduction.tex
+++ b/introduction.tex
@ -41,10 +41,15 @@ in parallel to classification and bounding box refinement.

 \subsection{Related work}

-\paragraph{Deep networks in optical flow and scene flow}
+\paragraph{Deep networks in optical flow}

-\cite{FlowLayers}
-\cite{ESI}
+End-to-end deep networks for optical flow were recently introduced
+based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
+which pose optical flow as generic pixel-wise estimation problem without making any assumptions
+about the regularity and structure of the estimated flow.
+Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure
+the optical flow estimation, but still require expensive energy minimization for each
+new input, as CNNs are only used for some of the components.

 \paragraph{Slanted plane methods for 3D scene flow}
 The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
@ -57,17 +62,35 @@ reducing the number of independently moving segments by allowing multiple
 segments to share the motion of the object they belong to.

 In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
-a CNN is used to compute 2D bounding boxes and instance masks, which are then combined
+a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
 with depth obtained from a non-learned stereo algorithm to be used as pre-computed
-inputs to the object scene flow model from \cite{KITTI2015}.
+inputs to their slanted plane scene flow model based on \cite{KITTI2015}.

 Interestingly, these slanted plane methods achieve the current state-of-the-art
 in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
 outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
+However, the end-to-end deep networks are significantly faster than energy-minimization based slanted plane models,
+generally taking a fraction of a second instead of minutes to compute and can often be modified to run in realtime.
+These concerns restrict the applicability of the current slanted plane models in practical applications,
+which often require estimations to be done in realtime and for which an end-to-end
+approach based on learning would be preferable.
+
+Futhermore, in other contexts, the move towards end-to-end deep learning has often lead
+to significant benefits in terms of accuracy and speed.
+As an example, consider the evolution of region-based convolutional networks, which started
+out as prohibitively slow with a CNN as a single component and
+became very fast and much more accurate over the course of their development into
+end-to-end deep networks.
+
+Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
+in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
+and the ability of deep networks to learn to handle ambiguity from experience. % TODO instead of experience, talk about compressing large datasets / generalization
+However, we think that the current end-to-end deep learning approaches to motion
+estimation are limited by a lack of spatial structure and regularity in their estimates,
+which stems from the generic nature of the employed networks.
+To this end, we aim to combine the modelling benefits of rigid scene decompositions
+with the promise of end-to-end deep learning.

-%
-In other contexts, the move from
-% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning

 \paragraph{End-to-end deep networks for 3D rigid motion estimation}
 End-to-end deep learning for predicting rigid 3D object motions was first introduced with