From 86c2c12c78d001408e0b379144f76c8f42967dec Mon Sep 17 00:00:00 2001
From: Simon Meister <simon.meister.93@gmail.com>
Date: Mon, 30 Oct 2017 15:11:59 +0100
Subject: [PATCH] WIP

---
 abstract.tex     |  4 ++--
 background.tex   |  5 ++++-
 bib.bib          | 12 ++++++++++++
 introduction.tex | 39 +++++++++++++++++++++++++++++++--------
 4 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/abstract.tex b/abstract.tex
index 3864469..9a0b387 100644
--- a/abstract.tex
+++ b/abstract.tex
@@ -1,11 +1,11 @@
 \begin{abstract}
 
 Many state of the art energy-minimization approaches to optical flow and scene
-flow estimation rely on a (piecewise) rigid scene model, where the scene is
+flow estimation rely on a rigid scene model, where the scene is
 represented as an ensemble of distinct, rigidly moving components, a static
 background and a moving camera.
 By constraining the optimization problem with a physically sound scene model,
-these approaches enable higly accurate motion estimation.
+these approaches enable state-of-the art motion estimation.
 
 With the advent of deep learning methods, it has become popular to re-purpose
 generic deep networks for classical computer vision problems involving
diff --git a/background.tex b/background.tex
index 6d870dd..3c6c687 100644
--- a/background.tex
+++ b/background.tex
@@ -8,7 +8,10 @@ visually corresponding pixel in the second frame $I_2$,
 thus representing the apparent movement of brigthness patterns between the two frames.
 Optical flow can be regarded as two-dimensional motion estimation.
 
-Scene flow is the generalization of optical flow to 3-dimensional space.
+Scene flow is the generalization of optical flow to 3-dimensional space and
+requires estimating dense depth. Generally, stereo input is used for scene flow
+to estimate disparity-based depth, however monocular depth estimation can in
+principle be used.
 
 \subsection{Convolutional neural networks for dense motion estimation}
 Deep convolutional neural network (CNN) architectures
diff --git a/bib.bib b/bib.bib
index 7773af5..17a1c73 100644
--- a/bib.bib
+++ b/bib.bib
@@ -142,3 +142,15 @@
   title = {3D Scene Flow with a Piecewise Rigid Scene Model},
   booktitle = {{IJCV}},
   year = {2015}}
+
+@inproceedings{MRFlow,
+  author = {Jonas Wulff and Laura Sevilla-Lara and Michael J. Black},
+  title = {Optical Flow in Mostly Rigid Scenes},
+  booktitle = {{CVPR}},
+  year = {2017}}
+
+@article{SPyNet,
+  author = {Anurag Ranjan and Michael J. Black},
+  title = {Optical Flow Estimation using a Spatial Pyramid Network},
+  journal = {arXiv preprint arXiv:1611.00850},
+  year = {2016}}
diff --git a/introduction.tex b/introduction.tex
index 0e3a880..83a1ca8 100644
--- a/introduction.tex
+++ b/introduction.tex
@@ -41,10 +41,15 @@ in parallel to classification and bounding box refinement.
 
 \subsection{Related work}
 
-\paragraph{Deep networks in optical flow and scene flow}
+\paragraph{Deep networks in optical flow}
 
-\cite{FlowLayers}
-\cite{ESI}
+End-to-end deep networks for optical flow were recently introduced
+based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
+which pose optical flow as generic pixel-wise estimation problem without making any assumptions
+about the regularity and structure of the estimated flow.
+Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure
+the optical flow estimation, but still require expensive energy minimization for each
+new input, as CNNs are only used for some of the components.
 
 \paragraph{Slanted plane methods for 3D scene flow}
 The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
@@ -57,17 +62,35 @@ reducing the number of independently moving segments by allowing multiple
 segments to share the motion of the object they belong to.
 
 In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
-a CNN is used to compute 2D bounding boxes and instance masks, which are then combined
+a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
 with depth obtained from a non-learned stereo algorithm to be used as pre-computed
-inputs to the object scene flow model from \cite{KITTI2015}.
+inputs to their slanted plane scene flow model based on \cite{KITTI2015}.
 
 Interestingly, these slanted plane methods achieve the current state-of-the-art
 in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
 outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
+However, the end-to-end deep networks are significantly faster than energy-minimization based slanted plane models,
+generally taking a fraction of a second instead of minutes to compute and can often be modified to run in realtime.
+These concerns restrict the applicability of the current slanted plane models in practical applications,
+which often require estimations to be done in realtime and for which an end-to-end
+approach based on learning would be preferable.
+
+Futhermore, in other contexts, the move towards end-to-end deep learning has often lead
+to significant benefits in terms of accuracy and speed.
+As an example, consider the evolution of region-based convolutional networks, which started
+out as prohibitively slow with a CNN as a single component and
+became very fast and much more accurate over the course of their development into
+end-to-end deep networks.
+
+Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
+in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
+and the ability of deep networks to learn to handle ambiguity from experience. % TODO instead of experience, talk about compressing large datasets / generalization
+However, we think that the current end-to-end deep learning approaches to motion
+estimation are limited by a lack of spatial structure and regularity in their estimates,
+which stems from the generic nature of the employed networks.
+To this end, we aim to combine the modelling benefits of rigid scene decompositions
+with the promise of end-to-end deep learning.
 
-%
-In other contexts, the move from
-% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning
 
 \paragraph{End-to-end deep networks for 3D rigid motion estimation}
 End-to-end deep learning for predicting rigid 3D object motions was first introduced with