bsc-thesis/introduction.tex

\subsection{Motivation}

% introduce problem to sovle
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep

% Steal intro from behl2017 & FlowLayers

Deep learning research is moving towards videos.
Motion estimation is an inherently ambigous problem and
A recent trend is towards end-to-end deep learning systems, away from energy-minimization.
Often however, this leads to a compromise in modelling as it is more difficult to
formulate a end-to-end deep network architecture for a given problem than it is
to state a fesable energy-minimization problem.
For this reason, we see lots of generic models applied to domains which previously
employed intricate physical models to simplify optimization.
On the on hand, end-to-end deep learning may bringe unique benefits due do the ability
of a learned system to deal with ambiguity.
On the other hand,
%Thus, there is an emerging trend to unify geometry with deep learning by
% THE ABOVE IS VERY DRAFT_LIKE

Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.

Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
large number of diverse objects due to the inflexible nature of their instance segmentation technique.

A scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN
and predicts pixel-precise segmentation masks for each detected object.

We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
in parallel to classification and bounding box refinement.

\subsection{Related work}

\paragraph{Deep networks in optical flow}

End-to-end deep networks for optical flow were recently introduced
based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
which pose optical flow as generic pixel-wise estimation problem without making any assumptions
about the regularity and structure of the estimated flow.
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure
the optical flow estimation, but still require expensive energy minimization for each
new input, as CNNs are only used for some of the components.

\paragraph{Slanted plane methods for 3D scene flow}
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
composed of planar segments. Pixels are assigned to one of the planar segments,
each of which undergoes a rigid motion.

In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
assigns each slanted plane to one rigidly moving object instance, thus
reducing the number of independently moving segments by allowing multiple
segments to share the motion of the object they belong to.

In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
inputs to their slanted plane scene flow model based on \cite{KITTI2015}.

Interestingly, these slanted plane methods achieve the current state-of-the-art
in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
However, the end-to-end deep networks are significantly faster than energy-minimization based slanted plane models,
generally taking a fraction of a second instead of minutes to compute and can often be modified to run in realtime.
These concerns restrict the applicability of the current slanted plane models in practical applications,
which often require estimations to be done in realtime and for which an end-to-end
approach based on learning would be preferable.

Futhermore, in other contexts, the move towards end-to-end deep learning has often lead
to significant benefits in terms of accuracy and speed.
As an example, consider the evolution of region-based convolutional networks, which started
out as prohibitively slow with a CNN as a single component and
became very fast and much more accurate over the course of their development into
end-to-end deep networks.

Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
and the ability of deep networks to learn to handle ambiguity from experience. % TODO instead of experience, talk about compressing large datasets / generalization
However, we think that the current end-to-end deep learning approaches to motion
estimation are limited by a lack of spatial structure and regularity in their estimates,
which stems from the generic nature of the employed networks.
To this end, we aim to combine the modelling benefits of rigid scene decompositions
with the promise of end-to-end deep learning.


\paragraph{End-to-end deep networks for 3D rigid motion estimation}
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
of the points into objects together with the 3D motion of each object.
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
with a brightness constancy proxy loss.