bsc-thesis/abstract.tex

\begin{abstract}

Many state of the art energy-minimization approaches to optical flow and scene flow estimation
rely on a (piecewise) rigid scene model, where the scene is represented as an ensemble of distinct,
rigidly moving components, a static background and a moving camera.
By constraining the optimization problem with a physically sound scene model,
these approaches enable higly accurate motion estimation.

With the advent of deep learning methods, it has become popular to re-purpose generic deep networks
for classical computer vision problems involving pixel-wise estimation.

Following this trend, many recent end-to-end deep learning approaches to optical flow
and scene flow directly predict full resolution
depth and flow fields with a generic network for dense, pixel-wise prediction,
thereby ignoring the inherent structure of the underlying motion estimation problem
and any physical constraints within the scene.

We introduce an end-to-end deep learning approach for dense motion estimation
that respects the structure of the scene as being composed of distinct objects,
thus combining the representation learning benefits of end-to-end deep networks
with a physically plausible scene model.

Building on recent advanced in region-based convolutional networks (R-CNNs), we integrate motion
estimation with instance segmentation.
Given two consecutive frames from a monocular RGBD camera,
our resulting end-to-end deep network detects objects with accurate per-pixel masks
and estimates the 3d motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network, we compose a dense
optical flow field based on instance-level motion predictions.

We demonstrate the effectiveness of our approach on the KITTI 2015 optical flow benchmark.
\end{abstract}