bsc-thesis/abstract.tex

\begin{abstract}

Many state of the art energy-minimization approaches to optical flow and scene
flow estimation rely on a (piecewise) rigid scene model, where the scene is
represented as an ensemble of distinct, rigidly moving components, a static
background and a moving camera.
By constraining the optimization problem with a physically sound scene model,
these approaches enable higly accurate motion estimation.

With the advent of deep learning methods, it has become popular to re-purpose
generic deep networks for classical computer vision problems involving
pixel-wise estimation.

Following this trend, many recent end-to-end deep learning approaches to optical
flow and scene flow directly predict full resolution flow fields with
a generic network for dense, pixel-wise prediction, thereby ignoring the
inherent structure of the underlying motion estimation problem and any physical
constraints within the scene.

We introduce a scalable end-to-end deep learning approach for dense motion estimation
that respects the structure of the scene as being composed of distinct objects,
thus combining the representation learning benefits of end-to-end deep networks
with a physically plausible scene model.

Building on recent advanced in region-based convolutional networks (R-CNNs),
we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGBD camera,
our resulting end-to-end deep network detects objects with accurate per-pixel
masks and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network,
we compose a dense optical flow field based on instance-level and global motion
predictions.

We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
benchmark.
\end{abstract}