\begin{abstract}

% Many state of the art energy-minimization approaches to optical flow and scene
% flow estimation rely on a rigid scene model, where the scene is
% represented as an ensemble of distinct, rigidly moving components, a static
% background and a moving camera.
% By constraining the optimization problem with a physically sound scene model,
% these approaches enable state-of-the art motion estimation.

With the advent of deep learning methods, it has become popular to re-purpose
generic deep networks for classical computer vision problems involving
pixel-wise estimation.

Following this trend, many recent end-to-end deep learning approaches to optical
flow and scene flow predict full resolution flow fields with
a generic network for dense, pixel-wise prediction, thereby ignoring the
inherent structure of the underlying motion estimation problem and any physical
constraints within the scene.

We introduce a scalable end-to-end deep learning approach for dense motion estimation
that respects the structure of the scene as being composed of distinct objects,
thus combining the representation learning benefits and speed of end-to-end deep networks
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
scene flow.

Building on recent advanced in region-based convolutional networks (R-CNNs),
we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGB-D camera,
our resulting end-to-end deep network detects objects with accurate per-pixel
masks and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network,
we compose a dense optical flow field based on instance-level and global motion
predictions. Our network is trained on the synthetic Virtual KITTI dataset,
which provides ground truth for all components of the system.

\end{abstract}

\renewcommand{\abstractname}{Zusammenfassung}
\begin{abstract}
\todo{german abstract}
\end{abstract}