bsc-thesis/abstract.tex

\begin{abstract}

% Many state of the art energy-minimization approaches to optical flow and scene
% flow estimation rely on a rigid scene model, where the scene is
% represented as an ensemble of distinct, rigidly moving components, a static
% background and a moving camera.
% By constraining the optimization problem with a physically sound scene model,
% these approaches enable state-of-the art motion estimation.

With the advent of deep learning, it has become popular to re-purpose
generic deep networks for classical computer vision problems involving
pixel-wise estimation.

Following this trend, many recent end-to-end deep learning approaches to optical
flow and scene flow predict complete, high resolution flow fields with
a generic network for dense, pixel-wise prediction, thereby ignoring the
inherent structure of the underlying motion estimation problem and any physical
constraints within the scene.

We introduce a scalable end-to-end deep learning approach for dense motion estimation
that respects the structure of the scene as being composed of distinct objects,
thus combining the representation learning benefits and speed of end-to-end deep networks
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
scene flow.

Building on recent advances in region-based convolutional neural networks (R-CNNs),
we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGB-D camera,
our resulting end-to-end deep network detects objects with precise per-pixel
object masks and estimates the 3D motion of each detected object between the frames.
Additionally, we estimate the camera ego-motion in the same network,
and compose a dense optical flow field based on instance-level and global motion
predictions. We train our network on the synthetic Virtual KITTI dataset,
which provides ground truth for all components of our system.

\subsection*{\textbf{Zusammenfassung}}

Mit dem Aufkommen von Deep Learning
ist das Umfunktionieren generischer Deep Networks ein
beliebter Ansatz für klassische Probleme der Computer Vision geworden,
die pixelweise Schätzung erfordern.

Viele aktuelle end-to-end Deep Learning Methoden
für optischen Fluss oder Szenenfluss folgen diesem Trend und berechnen
vollständige und hochauflösende Flussfelder mit generischen
Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
Randbedingungen innerhalb der Szene.

Wir stellen ein skalierbares end-to-end Deep Learning Verfahren für dichte
Bewegungschätzung vor,
das die Struktur einer Szene als Zusammensetzung eigenständiger
Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwindigkeit
von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.

Hierbei bauen wir auf den aktuellen Fortschritten bei regionsbasierten Convolutional
Neural Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
Zusätzlich schätzen wir im selben Netzwerk die Eigenbewegung der Kamera,
und setzen aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes
optisches Flussfeld zusammen.
Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz,
der Ground Truth für alle Komponenten unseres Systems bereitstellt.


\end{abstract}