From 2a39cf117476793a59ecf59781f7b7eeb29f253d Mon Sep 17 00:00:00 2001
From: Simon Meister <simon.meister.93@gmail.com>
Date: Mon, 13 Nov 2017 23:45:57 +0100
Subject: [PATCH] WIP

---
 abstract.tex   | 46 ++++++++++++++++++++++++++++++++++++++--------
 approach.tex   | 11 +++++++++--
 background.tex |  2 ++
 3 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/abstract.tex b/abstract.tex
index 52cba34..f600fb4 100644
--- a/abstract.tex
+++ b/abstract.tex
@@ -7,12 +7,12 @@
 % By constraining the optimization problem with a physically sound scene model,
 % these approaches enable state-of-the art motion estimation.
 
-With the advent of deep learning methods, it has become popular to re-purpose
+With the advent of deep learning, it has become popular to re-purpose
 generic deep networks for classical computer vision problems involving
 pixel-wise estimation.
 
 Following this trend, many recent end-to-end deep learning approaches to optical
-flow and scene flow predict full resolution flow fields with
+flow and scene flow predict complete, high resolution flow fields with
 a generic network for dense, pixel-wise prediction, thereby ignoring the
 inherent structure of the underlying motion estimation problem and any physical
 constraints within the scene.
@@ -23,19 +23,49 @@ thus combining the representation learning benefits and speed of end-to-end deep
 with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
 scene flow.
 
-Building on recent advanced in region-based convolutional networks (R-CNNs),
+Building on recent advances in region-based convolutional networks (R-CNNs),
 we integrate motion estimation with instance segmentation.
 Given two consecutive frames from a monocular RGB-D camera,
-our resulting end-to-end deep network detects objects with accurate per-pixel
-masks and estimates the 3D motion of each detected object between the frames.
+our resulting end-to-end deep network detects objects with precise per-pixel
+object masks and estimates the 3D motion of each detected object between the frames.
 By additionally estimating a global camera motion in the same network,
 we compose a dense optical flow field based on instance-level and global motion
-predictions. Our network is trained on the synthetic Virtual KITTI dataset,
-which provides ground truth for all components of the system.
+predictions. We train our network on the synthetic Virtual KITTI dataset,
+which provides ground truth for all components of our system.
 
 \end{abstract}
 
 \renewcommand{\abstractname}{Zusammenfassung}
 \begin{abstract}
-\todo{german abstract}
+
+Mit dem Aufkommen von Deep Learning
+ist die Umfunktionierung generischer Deep Networks ein
+beliebter Ansatz für klassische Probleme der Computer Vision geworden,
+die pixelweise Schätzung erfordern.
+
+Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden
+für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen
+Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
+inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
+Randbedingungen innerhalb der Szene.
+
+Wir stellen ein skalierbares end-to-end Deep Learning Verfahren für dichte
+Bewegungschätzung vor,
+das die Struktur einer Szene als Zusammensetzung eigenständiger
+Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwindigkeit
+von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
+das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.
+
+Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional
+Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
+Bei Eingabe von zwei aufeinanderfolgenden frames aus einer monokularen RGB-D
+Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
+und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den frames ab.
+Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen,
+setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen dichten
+optischen Fluss zusammen.
+Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz,
+der ground truth für alle Komponenten unseres Systems bereitstellt.
+
+
 \end{abstract}
diff --git a/approach.tex b/approach.tex
index 5d59b63..b291f90 100644
--- a/approach.tex
+++ b/approach.tex
@@ -324,8 +324,15 @@ loss could benefit motion regression by removing any loss balancing issues betwe
 rotation, translation and pivot terms \cite{PoseNet2},
 which can make it interesting even when 3D motion ground truth is available.
 
-\subsection{Inference}
-\label{ssec:inference}
+\subsection{Training and Inference}
+\label{ssec:training_inference}
+\paragraph{Training}
+We train the Motion R-CNN RPN and RoI heads in the exact same way as described for Mask R-CNN.
+We additionally compute the camera and instance motion losses and concatenate additional
+information into the network input, but otherwise do not modify the training procedure
+and sample proposals and RoIs in the exact same way.
+
+\paragraph{Inference}
 During inference, we proceed analogously to Mask R-CNN.
 In the same way as the RoI mask head, at test time, we compute the RoI motion head
 from the features extracted with refined bounding boxes.
diff --git a/background.tex b/background.tex
index db8dcca..1ab4c71 100644
--- a/background.tex
+++ b/background.tex
@@ -569,6 +569,8 @@ the predicted refined box encoding for class $c_i^*$.
 Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
 and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
 the binary ground truth mask using the RPN proposal bounding box.
+In our implementation, we use nearest neighbour resizing for resizing the mask
+targets.
 Then, the ROI loss is computed as
 \begin{equation}
 L_{RoI} = L_{cls} + L_{box} + L_{mask}