From 2a39cf117476793a59ecf59781f7b7eeb29f253d Mon Sep 17 00:00:00 2001 From: Simon Meister Date: Mon, 13 Nov 2017 23:45:57 +0100 Subject: [PATCH] WIP --- abstract.tex | 46 ++++++++++++++++++++++++++++++++++++++-------- approach.tex | 11 +++++++++-- background.tex | 2 ++ 3 files changed, 49 insertions(+), 10 deletions(-) diff --git a/abstract.tex b/abstract.tex index 52cba34..f600fb4 100644 --- a/abstract.tex +++ b/abstract.tex @@ -7,12 +7,12 @@ % By constraining the optimization problem with a physically sound scene model, % these approaches enable state-of-the art motion estimation. -With the advent of deep learning methods, it has become popular to re-purpose +With the advent of deep learning, it has become popular to re-purpose generic deep networks for classical computer vision problems involving pixel-wise estimation. Following this trend, many recent end-to-end deep learning approaches to optical -flow and scene flow predict full resolution flow fields with +flow and scene flow predict complete, high resolution flow fields with a generic network for dense, pixel-wise prediction, thereby ignoring the inherent structure of the underlying motion estimation problem and any physical constraints within the scene. @@ -23,19 +23,49 @@ thus combining the representation learning benefits and speed of end-to-end deep with a physically plausible scene model inspired by slanted plane energy-minimization approaches to scene flow. -Building on recent advanced in region-based convolutional networks (R-CNNs), +Building on recent advances in region-based convolutional networks (R-CNNs), we integrate motion estimation with instance segmentation. Given two consecutive frames from a monocular RGB-D camera, -our resulting end-to-end deep network detects objects with accurate per-pixel -masks and estimates the 3D motion of each detected object between the frames. +our resulting end-to-end deep network detects objects with precise per-pixel +object masks and estimates the 3D motion of each detected object between the frames. By additionally estimating a global camera motion in the same network, we compose a dense optical flow field based on instance-level and global motion -predictions. Our network is trained on the synthetic Virtual KITTI dataset, -which provides ground truth for all components of the system. +predictions. We train our network on the synthetic Virtual KITTI dataset, +which provides ground truth for all components of our system. \end{abstract} \renewcommand{\abstractname}{Zusammenfassung} \begin{abstract} -\todo{german abstract} + +Mit dem Aufkommen von Deep Learning +ist die Umfunktionierung generischer Deep Networks ein +beliebter Ansatz für klassische Probleme der Computer Vision geworden, +die pixelweise Schätzung erfordern. + +Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden +für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen +Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die +inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische +Randbedingungen innerhalb der Szene. + +Wir stellen ein skalierbares end-to-end Deep Learning Verfahren für dichte +Bewegungschätzung vor, +das die Struktur einer Szene als Zusammensetzung eigenständiger +Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwindigkeit +von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell, +das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist. + +Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional +Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung. +Bei Eingabe von zwei aufeinanderfolgenden frames aus einer monokularen RGB-D +Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken +und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den frames ab. +Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen, +setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen dichten +optischen Fluss zusammen. +Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz, +der ground truth für alle Komponenten unseres Systems bereitstellt. + + \end{abstract} diff --git a/approach.tex b/approach.tex index 5d59b63..b291f90 100644 --- a/approach.tex +++ b/approach.tex @@ -324,8 +324,15 @@ loss could benefit motion regression by removing any loss balancing issues betwe rotation, translation and pivot terms \cite{PoseNet2}, which can make it interesting even when 3D motion ground truth is available. -\subsection{Inference} -\label{ssec:inference} +\subsection{Training and Inference} +\label{ssec:training_inference} +\paragraph{Training} +We train the Motion R-CNN RPN and RoI heads in the exact same way as described for Mask R-CNN. +We additionally compute the camera and instance motion losses and concatenate additional +information into the network input, but otherwise do not modify the training procedure +and sample proposals and RoIs in the exact same way. + +\paragraph{Inference} During inference, we proceed analogously to Mask R-CNN. In the same way as the RoI mask head, at test time, we compute the RoI motion head from the features extracted with refined bounding boxes. diff --git a/background.tex b/background.tex index db8dcca..1ab4c71 100644 --- a/background.tex +++ b/background.tex @@ -569,6 +569,8 @@ the predicted refined box encoding for class $c_i^*$. Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$ and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from the binary ground truth mask using the RPN proposal bounding box. +In our implementation, we use nearest neighbour resizing for resizing the mask +targets. Then, the ROI loss is computed as \begin{equation} L_{RoI} = L_{cls} + L_{box} + L_{mask}