mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2026-01-20 20:11:16 +00:00
WIP
This commit is contained in:
parent
63663df448
commit
2a39cf1174
46
abstract.tex
46
abstract.tex
@ -7,12 +7,12 @@
|
||||
% By constraining the optimization problem with a physically sound scene model,
|
||||
% these approaches enable state-of-the art motion estimation.
|
||||
|
||||
With the advent of deep learning methods, it has become popular to re-purpose
|
||||
With the advent of deep learning, it has become popular to re-purpose
|
||||
generic deep networks for classical computer vision problems involving
|
||||
pixel-wise estimation.
|
||||
|
||||
Following this trend, many recent end-to-end deep learning approaches to optical
|
||||
flow and scene flow predict full resolution flow fields with
|
||||
flow and scene flow predict complete, high resolution flow fields with
|
||||
a generic network for dense, pixel-wise prediction, thereby ignoring the
|
||||
inherent structure of the underlying motion estimation problem and any physical
|
||||
constraints within the scene.
|
||||
@ -23,19 +23,49 @@ thus combining the representation learning benefits and speed of end-to-end deep
|
||||
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
|
||||
scene flow.
|
||||
|
||||
Building on recent advanced in region-based convolutional networks (R-CNNs),
|
||||
Building on recent advances in region-based convolutional networks (R-CNNs),
|
||||
we integrate motion estimation with instance segmentation.
|
||||
Given two consecutive frames from a monocular RGB-D camera,
|
||||
our resulting end-to-end deep network detects objects with accurate per-pixel
|
||||
masks and estimates the 3D motion of each detected object between the frames.
|
||||
our resulting end-to-end deep network detects objects with precise per-pixel
|
||||
object masks and estimates the 3D motion of each detected object between the frames.
|
||||
By additionally estimating a global camera motion in the same network,
|
||||
we compose a dense optical flow field based on instance-level and global motion
|
||||
predictions. Our network is trained on the synthetic Virtual KITTI dataset,
|
||||
which provides ground truth for all components of the system.
|
||||
predictions. We train our network on the synthetic Virtual KITTI dataset,
|
||||
which provides ground truth for all components of our system.
|
||||
|
||||
\end{abstract}
|
||||
|
||||
\renewcommand{\abstractname}{Zusammenfassung}
|
||||
\begin{abstract}
|
||||
\todo{german abstract}
|
||||
|
||||
Mit dem Aufkommen von Deep Learning
|
||||
ist die Umfunktionierung generischer Deep Networks ein
|
||||
beliebter Ansatz für klassische Probleme der Computer Vision geworden,
|
||||
die pixelweise Schätzung erfordern.
|
||||
|
||||
Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden
|
||||
für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen
|
||||
Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
|
||||
inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
|
||||
Randbedingungen innerhalb der Szene.
|
||||
|
||||
Wir stellen ein skalierbares end-to-end Deep Learning Verfahren für dichte
|
||||
Bewegungschätzung vor,
|
||||
das die Struktur einer Szene als Zusammensetzung eigenständiger
|
||||
Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwindigkeit
|
||||
von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
|
||||
das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.
|
||||
|
||||
Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional
|
||||
Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
|
||||
Bei Eingabe von zwei aufeinanderfolgenden frames aus einer monokularen RGB-D
|
||||
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
|
||||
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den frames ab.
|
||||
Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen,
|
||||
setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen dichten
|
||||
optischen Fluss zusammen.
|
||||
Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz,
|
||||
der ground truth für alle Komponenten unseres Systems bereitstellt.
|
||||
|
||||
|
||||
\end{abstract}
|
||||
|
||||
11
approach.tex
11
approach.tex
@ -324,8 +324,15 @@ loss could benefit motion regression by removing any loss balancing issues betwe
|
||||
rotation, translation and pivot terms \cite{PoseNet2},
|
||||
which can make it interesting even when 3D motion ground truth is available.
|
||||
|
||||
\subsection{Inference}
|
||||
\label{ssec:inference}
|
||||
\subsection{Training and Inference}
|
||||
\label{ssec:training_inference}
|
||||
\paragraph{Training}
|
||||
We train the Motion R-CNN RPN and RoI heads in the exact same way as described for Mask R-CNN.
|
||||
We additionally compute the camera and instance motion losses and concatenate additional
|
||||
information into the network input, but otherwise do not modify the training procedure
|
||||
and sample proposals and RoIs in the exact same way.
|
||||
|
||||
\paragraph{Inference}
|
||||
During inference, we proceed analogously to Mask R-CNN.
|
||||
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
||||
from the features extracted with refined bounding boxes.
|
||||
|
||||
@ -569,6 +569,8 @@ the predicted refined box encoding for class $c_i^*$.
|
||||
Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
|
||||
and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
|
||||
the binary ground truth mask using the RPN proposal bounding box.
|
||||
In our implementation, we use nearest neighbour resizing for resizing the mask
|
||||
targets.
|
||||
Then, the ROI loss is computed as
|
||||
\begin{equation}
|
||||
L_{RoI} = L_{cls} + L_{box} + L_{mask}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user