mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2026-01-19 11:31:17 +00:00
WIP
This commit is contained in:
parent
e832c23983
commit
65dddcc861
@ -90,6 +90,22 @@ The \emph{second stage} corresponds to the original Fast R-CNN head network, per
|
||||
and bounding box refinement for each region proposal. % TODO verify that it isn't modified
|
||||
As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for each of the region proposals.
|
||||
|
||||
\paragraph{Feature Pyramid Networks}
|
||||
In Faster R-CNN, a single feature map is used as a source of all RoIs, independent
|
||||
of the size of the bounding box of the RoI.
|
||||
However, for small objects, the C4 \todo{explain terminology of layers} features
|
||||
might have lost too much spatial information to properly predict the exact bounding
|
||||
box and a high resolution mask. Likewise, for very big objects, the fixed size
|
||||
RoI window might be too small to cover the region of the feature map containing
|
||||
information for this object.
|
||||
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enable features
|
||||
of an appropriate scale to be used, depending of the size of the bounding box.
|
||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||
encoder. \todo{figure and more details}
|
||||
Now, during RoI pooling,
|
||||
\todo{show formula}.
|
||||
|
||||
|
||||
\paragraph{Mask R-CNN}
|
||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||
@ -102,11 +118,11 @@ compute a pixel-precise mask for each instance.
|
||||
In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
|
||||
variant based on Feature Pyramid Networks \cite{FPN}.
|
||||
Figure \ref{} compares the two Mask R-CNN head variants.
|
||||
\todo{RoI Align}
|
||||
|
||||
\paragraph{Feature Pyramid Networks}
|
||||
\todo{TODO}
|
||||
|
||||
\paragraph{Supervision of the RPN}
|
||||
\todo{TODO}
|
||||
|
||||
\paragraph{Supervision of the RoI head}
|
||||
\todo{TODO}
|
||||
|
||||
26
bib.bib
26
bib.bib
@ -204,3 +204,29 @@
|
||||
note={Software available from tensorflow.org},
|
||||
author={Martín Abadi and others},
|
||||
year={2015}}
|
||||
|
||||
@inproceedings{LSTM,
|
||||
author = {Sepp Hochreiter and Jürgen Schmidhuber},
|
||||
title = {Long Short-Term Memory},
|
||||
booktitle = {Neural Computation},
|
||||
year = {1997}}
|
||||
|
||||
@inproceedings{TemporalSF,
|
||||
author = {Christoph Vogel and Stefan Roth and Konrad Schindler},
|
||||
title = {View-Consistent 3D Scene Flow Estimation over Multiple Frames},
|
||||
booktitle = {ECCV},
|
||||
year = {2014}}
|
||||
|
||||
@inproceedings{Cityscapes,
|
||||
author = {M. Cordts and M. Omran and S. Ramos and T. Rehfeld and
|
||||
M. Enzweiler and R. Benenson and U. Franke and S. Roth and B. Schiele},
|
||||
title = {The Cityscapes Dataset for Semantic Urban Scene Understanding},
|
||||
booktitle = {CVPR},
|
||||
year = {2016}}
|
||||
|
||||
@inproceedings{SGD,
|
||||
author = {Y. LeCun and B. Boser and J. S. Denker and D. Henderson
|
||||
and R. E. Howard and W. Hubbard and L. D. Jackel},
|
||||
title = {Backpropagation applied to handwritten zip code recognition},
|
||||
booktitle = {Neural Computation},
|
||||
year = {1989}}
|
||||
|
||||
@ -1,8 +1,21 @@
|
||||
\subsection{Summary}
|
||||
We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation
|
||||
in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D
|
||||
motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end.
|
||||
|
||||
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
||||
to instance segmentation in the framework of region-based convolutional networks,
|
||||
given an input of two consecutive frames from a monocular camera.
|
||||
In addition to instance motions, our network estimates the 3D motion of the camera.
|
||||
We combine all these estimates to yield a dense optical flow output from our
|
||||
end-to-end deep network.
|
||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||
us with all required ground truth data.
|
||||
During inference, our model does not add any significant computational overhead
|
||||
over the latest iterations of R-CNNs and is therefore just as fast and interesting
|
||||
for real time scenarios.
|
||||
We thus presented a step towards real time 3D motion estimation based on a
|
||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||
to previous end-to-end deep networks for dense motion estimation, the output
|
||||
of our network is highly interpretable, which may bring benefits for safety-critical
|
||||
applications.
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Predicting depth}
|
||||
@ -20,8 +33,8 @@ Due to the amount of supervision required by the different components of the net
|
||||
and the complexity of the optimization problem,
|
||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||||
A next step will be training on a more realistic dataset.
|
||||
For this, we can first pre-train the RPN on an object detection dataset like
|
||||
Cityscapes. As soon as the RPN works reliably, we could execute alternating
|
||||
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
||||
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
||||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
|
||||
@ -33,7 +46,7 @@ instance segmentation dataset with unsupervised warping-based proxy losses for t
|
||||
\paragraph{Temporal consistency}
|
||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||||
context of scene flow \cite{TemporalSF}.
|
||||
context of energy-minimization based scene flow \cite{TemporalSF}.
|
||||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||
into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
|
||||
@ -110,7 +110,10 @@ predicted camera motions.
|
||||
\subsection{Training Setup}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
||||
Virtual KITTI training set. As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
Virtual KITTI training set.
|
||||
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
|
||||
momentum of $0.9$.
|
||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
|
||||
\paragraph{R-CNN training parameters}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user