This commit is contained in:
Simon Meister 2017-11-06 22:04:25 +01:00
parent e832c23983
commit 65dddcc861
4 changed files with 69 additions and 11 deletions

View File

@ -90,6 +90,22 @@ The \emph{second stage} corresponds to the original Fast R-CNN head network, per
and bounding box refinement for each region proposal. % TODO verify that it isn't modified
As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for each of the region proposals.
\paragraph{Feature Pyramid Networks}
In Faster R-CNN, a single feature map is used as a source of all RoIs, independent
of the size of the bounding box of the RoI.
However, for small objects, the C4 \todo{explain terminology of layers} features
might have lost too much spatial information to properly predict the exact bounding
box and a high resolution mask. Likewise, for very big objects, the fixed size
RoI window might be too small to cover the region of the feature map containing
information for this object.
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enable features
of an appropriate scale to be used, depending of the size of the bounding box.
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
encoder. \todo{figure and more details}
Now, during RoI pooling,
\todo{show formula}.
\paragraph{Mask R-CNN}
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
However, it can be helpful to know class and object (instance) membership of all individual pixels,
@ -102,11 +118,11 @@ compute a pixel-precise mask for each instance.
In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
variant based on Feature Pyramid Networks \cite{FPN}.
Figure \ref{} compares the two Mask R-CNN head variants.
\todo{RoI Align}
\paragraph{Feature Pyramid Networks}
\todo{TODO}
\paragraph{Supervision of the RPN}
\todo{TODO}
\paragraph{Supervision of the RoI head}
\todo{TODO}

26
bib.bib
View File

@ -204,3 +204,29 @@
note={Software available from tensorflow.org},
author={Martín Abadi and others},
year={2015}}
@inproceedings{LSTM,
author = {Sepp Hochreiter and Jürgen Schmidhuber},
title = {Long Short-Term Memory},
booktitle = {Neural Computation},
year = {1997}}
@inproceedings{TemporalSF,
author = {Christoph Vogel and Stefan Roth and Konrad Schindler},
title = {View-Consistent 3D Scene Flow Estimation over Multiple Frames},
booktitle = {ECCV},
year = {2014}}
@inproceedings{Cityscapes,
author = {M. Cordts and M. Omran and S. Ramos and T. Rehfeld and
M. Enzweiler and R. Benenson and U. Franke and S. Roth and B. Schiele},
title = {The Cityscapes Dataset for Semantic Urban Scene Understanding},
booktitle = {CVPR},
year = {2016}}
@inproceedings{SGD,
author = {Y. LeCun and B. Boser and J. S. Denker and D. Henderson
and R. E. Howard and W. Hubbard and L. D. Jackel},
title = {Backpropagation applied to handwritten zip code recognition},
booktitle = {Neural Computation},
year = {1989}}

View File

@ -1,8 +1,21 @@
\subsection{Summary}
We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation
in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D
motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end.
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
to instance segmentation in the framework of region-based convolutional networks,
given an input of two consecutive frames from a monocular camera.
In addition to instance motions, our network estimates the 3D motion of the camera.
We combine all these estimates to yield a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with all required ground truth data.
During inference, our model does not add any significant computational overhead
over the latest iterations of R-CNNs and is therefore just as fast and interesting
for real time scenarios.
We thus presented a step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
to previous end-to-end deep networks for dense motion estimation, the output
of our network is highly interpretable, which may bring benefits for safety-critical
applications.
\subsection{Future Work}
\paragraph{Predicting depth}
@ -20,8 +33,8 @@ Due to the amount of supervision required by the different components of the net
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step will be training on a more realistic dataset.
For this, we can first pre-train the RPN on an object detection dataset like
Cityscapes. As soon as the RPN works reliably, we could execute alternating
For this, we can first pre-train the RPN on an instance segmentation dataset like
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
@ -33,7 +46,7 @@ instance segmentation dataset with unsupervised warping-based proxy losses for t
\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
context of scene flow \cite{TemporalSF}.
context of energy-minimization based scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.

View File

@ -110,7 +110,10 @@ predicted camera motions.
\subsection{Training Setup}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
We train on a single Titan X (Pascal) for a total of 192K iterations on the
Virtual KITTI training set. As learning rate we use $0.25 \cdot 10^{-2}$ for the
Virtual KITTI training set.
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
momentum of $0.9$.
As learning rate we use $0.25 \cdot 10^{-2}$ for the
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
\paragraph{R-CNN training parameters}