WIP

2026-03-01 12:54:10 +00:00 · 2017-11-06 22:04:25 +01:00 · 2017-11-06 22:04:25 +01:00 · 65dddcc861
commit 65dddcc861
parent e832c23983
4 changed files with 69 additions and 11 deletions
--- a/background.tex
+++ b/background.tex
@ -90,6 +90,22 @@ The \emph{second stage} corresponds to the original Fast R-CNN head network, per
 and bounding box refinement for each region proposal. % TODO verify that it isn't modified
 As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for each of the region proposals.

+\paragraph{Feature Pyramid Networks}
+In Faster R-CNN, a single feature map is used as a source of all RoIs, independent
+of the size of the bounding box of the RoI.
+However, for small objects, the C4 \todo{explain terminology of layers} features
+might have lost too much spatial information to properly predict the exact bounding
+box and a high resolution mask. Likewise, for very big objects, the fixed size
+RoI window might be too small to cover the region of the feature map containing
+information for this object.
+As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enable features
+of an appropriate scale to be used, depending of the size of the bounding box.
+For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
+encoder. \todo{figure and more details}
+Now, during RoI pooling,
+\todo{show formula}.
+
+
 \paragraph{Mask R-CNN}
 Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
 However, it can be helpful to know class and object (instance) membership of all individual pixels,
@ -102,11 +118,11 @@ compute a pixel-precise mask for each instance.
 In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
 variant based on Feature Pyramid Networks \cite{FPN}.
 Figure \ref{} compares the two Mask R-CNN head variants.
+\todo{RoI Align}

-\paragraph{Feature Pyramid Networks}
-\todo{TODO}

 \paragraph{Supervision of the RPN}
 \todo{TODO}
+
 \paragraph{Supervision of the RoI head}
 \todo{TODO}
--- a/bib.bib
+++ b/bib.bib
@ -204,3 +204,29 @@
 	note={Software available from tensorflow.org},
 	author={Martín Abadi and others},
 	year={2015}}
+
+@inproceedings{LSTM,
+  author = {Sepp Hochreiter and Jürgen Schmidhuber},
+  title = {Long Short-Term Memory},
+  booktitle = {Neural Computation},
+  year = {1997}}
+
+@inproceedings{TemporalSF,
+  author = {Christoph Vogel and Stefan Roth and Konrad Schindler},
+  title = {View-Consistent 3D Scene Flow Estimation over Multiple Frames},
+  booktitle = {ECCV},
+  year = {2014}}
+
+@inproceedings{Cityscapes,
+  author = {M. Cordts and M. Omran and S. Ramos and T. Rehfeld and
+            M. Enzweiler and R. Benenson and U. Franke and S. Roth and B. Schiele},
+  title = {The Cityscapes Dataset for Semantic Urban Scene Understanding},
+  booktitle = {CVPR},
+  year = {2016}}
+
+@inproceedings{SGD,
+  author = {Y. LeCun and B. Boser and J. S. Denker and D. Henderson
+            and R. E. Howard and W. Hubbard and L. D. Jackel},
+  title = {Backpropagation applied to handwritten zip code recognition},
+  booktitle = {Neural Computation},
+  year = {1989}}
--- a/conclusion.tex
+++ b/conclusion.tex
@ -1,8 +1,21 @@
 \subsection{Summary}
-We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation
-in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D
-motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end.

+We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
+to instance segmentation in the framework of region-based convolutional networks,
+given an input of two consecutive frames from a monocular camera.
+In addition to instance motions, our network estimates the 3D motion of the camera.
+We combine all these estimates to yield a dense optical flow output from our
+end-to-end deep network.
+Our model is trained on the synthetic Virtual KITTI dataset, which provides
+us with all required ground truth data.
+During inference, our model does not add any significant computational overhead
+over the latest iterations of R-CNNs and is therefore just as fast and interesting
+for real time scenarios.
+We thus presented a step towards real time 3D motion estimation based on a
+physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
+to previous end-to-end deep networks for dense motion estimation, the output
+of our network is highly interpretable, which may bring benefits for safety-critical
+applications.

 \subsection{Future Work}
 \paragraph{Predicting depth}
@ -20,8 +33,8 @@ Due to the amount of supervision required by the different components of the net
 and the complexity of the optimization problem,
 we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
 A next step will be training on a more realistic dataset.
-For this, we can first pre-train the RPN on an object detection dataset like
-Cityscapes. As soon as the RPN works reliably, we could execute alternating
+For this, we can first pre-train the RPN on an instance segmentation dataset like
+Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
 steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
 On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
 the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
@ -33,7 +46,7 @@ instance segmentation dataset with unsupervised warping-based proxy losses for t
 \paragraph{Temporal consistency}
 A next step after the two aforementioned ones could be to extend our network to exploit more than two
 temporally consecutive frames, which has previously been shown to be beneficial in the
-context of scene flow \cite{TemporalSF}.
+context of energy-minimization based scene flow \cite{TemporalSF}.
 In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
-into our architecture, we could enable temporally consistent motion estimation 
-from image sequences of arbitrary length. 
+into our architecture, we could enable temporally consistent motion estimation
+from image sequences of arbitrary length.
--- a/experiments.tex
+++ b/experiments.tex
@ -110,7 +110,10 @@ predicted camera motions.
 \subsection{Training Setup}
 Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
 We train on a single Titan X (Pascal) for a total of 192K iterations on the
-Virtual KITTI training set. As learning rate we use $0.25 \cdot 10^{-2}$ for the
+Virtual KITTI training set.
+As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
+momentum of $0.9$.
+As learning rate we use $0.25 \cdot 10^{-2}$ for the
 first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.

 \paragraph{R-CNN training parameters}