diff --git a/background.tex b/background.tex index c6cd6c9..8393850 100644 --- a/background.tex +++ b/background.tex @@ -90,6 +90,22 @@ The \emph{second stage} corresponds to the original Fast R-CNN head network, per and bounding box refinement for each region proposal. % TODO verify that it isn't modified As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for each of the region proposals. +\paragraph{Feature Pyramid Networks} +In Faster R-CNN, a single feature map is used as a source of all RoIs, independent +of the size of the bounding box of the RoI. +However, for small objects, the C4 \todo{explain terminology of layers} features +might have lost too much spatial information to properly predict the exact bounding +box and a high resolution mask. Likewise, for very big objects, the fixed size +RoI window might be too small to cover the region of the feature map containing +information for this object. +As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enable features +of an appropriate scale to be used, depending of the size of the bounding box. +For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet} +encoder. \todo{figure and more details} +Now, during RoI pooling, +\todo{show formula}. + + \paragraph{Mask R-CNN} Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. However, it can be helpful to know class and object (instance) membership of all individual pixels, @@ -102,11 +118,11 @@ compute a pixel-precise mask for each instance. In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network variant based on Feature Pyramid Networks \cite{FPN}. Figure \ref{} compares the two Mask R-CNN head variants. +\todo{RoI Align} -\paragraph{Feature Pyramid Networks} -\todo{TODO} \paragraph{Supervision of the RPN} \todo{TODO} + \paragraph{Supervision of the RoI head} \todo{TODO} diff --git a/bib.bib b/bib.bib index 714dd55..5abd440 100644 --- a/bib.bib +++ b/bib.bib @@ -204,3 +204,29 @@ note={Software available from tensorflow.org}, author={Martín Abadi and others}, year={2015}} + +@inproceedings{LSTM, + author = {Sepp Hochreiter and Jürgen Schmidhuber}, + title = {Long Short-Term Memory}, + booktitle = {Neural Computation}, + year = {1997}} + +@inproceedings{TemporalSF, + author = {Christoph Vogel and Stefan Roth and Konrad Schindler}, + title = {View-Consistent 3D Scene Flow Estimation over Multiple Frames}, + booktitle = {ECCV}, + year = {2014}} + +@inproceedings{Cityscapes, + author = {M. Cordts and M. Omran and S. Ramos and T. Rehfeld and + M. Enzweiler and R. Benenson and U. Franke and S. Roth and B. Schiele}, + title = {The Cityscapes Dataset for Semantic Urban Scene Understanding}, + booktitle = {CVPR}, + year = {2016}} + +@inproceedings{SGD, + author = {Y. LeCun and B. Boser and J. S. Denker and D. Henderson + and R. E. Howard and W. Hubbard and L. D. Jackel}, + title = {Backpropagation applied to handwritten zip code recognition}, + booktitle = {Neural Computation}, + year = {1989}} diff --git a/conclusion.tex b/conclusion.tex index 4fc5033..80b641b 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1,8 +1,21 @@ \subsection{Summary} -We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation -in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D -motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end. +We introduced Motion R-CNN, which enables 3D object motion estimation in parallel +to instance segmentation in the framework of region-based convolutional networks, +given an input of two consecutive frames from a monocular camera. +In addition to instance motions, our network estimates the 3D motion of the camera. +We combine all these estimates to yield a dense optical flow output from our +end-to-end deep network. +Our model is trained on the synthetic Virtual KITTI dataset, which provides +us with all required ground truth data. +During inference, our model does not add any significant computational overhead +over the latest iterations of R-CNNs and is therefore just as fast and interesting +for real time scenarios. +We thus presented a step towards real time 3D motion estimation based on a +physically sound scene decomposition. Thanks to instance-level reasoning, in contrast +to previous end-to-end deep networks for dense motion estimation, the output +of our network is highly interpretable, which may bring benefits for safety-critical +applications. \subsection{Future Work} \paragraph{Predicting depth} @@ -20,8 +33,8 @@ Due to the amount of supervision required by the different components of the net and the complexity of the optimization problem, we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now. A next step will be training on a more realistic dataset. -For this, we can first pre-train the RPN on an object detection dataset like -Cityscapes. As soon as the RPN works reliably, we could execute alternating +For this, we can first pre-train the RPN on an instance segmentation dataset like +Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets. On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize the motion losses (and depth prediction if added), as no instance segmentation ground truth exists. @@ -33,7 +46,7 @@ instance segmentation dataset with unsupervised warping-based proxy losses for t \paragraph{Temporal consistency} A next step after the two aforementioned ones could be to extend our network to exploit more than two temporally consecutive frames, which has previously been shown to be beneficial in the -context of scene flow \cite{TemporalSF}. +context of energy-minimization based scene flow \cite{TemporalSF}. In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM}, -into our architecture, we could enable temporally consistent motion estimation -from image sequences of arbitrary length. +into our architecture, we could enable temporally consistent motion estimation +from image sequences of arbitrary length. diff --git a/experiments.tex b/experiments.tex index 11e6e49..90bb6eb 100644 --- a/experiments.tex +++ b/experiments.tex @@ -110,7 +110,10 @@ predicted camera motions. \subsection{Training Setup} Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. We train on a single Titan X (Pascal) for a total of 192K iterations on the -Virtual KITTI training set. As learning rate we use $0.25 \cdot 10^{-2}$ for the +Virtual KITTI training set. +As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a +momentum of $0.9$. +As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. \paragraph{R-CNN training parameters}