diff --git a/approach.tex b/approach.tex index 451faf4..2c9d9e5 100644 --- a/approach.tex +++ b/approach.tex @@ -89,7 +89,7 @@ predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the sam \subsection{Supervision} -\paragraph{Per-RoI supervision with motion ground truth} +\paragraph{Per-RoI supervision with 3D motion ground truth} The most straightforward way to supervise the object motions is by using ground truth motions computed from ground truth object poses, which is in general only practical when training on synthetic datasets. @@ -124,7 +124,7 @@ We supervise the camera motion with ground truth analogously to the object motions, with the only difference being that we only have a rotation and translation, but no pivot term for the camera motion. -\paragraph{Per-RoI supervision \emph{without} motion ground truth} +\paragraph{Per-RoI supervision \emph{without} 3D motion ground truth} A more general way to supervise the object motions is a re-projection loss similar to the unsupervised loss in SfM-Net \cite{SfmNet}, which we can apply to coordinates within the object bounding boxes, diff --git a/conclusion.tex b/conclusion.tex index 72076a7..4fc5033 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1,7 +1,8 @@ \subsection{Summary} -We have introduced an extension on top of region-based convolutional networks to enable object motion estimation -in parallel to instance segmentation. -\todo{complete} +We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation +in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D +motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end. + \subsection{Future Work} \paragraph{Predicting depth} @@ -28,3 +29,11 @@ On Cityscapes, we could continue train the instance segmentation components to improve detection and masks and avoid forgetting instance segmentation. As an alternative to this training scheme, we could investigate training on a pure instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction. + +\paragraph{Temporal consistency} +A next step after the two aforementioned ones could be to extend our network to exploit more than two +temporally consecutive frames, which has previously been shown to be beneficial in the +context of scene flow \cite{TemporalSF}. +In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM}, +into our architecture, we could enable temporally consistent motion estimation +from image sequences of arbitrary length. diff --git a/experiments.tex b/experiments.tex index 878ec95..11e6e49 100644 --- a/experiments.tex +++ b/experiments.tex @@ -5,7 +5,7 @@ computations. To make our code easy to extend and flexible, we build on the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline implementation. On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN) -as well all extensions for motion estimation and related evaluations +as well as extensions for motion estimation and related evaluations and postprocessings. In addition, we generated all ground truth for Motion R-CNN in the form of TFRecords from the raw Virtual KITTI data to enable fast loading during training.