bsc-thesis/conclusion.tex

\subsection{Summary}
We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation
in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D
motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end.


\subsection{Future Work}
\paragraph{Predicting depth}
In most cases, we want to work with raw RGB sequences for which no depth is available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
Although single-frame monocular depth prediction with deep networks was already done
to some level of success,
our two-frame input should allow the network to make use of epipolar
geometry for making a more reliable depth estimate, at least when the camera
is moving.

\paragraph{Training on real world data}
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step will be training on a more realistic dataset.
For this, we can first pre-train the RPN on an object detection dataset like
Cityscapes. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
On Cityscapes, we could continue train the instance segmentation components to
improve detection and masks and avoid forgetting instance segmentation.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.

\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
context of scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.