mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
40 lines
2.4 KiB
TeX
40 lines
2.4 KiB
TeX
\subsection{Summary}
|
||
We have introduced an extension on top of region-based convolutional networks to enable 3D object motion estimation
|
||
in parallel to instance segmentation, given two consecutive frames. Additionally, our network estimates the 3D
|
||
motion of the camera between frames. Based on this, we compose optical flow from 3D motions in a end.
|
||
|
||
|
||
\subsection{Future Work}
|
||
\paragraph{Predicting depth}
|
||
In most cases, we want to work with raw RGB sequences for which no depth is available.
|
||
To do so, we could integrate depth prediction into our network by branching off a
|
||
depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
|
||
Although single-frame monocular depth prediction with deep networks was already done
|
||
to some level of success,
|
||
our two-frame input should allow the network to make use of epipolar
|
||
geometry for making a more reliable depth estimate, at least when the camera
|
||
is moving.
|
||
|
||
\paragraph{Training on real world data}
|
||
Due to the amount of supervision required by the different components of the network
|
||
and the complexity of the optimization problem,
|
||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||
A next step will be training on a more realistic dataset.
|
||
For this, we can first pre-train the RPN on an object detection dataset like
|
||
Cityscapes. As soon as the RPN works reliably, we could execute alternating
|
||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
|
||
On Cityscapes, we could continue train the instance segmentation components to
|
||
improve detection and masks and avoid forgetting instance segmentation.
|
||
As an alternative to this training scheme, we could investigate training on a pure
|
||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
||
|
||
\paragraph{Temporal consistency}
|
||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||
context of scene flow \cite{TemporalSF}.
|
||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||
into our architecture, we could enable temporally consistent motion estimation
|
||
from image sequences of arbitrary length.
|