bsc-thesis/conclusion.tex

\subsection{Summary}

We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
to instance segmentation in the framework of region-based convolutional networks,
given an input of two consecutive frames from a monocular camera.
In addition to instance motions, our network estimates the 3D motion of the camera.
We combine all these estimates to yield a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with all required ground truth data.
During inference, our model does not add any significant computational overhead
over the latest iterations of R-CNNs and is therefore just as fast and interesting
for real time scenarios.
We thus presented a step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
to previous end-to-end deep networks for dense motion estimation, the output
of our network is highly interpretable, which may also bring benefits for safety-critical
applications.

\subsection{Future Work}
\paragraph{Predicting depth}
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
However, in many applications settings, we are not provided with any depth information.
In most cases, we want to work with raw RGB sequences from one or multiple simple cameras,
from which no depth data is available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
Alternatively, we could add a specialized network for end-to-end depth regression
in parallel to the region-based network, e.g. \cite{GCNet}.
Although single-frame monocular depth prediction with deep networks was already done
to some level of success,
our two-frame input should allow the network to make use of epipolar
geometry for making a more reliable depth estimate, at least when the camera
is moving.

{
\begin{table}[h]
\centering
\begin{tabular}{llr}
\toprule
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{Depth Network}}\\
\midrule
& From P$_2$: 3 $\times$ 3 conv, 1024 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 256 \\
& 1 $\times$ 1 conv, 1 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1 \\
& $\times$ 2 bilinear upsample & H $\times$ W $\times$ 1 \\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network} (Table \ref{table:motionrcnn_resnet_fpn})}\\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head: Motions} (Table \ref{table:motionrcnn_resnet_fpn})}\\
\bottomrule
\end{tabular}

\caption {
Preliminary Motion R-CNN ResNet-50-FPN architecture with depth prediction,
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
}
\label{table:motionrcnn_resnet_fpn_depth}
\end{table}
}
\paragraph{Training on real world data}
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step will be training on a more realistic dataset.
For this, we can first pre-train the RPN on an instance segmentation dataset like
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
On Cityscapes, we could continue train the instance segmentation components to
improve detection and masks and avoid forgetting instance segmentation.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.

\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
context of energy-minimization based scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.