mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 01:45:50 +00:00
92 lines
4.8 KiB
TeX
92 lines
4.8 KiB
TeX
\subsection{Summary}
|
||
|
||
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
||
to instance segmentation in the framework of region-based convolutional networks,
|
||
given an input of two consecutive frames from a monocular camera.
|
||
In addition to instance motions, our network estimates the 3D motion of the camera.
|
||
We combine all these estimates to yield a dense optical flow output from our
|
||
end-to-end deep network.
|
||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||
us with all required ground truth data.
|
||
During inference, our model does not add any significant computational overhead
|
||
over the latest iterations of R-CNNs and is therefore just as fast and interesting
|
||
for real time scenarios.
|
||
We thus presented a step towards real time 3D motion estimation based on a
|
||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||
to previous end-to-end deep networks for dense motion estimation, the output
|
||
of our network is highly interpretable, which may also bring benefits for safety-critical
|
||
applications.
|
||
|
||
\subsection{Future Work}
|
||
\paragraph{Predicting depth}
|
||
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
|
||
However, in many applications settings, we are not provided with any depth information.
|
||
In most cases, we want to work with raw RGB sequences from one or multiple simple cameras,
|
||
from which no depth data is available.
|
||
To do so, we could integrate depth prediction into our network by branching off a
|
||
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
|
||
Alternatively, we could add a specialized network for end-to-end depth regression
|
||
in parallel to the region-based network, e.g. \cite{GCNet}.
|
||
Although single-frame monocular depth prediction with deep networks was already done
|
||
to some level of success,
|
||
our two-frame input should allow the network to make use of epipolar
|
||
geometry for making a more reliable depth estimate, at least when the camera
|
||
is moving.
|
||
|
||
{
|
||
\begin{table}[h]
|
||
\centering
|
||
\begin{tabular}{llr}
|
||
\toprule
|
||
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||
\midrule\midrule
|
||
& input image & H $\times$ W $\times$ C \\
|
||
\midrule
|
||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{Depth Network}}\\
|
||
\midrule
|
||
& From P$_2$: 3 $\times$ 3 conv, 1024 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 256 \\
|
||
& 1 $\times$ 1 conv, 1 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1 \\
|
||
& $\times$ 2 bilinear upsample & H $\times$ W $\times$ 1 \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{Camera Motion Network} (Table \ref{table:motionrcnn_resnet_fpn})}\\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RoI Head: Motions} (Table \ref{table:motionrcnn_resnet_fpn})}\\
|
||
\bottomrule
|
||
\end{tabular}
|
||
|
||
\caption {
|
||
Preliminary Motion R-CNN ResNet-50-FPN architecture with depth prediction,
|
||
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||
}
|
||
\label{table:motionrcnn_resnet_fpn_depth}
|
||
\end{table}
|
||
}
|
||
\paragraph{Training on real world data}
|
||
Due to the amount of supervision required by the different components of the network
|
||
and the complexity of the optimization problem,
|
||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||
A next step will be training on a more realistic dataset.
|
||
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
||
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
|
||
On Cityscapes, we could continue train the instance segmentation components to
|
||
improve detection and masks and avoid forgetting instance segmentation.
|
||
As an alternative to this training scheme, we could investigate training on a pure
|
||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
||
|
||
\paragraph{Temporal consistency}
|
||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||
context of energy-minimization based scene flow \cite{TemporalSF}.
|
||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||
into our architecture, we could enable temporally consistent motion estimation
|
||
from image sequences of arbitrary length.
|