bsc-thesis/conclusion.tex

\subsection{Summary}

We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
to instance segmentation in the framework of region-based convolutional networks,
given an input of two consecutive frames from a monocular camera.
In addition to instance motions, our network estimates the 3D motion of the camera.
We combine all these estimates to yield a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with all required ground truth data, and evaluated on the same domain.
During inference, our model does not add any significant computational overhead
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
for real time scenarios.
We thus presented a step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
to previous end-to-end deep networks for dense motion estimation, the output
of our network is highly interpretable, which may also bring benefits for safety-critical
applications.

\subsection{Future Work}
\paragraph{Evaluation and finetuning on KITTI 2015}
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
on which we do not train, but we have yet to evaluate on a real world dataset.
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
and optical flow ground truth to evaluate the composed flow field.
Note that with our current model, we can only evaluate on the \emph{train} set
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.

As KITTI 2015 also provides object masks for moving objects, we could in principle
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
KITTI 2015 test set, this makes little sense, though.

\paragraph{Predicting depth}
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
However, in many applications settings, we are not provided with any depth information.
In most cases, we want to work with raw RGB sequences from one or multiple simple cameras,
from which no depth data is available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
Alternatively, we could add a specialized network for end-to-end depth regression
in parallel to the region-based network (or before, to provide XYZ input to the R-CNN), e.g. \cite{GCNet}.
Although single-frame monocular depth prediction with deep networks was already done
to some level of success,
our two-frame input should allow the network to make use of epipolar
geometry for making a more reliable depth estimate, at least when the camera
is moving. We could also extend our method to stereo input data easily by concatenating
all of the frames into the input image.
In case we choose the option of integrating the depth prediction directly into
the R-CNN,
this would however require using a different dataset for training it, as Virtual KITTI does not
provide stereo images.
If we would use a specialized depth network, we could use stereo data
for depth prediction and still train the R-CNN independently on the monocular Virtual KITTI,
though we would loose the ability to easily train the system in an end-to-end manner.

As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
and also fine-tune on the training set as mentioned in the previous paragraph.

{
\begin{table}[h]
\centering
\begin{tabular}{llr}
\toprule
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{Depth Network}}\\
\midrule
& From P$_2$: 3 $\times$ 3 conv, 1024 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1024 \\
& 1 $\times$ 1 conv, 1 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1 \\
& $\times$ 2 bilinear upsample & H $\times$ W $\times$ 1 \\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network} (Table \ref{table:motionrcnn_resnet_fpn})}\\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head: Motions} (Table \ref{table:motionrcnn_resnet_fpn})}\\
\bottomrule
\end{tabular}

\caption {
A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
}
\label{table:motionrcnn_resnet_fpn_depth}
\end{table}
}
\paragraph{Training on real world data}
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step will be training on a more realistic dataset,
ideally without having to rely on synthetic data at all.
For this, we can first pre-train the RPN on an instance segmentation dataset like
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
On Cityscapes, we could continue train the instance segmentation components to
improve detection and masks and avoid forgetting instance segmentation.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth)
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
setting \cite{UnsupFlownet, UnFlow},
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.

\paragraph{Supervising the camera motion without 3D camera motion ground truth}
We already described an optical flow based loss for supervising instance motions
when we do not have 3D instance motion ground truth, or when we do not have
any motion ground truth at all.
However, it would also be useful to train our model without access to 3D camera
motion ground truth.
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
RoI instance motion loss. Still, to use all available information from
ground truth optical flow and obtain more accurate supervision,
it would likely be beneficial to add a global, flow-based camera motion loss
independent of the RoI supervision.
To do this, one could use a re-projection loss conceptually identical to the one
for supervising instance motions with ground truth flow. However, to adjust for the
fact that the camera motion can only be accurately supervised with flow at positions where
no object motion accurs, this loss would have to be masked with the ground truth
object masks. Again, we could use this flow-based loss in an unsupervised way.
For training on a dataset without any motion ground truth, e.g.
Cityscapes, it may be critical to add this term in addition to an unsupervised
loss for the instance motions.

\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
context of energy-minimization based scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.

\paragraph{Masking prior to the RoI motion head}
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
the backbone are integrated over the complete RoI window to yield the features
for motion estimation.
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
However, ideally, the motion (image matching) information from the backbone should

For example, consider

Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
extracted RoI features before passing them into the motion head.
The intuition behind that is that we want to mask out (set to zero) any positions in the
extracted feature window which belong to the background. Then, the RoI motion
head could aggregate the motion (image matching) information from the backbone
over positions localized within the object only, but not over positions belonging
to the background, which should probably not influence the final object motion estimate.