bsc-thesis/conclusion.tex

\subsection{Summary}

We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
to instance segmentation in the framework of region-based convolutional networks,
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
We combine all these estimates to obtain a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with bounding box, instance mask, depth, and 3D motion ground truth,
and evaluated on a validation set created from Virtual KITTI.
During inference, our model does not add any significant computational overhead
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
for real time scenarios.

Although our system gives first reasonable instance motion predictions,
estimates the camera ego-motion reasonably well,
and achieves high accuracy in classifying between moving and non-moving objects,
the accuracy of the motion predictions is still not convincing.
More work will be thus required to bring the system (closer) to competetive accuracy,
which includes trying penalization with the flow loss instead of 3D motion ground truth,
experimenting with the weighting between different loss terms,
and improvements to the network architecture, loss design, and training process.

We thus presented a partial step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
to previous end-to-end deep networks for dense motion estimation, the output
of our network is highly interpretable, which may also bring benefits for safety-critical
applications.

\subsection{Future Work}
\paragraph{Mask R-CNN baseline}
As our Mask R-CNN re-implementation is still not as accurate as reported in the
original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
R-CNN in TensorFlow was released, which should be studied to this end.

\paragraph{Instance motion supervision with the optical flow re-projection loss}
We developed and implemented a loss for penalizing instance motions with optical flow ground truth,
but could not yet train a network with it due to time constraints. The second
next step will be conducting experiments with this loss.

\paragraph{Training on all Virtual KITTI sequences}
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
to make training faster.
In the future, it would be interesting to train on all variants, as the different
lighting conditions and viewing angles should lead to a more general model.

\paragraph{Evaluation and finetuning on KITTI 2015}
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
on which we do not train, but we have yet to evaluate on a real world dataset.
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
and optical flow ground truth to evaluate the composed flow field.
Note that with our current model, we can only evaluate on the \emph{train} set
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.

As KITTI 2015 also provides instance masks for moving objects, we could in principle
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
KITTI 2015 test set, this makes little sense, though.

\paragraph{Predicting depth}
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
However, in many applications settings, we are not provided with any depth information.
In most cases, we instead want to work with raw RGB sequences from one or multiple simple cameras,
from which no depth data is available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN (Table~\ref{table:motionrcnn_resnet_fpn_depth}).
Alternatively, we could add a specialized network for end-to-end depth regression
in parallel to the region-based network (or before, to provide XYZ input to Motion R-CNN), e.g. \cite{GCNet}.
Although single-frame monocular depth prediction with deep networks was already done
to some level of success,
our two-frame input should allow the network to make use of epipolar
geometry for making a more reliable depth estimate, at least when the camera
is moving. We could also extend our method to stereo input data easily by concatenating
all of the frames into the input image.
In case we choose the option of integrating the depth prediction directly into
the current backbone,
this would however require using a different dataset for training Motion R-CNN, as Virtual KITTI does not
provide stereo images.
If we would use a specialized depth network, we could use stereo data
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
though we would loose the ability to easily train the system in an end-to-end manner.

As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
and also fine-tune on the training set as mentioned in the previous paragraph.

{
\begin{table}[h]
\centering
\begin{tabular}{llr}
\toprule
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
\midrule
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{Depth Network}}\\
\midrule
& From P$_2$: 3 $\times$ 3 conv, 1024 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1024 \\
& 1 $\times$ 1 conv, 1 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1 \\
& $\times$ 2 bilinear upsample & H $\times$ W $\times$ 1 \\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network} (Table \ref{table:motionrcnn_resnet_fpn})}\\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head: Motions} (Table \ref{table:motionrcnn_resnet_fpn})}\\
\bottomrule
\end{tabular}

\caption {
A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
}
\label{table:motionrcnn_resnet_fpn_depth}
\end{table}
}
\paragraph{Training on real world data}
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step will be training on a more realistic dataset,
ideally without having to rely on synthetic data at all.
For this, we can first pre-train the RPN on an instance segmentation dataset like
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
On Cityscapes, we could continue training the instance segmentation components to
improve detection and masks and avoid forgetting instance segmentation.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation stereo dataset with unsupervised warping-based proxy losses for the motion (and depth)
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
setting \cite{UnsupFlownet, UnFlow},
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.

\paragraph{Supervising the camera motion without 3D camera motion ground truth}
We already described a loss based on optical flow for supervising instance motions
when we do not have 3D instance motion ground truth, or when we do not have
any motion ground truth at all.
However, it would also be useful to train our model without access to 3D camera
motion ground truth.
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
RoI instance motion loss. Still, to use all available information from
ground truth optical flow and obtain more accurate supervision,
it would likely be beneficial to add a global, flow-based camera motion loss
independent of the RoI supervision.
To do this, one could use a re-projection loss conceptually similar to the one
for supervising instance motions with ground truth flow,
but computed on the full image instead of for individual RoIs. However, to adjust for the
fact that the camera motion can only be accurately supervised with flow at positions where
no object motion accurs, this loss would have to be masked with the ground truth
object masks. Again, we could use this flow-based loss in an unsupervised way,
as long as ground truth instance masks are available.
For training on a single dataset without any motion ground truth, e.g.
Cityscapes, it may be critical to add this term in addition to an unsupervised
loss for the instance motions.

\paragraph{Temporal consistency}
A next step after the aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
context of classical energy-minimization-based scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.

% \paragraph{Masking prior to the RoI motion head}
% Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
% the backbone are integrated over the complete RoI window to yield the features
% for motion estimation.
% For example, average pooling is applied before the fully-connected layers in the variant without FPN.
% However, ideally, the motion (image matching) information from the backbone should
%
% For example, consider
%
% Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
% extracted RoI features before passing them into the motion head.
% The intuition behind that is that we want to mask out (set to zero) any positions in the
% extracted feature window which belong to the background. Then, the RoI motion
% head could aggregate the motion (image matching) information from the backbone
% over positions localized within the object only, but not over positions belonging
% to the background, which should probably not influence the final object motion estimate.