mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 01:45:50 +00:00
157 lines
9.1 KiB
TeX
157 lines
9.1 KiB
TeX
\subsection{Summary}
|
||
|
||
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
||
to instance segmentation in the framework of region-based convolutional networks,
|
||
given an input of two consecutive frames from a monocular camera.
|
||
In addition to instance motions, our network estimates the 3D motion of the camera.
|
||
We combine all these estimates to yield a dense optical flow output from our
|
||
end-to-end deep network.
|
||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||
us with all required ground truth data, and evaluated on the same domain.
|
||
During inference, our model does not add any significant computational overhead
|
||
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
||
for real time scenarios.
|
||
We thus presented a step towards real time 3D motion estimation based on a
|
||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||
to previous end-to-end deep networks for dense motion estimation, the output
|
||
of our network is highly interpretable, which may also bring benefits for safety-critical
|
||
applications.
|
||
|
||
\subsection{Future Work}
|
||
\paragraph{Evaluation and finetuning on KITTI 2015}
|
||
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
|
||
on which we do not train, but we have yet to evaluate on a real world dataset.
|
||
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
|
||
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
|
||
and optical flow ground truth to evaluate the composed flow field.
|
||
Note that with our current model, we can only evaluate on the \emph{train} set
|
||
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
||
|
||
As KITTI 2015 also provides object masks for moving objects, we could in principle
|
||
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
||
KITTI 2015 test set, this makes little sense, though.
|
||
|
||
\paragraph{Predicting depth}
|
||
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
|
||
However, in many applications settings, we are not provided with any depth information.
|
||
In most cases, we want to work with raw RGB sequences from one or multiple simple cameras,
|
||
from which no depth data is available.
|
||
To do so, we could integrate depth prediction into our network by branching off a
|
||
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
|
||
Alternatively, we could add a specialized network for end-to-end depth regression
|
||
in parallel to the region-based network (or before, to provide XYZ input to the R-CNN), e.g. \cite{GCNet}.
|
||
Although single-frame monocular depth prediction with deep networks was already done
|
||
to some level of success,
|
||
our two-frame input should allow the network to make use of epipolar
|
||
geometry for making a more reliable depth estimate, at least when the camera
|
||
is moving. We could also extend our method to stereo input data easily by concatenating
|
||
all of the frames into the input image.
|
||
In case we choose the option of integrating the depth prediction directly into
|
||
the R-CNN,
|
||
this would however require using a different dataset for training it, as Virtual KITTI does not
|
||
provide stereo images.
|
||
If we would use a specialized depth network, we could use stereo data
|
||
for depth prediction and still train the R-CNN independently on the monocular Virtual KITTI,
|
||
though we would loose the ability to easily train the system in an end-to-end manner.
|
||
|
||
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
||
and also fine-tune on the training set as mentioned in the previous paragraph.
|
||
|
||
{
|
||
\begin{table}[h]
|
||
\centering
|
||
\begin{tabular}{llr}
|
||
\toprule
|
||
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||
\midrule\midrule
|
||
& input image & H $\times$ W $\times$ C \\
|
||
\midrule
|
||
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{Depth Network}}\\
|
||
\midrule
|
||
& From P$_2$: 3 $\times$ 3 conv, 1024 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1024 \\
|
||
& 1 $\times$ 1 conv, 1 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1 \\
|
||
& $\times$ 2 bilinear upsample & H $\times$ W $\times$ 1 \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{Camera Motion Network} (Table \ref{table:motionrcnn_resnet_fpn})}\\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RoI Head: Motions} (Table \ref{table:motionrcnn_resnet_fpn})}\\
|
||
\bottomrule
|
||
\end{tabular}
|
||
|
||
\caption {
|
||
A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
|
||
based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||
}
|
||
\label{table:motionrcnn_resnet_fpn_depth}
|
||
\end{table}
|
||
}
|
||
\paragraph{Training on real world data}
|
||
Due to the amount of supervision required by the different components of the network
|
||
and the complexity of the optimization problem,
|
||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||
A next step will be training on a more realistic dataset,
|
||
ideally without having to rely on synthetic data at all.
|
||
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
||
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
||
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
|
||
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
|
||
On Cityscapes, we could continue train the instance segmentation components to
|
||
improve detection and masks and avoid forgetting instance segmentation.
|
||
As an alternative to this training scheme, we could investigate training on a pure
|
||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth)
|
||
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
|
||
setting \cite{UnsupFlownet, UnFlow},
|
||
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
||
|
||
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
||
We already described an optical flow based loss for supervising instance motions
|
||
when we do not have 3D instance motion ground truth, or when we do not have
|
||
any motion ground truth at all.
|
||
However, it would also be useful to train our model without access to 3D camera
|
||
motion ground truth.
|
||
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
|
||
RoI instance motion loss. Still, to use all available information from
|
||
ground truth optical flow and obtain more accurate supervision,
|
||
it would likely be beneficial to add a global, flow-based camera motion loss
|
||
independent of the RoI supervision.
|
||
To do this, one could use a re-projection loss conceptually identical to the one
|
||
for supervising instance motions with ground truth flow. However, to adjust for the
|
||
fact that the camera motion can only be accurately supervised with flow at positions where
|
||
no object motion accurs, this loss would have to be masked with the ground truth
|
||
object masks. Again, we could use this flow-based loss in an unsupervised way.
|
||
For training on a dataset without any motion ground truth, e.g.
|
||
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||
loss for the instance motions.
|
||
|
||
\paragraph{Temporal consistency}
|
||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||
context of energy-minimization based scene flow \cite{TemporalSF}.
|
||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||
into our architecture, we could enable temporally consistent motion estimation
|
||
from image sequences of arbitrary length.
|
||
|
||
\paragraph{Masking prior to the RoI motion head}
|
||
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
|
||
the backbone are integrated over the complete RoI window to yield the features
|
||
for motion estimation.
|
||
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
|
||
However, ideally, the motion (image matching) information from the backbone should
|
||
|
||
For example, consider
|
||
|
||
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||
extracted RoI features before passing them into the motion head.
|
||
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||
extracted feature window which belong to the background. Then, the RoI motion
|
||
head could aggregate the motion (image matching) information from the backbone
|
||
over positions localized within the object only, but not over positions belonging
|
||
to the background, which should probably not influence the final object motion estimate.
|