mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-12 17:35:51 +00:00
187 lines
11 KiB
TeX
187 lines
11 KiB
TeX
\subsection{Summary}
|
||
|
||
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
||
to instance segmentation in the framework of region-based convolutional networks,
|
||
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
|
||
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
|
||
We combine all these estimates to obtain a dense optical flow output from our
|
||
end-to-end deep network.
|
||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||
us with bounding box, instance mask, depth, and 3D motion ground truth,
|
||
and evaluated on a validation set created from Virtual KITTI.
|
||
During inference, our model does not add any significant computational overhead
|
||
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
||
for real time scenarios.
|
||
|
||
Although our system gives first reasonable instance motion predictions,
|
||
estimates the camera ego-motion reasonably well,
|
||
and achieves high accuracy in classifying between moving and non-moving objects,
|
||
the accuracy of the motion predictions is still not convincing.
|
||
More work will be thus required to bring the system (closer) to competetive accuracy,
|
||
which includes trying penalization with the flow loss instead of 3D motion ground truth,
|
||
experimenting with the weighting between different loss terms,
|
||
and improvements to the network architecture, loss design, and training process.
|
||
|
||
We thus presented a partial step towards real time 3D motion estimation based on a
|
||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||
to previous end-to-end deep networks for dense motion estimation, the output
|
||
of our network is highly interpretable, which may also bring benefits for safety-critical
|
||
applications.
|
||
|
||
\subsection{Future Work}
|
||
\paragraph{Mask R-CNN baseline}
|
||
As our Mask R-CNN re-implementation is still not as accurate as reported in the
|
||
original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be
|
||
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
|
||
R-CNN in TensorFlow was released, which should be studied to this end.
|
||
|
||
\paragraph{Instance motion supervision with the optical flow re-projection loss}
|
||
We developed and implemented a loss for penalizing instance motions with optical flow ground truth,
|
||
but could not yet train a network with it due to time constraints. The second
|
||
next step will be conducting experiments with this loss.
|
||
|
||
\paragraph{Training on all Virtual KITTI sequences}
|
||
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
|
||
to make training faster.
|
||
In the future, it would be interesting to train on all variants, as the different
|
||
lighting conditions and viewing angles should lead to a more general model.
|
||
|
||
\paragraph{Evaluation and finetuning on KITTI 2015}
|
||
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
|
||
on which we do not train, but we have yet to evaluate on a real world dataset.
|
||
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
|
||
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
|
||
and optical flow ground truth to evaluate the composed flow field.
|
||
Note that with our current model, we can only evaluate on the \emph{train} set
|
||
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
||
|
||
As KITTI 2015 also provides instance masks for moving objects, we could in principle
|
||
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
||
KITTI 2015 test set, this makes little sense, though.
|
||
|
||
\paragraph{Predicting depth}
|
||
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
|
||
However, in many applications settings, we are not provided with any depth information.
|
||
In most cases, we instead want to work with raw RGB sequences from one or multiple simple cameras,
|
||
from which no depth data is available.
|
||
To do so, we could integrate depth prediction into our network by branching off a
|
||
depth network from the backbone in parallel to the RPN (Table~\ref{table:motionrcnn_resnet_fpn_depth}).
|
||
Alternatively, we could add a specialized network for end-to-end depth regression
|
||
in parallel to the region-based network (or before, to provide XYZ input to Motion R-CNN), e.g. \cite{GCNet}.
|
||
Although single-frame monocular depth prediction with deep networks was already done
|
||
to some level of success,
|
||
our two-frame input should allow the network to make use of epipolar
|
||
geometry for making a more reliable depth estimate, at least when the camera
|
||
is moving. We could also extend our method to stereo input data easily by concatenating
|
||
all of the frames into the input image.
|
||
In case we choose the option of integrating the depth prediction directly into
|
||
the current backbone,
|
||
this would however require using a different dataset for training Motion R-CNN, as Virtual KITTI does not
|
||
provide stereo images.
|
||
If we would use a specialized depth network, we could use stereo data
|
||
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
|
||
though we would loose the ability to easily train the system in an end-to-end manner.
|
||
|
||
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
||
and also fine-tune on the training set as mentioned in the previous paragraph.
|
||
|
||
{
|
||
\begin{table}[h]
|
||
\centering
|
||
\begin{tabular}{llr}
|
||
\toprule
|
||
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||
\midrule\midrule
|
||
& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
|
||
\midrule
|
||
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{Depth Network}}\\
|
||
\midrule
|
||
& From P$_2$: 3 $\times$ 3 conv, 1024 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1024 \\
|
||
& 1 $\times$ 1 conv, 1 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 1 \\
|
||
& $\times$ 2 bilinear upsample & H $\times$ W $\times$ 1 \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{Camera Motion Network} (Table \ref{table:motionrcnn_resnet_fpn})}\\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||
\midrule
|
||
\multicolumn{3}{c}{\textbf{RoI Head: Motions} (Table \ref{table:motionrcnn_resnet_fpn})}\\
|
||
\bottomrule
|
||
\end{tabular}
|
||
|
||
\caption {
|
||
A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
|
||
based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||
}
|
||
\label{table:motionrcnn_resnet_fpn_depth}
|
||
\end{table}
|
||
}
|
||
\paragraph{Training on real world data}
|
||
Due to the amount of supervision required by the different components of the network
|
||
and the complexity of the optimization problem,
|
||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||
A next step will be training on a more realistic dataset,
|
||
ideally without having to rely on synthetic data at all.
|
||
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
||
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
||
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
|
||
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
|
||
On Cityscapes, we could continue training the instance segmentation components to
|
||
improve detection and masks and avoid forgetting instance segmentation.
|
||
As an alternative to this training scheme, we could investigate training on a pure
|
||
instance segmentation stereo dataset with unsupervised warping-based proxy losses for the motion (and depth)
|
||
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
|
||
setting \cite{UnsupFlownet, UnFlow},
|
||
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
||
|
||
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
||
We already described a loss based on optical flow for supervising instance motions
|
||
when we do not have 3D instance motion ground truth, or when we do not have
|
||
any motion ground truth at all.
|
||
However, it would also be useful to train our model without access to 3D camera
|
||
motion ground truth.
|
||
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
|
||
RoI instance motion loss. Still, to use all available information from
|
||
ground truth optical flow and obtain more accurate supervision,
|
||
it would likely be beneficial to add a global, flow-based camera motion loss
|
||
independent of the RoI supervision.
|
||
To do this, one could use a re-projection loss conceptually similar to the one
|
||
for supervising instance motions with ground truth flow,
|
||
but computed on the full image instead of for individual RoIs. However, to adjust for the
|
||
fact that the camera motion can only be accurately supervised with flow at positions where
|
||
no object motion accurs, this loss would have to be masked with the ground truth
|
||
object masks. Again, we could use this flow-based loss in an unsupervised way,
|
||
as long as ground truth instance masks are available.
|
||
For training on a single dataset without any motion ground truth, e.g.
|
||
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||
loss for the instance motions.
|
||
|
||
\paragraph{Temporal consistency}
|
||
A next step after the aforementioned ones could be to extend our network to exploit more than two
|
||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||
context of classical energy-minimization-based scene flow \cite{TemporalSF}.
|
||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||
into our architecture, we could enable temporally consistent motion estimation
|
||
from image sequences of arbitrary length.
|
||
|
||
% \paragraph{Masking prior to the RoI motion head}
|
||
% Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
|
||
% the backbone are integrated over the complete RoI window to yield the features
|
||
% for motion estimation.
|
||
% For example, average pooling is applied before the fully-connected layers in the variant without FPN.
|
||
% However, ideally, the motion (image matching) information from the backbone should
|
||
%
|
||
% For example, consider
|
||
%
|
||
% Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||
% extracted RoI features before passing them into the motion head.
|
||
% The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||
% extracted feature window which belong to the background. Then, the RoI motion
|
||
% head could aggregate the motion (image matching) information from the backbone
|
||
% over positions localized within the object only, but not over positions belonging
|
||
% to the background, which should probably not influence the final object motion estimate.
|