This commit is contained in:
Simon Meister 2017-10-24 15:04:05 +02:00
parent bc34ca9fe5
commit ba500a8aaa
3 changed files with 36 additions and 3 deletions

View File

@ -96,6 +96,7 @@ high resolution instance masks within the bounding boxes of each detected object
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise mask for each instance.
In addition, Mask R-CNN
Figure \ref{} compares the two Mask R-CNN network variants.
\paragraph{Supervision of the RPN}
\paragraph{Supervision of the RoI head}

View File

@ -1,3 +1,15 @@
We have introduced a extension on top of region-based convolutional networks to enable object motion estimation
in parallel to instance segmentation.
\subsection{Future Work}
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset.
A next step would be the training on
For example, we can first pre-train the RPN on a object detection dataset like
Cityscapes. As soon as the RPN works reliably, we could then do alternating
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses and depth prediction, as no instance segmentation ground truth exists. % TODO depth prediction ?!
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
improve detection and masks and avoid any forgetting effects.

View File

@ -1,8 +1,28 @@
\subsection{Datasets}
\paragraph{Virtual KITTI}
The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
rendered from virtual 3D street scenes.
The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
a virtual car.
Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
in a total of 10 variants per sequence.
In addition to the RGB frames, a variety of ground truth is supplied.
For each frame, we are given a dense depth and optical flow map,
2D and 3D object bounding boxes, instance masks and 3D poses of all cars and vans in the scene,
the camera extrinsics matrix, and various other labels.
This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
and motion estimation system, as it allows us to test different components in isolation and
progress to more and more complete predictions.
\subsection{Training Setup}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
We train on a single Titan X (Pascal) for a total of 192K iterations.
As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
for all remaining iterations.
\subsection{TODO}
\subsection{Experiments on Virtual KITTI}
\subsection{Results on KITTI}
\subsection{Evaluation on KITTI 2015}