WIP

2026-03-01 12:54:10 +00:00 · 2017-10-24 15:04:05 +02:00 · 2017-10-24 15:04:05 +02:00 · ba500a8aaa
commit ba500a8aaa
parent bc34ca9fe5
3 changed files with 36 additions and 3 deletions
--- a/background.tex
+++ b/background.tex
@ -96,6 +96,7 @@ high resolution instance masks within the bounding boxes of each detected object
 This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
 compute a pixel-precise mask for each instance.
 In addition, Mask R-CNN
+Figure \ref{} compares the two Mask R-CNN network variants.

 \paragraph{Supervision of the RPN}
 \paragraph{Supervision of the RoI head}
--- a/conclusion.tex
+++ b/conclusion.tex
@ -1,3 +1,15 @@
-
+We have introduced a extension on top of region-based convolutional networks to enable object motion estimation
+in parallel to instance segmentation.

 \subsection{Future Work}
+Due to the amount of supervision required by the different components of the network
+and the complexity of the optimization problem,
+we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset.
+A next step would be the training on
+For example, we can first pre-train the RPN on a object detection dataset like
+Cityscapes. As soon as the RPN works reliably, we could then do alternating
+steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
+On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
+the motion losses and depth prediction, as no instance segmentation ground truth exists. % TODO depth prediction ?!
+On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
+improve detection and masks and avoid any forgetting effects.
--- a/experiments.tex
+++ b/experiments.tex
@ -1,8 +1,28 @@

 \subsection{Datasets}

+\paragraph{Virtual KITTI}
+The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
+rendered from virtual 3D street scenes.
+The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
+a virtual car.
+Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
+in a total of 10 variants per sequence.
+In addition to the RGB frames, a variety of ground truth is supplied.
+For each frame, we are given a dense depth and optical flow map,
+2D and 3D object bounding boxes, instance masks and 3D poses of all cars and vans in the scene,
+the camera extrinsics matrix, and various other labels.
+
+This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
+and motion estimation system, as it allows us to test different components in isolation and
+progress to more and more complete predictions.
+
 \subsection{Training Setup}
+Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
+We train on a single Titan X (Pascal) for a total of 192K iterations.
+As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
+for all remaining iterations.

-\subsection{TODO}
+\subsection{Experiments on Virtual KITTI}

-\subsection{Results on KITTI}
+\subsection{Evaluation on KITTI 2015}