mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
bc34ca9fe5
commit
ba500a8aaa
@ -96,6 +96,7 @@ high resolution instance masks within the bounding boxes of each detected object
|
||||
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise mask for each instance.
|
||||
In addition, Mask R-CNN
|
||||
Figure \ref{} compares the two Mask R-CNN network variants.
|
||||
|
||||
\paragraph{Supervision of the RPN}
|
||||
\paragraph{Supervision of the RoI head}
|
||||
|
||||
@ -1,3 +1,15 @@
|
||||
|
||||
We have introduced a extension on top of region-based convolutional networks to enable object motion estimation
|
||||
in parallel to instance segmentation.
|
||||
|
||||
\subsection{Future Work}
|
||||
Due to the amount of supervision required by the different components of the network
|
||||
and the complexity of the optimization problem,
|
||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset.
|
||||
A next step would be the training on
|
||||
For example, we can first pre-train the RPN on a object detection dataset like
|
||||
Cityscapes. As soon as the RPN works reliably, we could then do alternating
|
||||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses and depth prediction, as no instance segmentation ground truth exists. % TODO depth prediction ?!
|
||||
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
|
||||
improve detection and masks and avoid any forgetting effects.
|
||||
|
||||
@ -1,8 +1,28 @@
|
||||
|
||||
\subsection{Datasets}
|
||||
|
||||
\paragraph{Virtual KITTI}
|
||||
The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
|
||||
rendered from virtual 3D street scenes.
|
||||
The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
|
||||
a virtual car.
|
||||
Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
|
||||
in a total of 10 variants per sequence.
|
||||
In addition to the RGB frames, a variety of ground truth is supplied.
|
||||
For each frame, we are given a dense depth and optical flow map,
|
||||
2D and 3D object bounding boxes, instance masks and 3D poses of all cars and vans in the scene,
|
||||
the camera extrinsics matrix, and various other labels.
|
||||
|
||||
This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
|
||||
and motion estimation system, as it allows us to test different components in isolation and
|
||||
progress to more and more complete predictions.
|
||||
|
||||
\subsection{Training Setup}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations.
|
||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
|
||||
for all remaining iterations.
|
||||
|
||||
\subsection{TODO}
|
||||
\subsection{Experiments on Virtual KITTI}
|
||||
|
||||
\subsection{Results on KITTI}
|
||||
\subsection{Evaluation on KITTI 2015}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user