mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
60 lines
2.7 KiB
TeX
60 lines
2.7 KiB
TeX
|
|
\subsection{Datasets}
|
|
|
|
\paragraph{Virtual KITTI}
|
|
The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
|
|
rendered from virtual 3D street scenes.
|
|
The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
|
|
a virtual car.
|
|
Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
|
|
in a total of 10 variants per sequence.
|
|
In addition to the RGB frames, a variety of ground truth is supplied.
|
|
For each frame, we are given a dense depth and optical flow map and the camera extrinsics matrix.
|
|
For all cars and vans in the each frame, we are given 2D and 3D object bounding boxes, instance masks, 3D poses,
|
|
and various other labels.
|
|
|
|
This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
|
|
and motion estimation system, as it allows us to test different components in isolation and
|
|
progress to more and more complete predictions up to supervising the full system on a single dataset.
|
|
|
|
\paragraph{Motion ground truth from 3D poses and camera extrinsics}
|
|
For two consecutive frames $I_t$ and $I_{t+1}$, let $[R_t^{cam}|t_t^{cam}]$ and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$ be
|
|
the camera extrinsics at the two frames.
|
|
We compute the ground truth camera motion $\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
|
|
\begin{equation}
|
|
R_{t}^{gt, cam} = R_{t+1}^{cam} \cdot inv(R_t^{cam}),
|
|
\end{equation}
|
|
\begin{equation}
|
|
t_{t}^{gt, cam} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t^{cam}.
|
|
\end{equation}
|
|
|
|
|
|
For any object $k$ visible in both frames, let
|
|
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$ be its orientation and position in camera space
|
|
at $I_t$ and $I_{t+1}$. Note that the pose at $t$ is given with respect to the camera at $t$ and
|
|
the pose at $t+1$ is given with respect to the camera at $t+1$.
|
|
We define the ground truth pivot as
|
|
\begin{equation}
|
|
p_{t}^{gt, k} = t_t^k
|
|
\end{equation}
|
|
and compute the ground truth object motion $\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
|
|
\begin{equation}
|
|
R_{t}^{gt, k} = inv(R_{t}^{gt, cam}) \cdot R_{t+1}^k \cdot inv(R_t^k),
|
|
\end{equation}
|
|
\begin{equation}
|
|
t_{t}^{gt, k} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t.
|
|
\end{equation} % TODO
|
|
% TODO change notation in approach to remove t subscript from motion matrices and vectors!
|
|
|
|
\subsection{Training Setup}
|
|
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
|
|
We train on a single Titan X (Pascal) for a total of 192K iterations on the Virtual KITTI dataset.
|
|
As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
|
|
for all remaining iterations.
|
|
|
|
\paragraph{R-CNN training parameters}
|
|
|
|
\subsection{Experiments on Virtual KITTI}
|
|
|
|
\subsection{Evaluation on KITTI 2015}
|