bsc-thesis/experiments.tex


\subsection{Datasets}

\paragraph{Virtual KITTI}
The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
rendered from virtual 3D street scenes.
The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
a virtual car.
Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
in a total of 10 variants per sequence.
In addition to the RGB frames, a variety of ground truth is supplied.
For each frame, we are given a dense depth and optical flow map and the camera extrinsics matrix.
For all cars and vans in the each frame, we are given 2D and 3D object bounding boxes, instance masks, 3D poses,
and various other labels.

This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
and motion estimation system, as it allows us to test different components in isolation and
progress to more and more complete predictions up to supervising the full system on a single dataset.

\paragraph{Motion ground truth from 3D poses and camera extrinsics}
For two consecutive frames $I_t$ and $I_{t+1}$, let $[R_t^{cam}|t_t^{cam}]$ and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$ be
the camera extrinsics at the two frames.
We compute the ground truth camera motion $\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
\begin{equation}
R_{t}^{gt, cam} = R_{t+1}^{cam}  \cdot inv(R_t^{cam}),
\end{equation}
\begin{equation}
t_{t}^{gt, cam} = t_{t+1}^{cam}  - R_{gt}^{cam} \cdot t_t^{cam}.
\end{equation}


For any object $k$ visible in both frames, let
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$ be its orientation and position in camera space
at $I_t$ and $I_{t+1}$. Note that the pose at $t$ is given with respect to the camera at $t$ and
the pose at $t+1$ is given with respect to the camera at $t+1$.
We define the ground truth pivot as
\begin{equation}
p_{t}^{gt, k} = t_t^k
\end{equation}
and compute the ground truth object motion $\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
\begin{equation}
R_{t}^{gt, k} = inv(R_{t}^{gt, cam}) \cdot R_{t+1}^k \cdot inv(R_t^k),
\end{equation}
\begin{equation}
t_{t}^{gt, k} = t_{t+1}^{cam}  - R_{gt}^{cam} \cdot t_t.
\end{equation} % TODO
 % TODO change notation in approach to remove t subscript from motion matrices and vectors!

\subsection{Training Setup}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
We train on a single Titan X (Pascal) for a total of 192K iterations on the Virtual KITTI dataset.
As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
for all remaining iterations.

\paragraph{R-CNN training parameters}

\subsection{Experiments on Virtual KITTI}

\subsection{Evaluation on KITTI 2015}