mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 01:45:50 +00:00
242 lines
11 KiB
TeX
242 lines
11 KiB
TeX
\subsection{Implementation}
|
|
Our networks and loss functions are implemented using built-in TensorFlow \cite{TensorFlow}
|
|
functions, enabling us to use automatic differentiation for all gradient
|
|
computations. To make our code easy to extend and flexible, we build on
|
|
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
|
|
implementation.
|
|
On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN)
|
|
as well as the Motion R-CNN extensions for motion estimation and related evaluations
|
|
and postprocessings. In addition, we generated all ground truth for
|
|
Motion R-CNN in the form of TFRecords from the raw Virtual KITTI
|
|
data to enable fast loading during training.
|
|
Note that for RoI extraction and bilinear crop and resize operations,
|
|
we use the \texttt{tf.crop\_and\_resize} TensorFlow function with
|
|
interpolation set to bilinear.
|
|
|
|
\subsection{Datasets}
|
|
\label{ssec:datasets}
|
|
|
|
\paragraph{Virtual KITTI}
|
|
The synthetic Virtual KITTI dataset \cite{VKITTI} is a re-creation of the KITTI
|
|
driving scenario \cite{KITTI2012, KITTI2015}, rendered from virtual 3D street
|
|
scenes.
|
|
The dataset is made up of a total of 2126 frames from five different monocular
|
|
sequences recorded from a camera mounted on a virtual car.
|
|
Each sequence is rendered with varying lighting and weather conditions and
|
|
from different viewing angles, resulting in a total of 10 variants per sequence.
|
|
In addition to the RGB frames, a variety of ground truth is supplied.
|
|
For each frame, we are given a dense depth and optical flow map and the camera
|
|
extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2).
|
|
For all cars and vans in the each frame, we are given 2D and 3D object bounding
|
|
boxes, instance masks, 3D poses, and various other labels.
|
|
|
|
This makes the Virtual KITTI dataset ideally suited for developing our joint
|
|
instance segmentation and motion estimation system, as it allows us to test
|
|
different components in isolation and progress to more and more complete
|
|
predictions up to supervising the full system on a single dataset.
|
|
|
|
For our experiments, we use the \emph{clone} sequences, which are rendered in a
|
|
way that most closely resembles the original KITTI dataset. We sample 100 examples
|
|
to be used as validation set. From the remaining 2026 examples,
|
|
we remove a small number of examples without object instances and use the resulting
|
|
data as training set.
|
|
|
|
\paragraph{Motion ground truth from 3D poses and camera extrinsics}
|
|
We will now describe how we use the ground truth poses and camera matrices from Virtual KITTI to
|
|
compute instance and camera motion ground truth.
|
|
For two consecutive frames $I_t$ and $I_{t+1}$,
|
|
let $[R_t^{ex}|t_t^{ex}]$
|
|
and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$
|
|
be the camera extrinsics at the two frames.
|
|
We compute the ground truth camera motion
|
|
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as
|
|
|
|
\begin{equation}
|
|
R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}),
|
|
\end{equation}
|
|
\begin{equation}
|
|
t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}.
|
|
\end{equation}
|
|
|
|
Additionally, we define $o_t^{gt, cam} \in \{ 0, 1 \}$,
|
|
\begin{equation}
|
|
o_t^{gt, cam} =
|
|
\begin{cases}
|
|
1 &\text{if the camera pose changes between $t$ and $t+1$} \\
|
|
0 &\text{otherwise,}
|
|
\end{cases}
|
|
\end{equation}
|
|
which specifies the camera is moving in between the frames.
|
|
|
|
For any object $i$ visible in both frames, let
|
|
$(R_t^i, t_t^i)$ and $(R_{t+1}^i, t_{t+1}^i)$
|
|
be its orientation and position in camera space
|
|
at $I_t$ and $I_{t+1}$.
|
|
Note that the pose at $t$ is given with respect to the camera at $t$ and
|
|
the pose at $t+1$ is given with respect to the camera at $t+1$.
|
|
|
|
We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as
|
|
|
|
\begin{equation}
|
|
p_{t}^{gt, i} = t_t^i
|
|
\end{equation}
|
|
|
|
and compute the ground truth object motion
|
|
$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as
|
|
|
|
\begin{equation}
|
|
R_{t}^{gt, i} = \mathrm{inv}(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
|
|
\end{equation}
|
|
\begin{equation}
|
|
t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t.
|
|
\end{equation}
|
|
|
|
As for the camera, we define $o_t^{gt, i} \in \{ 0, 1 \}$,
|
|
\begin{equation}
|
|
o_t^{gt, i} =
|
|
\begin{cases}
|
|
1 &\text{if the position of object i changes between $t$ and $t+1$} \\
|
|
0 &\text{otherwise,}
|
|
\end{cases}
|
|
\end{equation}
|
|
which specifies whether an object is moving in between the frames.
|
|
|
|
\paragraph{Evaluation metrics with motion ground truth}
|
|
To evaluate the 3D instance and camera motions on the Virtual KITTI validation
|
|
set, we introduce a few error metrics.
|
|
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
|
|
let $i_k$ be the index of the best matching ground truth example,
|
|
let $c_k$ be the predicted class,
|
|
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
|
|
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
|
Then, assuming there are $N$ such detections,
|
|
\begin{equation}
|
|
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
|
|
\end{equation}
|
|
measures the mean angle of the error rotation between predicted and ground truth rotation,
|
|
\begin{equation}
|
|
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \right\rVert_2,
|
|
\end{equation}
|
|
is the mean euclidean norm between predicted and ground truth translation, and
|
|
\begin{equation}
|
|
E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2
|
|
\end{equation}
|
|
is the mean euclidean norm between predicted and ground truth pivot.
|
|
Moreover, we define precision and recall measures for the detection of moving objects,
|
|
where
|
|
\begin{equation}
|
|
O_{pr} = \frac{tp}{tp + fp}
|
|
\end{equation}
|
|
is the fraction of objects which are actually moving among all objects classified as moving,
|
|
and
|
|
\begin{equation}
|
|
O_{rc} = \frac{tp}{tp + fn}
|
|
\end{equation}
|
|
is the fraction of objects correctly classified as moving among all objects which are actually moving.
|
|
Here, we used
|
|
\begin{equation}
|
|
tp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1],
|
|
\end{equation}
|
|
\begin{equation}
|
|
fp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0],
|
|
\end{equation}
|
|
and
|
|
\begin{equation}
|
|
fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
|
|
\end{equation}
|
|
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
|
predicted camera motions.
|
|
|
|
\subsection{Virtual KITTI: Training setup}
|
|
\label{ssec:setup}
|
|
|
|
For our initial experiments, we concatenate both RGB frames as
|
|
well as the XYZ coordinates for both frames as input to the networks.
|
|
We train both, the Motion R-CNN and -FPN variants.
|
|
|
|
\paragraph{Training schedule}
|
|
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
|
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
|
Virtual KITTI training set.
|
|
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
|
|
momentum of $0.9$.
|
|
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
|
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
|
|
|
\paragraph{R-CNN training parameters}
|
|
For training the RPN and RoI heads and during inference,
|
|
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
|
the ResNet and ResNet-FPN variants, respectively.
|
|
|
|
\paragraph{Initialization}
|
|
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
|
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
|
|
Following the pre-existing TensorFlow implementation of Faster R-CNN,
|
|
we initialize all other hidden layers with He initialization \cite{He}.
|
|
For the fully-connected camera and instance motion output layers,
|
|
we use a truncated normal initializer with a standard
|
|
deviation of $0.0001$ and zero mean, truncated at two standard deviations.
|
|
Note that a larger weight prevented the
|
|
angle sine estimates from properly converging to the very small values they
|
|
are in general expected to output.
|
|
|
|
\subsection{Virtual KITTI: Evaluation}
|
|
\label{ssec:vkitti}
|
|
|
|
\begin{figure}[t]
|
|
\centering
|
|
\includegraphics[width=\textwidth]{figures/vkitti_cam}
|
|
\caption{
|
|
Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
|
|
with the ResNet (without FPN) architecture.
|
|
From left to right, we show the input image with instance segmentation results as overlay,
|
|
the estimated flow, as well as the flow error map.
|
|
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
|
}
|
|
\label{figure:vkitti}
|
|
\end{figure}
|
|
|
|
{
|
|
\begin{table}[t]
|
|
\centering
|
|
\begin{tabular}{@{}*{13}{c}@{}}
|
|
\toprule
|
|
\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
|
\cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
|
|
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
|
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\
|
|
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\
|
|
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
\midrule
|
|
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
|
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
|
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
\bottomrule
|
|
\end{tabular}
|
|
|
|
\caption {
|
|
Comparison of network variants on the Virtual KITTI validation set.
|
|
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
|
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
|
Camera and instance motion errors are averaged over the validation set.
|
|
We optionally enable camera motion prediction (cam.),
|
|
replace the ResNet backbone with ResNet-FPN (FPN),
|
|
or input XYZ coordinates into the backbone (XYZ).
|
|
We either supervise
|
|
object motions (sup.) with 3D motion ground truth (3D) or
|
|
with a 2D re-projection loss based on flow ground truth (flow).
|
|
Note that for rows where no camera motion is predicted, the optical flow
|
|
is composed using the ground truth camera motion and thus the flow error is
|
|
only impacted by the predicted 3D object motions.
|
|
}
|
|
\label{table:vkitti}
|
|
\end{table}
|
|
}
|
|
|
|
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
|
results on the Virtual KITTI validation set.
|
|
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
|
on the Virtual KITTI validation set.
|