mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-12 17:35:51 +00:00
WIP
This commit is contained in:
parent
5165cbec12
commit
7c9344a913
@ -214,8 +214,8 @@ $\begin{bmatrix}
|
||||
\end{tabular}
|
||||
\caption {
|
||||
Backbone architecture based on ResNet-50 \cite{ResNet}.
|
||||
Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
|
||||
block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
|
||||
Operations enclosed in a $[\cdot]_b$ block make up a single ResNet \enquote{bottleneck}
|
||||
block (see Figure \ref{figure:bottleneck}). If the block is denoted as $[\cdot]_b/2$,
|
||||
the first convolution operation in the block has a stride of 2. Note that the stride
|
||||
is only applied to the first block, but not to repeated blocks.
|
||||
Batch normalization \cite{BN} is used after every residual unit.
|
||||
@ -415,7 +415,7 @@ masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
||||
\end{tabular}
|
||||
\caption {
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
|
||||
Operations enclosed in a []$_p$ block make up a single FPN
|
||||
Operations enclosed in a $[\cdot]_p$ block make up a single FPN
|
||||
block (see Figure \ref{figure:fpn_block}).
|
||||
}
|
||||
\label{table:maskrcnn_resnet_fpn}
|
||||
|
||||
@ -7,18 +7,37 @@ In addition to instance motions, our network estimates the 3D motion of the came
|
||||
We combine all these estimates to yield a dense optical flow output from our
|
||||
end-to-end deep network.
|
||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||
us with all required ground truth data, and evaluated on a validation set created
|
||||
from Virtual KITTI.
|
||||
us with bounding box, instance mask, depth, and 3D motion ground truth,
|
||||
and evaluated on a validation set created from Virtual KITTI.
|
||||
During inference, our model does not add any significant computational overhead
|
||||
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
||||
for real time scenarios.
|
||||
We thus presented a step towards real time 3D motion estimation based on a
|
||||
|
||||
Although our system gives first reasonable instance motion predictions,
|
||||
estimates the camera ego-motion reasonably well,
|
||||
and achieves high accuracy in classifying between moving and non-moving objects,
|
||||
the accuracy of the motion predictions is still not convincing.
|
||||
More work will be thus required to bring the system (closer) to competetive accuracy,
|
||||
which includes trying penalization with the flow loss instead of 3D motion ground truth,
|
||||
and improvements to the network architecture and training process.
|
||||
We thus presented a partial step towards real time 3D motion estimation based on a
|
||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||
to previous end-to-end deep networks for dense motion estimation, the output
|
||||
of our network is highly interpretable, which may also bring benefits for safety-critical
|
||||
applications.
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Mask R-CNN baseline}
|
||||
As our Mask R-CNN re-implementation is still not as accurate as reported in the
|
||||
original paper, working on the implementation details of this baseline would be
|
||||
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
|
||||
R-CNN in TensorFlow was released, which should be studied to this end.
|
||||
|
||||
\paragraph{Instance motion supervision with the optical flow re-projection loss}
|
||||
We developed and implemented a loss for penalizing instance motions with optical flow ground truth,
|
||||
but could not yet train a network with it due to time constraints. The second
|
||||
next step will be conducting experiments with this loss.
|
||||
|
||||
\paragraph{Training on all Virtual KITTI sequences}
|
||||
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
|
||||
to make training faster.
|
||||
|
||||
@ -148,10 +148,6 @@ the predicted camera motion.
|
||||
\subsection{Virtual KITTI: Training setup}
|
||||
\label{ssec:setup}
|
||||
|
||||
For our initial experiments, we concatenate both RGB frames as
|
||||
well as the XYZ coordinates for both frames as input to the networks.
|
||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants.
|
||||
|
||||
\paragraph{Training schedule}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
We train for a total of 192K iterations on the Virtual KITTI training set.
|
||||
@ -184,10 +180,11 @@ are in general expected to output.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/vkitti_cam}
|
||||
\includegraphics[width=\textwidth]{figures/results}
|
||||
\caption{
|
||||
Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
|
||||
with the ResNet (without FPN) architecture.
|
||||
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
|
||||
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
|
||||
in the upper and lower row, respectively.
|
||||
From left to right, we show the input image with instance segmentation results as overlay,
|
||||
the estimated flow, as well as the flow error map.
|
||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||
@ -195,46 +192,87 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
|
||||
\label{figure:vkitti}
|
||||
\end{figure}
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/moving}
|
||||
\caption{
|
||||
We visually compare a Motion R-CNN ResNet trained without (upper row) and
|
||||
with (lower row) classifying the objects into moving and non-moving objects.
|
||||
Note that in the selected example, all cars are parking, and thus the predicted
|
||||
motion in the first row is an error.
|
||||
From left to right, we show the input image with instance segmentation results as overlay,
|
||||
the estimated flow, as well as the flow error map.
|
||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||
}
|
||||
\label{figure:moving}
|
||||
\end{figure}
|
||||
|
||||
{
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\begin{tabular}{@{}*{13}{c}@{}}
|
||||
\begin{tabular}{@{}*{10}{c}@{}}
|
||||
\toprule
|
||||
\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
||||
\cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
|
||||
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\
|
||||
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\
|
||||
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\midrule
|
||||
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
||||
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
||||
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
||||
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
|
||||
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule
|
||||
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
|
||||
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
\caption {
|
||||
Comparison of network variants on the Virtual KITTI validation set.
|
||||
Evaluation of different metrics on the Virtual KITTI validation set.
|
||||
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
||||
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
||||
We compare network variants with and without FPN.
|
||||
Camera and instance motion errors are averaged over the validation set.
|
||||
We optionally enable camera motion prediction (cam.),
|
||||
replace the ResNet backbone with ResNet-FPN (FPN),
|
||||
or input XYZ coordinates into the backbone (XYZ).
|
||||
We either supervise
|
||||
object motions (sup.) with 3D motion ground truth (3D) or
|
||||
with a 2D re-projection loss based on flow ground truth (flow).
|
||||
Note that for rows where no camera motion is predicted, the optical flow
|
||||
is composed using the ground truth camera motion and thus the flow error is
|
||||
only impacted by the predicted 3D object motions.
|
||||
Quantities in parentheses in the first row are the average ground truth values for the estimated
|
||||
quantity. For example, we compare the error in camera angle, $E_{R}^{cam} [deg]$,
|
||||
to the average rotation angle in the ground truth camera motions.
|
||||
}
|
||||
\label{table:vkitti}
|
||||
\end{table}
|
||||
}
|
||||
|
||||
For our initial experiments, we concatenate both RGB frames as
|
||||
well as the XYZ coordinates for both frames as input to the networks.
|
||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
|
||||
camera and instance motions with 3D motion ground truth.
|
||||
|
||||
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
||||
results on the Virtual KITTI validation set.
|
||||
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
|
||||
that decides between a moving and still object.
|
||||
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
||||
on the Virtual KITTI validation set.
|
||||
|
||||
\paragraph{Camera motion}
|
||||
Both variants achieve a low error in predicted camera translation, relative to
|
||||
the average ground truth camera translation. The camera rotation angle error
|
||||
is relatively high compared to the small average ground truth camera rotation.
|
||||
Although both variants use the exact same network for predicting the camera motion,
|
||||
the FPN variant performs worse here, with the error in rotation angle twice as high.
|
||||
One possible explanations that should be investigated in futher work is
|
||||
that in the FPN variant, all blocks in the backbone are shared between the camera
|
||||
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
|
||||
C$6$ blocks are only used in the camera branch, and thus only experience weight
|
||||
updates due to the camera motion loss.
|
||||
|
||||
\paragraph{Instance motion}
|
||||
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
||||
high precision in both variants, although the FPN variant is significantly more
|
||||
precise, which we ascribe to the higher resolution features used in this variant.
|
||||
|
||||
The predicted 3D object translations and rotations still have a relatively high
|
||||
error, compared to the average actual (ground truth) translations and rotations,
|
||||
which may be due to implementation issues or problems with the current 3D motion
|
||||
ground truth loss.
|
||||
The FPN variant is only slightly more accurate for these predictions, which suggests
|
||||
that there may still be issues with our implementation, as one would expect the
|
||||
FPN to be more accurate.
|
||||
|
||||
\paragraph{Instance segmentation}
|
||||
Looking at Figure \ref{figure:vkitti}, our instance segmentation results are in
|
||||
many cases still lacking the accuracy seen in the Mask R-CNN Cityscapes \cite{MaskRCNN} results,
|
||||
which is likely due to implementation details.
|
||||
|
||||
Binary file not shown.
BIN
figures/moving.pdf
Executable file
BIN
figures/moving.pdf
Executable file
Binary file not shown.
BIN
figures/results.pdf
Executable file
BIN
figures/results.pdf
Executable file
Binary file not shown.
Loading…
x
Reference in New Issue
Block a user