mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
5165cbec12
commit
7c9344a913
@ -214,8 +214,8 @@ $\begin{bmatrix}
|
|||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption {
|
\caption {
|
||||||
Backbone architecture based on ResNet-50 \cite{ResNet}.
|
Backbone architecture based on ResNet-50 \cite{ResNet}.
|
||||||
Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
|
Operations enclosed in a $[\cdot]_b$ block make up a single ResNet \enquote{bottleneck}
|
||||||
block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
|
block (see Figure \ref{figure:bottleneck}). If the block is denoted as $[\cdot]_b/2$,
|
||||||
the first convolution operation in the block has a stride of 2. Note that the stride
|
the first convolution operation in the block has a stride of 2. Note that the stride
|
||||||
is only applied to the first block, but not to repeated blocks.
|
is only applied to the first block, but not to repeated blocks.
|
||||||
Batch normalization \cite{BN} is used after every residual unit.
|
Batch normalization \cite{BN} is used after every residual unit.
|
||||||
@ -415,7 +415,7 @@ masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
|||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption {
|
\caption {
|
||||||
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
|
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
|
||||||
Operations enclosed in a []$_p$ block make up a single FPN
|
Operations enclosed in a $[\cdot]_p$ block make up a single FPN
|
||||||
block (see Figure \ref{figure:fpn_block}).
|
block (see Figure \ref{figure:fpn_block}).
|
||||||
}
|
}
|
||||||
\label{table:maskrcnn_resnet_fpn}
|
\label{table:maskrcnn_resnet_fpn}
|
||||||
|
|||||||
@ -7,18 +7,37 @@ In addition to instance motions, our network estimates the 3D motion of the came
|
|||||||
We combine all these estimates to yield a dense optical flow output from our
|
We combine all these estimates to yield a dense optical flow output from our
|
||||||
end-to-end deep network.
|
end-to-end deep network.
|
||||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||||
us with all required ground truth data, and evaluated on a validation set created
|
us with bounding box, instance mask, depth, and 3D motion ground truth,
|
||||||
from Virtual KITTI.
|
and evaluated on a validation set created from Virtual KITTI.
|
||||||
During inference, our model does not add any significant computational overhead
|
During inference, our model does not add any significant computational overhead
|
||||||
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
||||||
for real time scenarios.
|
for real time scenarios.
|
||||||
We thus presented a step towards real time 3D motion estimation based on a
|
|
||||||
|
Although our system gives first reasonable instance motion predictions,
|
||||||
|
estimates the camera ego-motion reasonably well,
|
||||||
|
and achieves high accuracy in classifying between moving and non-moving objects,
|
||||||
|
the accuracy of the motion predictions is still not convincing.
|
||||||
|
More work will be thus required to bring the system (closer) to competetive accuracy,
|
||||||
|
which includes trying penalization with the flow loss instead of 3D motion ground truth,
|
||||||
|
and improvements to the network architecture and training process.
|
||||||
|
We thus presented a partial step towards real time 3D motion estimation based on a
|
||||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||||
to previous end-to-end deep networks for dense motion estimation, the output
|
to previous end-to-end deep networks for dense motion estimation, the output
|
||||||
of our network is highly interpretable, which may also bring benefits for safety-critical
|
of our network is highly interpretable, which may also bring benefits for safety-critical
|
||||||
applications.
|
applications.
|
||||||
|
|
||||||
\subsection{Future Work}
|
\subsection{Future Work}
|
||||||
|
\paragraph{Mask R-CNN baseline}
|
||||||
|
As our Mask R-CNN re-implementation is still not as accurate as reported in the
|
||||||
|
original paper, working on the implementation details of this baseline would be
|
||||||
|
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
|
||||||
|
R-CNN in TensorFlow was released, which should be studied to this end.
|
||||||
|
|
||||||
|
\paragraph{Instance motion supervision with the optical flow re-projection loss}
|
||||||
|
We developed and implemented a loss for penalizing instance motions with optical flow ground truth,
|
||||||
|
but could not yet train a network with it due to time constraints. The second
|
||||||
|
next step will be conducting experiments with this loss.
|
||||||
|
|
||||||
\paragraph{Training on all Virtual KITTI sequences}
|
\paragraph{Training on all Virtual KITTI sequences}
|
||||||
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
|
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
|
||||||
to make training faster.
|
to make training faster.
|
||||||
|
|||||||
@ -148,10 +148,6 @@ the predicted camera motion.
|
|||||||
\subsection{Virtual KITTI: Training setup}
|
\subsection{Virtual KITTI: Training setup}
|
||||||
\label{ssec:setup}
|
\label{ssec:setup}
|
||||||
|
|
||||||
For our initial experiments, we concatenate both RGB frames as
|
|
||||||
well as the XYZ coordinates for both frames as input to the networks.
|
|
||||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants.
|
|
||||||
|
|
||||||
\paragraph{Training schedule}
|
\paragraph{Training schedule}
|
||||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||||
We train for a total of 192K iterations on the Virtual KITTI training set.
|
We train for a total of 192K iterations on the Virtual KITTI training set.
|
||||||
@ -184,10 +180,11 @@ are in general expected to output.
|
|||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{figures/vkitti_cam}
|
\includegraphics[width=\textwidth]{figures/results}
|
||||||
\caption{
|
\caption{
|
||||||
Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
|
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
|
||||||
with the ResNet (without FPN) architecture.
|
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
|
||||||
|
in the upper and lower row, respectively.
|
||||||
From left to right, we show the input image with instance segmentation results as overlay,
|
From left to right, we show the input image with instance segmentation results as overlay,
|
||||||
the estimated flow, as well as the flow error map.
|
the estimated flow, as well as the flow error map.
|
||||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||||
@ -195,46 +192,87 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
|
|||||||
\label{figure:vkitti}
|
\label{figure:vkitti}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
\begin{figure}[t]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=\textwidth]{figures/moving}
|
||||||
|
\caption{
|
||||||
|
We visually compare a Motion R-CNN ResNet trained without (upper row) and
|
||||||
|
with (lower row) classifying the objects into moving and non-moving objects.
|
||||||
|
Note that in the selected example, all cars are parking, and thus the predicted
|
||||||
|
motion in the first row is an error.
|
||||||
|
From left to right, we show the input image with instance segmentation results as overlay,
|
||||||
|
the estimated flow, as well as the flow error map.
|
||||||
|
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||||
|
}
|
||||||
|
\label{figure:moving}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
{
|
{
|
||||||
\begin{table}[t]
|
\begin{table}[t]
|
||||||
\centering
|
\centering
|
||||||
\begin{tabular}{@{}*{13}{c}@{}}
|
\begin{tabular}{@{}*{10}{c}@{}}
|
||||||
\toprule
|
\toprule
|
||||||
\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
||||||
\cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
|
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
|
||||||
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||||
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\
|
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule
|
||||||
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\
|
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
|
||||||
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
|
||||||
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
||||||
\midrule
|
|
||||||
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
|
||||||
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
|
||||||
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
||||||
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
|
|
||||||
\caption {
|
\caption {
|
||||||
Comparison of network variants on the Virtual KITTI validation set.
|
Evaluation of different metrics on the Virtual KITTI validation set.
|
||||||
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
||||||
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
||||||
|
We compare network variants with and without FPN.
|
||||||
Camera and instance motion errors are averaged over the validation set.
|
Camera and instance motion errors are averaged over the validation set.
|
||||||
We optionally enable camera motion prediction (cam.),
|
Quantities in parentheses in the first row are the average ground truth values for the estimated
|
||||||
replace the ResNet backbone with ResNet-FPN (FPN),
|
quantity. For example, we compare the error in camera angle, $E_{R}^{cam} [deg]$,
|
||||||
or input XYZ coordinates into the backbone (XYZ).
|
to the average rotation angle in the ground truth camera motions.
|
||||||
We either supervise
|
|
||||||
object motions (sup.) with 3D motion ground truth (3D) or
|
|
||||||
with a 2D re-projection loss based on flow ground truth (flow).
|
|
||||||
Note that for rows where no camera motion is predicted, the optical flow
|
|
||||||
is composed using the ground truth camera motion and thus the flow error is
|
|
||||||
only impacted by the predicted 3D object motions.
|
|
||||||
}
|
}
|
||||||
\label{table:vkitti}
|
\label{table:vkitti}
|
||||||
\end{table}
|
\end{table}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
For our initial experiments, we concatenate both RGB frames as
|
||||||
|
well as the XYZ coordinates for both frames as input to the networks.
|
||||||
|
We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
|
||||||
|
camera and instance motions with 3D motion ground truth.
|
||||||
|
|
||||||
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
||||||
results on the Virtual KITTI validation set.
|
results on the Virtual KITTI validation set.
|
||||||
|
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
|
||||||
|
that decides between a moving and still object.
|
||||||
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
||||||
on the Virtual KITTI validation set.
|
on the Virtual KITTI validation set.
|
||||||
|
|
||||||
|
\paragraph{Camera motion}
|
||||||
|
Both variants achieve a low error in predicted camera translation, relative to
|
||||||
|
the average ground truth camera translation. The camera rotation angle error
|
||||||
|
is relatively high compared to the small average ground truth camera rotation.
|
||||||
|
Although both variants use the exact same network for predicting the camera motion,
|
||||||
|
the FPN variant performs worse here, with the error in rotation angle twice as high.
|
||||||
|
One possible explanations that should be investigated in futher work is
|
||||||
|
that in the FPN variant, all blocks in the backbone are shared between the camera
|
||||||
|
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
|
||||||
|
C$6$ blocks are only used in the camera branch, and thus only experience weight
|
||||||
|
updates due to the camera motion loss.
|
||||||
|
|
||||||
|
\paragraph{Instance motion}
|
||||||
|
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
||||||
|
high precision in both variants, although the FPN variant is significantly more
|
||||||
|
precise, which we ascribe to the higher resolution features used in this variant.
|
||||||
|
|
||||||
|
The predicted 3D object translations and rotations still have a relatively high
|
||||||
|
error, compared to the average actual (ground truth) translations and rotations,
|
||||||
|
which may be due to implementation issues or problems with the current 3D motion
|
||||||
|
ground truth loss.
|
||||||
|
The FPN variant is only slightly more accurate for these predictions, which suggests
|
||||||
|
that there may still be issues with our implementation, as one would expect the
|
||||||
|
FPN to be more accurate.
|
||||||
|
|
||||||
|
\paragraph{Instance segmentation}
|
||||||
|
Looking at Figure \ref{figure:vkitti}, our instance segmentation results are in
|
||||||
|
many cases still lacking the accuracy seen in the Mask R-CNN Cityscapes \cite{MaskRCNN} results,
|
||||||
|
which is likely due to implementation details.
|
||||||
|
|||||||
Binary file not shown.
BIN
figures/moving.pdf
Executable file
BIN
figures/moving.pdf
Executable file
Binary file not shown.
BIN
figures/results.pdf
Executable file
BIN
figures/results.pdf
Executable file
Binary file not shown.
Loading…
x
Reference in New Issue
Block a user