This commit is contained in:
Simon Meister 2017-11-21 22:54:44 +01:00
parent 5165cbec12
commit 7c9344a913
6 changed files with 93 additions and 36 deletions

View File

@ -214,8 +214,8 @@ $\begin{bmatrix}
\end{tabular}
\caption {
Backbone architecture based on ResNet-50 \cite{ResNet}.
Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
Operations enclosed in a $[\cdot]_b$ block make up a single ResNet \enquote{bottleneck}
block (see Figure \ref{figure:bottleneck}). If the block is denoted as $[\cdot]_b/2$,
the first convolution operation in the block has a stride of 2. Note that the stride
is only applied to the first block, but not to repeated blocks.
Batch normalization \cite{BN} is used after every residual unit.
@ -415,7 +415,7 @@ masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
\end{tabular}
\caption {
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
Operations enclosed in a []$_p$ block make up a single FPN
Operations enclosed in a $[\cdot]_p$ block make up a single FPN
block (see Figure \ref{figure:fpn_block}).
}
\label{table:maskrcnn_resnet_fpn}

View File

@ -7,18 +7,37 @@ In addition to instance motions, our network estimates the 3D motion of the came
We combine all these estimates to yield a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with all required ground truth data, and evaluated on a validation set created
from Virtual KITTI.
us with bounding box, instance mask, depth, and 3D motion ground truth,
and evaluated on a validation set created from Virtual KITTI.
During inference, our model does not add any significant computational overhead
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
for real time scenarios.
We thus presented a step towards real time 3D motion estimation based on a
Although our system gives first reasonable instance motion predictions,
estimates the camera ego-motion reasonably well,
and achieves high accuracy in classifying between moving and non-moving objects,
the accuracy of the motion predictions is still not convincing.
More work will be thus required to bring the system (closer) to competetive accuracy,
which includes trying penalization with the flow loss instead of 3D motion ground truth,
and improvements to the network architecture and training process.
We thus presented a partial step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
to previous end-to-end deep networks for dense motion estimation, the output
of our network is highly interpretable, which may also bring benefits for safety-critical
applications.
\subsection{Future Work}
\paragraph{Mask R-CNN baseline}
As our Mask R-CNN re-implementation is still not as accurate as reported in the
original paper, working on the implementation details of this baseline would be
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
R-CNN in TensorFlow was released, which should be studied to this end.
\paragraph{Instance motion supervision with the optical flow re-projection loss}
We developed and implemented a loss for penalizing instance motions with optical flow ground truth,
but could not yet train a network with it due to time constraints. The second
next step will be conducting experiments with this loss.
\paragraph{Training on all Virtual KITTI sequences}
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
to make training faster.

View File

@ -148,10 +148,6 @@ the predicted camera motion.
\subsection{Virtual KITTI: Training setup}
\label{ssec:setup}
For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN ResNet and ResNet-FPN variants.
\paragraph{Training schedule}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
We train for a total of 192K iterations on the Virtual KITTI training set.
@ -184,10 +180,11 @@ are in general expected to output.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/vkitti_cam}
\includegraphics[width=\textwidth]{figures/results}
\caption{
Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
with the ResNet (without FPN) architecture.
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
in the upper and lower row, respectively.
From left to right, we show the input image with instance segmentation results as overlay,
the estimated flow, as well as the flow error map.
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
@ -195,46 +192,87 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
\label{figure:vkitti}
\end{figure}
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/moving}
\caption{
We visually compare a Motion R-CNN ResNet trained without (upper row) and
with (lower row) classifying the objects into moving and non-moving objects.
Note that in the selected example, all cars are parking, and thus the predicted
motion in the first row is an error.
From left to right, we show the input image with instance segmentation results as overlay,
the estimated flow, as well as the flow error map.
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
}
\label{figure:moving}
\end{figure}
{
\begin{table}[t]
\centering
\begin{tabular}{@{}*{13}{c}@{}}
\begin{tabular}{@{}*{10}{c}@{}}
\toprule
\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
\cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\midrule
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
\bottomrule
\end{tabular}
\caption {
Comparison of network variants on the Virtual KITTI validation set.
Evaluation of different metrics on the Virtual KITTI validation set.
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
wrong by both $\geq 3$ pixels and $\geq 5\%$.
We compare network variants with and without FPN.
Camera and instance motion errors are averaged over the validation set.
We optionally enable camera motion prediction (cam.),
replace the ResNet backbone with ResNet-FPN (FPN),
or input XYZ coordinates into the backbone (XYZ).
We either supervise
object motions (sup.) with 3D motion ground truth (3D) or
with a 2D re-projection loss based on flow ground truth (flow).
Note that for rows where no camera motion is predicted, the optical flow
is composed using the ground truth camera motion and thus the flow error is
only impacted by the predicted 3D object motions.
Quantities in parentheses in the first row are the average ground truth values for the estimated
quantity. For example, we compare the error in camera angle, $E_{R}^{cam} [deg]$,
to the average rotation angle in the ground truth camera motions.
}
\label{table:vkitti}
\end{table}
}
For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
camera and instance motions with 3D motion ground truth.
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
results on the Virtual KITTI validation set.
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
that decides between a moving and still object.
In Table \ref{table:vkitti}, we compare the performance of different network variants
on the Virtual KITTI validation set.
\paragraph{Camera motion}
Both variants achieve a low error in predicted camera translation, relative to
the average ground truth camera translation. The camera rotation angle error
is relatively high compared to the small average ground truth camera rotation.
Although both variants use the exact same network for predicting the camera motion,
the FPN variant performs worse here, with the error in rotation angle twice as high.
One possible explanations that should be investigated in futher work is
that in the FPN variant, all blocks in the backbone are shared between the camera
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
C$6$ blocks are only used in the camera branch, and thus only experience weight
updates due to the camera motion loss.
\paragraph{Instance motion}
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
high precision in both variants, although the FPN variant is significantly more
precise, which we ascribe to the higher resolution features used in this variant.
The predicted 3D object translations and rotations still have a relatively high
error, compared to the average actual (ground truth) translations and rotations,
which may be due to implementation issues or problems with the current 3D motion
ground truth loss.
The FPN variant is only slightly more accurate for these predictions, which suggests
that there may still be issues with our implementation, as one would expect the
FPN to be more accurate.
\paragraph{Instance segmentation}
Looking at Figure \ref{figure:vkitti}, our instance segmentation results are in
many cases still lacking the accuracy seen in the Mask R-CNN Cityscapes \cite{MaskRCNN} results,
which is likely due to implementation details.

Binary file not shown.

BIN
figures/moving.pdf Executable file

Binary file not shown.

BIN
figures/results.pdf Executable file

Binary file not shown.