This commit is contained in:
Simon Meister 2017-11-06 11:01:30 +01:00
parent 32aae94005
commit 0dd8445641
5 changed files with 34 additions and 15 deletions

View File

@ -15,6 +15,8 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
Alternatively, we also experiment with concatenating the XYZ coordinates of each frame
into the input as well.
We do not introduce a separate network for computing region proposals and use our modified backbone network
as both first stage RPN and second stage feature extractor for region cropping.
Technically, our feature encoder network will have to learn a motion representation similar to
@ -25,8 +27,6 @@ from the encoder is integrated for specific objects via RoI cropping and
processed by the RoI head for each object.
\todo{figure of backbone}
\todo{introduce optional XYZ input}
\paragraph{Per-RoI motion prediction}
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$

View File

@ -110,7 +110,19 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
\todo{add this}
\subsection{Experiments on Virtual KITTI}
\todo{complete this}
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/vkitti}
\caption{
Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
with the ResNet (without FPN) architecture.
From left to right, we show the input image with instance segmentation results as overlay,
the estimated flow, as well as the flow error map.
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
}
\label{figure:vkitti}
\end{figure}
{
\begin{table}[t]
@ -119,15 +131,9 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
\toprule
\multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\
\cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11}
FPN & cam. & sup. & XYZ & $E_{R}$ & $E_{t}$ & $E_{p}$ & $E_{R}^{cam}$ & $E_{t}^{cam}$ & AEE & Fl-all \\\midrule
$\times$ & $\times$ & 3D & & ? & ? & ? & - & - & ? & ?\% \\
\checkmark & $\times$ & 3D & & ? & ? & ? & - & - & ? & ?\% \\
$\times$ & \checkmark & 3D & & ? & ? & ? & ? & ? & ? & ?\% \\
\checkmark & \checkmark & 3D & & ? & ? & ? & ? & ? & ? & ?\% \\
$\times$ & $\times$ & flow & & ? & ? & ? & - & - & ? & ?\% \\
\checkmark & $\times$ & flow & & ? & ? & ? & - & - & ? & ?\% \\
$\times$ & \checkmark & flow & & ? & ? & ? & ? & ? & ? & ?\% \\
\checkmark & \checkmark & flow & & ? & ? & ? & ? & ? & ? & ?\% \\
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & 0.1 & 0.04 & 6.73 & 26.59\% \\
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & 0.22 & 0.07 & 12.62 & 46.28\% \\
\bottomrule
\end{tabular}
@ -135,8 +141,10 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
Comparison of network variants on our Virtual KITTI validation set.
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
wrong by both $\geq 3$ pixels and $\geq 5\%$.
We optionally train camera motion prediction (cam.)
or replace the ResNet50 backbone with ResNet50-FPN (FPN).
Camera and instance motion errors are averaged over the validation set.
We optionally train camera motion prediction (cam.),
replace the ResNet50 backbone with ResNet50-FPN (FPN),
or input XYZ coordinates into the backbone (XYZ).
We either supervise
object motions (sup.) with 3D motion ground truth (3D) or
with a 2D re-projection loss based on flow ground truth (flow).

BIN
figures/net_intro.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 71 KiB

BIN
figures/vkitti_cam.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.7 MiB

View File

@ -62,7 +62,7 @@ As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone
two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
and estimating the motion of all detected instances without any limitations
as to the number or variety of object instances.
as to the number or variety of object instances (Figure \ref{figure:net_intro}).
Eventually, we want to extend our method to include depth prediction,
yielding the first end-to-end deep network to perform 3D scene flow estimation
@ -70,6 +70,17 @@ in a principled way from considering individual objects.
For now, we will work with RGB-D frames to break down the problem into
manageable pieces.
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/net_intro}
\caption{
Overview of our network based on Mask R-CNN. For each RoI, we predict the instance motion
in parallel to the class, bounding box and mask. We branch off a fully connected
layer for predicting the camera motion from the bottleneck.
}
\label{figure:net_intro}
\end{figure}
\subsection{Related work}
In the following, we will refer to systems which use deep networks for all