WIP

2026-02-06 10:05:40 +00:00 · 2017-11-06 11:01:30 +01:00 · 2017-11-06 11:01:30 +01:00 · 0dd8445641
commit 0dd8445641
parent 32aae94005
5 changed files with 34 additions and 15 deletions
--- a/approach.tex
+++ b/approach.tex
@ -15,6 +15,8 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb
 Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
 laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
 we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
+Alternatively, we also experiment with concatenating the XYZ coordinates of each frame
+into the input as well.
 We do not introduce a separate network for computing region proposals and use our modified backbone network
 as both first stage RPN and second stage feature extractor for region cropping.
 Technically, our feature encoder network will have to learn a motion representation similar to
@ -25,8 +27,6 @@ from the encoder is integrated for specific objects via RoI cropping and
 processed by the RoI head for each object.
 \todo{figure of backbone}

-\todo{introduce optional XYZ input}
-
 \paragraph{Per-RoI motion prediction}
 We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
 For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
--- a/experiments.tex
+++ b/experiments.tex
@ -110,7 +110,19 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
 \todo{add this}

 \subsection{Experiments on Virtual KITTI}
-\todo{complete this}
+
+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\textwidth]{figures/vkitti}
+\caption{
+Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
+with the ResNet (without FPN) architecture.
+From left to right, we show the input image with instance segmentation results as overlay,
+the estimated flow, as well as the flow error map.
+The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
+}
+\label{figure:vkitti}
+\end{figure}

 {
 \begin{table}[t]
@ -119,15 +131,9 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
 \toprule
 \multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\
 \cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11}
-  FPN        & cam.       & sup. & XYZ & $E_{R}$ & $E_{t}$ & $E_{p}$ & $E_{R}^{cam}$ & $E_{t}^{cam}$ & AEE & Fl-all \\\midrule
-  $\times$   & $\times$   & 3D   &     & ?       & ?       & ?       & -             & -             & ?   & ?\%    \\
-  \checkmark & $\times$   & 3D   &     & ?       & ?       & ?       & -             & -             & ?   & ?\%    \\
-  $\times$   & \checkmark & 3D   &     & ?       & ?       & ?       & ?             & ?             & ?   & ?\%    \\
-  \checkmark & \checkmark & 3D   &     & ?       & ?       & ?       & ?             & ?             & ?   & ?\%    \\
-  $\times$   & $\times$   & flow &     & ?       & ?       & ?       & -             & -             & ?   & ?\%    \\
-  \checkmark & $\times$   & flow &     & ?       & ?       & ?       & -             & -             & ?   & ?\%    \\
-  $\times$   & \checkmark & flow &     & ?       & ?       & ?       & ?             & ?             & ?   & ?\%    \\
-  \checkmark & \checkmark & flow &     & ?       & ?       & ?       & ?             & ?             & ?   & ?\%    \\
+  FPN        & cam.       & sup. & XYZ         & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE   & Fl-all \\\midrule
+  $\times$   & \checkmark & 3D   & \checkmark  & 0.4           & 0.49        & 17.06        & 0.1                 & 0.04              & 6.73  & 26.59\%    \\
+  \checkmark & \checkmark & 3D   & \checkmark  & 0.35          & 0.38        & 11.87        & 0.22                & 0.07              & 12.62 & 46.28\%    \\
 \bottomrule
 \end{tabular}

@ -135,8 +141,10 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
 Comparison of network variants on our Virtual KITTI validation set.
 AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
 wrong by both $\geq 3$ pixels and $\geq 5\%$.
-We optionally train camera motion prediction (cam.)
-or replace the ResNet50 backbone with ResNet50-FPN (FPN).
+Camera and instance motion errors are averaged over the validation set.
+We optionally train camera motion prediction (cam.),
+replace the ResNet50 backbone with ResNet50-FPN (FPN),
+or input XYZ coordinates into the backbone (XYZ).
 We either supervise
 object motions (sup.) with 3D motion ground truth (3D) or
 with a 2D re-projection loss based on flow ground truth (flow).
--- a/figures/net_intro.png
+++ b/figures/net_intro.png
--- a/figures/vkitti_cam.png
+++ b/figures/vkitti_cam.png
--- a/introduction.tex
+++ b/introduction.tex
@ -62,7 +62,7 @@ As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone
 two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
 This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
 and estimating the motion of all detected instances without any limitations
-as to the number or variety of object instances.
+as to the number or variety of object instances (Figure \ref{figure:net_intro}).

 Eventually, we want to extend our method to include depth prediction,
 yielding the first end-to-end deep network to perform 3D scene flow estimation
@ -70,6 +70,17 @@ in a principled way from considering individual objects.
 For now, we will work with RGB-D frames to break down the problem into
 manageable pieces.

+\begin{figure}[t]
+  \centering
+  \includegraphics[width=\textwidth]{figures/net_intro}
+\caption{
+Overview of our network based on Mask R-CNN. For each RoI, we predict the instance motion
+in parallel to the class, bounding box and mask. We branch off a fully connected
+layer for predicting the camera motion from the bottleneck.
+}
+\label{figure:net_intro}
+\end{figure}
+
 \subsection{Related work}

 In the following, we will refer to systems which use deep networks for all