diff --git a/approach.tex b/approach.tex index ff62dc3..ee520fa 100644 --- a/approach.tex +++ b/approach.tex @@ -15,6 +15,8 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching, laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone, we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels. +Alternatively, we also experiment with concatenating the XYZ coordinates of each frame +into the input as well. We do not introduce a separate network for computing region proposals and use our modified backbone network as both first stage RPN and second stage feature extractor for region cropping. Technically, our feature encoder network will have to learn a motion representation similar to @@ -25,8 +27,6 @@ from the encoder is integrated for specific objects via RoI cropping and processed by the RoI head for each object. \todo{figure of backbone} -\todo{introduce optional XYZ input} - \paragraph{Per-RoI motion prediction} We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}. For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$ diff --git a/experiments.tex b/experiments.tex index 38faa2d..0e50a12 100644 --- a/experiments.tex +++ b/experiments.tex @@ -110,7 +110,19 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. \todo{add this} \subsection{Experiments on Virtual KITTI} -\todo{complete this} + +\begin{figure}[t] + \centering + \includegraphics[width=\textwidth]{figures/vkitti} +\caption{ +Visualization of results with XYZ input, camera motion prediction and 3D motion supervision +with the ResNet (without FPN) architecture. +From left to right, we show the input image with instance segmentation results as overlay, +the estimated flow, as well as the flow error map. +The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones. +} +\label{figure:vkitti} +\end{figure} { \begin{table}[t] @@ -119,15 +131,9 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. \toprule \multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\ \cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11} - FPN & cam. & sup. & XYZ & $E_{R}$ & $E_{t}$ & $E_{p}$ & $E_{R}^{cam}$ & $E_{t}^{cam}$ & AEE & Fl-all \\\midrule - $\times$ & $\times$ & 3D & & ? & ? & ? & - & - & ? & ?\% \\ - \checkmark & $\times$ & 3D & & ? & ? & ? & - & - & ? & ?\% \\ - $\times$ & \checkmark & 3D & & ? & ? & ? & ? & ? & ? & ?\% \\ - \checkmark & \checkmark & 3D & & ? & ? & ? & ? & ? & ? & ?\% \\ - $\times$ & $\times$ & flow & & ? & ? & ? & - & - & ? & ?\% \\ - \checkmark & $\times$ & flow & & ? & ? & ? & - & - & ? & ?\% \\ - $\times$ & \checkmark & flow & & ? & ? & ? & ? & ? & ? & ?\% \\ - \checkmark & \checkmark & flow & & ? & ? & ? & ? & ? & ? & ?\% \\ + FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule + $\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & 0.1 & 0.04 & 6.73 & 26.59\% \\ + \checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & 0.22 & 0.07 & 12.62 & 46.28\% \\ \bottomrule \end{tabular} @@ -135,8 +141,10 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. Comparison of network variants on our Virtual KITTI validation set. AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is wrong by both $\geq 3$ pixels and $\geq 5\%$. -We optionally train camera motion prediction (cam.) -or replace the ResNet50 backbone with ResNet50-FPN (FPN). +Camera and instance motion errors are averaged over the validation set. +We optionally train camera motion prediction (cam.), +replace the ResNet50 backbone with ResNet50-FPN (FPN), +or input XYZ coordinates into the backbone (XYZ). We either supervise object motions (sup.) with 3D motion ground truth (3D) or with a 2D re-projection loss based on flow ground truth (flow). diff --git a/figures/net_intro.png b/figures/net_intro.png new file mode 100755 index 0000000..451b69c Binary files /dev/null and b/figures/net_intro.png differ diff --git a/figures/vkitti_cam.png b/figures/vkitti_cam.png new file mode 100755 index 0000000..d6c0f07 Binary files /dev/null and b/figures/vkitti_cam.png differ diff --git a/introduction.tex b/introduction.tex index 475d5e5..b2cb804 100644 --- a/introduction.tex +++ b/introduction.tex @@ -62,7 +62,7 @@ As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone two concatenated images as input, similar to FlowNetS \cite{FlowNet}. This results in a fully integrated end-to-end network architecture for segmenting pixels into instances and estimating the motion of all detected instances without any limitations -as to the number or variety of object instances. +as to the number or variety of object instances (Figure \ref{figure:net_intro}). Eventually, we want to extend our method to include depth prediction, yielding the first end-to-end deep network to perform 3D scene flow estimation @@ -70,6 +70,17 @@ in a principled way from considering individual objects. For now, we will work with RGB-D frames to break down the problem into manageable pieces. +\begin{figure}[t] + \centering + \includegraphics[width=\textwidth]{figures/net_intro} +\caption{ +Overview of our network based on Mask R-CNN. For each RoI, we predict the instance motion +in parallel to the class, bounding box and mask. We branch off a fully connected +layer for predicting the camera motion from the bottleneck. +} +\label{figure:net_intro} +\end{figure} + \subsection{Related work} In the following, we will refer to systems which use deep networks for all