mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-12 17:35:51 +00:00
WIP
This commit is contained in:
parent
32aae94005
commit
0dd8445641
@ -15,6 +15,8 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb
|
||||
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
|
||||
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
||||
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
|
||||
Alternatively, we also experiment with concatenating the XYZ coordinates of each frame
|
||||
into the input as well.
|
||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||
as both first stage RPN and second stage feature extractor for region cropping.
|
||||
Technically, our feature encoder network will have to learn a motion representation similar to
|
||||
@ -25,8 +27,6 @@ from the encoder is integrated for specific objects via RoI cropping and
|
||||
processed by the RoI head for each object.
|
||||
\todo{figure of backbone}
|
||||
|
||||
\todo{introduce optional XYZ input}
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
|
||||
|
||||
@ -110,7 +110,19 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
\todo{add this}
|
||||
|
||||
\subsection{Experiments on Virtual KITTI}
|
||||
\todo{complete this}
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/vkitti}
|
||||
\caption{
|
||||
Visualization of results with XYZ input, camera motion prediction and 3D motion supervision
|
||||
with the ResNet (without FPN) architecture.
|
||||
From left to right, we show the input image with instance segmentation results as overlay,
|
||||
the estimated flow, as well as the flow error map.
|
||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||
}
|
||||
\label{figure:vkitti}
|
||||
\end{figure}
|
||||
|
||||
{
|
||||
\begin{table}[t]
|
||||
@ -119,15 +131,9 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
\toprule
|
||||
\multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\
|
||||
\cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11}
|
||||
FPN & cam. & sup. & XYZ & $E_{R}$ & $E_{t}$ & $E_{p}$ & $E_{R}^{cam}$ & $E_{t}^{cam}$ & AEE & Fl-all \\\midrule
|
||||
$\times$ & $\times$ & 3D & & ? & ? & ? & - & - & ? & ?\% \\
|
||||
\checkmark & $\times$ & 3D & & ? & ? & ? & - & - & ? & ?\% \\
|
||||
$\times$ & \checkmark & 3D & & ? & ? & ? & ? & ? & ? & ?\% \\
|
||||
\checkmark & \checkmark & 3D & & ? & ? & ? & ? & ? & ? & ?\% \\
|
||||
$\times$ & $\times$ & flow & & ? & ? & ? & - & - & ? & ?\% \\
|
||||
\checkmark & $\times$ & flow & & ? & ? & ? & - & - & ? & ?\% \\
|
||||
$\times$ & \checkmark & flow & & ? & ? & ? & ? & ? & ? & ?\% \\
|
||||
\checkmark & \checkmark & flow & & ? & ? & ? & ? & ? & ? & ?\% \\
|
||||
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & 0.1 & 0.04 & 6.73 & 26.59\% \\
|
||||
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & 0.22 & 0.07 & 12.62 & 46.28\% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
@ -135,8 +141,10 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
Comparison of network variants on our Virtual KITTI validation set.
|
||||
AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
||||
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
||||
We optionally train camera motion prediction (cam.)
|
||||
or replace the ResNet50 backbone with ResNet50-FPN (FPN).
|
||||
Camera and instance motion errors are averaged over the validation set.
|
||||
We optionally train camera motion prediction (cam.),
|
||||
replace the ResNet50 backbone with ResNet50-FPN (FPN),
|
||||
or input XYZ coordinates into the backbone (XYZ).
|
||||
We either supervise
|
||||
object motions (sup.) with 3D motion ground truth (3D) or
|
||||
with a 2D re-projection loss based on flow ground truth (flow).
|
||||
|
||||
BIN
figures/net_intro.png
Executable file
BIN
figures/net_intro.png
Executable file
Binary file not shown.
|
After Width: | Height: | Size: 71 KiB |
BIN
figures/vkitti_cam.png
Executable file
BIN
figures/vkitti_cam.png
Executable file
Binary file not shown.
|
After Width: | Height: | Size: 2.7 MiB |
@ -62,7 +62,7 @@ As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone
|
||||
two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
|
||||
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
|
||||
and estimating the motion of all detected instances without any limitations
|
||||
as to the number or variety of object instances.
|
||||
as to the number or variety of object instances (Figure \ref{figure:net_intro}).
|
||||
|
||||
Eventually, we want to extend our method to include depth prediction,
|
||||
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
||||
@ -70,6 +70,17 @@ in a principled way from considering individual objects.
|
||||
For now, we will work with RGB-D frames to break down the problem into
|
||||
manageable pieces.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/net_intro}
|
||||
\caption{
|
||||
Overview of our network based on Mask R-CNN. For each RoI, we predict the instance motion
|
||||
in parallel to the class, bounding box and mask. We branch off a fully connected
|
||||
layer for predicting the camera motion from the bottleneck.
|
||||
}
|
||||
\label{figure:net_intro}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Related work}
|
||||
|
||||
In the following, we will refer to systems which use deep networks for all
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user