mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
184b8e2efb
commit
32aae94005
@ -72,7 +72,7 @@ and $\alpha, \beta, \gamma$ are the rotation angles in radians about the $x,y,z$
|
||||
We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for
|
||||
refined boxes and classes. Figure \ref{fig:motion_rcnn_head} shows the Motion R-CNN RoI head.
|
||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||
Each motion is predicted as a set of nine scalar motion parameters,
|
||||
Each instance motion is predicted as a set of nine scalar parameters,
|
||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
@ -190,7 +190,7 @@ box of a detected object according to the predicted motion of the object.
|
||||
We first define the \emph{full image} mask $m_t^k$ for object k,
|
||||
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
|
||||
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
|
||||
of the resized mask into a full image map starting at the top-right coordinate of the predicted bounding box.
|
||||
of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box.
|
||||
Then,
|
||||
\begin{equation}
|
||||
P'_{t+1} =
|
||||
|
||||
@ -63,6 +63,7 @@ each corresponding to one of the proposal bounding boxes.
|
||||
The crops are collected into a batch and passed into a small Fast R-CNN
|
||||
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
||||
This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
|
||||
\todo{more details and figure}
|
||||
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
|
||||
speeding up the system by orders of magnitude. % TODO verify that
|
||||
|
||||
@ -80,8 +81,9 @@ Next, the \emph{backbone} output features are passed into a small, fully convolu
|
||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||
At any position, bounding boxes are predicted as offsets relative to a fixed set of \emph{anchors} with different
|
||||
aspect ratios.
|
||||
\todo{more details and figure}
|
||||
% TODO more about striding & computing the anchors?
|
||||
For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection.
|
||||
For each anchor at a given position, the objectness score tells us how likely this anchors is to correspond to a detection.
|
||||
The region proposals can then be obtained as the N highest scoring anchor boxes.
|
||||
|
||||
The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||
|
||||
@ -98,7 +98,7 @@ E_{p} = \frac{1}{N}\sum_k \lVert p^{gt,i_k} - p^{k,c_k} \rVert
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth pivot.
|
||||
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
||||
predicted camera motion.
|
||||
predicted camera motions.
|
||||
|
||||
\subsection{Training Setup}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
|
||||
BIN
figures/teaser.png
Executable file
BIN
figures/teaser.png
Executable file
Binary file not shown.
|
After Width: | Height: | Size: 536 KiB |
@ -1,3 +1,14 @@
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/teaser}
|
||||
\caption{
|
||||
Given two temporally consecutive frames,
|
||||
our network segments the pixels of the first frame into individual objects
|
||||
and estimates all 3D object motions between the frames.
|
||||
}
|
||||
\label{figure:teaser}
|
||||
\end{figure}
|
||||
|
||||
\subsection{Motivation \& Goals}
|
||||
|
||||
For moving in the real world, it is generally desirable to know which objects exists
|
||||
@ -21,7 +32,8 @@ and motion estimation.
|
||||
|
||||
Thus, in this work, we aim to develop end-to-end deep networks which can, given
|
||||
sequences of images, segment the image pixels into object instances and estimate
|
||||
the location and 3D motion of each object instance relative to the camera.
|
||||
the location and 3D motion of each object instance relative to the camera
|
||||
(Figure \ref{figure:teaser}).
|
||||
|
||||
\subsection{Technical outline}
|
||||
|
||||
@ -51,7 +63,6 @@ two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
|
||||
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
|
||||
and estimating the motion of all detected instances without any limitations
|
||||
as to the number or variety of object instances.
|
||||
Figure \ref{} illustrates our concept.
|
||||
|
||||
Eventually, we want to extend our method to include depth prediction,
|
||||
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
||||
@ -85,6 +96,9 @@ the optical flow estimation problem and introduce reasoning at the object level,
|
||||
but still require expensive energy minimization for each
|
||||
new input, as CNNs are only used for some of the components.
|
||||
|
||||
In contrast, we tackle motion estimation at the instance-level with end-to-end
|
||||
deep networks and derive optical flow from the individual object motions.
|
||||
|
||||
\paragraph{Slanted plane methods for 3D scene flow}
|
||||
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
|
||||
composed of planar segments. Pixels are assigned to one of the planar segments,
|
||||
@ -109,7 +123,7 @@ and takes minutes to make a prediction.
|
||||
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
||||
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
||||
However, the end-to-end deep networks are significantly faster than energy-minimization counterparts,
|
||||
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
|
||||
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
||||
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
||||
which often require estimations to be done in realtime and for which an end-to-end
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user