mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
184b8e2efb
commit
32aae94005
@ -72,7 +72,7 @@ and $\alpha, \beta, \gamma$ are the rotation angles in radians about the $x,y,z$
|
|||||||
We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for
|
We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for
|
||||||
refined boxes and classes. Figure \ref{fig:motion_rcnn_head} shows the Motion R-CNN RoI head.
|
refined boxes and classes. Figure \ref{fig:motion_rcnn_head} shows the Motion R-CNN RoI head.
|
||||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||||
Each motion is predicted as a set of nine scalar motion parameters,
|
Each instance motion is predicted as a set of nine scalar parameters,
|
||||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||||
Here, we assume that motions between frames are relatively small
|
Here, we assume that motions between frames are relatively small
|
||||||
@ -190,7 +190,7 @@ box of a detected object according to the predicted motion of the object.
|
|||||||
We first define the \emph{full image} mask $m_t^k$ for object k,
|
We first define the \emph{full image} mask $m_t^k$ for object k,
|
||||||
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
|
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
|
||||||
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
|
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
|
||||||
of the resized mask into a full image map starting at the top-right coordinate of the predicted bounding box.
|
of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box.
|
||||||
Then,
|
Then,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
P'_{t+1} =
|
P'_{t+1} =
|
||||||
|
|||||||
@ -63,6 +63,7 @@ each corresponding to one of the proposal bounding boxes.
|
|||||||
The crops are collected into a batch and passed into a small Fast R-CNN
|
The crops are collected into a batch and passed into a small Fast R-CNN
|
||||||
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
||||||
This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
|
This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
|
||||||
|
\todo{more details and figure}
|
||||||
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
|
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
|
||||||
speeding up the system by orders of magnitude. % TODO verify that
|
speeding up the system by orders of magnitude. % TODO verify that
|
||||||
|
|
||||||
@ -80,8 +81,9 @@ Next, the \emph{backbone} output features are passed into a small, fully convolu
|
|||||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||||
At any position, bounding boxes are predicted as offsets relative to a fixed set of \emph{anchors} with different
|
At any position, bounding boxes are predicted as offsets relative to a fixed set of \emph{anchors} with different
|
||||||
aspect ratios.
|
aspect ratios.
|
||||||
|
\todo{more details and figure}
|
||||||
% TODO more about striding & computing the anchors?
|
% TODO more about striding & computing the anchors?
|
||||||
For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection.
|
For each anchor at a given position, the objectness score tells us how likely this anchors is to correspond to a detection.
|
||||||
The region proposals can then be obtained as the N highest scoring anchor boxes.
|
The region proposals can then be obtained as the N highest scoring anchor boxes.
|
||||||
|
|
||||||
The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||||
|
|||||||
@ -98,7 +98,7 @@ E_{p} = \frac{1}{N}\sum_k \lVert p^{gt,i_k} - p^{k,c_k} \rVert
|
|||||||
\end{equation}
|
\end{equation}
|
||||||
is the mean euclidean norm between predicted and ground truth pivot.
|
is the mean euclidean norm between predicted and ground truth pivot.
|
||||||
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
||||||
predicted camera motion.
|
predicted camera motions.
|
||||||
|
|
||||||
\subsection{Training Setup}
|
\subsection{Training Setup}
|
||||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||||
|
|||||||
BIN
figures/teaser.png
Executable file
BIN
figures/teaser.png
Executable file
Binary file not shown.
|
After Width: | Height: | Size: 536 KiB |
@ -1,3 +1,14 @@
|
|||||||
|
\begin{figure}[t]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=\textwidth]{figures/teaser}
|
||||||
|
\caption{
|
||||||
|
Given two temporally consecutive frames,
|
||||||
|
our network segments the pixels of the first frame into individual objects
|
||||||
|
and estimates all 3D object motions between the frames.
|
||||||
|
}
|
||||||
|
\label{figure:teaser}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
\subsection{Motivation \& Goals}
|
\subsection{Motivation \& Goals}
|
||||||
|
|
||||||
For moving in the real world, it is generally desirable to know which objects exists
|
For moving in the real world, it is generally desirable to know which objects exists
|
||||||
@ -21,7 +32,8 @@ and motion estimation.
|
|||||||
|
|
||||||
Thus, in this work, we aim to develop end-to-end deep networks which can, given
|
Thus, in this work, we aim to develop end-to-end deep networks which can, given
|
||||||
sequences of images, segment the image pixels into object instances and estimate
|
sequences of images, segment the image pixels into object instances and estimate
|
||||||
the location and 3D motion of each object instance relative to the camera.
|
the location and 3D motion of each object instance relative to the camera
|
||||||
|
(Figure \ref{figure:teaser}).
|
||||||
|
|
||||||
\subsection{Technical outline}
|
\subsection{Technical outline}
|
||||||
|
|
||||||
@ -51,7 +63,6 @@ two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
|
|||||||
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
|
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
|
||||||
and estimating the motion of all detected instances without any limitations
|
and estimating the motion of all detected instances without any limitations
|
||||||
as to the number or variety of object instances.
|
as to the number or variety of object instances.
|
||||||
Figure \ref{} illustrates our concept.
|
|
||||||
|
|
||||||
Eventually, we want to extend our method to include depth prediction,
|
Eventually, we want to extend our method to include depth prediction,
|
||||||
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
||||||
@ -85,6 +96,9 @@ the optical flow estimation problem and introduce reasoning at the object level,
|
|||||||
but still require expensive energy minimization for each
|
but still require expensive energy minimization for each
|
||||||
new input, as CNNs are only used for some of the components.
|
new input, as CNNs are only used for some of the components.
|
||||||
|
|
||||||
|
In contrast, we tackle motion estimation at the instance-level with end-to-end
|
||||||
|
deep networks and derive optical flow from the individual object motions.
|
||||||
|
|
||||||
\paragraph{Slanted plane methods for 3D scene flow}
|
\paragraph{Slanted plane methods for 3D scene flow}
|
||||||
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
|
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
|
||||||
composed of planar segments. Pixels are assigned to one of the planar segments,
|
composed of planar segments. Pixels are assigned to one of the planar segments,
|
||||||
@ -109,7 +123,7 @@ and takes minutes to make a prediction.
|
|||||||
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
||||||
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||||
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
||||||
However, the end-to-end deep networks are significantly faster than energy-minimization counterparts,
|
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
|
||||||
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
||||||
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
||||||
which often require estimations to be done in realtime and for which an end-to-end
|
which often require estimations to be done in realtime and for which an end-to-end
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user