This commit is contained in:
Simon Meister 2017-10-24 12:04:55 +02:00
parent 497cf9ec70
commit bc34ca9fe5
2 changed files with 56 additions and 18 deletions

View File

@ -58,14 +58,37 @@ Figure \ref{fig:motion_rcnn_head} shows our extended per-RoI head network.
We then extend the Faster R-CNN head by adding a fully-connected layer in parallel to the final fully-connected layers for
predicting refined boxes and classes.
Like for refined boxes and masks, we make one separate motion prediction for each class.
Each motion is predicted as a set of nine scalar motion parameters, $\alpha$, $\beta$, $\gamma$, $t_t^k$ and $p_t^k$.
Each motion is predicted as a set of nine scalar motion parameters,
$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$,
where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small
and objects rotate no more than 90 degree in either direction.
\subsection{Supervision}
\paragraph{Per-RoI motion loss}
For each positive RoI, we
\paragraph{Per-RoI supervision with motion ground truth}
Given a positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
let $R_{c_k}^k, t_{c_k}^k, p_{c_k}^k$ be the predicted motion for class $c_k$
and $R_{gt}^{i_k}, t_{gt}^{i_k}, p_{gt}^{i_k}$ the ground truth motion for the example $i_k$.
We compute the motion loss $L_{motion}^k$ for each RoI as
\begin{equation}
L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
\end{equation}
where
\begin{equation}
l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right)
\end{equation}
measures the angle of the error rotation between predicted and ground truth rotation,
\begin{equation}
l_{t}^k = \lVert inv(R_{c_k}^k) \cdot (t_{gt}^{i_k} - t_{c_k}^k) \rVert,
\end{equation}
is the euclidean norm between predicted and ground truth translation, and
\begin{equation}
l_{p}^k = \lVert p_{gt}^{i_k} - p_{c_k}^k \rVert
\end{equation}
is the euclidean norm between predicted and ground truth pivot.
\subsection{Dense flow from motion}

View File

@ -1,5 +1,4 @@
\subsection{Problem formulation}
\subsection{Object detection, semantic segmentation and instance segmentation}
\subsection{Optical flow, scene flow and structure from motion}
@ -12,7 +11,7 @@ Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space.
\subsection{Rigid scene model}
\subsection{Convolutional neural networks for dense estimation tasks}
\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures \cite{} became widely popular
through numerous successes in classification and recognition tasks.
The general structure of a CNN consists of a convolutional encoder, which
@ -22,10 +21,19 @@ and a fully connected prediction network on top of the encoder.
The compressed representations learned by CNNs of these categories do not, however, allow
for prediction of high-resolution output, as spatial detail is lost through sequential applications
of pooling or strides.
Thus, networks for dense prediction introduced a convolutional decoder in addition to the representation encoder,
Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
The most popular deep architecture of this kind for end-to-end optical flow prediction
is the FlowNet family of networs \cite{}, which was recently extended to scene flow estimation \cite{}.
The most popular deep networks of this kind for end-to-end optical flow prediction
are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
Note that the network itself is rather generic and is specialized for optical flow only through being trained
with a dense optical flow groundtruth loss. The same network could also be used for semantic segmentation if
the number of output channels was adapted from two to the number of classes. % TODO verify
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow quite well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
operations in the encoder.
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{}. % TODO dense nets for dense flow
% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
@ -37,8 +45,8 @@ is the FlowNet family of networs \cite{}, which was recently extended to scene f
% in the resnet backbone.
\subsection{Region-based convolutional networks}
In the following, we will quickly re-view region-based convolutional networks, which are now the standard deep architecture for
object detection, object recognition and instance segmentation.
In the following, we give a short review of region-based convolutional networks, which are currently by far the
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
\paragraph{R-CNN}
The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
@ -47,17 +55,21 @@ For each of the region proposals, the input image is cropped at the proposed reg
passed through a CNN, which performs classification of the object (or non-object, if the region shows background).
\paragraph{Fast R-CNN}
The original R-CNN involved computing on forward pass of the CNN for each of the region proposals,
which can be costly, as there may be a large amount of proposals.
Fast R-CNN significantly reduces processing time by performing only a single forward pass with the whole image
as input to the CNN (compared to the input of crops in the case of R-CNN).
Then, crops are taken from the compressed feature map of the image, collected into a batch and passed into a small Fast R-CNN
\emph{head} network.
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
which is costly, as there is generally a large amount of proposals.
Fast R-CNN significantly reduces computation by performing only a single forward pass with the whole image
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
Then, fixed size crops are taken from the compressed feature map of the image,
collected into a batch and passed into a small Fast R-CNN
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
Thus, the per-region computation is reduced to a single network pass,
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
speeding up the system by orders of magnitude. % TODO verify that
\paragraph{Faster R-CNN}
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
algorith, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
and again, improved accuracy.
@ -70,7 +82,7 @@ At any position, bounding boxes are predicted as offsets relative to a fixed set
aspect ratios.
% TODO more about striding & computing the anchors?
For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection.
The region proposals are now obtained as the N highest scoring anchor boxes.
The region proposals can then be obtained as the N highest scoring anchor boxes.
The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
and bounding box refinement for each region proposal.
@ -84,3 +96,6 @@ high resolution instance masks within the bounding boxes of each detected object
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise mask for each instance.
In addition, Mask R-CNN
\paragraph{Supervision of the RPN}
\paragraph{Supervision of the RoI head}