mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
497cf9ec70
commit
bc34ca9fe5
29
approach.tex
29
approach.tex
@ -58,14 +58,37 @@ Figure \ref{fig:motion_rcnn_head} shows our extended per-RoI head network.
|
||||
We then extend the Faster R-CNN head by adding a fully-connected layer in parallel to the final fully-connected layers for
|
||||
predicting refined boxes and classes.
|
||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||
Each motion is predicted as a set of nine scalar motion parameters, $\alpha$, $\beta$, $\gamma$, $t_t^k$ and $p_t^k$.
|
||||
Each motion is predicted as a set of nine scalar motion parameters,
|
||||
$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||
where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
and objects rotate no more than 90 degree in either direction.
|
||||
|
||||
|
||||
\subsection{Supervision}
|
||||
|
||||
\paragraph{Per-RoI motion loss}
|
||||
For each positive RoI, we
|
||||
\paragraph{Per-RoI supervision with motion ground truth}
|
||||
Given a positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
|
||||
let $R_{c_k}^k, t_{c_k}^k, p_{c_k}^k$ be the predicted motion for class $c_k$
|
||||
and $R_{gt}^{i_k}, t_{gt}^{i_k}, p_{gt}^{i_k}$ the ground truth motion for the example $i_k$.
|
||||
We compute the motion loss $L_{motion}^k$ for each RoI as
|
||||
|
||||
\begin{equation}
|
||||
L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right)
|
||||
\end{equation}
|
||||
measures the angle of the error rotation between predicted and ground truth rotation,
|
||||
\begin{equation}
|
||||
l_{t}^k = \lVert inv(R_{c_k}^k) \cdot (t_{gt}^{i_k} - t_{c_k}^k) \rVert,
|
||||
\end{equation}
|
||||
is the euclidean norm between predicted and ground truth translation, and
|
||||
\begin{equation}
|
||||
l_{p}^k = \lVert p_{gt}^{i_k} - p_{c_k}^k \rVert
|
||||
\end{equation}
|
||||
is the euclidean norm between predicted and ground truth pivot.
|
||||
|
||||
|
||||
\subsection{Dense flow from motion}
|
||||
|
||||
@ -1,5 +1,4 @@
|
||||
|
||||
\subsection{Problem formulation}
|
||||
\subsection{Object detection, semantic segmentation and instance segmentation}
|
||||
|
||||
\subsection{Optical flow, scene flow and structure from motion}
|
||||
@ -12,7 +11,7 @@ Optical flow can be regarded as two-dimensional motion estimation.
|
||||
Scene flow is the generalization of optical flow to 3-dimensional space.
|
||||
|
||||
\subsection{Rigid scene model}
|
||||
\subsection{Convolutional neural networks for dense estimation tasks}
|
||||
\subsection{Convolutional neural networks for dense motion estimation}
|
||||
Deep convolutional neural network (CNN) architectures \cite{} became widely popular
|
||||
through numerous successes in classification and recognition tasks.
|
||||
The general structure of a CNN consists of a convolutional encoder, which
|
||||
@ -22,10 +21,19 @@ and a fully connected prediction network on top of the encoder.
|
||||
The compressed representations learned by CNNs of these categories do not, however, allow
|
||||
for prediction of high-resolution output, as spatial detail is lost through sequential applications
|
||||
of pooling or strides.
|
||||
Thus, networks for dense prediction introduced a convolutional decoder in addition to the representation encoder,
|
||||
Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
|
||||
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
|
||||
The most popular deep architecture of this kind for end-to-end optical flow prediction
|
||||
is the FlowNet family of networs \cite{}, which was recently extended to scene flow estimation \cite{}.
|
||||
The most popular deep networks of this kind for end-to-end optical flow prediction
|
||||
are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
|
||||
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
|
||||
Note that the network itself is rather generic and is specialized for optical flow only through being trained
|
||||
with a dense optical flow groundtruth loss. The same network could also be used for semantic segmentation if
|
||||
the number of output channels was adapted from two to the number of classes. % TODO verify
|
||||
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow quite well,
|
||||
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
|
||||
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
|
||||
operations in the encoder.
|
||||
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{}. % TODO dense nets for dense flow
|
||||
|
||||
% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
|
||||
% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
|
||||
@ -37,8 +45,8 @@ is the FlowNet family of networs \cite{}, which was recently extended to scene f
|
||||
% in the resnet backbone.
|
||||
|
||||
\subsection{Region-based convolutional networks}
|
||||
In the following, we will quickly re-view region-based convolutional networks, which are now the standard deep architecture for
|
||||
object detection, object recognition and instance segmentation.
|
||||
In the following, we give a short review of region-based convolutional networks, which are currently by far the
|
||||
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
||||
|
||||
\paragraph{R-CNN}
|
||||
The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
|
||||
@ -47,17 +55,21 @@ For each of the region proposals, the input image is cropped at the proposed reg
|
||||
passed through a CNN, which performs classification of the object (or non-object, if the region shows background).
|
||||
|
||||
\paragraph{Fast R-CNN}
|
||||
The original R-CNN involved computing on forward pass of the CNN for each of the region proposals,
|
||||
which can be costly, as there may be a large amount of proposals.
|
||||
Fast R-CNN significantly reduces processing time by performing only a single forward pass with the whole image
|
||||
as input to the CNN (compared to the input of crops in the case of R-CNN).
|
||||
Then, crops are taken from the compressed feature map of the image, collected into a batch and passed into a small Fast R-CNN
|
||||
\emph{head} network.
|
||||
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
|
||||
which is costly, as there is generally a large amount of proposals.
|
||||
Fast R-CNN significantly reduces computation by performing only a single forward pass with the whole image
|
||||
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
||||
Then, fixed size crops are taken from the compressed feature map of the image,
|
||||
collected into a batch and passed into a small Fast R-CNN
|
||||
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
||||
This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
|
||||
Thus, the per-region computation is reduced to a single network pass,
|
||||
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
|
||||
speeding up the system by orders of magnitude. % TODO verify that
|
||||
|
||||
\paragraph{Faster R-CNN}
|
||||
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
||||
algorith, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
|
||||
and again, improved accuracy.
|
||||
@ -70,7 +82,7 @@ At any position, bounding boxes are predicted as offsets relative to a fixed set
|
||||
aspect ratios.
|
||||
% TODO more about striding & computing the anchors?
|
||||
For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection.
|
||||
The region proposals are now obtained as the N highest scoring anchor boxes.
|
||||
The region proposals can then be obtained as the N highest scoring anchor boxes.
|
||||
|
||||
The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||
and bounding box refinement for each region proposal.
|
||||
@ -84,3 +96,6 @@ high resolution instance masks within the bounding boxes of each detected object
|
||||
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise mask for each instance.
|
||||
In addition, Mask R-CNN
|
||||
|
||||
\paragraph{Supervision of the RPN}
|
||||
\paragraph{Supervision of the RoI head}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user