WIP

2026-01-20 20:11:16 +00:00 · 2017-10-24 12:04:55 +02:00 · 2017-10-24 12:04:55 +02:00 · bc34ca9fe5
commit bc34ca9fe5
parent 497cf9ec70
2 changed files with 56 additions and 18 deletions
--- a/approach.tex
+++ b/approach.tex
@ -58,14 +58,37 @@ Figure \ref{fig:motion_rcnn_head} shows our extended per-RoI head network.
 We then extend the Faster R-CNN head by adding a fully-connected layer in parallel to the final fully-connected layers for
 predicting refined boxes and classes.
 Like for refined boxes and masks, we make one separate motion prediction for each class.
-Each motion is predicted as a set of nine scalar motion parameters, $\alpha$, $\beta$, $\gamma$, $t_t^k$ and $p_t^k$.
+Each motion is predicted as a set of nine scalar motion parameters,
+$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$,
+where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$.
+Here, we assume that motions between frames are relatively small
+and objects rotate no more than 90 degree in either direction.


 \subsection{Supervision}

-\paragraph{Per-RoI motion loss}
-For each positive RoI, we
+\paragraph{Per-RoI supervision with motion ground truth}
+Given a positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
+let $R_{c_k}^k, t_{c_k}^k, p_{c_k}^k$ be the predicted motion for class $c_k$
+and $R_{gt}^{i_k}, t_{gt}^{i_k}, p_{gt}^{i_k}$ the ground truth motion for the example $i_k$.
+We compute the motion loss $L_{motion}^k$ for each RoI as

+\begin{equation}
+L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
+\end{equation}
+where
+\begin{equation}
+l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right)
+\end{equation}
+measures the angle of the error rotation between predicted and ground truth rotation,
+\begin{equation}
+l_{t}^k = \lVert inv(R_{c_k}^k) \cdot (t_{gt}^{i_k} - t_{c_k}^k) \rVert,
+\end{equation}
+is the euclidean norm between predicted and ground truth translation, and
+\begin{equation}
+l_{p}^k = \lVert p_{gt}^{i_k} - p_{c_k}^k \rVert
+\end{equation}
+is the euclidean norm between predicted and ground truth pivot.


 \subsection{Dense flow from motion}
--- a/background.tex
+++ b/background.tex
@ -1,5 +1,4 @@

-\subsection{Problem formulation}
 \subsection{Object detection, semantic segmentation and instance segmentation}

 \subsection{Optical flow, scene flow and structure from motion}
@ -12,7 +11,7 @@ Optical flow can be regarded as two-dimensional motion estimation.
 Scene flow is the generalization of optical flow to 3-dimensional space.

 \subsection{Rigid scene model}
-\subsection{Convolutional neural networks for dense estimation tasks}
+\subsection{Convolutional neural networks for dense motion estimation}
 Deep convolutional neural network (CNN) architectures \cite{} became widely popular
 through numerous successes in classification and recognition tasks.
 The general structure of a CNN consists of a convolutional encoder, which
@ -22,10 +21,19 @@ and a fully connected prediction network on top of the encoder.
 The compressed representations learned by CNNs of these categories do not, however, allow
 for prediction of high-resolution output, as spatial detail is lost through sequential applications
 of pooling or strides.
-Thus, networks for dense prediction introduced a convolutional decoder in addition to the representation encoder,
+Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
 performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
-The most popular deep architecture of this kind for end-to-end optical flow prediction
-is the FlowNet family of networs \cite{}, which was recently extended to scene flow estimation \cite{}.
+The most popular deep networks of this kind for end-to-end optical flow prediction
+are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
+Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
+Note that the network itself is rather generic and is specialized for optical flow only through being trained
+with a dense optical flow groundtruth loss. The same network could also be used for semantic segmentation if
+the number of output channels was adapted from two to the number of classes. % TODO verify
+FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow quite well,
+given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
+Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
+operations in the encoder.
+Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{}. % TODO dense nets for dense flow

 % The conclusion should be an understanding of the generic nature of the popular dense prediction networks
 % for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
@ -37,8 +45,8 @@ is the FlowNet family of networs \cite{}, which was recently extended to scene f
 % in the resnet backbone.

 \subsection{Region-based convolutional networks}
-In the following, we will quickly re-view region-based convolutional networks, which are now the standard deep architecture for
-object detection, object recognition and instance segmentation.
+In the following, we give a short review of region-based convolutional networks, which are currently by far the
+most popular deep networks for object detection, and have recently also been applied to instance segmentation.

 \paragraph{R-CNN}
 The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
@ -47,17 +55,21 @@ For each of the region proposals, the input image is cropped at the proposed reg
 passed through a CNN, which performs classification of the object (or non-object, if the region shows background).

 \paragraph{Fast R-CNN}
-The original R-CNN involved computing on forward pass of the CNN for each of the region proposals,
-which can be costly, as there may be a large amount of proposals.
-Fast R-CNN significantly reduces processing time by performing only a single forward pass with the whole image
-as input to the CNN (compared to the input of crops in the case of R-CNN).
-Then, crops are taken from the compressed feature map of the image, collected into a batch and passed into a small Fast R-CNN
-\emph{head} network.
+The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
+which is costly, as there is generally a large amount of proposals.
+Fast R-CNN significantly reduces computation by performing only a single forward pass with the whole image
+as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
+Then, fixed size crops are taken from the compressed feature map of the image,
+collected into a batch and passed into a small Fast R-CNN
+\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
 This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
-Thus, the per-region computation is reduced to a single network pass,
+Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
 speeding up the system by orders of magnitude. % TODO verify that

 \paragraph{Faster R-CNN}
+After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
+algorith, which has to be run prior to the network passes and makes up a large portion of the total
+processing time.
 The Faster R-CNN object detection system unifies the generation of region proposals and subsequent box refinement and
 classification into a single deep network, leading to faster processing when compared to Fast R-CNN
 and again, improved accuracy.
@ -70,7 +82,7 @@ At any position, bounding boxes are predicted as offsets relative to a fixed set
 aspect ratios.
 % TODO more about striding & computing the anchors?
 For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection.
-The region proposals are now obtained as the N highest scoring anchor boxes.
+The region proposals can then be obtained as the N highest scoring anchor boxes.

 The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
 and bounding box refinement for each region proposal.
@ -84,3 +96,6 @@ high resolution instance masks within the bounding boxes of each detected object
 This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
 compute a pixel-precise mask for each instance.
 In addition, Mask R-CNN
+
+\paragraph{Supervision of the RPN}
+\paragraph{Supervision of the RoI head}