bsc-thesis/background.tex

In this section, we will give a more detailed description of previous works
we directly build on and other prerequisites.

\subsection{Optical flow and scene flow}
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
sequence of images.
The optical flow
$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
visually corresponding pixel in the second frame $I_2$,
and can be interpreted as the apparent movement of brigthness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation.

Scene flow is the generalization of optical flow to 3-dimensional space and
requires estimating depth for each pixel. Generally, stereo input is used for scene flow
to estimate disparity-based depth, however monocular depth estimation with deep networks is becoming
popular \cite{DeeperDepth}.

\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures
\cite{ImageNetCNN, VGGNet, ResNet}
became widely popular through numerous successes in classification and recognition tasks.
The general structure of a CNN consists of a convolutional encoder, which
learns a spatially compressed, wide (in the number of channels) representation of the input image,
and a fully connected prediction network on top of the encoder.

The compressed representations learned by CNNs of these categories do not, however, allow
for prediction of high-resolution output, as spatial detail is lost through sequential applications
of pooling or strides.
Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
The most popular deep networks of this kind for end-to-end optical flow prediction
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
with supervision from dense optical flow ground truth.
Potentially, the same network could also be used for semantic segmentation if
the number of output channels was adapted from two to the number of classes. % TODO verify
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
operations in the encoder.
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.

\subsection{Region-based convolutional networks}
We now give a short review of region-based convolutional networks, which are currently by far the
most popular deep networks for object detection, and have recently also been applied to instance segmentation.

\paragraph{R-CNN}
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped using the regions bounding box and the crop is
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!

\paragraph{Fast R-CNN}
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
which is costly, as there is generally a large number of proposals.
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
Then, fixed size crops are taken from the compressed feature map of the image,
each corresponding to one of the proposal bounding boxes.
The crops are collected into a batch and passed into a small Fast R-CNN
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
speeding up the system by orders of magnitude. % TODO verify that

\paragraph{Faster R-CNN}
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
and again, improved accuracy.
This unified network operates in two stages.
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
which is a deep feature encoder CNN with the original image as input.
Next, the \emph{backbone} output features are passed into a small, fully convolutional \emph{Region Proposal Network (RPN)} head, which
predicts objectness scores and regresses bounding boxes at each of its output positions.
At any position, bounding boxes are predicted as offsets relative to a fixed set of \emph{anchors} with different
aspect ratios.
% TODO more about striding & computing the anchors?
For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection.
The region proposals can then be obtained as the N highest scoring anchor boxes.

The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
and bounding box refinement for each region proposal. % TODO verify that it isn't modified
As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for each of the region proposals.

\paragraph{Mask R-CNN}
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
However, it can be helpful to know class and object (instance) membership of all individual pixels,
which generally involves computing a binary mask for each object instance specifying which pixels belong
to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object.
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise mask for each instance.
In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
variant based on Feature Pyramid Networks \cite{FPN}.
Figure \ref{} compares the two Mask R-CNN head variants.

\paragraph{Feature Pyramid Networks}
\todo{TODO}

\paragraph{Supervision of the RPN}
\todo{TODO}
\paragraph{Supervision of the RoI head}
\todo{TODO}