diff --git a/approach.tex b/approach.tex index 972f356..2ae5ffa 100644 --- a/approach.tex +++ b/approach.tex @@ -1,5 +1,9 @@ \subsection{Motion R-CNN architecture} + +Building on Mask R-CNN, we enable per-object motion estimation by predicting the 3d motion of each detected object. +Specifically, + \subsection{Supervision} %\subsection{Per-RoI motion loss} \subsection{Dense flow from instance-level prediction} diff --git a/background.tex b/background.tex index c0332d8..a1e61b7 100644 --- a/background.tex +++ b/background.tex @@ -1,7 +1,71 @@ \subsection{Problem formulation} +\subsection{Object detection, semantic segmentation and instance segmentation} + \subsection{Optical flow, scene flow and structure from motion} +Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images. +The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first +frame $I_1$ to pixel coordinates of the visually corresponding pixel in the second frame $I_2$, thus +representing the apparent movement of brigthness patterns between the two frames. +Optical flow can be regarded as two-dimensional motion estimation. + +Scene flow is the generalization of optical flow to 3-dimensional space. + \subsection{Rigid scene model} -\subsection{Pixel-wise generic CNNs} -\subsection{Faster R-CNN} -\subsection{Mask R-CNN} +\subsection{Convolutional neural networks for dense estimation tasks} +Deep convolutional neural network (CNN) architectures \cite{} became widely popular +through numerous successes in classification and recognition tasks. +The general structure of a CNN consists of a convolutional encoder, which +learns a spatially compressed, wide (in the number of channels) representation of the input image, +and a fully connected prediction network on top of the encoder. + +The compressed representations learned by CNNs of these categories do not, however, allow +for prediction of high-resolution output, as spatial detail is lost through sequential applications +of pooling or strides. +Thus, networks for dense prediction introduced a convolutional decoder in addition to the representation encoder, +performing upsampling of the compressed features and resulting in a encoder-decoder pyramid. +The most popular deep architecture of this kind for end-to-end optical flow prediction +is the FlowNet family of networs \cite{}, which was recently extended to scene flow estimation \cite{}. + +% The conclusion should be an understanding of the generic nature of the popular dense prediction networks +% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs. + +% The reader should understand the limitations of the generic dense-estimator approach! + +% Also, it should be emphasized that FlowNet learns to match images with a generic encoder, +% thus motivating the introduction of our motion head, which should integrate (and regularize) matching information learned +% in the resnet backbone. + +\subsection{Region-based convolutional networks} +In the following, we re-view region-based convolutional networks, which are the now classical deep networks for +object detection and recognition. + +\paragraph{R-CNN} +Region-based convolutional networks (R-CNNs) use a non-learned algorithm external to a standard encoder CNN +for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. +Then, for each of the region proposals, the image is cropped at the proposed region and the crop is +passed through a CNN, which performs classification of the object (or non-object, if the region shows background). + +\paragraph{Fast R-CNN} +The original R-CNN involved computing on forward pass of the CNN for each of the region proposals, +which can be costly, as there may be a large amount of proposals. +Fast R-CNN significantly reduces processing time by performing only a single forward pass with the whole image +as input to the CNN (compared to the input of crops in the case of R-CNN). +Then, crops are taken from the compressed feature map of the image, collected into a batch and passed into a small Fast R-CNN +\emph{head} network. +Thus, the per-region computation is heavily reduced, speeding up the system by orders of magnitude. % TODO verify that +This technique is called \emph{RoI pooling}. + +\paragraph{Faster R-CNN: End-to-end deep object detection with } +The Faster-RCNN object detection system combines the generation of region proposals and subsequent box refinement and +classification into a single deep network, leading to faster processing when compared to Fast R-CNN +and again, improved accuracy. + + +\paragraph{Mask R-CNN} + +Combining object detection and semantic segmentation, Mask R-CNN extends the Faster R-CNN system to predict +high resolution instance masks within the bounding boxes of each detected object. +This can be done by simply extending the Faster R-CNN head with multiple convolutions, which +compute a pixel-precise mask for each instance. +In addition, Mask R-CNN diff --git a/introduction.tex b/introduction.tex index a56cbbb..95191b7 100644 --- a/introduction.tex +++ b/introduction.tex @@ -1,2 +1,6 @@ \subsection{Motivation \& Goals} + + + + \subsection{Related Work}