diff --git a/approach.tex b/approach.tex index f398fbb..ff62dc3 100644 --- a/approach.tex +++ b/approach.tex @@ -72,7 +72,7 @@ and $\alpha, \beta, \gamma$ are the rotation angles in radians about the $x,y,z$ We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for refined boxes and classes. Figure \ref{fig:motion_rcnn_head} shows the Motion R-CNN RoI head. Like for refined boxes and masks, we make one separate motion prediction for each class. -Each motion is predicted as a set of nine scalar motion parameters, +Each instance motion is predicted as a set of nine scalar parameters, $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$, where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$. Here, we assume that motions between frames are relatively small @@ -190,7 +190,7 @@ box of a detected object according to the predicted motion of the object. We first define the \emph{full image} mask $m_t^k$ for object k, which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing $m_k^b$ to the width and height of the predicted bounding box and then copying the values -of the resized mask into a full image map starting at the top-right coordinate of the predicted bounding box. +of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box. Then, \begin{equation} P'_{t+1} = diff --git a/background.tex b/background.tex index 7e31a9b..c6cd6c9 100644 --- a/background.tex +++ b/background.tex @@ -63,6 +63,7 @@ each corresponding to one of the proposal bounding boxes. The crops are collected into a batch and passed into a small Fast R-CNN \emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass. This technique is called \emph{RoI pooling}. % TODO explain how RoI pooling converts full image box coords to crop ranges +\todo{more details and figure} Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network, speeding up the system by orders of magnitude. % TODO verify that @@ -80,8 +81,9 @@ Next, the \emph{backbone} output features are passed into a small, fully convolu predicts objectness scores and regresses bounding boxes at each of its output positions. At any position, bounding boxes are predicted as offsets relative to a fixed set of \emph{anchors} with different aspect ratios. +\todo{more details and figure} % TODO more about striding & computing the anchors? -For each anchor at a given position, the objectness score tells us how likely this anchors is to corresponds to a detection. +For each anchor at a given position, the objectness score tells us how likely this anchors is to correspond to a detection. The region proposals can then be obtained as the N highest scoring anchor boxes. The \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification diff --git a/experiments.tex b/experiments.tex index ed21958..38faa2d 100644 --- a/experiments.tex +++ b/experiments.tex @@ -98,7 +98,7 @@ E_{p} = \frac{1}{N}\sum_k \lVert p^{gt,i_k} - p^{k,c_k} \rVert \end{equation} is the mean euclidean norm between predicted and ground truth pivot. Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for -predicted camera motion. +predicted camera motions. \subsection{Training Setup} Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. diff --git a/figures/teaser.png b/figures/teaser.png new file mode 100755 index 0000000..71c0c42 Binary files /dev/null and b/figures/teaser.png differ diff --git a/introduction.tex b/introduction.tex index 8fa674d..475d5e5 100644 --- a/introduction.tex +++ b/introduction.tex @@ -1,3 +1,14 @@ +\begin{figure}[t] + \centering + \includegraphics[width=\textwidth]{figures/teaser} +\caption{ +Given two temporally consecutive frames, +our network segments the pixels of the first frame into individual objects +and estimates all 3D object motions between the frames. +} +\label{figure:teaser} +\end{figure} + \subsection{Motivation \& Goals} For moving in the real world, it is generally desirable to know which objects exists @@ -21,7 +32,8 @@ and motion estimation. Thus, in this work, we aim to develop end-to-end deep networks which can, given sequences of images, segment the image pixels into object instances and estimate -the location and 3D motion of each object instance relative to the camera. +the location and 3D motion of each object instance relative to the camera +(Figure \ref{figure:teaser}). \subsection{Technical outline} @@ -51,7 +63,6 @@ two concatenated images as input, similar to FlowNetS \cite{FlowNet}. This results in a fully integrated end-to-end network architecture for segmenting pixels into instances and estimating the motion of all detected instances without any limitations as to the number or variety of object instances. -Figure \ref{} illustrates our concept. Eventually, we want to extend our method to include depth prediction, yielding the first end-to-end deep network to perform 3D scene flow estimation @@ -85,6 +96,9 @@ the optical flow estimation problem and introduce reasoning at the object level, but still require expensive energy minimization for each new input, as CNNs are only used for some of the components. +In contrast, we tackle motion estimation at the instance-level with end-to-end +deep networks and derive optical flow from the individual object motions. + \paragraph{Slanted plane methods for 3D scene flow} The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being composed of planar segments. Pixels are assigned to one of the planar segments, @@ -109,7 +123,7 @@ and takes minutes to make a prediction. Interestingly, the slanted plane methods achieve the current state-of-the-art in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015}, outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}. -However, the end-to-end deep networks are significantly faster than energy-minimization counterparts, +However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts, generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime. These concerns restrict the applicability of the current slanted plane models in practical settings, which often require estimations to be done in realtime and for which an end-to-end