diff --git a/abstract.tex b/abstract.tex index 0dd926b..1e73166 100644 --- a/abstract.tex +++ b/abstract.tex @@ -24,9 +24,9 @@ Building on recent advanced in region-based convolutional networks (R-CNNs), we estimation with instance segmentation. Given two consecutive frames from a monocular RGBD camera, our resulting end-to-end deep network detects objects with accurate per-pixel masks -and estimates the 3d motion of each detected object between the frames. +and estimates the 3D motion of each detected object between the frames. By additionally estimating a global camera motion in the same network, we compose a dense -optical flow field based on instance-level motion predictions. +optical flow field based on instance-level and global motion predictions. -We demonstrate the effectiveness of our approach on the KITTI 2015 optical flow benchmark. +We demonstrate the feasibility of our approach on the KITTI 2015 optical flow benchmark. \end{abstract} diff --git a/approach.tex b/approach.tex index 0bb0018..7384043 100644 --- a/approach.tex +++ b/approach.tex @@ -28,25 +28,25 @@ where R_t^{k,x}(\alpha) = \begin{pmatrix} 1 & 0 & 0 \\ - 0 & cos(\alpha) & -sin(\alpha) \\ - 0 & sin(\alpha) & cos(\alpha) + 0 & \cos(\alpha) & -\sin(\alpha) \\ + 0 & \sin(\alpha) & \cos(\alpha) \end{pmatrix}, \end{equation} \begin{equation} R_t^{k,y}(\beta) = \begin{pmatrix} - cos(\beta) & 0 & sin(\beta) \\ + \cos(\beta) & 0 & \sin(\beta) \\ 0 & 1 & 0 \\ - -sin(\beta) & 0 & cos(\beta) + -\sin(\beta) & 0 & \cos(\beta) \end{pmatrix}, \end{equation} \begin{equation} R_t^{k,z}(\gamma) = \begin{pmatrix} - cos(\gamma) & -sin(\gamma) & 0 \\ - sin(\gamma) & cos(\gamma) & 0 \\ + \cos(\gamma) & -\sin(\gamma) & 0 \\ + \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}, \end{equation} @@ -59,8 +59,8 @@ We then extend the Faster R-CNN head by adding a fully connected layer in parall predicting refined boxes and classes. Like for refined boxes and masks, we make one separate motion prediction for each class. Each motion is predicted as a set of nine scalar motion parameters, -$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$, -where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$. +$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$, +where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$. Here, we assume that motions between frames are relatively small and that objects rotate at most 90 degrees in either direction along any axis. @@ -69,7 +69,7 @@ In addition to the object transformations, we optionally predict the camera moti between the two frames $I_t$ and $I_{t+1}$. For this, we flatten the full output of the backbone and pass it through a fully connected layer. We again represent $R_t^{cam}$ using a Euler angle representation and -predict $sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects. +predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects. \subsection{Supervision} @@ -84,7 +84,7 @@ L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k, \end{equation} where \begin{equation} -l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right) +l_{R}^k = \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right\}\right\} \right) \end{equation} measures the angle of the error rotation between predicted and ground truth rotation, \begin{equation} diff --git a/background.tex b/background.tex index 772077d..c9aac1a 100644 --- a/background.tex +++ b/background.tex @@ -1,6 +1,3 @@ - -\subsection{Object detection, semantic segmentation and instance segmentation} - \subsection{Optical flow, scene flow and structure from motion} Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images. The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first @@ -10,7 +7,6 @@ Optical flow can be regarded as two-dimensional motion estimation. Scene flow is the generalization of optical flow to 3-dimensional space. -\subsection{Rigid scene model} \subsection{Convolutional neural networks for dense motion estimation} Deep convolutional neural network (CNN) architectures \cite{} became widely popular through numerous successes in classification and recognition tasks. @@ -27,9 +23,10 @@ The most popular deep networks of this kind for end-to-end optical flow predicti are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}. Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction. Note that the network itself is rather generic and is specialized for optical flow only through being trained -with a dense optical flow groundtruth loss. The same network could also be used for semantic segmentation if +with a dense optical flow groundtruth loss. +Note that the same network could also be used for semantic segmentation if the number of output channels was adapted from two to the number of classes. % TODO verify -FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow quite well, +FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow arguably well, given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements. Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling operations in the encoder. @@ -52,7 +49,7 @@ most popular deep networks for object detection, and have recently also been app The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. For each of the region proposals, the input image is cropped at the proposed region and the crop is -passed through a CNN, which performs classification of the object (or non-object, if the region shows background). +passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement! \paragraph{Fast R-CNN} The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals, @@ -90,9 +87,12 @@ As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for eac \paragraph{Mask R-CNN} - -Combining object detection and semantic segmentation, Mask R-CNN extends the Faster R-CNN system to predict -high resolution instance masks within the bounding boxes of each detected object. +Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. +However, it can be helpful to know class and object (instance) membership of all individual pixels, +which generally involves computing a binary mask for each object instance specifying which pixels belong +to that object. This problem is called \emph{instance segmentation}. +Mask R-CNN extends the Faster R-CNN system to instance segmentation by predicting +fixed resolution instance masks within the bounding boxes of each detected object. This can be done by simply extending the Faster R-CNN head with multiple convolutions, which compute a pixel-precise mask for each instance. In addition, Mask R-CNN diff --git a/conclusion.tex b/conclusion.tex index fc077d9..a4b4dda 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -2,14 +2,26 @@ We have introduced a extension on top of region-based convolutional networks to in parallel to instance segmentation. \subsection{Future Work} +\paragraph{Predicting depth} +In most cases, we want to work with RGB frames without depth available. +To do so, we could integrate depth prediction into our network by branching off a +depth network from the backbone in parallel to the RPN, as in Figure \ref{}. +Although single-frame monocular depth prediction with deep networks was already done +to some level of success, +our two-frame input should allow the network to make use of epipolar +geometry for making a more reliable depth estimate. + +\paragraph{Training on real world data} Due to the amount of supervision required by the different components of the network and the complexity of the optimization problem, -we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset. -A next step would be the training on -For example, we can first pre-train the RPN on a object detection dataset like -Cityscapes. As soon as the RPN works reliably, we could then do alternating +we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now. +A next step would be training on a more realistic dataset. +For example, we can first pre-train the RPN on an object detection dataset like +Cityscapes. As soon as the RPN works reliably, we could execute alternating steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets. On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize -the motion losses and depth prediction, as no instance segmentation ground truth exists. % TODO depth prediction ?! +the motion losses (and depth prediction), as no instance segmentation ground truth exists. On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to improve detection and masks and avoid any forgetting effects. +As an alternative to this training scheme, we could investigate training on a pure +instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction. diff --git a/introduction.tex b/introduction.tex index 5ab0dd0..e190b78 100644 --- a/introduction.tex +++ b/introduction.tex @@ -1,6 +1,34 @@ \subsection{Motivation \& Goals} +% introduce problem to sovle +% mention classical non deep-learning works, then say it would be nice to go end-to-end deep -% Explain benefits of learning (why deep-nize rigid scene model??) +Recently, SfM-Net \cite{} introduced an end-to-end deep learning approach for predicting depth +and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera. +SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder +network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object. +However, due to the fixed number of objects masks, it can only predict a small number of motions and +often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions. + +Thus, their approach is very unlikely to scale to dynamic scenes with a potentially +large number of diverse objects due to the inflexible nature of their instance segmentation technique. + +A scalable approach to instance segmentation based on region-based convolutional networks +was recently introduced with Mask R-CNN \cite{}, which inherits the ability to detect +a large number of objects from a large number of classes at once from Faster R-CNN +and predicts pixel-precise segmentation masks for each detected object. + +We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of +Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net. +For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head +in parallel to classification and bounding box refinement. \subsection{Related Work} + +\paragraph{Deep optical flow estimation} +\paragraph{Deep scene flow estimation} +\paragraph{Structure from motion} +SfM-Net, SE3 Nets, + + +Behl2017ICCV