This commit is contained in:
Simon Meister 2017-10-27 21:31:13 +02:00
parent fafd293f99
commit eb7df27e2f
5 changed files with 69 additions and 29 deletions

View File

@ -24,9 +24,9 @@ Building on recent advanced in region-based convolutional networks (R-CNNs), we
estimation with instance segmentation.
Given two consecutive frames from a monocular RGBD camera,
our resulting end-to-end deep network detects objects with accurate per-pixel masks
and estimates the 3d motion of each detected object between the frames.
and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network, we compose a dense
optical flow field based on instance-level motion predictions.
optical flow field based on instance-level and global motion predictions.
We demonstrate the effectiveness of our approach on the KITTI 2015 optical flow benchmark.
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow benchmark.
\end{abstract}

View File

@ -28,25 +28,25 @@ where
R_t^{k,x}(\alpha) =
\begin{pmatrix}
1 & 0 & 0 \\
0 & cos(\alpha) & -sin(\alpha) \\
0 & sin(\alpha) & cos(\alpha)
0 & \cos(\alpha) & -\sin(\alpha) \\
0 & \sin(\alpha) & \cos(\alpha)
\end{pmatrix},
\end{equation}
\begin{equation}
R_t^{k,y}(\beta) =
\begin{pmatrix}
cos(\beta) & 0 & sin(\beta) \\
\cos(\beta) & 0 & \sin(\beta) \\
0 & 1 & 0 \\
-sin(\beta) & 0 & cos(\beta)
-\sin(\beta) & 0 & \cos(\beta)
\end{pmatrix},
\end{equation}
\begin{equation}
R_t^{k,z}(\gamma) =
\begin{pmatrix}
cos(\gamma) & -sin(\gamma) & 0 \\
sin(\gamma) & cos(\gamma) & 0 \\
\cos(\gamma) & -\sin(\gamma) & 0 \\
\sin(\gamma) & \cos(\gamma) & 0 \\
0 & 0 & 1
\end{pmatrix},
\end{equation}
@ -59,8 +59,8 @@ We then extend the Faster R-CNN head by adding a fully connected layer in parall
predicting refined boxes and classes.
Like for refined boxes and masks, we make one separate motion prediction for each class.
Each motion is predicted as a set of nine scalar motion parameters,
$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$,
where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$.
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small
and that objects rotate at most 90 degrees in either direction along any axis.
@ -69,7 +69,7 @@ In addition to the object transformations, we optionally predict the camera moti
between the two frames $I_t$ and $I_{t+1}$.
For this, we flatten the full output of the backbone and pass it through a fully connected layer.
We again represent $R_t^{cam}$ using a Euler angle representation and
predict $sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
\subsection{Supervision}
@ -84,7 +84,7 @@ L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
\end{equation}
where
\begin{equation}
l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right)
l_{R}^k = \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right\}\right\} \right)
\end{equation}
measures the angle of the error rotation between predicted and ground truth rotation,
\begin{equation}

View File

@ -1,6 +1,3 @@
\subsection{Object detection, semantic segmentation and instance segmentation}
\subsection{Optical flow, scene flow and structure from motion}
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images.
The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first
@ -10,7 +7,6 @@ Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space.
\subsection{Rigid scene model}
\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures \cite{} became widely popular
through numerous successes in classification and recognition tasks.
@ -27,9 +23,10 @@ The most popular deep networks of this kind for end-to-end optical flow predicti
are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
Note that the network itself is rather generic and is specialized for optical flow only through being trained
with a dense optical flow groundtruth loss. The same network could also be used for semantic segmentation if
with a dense optical flow groundtruth loss.
Note that the same network could also be used for semantic segmentation if
the number of output channels was adapted from two to the number of classes. % TODO verify
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow quite well,
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
operations in the encoder.
@ -52,7 +49,7 @@ most popular deep networks for object detection, and have recently also been app
The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped at the proposed region and the crop is
passed through a CNN, which performs classification of the object (or non-object, if the region shows background).
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!
\paragraph{Fast R-CNN}
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
@ -90,9 +87,12 @@ As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for eac
\paragraph{Mask R-CNN}
Combining object detection and semantic segmentation, Mask R-CNN extends the Faster R-CNN system to predict
high resolution instance masks within the bounding boxes of each detected object.
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
However, it can be helpful to know class and object (instance) membership of all individual pixels,
which generally involves computing a binary mask for each object instance specifying which pixels belong
to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN extends the Faster R-CNN system to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object.
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise mask for each instance.
In addition, Mask R-CNN

View File

@ -2,14 +2,26 @@ We have introduced a extension on top of region-based convolutional networks to
in parallel to instance segmentation.
\subsection{Future Work}
\paragraph{Predicting depth}
In most cases, we want to work with RGB frames without depth available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
Although single-frame monocular depth prediction with deep networks was already done
to some level of success,
our two-frame input should allow the network to make use of epipolar
geometry for making a more reliable depth estimate.
\paragraph{Training on real world data}
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset.
A next step would be the training on
For example, we can first pre-train the RPN on a object detection dataset like
Cityscapes. As soon as the RPN works reliably, we could then do alternating
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step would be training on a more realistic dataset.
For example, we can first pre-train the RPN on an object detection dataset like
Cityscapes. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses and depth prediction, as no instance segmentation ground truth exists. % TODO depth prediction ?!
the motion losses (and depth prediction), as no instance segmentation ground truth exists.
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
improve detection and masks and avoid any forgetting effects.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.

View File

@ -1,6 +1,34 @@
\subsection{Motivation \& Goals}
% introduce problem to sovle
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
% Explain benefits of learning (why deep-nize rigid scene model??)
Recently, SfM-Net \cite{} introduced an end-to-end deep learning approach for predicting depth
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
However, due to the fixed number of objects masks, it can only predict a small number of motions and
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.
Thus, their approach is very unlikely to scale to dynamic scenes with a potentially
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
A scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{}, which inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN
and predicts pixel-precise segmentation masks for each detected object.
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
in parallel to classification and bounding box refinement.
\subsection{Related Work}
\paragraph{Deep optical flow estimation}
\paragraph{Deep scene flow estimation}
\paragraph{Structure from motion}
SfM-Net, SE3 Nets,
Behl2017ICCV