This commit is contained in:
Simon Meister 2017-11-04 15:01:43 +01:00
parent 84c5b1e6cd
commit a6311dca56
5 changed files with 104 additions and 46 deletions

View File

@ -31,6 +31,9 @@ By additionally estimating a global camera motion in the same network,
we compose a dense optical flow field based on instance-level and global motion
predictions.
%We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
%benchmark.
\end{abstract}
\renewcommand{\abstractname}{Zusammenfassung}
\begin{abstract}
\todo{german abstract}
\end{abstract}

View File

@ -17,7 +17,15 @@ laying the foundation for our motion estimation. Instead of taking a single imag
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
We do not introduce a separate network for computing region proposals and use our modified backbone network
as both first stage RPN and second stage feature extractor for region cropping.
% TODO figures; introduce XYZ inputs
Technically, our feature encoder network will have to learn a motion representation similar to
that learned by the FlowNet encoder, but the output will be computed in the
object-centric framework of a region based convolutional network head with a 3D parametrization.
Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information
from the encoder is integrated for specific objects via RoI cropping and
processed by the RoI head for each object.
\todo{figure of backbone}
\todo{introduce optional XYZ input}
\paragraph{Per-RoI motion prediction}
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
@ -70,6 +78,7 @@ where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small
and that objects rotate at most 90 degrees in either direction along any axis.
All predictions are made in camera space, and translation and pivot predictions are in meters.
\todo{figure of head}
\paragraph{Camera motion prediction}
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
@ -111,23 +120,53 @@ l_{p}^k = \lVert p^{gt,i_k} - p^{k,c_k} \rVert_1.
\end{equation}
\paragraph{Camera motion supervision}
We supervise the camera motion with ground truth in the same way as the
object motions.
We supervise the camera motion with ground truth analogously to the
object motions, with the only difference being that we only have
a rotation and translation, but no pivot term for the camera motion.
\paragraph{Per-RoI supervision \emph{without} motion ground truth}
A more general way to supervise the object motions is a re-projection
loss applied to coordinates within the object bounding box,
as used in SfM-Net \cite{SfmNet}. Let
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
which we can apply to coordinates within the object bounding boxes,
and which does not require ground truth 3D object motions.
For any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
with the same resolution as the predicted mask. We use the same bounding box
to crop the corresponding region from the dense, full image depth map
and bilinearly resize the depth crop to the same resolution as the mask and point
grid.
We then compute the optical flow at each of the grid points by creating
a 3D point cloud from the point grid and depth crop. To this point cloud, we
apply the RoI's predicted motion, masked by the predicted mask.
Then, we apply the camera motion to the points, project them back to 2D
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
Note that we batch this computation over all RoIs, so that we only perform
it once per forward pass. The mathematical details are analogous to the
dense, full image flow computation in the following subsection and will not
be repeated here. \todo{add diagram to make it easier to understand}
For each RoI, we can now penalize the optical flow grid to supervise the object motion.
If there is optical flow ground truth available, we can use the RoI bounding box to
crop and resize a region from the ground truth optical flow to match the RoI's
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
However, we can also use the re-projection loss without optical flow ground truth
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
In this case, we use the bounding box to crop and resize a corresponding region
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
For more details on differentiable bilinear sampling for deep learning, we refer the reader to
\cite{STN}.
When compared to supervision with motion ground truth, a re-projection
loss could benefit motion regression by removing any loss balancing issues between the
rotation, translation and pivot terms \cite{PoseNet2}.
rotation, translation and pivot terms \cite{PoseNet2},
which can make it interesting even when 3D motion ground truth is available.
\subsection{Dense flow from motion}
We compose a dense optical flow map from the outputs of our Motion R-CNN network.
As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network.
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
where
\begin{equation}
@ -143,6 +182,7 @@ x_t - c_0 \\ y_t - c_1 \\ f
\end{equation}
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
which range over all coordinates in $I_t$.
For now, the depth map is always assumed to come from ground truth.
Given $k$ detections with predicted motions as above, we transform all points within the bounding
box of a detected object according to the predicted motion of the object.
@ -166,6 +206,10 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^k
\end{equation}.
Note that in our experiments, we either use the ground truth camera motion to focus
on the object motion predictions or the predicted camera motion to predict complete
motion. We will always state which variant we use in the experimental section.
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
\begin{equation}
\begin{pmatrix}
@ -191,7 +235,3 @@ u \\ v
x_{t+1} - x_{t} \\ y_{t+1} - y_{t}
\end{pmatrix}.
\end{equation}
%Given the predicted motion as above, a depth map $d_t$ for frame $I_t$ and
%the predicted or ground truth camera motion $\{R_c^k, t_c^k\}\in SE3$.

View File

@ -1,17 +1,20 @@
\subsection{Optical flow, scene flow and structure from motion}
Here, we will give a more detailed description of previous works
we directly build on and other prerequisites.
\subsection{Optical flow and scene flow}
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
sequence of images.
The optical flow
$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
visually corresponding pixel in the second frame $I_2$,
thus representing the apparent movement of brigthness patterns between the two frames.
and can be interpreted as the apparent movement of brigthness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space and
requires estimating dense depth. Generally, stereo input is used for scene flow
to estimate disparity-based depth, however monocular depth estimation can in
principle be used.
requires estimating depth for each pixel. Generally, stereo input is used for scene flow
to estimate disparity-based depth, however monocular depth estimation with deep networks is becoming
popular \cite{DeeperDepth}.
\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures
@ -30,27 +33,18 @@ The most popular deep networks of this kind for end-to-end optical flow predicti
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
Note that the network itself is rather generic and is specialized for optical flow only through being trained
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
with supervision from dense optical flow ground truth.
Potentially, the same network could also be used for semantic segmentation if
the number of output channels was adapted from two to the number of classes. % TODO verify
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
operations in the encoder.
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
% The reader should understand the limitations of the generic dense-estimator approach!
% Also, it should be emphasized that FlowNet learns to match images with a generic encoder,
% thus motivating the introduction of our motion head, which should integrate (and regularize) matching information learned
% in the resnet backbone.
\subsection{Region-based convolutional networks}
In the following, we give a short review of region-based convolutional networks, which are currently by far the
We now give a short review of region-based convolutional networks, which are currently by far the
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
\paragraph{R-CNN}
@ -101,10 +95,11 @@ which generally involves computing a binary mask for each object instance specif
to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object.
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise mask for each instance.
In addition, Mask R-CNN
Figure \ref{} compares the two Mask R-CNN network variants.
In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
variant based on Feature Pyramid Networks \cite{FPN}.
Figure \ref{} compares the two Mask R-CNN head variants.
\paragraph{Supervision of the RPN}
\paragraph{Supervision of the RoI head}

18
bib.bib
View File

@ -172,3 +172,21 @@
title = {Geometric loss functions for camera pose regression with deep learning},
booktitle = {CVPR},
year = {2017}}
@inproceedings{STN,
author = {M. Jadeberg and K. Zisserman and K. Kavukcuoglu},
title = {Spatial transformer networks},
booktitle = {NIPS},
year = {2015}}
@inproceedings{CensusTerm,
author = {Fridtjof Stein},
title = {Efficient Computation of Optical Flow Using the Census Transform},
booktitle = {DAGM},
year = {2004}}
@inproceedings{DeeperDepth,
author = {Iro Laina and Christian Rupprecht and Vasileios Belagiannis and Federico Tombari and Nassir Navab},
title = {Deeper Depth Prediction with Fully Convolutional Residual Networks},
booktitle = {3DV},
year = {2016}}

View File

@ -12,7 +12,7 @@ of each obstacle, but to also know if and where the obstacle is moving,
and to use sensors that will not make the system too expensive for widespread use.
There are many other applications.. %TODO(make motivation wider)
A promising approach for 3D scene understanding in these situations may be deep neural
A promising approach for 3D scene understanding in these situations are deep neural
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
in still images and are more and more often being applied to video data.
A key benefit of end-to-end deep networks is that they can, in principle,
@ -20,7 +20,7 @@ enable very fast inference on real time video data and generalize
over many training examples to resolve ambiguities inherent in image understanding
and motion estimation.
Thus, in this work, we aim to develop a end-to-end deep network which can, given
Thus, in this work, we aim to develop end-to-end deep networks which can, given
sequences of images, segment the image pixels into object instances and estimate
the location and 3D motion of each object instance relative to the camera.
@ -44,19 +44,21 @@ and predicts pixel-precise segmentation masks for each detected object.
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
in parallel to classification and bounding box refinement.
in parallel to classification, bounding box refinement and mask prediction.
For each RoI, we predict a single 3D rigid object motion together with the object
pivot in camera space.
As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone to take
pivot in camera space in this way.
As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone of Mask R-CNN to take
two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
This gives us a fully integrated end-to-end network architecture for segmenting pixels into instances
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
and estimating the motion of all detected instances without any limitations
as to the number or variety of object instances.
Figure \ref{} gives an overview of our network.
Eventually, we want to extend our method to include end-to-end depth prediction,
Eventually, we want to extend our method to include depth prediction,
yielding the first end-to-end deep network to perform 3D scene flow estimation
in a principled way from considering individual objects.
For now, we will work with RGB-D frames to break down the problem into manageable pieces.
For now, we will work with RGB-D frames to break down the problem into
manageable pieces.
\subsection{Related work}
@ -75,12 +77,12 @@ image depending on the semantics of each region or pixel, which include whether
pixel belongs to the background, to which object instance it belongs if it is not background,
and the class of the object it belongs to.
Often, failure cases of these methods include motion boundaries or regions with little texture,
where semantics become more important. % TODO make sure this is a grounded statement
where semantics become more important. \todo{elaborate}% TODO make sure this is a grounded statement
Extensions of these approaches to scene flow estimate flow and depth
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
the optical flow estimation problem and introduce semantics,
the optical flow estimation problem and introduce reasoning at the object level,
but still require expensive energy minimization for each
new input, as CNNs are only used for some of the components.
@ -94,7 +96,7 @@ reducing the number of independently moving segments by allowing multiple
segments to share the motion of the object they belong to.
In these methods, pixel assignment and motion estimation are formulated
as energy-minimization problem and optimized for each input data point,
without any learning. % TODO make sure it's ok to say there's no learning
without the use of (deep) learning. % TODO make sure it's ok to say there's no learning
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
@ -126,8 +128,8 @@ in speed, but also in accuracy, especially considering the inherent ambiguity of
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
However, we think that the current end-to-end deep learning approaches to motion
estimation are limited by a lack of spatial structure and regularity in their estimates,
which stems from the generic nature of the employed networks. % TODO move to end-to-end deep nets section
estimation are likely limited by a lack of spatial structure and regularity in their estimates
as explained above, which stems from the generic nature of the employed networks.
To this end, we aim to combine the modelling benefits of rigid scene decompositions
with the promise of end-to-end deep learning.