This commit is contained in:
Simon Meister 2017-11-15 12:53:48 +01:00
parent c157e9e1dd
commit 00039107be
7 changed files with 74 additions and 25 deletions

View File

@ -59,9 +59,9 @@ das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert is
Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional
Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
Bei Eingabe von zwei aufeinanderfolgenden frames aus einer monokularen RGB-D
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den frames ab.
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen,
setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes
optisches Flussfeld zusammen.

View File

@ -279,6 +279,20 @@ penalize rotation and translation. For the camera, the loss is reduced to the
classification term in this case.
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/flow_loss}
\caption{
Overview of the alternative, optical flow based loss for instance motion
supervision without 3D instance motion ground truth.
In contrast to SfM-Net, where a single optical flow field is
composed and penalized to supervise the motion prediction, our loss considers
the motion of all objects in isolation and composes a batch of flow windows
for the RoIs.
}
\label{figure:flow_loss}
\end{figure}
A more general way to supervise the object motions is a re-projection
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
which we can apply to coordinates within the object bounding boxes,

View File

@ -59,7 +59,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical
& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
\multicolumn{3}{c}{...}\\
\midrule
flow & $\times$ 2 bilinear upsample & $\tfrac{1}{1}$ H $\times$ $\tfrac{1}{1}$ W $\times$ 2 \\
flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
\bottomrule
\caption {
@ -78,7 +78,7 @@ Potentially, the same network could also be used for semantic segmentation if
the number of output final and intermediate output channels was adapted from two to the number of classes.\
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
Note that the maximum displacement that can be correctly estimated depends on the number of 2D convolution strides or pooling
operations in the encoder.
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
@ -364,9 +364,24 @@ which has a stride of $4$ with respect to the input image.
Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a
RoI bounding box with size $h \times w$,
\begin{equation}
j = \log_2(\sqrt{w \cdot h} / 224). \todo{complete}
j = 2 + j_a,
\end{equation}
where
\begin{equation}
j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{w \cdot h}{s_0}\right)\right], 0, 4\right)
\label{eq:level_assignment}
\end{equation}
is the index (from small anchor to large anchor) of the corresponding anchor box and
\begin{equation}
s_0 = 256 \cdot 0.125
\label{eq:level_assignment}
\end{equation}
is the scale of the smallest anchor boxes.
This formula is slightly different from the one used in the FPN paper,
as we want to assign the bounding boxes which are at the same scale
as some anchor to the exact same pyramid level from which the RPN of this
anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$,
which is the highest resolution feature map.
{
@ -454,6 +469,8 @@ frequently in the following chapters. For vector or tuple arguments, the sum of
losses is computed.
For classification we define $\ell_{cls}$ as the cross-entropy classification loss.
\todo{formally define cross-entropy losses?}
\label{ssec:rcnn_techn}
\paragraph{Bounding box regression}
All bounding boxes predicted by the RoI head or RPN are estimated as offsets

View File

@ -257,11 +257,17 @@
year = {2018}}
@inproceedings{UnsupDepth,
title={Unsupervised CNN for single view depth estimation: Geometry to the rescue},
title={Unsupervised Learning of Depth and Ego-Motion from Video},
author={Ravi Garg and BG Vijay Kumar and Gustavo Carneiro and Ian Reid},
booktitle={ECCV},
year={2016}}
@inproceedings{UnsupPoseDepth,
title={Unsupervised CNN for single view depth estimation: Geometry to the rescue},
author={Tinghui Zhou and Matthew Brown and Noah Snavely and David G. Lowe},
booktitle={CVPR},
year={2017}}
@inproceedings{UnsupFlownet,
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},

BIN
figures/flow_loss.png Executable file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 72 KiB

After

Width:  |  Height:  |  Size: 76 KiB

View File

@ -57,6 +57,13 @@ Figure taken from \cite{SfmNet}.
Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
Still, we think that the general idea of estimating object-level motion with
end-to-end deep networks instead
of directly predicting a dense flow field, as is common in current end-to-end
deep learning approaches to motion estimation, may significantly benefit motion
estimation by structuring the problem, creating physical constraints and reducing
the dimensionality of the estimate.
A scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN
@ -71,10 +78,10 @@ on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
}
\label{figure:maskrcnn_cs}
\end{figure}
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
Inspired by the accurate segmentation results of Mask R-CNN,
we thus propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
Mask R-CNN with the end-to-end instance-level 3D motion estimation approach introduced with SfM-Net.
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI Mask R-CNN head
in parallel to classification, bounding box refinement and mask prediction.
For each RoI, we predict a single 3D rigid object motion together with the object
pivot in camera space in this way.
@ -86,8 +93,8 @@ as to the number or variety of object instances (Figure \ref{figure:net_intro}).
Eventually, we want to extend our method to include depth prediction,
yielding the first end-to-end deep network to perform 3D scene flow estimation
in a principled way from considering individual objects.
For now, we will work with RGB-D frames to break down the problem into
in a principled way from the consideration of individual objects.
For now, we will assume that RGB-D frames are given to break down the problem into
manageable pieces.
\begin{figure}[t]
@ -95,7 +102,7 @@ manageable pieces.
\includegraphics[width=\textwidth]{figures/net_intro}
\caption{
Overview of our network based on Mask R-CNN. For each RoI, we predict the instance motion
in parallel to the class, bounding box and mask. We branch off a additionaly
in parallel to the class, bounding box and mask. Additionally, we branch off a
small network for predicting the camera motion from the bottleneck.
}
\label{figure:net_intro}
@ -104,8 +111,8 @@ small network for predicting the camera motion from the bottleneck.
\subsection{Related work}
In the following, we will refer to systems which use deep networks for all
optimization and do not perform time-critical side computation at inference time as
\emph{end-to-end} deep learning systems.
optimization and do not perform time-critical side computation (e.g. numerical optimization)
at inference time as \emph{end-to-end} deep learning systems.
\paragraph{Deep networks in optical flow}
@ -118,14 +125,15 @@ image depending on the semantics of each region or pixel, which include whether
pixel belongs to the background, to which object instance it belongs if it is not background,
and the class of the object it belongs to.
Often, failure cases of these methods include motion boundaries or regions with little texture,
where semantics become more important. \todo{elaborate}% TODO make sure this is a grounded statement
where semantics become important.
Extensions of these approaches to scene flow estimate flow and depth
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
the optical flow estimation problem and introduce reasoning at the object level,
but still require expensive energy minimization for each
new input, as CNNs are only used for some of the components.
new input, as CNNs are only used for some of the components and numerical
optimization is central to their inference.
In contrast, we tackle motion estimation at the instance-level with end-to-end
deep networks and derive optical flow from the individual object motions.
@ -133,14 +141,16 @@ deep networks and derive optical flow from the individual object motions.
\paragraph{Slanted plane methods for 3D scene flow}
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
composed of planar segments. Pixels are assigned to one of the planar segments,
each of which undergoes a independent 3D rigid motion. % TODO explain benefits of this modelling, but unify with explanations above
each of which undergoes a independent 3D rigid motion.
This model simplifies the motion estimation problem significantly by reducing the dimensionality
of the estimate, and thus leads to accurate results.
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
assigns each slanted plane to one rigidly moving object instance, thus
reducing the number of independently moving segments by allowing multiple
segments to share the motion of the object they belong to.
In all of these methods, pixel assignment and motion estimation are formulated
as energy-minimization problem which is optimized for each input data point,
without the use of (deep) learning. % TODO make sure it's ok to say there's no learning
without the use of (deep) learning.
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
@ -183,25 +193,27 @@ SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a s
of the points into objects together with the 3D motion of each object.
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow from end-to-end deep learning.
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
with a brightness constancy proxy loss.
Like SfM-Net, we aim to estimate motion and instance segmentation jointly with
Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with
end-to-end deep learning.
Unlike SfM-Net, we build on a scalable object detection and instance segmentation
approach with R-CNNs, which provide a strong baseline.
\paragraph{End-to-end deep networks for camera pose estimation}
Deep networks have been used for estimating the 6-DOF camera pose from
a single RGB frame \cite{PoseNet, PoseNet2}. These works are related to
a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion
from monocular video \cite{UnsupPoseDepth}.
These works are related to
ours in that we also need to output various rotations and translations from a deep network
and thus need to solve similar regression problems and use similar parametrizations
and losses.
\subsection{Outline}
In section \ref{sec:background}, we introduce preliminaries and building
First, in section \ref{sec:background}, we introduce preliminaries and building
blocks from earlier works that serve as a foundation for our networks and losses.
Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone
as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}),
@ -212,11 +224,11 @@ followed by our losses and supervision methods for training
the extended region-based CNN (\ref{ssec:supervision}), and
finally the postprocessings we use to derive dense flow from our 3D motion estimates
(\ref{ssec:postprocessing}).
In section \ref{sec:experiments}, we introduce the Virtual KITTI dataset we use
Then, in section \ref{sec:experiments}, we introduce the Virtual KITTI dataset we use
for training our networks as well as all preprocessings we perform (\ref{ssec:datasets}),
give details of our experimental setup (\ref{ssec:setup}),
and finally describe the experimental results
on Virtual KITTI (\ref{ssec:vkitti}).
In section \ref{sec:conclusion}, we summarize our work and describe future
Finally, in section \ref{sec:conclusion}, we summarize our work and describe future
developments, including depth prediction, training on real world data,
and exploiting frames over longer time intervals.