This commit is contained in:
Simon Meister 2017-10-30 10:45:38 +01:00
parent ae850b0282
commit 8edfcbac9f
6 changed files with 98 additions and 44 deletions

View File

@ -1,32 +1,36 @@
\begin{abstract}
Many state of the art energy-minimization approaches to optical flow and scene flow estimation
rely on a (piecewise) rigid scene model, where the scene is represented as an ensemble of distinct,
rigidly moving components, a static background and a moving camera.
Many state of the art energy-minimization approaches to optical flow and scene
flow estimation rely on a (piecewise) rigid scene model, where the scene is
represented as an ensemble of distinct, rigidly moving components, a static
background and a moving camera.
By constraining the optimization problem with a physically sound scene model,
these approaches enable higly accurate motion estimation.
With the advent of deep learning methods, it has become popular to re-purpose generic deep networks
for classical computer vision problems involving pixel-wise estimation.
With the advent of deep learning methods, it has become popular to re-purpose
generic deep networks for classical computer vision problems involving
pixel-wise estimation.
Following this trend, many recent end-to-end deep learning approaches to optical flow
and scene flow directly predict full resolution
depth and flow fields with a generic network for dense, pixel-wise prediction,
thereby ignoring the inherent structure of the underlying motion estimation problem
and any physical constraints within the scene.
Following this trend, many recent end-to-end deep learning approaches to optical
flow and scene flow directly predict full resolution flow fields with
a generic network for dense, pixel-wise prediction, thereby ignoring the
inherent structure of the underlying motion estimation problem and any physical
constraints within the scene.
We introduce an end-to-end deep learning approach for dense motion estimation
We introduce a scalable end-to-end deep learning approach for dense motion estimation
that respects the structure of the scene as being composed of distinct objects,
thus combining the representation learning benefits of end-to-end deep networks
with a physically plausible scene model.
Building on recent advanced in region-based convolutional networks (R-CNNs), we integrate motion
estimation with instance segmentation.
Building on recent advanced in region-based convolutional networks (R-CNNs),
we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGBD camera,
our resulting end-to-end deep network detects objects with accurate per-pixel masks
and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network, we compose a dense
optical flow field based on instance-level and global motion predictions.
our resulting end-to-end deep network detects objects with accurate per-pixel
masks and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network,
we compose a dense optical flow field based on instance-level and global motion
predictions.
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow benchmark.
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
benchmark.
\end{abstract}

View File

@ -1,15 +1,19 @@
\subsection{Optical flow, scene flow and structure from motion}
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images.
The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first
frame $I_1$ to pixel coordinates of the visually corresponding pixel in the second frame $I_2$, thus
representing the apparent movement of brigthness patterns between the two frames.
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
sequence of images.
The optical flow
$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
visually corresponding pixel in the second frame $I_2$,
thus representing the apparent movement of brigthness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space.
\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures \cite{ImageNetCNN, VGGNet, ResNet} became widely popular
through numerous successes in classification and recognition tasks.
Deep convolutional neural network (CNN) architectures
\cite{ImageNetCNN, VGGNet, ResNet}
became widely popular through numerous successes in classification and recognition tasks.
The general structure of a CNN consists of a convolutional encoder, which
learns a spatially compressed, wide (in the number of channels) representation of the input image,
and a fully connected prediction network on top of the encoder.
@ -47,13 +51,13 @@ In the following, we give a short review of region-based convolutional networks,
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
\paragraph{R-CNN}
The original region-based convolutional network (R-CNN) \cite{RCNN} uses a non-learned algorithm external to a standard encoder CNN
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped at the proposed region and the crop is
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!
\paragraph{Fast R-CNN}
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
which is costly, as there is generally a large amount of proposals.
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
@ -66,7 +70,7 @@ speeding up the system by orders of magnitude. % TODO verify that
\paragraph{Faster R-CNN}
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
algorith, which has to be run prior to the network passes and makes up a large portion of the total
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster processing when compared to Fast R-CNN

16
bib.bib
View File

@ -52,7 +52,7 @@
Booktitle = {{ICCV}},
Year = {2015}}
@inproceedings{Behl2017ICCV,
@inproceedings{InstanceSceneFlow,
Author = {Aseem Behl and Omid Hosseini Jafari and Siva Karthik Mustikovela and
Hassan Abu Alhaija and Carsten Rother and Andreas Geiger},
Title = {Bounding Boxes, Segmentations and Object Coordinates:
@ -125,8 +125,20 @@
booktitle = {{CVPR}},
year = {2012}}
@INPROCEEDINGS{KITTI2015,
@inproceedings{KITTI2015,
author = {Moritz Menze and Andreas Geiger},
title = {Object Scene Flow for Autonomous Vehicles},
booktitle = {{CVPR}},
year = {2015}}
@inproceedings{PRSF,
author = {C. Vogel and K. Schindler and S. Roth},
title = {Piecewise Rigid Scene Flow},
booktitle = {{ICCV}},
year = {2013}}
@inproceedings{PRSM,
author = {C. Vogel and K. Schindler and S. Roth},
title = {3D Scene Flow with a Piecewise Rigid Scene Model},
booktitle = {{IJCV}},
year = {2015}}

View File

@ -22,6 +22,6 @@ steps of training on, for example, Cityscapes and the KITTI stereo and optical f
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction), as no instance segmentation ground truth exists.
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
improve detection and masks and avoid any forgetting effects.
improve detection and masks and avoid forgetting instance segmentation.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.

View File

@ -1,8 +1,24 @@
\subsection{Motivation \& Goals}
\subsection{Motivation}
% introduce problem to sovle
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
% Steal intro from behl2017 & FlowLayers
Deep learning research is moving towards videos.
Motion estimation is an inherently ambigous problem and
A recent trend is towards end-to-end deep learning systems, away from energy-minimization.
Often however, this leads to a compromise in modelling as it is more difficult to
formulate a end-to-end deep network architecture for a given problem than it is
to state a fesable energy-minimization problem.
For this reason, we see lots of generic models applied to domains which previously
employed intricate physical models to simplify optimization.
On the on hand, end-to-end deep learning may bringe unique benefits due do the ability
of a learned system to deal with ambiguity.
On the other hand,
%Thus, there is an emerging trend to unify geometry with deep learning by
% THE ABOVE IS VERY DRAFT_LIKE
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
@ -23,24 +39,42 @@ Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
in parallel to classification and bounding box refinement.
\subsection{Related Work}
\subsection{Related work}
\paragraph{Deep networks for optical flow and scene flow}
\paragraph{Deep networks in optical flow and scene flow}
\paragraph{Deep networks for 3D motion estimation}
\cite{FlowLayers}
\cite{ESI}
\paragraph{Slanted plane methods for 3D scene flow}
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
composed of planar segments. Pixels are assigned to one of the planar segments,
each of which undergoes a rigid motion.
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
assigns each slanted plane to one rigidly moving object instance, thus
reducing the number of independently moving segments by allowing multiple
segments to share the motion of the object they belong to.
In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks, which are then combined
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
inputs to the object scene flow model from \cite{KITTI2015}.
Interestingly, these slanted plane methods achieve the current state-of-the-art
in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
%
In other contexts, the move from
% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning
\paragraph{End-to-end deep networks for 3D rigid motion estimation}
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
of the points into objects together with the 3D motion of each object.
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
For supervision, SfM-Net penalizes the dense optical flow composed from the 3D motions and depth estimate
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
with a brightness constancy proxy loss.
Recently, deep CNN-based recognition was combined with energy-based 3D scene flow estimation \cite{Behl2017ICCV}.
\cite{FlowLayers}
\cite{ESI}

View File

@ -151,7 +151,7 @@
% Verwende keyword=meinbegriff, um nur die Einträge aus deiner .bib-Datei ausgeben zu lassen, die mit meinbegriff getaggt sind.
% Darf ein bestimmtes Keyword nicht enthalten sein, verwende notkeyword=meinbegriff.
\singlespacing
\printbibliography[title=Literaturverzeichnis, heading=bibliography]
\printbibliography[title=Bibliography, heading=bibliography]
%\printbibliography[title=Literaturverzeichnis, heading=bibliography, keyword=meinbegriff]