mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
ae850b0282
commit
8edfcbac9f
40
abstract.tex
40
abstract.tex
@ -1,32 +1,36 @@
|
||||
\begin{abstract}
|
||||
|
||||
Many state of the art energy-minimization approaches to optical flow and scene flow estimation
|
||||
rely on a (piecewise) rigid scene model, where the scene is represented as an ensemble of distinct,
|
||||
rigidly moving components, a static background and a moving camera.
|
||||
Many state of the art energy-minimization approaches to optical flow and scene
|
||||
flow estimation rely on a (piecewise) rigid scene model, where the scene is
|
||||
represented as an ensemble of distinct, rigidly moving components, a static
|
||||
background and a moving camera.
|
||||
By constraining the optimization problem with a physically sound scene model,
|
||||
these approaches enable higly accurate motion estimation.
|
||||
|
||||
With the advent of deep learning methods, it has become popular to re-purpose generic deep networks
|
||||
for classical computer vision problems involving pixel-wise estimation.
|
||||
With the advent of deep learning methods, it has become popular to re-purpose
|
||||
generic deep networks for classical computer vision problems involving
|
||||
pixel-wise estimation.
|
||||
|
||||
Following this trend, many recent end-to-end deep learning approaches to optical flow
|
||||
and scene flow directly predict full resolution
|
||||
depth and flow fields with a generic network for dense, pixel-wise prediction,
|
||||
thereby ignoring the inherent structure of the underlying motion estimation problem
|
||||
and any physical constraints within the scene.
|
||||
Following this trend, many recent end-to-end deep learning approaches to optical
|
||||
flow and scene flow directly predict full resolution flow fields with
|
||||
a generic network for dense, pixel-wise prediction, thereby ignoring the
|
||||
inherent structure of the underlying motion estimation problem and any physical
|
||||
constraints within the scene.
|
||||
|
||||
We introduce an end-to-end deep learning approach for dense motion estimation
|
||||
We introduce a scalable end-to-end deep learning approach for dense motion estimation
|
||||
that respects the structure of the scene as being composed of distinct objects,
|
||||
thus combining the representation learning benefits of end-to-end deep networks
|
||||
with a physically plausible scene model.
|
||||
|
||||
Building on recent advanced in region-based convolutional networks (R-CNNs), we integrate motion
|
||||
estimation with instance segmentation.
|
||||
Building on recent advanced in region-based convolutional networks (R-CNNs),
|
||||
we integrate motion estimation with instance segmentation.
|
||||
Given two consecutive frames from a monocular RGBD camera,
|
||||
our resulting end-to-end deep network detects objects with accurate per-pixel masks
|
||||
and estimates the 3D motion of each detected object between the frames.
|
||||
By additionally estimating a global camera motion in the same network, we compose a dense
|
||||
optical flow field based on instance-level and global motion predictions.
|
||||
our resulting end-to-end deep network detects objects with accurate per-pixel
|
||||
masks and estimates the 3D motion of each detected object between the frames.
|
||||
By additionally estimating a global camera motion in the same network,
|
||||
we compose a dense optical flow field based on instance-level and global motion
|
||||
predictions.
|
||||
|
||||
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow benchmark.
|
||||
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
|
||||
benchmark.
|
||||
\end{abstract}
|
||||
|
||||
@ -1,15 +1,19 @@
|
||||
\subsection{Optical flow, scene flow and structure from motion}
|
||||
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images.
|
||||
The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first
|
||||
frame $I_1$ to pixel coordinates of the visually corresponding pixel in the second frame $I_2$, thus
|
||||
representing the apparent movement of brigthness patterns between the two frames.
|
||||
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
|
||||
sequence of images.
|
||||
The optical flow
|
||||
$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
|
||||
maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
|
||||
visually corresponding pixel in the second frame $I_2$,
|
||||
thus representing the apparent movement of brigthness patterns between the two frames.
|
||||
Optical flow can be regarded as two-dimensional motion estimation.
|
||||
|
||||
Scene flow is the generalization of optical flow to 3-dimensional space.
|
||||
|
||||
\subsection{Convolutional neural networks for dense motion estimation}
|
||||
Deep convolutional neural network (CNN) architectures \cite{ImageNetCNN, VGGNet, ResNet} became widely popular
|
||||
through numerous successes in classification and recognition tasks.
|
||||
Deep convolutional neural network (CNN) architectures
|
||||
\cite{ImageNetCNN, VGGNet, ResNet}
|
||||
became widely popular through numerous successes in classification and recognition tasks.
|
||||
The general structure of a CNN consists of a convolutional encoder, which
|
||||
learns a spatially compressed, wide (in the number of channels) representation of the input image,
|
||||
and a fully connected prediction network on top of the encoder.
|
||||
@ -47,13 +51,13 @@ In the following, we give a short review of region-based convolutional networks,
|
||||
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
||||
|
||||
\paragraph{R-CNN}
|
||||
The original region-based convolutional network (R-CNN) \cite{RCNN} uses a non-learned algorithm external to a standard encoder CNN
|
||||
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
|
||||
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
||||
For each of the region proposals, the input image is cropped at the proposed region and the crop is
|
||||
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!
|
||||
|
||||
\paragraph{Fast R-CNN}
|
||||
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
|
||||
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
|
||||
which is costly, as there is generally a large amount of proposals.
|
||||
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
|
||||
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
||||
@ -66,7 +70,7 @@ speeding up the system by orders of magnitude. % TODO verify that
|
||||
|
||||
\paragraph{Faster R-CNN}
|
||||
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
||||
algorith, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
|
||||
|
||||
16
bib.bib
16
bib.bib
@ -52,7 +52,7 @@
|
||||
Booktitle = {{ICCV}},
|
||||
Year = {2015}}
|
||||
|
||||
@inproceedings{Behl2017ICCV,
|
||||
@inproceedings{InstanceSceneFlow,
|
||||
Author = {Aseem Behl and Omid Hosseini Jafari and Siva Karthik Mustikovela and
|
||||
Hassan Abu Alhaija and Carsten Rother and Andreas Geiger},
|
||||
Title = {Bounding Boxes, Segmentations and Object Coordinates:
|
||||
@ -125,8 +125,20 @@
|
||||
booktitle = {{CVPR}},
|
||||
year = {2012}}
|
||||
|
||||
@INPROCEEDINGS{KITTI2015,
|
||||
@inproceedings{KITTI2015,
|
||||
author = {Moritz Menze and Andreas Geiger},
|
||||
title = {Object Scene Flow for Autonomous Vehicles},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2015}}
|
||||
|
||||
@inproceedings{PRSF,
|
||||
author = {C. Vogel and K. Schindler and S. Roth},
|
||||
title = {Piecewise Rigid Scene Flow},
|
||||
booktitle = {{ICCV}},
|
||||
year = {2013}}
|
||||
|
||||
@inproceedings{PRSM,
|
||||
author = {C. Vogel and K. Schindler and S. Roth},
|
||||
title = {3D Scene Flow with a Piecewise Rigid Scene Model},
|
||||
booktitle = {{IJCV}},
|
||||
year = {2015}}
|
||||
|
||||
@ -22,6 +22,6 @@ steps of training on, for example, Cityscapes and the KITTI stereo and optical f
|
||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses (and depth prediction), as no instance segmentation ground truth exists.
|
||||
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
|
||||
improve detection and masks and avoid any forgetting effects.
|
||||
improve detection and masks and avoid forgetting instance segmentation.
|
||||
As an alternative to this training scheme, we could investigate training on a pure
|
||||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
||||
|
||||
@ -1,8 +1,24 @@
|
||||
\subsection{Motivation \& Goals}
|
||||
\subsection{Motivation}
|
||||
|
||||
% introduce problem to sovle
|
||||
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
|
||||
|
||||
% Steal intro from behl2017 & FlowLayers
|
||||
|
||||
Deep learning research is moving towards videos.
|
||||
Motion estimation is an inherently ambigous problem and
|
||||
A recent trend is towards end-to-end deep learning systems, away from energy-minimization.
|
||||
Often however, this leads to a compromise in modelling as it is more difficult to
|
||||
formulate a end-to-end deep network architecture for a given problem than it is
|
||||
to state a fesable energy-minimization problem.
|
||||
For this reason, we see lots of generic models applied to domains which previously
|
||||
employed intricate physical models to simplify optimization.
|
||||
On the on hand, end-to-end deep learning may bringe unique benefits due do the ability
|
||||
of a learned system to deal with ambiguity.
|
||||
On the other hand,
|
||||
%Thus, there is an emerging trend to unify geometry with deep learning by
|
||||
% THE ABOVE IS VERY DRAFT_LIKE
|
||||
|
||||
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
||||
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
||||
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
|
||||
@ -23,24 +39,42 @@ Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM
|
||||
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
|
||||
in parallel to classification and bounding box refinement.
|
||||
|
||||
\subsection{Related Work}
|
||||
\subsection{Related work}
|
||||
|
||||
\paragraph{Deep networks for optical flow and scene flow}
|
||||
\paragraph{Deep networks in optical flow and scene flow}
|
||||
|
||||
\paragraph{Deep networks for 3D motion estimation}
|
||||
\cite{FlowLayers}
|
||||
\cite{ESI}
|
||||
|
||||
\paragraph{Slanted plane methods for 3D scene flow}
|
||||
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
|
||||
composed of planar segments. Pixels are assigned to one of the planar segments,
|
||||
each of which undergoes a rigid motion.
|
||||
|
||||
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
|
||||
assigns each slanted plane to one rigidly moving object instance, thus
|
||||
reducing the number of independently moving segments by allowing multiple
|
||||
segments to share the motion of the object they belong to.
|
||||
|
||||
In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||
a CNN is used to compute 2D bounding boxes and instance masks, which are then combined
|
||||
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
|
||||
inputs to the object scene flow model from \cite{KITTI2015}.
|
||||
|
||||
Interestingly, these slanted plane methods achieve the current state-of-the-art
|
||||
in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
||||
|
||||
%
|
||||
In other contexts, the move from
|
||||
% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning
|
||||
|
||||
\paragraph{End-to-end deep networks for 3D rigid motion estimation}
|
||||
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
|
||||
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
|
||||
of the points into objects together with the 3D motion of each object.
|
||||
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
|
||||
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
|
||||
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
|
||||
For supervision, SfM-Net penalizes the dense optical flow composed from the 3D motions and depth estimate
|
||||
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
|
||||
with a brightness constancy proxy loss.
|
||||
|
||||
|
||||
|
||||
Recently, deep CNN-based recognition was combined with energy-based 3D scene flow estimation \cite{Behl2017ICCV}.
|
||||
|
||||
|
||||
\cite{FlowLayers}
|
||||
\cite{ESI}
|
||||
|
||||
@ -151,7 +151,7 @@
|
||||
% Verwende keyword=meinbegriff, um nur die Einträge aus deiner .bib-Datei ausgeben zu lassen, die mit meinbegriff getaggt sind.
|
||||
% Darf ein bestimmtes Keyword nicht enthalten sein, verwende notkeyword=meinbegriff.
|
||||
\singlespacing
|
||||
\printbibliography[title=Literaturverzeichnis, heading=bibliography]
|
||||
\printbibliography[title=Bibliography, heading=bibliography]
|
||||
%\printbibliography[title=Literaturverzeichnis, heading=bibliography, keyword=meinbegriff]
|
||||
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user