This commit is contained in:
Simon Meister 2017-10-29 16:09:17 +01:00
parent eb7df27e2f
commit ae850b0282
10 changed files with 216 additions and 91 deletions

View File

@ -14,8 +14,10 @@ We do not introduce a separate network for computing region proposals and use ou
as both first stage RPN and second stage feature extractor for region cropping.
\paragraph{Per-RoI motion prediction}
We use a rigid motion parametrization similar to the one used by SfM-Net \cite{Byravan:2017:SNL}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in SE3$
We use a rigid 3D motion parametrization similar to the one used by SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
of the object between the two frames $I_t$ and $I_{t+1}$ as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$ .
We parametrize ${R_t^k}$ using an Euler angle representation,
@ -65,7 +67,7 @@ Here, we assume that motions between frames are relatively small
and that objects rotate at most 90 degrees in either direction along any axis.
\paragraph{Camera motion prediction}
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in SE3$
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
between the two frames $I_t$ and $I_{t+1}$.
For this, we flatten the full output of the backbone and pass it through a fully connected layer.
We again represent $R_t^{cam}$ using a Euler angle representation and

View File

@ -8,7 +8,7 @@ Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space.
\subsection{Convolutional neural networks for dense motion estimation}
Deep convolutional neural network (CNN) architectures \cite{} became widely popular
Deep convolutional neural network (CNN) architectures \cite{ImageNetCNN, VGGNet, ResNet} became widely popular
through numerous successes in classification and recognition tasks.
The general structure of a CNN consists of a convolutional encoder, which
learns a spatially compressed, wide (in the number of channels) representation of the input image,
@ -20,17 +20,18 @@ of pooling or strides.
Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
The most popular deep networks of this kind for end-to-end optical flow prediction
are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
Note that the network itself is rather generic and is specialized for optical flow only through being trained
with a dense optical flow groundtruth loss.
Note that the same network could also be used for semantic segmentation if
with supervision from dense optical flow ground truth.
Potentially, the same network could also be used for semantic segmentation if
the number of output channels was adapted from two to the number of classes. % TODO verify
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow arguably well,
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
operations in the encoder.
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{}. % TODO dense nets for dense flow
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
@ -46,7 +47,7 @@ In the following, we give a short review of region-based convolutional networks,
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
\paragraph{R-CNN}
The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
The original region-based convolutional network (R-CNN) \cite{RCNN} uses a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped at the proposed region and the crop is
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!
@ -54,7 +55,7 @@ passed through a CNN, which performs classification of the object (or non-object
\paragraph{Fast R-CNN}
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
which is costly, as there is generally a large amount of proposals.
Fast R-CNN significantly reduces computation by performing only a single forward pass with the whole image
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
Then, fixed size crops are taken from the compressed feature map of the image,
collected into a batch and passed into a small Fast R-CNN
@ -67,7 +68,7 @@ speeding up the system by orders of magnitude. % TODO verify that
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
algorith, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system unifies the generation of region proposals and subsequent box refinement and
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
and again, improved accuracy.
This unified network operates in two stages.
@ -91,7 +92,7 @@ Faster R-CNN and the earlier systems detect and classify objects at bounding box
However, it can be helpful to know class and object (instance) membership of all individual pixels,
which generally involves computing a binary mask for each object instance specifying which pixels belong
to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN extends the Faster R-CNN system to instance segmentation by predicting
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object.
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise mask for each instance.

132
bib.bib Normal file
View File

@ -0,0 +1,132 @@
@inproceedings{FlowNet,
author = {Alexey Dosovitskiy and Philipp Fischer and Eddy Ilg
and Philip H{\"a}usser and Caner Haz{\i}rba{\c{s}} and
Vladimir Golkov and Patrick v.d. Smagt and Daniel Cremers and Thomas Brox},
title = {{FlowNet}: Learning Optical Flow with Convolutional Networks},
booktitle = {{ICCV}},
year = {2015}}
@inproceedings{FlowNet2,
author = {Eddy Ilg and Nikolaus Mayer and Tonmoy Saikia and
Margret Keuper and Alexey Dosovitskiy and Thomas Brox},
title = {{FlowNet} 2.0: {E}volution of Optical Flow Estimation with Deep Networks},
booktitle = {{CVPR}},
year = {2017},}
@inproceedings{SceneFlowDataset,
author = {Nikolaus Mayer and Eddy Ilg and Philip H{\"a}usser and Philipp Fischer and
Daniel Cremers and Alexey Dosovitskiy and Thomas Brox},
title = {A Large Dataset to Train Convolutional Networks for
Disparity, Optical Flow, and Scene Flow Estimation},
booktitle = {{CVPR}},
year = {2016}}
@article{SfmNet,
author = {Sudheendra Vijayanarasimhan and
Susanna Ricco and
Cordelia Schmid and
Rahul Sukthankar and
Katerina Fragkiadaki},
title = {{SfM-Net}: Learning of Structure and Motion from Video},
journal = {arXiv preprint arXiv:1704.07804},
year = {2017}}
@article{MaskRCNN,
Author = {Kaiming He and Georgia Gkioxari and
Piotr Doll\'{a}r and Ross Girshick},
Title = {{Mask {R-CNN}}},
Journal = {arXiv preprint arXiv:1703.06870},
Year = {2017}}
@inproceedings{FasterRCNN,
Author = {Shaoqing Ren and Kaiming He and
Ross Girshick and Jian Sun},
Title = {Faster {R-CNN}: Towards Real-Time Object Detection
with Region Proposal Networks},
Booktitle = {{NIPS}},
Year = {2015}}
@inproceedings{FastRCNN,
Author = {Ross Girshick},
Title = {Fast {R-CNN}},
Booktitle = {{ICCV}},
Year = {2015}}
@inproceedings{Behl2017ICCV,
Author = {Aseem Behl and Omid Hosseini Jafari and Siva Karthik Mustikovela and
Hassan Abu Alhaija and Carsten Rother and Andreas Geiger},
Title = {Bounding Boxes, Segmentations and Object Coordinates:
How Important is Recognition for 3D Scene Flow Estimation
in Autonomous Driving Scenarios?},
Booktitle = {{ICCV}},
Year = {2017}}
@inproceedings{RCNN,
Author = {Ross Girshick and
Jeff Donahue and
Trevor Darrell and
Jitendra Malik},
Title = {Rich feature hierarchies for accurate
object detection and semantic segmentation},
Booktitle = {{CVPR}},
Year = {2014}}
@inproceedings{ImageNetCNN,
title = {ImageNet Classification with Deep Convolutional Neural Networks},
author = {Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E.},
booktitle = {{NIPS}},
year = {2012}}
@article{VGGNet,
author = {Karen Simonyan and Andrew Zisserman},
title = {Very Deep Convolutional Networks for Large-Scale Image Recognition},
journal = {arXiv preprint arXiv:1409.1556},
year = {2014}}
@article{ResNet,
author = {Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun},
title = {Deep Residual Learning for Image Recognition},
journal = {arXiv preprint arXiv:1512.03385},
year = {2015}}
@article{DenseNetDenseFlow,
author = {Yi Zhu and Shawn D. Newsam},
title = {DenseNet for Dense Flow},
journal = {arXiv preprint arXiv:1707.06316},
year = {2017}}
@inproceedings{SE3Nets,
author = {Arunkumar Byravan and Dieter Fox},
title = {{SE3-Nets}: Learning Rigid Body Motion using Deep Neural Networks},
booktitle = {{ICRA}},
year = {2017}}
@inproceedings{FlowLayers,
author = {Laura Sevilla-Lara and Deqing Sun and Varun Jampani and Michael J. Black},
title = {Optical Flow with Semantic Segmentation and Localized Layers},
booktitle = {{CVPR}},
year = {2016}}
@inproceedings{ESI,
author = {Min Bai and Wenjie Luo and Kaustav Kundu and Raquel Urtasun},
title = {Exploiting Semantic Information and Deep Matching for Optical Flow},
booktitle = {{ECCV}},
year = {2016}}
@inproceedings{VKITTI,
author = {Adrien Gaidon and Qiao Wang and Yohann Cabon and Eleonora Vig},
title = {Virtual Worlds as Proxy for Multi-Object Tracking Analysis},
booktitle = {{CVPR}},
year = {2016}}
@inproceedings{KITTI2012,
author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
title = {Are we ready for Autonomous Driving? The {KITTI} Vision Benchmark Suite},
booktitle = {{CVPR}},
year = {2012}}
@INPROCEEDINGS{KITTI2015,
author = {Moritz Menze and Andreas Geiger},
title = {Object Scene Flow for Autonomous Vehicles},
booktitle = {{CVPR}},
year = {2015}}

View File

@ -1,42 +0,0 @@
% This file was created with JabRef 2.10.
% Encoding: UTF8
@Book{Schill2012,
title = {Verteilte Systeme: Grundlagen und Basistechnologien},
author = {Schill, Alexander and Springer, Thomas},
year = {2012},
edition = {2. Aufl.},
location = {Berlin [u.a.]},
pagetotal = {433},
publisher = {Springer Vieweg},
timestamp = {2016-02-09}
}
@Book{Tanenbaum2008,
title = {Verteilte Systeme: Prinzipien und Paradigmen},
author = {Tanenbaum, Andrew S. and Steen, Maarten van},
year = {2008},
edition = {2., aktualisierte Aufl.},
location = {München [u.a.]},
pagetotal = {761},
publisher = {Pearson Studium},
timestamp = {2016-02-09}
}
@Mvproceedings{Dowling2013,
title = {Distributed Applications and Interoperable Systems},
editor = {Dowling, Jim and Taïani, François},
year = {2013},
eventdate = {2013-06-03/2013-06-05},
eventtitle = {8th International Federated Conference on Distributed Computing Techniques, DisCoTec},
publisher = {Springer},
venue = {Florenz, Italien},
timestamp = {2016-02-09}
}
@comment{jabref-entrytype: Mvproceedings: req[date;editor;title;year] opt[addendum;chapter;doi;eprint;eprintclass;eprinttype;eventdate;eventtitle;isbn;language;location;mainsubtitle;maintitle;maintitleaddon;month;note;number;organization;pages;pagetotal;publisher;pubstate;series;subtitle;titleaddon;url;urldate;venue;volumes]}

View File

@ -3,7 +3,7 @@ in parallel to instance segmentation.
\subsection{Future Work}
\paragraph{Predicting depth}
In most cases, we want to work with RGB frames without depth available.
In most cases, we want to work with raw RGB sequences for which no depth is available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
Although single-frame monocular depth prediction with deep networks was already done

View File

@ -1,26 +1,32 @@
\subsection{Datasets}
\paragraph{Virtual KITTI}
The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
rendered from virtual 3D street scenes.
The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
a virtual car.
Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
in a total of 10 variants per sequence.
The synthetic Virtual KITTI dataset \cite{VKITTI} is a re-creation of the KITTI
driving scenario \cite{KITTI2012, KITTI2015}, rendered from virtual 3D street
scenes.
The dataset is made up of a total of 2126 frames from five different monocular
sequences recorded from a camera mounted on a virtual car.
Each sequence is rendered with varying lighting and weather conditions and
from different viewing angles, resulting in a total of 10 variants per sequence.
In addition to the RGB frames, a variety of ground truth is supplied.
For each frame, we are given a dense depth and optical flow map and the camera extrinsics matrix.
For all cars and vans in the each frame, we are given 2D and 3D object bounding boxes, instance masks, 3D poses,
and various other labels.
For each frame, we are given a dense depth and optical flow map and the camera
extrinsics matrix.
For all cars and vans in the each frame, we are given 2D and 3D object bounding
boxes, instance masks, 3D poses, and various other labels.
This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
and motion estimation system, as it allows us to test different components in isolation and
progress to more and more complete predictions up to supervising the full system on a single dataset.
This makes the Virtual KITTI dataset ideally suited for developing our joint
instance segmentation and motion estimation system, as it allows us to test
different components in isolation and progress to more and more complete
predictions up to supervising the full system on a single dataset.
\paragraph{Motion ground truth from 3D poses and camera extrinsics}
For two consecutive frames $I_t$ and $I_{t+1}$, let $[R_t^{cam}|t_t^{cam}]$ and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$ be
the camera extrinsics at the two frames.
We compute the ground truth camera motion $\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
For two consecutive frames $I_t$ and $I_{t+1}$,
let $[R_t^{cam}|t_t^{cam}]$
and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$
be the camera extrinsics at the two frames.
We compute the ground truth camera motion
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
\begin{equation}
R_{t}^{gt, cam} = R_{t+1}^{cam} \cdot inv(R_t^{cam}),
\end{equation}
@ -28,16 +34,22 @@ R_{t}^{gt, cam} = R_{t+1}^{cam} \cdot inv(R_t^{cam}),
t_{t}^{gt, cam} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t^{cam}.
\end{equation}
For any object $k$ visible in both frames, let
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$ be its orientation and position in camera space
at $I_t$ and $I_{t+1}$. Note that the pose at $t$ is given with respect to the camera at $t$ and
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$
be its orientation and position in camera space
at $I_t$ and $I_{t+1}$.
Note that the pose at $t$ is given with respect to the camera at $t$ and
the pose at $t+1$ is given with respect to the camera at $t+1$.
We define the ground truth pivot as
\begin{equation}
p_{t}^{gt, k} = t_t^k
\end{equation}
and compute the ground truth object motion $\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
and compute the ground truth object motion
$\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
\begin{equation}
R_{t}^{gt, k} = inv(R_{t}^{gt, cam}) \cdot R_{t+1}^k \cdot inv(R_t^k),
\end{equation}
@ -48,9 +60,9 @@ t_{t}^{gt, k} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t.
\subsection{Training Setup}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
We train on a single Titan X (Pascal) for a total of 192K iterations on the Virtual KITTI dataset.
As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
for all remaining iterations.
We train on a single Titan X (Pascal) for a total of 192K iterations on the
Virtual KITTI dataset. As learning rate we use $0.25 \cdot 10^{-2}$ for the
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
\paragraph{R-CNN training parameters}

View File

@ -3,18 +3,18 @@
% introduce problem to sovle
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
Recently, SfM-Net \cite{} introduced an end-to-end deep learning approach for predicting depth
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
However, due to the fixed number of objects masks, it can only predict a small number of motions and
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.
Thus, their approach is very unlikely to scale to dynamic scenes with a potentially
Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
A scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{}, which inherits the ability to detect
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN
and predicts pixel-precise segmentation masks for each detected object.
@ -25,10 +25,22 @@ in parallel to classification and bounding box refinement.
\subsection{Related Work}
\paragraph{Deep optical flow estimation}
\paragraph{Deep scene flow estimation}
\paragraph{Structure from motion}
SfM-Net, SE3 Nets,
\paragraph{Deep networks for optical flow and scene flow}
\paragraph{Deep networks for 3D motion estimation}
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
of the points into objects together with the 3D motion of each object.
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
For supervision, SfM-Net penalizes the dense optical flow composed from the 3D motions and depth estimate
with a brightness constancy proxy loss.
Behl2017ICCV
Recently, deep CNN-based recognition was combined with energy-based 3D scene flow estimation \cite{Behl2017ICCV}.
\cite{FlowLayers}
\cite{ESI}

0
thesis.aux.bbl Normal file
View File

5
thesis.aux.blg Normal file
View File

@ -0,0 +1,5 @@
[0] Config.pm:343> INFO - This is Biber 2.5
[0] Config.pm:346> INFO - Logfile is 'thesis.aux.blg'
[36] biber:290> INFO - === So Okt 29, 2017, 10:29:33
[108] Utils.pm:165> ERROR - Cannot find control file 'thesis.aux.bcf'! - did you pass the "backend=biber" option to BibLaTeX?
[108] Biber.pm:113> INFO - ERRORS: 1

View File

@ -47,19 +47,21 @@
\usepackage[
backend=biber, % biber ist das Standard-Backend für Biblatex. Für die Abwärtskompatibilität kann hier auch bibtex oder bibtex8 gewählt werden (siehe biblatex-Dokumentation)
style=authortitle, %numeric, authortitle, alphabetic etc.
style=numeric, %numeric, authortitle, alphabetic etc.
autocite=footnote, % Stil, der mit \autocite verwendet wird
sorting=nty, % Sortierung: nty = name title year, nyt = name year title u.a.
sorting=ynt, % Sortierung: nty = name title year, nyt = name year title u.a.
sortcase=false,
url=false,
hyperref=auto,
giveninits=true,
maxbibnames=10
]{biblatex}
\renewbibmacro*{cite:seenote}{} % um zu verhindern, dass in Fußnoten automatisch "(wie Anm. xy)" eingefügt wird
\DeclareFieldFormat*{citetitle}{\mkbibemph{#1\isdot}} % zitierte Titel kursiv formatieren
\DeclareFieldFormat*{title}{\mkbibemph{#1\isdot}} % zitierte Titel kursiv formatieren
\addbibresource{clean.bib} % Hier Pfad zu deiner .bib-Datei hineinschreiben
\addbibresource{bib.bib} % Hier Pfad zu deiner .bib-Datei hineinschreiben
\nocite{*} % Alle Einträge in der .bib-Datei im Literaturverzeichnis ausgeben, auch wenn sie nicht im Text zitiert werden. Gut zum Testen der .bib-Datei, sollte aber nicht generell verwendet werden. Stattdessen lieber gezielt Einträge mit Keywords ausgeben lassen (siehe \printbibliography in Zeile 224).
@ -152,6 +154,7 @@
\printbibliography[title=Literaturverzeichnis, heading=bibliography]
%\printbibliography[title=Literaturverzeichnis, heading=bibliography, keyword=meinbegriff]
\clearpage
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Beginn des Anhangs