mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 01:45:50 +00:00
WIP
This commit is contained in:
parent
eb7df27e2f
commit
ae850b0282
@ -14,8 +14,10 @@ We do not introduce a separate network for computing region proposals and use ou
|
||||
as both first stage RPN and second stage feature extractor for region cropping.
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid motion parametrization similar to the one used by SfM-Net \cite{Byravan:2017:SNL}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in SE3$
|
||||
We use a rigid 3D motion parametrization similar to the one used by SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
|
||||
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
||||
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
||||
of the object between the two frames $I_t$ and $I_{t+1}$ as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$ .
|
||||
We parametrize ${R_t^k}$ using an Euler angle representation,
|
||||
|
||||
@ -65,7 +67,7 @@ Here, we assume that motions between frames are relatively small
|
||||
and that objects rotate at most 90 degrees in either direction along any axis.
|
||||
|
||||
\paragraph{Camera motion prediction}
|
||||
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in SE3$
|
||||
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
|
||||
between the two frames $I_t$ and $I_{t+1}$.
|
||||
For this, we flatten the full output of the backbone and pass it through a fully connected layer.
|
||||
We again represent $R_t^{cam}$ using a Euler angle representation and
|
||||
|
||||
@ -8,7 +8,7 @@ Optical flow can be regarded as two-dimensional motion estimation.
|
||||
Scene flow is the generalization of optical flow to 3-dimensional space.
|
||||
|
||||
\subsection{Convolutional neural networks for dense motion estimation}
|
||||
Deep convolutional neural network (CNN) architectures \cite{} became widely popular
|
||||
Deep convolutional neural network (CNN) architectures \cite{ImageNetCNN, VGGNet, ResNet} became widely popular
|
||||
through numerous successes in classification and recognition tasks.
|
||||
The general structure of a CNN consists of a convolutional encoder, which
|
||||
learns a spatially compressed, wide (in the number of channels) representation of the input image,
|
||||
@ -20,17 +20,18 @@ of pooling or strides.
|
||||
Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
|
||||
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
|
||||
The most popular deep networks of this kind for end-to-end optical flow prediction
|
||||
are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
|
||||
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
|
||||
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
|
||||
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
|
||||
Note that the network itself is rather generic and is specialized for optical flow only through being trained
|
||||
with a dense optical flow groundtruth loss.
|
||||
Note that the same network could also be used for semantic segmentation if
|
||||
with supervision from dense optical flow ground truth.
|
||||
Potentially, the same network could also be used for semantic segmentation if
|
||||
the number of output channels was adapted from two to the number of classes. % TODO verify
|
||||
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow arguably well,
|
||||
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
|
||||
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
|
||||
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
|
||||
operations in the encoder.
|
||||
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{}. % TODO dense nets for dense flow
|
||||
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
||||
|
||||
% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
|
||||
% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
|
||||
@ -46,7 +47,7 @@ In the following, we give a short review of region-based convolutional networks,
|
||||
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
||||
|
||||
\paragraph{R-CNN}
|
||||
The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
|
||||
The original region-based convolutional network (R-CNN) \cite{RCNN} uses a non-learned algorithm external to a standard encoder CNN
|
||||
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
||||
For each of the region proposals, the input image is cropped at the proposed region and the crop is
|
||||
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!
|
||||
@ -54,7 +55,7 @@ passed through a CNN, which performs classification of the object (or non-object
|
||||
\paragraph{Fast R-CNN}
|
||||
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
|
||||
which is costly, as there is generally a large amount of proposals.
|
||||
Fast R-CNN significantly reduces computation by performing only a single forward pass with the whole image
|
||||
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
|
||||
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
||||
Then, fixed size crops are taken from the compressed feature map of the image,
|
||||
collected into a batch and passed into a small Fast R-CNN
|
||||
@ -67,7 +68,7 @@ speeding up the system by orders of magnitude. % TODO verify that
|
||||
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
||||
algorith, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system unifies the generation of region proposals and subsequent box refinement and
|
||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
|
||||
and again, improved accuracy.
|
||||
This unified network operates in two stages.
|
||||
@ -91,7 +92,7 @@ Faster R-CNN and the earlier systems detect and classify objects at bounding box
|
||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||
which generally involves computing a binary mask for each object instance specifying which pixels belong
|
||||
to that object. This problem is called \emph{instance segmentation}.
|
||||
Mask R-CNN extends the Faster R-CNN system to instance segmentation by predicting
|
||||
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
|
||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise mask for each instance.
|
||||
|
||||
132
bib.bib
Normal file
132
bib.bib
Normal file
@ -0,0 +1,132 @@
|
||||
@inproceedings{FlowNet,
|
||||
author = {Alexey Dosovitskiy and Philipp Fischer and Eddy Ilg
|
||||
and Philip H{\"a}usser and Caner Haz{\i}rba{\c{s}} and
|
||||
Vladimir Golkov and Patrick v.d. Smagt and Daniel Cremers and Thomas Brox},
|
||||
title = {{FlowNet}: Learning Optical Flow with Convolutional Networks},
|
||||
booktitle = {{ICCV}},
|
||||
year = {2015}}
|
||||
|
||||
@inproceedings{FlowNet2,
|
||||
author = {Eddy Ilg and Nikolaus Mayer and Tonmoy Saikia and
|
||||
Margret Keuper and Alexey Dosovitskiy and Thomas Brox},
|
||||
title = {{FlowNet} 2.0: {E}volution of Optical Flow Estimation with Deep Networks},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2017},}
|
||||
|
||||
@inproceedings{SceneFlowDataset,
|
||||
author = {Nikolaus Mayer and Eddy Ilg and Philip H{\"a}usser and Philipp Fischer and
|
||||
Daniel Cremers and Alexey Dosovitskiy and Thomas Brox},
|
||||
title = {A Large Dataset to Train Convolutional Networks for
|
||||
Disparity, Optical Flow, and Scene Flow Estimation},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2016}}
|
||||
|
||||
@article{SfmNet,
|
||||
author = {Sudheendra Vijayanarasimhan and
|
||||
Susanna Ricco and
|
||||
Cordelia Schmid and
|
||||
Rahul Sukthankar and
|
||||
Katerina Fragkiadaki},
|
||||
title = {{SfM-Net}: Learning of Structure and Motion from Video},
|
||||
journal = {arXiv preprint arXiv:1704.07804},
|
||||
year = {2017}}
|
||||
|
||||
@article{MaskRCNN,
|
||||
Author = {Kaiming He and Georgia Gkioxari and
|
||||
Piotr Doll\'{a}r and Ross Girshick},
|
||||
Title = {{Mask {R-CNN}}},
|
||||
Journal = {arXiv preprint arXiv:1703.06870},
|
||||
Year = {2017}}
|
||||
|
||||
@inproceedings{FasterRCNN,
|
||||
Author = {Shaoqing Ren and Kaiming He and
|
||||
Ross Girshick and Jian Sun},
|
||||
Title = {Faster {R-CNN}: Towards Real-Time Object Detection
|
||||
with Region Proposal Networks},
|
||||
Booktitle = {{NIPS}},
|
||||
Year = {2015}}
|
||||
|
||||
@inproceedings{FastRCNN,
|
||||
Author = {Ross Girshick},
|
||||
Title = {Fast {R-CNN}},
|
||||
Booktitle = {{ICCV}},
|
||||
Year = {2015}}
|
||||
|
||||
@inproceedings{Behl2017ICCV,
|
||||
Author = {Aseem Behl and Omid Hosseini Jafari and Siva Karthik Mustikovela and
|
||||
Hassan Abu Alhaija and Carsten Rother and Andreas Geiger},
|
||||
Title = {Bounding Boxes, Segmentations and Object Coordinates:
|
||||
How Important is Recognition for 3D Scene Flow Estimation
|
||||
in Autonomous Driving Scenarios?},
|
||||
Booktitle = {{ICCV}},
|
||||
Year = {2017}}
|
||||
|
||||
@inproceedings{RCNN,
|
||||
Author = {Ross Girshick and
|
||||
Jeff Donahue and
|
||||
Trevor Darrell and
|
||||
Jitendra Malik},
|
||||
Title = {Rich feature hierarchies for accurate
|
||||
object detection and semantic segmentation},
|
||||
Booktitle = {{CVPR}},
|
||||
Year = {2014}}
|
||||
|
||||
@inproceedings{ImageNetCNN,
|
||||
title = {ImageNet Classification with Deep Convolutional Neural Networks},
|
||||
author = {Alex Krizhevsky and Sutskever, Ilya and Hinton, Geoffrey E.},
|
||||
booktitle = {{NIPS}},
|
||||
year = {2012}}
|
||||
|
||||
@article{VGGNet,
|
||||
author = {Karen Simonyan and Andrew Zisserman},
|
||||
title = {Very Deep Convolutional Networks for Large-Scale Image Recognition},
|
||||
journal = {arXiv preprint arXiv:1409.1556},
|
||||
year = {2014}}
|
||||
|
||||
@article{ResNet,
|
||||
author = {Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun},
|
||||
title = {Deep Residual Learning for Image Recognition},
|
||||
journal = {arXiv preprint arXiv:1512.03385},
|
||||
year = {2015}}
|
||||
|
||||
@article{DenseNetDenseFlow,
|
||||
author = {Yi Zhu and Shawn D. Newsam},
|
||||
title = {DenseNet for Dense Flow},
|
||||
journal = {arXiv preprint arXiv:1707.06316},
|
||||
year = {2017}}
|
||||
|
||||
@inproceedings{SE3Nets,
|
||||
author = {Arunkumar Byravan and Dieter Fox},
|
||||
title = {{SE3-Nets}: Learning Rigid Body Motion using Deep Neural Networks},
|
||||
booktitle = {{ICRA}},
|
||||
year = {2017}}
|
||||
|
||||
@inproceedings{FlowLayers,
|
||||
author = {Laura Sevilla-Lara and Deqing Sun and Varun Jampani and Michael J. Black},
|
||||
title = {Optical Flow with Semantic Segmentation and Localized Layers},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2016}}
|
||||
|
||||
@inproceedings{ESI,
|
||||
author = {Min Bai and Wenjie Luo and Kaustav Kundu and Raquel Urtasun},
|
||||
title = {Exploiting Semantic Information and Deep Matching for Optical Flow},
|
||||
booktitle = {{ECCV}},
|
||||
year = {2016}}
|
||||
|
||||
@inproceedings{VKITTI,
|
||||
author = {Adrien Gaidon and Qiao Wang and Yohann Cabon and Eleonora Vig},
|
||||
title = {Virtual Worlds as Proxy for Multi-Object Tracking Analysis},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2016}}
|
||||
|
||||
@inproceedings{KITTI2012,
|
||||
author = {Andreas Geiger and Philip Lenz and Raquel Urtasun},
|
||||
title = {Are we ready for Autonomous Driving? The {KITTI} Vision Benchmark Suite},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2012}}
|
||||
|
||||
@INPROCEEDINGS{KITTI2015,
|
||||
author = {Moritz Menze and Andreas Geiger},
|
||||
title = {Object Scene Flow for Autonomous Vehicles},
|
||||
booktitle = {{CVPR}},
|
||||
year = {2015}}
|
||||
42
clean.bib
42
clean.bib
@ -1,42 +0,0 @@
|
||||
% This file was created with JabRef 2.10.
|
||||
% Encoding: UTF8
|
||||
|
||||
|
||||
@Book{Schill2012,
|
||||
title = {Verteilte Systeme: Grundlagen und Basistechnologien},
|
||||
author = {Schill, Alexander and Springer, Thomas},
|
||||
year = {2012},
|
||||
edition = {2. Aufl.},
|
||||
location = {Berlin [u.a.]},
|
||||
pagetotal = {433},
|
||||
publisher = {Springer Vieweg},
|
||||
|
||||
timestamp = {2016-02-09}
|
||||
}
|
||||
|
||||
@Book{Tanenbaum2008,
|
||||
title = {Verteilte Systeme: Prinzipien und Paradigmen},
|
||||
author = {Tanenbaum, Andrew S. and Steen, Maarten van},
|
||||
year = {2008},
|
||||
edition = {2., aktualisierte Aufl.},
|
||||
location = {München [u.a.]},
|
||||
pagetotal = {761},
|
||||
publisher = {Pearson Studium},
|
||||
|
||||
timestamp = {2016-02-09}
|
||||
}
|
||||
|
||||
@Mvproceedings{Dowling2013,
|
||||
title = {Distributed Applications and Interoperable Systems},
|
||||
editor = {Dowling, Jim and Taïani, François},
|
||||
year = {2013},
|
||||
eventdate = {2013-06-03/2013-06-05},
|
||||
eventtitle = {8th International Federated Conference on Distributed Computing Techniques, DisCoTec},
|
||||
publisher = {Springer},
|
||||
venue = {Florenz, Italien},
|
||||
|
||||
timestamp = {2016-02-09}
|
||||
}
|
||||
|
||||
@comment{jabref-entrytype: Mvproceedings: req[date;editor;title;year] opt[addendum;chapter;doi;eprint;eprintclass;eprinttype;eventdate;eventtitle;isbn;language;location;mainsubtitle;maintitle;maintitleaddon;month;note;number;organization;pages;pagetotal;publisher;pubstate;series;subtitle;titleaddon;url;urldate;venue;volumes]}
|
||||
|
||||
@ -3,7 +3,7 @@ in parallel to instance segmentation.
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Predicting depth}
|
||||
In most cases, we want to work with RGB frames without depth available.
|
||||
In most cases, we want to work with raw RGB sequences for which no depth is available.
|
||||
To do so, we could integrate depth prediction into our network by branching off a
|
||||
depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
|
||||
Although single-frame monocular depth prediction with deep networks was already done
|
||||
|
||||
@ -1,26 +1,32 @@
|
||||
|
||||
\subsection{Datasets}
|
||||
|
||||
\paragraph{Virtual KITTI}
|
||||
The synthetic Virtual KITTI dataset is a re-creation of the KITTI driving scenario,
|
||||
rendered from virtual 3D street scenes.
|
||||
The dataset is made up of a total of 2126 frames from five different monocular sequences recorded from a camera mounted on
|
||||
a virtual car.
|
||||
Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting
|
||||
in a total of 10 variants per sequence.
|
||||
The synthetic Virtual KITTI dataset \cite{VKITTI} is a re-creation of the KITTI
|
||||
driving scenario \cite{KITTI2012, KITTI2015}, rendered from virtual 3D street
|
||||
scenes.
|
||||
The dataset is made up of a total of 2126 frames from five different monocular
|
||||
sequences recorded from a camera mounted on a virtual car.
|
||||
Each sequence is rendered with varying lighting and weather conditions and
|
||||
from different viewing angles, resulting in a total of 10 variants per sequence.
|
||||
In addition to the RGB frames, a variety of ground truth is supplied.
|
||||
For each frame, we are given a dense depth and optical flow map and the camera extrinsics matrix.
|
||||
For all cars and vans in the each frame, we are given 2D and 3D object bounding boxes, instance masks, 3D poses,
|
||||
and various other labels.
|
||||
For each frame, we are given a dense depth and optical flow map and the camera
|
||||
extrinsics matrix.
|
||||
For all cars and vans in the each frame, we are given 2D and 3D object bounding
|
||||
boxes, instance masks, 3D poses, and various other labels.
|
||||
|
||||
This makes the Virtual KITTI dataset ideally suited for developing our joint instance segmentation
|
||||
and motion estimation system, as it allows us to test different components in isolation and
|
||||
progress to more and more complete predictions up to supervising the full system on a single dataset.
|
||||
This makes the Virtual KITTI dataset ideally suited for developing our joint
|
||||
instance segmentation and motion estimation system, as it allows us to test
|
||||
different components in isolation and progress to more and more complete
|
||||
predictions up to supervising the full system on a single dataset.
|
||||
|
||||
\paragraph{Motion ground truth from 3D poses and camera extrinsics}
|
||||
For two consecutive frames $I_t$ and $I_{t+1}$, let $[R_t^{cam}|t_t^{cam}]$ and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$ be
|
||||
the camera extrinsics at the two frames.
|
||||
We compute the ground truth camera motion $\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
|
||||
For two consecutive frames $I_t$ and $I_{t+1}$,
|
||||
let $[R_t^{cam}|t_t^{cam}]$
|
||||
and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$
|
||||
be the camera extrinsics at the two frames.
|
||||
We compute the ground truth camera motion
|
||||
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
|
||||
|
||||
\begin{equation}
|
||||
R_{t}^{gt, cam} = R_{t+1}^{cam} \cdot inv(R_t^{cam}),
|
||||
\end{equation}
|
||||
@ -28,16 +34,22 @@ R_{t}^{gt, cam} = R_{t+1}^{cam} \cdot inv(R_t^{cam}),
|
||||
t_{t}^{gt, cam} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t^{cam}.
|
||||
\end{equation}
|
||||
|
||||
|
||||
For any object $k$ visible in both frames, let
|
||||
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$ be its orientation and position in camera space
|
||||
at $I_t$ and $I_{t+1}$. Note that the pose at $t$ is given with respect to the camera at $t$ and
|
||||
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$
|
||||
be its orientation and position in camera space
|
||||
at $I_t$ and $I_{t+1}$.
|
||||
Note that the pose at $t$ is given with respect to the camera at $t$ and
|
||||
the pose at $t+1$ is given with respect to the camera at $t+1$.
|
||||
|
||||
We define the ground truth pivot as
|
||||
|
||||
\begin{equation}
|
||||
p_{t}^{gt, k} = t_t^k
|
||||
\end{equation}
|
||||
and compute the ground truth object motion $\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
|
||||
|
||||
and compute the ground truth object motion
|
||||
$\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
|
||||
|
||||
\begin{equation}
|
||||
R_{t}^{gt, k} = inv(R_{t}^{gt, cam}) \cdot R_{t+1}^k \cdot inv(R_t^k),
|
||||
\end{equation}
|
||||
@ -48,9 +60,9 @@ t_{t}^{gt, k} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t.
|
||||
|
||||
\subsection{Training Setup}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the Virtual KITTI dataset.
|
||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the first 144K iterations and $0.25 \cdot 10^{-3}$
|
||||
for all remaining iterations.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
||||
Virtual KITTI dataset. As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
|
||||
\paragraph{R-CNN training parameters}
|
||||
|
||||
|
||||
@ -3,18 +3,18 @@
|
||||
% introduce problem to sovle
|
||||
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
|
||||
|
||||
Recently, SfM-Net \cite{} introduced an end-to-end deep learning approach for predicting depth
|
||||
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
||||
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
||||
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
|
||||
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
|
||||
However, due to the fixed number of objects masks, it can only predict a small number of motions and
|
||||
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
|
||||
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.
|
||||
|
||||
Thus, their approach is very unlikely to scale to dynamic scenes with a potentially
|
||||
Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
|
||||
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
|
||||
|
||||
A scalable approach to instance segmentation based on region-based convolutional networks
|
||||
was recently introduced with Mask R-CNN \cite{}, which inherits the ability to detect
|
||||
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
|
||||
a large number of objects from a large number of classes at once from Faster R-CNN
|
||||
and predicts pixel-precise segmentation masks for each detected object.
|
||||
|
||||
@ -25,10 +25,22 @@ in parallel to classification and bounding box refinement.
|
||||
|
||||
\subsection{Related Work}
|
||||
|
||||
\paragraph{Deep optical flow estimation}
|
||||
\paragraph{Deep scene flow estimation}
|
||||
\paragraph{Structure from motion}
|
||||
SfM-Net, SE3 Nets,
|
||||
\paragraph{Deep networks for optical flow and scene flow}
|
||||
|
||||
\paragraph{Deep networks for 3D motion estimation}
|
||||
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
|
||||
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
|
||||
of the points into objects together with the 3D motion of each object.
|
||||
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
|
||||
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
|
||||
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
|
||||
For supervision, SfM-Net penalizes the dense optical flow composed from the 3D motions and depth estimate
|
||||
with a brightness constancy proxy loss.
|
||||
|
||||
|
||||
Behl2017ICCV
|
||||
|
||||
Recently, deep CNN-based recognition was combined with energy-based 3D scene flow estimation \cite{Behl2017ICCV}.
|
||||
|
||||
|
||||
\cite{FlowLayers}
|
||||
\cite{ESI}
|
||||
|
||||
0
thesis.aux.bbl
Normal file
0
thesis.aux.bbl
Normal file
5
thesis.aux.blg
Normal file
5
thesis.aux.blg
Normal file
@ -0,0 +1,5 @@
|
||||
[0] Config.pm:343> INFO - This is Biber 2.5
|
||||
[0] Config.pm:346> INFO - Logfile is 'thesis.aux.blg'
|
||||
[36] biber:290> INFO - === So Okt 29, 2017, 10:29:33
|
||||
[108] Utils.pm:165> ERROR - Cannot find control file 'thesis.aux.bcf'! - did you pass the "backend=biber" option to BibLaTeX?
|
||||
[108] Biber.pm:113> INFO - ERRORS: 1
|
||||
@ -47,19 +47,21 @@
|
||||
|
||||
\usepackage[
|
||||
backend=biber, % biber ist das Standard-Backend für Biblatex. Für die Abwärtskompatibilität kann hier auch bibtex oder bibtex8 gewählt werden (siehe biblatex-Dokumentation)
|
||||
style=authortitle, %numeric, authortitle, alphabetic etc.
|
||||
style=numeric, %numeric, authortitle, alphabetic etc.
|
||||
autocite=footnote, % Stil, der mit \autocite verwendet wird
|
||||
sorting=nty, % Sortierung: nty = name title year, nyt = name year title u.a.
|
||||
sorting=ynt, % Sortierung: nty = name title year, nyt = name year title u.a.
|
||||
sortcase=false,
|
||||
url=false,
|
||||
hyperref=auto,
|
||||
giveninits=true,
|
||||
maxbibnames=10
|
||||
]{biblatex}
|
||||
|
||||
\renewbibmacro*{cite:seenote}{} % um zu verhindern, dass in Fußnoten automatisch "(wie Anm. xy)" eingefügt wird
|
||||
\DeclareFieldFormat*{citetitle}{\mkbibemph{#1\isdot}} % zitierte Titel kursiv formatieren
|
||||
\DeclareFieldFormat*{title}{\mkbibemph{#1\isdot}} % zitierte Titel kursiv formatieren
|
||||
|
||||
\addbibresource{clean.bib} % Hier Pfad zu deiner .bib-Datei hineinschreiben
|
||||
\addbibresource{bib.bib} % Hier Pfad zu deiner .bib-Datei hineinschreiben
|
||||
\nocite{*} % Alle Einträge in der .bib-Datei im Literaturverzeichnis ausgeben, auch wenn sie nicht im Text zitiert werden. Gut zum Testen der .bib-Datei, sollte aber nicht generell verwendet werden. Stattdessen lieber gezielt Einträge mit Keywords ausgeben lassen (siehe \printbibliography in Zeile 224).
|
||||
|
||||
|
||||
@ -152,6 +154,7 @@
|
||||
\printbibliography[title=Literaturverzeichnis, heading=bibliography]
|
||||
%\printbibliography[title=Literaturverzeichnis, heading=bibliography, keyword=meinbegriff]
|
||||
|
||||
|
||||
\clearpage
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
% Beginn des Anhangs
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user