mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2026-04-04 08:35:19 +00:00
save
This commit is contained in:
parent
2e84078d0d
commit
84c5b1e6cd
@ -19,7 +19,7 @@ constraints within the scene.
|
||||
|
||||
We introduce a scalable end-to-end deep learning approach for dense motion estimation
|
||||
that respects the structure of the scene as being composed of distinct objects,
|
||||
thus combining the representation learning benefits of end-to-end deep networks
|
||||
thus combining the representation learning benefits and speed of end-to-end deep networks
|
||||
with a physically plausible scene model.
|
||||
|
||||
Building on recent advanced in region-based convolutional networks (R-CNNs),
|
||||
@ -31,6 +31,6 @@ By additionally estimating a global camera motion in the same network,
|
||||
we compose a dense optical flow field based on instance-level and global motion
|
||||
predictions.
|
||||
|
||||
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
|
||||
benchmark.
|
||||
%We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
|
||||
%benchmark.
|
||||
\end{abstract}
|
||||
|
||||
73
approach.tex
73
approach.tex
@ -1,24 +1,30 @@
|
||||
|
||||
\subsection{Motion R-CNN architecture}
|
||||
|
||||
Building on Mask R-CNN, we enable per-object motion estimation by predicting the 3D motion of each detected object
|
||||
in camera space.
|
||||
Building on Mask R-CNN \cite{MaskRCNN},
|
||||
we estimate per-object motion by predicting the 3D motion of each detected object.
|
||||
For this, we extend Mask R-CNN in two straightforward ways.
|
||||
First, we modify the backbone network and provide two frames to the R-CNN system
|
||||
in order to enable image matching between the consecutive frames.
|
||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
|
||||
region proposal.
|
||||
|
||||
\paragraph{Backbone Network}
|
||||
Like Faster R-CNN and Mask R-CNN, we use a ResNet variant as backbone network to compute feature maps from input imagery.
|
||||
Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backbone network to compute feature maps from input imagery.
|
||||
|
||||
Inspired by FlowNetS, we make one modification to enable image matching within the backbone network,
|
||||
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
|
||||
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
||||
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
|
||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||
as both first stage RPN and second stage feature extractor for region cropping.
|
||||
% TODO figures; introduce XYZ inputs
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid 3D motion parametrization similar to the one used by SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
|
||||
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
||||
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
||||
of the object between the two frames $I_t$ and $I_{t+1}$ as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$ .
|
||||
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$.
|
||||
We parametrize ${R_t^k}$ using an Euler angle representation,
|
||||
|
||||
\begin{equation}
|
||||
@ -53,50 +59,71 @@ R_t^{k,z}(\gamma) =
|
||||
\end{pmatrix},
|
||||
\end{equation}
|
||||
|
||||
and $\alpha, \beta, \gamma$ are the rotation angles about the $x,y,z$-axis, respectively.
|
||||
and $\alpha, \beta, \gamma$ are the rotation angles in radians about the $x,y,z$-axis, respectively.
|
||||
|
||||
|
||||
Figure \ref{fig:motion_rcnn_head} shows our extended per-RoI head network.
|
||||
We then extend the Faster R-CNN head by adding a fully connected layer in parallel to the final fully connected layers for
|
||||
predicting refined boxes and classes.
|
||||
We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for
|
||||
refined boxes and classes. Figure \ref{fig:motion_rcnn_head} shows the Motion R-CNN RoI head.
|
||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||
Each motion is predicted as a set of nine scalar motion parameters,
|
||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
and that objects rotate at most 90 degrees in either direction along any axis.
|
||||
All predictions are made in camera space, and translation and pivot predictions are in meters.
|
||||
|
||||
\paragraph{Camera motion prediction}
|
||||
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
|
||||
between the two frames $I_t$ and $I_{t+1}$.
|
||||
For this, we flatten the full output of the backbone and pass it through a fully connected layer.
|
||||
For this, we flatten the bottleneck output of the backbone and pass it through a fully connected layer.
|
||||
We again represent $R_t^{cam}$ using a Euler angle representation and
|
||||
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
|
||||
|
||||
\subsection{Supervision}
|
||||
|
||||
\paragraph{Per-RoI supervision with motion ground truth}
|
||||
Given a positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
|
||||
let $R_{c_k}^k, t_{c_k}^k, p_{c_k}^k$ be the predicted motion for class $c_k$
|
||||
and $R_{gt}^{i_k}, t_{gt}^{i_k}, p_{gt}^{i_k}$ the ground truth motion for the example $i_k$.
|
||||
We compute the motion loss $L_{motion}^k$ for each RoI as
|
||||
The most straightforward way to supervise the object motions is by using ground truth
|
||||
motions computed from ground truth object poses, which is in general
|
||||
only practical when training on synthetic datasets.
|
||||
Given the $k$-th positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
Note that we dropped the subscript $t$ to increase readability.
|
||||
Inspired by the camera pose regression loss in \cite{PoseNet2},
|
||||
we use $\ell_1$-loss to penalize the differences between ground truth and predicted % TODO actually, we use smooth l1
|
||||
rotation, translation and pivot.
|
||||
For each RoI, we compute the motion loss $L_{motion}^k$ as a linear sum of
|
||||
the individual losses,
|
||||
|
||||
\begin{equation}
|
||||
L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
l_{R}^k = \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right\}\right\} \right)
|
||||
l_{R}^k = \lVert R^{gt,i_k} - R^{k,c_k} \rVert _1,
|
||||
\end{equation}
|
||||
measures the angle of the error rotation between predicted and ground truth rotation,
|
||||
|
||||
\begin{equation}
|
||||
l_{t}^k = \lVert inv(R_{c_k}^k) \cdot (t_{gt}^{i_k} - t_{c_k}^k) \rVert,
|
||||
l_{t}^k = \lVert t^{gt,i_k} - t^{k,c_k} \rVert_1,
|
||||
\end{equation}
|
||||
is the euclidean norm between predicted and ground truth translation, and
|
||||
and
|
||||
\begin{equation}
|
||||
l_{p}^k = \lVert p_{gt}^{i_k} - p_{c_k}^k \rVert
|
||||
l_{p}^k = \lVert p^{gt,i_k} - p^{k,c_k} \rVert_1.
|
||||
\end{equation}
|
||||
is the euclidean norm between predicted and ground truth pivot.
|
||||
|
||||
\paragraph{Camera motion supervision}
|
||||
We supervise the camera motion with ground truth in the same way as the
|
||||
object motions.
|
||||
|
||||
\paragraph{Per-RoI supervision \emph{without} motion ground truth}
|
||||
A more general way to supervise the object motions is a re-projection
|
||||
loss applied to coordinates within the object bounding box,
|
||||
as used in SfM-Net \cite{SfmNet}. Let
|
||||
|
||||
|
||||
|
||||
When compared to supervision with motion ground truth, a re-projection
|
||||
loss could benefit motion regression by removing any loss balancing issues between the
|
||||
rotation, translation and pivot terms \cite{PoseNet2}.
|
||||
|
||||
|
||||
\subsection{Dense flow from motion}
|
||||
@ -130,7 +157,7 @@ P'_{t+1} =
|
||||
P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
||||
\end{equation}
|
||||
|
||||
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in SE3$, % TODO introduce!
|
||||
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce!
|
||||
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
|
||||
12
bib.bib
12
bib.bib
@ -160,3 +160,15 @@
|
||||
title = {Feature Pyramid Networks for Object Detection},
|
||||
journal = {arXiv preprint arXiv:1612.03144},
|
||||
year = {2016}}
|
||||
|
||||
@inproceedings{PoseNet,
|
||||
author = {Alex Kendall and Matthew Grimes and Roberto Cipolla},
|
||||
title = {PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization},
|
||||
booktitle = {ICCV},
|
||||
year = {2015}}
|
||||
|
||||
@inproceedings{PoseNet2,
|
||||
author = {Alex Kendall and Roberto Cipolla},
|
||||
title = {Geometric loss functions for camera pose regression with deep learning},
|
||||
booktitle = {CVPR},
|
||||
year = {2017}}
|
||||
|
||||
@ -1,5 +1,6 @@
|
||||
We have introduced a extension on top of region-based convolutional networks to enable object motion estimation
|
||||
We have introduced an extension on top of region-based convolutional networks to enable object motion estimation
|
||||
in parallel to instance segmentation.
|
||||
\todo{complete}
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Predicting depth}
|
||||
@ -9,19 +10,20 @@ depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
|
||||
Although single-frame monocular depth prediction with deep networks was already done
|
||||
to some level of success,
|
||||
our two-frame input should allow the network to make use of epipolar
|
||||
geometry for making a more reliable depth estimate.
|
||||
geometry for making a more reliable depth estimate, at least when the camera
|
||||
is moving.
|
||||
|
||||
\paragraph{Training on real world data}
|
||||
Due to the amount of supervision required by the different components of the network
|
||||
and the complexity of the optimization problem,
|
||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||||
A next step would be training on a more realistic dataset.
|
||||
For example, we can first pre-train the RPN on an object detection dataset like
|
||||
A next step will be training on a more realistic dataset.
|
||||
For this, we can first pre-train the RPN on an object detection dataset like
|
||||
Cityscapes. As soon as the RPN works reliably, we could execute alternating
|
||||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses (and depth prediction), as no instance segmentation ground truth exists.
|
||||
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
|
||||
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
|
||||
On Cityscapes, we could continue train the instance segmentation components to
|
||||
improve detection and masks and avoid forgetting instance segmentation.
|
||||
As an alternative to this training scheme, we could investigate training on a pure
|
||||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
||||
|
||||
@ -19,53 +19,82 @@ instance segmentation and motion estimation system, as it allows us to test
|
||||
different components in isolation and progress to more and more complete
|
||||
predictions up to supervising the full system on a single dataset.
|
||||
|
||||
For our experiments, we use the \emph{clone} sequences, which are rendered in a
|
||||
way that most closely resembles the original KITTI dataset. We sample 100 examples
|
||||
to be used as validation set. From the remaining 2026 examples,
|
||||
we remove a small number of examples without object instances and use the resulting
|
||||
data as training set.
|
||||
|
||||
\paragraph{Motion ground truth from 3D poses and camera extrinsics}
|
||||
For two consecutive frames $I_t$ and $I_{t+1}$,
|
||||
let $[R_t^{cam}|t_t^{cam}]$
|
||||
and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$
|
||||
let $[R_t^{ex}|t_t^{ex}]$
|
||||
and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$
|
||||
be the camera extrinsics at the two frames.
|
||||
We compute the ground truth camera motion
|
||||
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
|
||||
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as
|
||||
|
||||
\begin{equation}
|
||||
R_{t}^{gt, cam} = R_{t+1}^{cam} \cdot inv(R_t^{cam}),
|
||||
R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot inv(R_t^{ex}),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
t_{t}^{gt, cam} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t^{cam}.
|
||||
t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}.
|
||||
\end{equation}
|
||||
|
||||
For any object $k$ visible in both frames, let
|
||||
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$
|
||||
For any object $i$ visible in both frames, let
|
||||
$(R_t^i, t_t^i)$ and $(R_{t+1}^i, t_{t+1}^i)$
|
||||
be its orientation and position in camera space
|
||||
at $I_t$ and $I_{t+1}$.
|
||||
Note that the pose at $t$ is given with respect to the camera at $t$ and
|
||||
the pose at $t+1$ is given with respect to the camera at $t+1$.
|
||||
|
||||
We define the ground truth pivot as
|
||||
We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as
|
||||
|
||||
\begin{equation}
|
||||
p_{t}^{gt, k} = t_t^k
|
||||
p_{t}^{gt, i} = t_t^i
|
||||
\end{equation}
|
||||
|
||||
and compute the ground truth object motion
|
||||
$\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
|
||||
$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as
|
||||
|
||||
\begin{equation}
|
||||
R_{t}^{gt, k} = inv(R_{t}^{gt, cam}) \cdot R_{t+1}^k \cdot inv(R_t^k),
|
||||
R_{t}^{gt, i} = inv(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot inv(R_t^i),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
t_{t}^{gt, k} = t_{t+1}^{cam} - R_{gt}^{cam} \cdot t_t.
|
||||
\end{equation} % TODO
|
||||
% TODO change notation in approach to remove t subscript from motion matrices and vectors!
|
||||
t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t.
|
||||
\end{equation}
|
||||
|
||||
\paragraph{Evaluation metrics with motion ground truth}
|
||||
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
|
||||
let $i_k$ be the index of the best matching ground truth example,
|
||||
let $c_k$ be the predicted class,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
Then, assuming there are $N$ such detections,
|
||||
\begin{equation}
|
||||
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
|
||||
\end{equation}
|
||||
measures the mean angle of the error rotation between predicted and ground truth rotation,
|
||||
\begin{equation}
|
||||
E_{t} = \frac{1}{N}\sum_k \lVert inv(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \rVert,
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth translation, and
|
||||
\begin{equation}
|
||||
E_{p} = \frac{1}{N}\sum_k \lVert p^{gt,i_k} - p^{k,c_k} \rVert
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth pivot.
|
||||
|
||||
\subsection{Training Setup}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
||||
Virtual KITTI dataset. As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
Virtual KITTI training set. As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
|
||||
\paragraph{R-CNN training parameters}
|
||||
\todo{add this}
|
||||
|
||||
\subsection{Experiments on Virtual KITTI}
|
||||
\todo{add this}
|
||||
|
||||
|
||||
\subsection{Evaluation on KITTI 2015}
|
||||
\todo{add this}
|
||||
|
||||
107
introduction.tex
107
introduction.tex
@ -1,23 +1,30 @@
|
||||
\subsection{Motivation}
|
||||
\subsection{Motivation \& Goals}
|
||||
|
||||
% introduce problem to sovle
|
||||
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
|
||||
For moving in the real world, it is generally desirable to know which objects exists
|
||||
in the proximity of the moving agent,
|
||||
where they are located relative to the agent,
|
||||
and where they will be at some point in the future.
|
||||
In many cases, it would be preferable to infer such information from video data
|
||||
if technically feasible, as camera sensors are cheap and ubiquitous.
|
||||
|
||||
% Steal intro from behl2017 & FlowLayers
|
||||
For example, in autonomous driving, it is crucial to not only know the position
|
||||
of each obstacle, but to also know if and where the obstacle is moving,
|
||||
and to use sensors that will not make the system too expensive for widespread use.
|
||||
There are many other applications.. %TODO(make motivation wider)
|
||||
|
||||
Deep learning research is moving towards videos.
|
||||
Motion estimation is an inherently ambigous problem and.
|
||||
A recent trend is towards end-to-end deep learning systems, away from energy-minimization.
|
||||
Often however, this leads to a compromise in modelling as it is more difficult to
|
||||
formulate a end-to-end deep network architecture for a given problem than it is
|
||||
to state a fesable energy-minimization problem.
|
||||
For this reason, we see lots of generic models applied to domains which previously
|
||||
employed intricate physical models to simplify optimization.
|
||||
On the on hand, end-to-end deep learning may bringe unique benefits due do the ability
|
||||
of a learned system to deal with ambiguity.
|
||||
On the other hand,
|
||||
%Thus, there is an emerging trend to unify geometry with deep learning by
|
||||
% THE ABOVE IS VERY DRAFT_LIKE
|
||||
A promising approach for 3D scene understanding in these situations may be deep neural
|
||||
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
|
||||
in still images and are more and more often being applied to video data.
|
||||
A key benefit of end-to-end deep networks is that they can, in principle,
|
||||
enable very fast inference on real time video data and generalize
|
||||
over many training examples to resolve ambiguities inherent in image understanding
|
||||
and motion estimation.
|
||||
|
||||
Thus, in this work, we aim to develop a end-to-end deep network which can, given
|
||||
sequences of images, segment the image pixels into object instances and estimate
|
||||
the location and 3D motion of each object instance relative to the camera.
|
||||
|
||||
\subsection{Technical outline}
|
||||
|
||||
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
||||
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
||||
@ -46,40 +53,64 @@ This gives us a fully integrated end-to-end network architecture for segmenting
|
||||
and estimating the motion of all detected instances without any limitations
|
||||
as to the number or variety of object instances.
|
||||
|
||||
Eventually, we want to extend our method to include end-to-end depth prediction,
|
||||
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
||||
in a principled way from considering individual objects.
|
||||
For now, we will work with RGB-D frames to break down the problem into manageable pieces.
|
||||
|
||||
\subsection{Related work}
|
||||
|
||||
In the following, we will refer to systems which use deep networks for all
|
||||
optimization and do not perform time-critical side computation at inference time as
|
||||
\emph{end-to-end} deep learning systems.
|
||||
|
||||
\paragraph{Deep networks in optical flow}
|
||||
|
||||
End-to-end deep networks for optical flow were recently introduced
|
||||
based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
|
||||
which pose optical flow as generic pixel-wise estimation problem without making any assumptions
|
||||
which pose optical flow as generic, homogenous pixel-wise estimation problem without making any assumptions
|
||||
about the regularity and structure of the estimated flow.
|
||||
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure
|
||||
the optical flow estimation, but still require expensive energy minimization for each
|
||||
Specifically, such methods ignore that the optical flow varies across an
|
||||
image depending on the semantics of each region or pixel, which include whether a
|
||||
pixel belongs to the background, to which object instance it belongs if it is not background,
|
||||
and the class of the object it belongs to.
|
||||
Often, failure cases of these methods include motion boundaries or regions with little texture,
|
||||
where semantics become more important. % TODO make sure this is a grounded statement
|
||||
Extensions of these approaches to scene flow estimate flow and depth
|
||||
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
||||
|
||||
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
|
||||
the optical flow estimation problem and introduce semantics,
|
||||
but still require expensive energy minimization for each
|
||||
new input, as CNNs are only used for some of the components.
|
||||
|
||||
\paragraph{Slanted plane methods for 3D scene flow}
|
||||
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
|
||||
composed of planar segments. Pixels are assigned to one of the planar segments,
|
||||
each of which undergoes a rigid motion.
|
||||
|
||||
each of which undergoes a independent 3D rigid motion. % TODO explain benefits of this modelling, but unify with explanations above
|
||||
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
|
||||
assigns each slanted plane to one rigidly moving object instance, thus
|
||||
reducing the number of independently moving segments by allowing multiple
|
||||
segments to share the motion of the object they belong to.
|
||||
In these methods, pixel assignment and motion estimation are formulated
|
||||
as energy-minimization problem and optimized for each input data point,
|
||||
without any learning. % TODO make sure it's ok to say there's no learning
|
||||
|
||||
In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
|
||||
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
|
||||
inputs to their slanted plane scene flow model based on \cite{KITTI2015}.
|
||||
inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
|
||||
Most likely due to their use of deep learning for instance segmentation and for some other components, this
|
||||
approach outperforms the previous related scene flow methods on public benchmarks.
|
||||
Still, the method uses a energy-minimization formulation for the scene flow estimation
|
||||
and takes minutes to make a prediction.
|
||||
|
||||
Interestingly, these slanted plane methods achieve the current state-of-the-art
|
||||
in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
||||
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
||||
However, the end-to-end deep networks are significantly faster than energy-minimization based slanted plane models,
|
||||
generally taking a fraction of a second instead of minutes to compute and can often be modified to run in realtime.
|
||||
These concerns restrict the applicability of the current slanted plane models in practical applications,
|
||||
However, the end-to-end deep networks are significantly faster than energy-minimization counterparts,
|
||||
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
||||
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
||||
which often require estimations to be done in realtime and for which an end-to-end
|
||||
approach based on learning would be preferable.
|
||||
|
||||
@ -92,14 +123,14 @@ end-to-end deep networks.
|
||||
|
||||
Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
|
||||
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
|
||||
and the ability of deep networks to learn to handle ambiguity from experience. % TODO instead of experience, talk about compressing large datasets / generalization
|
||||
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
|
||||
|
||||
However, we think that the current end-to-end deep learning approaches to motion
|
||||
estimation are limited by a lack of spatial structure and regularity in their estimates,
|
||||
which stems from the generic nature of the employed networks.
|
||||
which stems from the generic nature of the employed networks. % TODO move to end-to-end deep nets section
|
||||
To this end, we aim to combine the modelling benefits of rigid scene decompositions
|
||||
with the promise of end-to-end deep learning.
|
||||
|
||||
|
||||
\paragraph{End-to-end deep networks for 3D rigid motion estimation}
|
||||
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
|
||||
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
|
||||
@ -109,3 +140,15 @@ estimates a segmentation of pixels into objects together with their 3D motions b
|
||||
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
|
||||
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
|
||||
with a brightness constancy proxy loss.
|
||||
|
||||
Like SfM-Net, we aim to estimate motion and instance segmentation jointly with
|
||||
end-to-end deep learning.
|
||||
Unlike SfM-Net, we build on a scalable object detection and instance segmentation
|
||||
approach with R-CNNs, which provide a strong baseline.
|
||||
|
||||
\paragraph{End-to-end deep networks for camera pose estimation}
|
||||
Deep networks have been used for estimating the 6-DOF camera pose from
|
||||
a single RGB frame \cite{PoseNet, PoseNet2}. These works are related to
|
||||
ours in that we also need to output various rotations and translations from a deep network
|
||||
and thus need to solve similar regression problems and use similar parametrizations
|
||||
and losses.
|
||||
|
||||
@ -81,6 +81,9 @@
|
||||
\parindent 0em % Erstzeileneinzug
|
||||
|
||||
|
||||
\newcommand{\todo}[1]{\textbf{\textcolor{red}{#1}}}
|
||||
|
||||
|
||||
% Titelei
|
||||
\author{\myname}
|
||||
\thesistitle{\mytitleen}{\mytitlede}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user