mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
84c5b1e6cd
commit
a6311dca56
@ -31,6 +31,9 @@ By additionally estimating a global camera motion in the same network,
|
||||
we compose a dense optical flow field based on instance-level and global motion
|
||||
predictions.
|
||||
|
||||
%We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
|
||||
%benchmark.
|
||||
\end{abstract}
|
||||
|
||||
\renewcommand{\abstractname}{Zusammenfassung}
|
||||
\begin{abstract}
|
||||
\todo{german abstract}
|
||||
\end{abstract}
|
||||
|
||||
62
approach.tex
62
approach.tex
@ -17,7 +17,15 @@ laying the foundation for our motion estimation. Instead of taking a single imag
|
||||
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
|
||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||
as both first stage RPN and second stage feature extractor for region cropping.
|
||||
% TODO figures; introduce XYZ inputs
|
||||
Technically, our feature encoder network will have to learn a motion representation similar to
|
||||
that learned by the FlowNet encoder, but the output will be computed in the
|
||||
object-centric framework of a region based convolutional network head with a 3D parametrization.
|
||||
Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information
|
||||
from the encoder is integrated for specific objects via RoI cropping and
|
||||
processed by the RoI head for each object.
|
||||
\todo{figure of backbone}
|
||||
|
||||
\todo{introduce optional XYZ input}
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
@ -70,6 +78,7 @@ where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
and that objects rotate at most 90 degrees in either direction along any axis.
|
||||
All predictions are made in camera space, and translation and pivot predictions are in meters.
|
||||
\todo{figure of head}
|
||||
|
||||
\paragraph{Camera motion prediction}
|
||||
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
|
||||
@ -111,23 +120,53 @@ l_{p}^k = \lVert p^{gt,i_k} - p^{k,c_k} \rVert_1.
|
||||
\end{equation}
|
||||
|
||||
\paragraph{Camera motion supervision}
|
||||
We supervise the camera motion with ground truth in the same way as the
|
||||
object motions.
|
||||
We supervise the camera motion with ground truth analogously to the
|
||||
object motions, with the only difference being that we only have
|
||||
a rotation and translation, but no pivot term for the camera motion.
|
||||
|
||||
\paragraph{Per-RoI supervision \emph{without} motion ground truth}
|
||||
A more general way to supervise the object motions is a re-projection
|
||||
loss applied to coordinates within the object bounding box,
|
||||
as used in SfM-Net \cite{SfmNet}. Let
|
||||
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
|
||||
which we can apply to coordinates within the object bounding boxes,
|
||||
and which does not require ground truth 3D object motions.
|
||||
|
||||
For any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
|
||||
with the same resolution as the predicted mask. We use the same bounding box
|
||||
to crop the corresponding region from the dense, full image depth map
|
||||
and bilinearly resize the depth crop to the same resolution as the mask and point
|
||||
grid.
|
||||
We then compute the optical flow at each of the grid points by creating
|
||||
a 3D point cloud from the point grid and depth crop. To this point cloud, we
|
||||
apply the RoI's predicted motion, masked by the predicted mask.
|
||||
Then, we apply the camera motion to the points, project them back to 2D
|
||||
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
|
||||
Note that we batch this computation over all RoIs, so that we only perform
|
||||
it once per forward pass. The mathematical details are analogous to the
|
||||
dense, full image flow computation in the following subsection and will not
|
||||
be repeated here. \todo{add diagram to make it easier to understand}
|
||||
|
||||
For each RoI, we can now penalize the optical flow grid to supervise the object motion.
|
||||
If there is optical flow ground truth available, we can use the RoI bounding box to
|
||||
crop and resize a region from the ground truth optical flow to match the RoI's
|
||||
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
|
||||
|
||||
However, we can also use the re-projection loss without optical flow ground truth
|
||||
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
|
||||
In this case, we use the bounding box to crop and resize a corresponding region
|
||||
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
|
||||
using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference
|
||||
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
|
||||
For more details on differentiable bilinear sampling for deep learning, we refer the reader to
|
||||
\cite{STN}.
|
||||
|
||||
When compared to supervision with motion ground truth, a re-projection
|
||||
loss could benefit motion regression by removing any loss balancing issues between the
|
||||
rotation, translation and pivot terms \cite{PoseNet2}.
|
||||
rotation, translation and pivot terms \cite{PoseNet2},
|
||||
which can make it interesting even when 3D motion ground truth is available.
|
||||
|
||||
|
||||
\subsection{Dense flow from motion}
|
||||
We compose a dense optical flow map from the outputs of our Motion R-CNN network.
|
||||
As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network.
|
||||
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
|
||||
where
|
||||
\begin{equation}
|
||||
@ -143,6 +182,7 @@ x_t - c_0 \\ y_t - c_1 \\ f
|
||||
\end{equation}
|
||||
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
|
||||
which range over all coordinates in $I_t$.
|
||||
For now, the depth map is always assumed to come from ground truth.
|
||||
|
||||
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
||||
box of a detected object according to the predicted motion of the object.
|
||||
@ -166,6 +206,10 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
|
||||
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^k
|
||||
\end{equation}.
|
||||
|
||||
Note that in our experiments, we either use the ground truth camera motion to focus
|
||||
on the object motion predictions or the predicted camera motion to predict complete
|
||||
motion. We will always state which variant we use in the experimental section.
|
||||
|
||||
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
@ -191,7 +235,3 @@ u \\ v
|
||||
x_{t+1} - x_{t} \\ y_{t+1} - y_{t}
|
||||
\end{pmatrix}.
|
||||
\end{equation}
|
||||
|
||||
|
||||
%Given the predicted motion as above, a depth map $d_t$ for frame $I_t$ and
|
||||
%the predicted or ground truth camera motion $\{R_c^k, t_c^k\}\in SE3$.
|
||||
|
||||
@ -1,17 +1,20 @@
|
||||
\subsection{Optical flow, scene flow and structure from motion}
|
||||
Here, we will give a more detailed description of previous works
|
||||
we directly build on and other prerequisites.
|
||||
|
||||
\subsection{Optical flow and scene flow}
|
||||
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
|
||||
sequence of images.
|
||||
The optical flow
|
||||
$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
|
||||
maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
|
||||
visually corresponding pixel in the second frame $I_2$,
|
||||
thus representing the apparent movement of brigthness patterns between the two frames.
|
||||
and can be interpreted as the apparent movement of brigthness patterns between the two frames.
|
||||
Optical flow can be regarded as two-dimensional motion estimation.
|
||||
|
||||
Scene flow is the generalization of optical flow to 3-dimensional space and
|
||||
requires estimating dense depth. Generally, stereo input is used for scene flow
|
||||
to estimate disparity-based depth, however monocular depth estimation can in
|
||||
principle be used.
|
||||
requires estimating depth for each pixel. Generally, stereo input is used for scene flow
|
||||
to estimate disparity-based depth, however monocular depth estimation with deep networks is becoming
|
||||
popular \cite{DeeperDepth}.
|
||||
|
||||
\subsection{Convolutional neural networks for dense motion estimation}
|
||||
Deep convolutional neural network (CNN) architectures
|
||||
@ -30,27 +33,18 @@ The most popular deep networks of this kind for end-to-end optical flow predicti
|
||||
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
|
||||
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
|
||||
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
|
||||
Note that the network itself is rather generic and is specialized for optical flow only through being trained
|
||||
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
|
||||
with supervision from dense optical flow ground truth.
|
||||
Potentially, the same network could also be used for semantic segmentation if
|
||||
the number of output channels was adapted from two to the number of classes. % TODO verify
|
||||
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
|
||||
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
|
||||
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
|
||||
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
|
||||
operations in the encoder.
|
||||
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
||||
|
||||
% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
|
||||
% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
|
||||
|
||||
% The reader should understand the limitations of the generic dense-estimator approach!
|
||||
|
||||
% Also, it should be emphasized that FlowNet learns to match images with a generic encoder,
|
||||
% thus motivating the introduction of our motion head, which should integrate (and regularize) matching information learned
|
||||
% in the resnet backbone.
|
||||
|
||||
\subsection{Region-based convolutional networks}
|
||||
In the following, we give a short review of region-based convolutional networks, which are currently by far the
|
||||
We now give a short review of region-based convolutional networks, which are currently by far the
|
||||
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
||||
|
||||
\paragraph{R-CNN}
|
||||
@ -101,10 +95,11 @@ which generally involves computing a binary mask for each object instance specif
|
||||
to that object. This problem is called \emph{instance segmentation}.
|
||||
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
|
||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise mask for each instance.
|
||||
In addition, Mask R-CNN
|
||||
Figure \ref{} compares the two Mask R-CNN network variants.
|
||||
In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
|
||||
variant based on Feature Pyramid Networks \cite{FPN}.
|
||||
Figure \ref{} compares the two Mask R-CNN head variants.
|
||||
|
||||
\paragraph{Supervision of the RPN}
|
||||
\paragraph{Supervision of the RoI head}
|
||||
|
||||
18
bib.bib
18
bib.bib
@ -172,3 +172,21 @@
|
||||
title = {Geometric loss functions for camera pose regression with deep learning},
|
||||
booktitle = {CVPR},
|
||||
year = {2017}}
|
||||
|
||||
@inproceedings{STN,
|
||||
author = {M. Jadeberg and K. Zisserman and K. Kavukcuoglu},
|
||||
title = {Spatial transformer networks},
|
||||
booktitle = {NIPS},
|
||||
year = {2015}}
|
||||
|
||||
@inproceedings{CensusTerm,
|
||||
author = {Fridtjof Stein},
|
||||
title = {Efficient Computation of Optical Flow Using the Census Transform},
|
||||
booktitle = {DAGM},
|
||||
year = {2004}}
|
||||
|
||||
@inproceedings{DeeperDepth,
|
||||
author = {Iro Laina and Christian Rupprecht and Vasileios Belagiannis and Federico Tombari and Nassir Navab},
|
||||
title = {Deeper Depth Prediction with Fully Convolutional Residual Networks},
|
||||
booktitle = {3DV},
|
||||
year = {2016}}
|
||||
|
||||
@ -12,7 +12,7 @@ of each obstacle, but to also know if and where the obstacle is moving,
|
||||
and to use sensors that will not make the system too expensive for widespread use.
|
||||
There are many other applications.. %TODO(make motivation wider)
|
||||
|
||||
A promising approach for 3D scene understanding in these situations may be deep neural
|
||||
A promising approach for 3D scene understanding in these situations are deep neural
|
||||
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
|
||||
in still images and are more and more often being applied to video data.
|
||||
A key benefit of end-to-end deep networks is that they can, in principle,
|
||||
@ -20,7 +20,7 @@ enable very fast inference on real time video data and generalize
|
||||
over many training examples to resolve ambiguities inherent in image understanding
|
||||
and motion estimation.
|
||||
|
||||
Thus, in this work, we aim to develop a end-to-end deep network which can, given
|
||||
Thus, in this work, we aim to develop end-to-end deep networks which can, given
|
||||
sequences of images, segment the image pixels into object instances and estimate
|
||||
the location and 3D motion of each object instance relative to the camera.
|
||||
|
||||
@ -44,19 +44,21 @@ and predicts pixel-precise segmentation masks for each detected object.
|
||||
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
|
||||
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
|
||||
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
|
||||
in parallel to classification and bounding box refinement.
|
||||
in parallel to classification, bounding box refinement and mask prediction.
|
||||
For each RoI, we predict a single 3D rigid object motion together with the object
|
||||
pivot in camera space.
|
||||
As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone to take
|
||||
pivot in camera space in this way.
|
||||
As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone of Mask R-CNN to take
|
||||
two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
|
||||
This gives us a fully integrated end-to-end network architecture for segmenting pixels into instances
|
||||
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
|
||||
and estimating the motion of all detected instances without any limitations
|
||||
as to the number or variety of object instances.
|
||||
Figure \ref{} gives an overview of our network.
|
||||
|
||||
Eventually, we want to extend our method to include end-to-end depth prediction,
|
||||
Eventually, we want to extend our method to include depth prediction,
|
||||
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
||||
in a principled way from considering individual objects.
|
||||
For now, we will work with RGB-D frames to break down the problem into manageable pieces.
|
||||
For now, we will work with RGB-D frames to break down the problem into
|
||||
manageable pieces.
|
||||
|
||||
\subsection{Related work}
|
||||
|
||||
@ -75,12 +77,12 @@ image depending on the semantics of each region or pixel, which include whether
|
||||
pixel belongs to the background, to which object instance it belongs if it is not background,
|
||||
and the class of the object it belongs to.
|
||||
Often, failure cases of these methods include motion boundaries or regions with little texture,
|
||||
where semantics become more important. % TODO make sure this is a grounded statement
|
||||
where semantics become more important. \todo{elaborate}% TODO make sure this is a grounded statement
|
||||
Extensions of these approaches to scene flow estimate flow and depth
|
||||
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
||||
|
||||
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
|
||||
the optical flow estimation problem and introduce semantics,
|
||||
the optical flow estimation problem and introduce reasoning at the object level,
|
||||
but still require expensive energy minimization for each
|
||||
new input, as CNNs are only used for some of the components.
|
||||
|
||||
@ -94,7 +96,7 @@ reducing the number of independently moving segments by allowing multiple
|
||||
segments to share the motion of the object they belong to.
|
||||
In these methods, pixel assignment and motion estimation are formulated
|
||||
as energy-minimization problem and optimized for each input data point,
|
||||
without any learning. % TODO make sure it's ok to say there's no learning
|
||||
without the use of (deep) learning. % TODO make sure it's ok to say there's no learning
|
||||
|
||||
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
|
||||
@ -126,8 +128,8 @@ in speed, but also in accuracy, especially considering the inherent ambiguity of
|
||||
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
|
||||
|
||||
However, we think that the current end-to-end deep learning approaches to motion
|
||||
estimation are limited by a lack of spatial structure and regularity in their estimates,
|
||||
which stems from the generic nature of the employed networks. % TODO move to end-to-end deep nets section
|
||||
estimation are likely limited by a lack of spatial structure and regularity in their estimates
|
||||
as explained above, which stems from the generic nature of the employed networks.
|
||||
To this end, we aim to combine the modelling benefits of rigid scene decompositions
|
||||
with the promise of end-to-end deep learning.
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user