mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
81 lines
4.7 KiB
TeX
81 lines
4.7 KiB
TeX
\subsection{Motivation}
|
|
|
|
% introduce problem to sovle
|
|
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
|
|
|
|
% Steal intro from behl2017 & FlowLayers
|
|
|
|
Deep learning research is moving towards videos.
|
|
Motion estimation is an inherently ambigous problem and
|
|
A recent trend is towards end-to-end deep learning systems, away from energy-minimization.
|
|
Often however, this leads to a compromise in modelling as it is more difficult to
|
|
formulate a end-to-end deep network architecture for a given problem than it is
|
|
to state a fesable energy-minimization problem.
|
|
For this reason, we see lots of generic models applied to domains which previously
|
|
employed intricate physical models to simplify optimization.
|
|
On the on hand, end-to-end deep learning may bringe unique benefits due do the ability
|
|
of a learned system to deal with ambiguity.
|
|
On the other hand,
|
|
%Thus, there is an emerging trend to unify geometry with deep learning by
|
|
% THE ABOVE IS VERY DRAFT_LIKE
|
|
|
|
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
|
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
|
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
|
|
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
|
|
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
|
|
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.
|
|
|
|
Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
|
|
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
|
|
|
|
A scalable approach to instance segmentation based on region-based convolutional networks
|
|
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
|
|
a large number of objects from a large number of classes at once from Faster R-CNN
|
|
and predicts pixel-precise segmentation masks for each detected object.
|
|
|
|
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
|
|
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
|
|
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
|
|
in parallel to classification and bounding box refinement.
|
|
|
|
\subsection{Related work}
|
|
|
|
\paragraph{Deep networks in optical flow and scene flow}
|
|
|
|
\cite{FlowLayers}
|
|
\cite{ESI}
|
|
|
|
\paragraph{Slanted plane methods for 3D scene flow}
|
|
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
|
|
composed of planar segments. Pixels are assigned to one of the planar segments,
|
|
each of which undergoes a rigid motion.
|
|
|
|
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
|
|
assigns each slanted plane to one rigidly moving object instance, thus
|
|
reducing the number of independently moving segments by allowing multiple
|
|
segments to share the motion of the object they belong to.
|
|
|
|
In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
|
a CNN is used to compute 2D bounding boxes and instance masks, which are then combined
|
|
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
|
|
inputs to the object scene flow model from \cite{KITTI2015}.
|
|
|
|
Interestingly, these slanted plane methods achieve the current state-of-the-art
|
|
in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
|
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
|
|
|
%
|
|
In other contexts, the move from
|
|
% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning
|
|
|
|
\paragraph{End-to-end deep networks for 3D rigid motion estimation}
|
|
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
|
|
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
|
|
of the points into objects together with the 3D motion of each object.
|
|
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
|
|
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
|
|
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
|
|
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
|
|
with a brightness constancy proxy loss.
|