mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
47 lines
2.7 KiB
TeX
47 lines
2.7 KiB
TeX
\subsection{Motivation \& Goals}
|
|
|
|
% introduce problem to sovle
|
|
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
|
|
|
|
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
|
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
|
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
|
|
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
|
|
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
|
|
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.
|
|
|
|
Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
|
|
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
|
|
|
|
A scalable approach to instance segmentation based on region-based convolutional networks
|
|
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
|
|
a large number of objects from a large number of classes at once from Faster R-CNN
|
|
and predicts pixel-precise segmentation masks for each detected object.
|
|
|
|
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
|
|
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
|
|
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
|
|
in parallel to classification and bounding box refinement.
|
|
|
|
\subsection{Related Work}
|
|
|
|
\paragraph{Deep networks for optical flow and scene flow}
|
|
|
|
\paragraph{Deep networks for 3D motion estimation}
|
|
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
|
|
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
|
|
of the points into objects together with the 3D motion of each object.
|
|
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
|
|
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
|
|
In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
|
|
For supervision, SfM-Net penalizes the dense optical flow composed from the 3D motions and depth estimate
|
|
with a brightness constancy proxy loss.
|
|
|
|
|
|
|
|
Recently, deep CNN-based recognition was combined with energy-based 3D scene flow estimation \cite{Behl2017ICCV}.
|
|
|
|
|
|
\cite{FlowLayers}
|
|
\cite{ESI}
|