mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
fafd293f99
commit
eb7df27e2f
@ -24,9 +24,9 @@ Building on recent advanced in region-based convolutional networks (R-CNNs), we
|
||||
estimation with instance segmentation.
|
||||
Given two consecutive frames from a monocular RGBD camera,
|
||||
our resulting end-to-end deep network detects objects with accurate per-pixel masks
|
||||
and estimates the 3d motion of each detected object between the frames.
|
||||
and estimates the 3D motion of each detected object between the frames.
|
||||
By additionally estimating a global camera motion in the same network, we compose a dense
|
||||
optical flow field based on instance-level motion predictions.
|
||||
optical flow field based on instance-level and global motion predictions.
|
||||
|
||||
We demonstrate the effectiveness of our approach on the KITTI 2015 optical flow benchmark.
|
||||
We demonstrate the feasibility of our approach on the KITTI 2015 optical flow benchmark.
|
||||
\end{abstract}
|
||||
|
||||
20
approach.tex
20
approach.tex
@ -28,25 +28,25 @@ where
|
||||
R_t^{k,x}(\alpha) =
|
||||
\begin{pmatrix}
|
||||
1 & 0 & 0 \\
|
||||
0 & cos(\alpha) & -sin(\alpha) \\
|
||||
0 & sin(\alpha) & cos(\alpha)
|
||||
0 & \cos(\alpha) & -\sin(\alpha) \\
|
||||
0 & \sin(\alpha) & \cos(\alpha)
|
||||
\end{pmatrix},
|
||||
\end{equation}
|
||||
|
||||
\begin{equation}
|
||||
R_t^{k,y}(\beta) =
|
||||
\begin{pmatrix}
|
||||
cos(\beta) & 0 & sin(\beta) \\
|
||||
\cos(\beta) & 0 & \sin(\beta) \\
|
||||
0 & 1 & 0 \\
|
||||
-sin(\beta) & 0 & cos(\beta)
|
||||
-\sin(\beta) & 0 & \cos(\beta)
|
||||
\end{pmatrix},
|
||||
\end{equation}
|
||||
|
||||
\begin{equation}
|
||||
R_t^{k,z}(\gamma) =
|
||||
\begin{pmatrix}
|
||||
cos(\gamma) & -sin(\gamma) & 0 \\
|
||||
sin(\gamma) & cos(\gamma) & 0 \\
|
||||
\cos(\gamma) & -\sin(\gamma) & 0 \\
|
||||
\sin(\gamma) & \cos(\gamma) & 0 \\
|
||||
0 & 0 & 1
|
||||
\end{pmatrix},
|
||||
\end{equation}
|
||||
@ -59,8 +59,8 @@ We then extend the Faster R-CNN head by adding a fully connected layer in parall
|
||||
predicting refined boxes and classes.
|
||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||
Each motion is predicted as a set of nine scalar motion parameters,
|
||||
$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||
where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
and that objects rotate at most 90 degrees in either direction along any axis.
|
||||
|
||||
@ -69,7 +69,7 @@ In addition to the object transformations, we optionally predict the camera moti
|
||||
between the two frames $I_t$ and $I_{t+1}$.
|
||||
For this, we flatten the full output of the backbone and pass it through a fully connected layer.
|
||||
We again represent $R_t^{cam}$ using a Euler angle representation and
|
||||
predict $sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
|
||||
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
|
||||
|
||||
\subsection{Supervision}
|
||||
|
||||
@ -84,7 +84,7 @@ L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right)
|
||||
l_{R}^k = \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right\}\right\} \right)
|
||||
\end{equation}
|
||||
measures the angle of the error rotation between predicted and ground truth rotation,
|
||||
\begin{equation}
|
||||
|
||||
@ -1,6 +1,3 @@
|
||||
|
||||
\subsection{Object detection, semantic segmentation and instance segmentation}
|
||||
|
||||
\subsection{Optical flow, scene flow and structure from motion}
|
||||
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images.
|
||||
The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first
|
||||
@ -10,7 +7,6 @@ Optical flow can be regarded as two-dimensional motion estimation.
|
||||
|
||||
Scene flow is the generalization of optical flow to 3-dimensional space.
|
||||
|
||||
\subsection{Rigid scene model}
|
||||
\subsection{Convolutional neural networks for dense motion estimation}
|
||||
Deep convolutional neural network (CNN) architectures \cite{} became widely popular
|
||||
through numerous successes in classification and recognition tasks.
|
||||
@ -27,9 +23,10 @@ The most popular deep networks of this kind for end-to-end optical flow predicti
|
||||
are variants of the FlowNet family \cite{}, which was recently extended to scene flow estimation \cite{}.
|
||||
Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
|
||||
Note that the network itself is rather generic and is specialized for optical flow only through being trained
|
||||
with a dense optical flow groundtruth loss. The same network could also be used for semantic segmentation if
|
||||
with a dense optical flow groundtruth loss.
|
||||
Note that the same network could also be used for semantic segmentation if
|
||||
the number of output channels was adapted from two to the number of classes. % TODO verify
|
||||
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow quite well,
|
||||
FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to estimate optical flow arguably well,
|
||||
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
|
||||
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
|
||||
operations in the encoder.
|
||||
@ -52,7 +49,7 @@ most popular deep networks for object detection, and have recently also been app
|
||||
The original region-based convolutional network (R-CNN) uses a non-learned algorithm external to a standard encoder CNN
|
||||
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
||||
For each of the region proposals, the input image is cropped at the proposed region and the crop is
|
||||
passed through a CNN, which performs classification of the object (or non-object, if the region shows background).
|
||||
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement!
|
||||
|
||||
\paragraph{Fast R-CNN}
|
||||
The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals,
|
||||
@ -90,9 +87,12 @@ As in Fast R-CNN, RoI pooling is used to crop one fixed size feature map for eac
|
||||
|
||||
|
||||
\paragraph{Mask R-CNN}
|
||||
|
||||
Combining object detection and semantic segmentation, Mask R-CNN extends the Faster R-CNN system to predict
|
||||
high resolution instance masks within the bounding boxes of each detected object.
|
||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||
which generally involves computing a binary mask for each object instance specifying which pixels belong
|
||||
to that object. This problem is called \emph{instance segmentation}.
|
||||
Mask R-CNN extends the Faster R-CNN system to instance segmentation by predicting
|
||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||
This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise mask for each instance.
|
||||
In addition, Mask R-CNN
|
||||
|
||||
@ -2,14 +2,26 @@ We have introduced a extension on top of region-based convolutional networks to
|
||||
in parallel to instance segmentation.
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Predicting depth}
|
||||
In most cases, we want to work with RGB frames without depth available.
|
||||
To do so, we could integrate depth prediction into our network by branching off a
|
||||
depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
|
||||
Although single-frame monocular depth prediction with deep networks was already done
|
||||
to some level of success,
|
||||
our two-frame input should allow the network to make use of epipolar
|
||||
geometry for making a more reliable depth estimate.
|
||||
|
||||
\paragraph{Training on real world data}
|
||||
Due to the amount of supervision required by the different components of the network
|
||||
and the complexity of the optimization problem,
|
||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset.
|
||||
A next step would be the training on
|
||||
For example, we can first pre-train the RPN on a object detection dataset like
|
||||
Cityscapes. As soon as the RPN works reliably, we could then do alternating
|
||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||||
A next step would be training on a more realistic dataset.
|
||||
For example, we can first pre-train the RPN on an object detection dataset like
|
||||
Cityscapes. As soon as the RPN works reliably, we could execute alternating
|
||||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses and depth prediction, as no instance segmentation ground truth exists. % TODO depth prediction ?!
|
||||
the motion losses (and depth prediction), as no instance segmentation ground truth exists.
|
||||
On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
|
||||
improve detection and masks and avoid any forgetting effects.
|
||||
As an alternative to this training scheme, we could investigate training on a pure
|
||||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
||||
|
||||
@ -1,6 +1,34 @@
|
||||
\subsection{Motivation \& Goals}
|
||||
|
||||
% introduce problem to sovle
|
||||
% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
|
||||
|
||||
% Explain benefits of learning (why deep-nize rigid scene model??)
|
||||
Recently, SfM-Net \cite{} introduced an end-to-end deep learning approach for predicting depth
|
||||
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
||||
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
|
||||
network for pixel-wise prediction. A fully connected network branching off the encoder predicts a 3D motion for each object.
|
||||
However, due to the fixed number of objects masks, it can only predict a small number of motions and
|
||||
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions.
|
||||
|
||||
Thus, their approach is very unlikely to scale to dynamic scenes with a potentially
|
||||
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
|
||||
|
||||
A scalable approach to instance segmentation based on region-based convolutional networks
|
||||
was recently introduced with Mask R-CNN \cite{}, which inherits the ability to detect
|
||||
a large number of objects from a large number of classes at once from Faster R-CNN
|
||||
and predicts pixel-precise segmentation masks for each detected object.
|
||||
|
||||
We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
|
||||
Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
|
||||
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
|
||||
in parallel to classification and bounding box refinement.
|
||||
|
||||
\subsection{Related Work}
|
||||
|
||||
\paragraph{Deep optical flow estimation}
|
||||
\paragraph{Deep scene flow estimation}
|
||||
\paragraph{Structure from motion}
|
||||
SfM-Net, SE3 Nets,
|
||||
|
||||
|
||||
Behl2017ICCV
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user