mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
full editing pass
This commit is contained in:
parent
5b046e41b5
commit
bcf3adc60e
@ -28,7 +28,7 @@ we integrate motion estimation with instance segmentation.
|
||||
Given two consecutive frames from a monocular RGB-D camera,
|
||||
our resulting end-to-end deep network detects objects with precise per-pixel
|
||||
object masks and estimates the 3D motion of each detected object between the frames.
|
||||
By additionally estimating a global camera motion in the same network,
|
||||
By additionally estimating the camera ego-motion in the same network,
|
||||
we compose a dense optical flow field based on instance-level and global motion
|
||||
predictions. We train our network on the synthetic Virtual KITTI dataset,
|
||||
which provides ground truth for all components of our system.
|
||||
@ -62,7 +62,7 @@ Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentieru
|
||||
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
|
||||
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
|
||||
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
|
||||
Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen,
|
||||
Indem wir zusätzlich im selben Netzwerk die Eigenbewerung der Kamera schätzen,
|
||||
setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes
|
||||
optisches Flussfeld zusammen.
|
||||
Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz,
|
||||
|
||||
150
approach.tex
150
approach.tex
@ -7,7 +7,7 @@ we estimate per-object motion by predicting the 3D motion of each detected objec
|
||||
For this, we extend Mask R-CNN in two straightforward ways.
|
||||
First, we modify the backbone network and provide two frames to the R-CNN system
|
||||
in order to enable image matching between the consecutive frames.
|
||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
|
||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion and pivot for each
|
||||
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
|
||||
show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
|
||||
respectively.
|
||||
@ -18,7 +18,7 @@ respectively.
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
& input images & H $\times$ W $\times$ C \\
|
||||
& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
@ -69,7 +69,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
& input images & H $\times$ W $\times$ C \\
|
||||
& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
\midrule
|
||||
@ -121,32 +121,32 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb
|
||||
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
|
||||
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
||||
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
|
||||
Alternatively, we also experiment with concatenating the camera space XYZ coordinates for each frame,
|
||||
Additionally, we also experiment with concatenating the camera space XYZ coordinates for each frame,
|
||||
XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
|
||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||
as both first stage RPN and second stage feature extractor for extracting the RoI features.
|
||||
Technically, our feature encoder network will have to learn a motion representation similar to
|
||||
Technically, our feature encoder network will have to learn image matching representations similar to
|
||||
that learned by the FlowNet encoder, but the output will be computed in the
|
||||
object-centric framework of a region based convolutional network head with a 3D parametrization.
|
||||
Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information
|
||||
from the encoder is integrated for specific objects via RoI cropping and
|
||||
Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
|
||||
from the encoder is integrated for specific objects via RoI extraction and
|
||||
processed by the RoI head for each object.
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
|
||||
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
||||
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
||||
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$.
|
||||
We parametrize ${R_t^k}$ using an Euler angle representation,
|
||||
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_k \in \mathbb{R}^3$ at time $t$.
|
||||
We parametrize ${R_k}$ using an Euler angle representation,
|
||||
|
||||
\begin{equation}
|
||||
R_t^k = R_t^{k,z}(\gamma) \cdot R_t^{k,x}(\alpha) \cdot R_t^{k,y}(\beta),
|
||||
R_k = R_k^z(\gamma) \cdot R_k^x(\alpha) \cdot R_k^y(\beta),
|
||||
\end{equation}
|
||||
|
||||
where
|
||||
\begin{equation}
|
||||
R_t^{k,x}(\alpha) =
|
||||
R_k^x(\alpha) =
|
||||
\begin{pmatrix}
|
||||
1 & 0 & 0 \\
|
||||
0 & \cos(\alpha) & -\sin(\alpha) \\
|
||||
@ -155,7 +155,7 @@ R_t^{k,x}(\alpha) =
|
||||
\end{equation}
|
||||
|
||||
\begin{equation}
|
||||
R_t^{k,y}(\beta) =
|
||||
R_k^y(\beta) =
|
||||
\begin{pmatrix}
|
||||
\cos(\beta) & 0 & \sin(\beta) \\
|
||||
0 & 1 & 0 \\
|
||||
@ -164,7 +164,7 @@ R_t^{k,y}(\beta) =
|
||||
\end{equation}
|
||||
|
||||
\begin{equation}
|
||||
R_t^{k,z}(\gamma) =
|
||||
R_k^z(\gamma) =
|
||||
\begin{pmatrix}
|
||||
\cos(\gamma) & -\sin(\gamma) & 0 \\
|
||||
\sin(\gamma) & \cos(\gamma) & 0 \\
|
||||
@ -179,25 +179,26 @@ prediction in addition to the fully-connected layers for
|
||||
refined boxes and classes and the convolutional network for the masks.
|
||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||
Each instance motion is predicted as a set of nine scalar parameters,
|
||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
|
||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
and that objects rotate at most 90 degrees in either direction along any axis,
|
||||
which is in general a safe assumption for image sequences from videos.
|
||||
which is in general a safe assumption for image sequences from videos,
|
||||
and enables us to obtain unique cosine values from the predicted sine values.
|
||||
All predictions are made in camera space, and translation and pivot predictions are in meters.
|
||||
We additionally predict softmax scores $o_t^k$ for classifying the objects into
|
||||
still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_t^k = 0$,
|
||||
we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_t^k = (0,0,0)^T$,
|
||||
We additionally predict softmax scores $o_k$ for classifying the objects into
|
||||
still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_k = 0$,
|
||||
we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_k = (0,0,0)^T$,
|
||||
and thus predict an identity motion.
|
||||
|
||||
|
||||
\paragraph{Camera motion prediction}
|
||||
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
|
||||
In addition to the object transformations, we optionally predict the camera motion $\{R_{cam}, t_{cam}\}\in \mathbf{SE}(3)$
|
||||
between the two frames $I_t$ and $I_{t+1}$.
|
||||
For this, we branch off a small fully-connected network from the bottleneck output of the backbone.
|
||||
We again represent $R_t^{cam}$ using a Euler angle representation and
|
||||
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
|
||||
Again, we predict a softmax score $o_t^{cam}$ for differentiating between
|
||||
We again represent $R_{cam}$ using a Euler angle representation and
|
||||
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_{cam}$ in the same way as for the individual objects.
|
||||
Again, we predict a softmax score $o_{cam}$ for differentiating between
|
||||
a still and moving camera.
|
||||
|
||||
\subsection{Network design}
|
||||
@ -207,19 +208,19 @@ a still and moving camera.
|
||||
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
||||
feature resolution prior to RoI extraction would be reduced too much.
|
||||
Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$
|
||||
Therefore, in our variant without FPN, we first pass the $C_4$ features through $C_5$
|
||||
and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant)
|
||||
to increase the bottleneck stride prior to the camera network to 64.
|
||||
to increase the bottleneck stride prior to the camera motion network to 64.
|
||||
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
|
||||
the backbone makes use of all blocks through $C_6$ and
|
||||
we can simply branch of our camera network from the $C_6$ bottleneck.
|
||||
the backbone makes use of all blocks through $C_6$, and
|
||||
we can simply branch off our camera motion network from the $C_6$ bottleneck.
|
||||
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
|
||||
convolution to the $C_6$ features to reduce the number of inputs to the following
|
||||
fully-connected layers.
|
||||
fully-connected layers, and thus keep the number of weights reasonably small.
|
||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||
to a fixed size without losing all spatial information,
|
||||
flatten them, and finally apply multiple fully-connected layers to compute the
|
||||
camera motion prediction.
|
||||
flatten them, and finally apply multiple fully-connected layers to predict the
|
||||
camera motion parameters.
|
||||
|
||||
\paragraph{RoI motion head network}
|
||||
In both of our network variants
|
||||
@ -227,6 +228,15 @@ In both of our network variants
|
||||
we compute the fully-connected network for motion prediction from the
|
||||
flattened RoI features, which are also the basis for classification and
|
||||
bounding box refinement.
|
||||
Note that the features (extracted from the upsampled FPN stage appropriate to the RoI bounding box scales)
|
||||
passed to our ResNet-FPN RoI head went through the $C_6$
|
||||
bottleneck, which has a stride of 64 with respect to the original image.
|
||||
In contrast, the bottleneck for the features passed to our ResNet RoI head
|
||||
is $C_4$ (with a stride of 16). Thus, the ResNet-FPN variant can in principle estimate
|
||||
object motions based on larger displacements than the ResNet variant.
|
||||
Additionally, as smaller bounding boxes use higher resolution features, the
|
||||
motions and pivots of (especially smaller) objects can in principle be more accurately
|
||||
estimated with the FPN variant.
|
||||
|
||||
\subsection{Supervision}
|
||||
\label{ssec:supervision}
|
||||
@ -235,42 +245,41 @@ bounding box refinement.
|
||||
The most straightforward way to supervise the object motions is by using ground truth
|
||||
motions computed from ground truth object poses, which is in general
|
||||
only practical when training on synthetic datasets.
|
||||
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
Note that we dropped the subscript $t$ to increase readability.
|
||||
Given the $k$-th foreground RoI (as defined for Mask R-CNN) with ground class $c_k^*$,
|
||||
let $R_k, t_k, p_k, o_k$ be the predicted motion for class $c_k^*$ as parametrized above,
|
||||
and $R_k^*, t_k^*, p_k^*, o_k^*$ the ground truth motion for the matched ground truth example.
|
||||
Similar to the camera pose regression loss in \cite{PoseNet2},
|
||||
we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted
|
||||
rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss
|
||||
performs better in our case than the standard $\ell_1$-loss.
|
||||
We then compute the RoI motion loss as
|
||||
We thus compute the RoI motion loss as
|
||||
|
||||
\begin{equation}
|
||||
L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k,
|
||||
L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o_k^* + l_o^k,
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
l_{R}^k = \ell_{reg} (R^{gt,i_k} - R^{k,c_k}),
|
||||
l_{R}^k = \ell_{reg} (R_k^* - R_k),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
l_{t}^k = \ell_{reg} (t^{gt,i_k} - t^{k,c_k}),
|
||||
l_{t}^k = \ell_{reg} (t_k^* - t_k),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
l_{p}^k = \ell_{reg} (p^{gt,i_k} - p^{k,c_k}).
|
||||
l_{p}^k = \ell_{reg} (p_k^* - p_k).
|
||||
\end{equation}
|
||||
are the smooth $\ell_1$-loss terms for the predicted rotation, translation and pivot,
|
||||
are the smooth-$\ell_1$ losses for the predicted rotation, translation and pivot,
|
||||
respectively and
|
||||
\begin{equation}
|
||||
l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}).
|
||||
l_o^k = \ell_{cls}(o_k, o_k^*).
|
||||
\end{equation}
|
||||
is the cross-entropy loss for the predicted classification into moving and non-moving objects.
|
||||
|
||||
Note that we do not penalize the rotation and translation for objects with
|
||||
$o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the network
|
||||
$o_k^* = 0$, which do not move between $t$ and $t+1$. We found that the network
|
||||
may not reliably predict exact identity motions for still objects, which is
|
||||
numerically more difficult to optimize than performing classification between
|
||||
moving and non-moving objects and discarding the regression for the non-moving
|
||||
ones. Also, analogous to masks and bounding boxes, the estimates for classes
|
||||
ones. Also, analogously to masks and bounding boxes, the estimates for classes
|
||||
other than $c_k^*$ are not penalized.
|
||||
|
||||
Now, our modified RoI loss is
|
||||
@ -281,10 +290,10 @@ L_{RoI} = L_{cls} + L_{box} + L_{mask} + L_{motion}.
|
||||
\paragraph{Camera motion supervision}
|
||||
We supervise the camera motion with ground truth analogously to the
|
||||
object motions, with the only difference being that we only have
|
||||
a rotation and translation, but no pivot term for the camera motion.
|
||||
a rotation and translation, but no pivot loss for the camera motion.
|
||||
If the ground truth shows that the camera is not moving, we again do not
|
||||
penalize rotation and translation. For the camera, the loss is reduced to the
|
||||
classification term in this case.
|
||||
penalize rotation and translation. In this case, the camera motion loss is reduced to the
|
||||
classification loss.
|
||||
|
||||
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
|
||||
\begin{figure}[t]
|
||||
@ -310,19 +319,20 @@ In this case, for any RoI,
|
||||
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
|
||||
with the same resolution as the predicted mask.
|
||||
We use the same bounding box
|
||||
to crop the corresponding region from the dense, full image depth map
|
||||
to crop the corresponding region from the dense, full-image depth map
|
||||
and bilinearly resize the depth crop to the same resolution as the mask and point
|
||||
grid.
|
||||
We then compute the optical flow at each of the grid points by creating
|
||||
a 3D point cloud from the point grid and depth crop. To this point cloud, we
|
||||
apply the RoI's predicted motion, masked by the predicted mask.
|
||||
Next, we create a 3D point cloud from the point grid and depth crop. To this point cloud, we
|
||||
apply the object motion predicted for the RoI, masked by the predicted mask.
|
||||
Then, we apply the camera motion to the points, project them back to 2D
|
||||
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
|
||||
Note that we batch this computation over all RoIs, so that we only perform
|
||||
it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach.
|
||||
it once per forward pass.
|
||||
Figure \ref{figure:flow_loss} illustrates the approach.
|
||||
|
||||
The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the
|
||||
dense, full image flow composition in the following subsection, so we will not
|
||||
include them here. The only differences are that there is no sum over objects during
|
||||
dense, full-image flow composition in the following subsection, so we will not
|
||||
duplicate them here. The only differences are that there is no sum over objects during
|
||||
the point transformation based on instance motion, as we consider the single object
|
||||
corresponding to an RoI in isolation, and that the masks are not resized to the
|
||||
full image resolution, as
|
||||
@ -333,21 +343,22 @@ For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
|
||||
by penalizing the $m \times m$ optical flow grid.
|
||||
If there is optical flow ground truth available, we can use the RoI bounding box to
|
||||
crop and resize a region from the ground truth optical flow to match the RoI's
|
||||
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
|
||||
optical flow grid and penalize the difference between the flow grids with a (smooth) $\ell_1$-loss.
|
||||
|
||||
However, we can also use the re-projection loss without optical flow ground truth
|
||||
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
|
||||
In this case, we use the bounding box to crop and resize a corresponding region
|
||||
In this case, we can use the bounding box to crop and resize a corresponding region
|
||||
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
|
||||
using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference
|
||||
using the 2D grid displaced with the predicted flow grid (the latter is often called \emph{backward warping}).
|
||||
Then, we can penalize the difference
|
||||
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
|
||||
For more details on differentiable bilinear sampling for deep learning, we refer the reader to
|
||||
\cite{STN}.
|
||||
|
||||
When compared to supervision with motion ground truth, a re-projection
|
||||
When compared to supervision with 3D instance motion ground truth, a re-projection
|
||||
loss could benefit motion regression by removing any loss balancing issues between the
|
||||
rotation, translation and pivot terms \cite{PoseNet2},
|
||||
which can make it interesting even when 3D motion ground truth is available.
|
||||
rotation, translation and pivot losses \cite{PoseNet2},
|
||||
which could make it interesting even when 3D motion ground truth is available.
|
||||
|
||||
\subsection{Training and inference}
|
||||
\label{ssec:training_inference}
|
||||
@ -368,7 +379,7 @@ highest scoring class.
|
||||
|
||||
\subsection{Dense flow from motion}
|
||||
\label{ssec:postprocessing}
|
||||
As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network.
|
||||
As a postprocessing, we compose the dense optical flow between $I_t$ and $I_{t+1}$ from the outputs of our Motion R-CNN network.
|
||||
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
|
||||
where
|
||||
\begin{equation}
|
||||
@ -383,33 +394,34 @@ x_t - c_0 \\ y_t - c_1 \\ f
|
||||
\end{pmatrix},
|
||||
\end{equation}
|
||||
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
|
||||
which range over all coordinates in $I_t$.
|
||||
which range over all coordinates in $I_t$,
|
||||
and $(c_0, c_1, f)$ are the camera intrinsics.
|
||||
For now, the depth map is always assumed to come from ground truth.
|
||||
|
||||
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
||||
box of a detected object according to the predicted motion of the object.
|
||||
|
||||
We first define the \emph{full image} mask $M_t^k$ for object k,
|
||||
which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing
|
||||
$m_t^k$ to the width and height of the predicted bounding box and then copying the values
|
||||
We first define the \emph{full image} mask $M_k$ for object k,
|
||||
which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
|
||||
it to the width and height of the predicted bounding box and then copying the values
|
||||
of the resized mask into a full resolution mask initialized with zeros,
|
||||
starting at the top-left coordinate of the predicted bounding box.
|
||||
Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects,
|
||||
Then, given the predicted motions $(R_k, t_k)$, as well as $p_k$ for all objects,
|
||||
\begin{equation}
|
||||
P'_{t+1} =
|
||||
P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
||||
P_t + \sum_1^{k} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\}
|
||||
\end{equation}
|
||||
These motion predictions are understood to have already taken into account
|
||||
the classification into moving and still objects,
|
||||
and we thus, as described above, have identity motions for all objects with $o_t^k = 0$.
|
||||
and we thus, as described above, have identity motions for all objects with $o_k = 0$.
|
||||
|
||||
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$,
|
||||
Next, we transform all points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$,
|
||||
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
|
||||
\end{pmatrix}
|
||||
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^c
|
||||
= P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam}
|
||||
\end{equation}.
|
||||
|
||||
Note that in our experiments, we either use the ground truth camera motion to focus
|
||||
|
||||
122
background.tex
122
background.tex
@ -8,10 +8,10 @@ The optical flow
|
||||
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
|
||||
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
|
||||
visually corresponding pixel in the second frame $I_{t+1}$,
|
||||
and can be interpreted as the apparent movement of brigthness patterns between the two frames.
|
||||
and can be interpreted as the apparent movement of brightness patterns between the two frames.
|
||||
Optical flow can be regarded as two-dimensional motion estimation.
|
||||
|
||||
Scene flow is the generalization of optical flow to 3-dimensional space and
|
||||
Scene flow is the generalization of optical flow to three-dimensional space and additionally
|
||||
requires estimating depth for each pixel. Generally, stereo input is used for scene flow
|
||||
to estimate disparity-based depth, however monocular depth estimation with deep networks is also becoming
|
||||
popular \cite{DeeperDepth, UnsupPoseDepth}.
|
||||
@ -47,7 +47,7 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption {
|
||||
FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
|
||||
Overview of the FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
|
||||
are used for refinement.
|
||||
}
|
||||
\label{table:flownets}
|
||||
@ -70,21 +70,22 @@ performing upsampling of the compressed features and resulting in a encoder-deco
|
||||
The most popular deep networks of this kind for end-to-end optical flow prediction
|
||||
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
|
||||
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
|
||||
Table \ref{table:flownets} shows the classical FlowNetS architecture for optical flow prediction.
|
||||
Table \ref{table:flownets} gives an overview of the classical FlowNetS architecture for optical flow prediction.
|
||||
|
||||
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
|
||||
with supervision from dense optical flow ground truth.
|
||||
Potentially, the same network could also be used for semantic segmentation if
|
||||
the number of output final and intermediate output channels was adapted from two to the number of classes.\
|
||||
the number of final and intermediate output channels was adapted from two to the number of classes.
|
||||
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
|
||||
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
|
||||
Note that the maximum displacement that can be correctly estimated depends on the number of 2D convolution strides or pooling
|
||||
Note that the maximum displacement that can be correctly estimated depends on the number of strided 2D convolutions (and the stride they use) and pooling
|
||||
operations in the encoder.
|
||||
Recently, other, similarly generic,
|
||||
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
||||
encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
|
||||
|
||||
\subsection{SfM-Net}
|
||||
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture.
|
||||
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described
|
||||
in the introduction.
|
||||
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
|
||||
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
|
||||
image brightness differences penalizes the predictions.
|
||||
@ -103,7 +104,7 @@ image brightness differences penalizes the predictions.
|
||||
& input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\
|
||||
& Conv-Deconv & H $\times$ W $\times$ 32 \\
|
||||
masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\
|
||||
FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & H $\times$ W $\times$ 32 \\
|
||||
FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & 1 $\times$ 512 \\
|
||||
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
|
||||
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
|
||||
\midrule
|
||||
@ -118,7 +119,7 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
|
||||
\end{tabular}
|
||||
|
||||
\caption {
|
||||
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional
|
||||
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully-convolutional
|
||||
encoder-decoder network, where convolutions and deconvolutions with stride 2 are
|
||||
used for downsampling and upsampling, respectively. The stride at the bottleneck
|
||||
with respect to the input image is 32.
|
||||
@ -147,7 +148,7 @@ Note that for the Mask R-CNN architectures we describe below, this is equivalent
|
||||
to the standard ResNet-50 backbone. We now introduce one small extension that
|
||||
will be useful for our Motion R-CNN network.
|
||||
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||
stride may be important.
|
||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||
@ -166,9 +167,9 @@ to increase the bottleneck stride to 64, following FlowNetS.
|
||||
\multicolumn{3}{c}{\textbf{ResNet}}\\
|
||||
\midrule
|
||||
C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\
|
||||
|
||||
\midrule
|
||||
& 3 $\times$ 3 max pool, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 64 \\
|
||||
|
||||
\midrule
|
||||
C$_2$ &
|
||||
$\begin{bmatrix}
|
||||
1 \times 1, 64 \\
|
||||
@ -242,8 +243,8 @@ most popular deep networks for object detection, and have recently also been app
|
||||
\paragraph{R-CNN}
|
||||
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
|
||||
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
||||
For each of the region proposals, the input image is cropped using the regions bounding box and the crop is
|
||||
passed through a CNN, which performs classification of the object (or non-object, if the region shows background).
|
||||
For each of the region proposals, the input image is cropped using the region bounding box and the crop is
|
||||
passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
|
||||
|
||||
\paragraph{Fast R-CNN}
|
||||
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
|
||||
@ -256,8 +257,8 @@ The extracted per-RoI (region of interest) feature maps are collected into a bat
|
||||
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
||||
The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
|
||||
is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
|
||||
full image feature map are max-pooled to yield the output value at this cell.
|
||||
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
|
||||
full-image feature map are max-pooled to yield the output value at the cell.
|
||||
Thus, given region proposals, all computation is reduced to a single pass through the complete network,
|
||||
speeding up the system by two orders of magnitude at inference time and one order of magnitude
|
||||
at training time.
|
||||
|
||||
@ -297,15 +298,15 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
\midrule
|
||||
M$_0$ & From R$_1$: 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
|
||||
& 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ N$_{cls}$ \\
|
||||
masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
||||
masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
||||
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption {
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture.
|
||||
Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture.
|
||||
Note that this is equivalent to the Faster R-CNN ResNet-50 architecture if the mask
|
||||
head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction,
|
||||
whereas Faster R-CNN used RoI pooling.
|
||||
whereas Faster R-CNN uses RoI pooling.
|
||||
}
|
||||
\label{table:maskrcnn_resnet}
|
||||
\end{table}
|
||||
@ -317,17 +318,17 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the
|
||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
|
||||
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
||||
and again, improved accuracy.
|
||||
This unified network operates in two stages.
|
||||
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
||||
which is a deep feature encoder CNN with the original image as input.
|
||||
Next, the \emph{backbone} output features are passed into a small, fully convolutional \emph{Region Proposal Network (RPN)} head, which
|
||||
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||
At any of the $h \times w$ output positions of the RPN head,
|
||||
$N_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $N_a$ \emph{anchors} with different
|
||||
aspect ratios and scales. Thus, there are $N_a \times h \times w$ reference anchors in total.
|
||||
In Faster R-CNN, $N_a = 9$, with 3 scales corresponding
|
||||
In Faster R-CNN, $N_a = 9$, with 3 scales, corresponding
|
||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
||||
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
||||
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
||||
@ -337,8 +338,12 @@ The region proposals can then be obtained as the N highest scoring RPN predictio
|
||||
|
||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||
and bounding box refinement for each of the region proposals, which are now obtained
|
||||
from the RPN instead of being pre-computed by some external algorithm.
|
||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals.
|
||||
from the RPN instead of being pre-computed by an external algorithm.
|
||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
||||
and the refined bounding boxes are predicted separately for each object class.
|
||||
|
||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
||||
(here, the mask head is ignored).
|
||||
|
||||
\paragraph{Mask R-CNN}
|
||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||
@ -346,18 +351,20 @@ However, it can be helpful to know class and object (instance) membership of all
|
||||
which generally involves computing a binary mask for each object instance specifying which pixels belong
|
||||
to that object. This problem is called \emph{instance segmentation}.
|
||||
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
|
||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||
fixed resolution instance masks within the bounding boxes of each detected object,
|
||||
which are then bilinearly resized to fit inside the respective bounding boxes.
|
||||
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise binary mask for each instance.
|
||||
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
||||
comptetition between classes for the mask prediction branch.
|
||||
comptetition between classes in the mask prediction branch.
|
||||
|
||||
One important additional technical aspect of Mask R-CNN is the replacement of RoI pooling with
|
||||
Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
|
||||
bilinear sampling for extracting the RoI features, which is much more precise.
|
||||
In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
|
||||
boundary of the bounding box, and thus some detail is lost.
|
||||
|
||||
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
|
||||
{
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
@ -367,7 +374,7 @@ boundary of the bounding box, and thus some detail is lost.
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
C$_5$ & ResNet \{up to C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\
|
||||
\midrule
|
||||
@ -403,11 +410,11 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
M$_1$ & From R$_2$: $\begin{bmatrix}\textrm{3 $\times$ 3 conv} \end{bmatrix}$ $\times$ 4, 256 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
|
||||
& 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ 256 \\
|
||||
& 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
||||
masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
||||
masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption {
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture.
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
|
||||
Operations enclosed in a []$_p$ block make up a single FPN
|
||||
block (see Figure \ref{figure:fpn_block}).
|
||||
}
|
||||
@ -416,28 +423,29 @@ block (see Figure \ref{figure:fpn_block}).
|
||||
}
|
||||
|
||||
\paragraph{Feature Pyramid Networks}
|
||||
In Faster R-CNN, a single feature map is used as a source of all RoIs, independent
|
||||
of the size of the bounding box of the RoI.
|
||||
However, for small objects, the C$_4$ (see Table \ref{table:maskrcnn_resnet}) features
|
||||
might have lost too much spatial information to properly predict the exact bounding
|
||||
box and a high resolution mask. Likewise, for very big objects, the fixed size
|
||||
RoI window might be too small to cover the region of the feature map containing
|
||||
information for this object.
|
||||
In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
|
||||
of the size of the bounding box of each RoI.
|
||||
However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
|
||||
might have lost too much spatial information to allow properly predicting the exact bounding
|
||||
box and a high resolution mask.
|
||||
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
|
||||
of an appropriate scale to be used, depending of the size of the bounding box.
|
||||
of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI.
|
||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||
encoder by combining bilinear upsampled feature maps coming from the bottleneck
|
||||
with lateral skip connections from the encoder.
|
||||
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
encoder by combining bilinearly upsampled feature maps coming from the bottleneck
|
||||
with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
|
||||
For each consecutive upsampling block, the lateral skip connections are taken from
|
||||
the encoder block with the same output resolution as the upsampled features coming
|
||||
from the bottleneck.
|
||||
|
||||
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
|
||||
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
|
||||
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$ (see Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
At each output position of the resulting RPN pyramid, bounding boxes are predicted
|
||||
with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$).
|
||||
For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$,
|
||||
the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$,
|
||||
respectively.
|
||||
Note that there is no need for multiple anchor scales per anchor position anymore,
|
||||
as the RPN heads themselves correspond to multiple scales.
|
||||
as the RPN heads themselves correspond to different scales.
|
||||
Now, in the RPN, higher resolution feature maps can be used for regressing smaller
|
||||
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
|
||||
which has a stride of $4$ with respect to the input image.
|
||||
@ -463,6 +471,8 @@ as some anchor to the exact same pyramid level from which the RPN of this
|
||||
anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$,
|
||||
which is the highest resolution feature map.
|
||||
|
||||
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -506,7 +516,7 @@ All bounding boxes predicted by the RoI head or RPN are estimated as offsets
|
||||
with respect to a reference bounding box. In the case of the RPN,
|
||||
the reference bounding box is one of the anchors, and refined bounding boxes from the RoI head are
|
||||
predicted relative to the RPN output bounding boxes.
|
||||
Let $(x, y, w, h)$ be the top left coordinates, height and width of the bounding box
|
||||
Let $(x, y, w, h)$ be the top left coordinates, width, and height of the bounding box
|
||||
to be predicted. Likewise, let $(x^*, y^*, w^*, h^*)$ be the ground truth bounding
|
||||
box and let $(x_r, y_r, w_r, h_r)$ be the reference bounding box.
|
||||
The ground truth \emph{box encoding} $b_e^*$ is then defined as
|
||||
@ -561,7 +571,7 @@ w = \exp(b_w) \cdot w_r,
|
||||
h = \exp(b_h) \cdot h_r,
|
||||
\end{equation*}
|
||||
and thus the bounding box is obtained as the reference bounding box adjusted by
|
||||
the predicted relative offsets and scales.
|
||||
the predicted relative offsets and scales encoded in $b_e$.
|
||||
|
||||
\paragraph{Supervision of the RPN}
|
||||
A positive RPN proposal is defined as one with a IoU of at least $0.7$ with
|
||||
@ -571,7 +581,7 @@ with at most $50\%$ positive examples (if there are less positive examples,
|
||||
more negative examples are used instead).
|
||||
For examples selected in this way, a regression loss is computed between
|
||||
predicted and ground truth bounding box encoding, and a classification loss
|
||||
is computed for the predicted objectness.
|
||||
is computed for the predicted objectness scores.
|
||||
Specifically, let $s_i^* = 1$ if proposal $i$ is positive and $s_i^* = 0$ if
|
||||
it is negative, let $s_i$ be the predicted objectness score and $b_i$, $b_i^*$ the
|
||||
predicted and ground truth bounding box encodings.
|
||||
@ -588,7 +598,7 @@ L_{box}^{RPN} = \frac{1}{N_{RPN}^{pos}} \sum_{i=1}^{N_{RPN}} s_i^* \cdot \ell_{r
|
||||
\end{equation}
|
||||
and
|
||||
\begin{equation}
|
||||
N_{RPN}^{pos} = \sum_{i=1}^{N_{pos}} s_i^*
|
||||
N_{RPN}^{pos} = \sum_{i=1}^{N_{RPN}} s_i^*
|
||||
\end{equation}
|
||||
is the number of positive examples. Note that the bounding box loss is only
|
||||
active for positive examples, and that the classification loss is computed
|
||||
@ -648,14 +658,14 @@ During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring regio
|
||||
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
|
||||
and passed through the RoI bounding box refinement and classification heads
|
||||
(but not through the mask head).
|
||||
After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class,
|
||||
with a maximum IoU of 0.7.
|
||||
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
|
||||
after again extracting the corresponding features.
|
||||
After this, non-maximum supression (NMS) is applied to predicted RoIs for which the predicted class is not the background class,
|
||||
with a maximum IoU of 0.7 of the refined boxes.
|
||||
Finally, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
|
||||
after extracting the corresponding features again.
|
||||
Thus, during inference, the features for the mask head are extracted using the refined
|
||||
bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not
|
||||
introducing any misalignment, as we want to create the instance mask inside of the
|
||||
more precise, refined detection bounding boxes.
|
||||
introducing any misalignment, as the instance masks are to be created inside of the
|
||||
final, more precise, refined detection bounding boxes.
|
||||
Furthermore, note that bounding box and mask predictions for all classes but the predicted
|
||||
class (the highest scoring class) are discarded, and thus the output bounding
|
||||
box and mask correspond to the highest scoring class.
|
||||
|
||||
@ -18,6 +18,12 @@ of our network is highly interpretable, which may also bring benefits for safety
|
||||
applications.
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Training on all Virtual KITTI sequences}
|
||||
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
|
||||
to make training faster.
|
||||
In the future, it would be interesting to train on all variants, as the different
|
||||
lighting conditions and angles should lead to a more general model.
|
||||
|
||||
\paragraph{Evaluation and finetuning on KITTI 2015}
|
||||
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
|
||||
on which we do not train, but we have yet to evaluate on a real world dataset.
|
||||
@ -138,19 +144,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||
into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
|
||||
\paragraph{Masking prior to the RoI motion head}
|
||||
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
|
||||
the backbone are integrated over the complete RoI window to yield the features
|
||||
for motion estimation.
|
||||
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
|
||||
However, ideally, the motion (image matching) information from the backbone should
|
||||
|
||||
For example, consider
|
||||
|
||||
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||
extracted RoI features before passing them into the motion head.
|
||||
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||
extracted feature window which belong to the background. Then, the RoI motion
|
||||
head could aggregate the motion (image matching) information from the backbone
|
||||
over positions localized within the object only, but not over positions belonging
|
||||
to the background, which should probably not influence the final object motion estimate.
|
||||
% \paragraph{Masking prior to the RoI motion head}
|
||||
% Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
|
||||
% the backbone are integrated over the complete RoI window to yield the features
|
||||
% for motion estimation.
|
||||
% For example, average pooling is applied before the fully-connected layers in the variant without FPN.
|
||||
% However, ideally, the motion (image matching) information from the backbone should
|
||||
%
|
||||
% For example, consider
|
||||
%
|
||||
% Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||
% extracted RoI features before passing them into the motion head.
|
||||
% The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||
% extracted feature window which belong to the background. Then, the RoI motion
|
||||
% head could aggregate the motion (image matching) information from the backbone
|
||||
% over positions localized within the object only, but not over positions belonging
|
||||
% to the background, which should probably not influence the final object motion estimate.
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
\subsection{Implementation}
|
||||
Our networks and loss functions are implemented using built-in TensorFlow \cite{TensorFlow}
|
||||
functions, enabling us to use automatic differentiation for all gradient
|
||||
Our networks and loss functions are implemented using built-in TensorFlow
|
||||
functions \cite{TensorFlow}, enabling us to use automatic differentiation for all gradient
|
||||
computations. To make our code easy to extend and flexible, we build on
|
||||
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
|
||||
implementation.
|
||||
@ -49,18 +49,18 @@ let $[R_t^{ex}|t_t^{ex}]$
|
||||
and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$
|
||||
be the camera extrinsics at the two frames.
|
||||
We compute the ground truth camera motion
|
||||
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as
|
||||
$\{R_{cam}^*, t_{cam}^*\} \in \mathbf{SE}(3)$ as
|
||||
|
||||
\begin{equation}
|
||||
R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}),
|
||||
R_{cam}^* = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}.
|
||||
t_{cam}^* = t_{t+1}^{ex} - R_{cam}^* \cdot t_t^{ex}.
|
||||
\end{equation}
|
||||
|
||||
Additionally, we define $o_t^{gt, cam} \in \{ 0, 1 \}$,
|
||||
Additionally, we define $o_{cam}^* \in \{ 0, 1 \}$,
|
||||
\begin{equation}
|
||||
o_t^{gt, cam} =
|
||||
o_{cam}^* =
|
||||
\begin{cases}
|
||||
1 &\text{if the camera pose changes between $t$ and $t+1$} \\
|
||||
0 &\text{otherwise,}
|
||||
@ -75,25 +75,25 @@ at $I_t$ and $I_{t+1}$.
|
||||
Note that the pose at $t$ is given with respect to the camera at $t$ and
|
||||
the pose at $t+1$ is given with respect to the camera at $t+1$.
|
||||
|
||||
We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as
|
||||
We define the ground truth pivot $p_k^* \in \mathbb{R}^3$ as
|
||||
|
||||
\begin{equation}
|
||||
p_{t}^{gt, i} = t_t^i
|
||||
p_k^* = t_t^i
|
||||
\end{equation}
|
||||
|
||||
and compute the ground truth object motion
|
||||
$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as
|
||||
$\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
|
||||
|
||||
\begin{equation}
|
||||
R_{t}^{gt, i} = \mathrm{inv}(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
|
||||
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t.
|
||||
t_k^* = t_{t+1}^{i} - R_k^* \cdot t_t.
|
||||
\end{equation}
|
||||
|
||||
As for the camera, we define $o_t^{gt, i} \in \{ 0, 1 \}$,
|
||||
As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
|
||||
\begin{equation}
|
||||
o_t^{gt, i} =
|
||||
o_k^* =
|
||||
\begin{cases}
|
||||
1 &\text{if the position of object i changes between $t$ and $t+1$} \\
|
||||
0 &\text{otherwise,}
|
||||
@ -105,21 +105,19 @@ which specifies whether an object is moving in between the frames.
|
||||
To evaluate the 3D instance and camera motions on the Virtual KITTI validation
|
||||
set, we introduce a few error metrics.
|
||||
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
|
||||
let $i_k$ be the index of the best matching ground truth example,
|
||||
let $c_k$ be the predicted class,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
let $R_k, t_k, p_k, o_k$ be the predicted motion for the predicted class $c_k$
|
||||
and $R_k^*, t_k^*, p_k^*, o_k^*$ the motion ground truth for the best matching example.
|
||||
Then, assuming there are $N$ such detections,
|
||||
\begin{equation}
|
||||
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
|
||||
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R_k^*) \cdot R_k) - 1}{2} \right\}\right\} \right)
|
||||
\end{equation}
|
||||
measures the mean angle of the error rotation between predicted and ground truth rotation,
|
||||
\begin{equation}
|
||||
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \right\rVert_2,
|
||||
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth translation, and
|
||||
\begin{equation}
|
||||
E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2
|
||||
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth pivot.
|
||||
Moreover, we define precision and recall measures for the detection of moving objects,
|
||||
@ -135,29 +133,30 @@ O_{rc} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}}
|
||||
is the fraction of objects correctly classified as moving among all objects which are actually moving.
|
||||
Here, we used
|
||||
\begin{equation}
|
||||
\mathit{TP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1],
|
||||
\mathit{TP} = \sum_k [o_k = 1 \land o_k^* = 1],
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
\mathit{FP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0],
|
||||
\mathit{FP} = \sum_k [o_k = 1 \land o_k^* = 0],
|
||||
\end{equation}
|
||||
and
|
||||
\begin{equation}
|
||||
\mathit{FN} = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
|
||||
\mathit{FN} = \sum_k [o_k = 0 \land o_k^* = 1].
|
||||
\end{equation}
|
||||
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
||||
predicted camera motions.
|
||||
the predicted camera motion.
|
||||
|
||||
\subsection{Virtual KITTI: Training setup}
|
||||
\label{ssec:setup}
|
||||
|
||||
For our initial experiments, we concatenate both RGB frames as
|
||||
well as the XYZ coordinates for both frames as input to the networks.
|
||||
We train both, the Motion R-CNN and -FPN variants.
|
||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants.
|
||||
|
||||
\paragraph{Training schedule}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
||||
Virtual KITTI training set.
|
||||
We train for a total of 192K iterations on the Virtual KITTI training set.
|
||||
For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
|
||||
which results in approximately one day of training.
|
||||
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
|
||||
momentum of $0.9$.
|
||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
|
||||
@ -11,26 +11,30 @@ and estimates their 3D locations as well as all 3D object motions between the fr
|
||||
|
||||
\subsection{Motivation}
|
||||
|
||||
For moving in the real world, it is generally desirable to know which objects exists
|
||||
For moving in the real world, it is often desirable to know which objects exists
|
||||
in the proximity of the moving agent,
|
||||
where they are located relative to the agent,
|
||||
and where they will be at some point in the future.
|
||||
and where they will be at some point in the near future.
|
||||
In many cases, it would be preferable to infer such information from video data
|
||||
if technically feasible, as camera sensors are cheap and ubiquitous.
|
||||
if technically feasible, as camera sensors are cheap and ubiquitous
|
||||
(compared to, for example, Lidar).
|
||||
|
||||
For example, in autonomous driving, it is crucial to not only know the position
|
||||
As an example, consider the autonomous driving problem.
|
||||
Here, it is crucial to not only know the position
|
||||
of each obstacle, but to also know if and where the obstacle is moving,
|
||||
and to use sensors that will not make the system too expensive for widespread use.
|
||||
At the same time, the autonomous driving system has to operate in real time to
|
||||
react quickly enough for safely controlling the vehicle.
|
||||
|
||||
A promising approach for 3D scene understanding in situations like these are deep neural
|
||||
A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural
|
||||
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
|
||||
in still images and are more and more often being applied to video data.
|
||||
A key benefit of end-to-end deep networks is that they can, in principle,
|
||||
A key benefit of deep networks is that they can, in principle,
|
||||
enable very fast inference on real time video data and generalize
|
||||
over many training examples to resolve ambiguities inherent in image understanding
|
||||
over many training situations to resolve ambiguities inherent in image understanding
|
||||
and motion estimation.
|
||||
|
||||
Thus, in this work, we aim to develop end-to-end deep networks which can, given
|
||||
Thus, in this work, we aim to develop deep neural networks which can, given
|
||||
sequences of images, segment the image pixels into object instances and estimate
|
||||
the location and 3D motion of each object instance relative to the camera
|
||||
(Figure \ref{figure:teaser}).
|
||||
@ -39,9 +43,12 @@ the location and 3D motion of each object instance relative to the camera
|
||||
|
||||
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
||||
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
||||
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
|
||||
network for pixel-wise prediction. A fully-connected network branching off the encoder predicts a 3D motion for each object.
|
||||
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
|
||||
Using a standard encoder-decoder network for pixel-wise dense prediction,
|
||||
SfM-Net predicts a pre-determined number of binary masks ranging over the complete image,
|
||||
with each mask specifying the membership of the image pixels to one object.
|
||||
A fully-connected network branching off the encoder then predicts a 3D motion for each object,
|
||||
as well as the camera ego-motion.
|
||||
However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions and
|
||||
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}).
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -64,9 +71,11 @@ deep learning approaches to motion estimation, may significantly benefit motion
|
||||
estimation by structuring the problem, creating physical constraints and reducing
|
||||
the dimensionality of the estimate.
|
||||
|
||||
A scalable approach to instance segmentation based on region-based convolutional networks
|
||||
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
|
||||
a large number of objects from a large number of classes at once from Faster R-CNN
|
||||
In the context of still images, a
|
||||
scalable approach to instance segmentation based on region-based convolutional networks
|
||||
was recently introduced with Mask R-CNN \cite{MaskRCNN}.
|
||||
Mask R-CNN inherits the ability to detect
|
||||
a large number of objects from a large number of classes at once from Faster R-CNN \cite{FasterRCNN}
|
||||
and predicts pixel-precise segmentation masks for each detected object (Figure \ref{figure:maskrcnn_cs}).
|
||||
|
||||
\begin{figure}[t]
|
||||
@ -126,7 +135,7 @@ image depending on the semantics of each region or pixel, which include whether
|
||||
pixel belongs to the background, to which object instance it belongs if it is not background,
|
||||
and the class of the object it belongs to.
|
||||
Often, failure cases of these methods include motion boundaries or regions with little texture,
|
||||
where semantics become important.
|
||||
where semantics become very important.
|
||||
Extensions of these approaches to scene flow estimate flow and depth
|
||||
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
||||
|
||||
@ -171,14 +180,14 @@ These concerns restrict the applicability of the current slanted plane models in
|
||||
which often require estimations to be done in realtime and for which an end-to-end
|
||||
approach based on learning would be preferable.
|
||||
|
||||
Futhermore, in other contexts, the move towards end-to-end deep learning has often lead
|
||||
By analogy, in other contexts, the move towards end-to-end deep learning has often lead
|
||||
to significant benefits in terms of accuracy and speed.
|
||||
As an example, consider the evolution of region-based convolutional networks, which started
|
||||
out as prohibitively slow with a CNN as a single component and
|
||||
became very fast and much more accurate over the course of their development into
|
||||
end-to-end deep networks.
|
||||
|
||||
Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
|
||||
Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements
|
||||
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
|
||||
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
|
||||
|
||||
@ -201,15 +210,15 @@ with a brightness constancy proxy loss.
|
||||
Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with
|
||||
end-to-end deep learning.
|
||||
Unlike SfM-Net, we build on a scalable object detection and instance segmentation
|
||||
approach with R-CNNs, which provide a strong baseline.
|
||||
approach with R-CNNs, which provide us with a strong baseline for these tasks.
|
||||
|
||||
\paragraph{End-to-end deep networks for camera pose estimation}
|
||||
Deep networks have been used for estimating the 6-DOF camera pose from
|
||||
a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion
|
||||
from monocular video \cite{UnsupPoseDepth}.
|
||||
These works are related to
|
||||
ours in that we also need to output various rotations and translations from a deep network
|
||||
and thus need to solve similar regression problems and use similar parametrizations
|
||||
ours in that we also need to output various rotations and translations from a deep network,
|
||||
and thus need to solve similar regression problems and may be able to use similar parametrizations
|
||||
and losses.
|
||||
|
||||
|
||||
@ -217,8 +226,8 @@ and losses.
|
||||
First, in section \ref{sec:background}, we introduce preliminaries and building
|
||||
blocks from earlier works that serve as a foundation for our networks and losses.
|
||||
Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone
|
||||
as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}),
|
||||
specifically Mask R-CNN and the FPN \cite{FPN}.
|
||||
as well as the developments in region-based CNNs which we build on (\ref{ssec:rcnn}),
|
||||
specifically Mask R-CNN and the Feature Pyramid Network (FPN) \cite{FPN}.
|
||||
In section \ref{sec:approach}, we describe our technical contribution, starting
|
||||
with our motion estimation model and modifications to the Mask R-CNN backbone and head networks (\ref{ssec:model}),
|
||||
followed by our losses and supervision methods for training
|
||||
|
||||
10
thesis.tex
10
thesis.tex
@ -125,39 +125,39 @@
|
||||
%\pagenumbering{arabic} % Arabische Seitenzahlen
|
||||
|
||||
\section{Introduction}
|
||||
\label{sec:introduction}
|
||||
\parindent 2em
|
||||
\onehalfspacing
|
||||
|
||||
\input{introduction}
|
||||
\label{sec:introduction}
|
||||
|
||||
\section{Background}
|
||||
\label{sec:background}
|
||||
\parindent 2em
|
||||
\onehalfspacing
|
||||
|
||||
\label{sec:background}
|
||||
\input{background}
|
||||
|
||||
\section{Motion R-CNN}
|
||||
\label{sec:approach}
|
||||
\parindent 2em
|
||||
\onehalfspacing
|
||||
|
||||
\label{sec:approach}
|
||||
\input{approach}
|
||||
|
||||
\section{Experiments}
|
||||
\label{sec:experiments}
|
||||
\parindent 2em
|
||||
\onehalfspacing
|
||||
|
||||
\input{experiments}
|
||||
\label{sec:experiments}
|
||||
|
||||
\section{Conclusion}
|
||||
\label{sec:conclusion}
|
||||
\parindent 2em
|
||||
\onehalfspacing
|
||||
|
||||
\input{conclusion}
|
||||
\label{sec:conclusion}
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
% Bibliografie mit BibLaTeX
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user