full editing pass

This commit is contained in:
Simon Meister 2017-11-20 20:45:26 +01:00
parent 5b046e41b5
commit bcf3adc60e
7 changed files with 234 additions and 198 deletions

View File

@ -28,7 +28,7 @@ we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGB-D camera,
our resulting end-to-end deep network detects objects with precise per-pixel
object masks and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network,
By additionally estimating the camera ego-motion in the same network,
we compose a dense optical flow field based on instance-level and global motion
predictions. We train our network on the synthetic Virtual KITTI dataset,
which provides ground truth for all components of our system.
@ -62,7 +62,7 @@ Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentieru
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen,
Indem wir zusätzlich im selben Netzwerk die Eigenbewerung der Kamera schätzen,
setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes
optisches Flussfeld zusammen.
Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz,

View File

@ -7,7 +7,7 @@ we estimate per-object motion by predicting the 3D motion of each detected objec
For this, we extend Mask R-CNN in two straightforward ways.
First, we modify the backbone network and provide two frames to the R-CNN system
in order to enable image matching between the consecutive frames.
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
Second, we extend the Mask R-CNN RoI head to predict a 3D motion and pivot for each
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
respectively.
@ -18,7 +18,7 @@ respectively.
\toprule
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input images & H $\times$ W $\times$ C \\
& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
\midrule
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
\midrule
@ -69,7 +69,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
\toprule
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input images & H $\times$ W $\times$ C \\
& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
\midrule
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\midrule
@ -121,32 +121,32 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
Alternatively, we also experiment with concatenating the camera space XYZ coordinates for each frame,
Additionally, we also experiment with concatenating the camera space XYZ coordinates for each frame,
XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
We do not introduce a separate network for computing region proposals and use our modified backbone network
as both first stage RPN and second stage feature extractor for extracting the RoI features.
Technically, our feature encoder network will have to learn a motion representation similar to
Technically, our feature encoder network will have to learn image matching representations similar to
that learned by the FlowNet encoder, but the output will be computed in the
object-centric framework of a region based convolutional network head with a 3D parametrization.
Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information
from the encoder is integrated for specific objects via RoI cropping and
Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
from the encoder is integrated for specific objects via RoI extraction and
processed by the RoI head for each object.
\paragraph{Per-RoI motion prediction}
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$.
We parametrize ${R_t^k}$ using an Euler angle representation,
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_k \in \mathbb{R}^3$ at time $t$.
We parametrize ${R_k}$ using an Euler angle representation,
\begin{equation}
R_t^k = R_t^{k,z}(\gamma) \cdot R_t^{k,x}(\alpha) \cdot R_t^{k,y}(\beta),
R_k = R_k^z(\gamma) \cdot R_k^x(\alpha) \cdot R_k^y(\beta),
\end{equation}
where
\begin{equation}
R_t^{k,x}(\alpha) =
R_k^x(\alpha) =
\begin{pmatrix}
1 & 0 & 0 \\
0 & \cos(\alpha) & -\sin(\alpha) \\
@ -155,7 +155,7 @@ R_t^{k,x}(\alpha) =
\end{equation}
\begin{equation}
R_t^{k,y}(\beta) =
R_k^y(\beta) =
\begin{pmatrix}
\cos(\beta) & 0 & \sin(\beta) \\
0 & 1 & 0 \\
@ -164,7 +164,7 @@ R_t^{k,y}(\beta) =
\end{equation}
\begin{equation}
R_t^{k,z}(\gamma) =
R_k^z(\gamma) =
\begin{pmatrix}
\cos(\gamma) & -\sin(\gamma) & 0 \\
\sin(\gamma) & \cos(\gamma) & 0 \\
@ -179,25 +179,26 @@ prediction in addition to the fully-connected layers for
refined boxes and classes and the convolutional network for the masks.
Like for refined boxes and masks, we make one separate motion prediction for each class.
Each instance motion is predicted as a set of nine scalar parameters,
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small
and that objects rotate at most 90 degrees in either direction along any axis,
which is in general a safe assumption for image sequences from videos.
which is in general a safe assumption for image sequences from videos,
and enables us to obtain unique cosine values from the predicted sine values.
All predictions are made in camera space, and translation and pivot predictions are in meters.
We additionally predict softmax scores $o_t^k$ for classifying the objects into
still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_t^k = 0$,
we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_t^k = (0,0,0)^T$,
We additionally predict softmax scores $o_k$ for classifying the objects into
still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_k = 0$,
we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_k = (0,0,0)^T$,
and thus predict an identity motion.
\paragraph{Camera motion prediction}
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
In addition to the object transformations, we optionally predict the camera motion $\{R_{cam}, t_{cam}\}\in \mathbf{SE}(3)$
between the two frames $I_t$ and $I_{t+1}$.
For this, we branch off a small fully-connected network from the bottleneck output of the backbone.
We again represent $R_t^{cam}$ using a Euler angle representation and
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
Again, we predict a softmax score $o_t^{cam}$ for differentiating between
We again represent $R_{cam}$ using a Euler angle representation and
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_{cam}$ in the same way as for the individual objects.
Again, we predict a softmax score $o_{cam}$ for differentiating between
a still and moving camera.
\subsection{Network design}
@ -207,19 +208,19 @@ a still and moving camera.
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
feature resolution prior to RoI extraction would be reduced too much.
Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$
Therefore, in our variant without FPN, we first pass the $C_4$ features through $C_5$
and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant)
to increase the bottleneck stride prior to the camera network to 64.
to increase the bottleneck stride prior to the camera motion network to 64.
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
the backbone makes use of all blocks through $C_6$ and
we can simply branch of our camera network from the $C_6$ bottleneck.
the backbone makes use of all blocks through $C_6$, and
we can simply branch off our camera motion network from the $C_6$ bottleneck.
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
convolution to the $C_6$ features to reduce the number of inputs to the following
fully-connected layers.
fully-connected layers, and thus keep the number of weights reasonably small.
Instead of averaging, we use bilinear resizing to bring the convolutional features
to a fixed size without losing all spatial information,
flatten them, and finally apply multiple fully-connected layers to compute the
camera motion prediction.
flatten them, and finally apply multiple fully-connected layers to predict the
camera motion parameters.
\paragraph{RoI motion head network}
In both of our network variants
@ -227,6 +228,15 @@ In both of our network variants
we compute the fully-connected network for motion prediction from the
flattened RoI features, which are also the basis for classification and
bounding box refinement.
Note that the features (extracted from the upsampled FPN stage appropriate to the RoI bounding box scales)
passed to our ResNet-FPN RoI head went through the $C_6$
bottleneck, which has a stride of 64 with respect to the original image.
In contrast, the bottleneck for the features passed to our ResNet RoI head
is $C_4$ (with a stride of 16). Thus, the ResNet-FPN variant can in principle estimate
object motions based on larger displacements than the ResNet variant.
Additionally, as smaller bounding boxes use higher resolution features, the
motions and pivots of (especially smaller) objects can in principle be more accurately
estimated with the FPN variant.
\subsection{Supervision}
\label{ssec:supervision}
@ -235,42 +245,41 @@ bounding box refinement.
The most straightforward way to supervise the object motions is by using ground truth
motions computed from ground truth object poses, which is in general
only practical when training on synthetic datasets.
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
Note that we dropped the subscript $t$ to increase readability.
Given the $k$-th foreground RoI (as defined for Mask R-CNN) with ground class $c_k^*$,
let $R_k, t_k, p_k, o_k$ be the predicted motion for class $c_k^*$ as parametrized above,
and $R_k^*, t_k^*, p_k^*, o_k^*$ the ground truth motion for the matched ground truth example.
Similar to the camera pose regression loss in \cite{PoseNet2},
we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted
rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss
performs better in our case than the standard $\ell_1$-loss.
We then compute the RoI motion loss as
We thus compute the RoI motion loss as
\begin{equation}
L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k,
L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o_k^* + l_o^k,
\end{equation}
where
\begin{equation}
l_{R}^k = \ell_{reg} (R^{gt,i_k} - R^{k,c_k}),
l_{R}^k = \ell_{reg} (R_k^* - R_k),
\end{equation}
\begin{equation}
l_{t}^k = \ell_{reg} (t^{gt,i_k} - t^{k,c_k}),
l_{t}^k = \ell_{reg} (t_k^* - t_k),
\end{equation}
\begin{equation}
l_{p}^k = \ell_{reg} (p^{gt,i_k} - p^{k,c_k}).
l_{p}^k = \ell_{reg} (p_k^* - p_k).
\end{equation}
are the smooth $\ell_1$-loss terms for the predicted rotation, translation and pivot,
are the smooth-$\ell_1$ losses for the predicted rotation, translation and pivot,
respectively and
\begin{equation}
l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}).
l_o^k = \ell_{cls}(o_k, o_k^*).
\end{equation}
is the cross-entropy loss for the predicted classification into moving and non-moving objects.
Note that we do not penalize the rotation and translation for objects with
$o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the network
$o_k^* = 0$, which do not move between $t$ and $t+1$. We found that the network
may not reliably predict exact identity motions for still objects, which is
numerically more difficult to optimize than performing classification between
moving and non-moving objects and discarding the regression for the non-moving
ones. Also, analogous to masks and bounding boxes, the estimates for classes
ones. Also, analogously to masks and bounding boxes, the estimates for classes
other than $c_k^*$ are not penalized.
Now, our modified RoI loss is
@ -281,10 +290,10 @@ L_{RoI} = L_{cls} + L_{box} + L_{mask} + L_{motion}.
\paragraph{Camera motion supervision}
We supervise the camera motion with ground truth analogously to the
object motions, with the only difference being that we only have
a rotation and translation, but no pivot term for the camera motion.
a rotation and translation, but no pivot loss for the camera motion.
If the ground truth shows that the camera is not moving, we again do not
penalize rotation and translation. For the camera, the loss is reduced to the
classification term in this case.
penalize rotation and translation. In this case, the camera motion loss is reduced to the
classification loss.
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
\begin{figure}[t]
@ -310,19 +319,20 @@ In this case, for any RoI,
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
with the same resolution as the predicted mask.
We use the same bounding box
to crop the corresponding region from the dense, full image depth map
to crop the corresponding region from the dense, full-image depth map
and bilinearly resize the depth crop to the same resolution as the mask and point
grid.
We then compute the optical flow at each of the grid points by creating
a 3D point cloud from the point grid and depth crop. To this point cloud, we
apply the RoI's predicted motion, masked by the predicted mask.
Next, we create a 3D point cloud from the point grid and depth crop. To this point cloud, we
apply the object motion predicted for the RoI, masked by the predicted mask.
Then, we apply the camera motion to the points, project them back to 2D
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
Note that we batch this computation over all RoIs, so that we only perform
it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach.
it once per forward pass.
Figure \ref{figure:flow_loss} illustrates the approach.
The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the
dense, full image flow composition in the following subsection, so we will not
include them here. The only differences are that there is no sum over objects during
dense, full-image flow composition in the following subsection, so we will not
duplicate them here. The only differences are that there is no sum over objects during
the point transformation based on instance motion, as we consider the single object
corresponding to an RoI in isolation, and that the masks are not resized to the
full image resolution, as
@ -333,21 +343,22 @@ For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
by penalizing the $m \times m$ optical flow grid.
If there is optical flow ground truth available, we can use the RoI bounding box to
crop and resize a region from the ground truth optical flow to match the RoI's
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
optical flow grid and penalize the difference between the flow grids with a (smooth) $\ell_1$-loss.
However, we can also use the re-projection loss without optical flow ground truth
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
In this case, we use the bounding box to crop and resize a corresponding region
In this case, we can use the bounding box to crop and resize a corresponding region
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference
using the 2D grid displaced with the predicted flow grid (the latter is often called \emph{backward warping}).
Then, we can penalize the difference
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
For more details on differentiable bilinear sampling for deep learning, we refer the reader to
\cite{STN}.
When compared to supervision with motion ground truth, a re-projection
When compared to supervision with 3D instance motion ground truth, a re-projection
loss could benefit motion regression by removing any loss balancing issues between the
rotation, translation and pivot terms \cite{PoseNet2},
which can make it interesting even when 3D motion ground truth is available.
rotation, translation and pivot losses \cite{PoseNet2},
which could make it interesting even when 3D motion ground truth is available.
\subsection{Training and inference}
\label{ssec:training_inference}
@ -368,7 +379,7 @@ highest scoring class.
\subsection{Dense flow from motion}
\label{ssec:postprocessing}
As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network.
As a postprocessing, we compose the dense optical flow between $I_t$ and $I_{t+1}$ from the outputs of our Motion R-CNN network.
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
where
\begin{equation}
@ -383,33 +394,34 @@ x_t - c_0 \\ y_t - c_1 \\ f
\end{pmatrix},
\end{equation}
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
which range over all coordinates in $I_t$.
which range over all coordinates in $I_t$,
and $(c_0, c_1, f)$ are the camera intrinsics.
For now, the depth map is always assumed to come from ground truth.
Given $k$ detections with predicted motions as above, we transform all points within the bounding
box of a detected object according to the predicted motion of the object.
We first define the \emph{full image} mask $M_t^k$ for object k,
which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing
$m_t^k$ to the width and height of the predicted bounding box and then copying the values
We first define the \emph{full image} mask $M_k$ for object k,
which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
it to the width and height of the predicted bounding box and then copying the values
of the resized mask into a full resolution mask initialized with zeros,
starting at the top-left coordinate of the predicted bounding box.
Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects,
Then, given the predicted motions $(R_k, t_k)$, as well as $p_k$ for all objects,
\begin{equation}
P'_{t+1} =
P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
P_t + \sum_1^{k} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\}
\end{equation}
These motion predictions are understood to have already taken into account
the classification into moving and still objects,
and we thus, as described above, have identity motions for all objects with $o_t^k = 0$.
and we thus, as described above, have identity motions for all objects with $o_k = 0$.
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$,
Next, we transform all points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$,
\begin{equation}
\begin{pmatrix}
X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
\end{pmatrix}
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^c
= P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam}
\end{equation}.
Note that in our experiments, we either use the ground truth camera motion to focus

View File

@ -8,10 +8,10 @@ The optical flow
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
visually corresponding pixel in the second frame $I_{t+1}$,
and can be interpreted as the apparent movement of brigthness patterns between the two frames.
and can be interpreted as the apparent movement of brightness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space and
Scene flow is the generalization of optical flow to three-dimensional space and additionally
requires estimating depth for each pixel. Generally, stereo input is used for scene flow
to estimate disparity-based depth, however monocular depth estimation with deep networks is also becoming
popular \cite{DeeperDepth, UnsupPoseDepth}.
@ -47,7 +47,7 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
\bottomrule
\end{tabular}
\caption {
FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
Overview of the FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
are used for refinement.
}
\label{table:flownets}
@ -70,21 +70,22 @@ performing upsampling of the compressed features and resulting in a encoder-deco
The most popular deep networks of this kind for end-to-end optical flow prediction
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
Table \ref{table:flownets} shows the classical FlowNetS architecture for optical flow prediction.
Table \ref{table:flownets} gives an overview of the classical FlowNetS architecture for optical flow prediction.
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
with supervision from dense optical flow ground truth.
Potentially, the same network could also be used for semantic segmentation if
the number of output final and intermediate output channels was adapted from two to the number of classes.\
the number of final and intermediate output channels was adapted from two to the number of classes.
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated depends on the number of 2D convolution strides or pooling
Note that the maximum displacement that can be correctly estimated depends on the number of strided 2D convolutions (and the stride they use) and pooling
operations in the encoder.
Recently, other, similarly generic,
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
\subsection{SfM-Net}
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture.
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described
in the introduction.
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
image brightness differences penalizes the predictions.
@ -103,7 +104,7 @@ image brightness differences penalizes the predictions.
& input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\
& Conv-Deconv & H $\times$ W $\times$ 32 \\
masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\
FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & H $\times$ W $\times$ 32 \\
FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & 1 $\times$ 512 \\
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
\midrule
@ -118,7 +119,7 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
\end{tabular}
\caption {
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully-convolutional
encoder-decoder network, where convolutions and deconvolutions with stride 2 are
used for downsampling and upsampling, respectively. The stride at the bottleneck
with respect to the input image is 32.
@ -147,7 +148,7 @@ Note that for the Mask R-CNN architectures we describe below, this is equivalent
to the standard ResNet-50 backbone. We now introduce one small extension that
will be useful for our Motion R-CNN network.
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important.
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
@ -166,9 +167,9 @@ to increase the bottleneck stride to 64, following FlowNetS.
\multicolumn{3}{c}{\textbf{ResNet}}\\
\midrule
C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\
\midrule
& 3 $\times$ 3 max pool, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 64 \\
\midrule
C$_2$ &
$\begin{bmatrix}
1 \times 1, 64 \\
@ -242,8 +243,8 @@ most popular deep networks for object detection, and have recently also been app
\paragraph{R-CNN}
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped using the regions bounding box and the crop is
passed through a CNN, which performs classification of the object (or non-object, if the region shows background).
For each of the region proposals, the input image is cropped using the region bounding box and the crop is
passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
\paragraph{Fast R-CNN}
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
@ -256,8 +257,8 @@ The extracted per-RoI (region of interest) feature maps are collected into a bat
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
full image feature map are max-pooled to yield the output value at this cell.
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network,
full-image feature map are max-pooled to yield the output value at the cell.
Thus, given region proposals, all computation is reduced to a single pass through the complete network,
speeding up the system by two orders of magnitude at inference time and one order of magnitude
at training time.
@ -297,15 +298,15 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
\midrule
M$_0$ & From R$_1$: 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
& 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ N$_{cls}$ \\
masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
\bottomrule
\end{tabular}
\caption {
Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture.
Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask
Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture.
Note that this is equivalent to the Faster R-CNN ResNet-50 architecture if the mask
head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction,
whereas Faster R-CNN used RoI pooling.
whereas Faster R-CNN uses RoI pooling.
}
\label{table:maskrcnn_resnet}
\end{table}
@ -317,17 +318,17 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster processing when compared to Fast R-CNN
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
and again, improved accuracy.
This unified network operates in two stages.
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
which is a deep feature encoder CNN with the original image as input.
Next, the \emph{backbone} output features are passed into a small, fully convolutional \emph{Region Proposal Network (RPN)} head, which
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
predicts objectness scores and regresses bounding boxes at each of its output positions.
At any of the $h \times w$ output positions of the RPN head,
$N_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $N_a$ \emph{anchors} with different
aspect ratios and scales. Thus, there are $N_a \times h \times w$ reference anchors in total.
In Faster R-CNN, $N_a = 9$, with 3 scales corresponding
In Faster R-CNN, $N_a = 9$, with 3 scales, corresponding
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
@ -337,8 +338,12 @@ The region proposals can then be obtained as the N highest scoring RPN predictio
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
and bounding box refinement for each of the region proposals, which are now obtained
from the RPN instead of being pre-computed by some external algorithm.
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals.
from the RPN instead of being pre-computed by an external algorithm.
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
and the refined bounding boxes are predicted separately for each object class.
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
(here, the mask head is ignored).
\paragraph{Mask R-CNN}
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
@ -346,18 +351,20 @@ However, it can be helpful to know class and object (instance) membership of all
which generally involves computing a binary mask for each object instance specifying which pixels belong
to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object.
fixed resolution instance masks within the bounding boxes of each detected object,
which are then bilinearly resized to fit inside the respective bounding boxes.
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise binary mask for each instance.
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
comptetition between classes for the mask prediction branch.
comptetition between classes in the mask prediction branch.
One important additional technical aspect of Mask R-CNN is the replacement of RoI pooling with
Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
bilinear sampling for extracting the RoI features, which is much more precise.
In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
boundary of the bounding box, and thus some detail is lost.
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
{
\begin{table}[h]
\centering
@ -367,7 +374,7 @@ boundary of the bounding box, and thus some detail is lost.
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
C$_5$ & ResNet \{up to C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\
\midrule
@ -403,11 +410,11 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
M$_1$ & From R$_2$: $\begin{bmatrix}\textrm{3 $\times$ 3 conv} \end{bmatrix}$ $\times$ 4, 256 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
& 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ 256 \\
& 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
\bottomrule
\end{tabular}
\caption {
Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture.
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
Operations enclosed in a []$_p$ block make up a single FPN
block (see Figure \ref{figure:fpn_block}).
}
@ -416,28 +423,29 @@ block (see Figure \ref{figure:fpn_block}).
}
\paragraph{Feature Pyramid Networks}
In Faster R-CNN, a single feature map is used as a source of all RoIs, independent
of the size of the bounding box of the RoI.
However, for small objects, the C$_4$ (see Table \ref{table:maskrcnn_resnet}) features
might have lost too much spatial information to properly predict the exact bounding
box and a high resolution mask. Likewise, for very big objects, the fixed size
RoI window might be too small to cover the region of the feature map containing
information for this object.
In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
of the size of the bounding box of each RoI.
However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
might have lost too much spatial information to allow properly predicting the exact bounding
box and a high resolution mask.
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
of an appropriate scale to be used, depending of the size of the bounding box.
of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI.
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
encoder by combining bilinear upsampled feature maps coming from the bottleneck
with lateral skip connections from the encoder.
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
encoder by combining bilinearly upsampled feature maps coming from the bottleneck
with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
For each consecutive upsampling block, the lateral skip connections are taken from
the encoder block with the same output resolution as the upsampled features coming
from the bottleneck.
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$ (see Table \ref{table:maskrcnn_resnet_fpn}).
At each output position of the resulting RPN pyramid, bounding boxes are predicted
with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$).
For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$,
the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$,
respectively.
Note that there is no need for multiple anchor scales per anchor position anymore,
as the RPN heads themselves correspond to multiple scales.
as the RPN heads themselves correspond to different scales.
Now, in the RPN, higher resolution feature maps can be used for regressing smaller
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
which has a stride of $4$ with respect to the input image.
@ -463,6 +471,8 @@ as some anchor to the exact same pyramid level from which the RPN of this
anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$,
which is the highest resolution feature map.
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
\begin{figure}[t]
\centering
@ -506,7 +516,7 @@ All bounding boxes predicted by the RoI head or RPN are estimated as offsets
with respect to a reference bounding box. In the case of the RPN,
the reference bounding box is one of the anchors, and refined bounding boxes from the RoI head are
predicted relative to the RPN output bounding boxes.
Let $(x, y, w, h)$ be the top left coordinates, height and width of the bounding box
Let $(x, y, w, h)$ be the top left coordinates, width, and height of the bounding box
to be predicted. Likewise, let $(x^*, y^*, w^*, h^*)$ be the ground truth bounding
box and let $(x_r, y_r, w_r, h_r)$ be the reference bounding box.
The ground truth \emph{box encoding} $b_e^*$ is then defined as
@ -561,7 +571,7 @@ w = \exp(b_w) \cdot w_r,
h = \exp(b_h) \cdot h_r,
\end{equation*}
and thus the bounding box is obtained as the reference bounding box adjusted by
the predicted relative offsets and scales.
the predicted relative offsets and scales encoded in $b_e$.
\paragraph{Supervision of the RPN}
A positive RPN proposal is defined as one with a IoU of at least $0.7$ with
@ -571,7 +581,7 @@ with at most $50\%$ positive examples (if there are less positive examples,
more negative examples are used instead).
For examples selected in this way, a regression loss is computed between
predicted and ground truth bounding box encoding, and a classification loss
is computed for the predicted objectness.
is computed for the predicted objectness scores.
Specifically, let $s_i^* = 1$ if proposal $i$ is positive and $s_i^* = 0$ if
it is negative, let $s_i$ be the predicted objectness score and $b_i$, $b_i^*$ the
predicted and ground truth bounding box encodings.
@ -588,7 +598,7 @@ L_{box}^{RPN} = \frac{1}{N_{RPN}^{pos}} \sum_{i=1}^{N_{RPN}} s_i^* \cdot \ell_{r
\end{equation}
and
\begin{equation}
N_{RPN}^{pos} = \sum_{i=1}^{N_{pos}} s_i^*
N_{RPN}^{pos} = \sum_{i=1}^{N_{RPN}} s_i^*
\end{equation}
is the number of positive examples. Note that the bounding box loss is only
active for positive examples, and that the classification loss is computed
@ -648,14 +658,14 @@ During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring regio
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
and passed through the RoI bounding box refinement and classification heads
(but not through the mask head).
After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class,
with a maximum IoU of 0.7.
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
after again extracting the corresponding features.
After this, non-maximum supression (NMS) is applied to predicted RoIs for which the predicted class is not the background class,
with a maximum IoU of 0.7 of the refined boxes.
Finally, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
after extracting the corresponding features again.
Thus, during inference, the features for the mask head are extracted using the refined
bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not
introducing any misalignment, as we want to create the instance mask inside of the
more precise, refined detection bounding boxes.
introducing any misalignment, as the instance masks are to be created inside of the
final, more precise, refined detection bounding boxes.
Furthermore, note that bounding box and mask predictions for all classes but the predicted
class (the highest scoring class) are discarded, and thus the output bounding
box and mask correspond to the highest scoring class.

View File

@ -18,6 +18,12 @@ of our network is highly interpretable, which may also bring benefits for safety
applications.
\subsection{Future Work}
\paragraph{Training on all Virtual KITTI sequences}
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
to make training faster.
In the future, it would be interesting to train on all variants, as the different
lighting conditions and angles should lead to a more general model.
\paragraph{Evaluation and finetuning on KITTI 2015}
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
on which we do not train, but we have yet to evaluate on a real world dataset.
@ -138,19 +144,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.
\paragraph{Masking prior to the RoI motion head}
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
the backbone are integrated over the complete RoI window to yield the features
for motion estimation.
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
However, ideally, the motion (image matching) information from the backbone should
For example, consider
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
extracted RoI features before passing them into the motion head.
The intuition behind that is that we want to mask out (set to zero) any positions in the
extracted feature window which belong to the background. Then, the RoI motion
head could aggregate the motion (image matching) information from the backbone
over positions localized within the object only, but not over positions belonging
to the background, which should probably not influence the final object motion estimate.
% \paragraph{Masking prior to the RoI motion head}
% Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
% the backbone are integrated over the complete RoI window to yield the features
% for motion estimation.
% For example, average pooling is applied before the fully-connected layers in the variant without FPN.
% However, ideally, the motion (image matching) information from the backbone should
%
% For example, consider
%
% Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
% extracted RoI features before passing them into the motion head.
% The intuition behind that is that we want to mask out (set to zero) any positions in the
% extracted feature window which belong to the background. Then, the RoI motion
% head could aggregate the motion (image matching) information from the backbone
% over positions localized within the object only, but not over positions belonging
% to the background, which should probably not influence the final object motion estimate.

View File

@ -1,6 +1,6 @@
\subsection{Implementation}
Our networks and loss functions are implemented using built-in TensorFlow \cite{TensorFlow}
functions, enabling us to use automatic differentiation for all gradient
Our networks and loss functions are implemented using built-in TensorFlow
functions \cite{TensorFlow}, enabling us to use automatic differentiation for all gradient
computations. To make our code easy to extend and flexible, we build on
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
implementation.
@ -49,18 +49,18 @@ let $[R_t^{ex}|t_t^{ex}]$
and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$
be the camera extrinsics at the two frames.
We compute the ground truth camera motion
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as
$\{R_{cam}^*, t_{cam}^*\} \in \mathbf{SE}(3)$ as
\begin{equation}
R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}),
R_{cam}^* = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}),
\end{equation}
\begin{equation}
t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}.
t_{cam}^* = t_{t+1}^{ex} - R_{cam}^* \cdot t_t^{ex}.
\end{equation}
Additionally, we define $o_t^{gt, cam} \in \{ 0, 1 \}$,
Additionally, we define $o_{cam}^* \in \{ 0, 1 \}$,
\begin{equation}
o_t^{gt, cam} =
o_{cam}^* =
\begin{cases}
1 &\text{if the camera pose changes between $t$ and $t+1$} \\
0 &\text{otherwise,}
@ -75,25 +75,25 @@ at $I_t$ and $I_{t+1}$.
Note that the pose at $t$ is given with respect to the camera at $t$ and
the pose at $t+1$ is given with respect to the camera at $t+1$.
We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as
We define the ground truth pivot $p_k^* \in \mathbb{R}^3$ as
\begin{equation}
p_{t}^{gt, i} = t_t^i
p_k^* = t_t^i
\end{equation}
and compute the ground truth object motion
$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as
$\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
\begin{equation}
R_{t}^{gt, i} = \mathrm{inv}(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
\end{equation}
\begin{equation}
t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t.
t_k^* = t_{t+1}^{i} - R_k^* \cdot t_t.
\end{equation}
As for the camera, we define $o_t^{gt, i} \in \{ 0, 1 \}$,
As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
\begin{equation}
o_t^{gt, i} =
o_k^* =
\begin{cases}
1 &\text{if the position of object i changes between $t$ and $t+1$} \\
0 &\text{otherwise,}
@ -105,21 +105,19 @@ which specifies whether an object is moving in between the frames.
To evaluate the 3D instance and camera motions on the Virtual KITTI validation
set, we introduce a few error metrics.
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
let $i_k$ be the index of the best matching ground truth example,
let $c_k$ be the predicted class,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
let $R_k, t_k, p_k, o_k$ be the predicted motion for the predicted class $c_k$
and $R_k^*, t_k^*, p_k^*, o_k^*$ the motion ground truth for the best matching example.
Then, assuming there are $N$ such detections,
\begin{equation}
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R_k^*) \cdot R_k) - 1}{2} \right\}\right\} \right)
\end{equation}
measures the mean angle of the error rotation between predicted and ground truth rotation,
\begin{equation}
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \right\rVert_2,
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
\end{equation}
is the mean euclidean norm between predicted and ground truth translation, and
\begin{equation}
E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
\end{equation}
is the mean euclidean norm between predicted and ground truth pivot.
Moreover, we define precision and recall measures for the detection of moving objects,
@ -135,29 +133,30 @@ O_{rc} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}}
is the fraction of objects correctly classified as moving among all objects which are actually moving.
Here, we used
\begin{equation}
\mathit{TP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1],
\mathit{TP} = \sum_k [o_k = 1 \land o_k^* = 1],
\end{equation}
\begin{equation}
\mathit{FP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0],
\mathit{FP} = \sum_k [o_k = 1 \land o_k^* = 0],
\end{equation}
and
\begin{equation}
\mathit{FN} = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
\mathit{FN} = \sum_k [o_k = 0 \land o_k^* = 1].
\end{equation}
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
predicted camera motions.
the predicted camera motion.
\subsection{Virtual KITTI: Training setup}
\label{ssec:setup}
For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN and -FPN variants.
We train both, the Motion R-CNN ResNet and ResNet-FPN variants.
\paragraph{Training schedule}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
We train on a single Titan X (Pascal) for a total of 192K iterations on the
Virtual KITTI training set.
We train for a total of 192K iterations on the Virtual KITTI training set.
For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
which results in approximately one day of training.
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
momentum of $0.9$.
As learning rate we use $0.25 \cdot 10^{-2}$ for the

View File

@ -11,26 +11,30 @@ and estimates their 3D locations as well as all 3D object motions between the fr
\subsection{Motivation}
For moving in the real world, it is generally desirable to know which objects exists
For moving in the real world, it is often desirable to know which objects exists
in the proximity of the moving agent,
where they are located relative to the agent,
and where they will be at some point in the future.
and where they will be at some point in the near future.
In many cases, it would be preferable to infer such information from video data
if technically feasible, as camera sensors are cheap and ubiquitous.
if technically feasible, as camera sensors are cheap and ubiquitous
(compared to, for example, Lidar).
For example, in autonomous driving, it is crucial to not only know the position
As an example, consider the autonomous driving problem.
Here, it is crucial to not only know the position
of each obstacle, but to also know if and where the obstacle is moving,
and to use sensors that will not make the system too expensive for widespread use.
At the same time, the autonomous driving system has to operate in real time to
react quickly enough for safely controlling the vehicle.
A promising approach for 3D scene understanding in situations like these are deep neural
A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
in still images and are more and more often being applied to video data.
A key benefit of end-to-end deep networks is that they can, in principle,
A key benefit of deep networks is that they can, in principle,
enable very fast inference on real time video data and generalize
over many training examples to resolve ambiguities inherent in image understanding
over many training situations to resolve ambiguities inherent in image understanding
and motion estimation.
Thus, in this work, we aim to develop end-to-end deep networks which can, given
Thus, in this work, we aim to develop deep neural networks which can, given
sequences of images, segment the image pixels into object instances and estimate
the location and 3D motion of each object instance relative to the camera
(Figure \ref{figure:teaser}).
@ -39,9 +43,12 @@ the location and 3D motion of each object instance relative to the camera
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder
network for pixel-wise prediction. A fully-connected network branching off the encoder predicts a 3D motion for each object.
However, due to the fixed number of objects masks, the system can only predict a small number of motions and
Using a standard encoder-decoder network for pixel-wise dense prediction,
SfM-Net predicts a pre-determined number of binary masks ranging over the complete image,
with each mask specifying the membership of the image pixels to one object.
A fully-connected network branching off the encoder then predicts a 3D motion for each object,
as well as the camera ego-motion.
However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions and
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}).
\begin{figure}[t]
\centering
@ -64,9 +71,11 @@ deep learning approaches to motion estimation, may significantly benefit motion
estimation by structuring the problem, creating physical constraints and reducing
the dimensionality of the estimate.
A scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN
In the context of still images, a
scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{MaskRCNN}.
Mask R-CNN inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN \cite{FasterRCNN}
and predicts pixel-precise segmentation masks for each detected object (Figure \ref{figure:maskrcnn_cs}).
\begin{figure}[t]
@ -126,7 +135,7 @@ image depending on the semantics of each region or pixel, which include whether
pixel belongs to the background, to which object instance it belongs if it is not background,
and the class of the object it belongs to.
Often, failure cases of these methods include motion boundaries or regions with little texture,
where semantics become important.
where semantics become very important.
Extensions of these approaches to scene flow estimate flow and depth
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
@ -171,14 +180,14 @@ These concerns restrict the applicability of the current slanted plane models in
which often require estimations to be done in realtime and for which an end-to-end
approach based on learning would be preferable.
Futhermore, in other contexts, the move towards end-to-end deep learning has often lead
By analogy, in other contexts, the move towards end-to-end deep learning has often lead
to significant benefits in terms of accuracy and speed.
As an example, consider the evolution of region-based convolutional networks, which started
out as prohibitively slow with a CNN as a single component and
became very fast and much more accurate over the course of their development into
end-to-end deep networks.
Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
@ -201,15 +210,15 @@ with a brightness constancy proxy loss.
Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with
end-to-end deep learning.
Unlike SfM-Net, we build on a scalable object detection and instance segmentation
approach with R-CNNs, which provide a strong baseline.
approach with R-CNNs, which provide us with a strong baseline for these tasks.
\paragraph{End-to-end deep networks for camera pose estimation}
Deep networks have been used for estimating the 6-DOF camera pose from
a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion
from monocular video \cite{UnsupPoseDepth}.
These works are related to
ours in that we also need to output various rotations and translations from a deep network
and thus need to solve similar regression problems and use similar parametrizations
ours in that we also need to output various rotations and translations from a deep network,
and thus need to solve similar regression problems and may be able to use similar parametrizations
and losses.
@ -217,8 +226,8 @@ and losses.
First, in section \ref{sec:background}, we introduce preliminaries and building
blocks from earlier works that serve as a foundation for our networks and losses.
Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone
as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}),
specifically Mask R-CNN and the FPN \cite{FPN}.
as well as the developments in region-based CNNs which we build on (\ref{ssec:rcnn}),
specifically Mask R-CNN and the Feature Pyramid Network (FPN) \cite{FPN}.
In section \ref{sec:approach}, we describe our technical contribution, starting
with our motion estimation model and modifications to the Mask R-CNN backbone and head networks (\ref{ssec:model}),
followed by our losses and supervision methods for training

View File

@ -125,39 +125,39 @@
%\pagenumbering{arabic} % Arabische Seitenzahlen
\section{Introduction}
\label{sec:introduction}
\parindent 2em
\onehalfspacing
\input{introduction}
\label{sec:introduction}
\section{Background}
\label{sec:background}
\parindent 2em
\onehalfspacing
\label{sec:background}
\input{background}
\section{Motion R-CNN}
\label{sec:approach}
\parindent 2em
\onehalfspacing
\label{sec:approach}
\input{approach}
\section{Experiments}
\label{sec:experiments}
\parindent 2em
\onehalfspacing
\input{experiments}
\label{sec:experiments}
\section{Conclusion}
\label{sec:conclusion}
\parindent 2em
\onehalfspacing
\input{conclusion}
\label{sec:conclusion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bibliografie mit BibLaTeX