full editing pass

This commit is contained in:
Simon Meister 2017-11-20 20:45:26 +01:00
parent 5b046e41b5
commit bcf3adc60e
7 changed files with 234 additions and 198 deletions

View File

@ -28,7 +28,7 @@ we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGB-D camera, Given two consecutive frames from a monocular RGB-D camera,
our resulting end-to-end deep network detects objects with precise per-pixel our resulting end-to-end deep network detects objects with precise per-pixel
object masks and estimates the 3D motion of each detected object between the frames. object masks and estimates the 3D motion of each detected object between the frames.
By additionally estimating a global camera motion in the same network, By additionally estimating the camera ego-motion in the same network,
we compose a dense optical flow field based on instance-level and global motion we compose a dense optical flow field based on instance-level and global motion
predictions. We train our network on the synthetic Virtual KITTI dataset, predictions. We train our network on the synthetic Virtual KITTI dataset,
which provides ground truth for all components of our system. which provides ground truth for all components of our system.
@ -62,7 +62,7 @@ Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentieru
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab. und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen, Indem wir zusätzlich im selben Netzwerk die Eigenbewerung der Kamera schätzen,
setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes
optisches Flussfeld zusammen. optisches Flussfeld zusammen.
Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz, Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz,

View File

@ -7,7 +7,7 @@ we estimate per-object motion by predicting the 3D motion of each detected objec
For this, we extend Mask R-CNN in two straightforward ways. For this, we extend Mask R-CNN in two straightforward ways.
First, we modify the backbone network and provide two frames to the R-CNN system First, we modify the backbone network and provide two frames to the R-CNN system
in order to enable image matching between the consecutive frames. in order to enable image matching between the consecutive frames.
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each Second, we extend the Mask R-CNN RoI head to predict a 3D motion and pivot for each
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn} region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN, show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
respectively. respectively.
@ -18,7 +18,7 @@ respectively.
\toprule \toprule
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule \midrule\midrule
& input images & H $\times$ W $\times$ C \\ & input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
\midrule \midrule
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
\midrule \midrule
@ -69,7 +69,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
\toprule \toprule
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule \midrule\midrule
& input images & H $\times$ W $\times$ C \\ & input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\
\midrule \midrule
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\midrule \midrule
@ -121,32 +121,32 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching, Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone, laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels. we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
Alternatively, we also experiment with concatenating the camera space XYZ coordinates for each frame, Additionally, we also experiment with concatenating the camera space XYZ coordinates for each frame,
XYZ$_t$ and XYZ$_{t+1}$, into the input as well. XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
We do not introduce a separate network for computing region proposals and use our modified backbone network We do not introduce a separate network for computing region proposals and use our modified backbone network
as both first stage RPN and second stage feature extractor for extracting the RoI features. as both first stage RPN and second stage feature extractor for extracting the RoI features.
Technically, our feature encoder network will have to learn a motion representation similar to Technically, our feature encoder network will have to learn image matching representations similar to
that learned by the FlowNet encoder, but the output will be computed in the that learned by the FlowNet encoder, but the output will be computed in the
object-centric framework of a region based convolutional network head with a 3D parametrization. object-centric framework of a region based convolutional network head with a 3D parametrization.
Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
from the encoder is integrated for specific objects via RoI cropping and from the encoder is integrated for specific objects via RoI extraction and
processed by the RoI head for each object. processed by the RoI head for each object.
\paragraph{Per-RoI motion prediction} \paragraph{Per-RoI motion prediction}
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}. We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$ For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations \footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$} and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$. of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_k \in \mathbb{R}^3$ at time $t$.
We parametrize ${R_t^k}$ using an Euler angle representation, We parametrize ${R_k}$ using an Euler angle representation,
\begin{equation} \begin{equation}
R_t^k = R_t^{k,z}(\gamma) \cdot R_t^{k,x}(\alpha) \cdot R_t^{k,y}(\beta), R_k = R_k^z(\gamma) \cdot R_k^x(\alpha) \cdot R_k^y(\beta),
\end{equation} \end{equation}
where where
\begin{equation} \begin{equation}
R_t^{k,x}(\alpha) = R_k^x(\alpha) =
\begin{pmatrix} \begin{pmatrix}
1 & 0 & 0 \\ 1 & 0 & 0 \\
0 & \cos(\alpha) & -\sin(\alpha) \\ 0 & \cos(\alpha) & -\sin(\alpha) \\
@ -155,7 +155,7 @@ R_t^{k,x}(\alpha) =
\end{equation} \end{equation}
\begin{equation} \begin{equation}
R_t^{k,y}(\beta) = R_k^y(\beta) =
\begin{pmatrix} \begin{pmatrix}
\cos(\beta) & 0 & \sin(\beta) \\ \cos(\beta) & 0 & \sin(\beta) \\
0 & 1 & 0 \\ 0 & 1 & 0 \\
@ -164,7 +164,7 @@ R_t^{k,y}(\beta) =
\end{equation} \end{equation}
\begin{equation} \begin{equation}
R_t^{k,z}(\gamma) = R_k^z(\gamma) =
\begin{pmatrix} \begin{pmatrix}
\cos(\gamma) & -\sin(\gamma) & 0 \\ \cos(\gamma) & -\sin(\gamma) & 0 \\
\sin(\gamma) & \cos(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\
@ -179,25 +179,26 @@ prediction in addition to the fully-connected layers for
refined boxes and classes and the convolutional network for the masks. refined boxes and classes and the convolutional network for the masks.
Like for refined boxes and masks, we make one separate motion prediction for each class. Like for refined boxes and masks, we make one separate motion prediction for each class.
Each instance motion is predicted as a set of nine scalar parameters, Each instance motion is predicted as a set of nine scalar parameters,
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$, $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$. where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small Here, we assume that motions between frames are relatively small
and that objects rotate at most 90 degrees in either direction along any axis, and that objects rotate at most 90 degrees in either direction along any axis,
which is in general a safe assumption for image sequences from videos. which is in general a safe assumption for image sequences from videos,
and enables us to obtain unique cosine values from the predicted sine values.
All predictions are made in camera space, and translation and pivot predictions are in meters. All predictions are made in camera space, and translation and pivot predictions are in meters.
We additionally predict softmax scores $o_t^k$ for classifying the objects into We additionally predict softmax scores $o_k$ for classifying the objects into
still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_t^k = 0$, still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_k = 0$,
we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_t^k = (0,0,0)^T$, we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_k = (0,0,0)^T$,
and thus predict an identity motion. and thus predict an identity motion.
\paragraph{Camera motion prediction} \paragraph{Camera motion prediction}
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$ In addition to the object transformations, we optionally predict the camera motion $\{R_{cam}, t_{cam}\}\in \mathbf{SE}(3)$
between the two frames $I_t$ and $I_{t+1}$. between the two frames $I_t$ and $I_{t+1}$.
For this, we branch off a small fully-connected network from the bottleneck output of the backbone. For this, we branch off a small fully-connected network from the bottleneck output of the backbone.
We again represent $R_t^{cam}$ using a Euler angle representation and We again represent $R_{cam}$ using a Euler angle representation and
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects. predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_{cam}$ in the same way as for the individual objects.
Again, we predict a softmax score $o_t^{cam}$ for differentiating between Again, we predict a softmax score $o_{cam}$ for differentiating between
a still and moving camera. a still and moving camera.
\subsection{Network design} \subsection{Network design}
@ -207,19 +208,19 @@ a still and moving camera.
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
ResNet backbone is only computed up to the $C_4$ block, as otherwise the ResNet backbone is only computed up to the $C_4$ block, as otherwise the
feature resolution prior to RoI extraction would be reduced too much. feature resolution prior to RoI extraction would be reduced too much.
Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$ Therefore, in our variant without FPN, we first pass the $C_4$ features through $C_5$
and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant) and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant)
to increase the bottleneck stride prior to the camera network to 64. to increase the bottleneck stride prior to the camera motion network to 64.
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
the backbone makes use of all blocks through $C_6$ and the backbone makes use of all blocks through $C_6$, and
we can simply branch of our camera network from the $C_6$ bottleneck. we can simply branch off our camera motion network from the $C_6$ bottleneck.
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
convolution to the $C_6$ features to reduce the number of inputs to the following convolution to the $C_6$ features to reduce the number of inputs to the following
fully-connected layers. fully-connected layers, and thus keep the number of weights reasonably small.
Instead of averaging, we use bilinear resizing to bring the convolutional features Instead of averaging, we use bilinear resizing to bring the convolutional features
to a fixed size without losing all spatial information, to a fixed size without losing all spatial information,
flatten them, and finally apply multiple fully-connected layers to compute the flatten them, and finally apply multiple fully-connected layers to predict the
camera motion prediction. camera motion parameters.
\paragraph{RoI motion head network} \paragraph{RoI motion head network}
In both of our network variants In both of our network variants
@ -227,6 +228,15 @@ In both of our network variants
we compute the fully-connected network for motion prediction from the we compute the fully-connected network for motion prediction from the
flattened RoI features, which are also the basis for classification and flattened RoI features, which are also the basis for classification and
bounding box refinement. bounding box refinement.
Note that the features (extracted from the upsampled FPN stage appropriate to the RoI bounding box scales)
passed to our ResNet-FPN RoI head went through the $C_6$
bottleneck, which has a stride of 64 with respect to the original image.
In contrast, the bottleneck for the features passed to our ResNet RoI head
is $C_4$ (with a stride of 16). Thus, the ResNet-FPN variant can in principle estimate
object motions based on larger displacements than the ResNet variant.
Additionally, as smaller bounding boxes use higher resolution features, the
motions and pivots of (especially smaller) objects can in principle be more accurately
estimated with the FPN variant.
\subsection{Supervision} \subsection{Supervision}
\label{ssec:supervision} \label{ssec:supervision}
@ -235,42 +245,41 @@ bounding box refinement.
The most straightforward way to supervise the object motions is by using ground truth The most straightforward way to supervise the object motions is by using ground truth
motions computed from ground truth object poses, which is in general motions computed from ground truth object poses, which is in general
only practical when training on synthetic datasets. only practical when training on synthetic datasets.
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$, Given the $k$-th foreground RoI (as defined for Mask R-CNN) with ground class $c_k^*$,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$ let $R_k, t_k, p_k, o_k$ be the predicted motion for class $c_k^*$ as parametrized above,
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$. and $R_k^*, t_k^*, p_k^*, o_k^*$ the ground truth motion for the matched ground truth example.
Note that we dropped the subscript $t$ to increase readability.
Similar to the camera pose regression loss in \cite{PoseNet2}, Similar to the camera pose regression loss in \cite{PoseNet2},
we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted
rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss
performs better in our case than the standard $\ell_1$-loss. performs better in our case than the standard $\ell_1$-loss.
We then compute the RoI motion loss as We thus compute the RoI motion loss as
\begin{equation} \begin{equation}
L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k, L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o_k^* + l_o^k,
\end{equation} \end{equation}
where where
\begin{equation} \begin{equation}
l_{R}^k = \ell_{reg} (R^{gt,i_k} - R^{k,c_k}), l_{R}^k = \ell_{reg} (R_k^* - R_k),
\end{equation} \end{equation}
\begin{equation} \begin{equation}
l_{t}^k = \ell_{reg} (t^{gt,i_k} - t^{k,c_k}), l_{t}^k = \ell_{reg} (t_k^* - t_k),
\end{equation} \end{equation}
\begin{equation} \begin{equation}
l_{p}^k = \ell_{reg} (p^{gt,i_k} - p^{k,c_k}). l_{p}^k = \ell_{reg} (p_k^* - p_k).
\end{equation} \end{equation}
are the smooth $\ell_1$-loss terms for the predicted rotation, translation and pivot, are the smooth-$\ell_1$ losses for the predicted rotation, translation and pivot,
respectively and respectively and
\begin{equation} \begin{equation}
l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}). l_o^k = \ell_{cls}(o_k, o_k^*).
\end{equation} \end{equation}
is the cross-entropy loss for the predicted classification into moving and non-moving objects. is the cross-entropy loss for the predicted classification into moving and non-moving objects.
Note that we do not penalize the rotation and translation for objects with Note that we do not penalize the rotation and translation for objects with
$o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the network $o_k^* = 0$, which do not move between $t$ and $t+1$. We found that the network
may not reliably predict exact identity motions for still objects, which is may not reliably predict exact identity motions for still objects, which is
numerically more difficult to optimize than performing classification between numerically more difficult to optimize than performing classification between
moving and non-moving objects and discarding the regression for the non-moving moving and non-moving objects and discarding the regression for the non-moving
ones. Also, analogous to masks and bounding boxes, the estimates for classes ones. Also, analogously to masks and bounding boxes, the estimates for classes
other than $c_k^*$ are not penalized. other than $c_k^*$ are not penalized.
Now, our modified RoI loss is Now, our modified RoI loss is
@ -281,10 +290,10 @@ L_{RoI} = L_{cls} + L_{box} + L_{mask} + L_{motion}.
\paragraph{Camera motion supervision} \paragraph{Camera motion supervision}
We supervise the camera motion with ground truth analogously to the We supervise the camera motion with ground truth analogously to the
object motions, with the only difference being that we only have object motions, with the only difference being that we only have
a rotation and translation, but no pivot term for the camera motion. a rotation and translation, but no pivot loss for the camera motion.
If the ground truth shows that the camera is not moving, we again do not If the ground truth shows that the camera is not moving, we again do not
penalize rotation and translation. For the camera, the loss is reduced to the penalize rotation and translation. In this case, the camera motion loss is reduced to the
classification term in this case. classification loss.
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth} \paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
\begin{figure}[t] \begin{figure}[t]
@ -310,19 +319,20 @@ In this case, for any RoI,
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
with the same resolution as the predicted mask. with the same resolution as the predicted mask.
We use the same bounding box We use the same bounding box
to crop the corresponding region from the dense, full image depth map to crop the corresponding region from the dense, full-image depth map
and bilinearly resize the depth crop to the same resolution as the mask and point and bilinearly resize the depth crop to the same resolution as the mask and point
grid. grid.
We then compute the optical flow at each of the grid points by creating Next, we create a 3D point cloud from the point grid and depth crop. To this point cloud, we
a 3D point cloud from the point grid and depth crop. To this point cloud, we apply the object motion predicted for the RoI, masked by the predicted mask.
apply the RoI's predicted motion, masked by the predicted mask.
Then, we apply the camera motion to the points, project them back to 2D Then, we apply the camera motion to the points, project them back to 2D
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids. and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
Note that we batch this computation over all RoIs, so that we only perform Note that we batch this computation over all RoIs, so that we only perform
it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach. it once per forward pass.
Figure \ref{figure:flow_loss} illustrates the approach.
The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the
dense, full image flow composition in the following subsection, so we will not dense, full-image flow composition in the following subsection, so we will not
include them here. The only differences are that there is no sum over objects during duplicate them here. The only differences are that there is no sum over objects during
the point transformation based on instance motion, as we consider the single object the point transformation based on instance motion, as we consider the single object
corresponding to an RoI in isolation, and that the masks are not resized to the corresponding to an RoI in isolation, and that the masks are not resized to the
full image resolution, as full image resolution, as
@ -333,21 +343,22 @@ For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
by penalizing the $m \times m$ optical flow grid. by penalizing the $m \times m$ optical flow grid.
If there is optical flow ground truth available, we can use the RoI bounding box to If there is optical flow ground truth available, we can use the RoI bounding box to
crop and resize a region from the ground truth optical flow to match the RoI's crop and resize a region from the ground truth optical flow to match the RoI's
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss. optical flow grid and penalize the difference between the flow grids with a (smooth) $\ell_1$-loss.
However, we can also use the re-projection loss without optical flow ground truth However, we can also use the re-projection loss without optical flow ground truth
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}. to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
In this case, we use the bounding box to crop and resize a corresponding region In this case, we can use the bounding box to crop and resize a corresponding region
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$ from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference using the 2D grid displaced with the predicted flow grid (the latter is often called \emph{backward warping}).
Then, we can penalize the difference
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}. between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
For more details on differentiable bilinear sampling for deep learning, we refer the reader to For more details on differentiable bilinear sampling for deep learning, we refer the reader to
\cite{STN}. \cite{STN}.
When compared to supervision with motion ground truth, a re-projection When compared to supervision with 3D instance motion ground truth, a re-projection
loss could benefit motion regression by removing any loss balancing issues between the loss could benefit motion regression by removing any loss balancing issues between the
rotation, translation and pivot terms \cite{PoseNet2}, rotation, translation and pivot losses \cite{PoseNet2},
which can make it interesting even when 3D motion ground truth is available. which could make it interesting even when 3D motion ground truth is available.
\subsection{Training and inference} \subsection{Training and inference}
\label{ssec:training_inference} \label{ssec:training_inference}
@ -368,7 +379,7 @@ highest scoring class.
\subsection{Dense flow from motion} \subsection{Dense flow from motion}
\label{ssec:postprocessing} \label{ssec:postprocessing}
As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network. As a postprocessing, we compose the dense optical flow between $I_t$ and $I_{t+1}$ from the outputs of our Motion R-CNN network.
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$, Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
where where
\begin{equation} \begin{equation}
@ -383,33 +394,34 @@ x_t - c_0 \\ y_t - c_1 \\ f
\end{pmatrix}, \end{pmatrix},
\end{equation} \end{equation}
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$, is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
which range over all coordinates in $I_t$. which range over all coordinates in $I_t$,
and $(c_0, c_1, f)$ are the camera intrinsics.
For now, the depth map is always assumed to come from ground truth. For now, the depth map is always assumed to come from ground truth.
Given $k$ detections with predicted motions as above, we transform all points within the bounding Given $k$ detections with predicted motions as above, we transform all points within the bounding
box of a detected object according to the predicted motion of the object. box of a detected object according to the predicted motion of the object.
We first define the \emph{full image} mask $M_t^k$ for object k, We first define the \emph{full image} mask $M_k$ for object k,
which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
$m_t^k$ to the width and height of the predicted bounding box and then copying the values it to the width and height of the predicted bounding box and then copying the values
of the resized mask into a full resolution mask initialized with zeros, of the resized mask into a full resolution mask initialized with zeros,
starting at the top-left coordinate of the predicted bounding box. starting at the top-left coordinate of the predicted bounding box.
Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects, Then, given the predicted motions $(R_k, t_k)$, as well as $p_k$ for all objects,
\begin{equation} \begin{equation}
P'_{t+1} = P'_{t+1} =
P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\} P_t + \sum_1^{k} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\}
\end{equation} \end{equation}
These motion predictions are understood to have already taken into account These motion predictions are understood to have already taken into account
the classification into moving and still objects, the classification into moving and still objects,
and we thus, as described above, have identity motions for all objects with $o_t^k = 0$. and we thus, as described above, have identity motions for all objects with $o_k = 0$.
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, Next, we transform all points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$,
\begin{equation} \begin{equation}
\begin{pmatrix} \begin{pmatrix}
X_{t+1} \\ Y_{t+1} \\ Z_{t+1} X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
\end{pmatrix} \end{pmatrix}
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^c = P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam}
\end{equation}. \end{equation}.
Note that in our experiments, we either use the ground truth camera motion to focus Note that in our experiments, we either use the ground truth camera motion to focus

View File

@ -8,10 +8,10 @@ The optical flow
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$ $\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
visually corresponding pixel in the second frame $I_{t+1}$, visually corresponding pixel in the second frame $I_{t+1}$,
and can be interpreted as the apparent movement of brigthness patterns between the two frames. and can be interpreted as the apparent movement of brightness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation. Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to 3-dimensional space and Scene flow is the generalization of optical flow to three-dimensional space and additionally
requires estimating depth for each pixel. Generally, stereo input is used for scene flow requires estimating depth for each pixel. Generally, stereo input is used for scene flow
to estimate disparity-based depth, however monocular depth estimation with deep networks is also becoming to estimate disparity-based depth, however monocular depth estimation with deep networks is also becoming
popular \cite{DeeperDepth, UnsupPoseDepth}. popular \cite{DeeperDepth, UnsupPoseDepth}.
@ -47,7 +47,7 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
\bottomrule \bottomrule
\end{tabular} \end{tabular}
\caption { \caption {
FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions) Overview of the FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
are used for refinement. are used for refinement.
} }
\label{table:flownets} \label{table:flownets}
@ -70,21 +70,22 @@ performing upsampling of the compressed features and resulting in a encoder-deco
The most popular deep networks of this kind for end-to-end optical flow prediction The most popular deep networks of this kind for end-to-end optical flow prediction
are variants of the FlowNet family \cite{FlowNet, FlowNet2}, are variants of the FlowNet family \cite{FlowNet, FlowNet2},
which was recently extended to scene flow estimation \cite{SceneFlowDataset}. which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
Table \ref{table:flownets} shows the classical FlowNetS architecture for optical flow prediction. Table \ref{table:flownets} gives an overview of the classical FlowNetS architecture for optical flow prediction.
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
with supervision from dense optical flow ground truth. with supervision from dense optical flow ground truth.
Potentially, the same network could also be used for semantic segmentation if Potentially, the same network could also be used for semantic segmentation if
the number of output final and intermediate output channels was adapted from two to the number of classes.\ the number of final and intermediate output channels was adapted from two to the number of classes.
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well, Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements. given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
Note that the maximum displacement that can be correctly estimated depends on the number of 2D convolution strides or pooling Note that the maximum displacement that can be correctly estimated depends on the number of strided 2D convolutions (and the stride they use) and pooling
operations in the encoder. operations in the encoder.
Recently, other, similarly generic, Recently, other, similarly generic,
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
\subsection{SfM-Net} \subsection{SfM-Net}
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture. Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described
in the introduction.
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
are predicted in addition to a depth map, and a unsupervised re-projection loss based on are predicted in addition to a depth map, and a unsupervised re-projection loss based on
image brightness differences penalizes the predictions. image brightness differences penalizes the predictions.
@ -103,7 +104,7 @@ image brightness differences penalizes the predictions.
& input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\ & input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\
& Conv-Deconv & H $\times$ W $\times$ 32 \\ & Conv-Deconv & H $\times$ W $\times$ 32 \\
masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\ masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\
FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & H $\times$ W $\times$ 32 \\ FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & 1 $\times$ 512 \\
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\ object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\ camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
\midrule \midrule
@ -118,7 +119,7 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
\end{tabular} \end{tabular}
\caption { \caption {
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully-convolutional
encoder-decoder network, where convolutions and deconvolutions with stride 2 are encoder-decoder network, where convolutions and deconvolutions with stride 2 are
used for downsampling and upsampling, respectively. The stride at the bottleneck used for downsampling and upsampling, respectively. The stride at the bottleneck
with respect to the input image is 32. with respect to the input image is 32.
@ -147,7 +148,7 @@ Note that for the Mask R-CNN architectures we describe below, this is equivalent
to the standard ResNet-50 backbone. We now introduce one small extension that to the standard ResNet-50 backbone. We now introduce one small extension that
will be useful for our Motion R-CNN network. will be useful for our Motion R-CNN network.
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64. input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important. stride may be important.
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
@ -166,9 +167,9 @@ to increase the bottleneck stride to 64, following FlowNetS.
\multicolumn{3}{c}{\textbf{ResNet}}\\ \multicolumn{3}{c}{\textbf{ResNet}}\\
\midrule \midrule
C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\ C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\
\midrule
& 3 $\times$ 3 max pool, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 64 \\ & 3 $\times$ 3 max pool, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 64 \\
\midrule
C$_2$ & C$_2$ &
$\begin{bmatrix} $\begin{bmatrix}
1 \times 1, 64 \\ 1 \times 1, 64 \\
@ -242,8 +243,8 @@ most popular deep networks for object detection, and have recently also been app
\paragraph{R-CNN} \paragraph{R-CNN}
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped using the regions bounding box and the crop is For each of the region proposals, the input image is cropped using the region bounding box and the crop is
passed through a CNN, which performs classification of the object (or non-object, if the region shows background). passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
\paragraph{Fast R-CNN} \paragraph{Fast R-CNN}
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals, The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
@ -256,8 +257,8 @@ The extracted per-RoI (region of interest) feature maps are collected into a bat
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass. \emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
full image feature map are max-pooled to yield the output value at this cell. full-image feature map are max-pooled to yield the output value at the cell.
Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network, Thus, given region proposals, all computation is reduced to a single pass through the complete network,
speeding up the system by two orders of magnitude at inference time and one order of magnitude speeding up the system by two orders of magnitude at inference time and one order of magnitude
at training time. at training time.
@ -297,15 +298,15 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
\midrule \midrule
M$_0$ & From R$_1$: 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\ M$_0$ & From R$_1$: 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
& 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ N$_{cls}$ \\ & 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ N$_{cls}$ \\
masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
\bottomrule \bottomrule
\end{tabular} \end{tabular}
\caption { \caption {
Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture. Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture.
Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask Note that this is equivalent to the Faster R-CNN ResNet-50 architecture if the mask
head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction, head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction,
whereas Faster R-CNN used RoI pooling. whereas Faster R-CNN uses RoI pooling.
} }
\label{table:maskrcnn_resnet} \label{table:maskrcnn_resnet}
\end{table} \end{table}
@ -317,17 +318,17 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the
algorithm, which has to be run prior to the network passes and makes up a large portion of the total algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time. processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster processing when compared to Fast R-CNN classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
and again, improved accuracy. and again, improved accuracy.
This unified network operates in two stages. This unified network operates in two stages.
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network, In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
which is a deep feature encoder CNN with the original image as input. which is a deep feature encoder CNN with the original image as input.
Next, the \emph{backbone} output features are passed into a small, fully convolutional \emph{Region Proposal Network (RPN)} head, which Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
predicts objectness scores and regresses bounding boxes at each of its output positions. predicts objectness scores and regresses bounding boxes at each of its output positions.
At any of the $h \times w$ output positions of the RPN head, At any of the $h \times w$ output positions of the RPN head,
$N_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $N_a$ \emph{anchors} with different $N_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $N_a$ \emph{anchors} with different
aspect ratios and scales. Thus, there are $N_a \times h \times w$ reference anchors in total. aspect ratios and scales. Thus, there are $N_a \times h \times w$ reference anchors in total.
In Faster R-CNN, $N_a = 9$, with 3 scales corresponding In Faster R-CNN, $N_a = 9$, with 3 scales, corresponding
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios, to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16 $\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}). with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
@ -337,8 +338,12 @@ The region proposals can then be obtained as the N highest scoring RPN predictio
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
and bounding box refinement for each of the region proposals, which are now obtained and bounding box refinement for each of the region proposals, which are now obtained
from the RPN instead of being pre-computed by some external algorithm. from the RPN instead of being pre-computed by an external algorithm.
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals. As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
and the refined bounding boxes are predicted separately for each object class.
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
(here, the mask head is ignored).
\paragraph{Mask R-CNN} \paragraph{Mask R-CNN}
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
@ -346,18 +351,20 @@ However, it can be helpful to know class and object (instance) membership of all
which generally involves computing a binary mask for each object instance specifying which pixels belong which generally involves computing a binary mask for each object instance specifying which pixels belong
to that object. This problem is called \emph{instance segmentation}. to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object. fixed resolution instance masks within the bounding boxes of each detected object,
which are then bilinearly resized to fit inside the respective bounding boxes.
This is done by simply extending the Faster R-CNN head with multiple convolutions, which This is done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise binary mask for each instance. compute a pixel-precise binary mask for each instance.
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
comptetition between classes for the mask prediction branch. comptetition between classes in the mask prediction branch.
One important additional technical aspect of Mask R-CNN is the replacement of RoI pooling with Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
bilinear sampling for extracting the RoI features, which is much more precise. bilinear sampling for extracting the RoI features, which is much more precise.
In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
boundary of the bounding box, and thus some detail is lost. boundary of the bounding box, and thus some detail is lost.
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
{ {
\begin{table}[h] \begin{table}[h]
\centering \centering
@ -367,7 +374,7 @@ boundary of the bounding box, and thus some detail is lost.
\midrule\midrule \midrule\midrule
& input image & H $\times$ W $\times$ C \\ & input image & H $\times$ W $\times$ C \\
\midrule \midrule
C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ C$_5$ & ResNet \{up to C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
\midrule \midrule
\multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\ \multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\
\midrule \midrule
@ -403,11 +410,11 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
M$_1$ & From R$_2$: $\begin{bmatrix}\textrm{3 $\times$ 3 conv} \end{bmatrix}$ $\times$ 4, 256 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\ M$_1$ & From R$_2$: $\begin{bmatrix}\textrm{3 $\times$ 3 conv} \end{bmatrix}$ $\times$ 4, 256 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
& 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ 256 \\ & 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ 256 \\
& 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ & 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\
\bottomrule \bottomrule
\end{tabular} \end{tabular}
\caption { \caption {
Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture. Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
Operations enclosed in a []$_p$ block make up a single FPN Operations enclosed in a []$_p$ block make up a single FPN
block (see Figure \ref{figure:fpn_block}). block (see Figure \ref{figure:fpn_block}).
} }
@ -416,28 +423,29 @@ block (see Figure \ref{figure:fpn_block}).
} }
\paragraph{Feature Pyramid Networks} \paragraph{Feature Pyramid Networks}
In Faster R-CNN, a single feature map is used as a source of all RoIs, independent In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
of the size of the bounding box of the RoI. of the size of the bounding box of each RoI.
However, for small objects, the C$_4$ (see Table \ref{table:maskrcnn_resnet}) features However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
might have lost too much spatial information to properly predict the exact bounding might have lost too much spatial information to allow properly predicting the exact bounding
box and a high resolution mask. Likewise, for very big objects, the fixed size box and a high resolution mask.
RoI window might be too small to cover the region of the feature map containing
information for this object.
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
of an appropriate scale to be used, depending of the size of the bounding box. of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI.
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet} For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
encoder by combining bilinear upsampled feature maps coming from the bottleneck encoder by combining bilinearly upsampled feature maps coming from the bottleneck
with lateral skip connections from the encoder. with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. For each consecutive upsampling block, the lateral skip connections are taken from
the encoder block with the same output resolution as the upsampled features coming
from the bottleneck.
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios, Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$. the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$ (see Table \ref{table:maskrcnn_resnet_fpn}).
At each output position of the resulting RPN pyramid, bounding boxes are predicted At each output position of the resulting RPN pyramid, bounding boxes are predicted
with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$). with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$).
For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$, For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$,
the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$, the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$,
respectively. respectively.
Note that there is no need for multiple anchor scales per anchor position anymore, Note that there is no need for multiple anchor scales per anchor position anymore,
as the RPN heads themselves correspond to multiple scales. as the RPN heads themselves correspond to different scales.
Now, in the RPN, higher resolution feature maps can be used for regressing smaller Now, in the RPN, higher resolution feature maps can be used for regressing smaller
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$, bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
which has a stride of $4$ with respect to the input image. which has a stride of $4$ with respect to the input image.
@ -463,6 +471,8 @@ as some anchor to the exact same pyramid level from which the RPN of this
anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$, anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$,
which is the highest resolution feature map. which is the highest resolution feature map.
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
\begin{figure}[t] \begin{figure}[t]
\centering \centering
@ -506,7 +516,7 @@ All bounding boxes predicted by the RoI head or RPN are estimated as offsets
with respect to a reference bounding box. In the case of the RPN, with respect to a reference bounding box. In the case of the RPN,
the reference bounding box is one of the anchors, and refined bounding boxes from the RoI head are the reference bounding box is one of the anchors, and refined bounding boxes from the RoI head are
predicted relative to the RPN output bounding boxes. predicted relative to the RPN output bounding boxes.
Let $(x, y, w, h)$ be the top left coordinates, height and width of the bounding box Let $(x, y, w, h)$ be the top left coordinates, width, and height of the bounding box
to be predicted. Likewise, let $(x^*, y^*, w^*, h^*)$ be the ground truth bounding to be predicted. Likewise, let $(x^*, y^*, w^*, h^*)$ be the ground truth bounding
box and let $(x_r, y_r, w_r, h_r)$ be the reference bounding box. box and let $(x_r, y_r, w_r, h_r)$ be the reference bounding box.
The ground truth \emph{box encoding} $b_e^*$ is then defined as The ground truth \emph{box encoding} $b_e^*$ is then defined as
@ -561,7 +571,7 @@ w = \exp(b_w) \cdot w_r,
h = \exp(b_h) \cdot h_r, h = \exp(b_h) \cdot h_r,
\end{equation*} \end{equation*}
and thus the bounding box is obtained as the reference bounding box adjusted by and thus the bounding box is obtained as the reference bounding box adjusted by
the predicted relative offsets and scales. the predicted relative offsets and scales encoded in $b_e$.
\paragraph{Supervision of the RPN} \paragraph{Supervision of the RPN}
A positive RPN proposal is defined as one with a IoU of at least $0.7$ with A positive RPN proposal is defined as one with a IoU of at least $0.7$ with
@ -571,7 +581,7 @@ with at most $50\%$ positive examples (if there are less positive examples,
more negative examples are used instead). more negative examples are used instead).
For examples selected in this way, a regression loss is computed between For examples selected in this way, a regression loss is computed between
predicted and ground truth bounding box encoding, and a classification loss predicted and ground truth bounding box encoding, and a classification loss
is computed for the predicted objectness. is computed for the predicted objectness scores.
Specifically, let $s_i^* = 1$ if proposal $i$ is positive and $s_i^* = 0$ if Specifically, let $s_i^* = 1$ if proposal $i$ is positive and $s_i^* = 0$ if
it is negative, let $s_i$ be the predicted objectness score and $b_i$, $b_i^*$ the it is negative, let $s_i$ be the predicted objectness score and $b_i$, $b_i^*$ the
predicted and ground truth bounding box encodings. predicted and ground truth bounding box encodings.
@ -588,7 +598,7 @@ L_{box}^{RPN} = \frac{1}{N_{RPN}^{pos}} \sum_{i=1}^{N_{RPN}} s_i^* \cdot \ell_{r
\end{equation} \end{equation}
and and
\begin{equation} \begin{equation}
N_{RPN}^{pos} = \sum_{i=1}^{N_{pos}} s_i^* N_{RPN}^{pos} = \sum_{i=1}^{N_{RPN}} s_i^*
\end{equation} \end{equation}
is the number of positive examples. Note that the bounding box loss is only is the number of positive examples. Note that the bounding box loss is only
active for positive examples, and that the classification loss is computed active for positive examples, and that the classification loss is computed
@ -648,14 +658,14 @@ During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring regio
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes, from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
and passed through the RoI bounding box refinement and classification heads and passed through the RoI bounding box refinement and classification heads
(but not through the mask head). (but not through the mask head).
After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class, After this, non-maximum supression (NMS) is applied to predicted RoIs for which the predicted class is not the background class,
with a maximum IoU of 0.7. with a maximum IoU of 0.7 of the refined boxes.
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes, Finally, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
after again extracting the corresponding features. after extracting the corresponding features again.
Thus, during inference, the features for the mask head are extracted using the refined Thus, during inference, the features for the mask head are extracted using the refined
bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not
introducing any misalignment, as we want to create the instance mask inside of the introducing any misalignment, as the instance masks are to be created inside of the
more precise, refined detection bounding boxes. final, more precise, refined detection bounding boxes.
Furthermore, note that bounding box and mask predictions for all classes but the predicted Furthermore, note that bounding box and mask predictions for all classes but the predicted
class (the highest scoring class) are discarded, and thus the output bounding class (the highest scoring class) are discarded, and thus the output bounding
box and mask correspond to the highest scoring class. box and mask correspond to the highest scoring class.

View File

@ -18,6 +18,12 @@ of our network is highly interpretable, which may also bring benefits for safety
applications. applications.
\subsection{Future Work} \subsection{Future Work}
\paragraph{Training on all Virtual KITTI sequences}
We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences
to make training faster.
In the future, it would be interesting to train on all variants, as the different
lighting conditions and angles should lead to a more general model.
\paragraph{Evaluation and finetuning on KITTI 2015} \paragraph{Evaluation and finetuning on KITTI 2015}
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
on which we do not train, but we have yet to evaluate on a real world dataset. on which we do not train, but we have yet to evaluate on a real world dataset.
@ -138,19 +144,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length. from image sequences of arbitrary length.
\paragraph{Masking prior to the RoI motion head} % \paragraph{Masking prior to the RoI motion head}
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from % Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
the backbone are integrated over the complete RoI window to yield the features % the backbone are integrated over the complete RoI window to yield the features
for motion estimation. % for motion estimation.
For example, average pooling is applied before the fully-connected layers in the variant without FPN. % For example, average pooling is applied before the fully-connected layers in the variant without FPN.
However, ideally, the motion (image matching) information from the backbone should % However, ideally, the motion (image matching) information from the backbone should
%
For example, consider % For example, consider
%
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the % Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
extracted RoI features before passing them into the motion head. % extracted RoI features before passing them into the motion head.
The intuition behind that is that we want to mask out (set to zero) any positions in the % The intuition behind that is that we want to mask out (set to zero) any positions in the
extracted feature window which belong to the background. Then, the RoI motion % extracted feature window which belong to the background. Then, the RoI motion
head could aggregate the motion (image matching) information from the backbone % head could aggregate the motion (image matching) information from the backbone
over positions localized within the object only, but not over positions belonging % over positions localized within the object only, but not over positions belonging
to the background, which should probably not influence the final object motion estimate. % to the background, which should probably not influence the final object motion estimate.

View File

@ -1,6 +1,6 @@
\subsection{Implementation} \subsection{Implementation}
Our networks and loss functions are implemented using built-in TensorFlow \cite{TensorFlow} Our networks and loss functions are implemented using built-in TensorFlow
functions, enabling us to use automatic differentiation for all gradient functions \cite{TensorFlow}, enabling us to use automatic differentiation for all gradient
computations. To make our code easy to extend and flexible, we build on computations. To make our code easy to extend and flexible, we build on
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
implementation. implementation.
@ -49,18 +49,18 @@ let $[R_t^{ex}|t_t^{ex}]$
and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$ and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$
be the camera extrinsics at the two frames. be the camera extrinsics at the two frames.
We compute the ground truth camera motion We compute the ground truth camera motion
$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as $\{R_{cam}^*, t_{cam}^*\} \in \mathbf{SE}(3)$ as
\begin{equation} \begin{equation}
R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}), R_{cam}^* = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}),
\end{equation} \end{equation}
\begin{equation} \begin{equation}
t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}. t_{cam}^* = t_{t+1}^{ex} - R_{cam}^* \cdot t_t^{ex}.
\end{equation} \end{equation}
Additionally, we define $o_t^{gt, cam} \in \{ 0, 1 \}$, Additionally, we define $o_{cam}^* \in \{ 0, 1 \}$,
\begin{equation} \begin{equation}
o_t^{gt, cam} = o_{cam}^* =
\begin{cases} \begin{cases}
1 &\text{if the camera pose changes between $t$ and $t+1$} \\ 1 &\text{if the camera pose changes between $t$ and $t+1$} \\
0 &\text{otherwise,} 0 &\text{otherwise,}
@ -75,25 +75,25 @@ at $I_t$ and $I_{t+1}$.
Note that the pose at $t$ is given with respect to the camera at $t$ and Note that the pose at $t$ is given with respect to the camera at $t$ and
the pose at $t+1$ is given with respect to the camera at $t+1$. the pose at $t+1$ is given with respect to the camera at $t+1$.
We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as We define the ground truth pivot $p_k^* \in \mathbb{R}^3$ as
\begin{equation} \begin{equation}
p_{t}^{gt, i} = t_t^i p_k^* = t_t^i
\end{equation} \end{equation}
and compute the ground truth object motion and compute the ground truth object motion
$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as $\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
\begin{equation} \begin{equation}
R_{t}^{gt, i} = \mathrm{inv}(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i), R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
\end{equation} \end{equation}
\begin{equation} \begin{equation}
t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t. t_k^* = t_{t+1}^{i} - R_k^* \cdot t_t.
\end{equation} \end{equation}
As for the camera, we define $o_t^{gt, i} \in \{ 0, 1 \}$, As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
\begin{equation} \begin{equation}
o_t^{gt, i} = o_k^* =
\begin{cases} \begin{cases}
1 &\text{if the position of object i changes between $t$ and $t+1$} \\ 1 &\text{if the position of object i changes between $t$ and $t+1$} \\
0 &\text{otherwise,} 0 &\text{otherwise,}
@ -105,21 +105,19 @@ which specifies whether an object is moving in between the frames.
To evaluate the 3D instance and camera motions on the Virtual KITTI validation To evaluate the 3D instance and camera motions on the Virtual KITTI validation
set, we introduce a few error metrics. set, we introduce a few error metrics.
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example, Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
let $i_k$ be the index of the best matching ground truth example, let $R_k, t_k, p_k, o_k$ be the predicted motion for the predicted class $c_k$
let $c_k$ be the predicted class, and $R_k^*, t_k^*, p_k^*, o_k^*$ the motion ground truth for the best matching example.
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
Then, assuming there are $N$ such detections, Then, assuming there are $N$ such detections,
\begin{equation} \begin{equation}
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right) E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R_k^*) \cdot R_k) - 1}{2} \right\}\right\} \right)
\end{equation} \end{equation}
measures the mean angle of the error rotation between predicted and ground truth rotation, measures the mean angle of the error rotation between predicted and ground truth rotation,
\begin{equation} \begin{equation}
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \right\rVert_2, E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
\end{equation} \end{equation}
is the mean euclidean norm between predicted and ground truth translation, and is the mean euclidean norm between predicted and ground truth translation, and
\begin{equation} \begin{equation}
E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2 E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
\end{equation} \end{equation}
is the mean euclidean norm between predicted and ground truth pivot. is the mean euclidean norm between predicted and ground truth pivot.
Moreover, we define precision and recall measures for the detection of moving objects, Moreover, we define precision and recall measures for the detection of moving objects,
@ -135,29 +133,30 @@ O_{rc} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}}
is the fraction of objects correctly classified as moving among all objects which are actually moving. is the fraction of objects correctly classified as moving among all objects which are actually moving.
Here, we used Here, we used
\begin{equation} \begin{equation}
\mathit{TP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1], \mathit{TP} = \sum_k [o_k = 1 \land o_k^* = 1],
\end{equation} \end{equation}
\begin{equation} \begin{equation}
\mathit{FP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0], \mathit{FP} = \sum_k [o_k = 1 \land o_k^* = 0],
\end{equation} \end{equation}
and and
\begin{equation} \begin{equation}
\mathit{FN} = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1]. \mathit{FN} = \sum_k [o_k = 0 \land o_k^* = 1].
\end{equation} \end{equation}
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
predicted camera motions. the predicted camera motion.
\subsection{Virtual KITTI: Training setup} \subsection{Virtual KITTI: Training setup}
\label{ssec:setup} \label{ssec:setup}
For our initial experiments, we concatenate both RGB frames as For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks. well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN and -FPN variants. We train both, the Motion R-CNN ResNet and ResNet-FPN variants.
\paragraph{Training schedule} \paragraph{Training schedule}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
We train on a single Titan X (Pascal) for a total of 192K iterations on the We train for a total of 192K iterations on the Virtual KITTI training set.
Virtual KITTI training set. For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
which results in approximately one day of training.
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
momentum of $0.9$. momentum of $0.9$.
As learning rate we use $0.25 \cdot 10^{-2}$ for the As learning rate we use $0.25 \cdot 10^{-2}$ for the

View File

@ -11,26 +11,30 @@ and estimates their 3D locations as well as all 3D object motions between the fr
\subsection{Motivation} \subsection{Motivation}
For moving in the real world, it is generally desirable to know which objects exists For moving in the real world, it is often desirable to know which objects exists
in the proximity of the moving agent, in the proximity of the moving agent,
where they are located relative to the agent, where they are located relative to the agent,
and where they will be at some point in the future. and where they will be at some point in the near future.
In many cases, it would be preferable to infer such information from video data In many cases, it would be preferable to infer such information from video data
if technically feasible, as camera sensors are cheap and ubiquitous. if technically feasible, as camera sensors are cheap and ubiquitous
(compared to, for example, Lidar).
For example, in autonomous driving, it is crucial to not only know the position As an example, consider the autonomous driving problem.
Here, it is crucial to not only know the position
of each obstacle, but to also know if and where the obstacle is moving, of each obstacle, but to also know if and where the obstacle is moving,
and to use sensors that will not make the system too expensive for widespread use. and to use sensors that will not make the system too expensive for widespread use.
At the same time, the autonomous driving system has to operate in real time to
react quickly enough for safely controlling the vehicle.
A promising approach for 3D scene understanding in situations like these are deep neural A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
in still images and are more and more often being applied to video data. in still images and are more and more often being applied to video data.
A key benefit of end-to-end deep networks is that they can, in principle, A key benefit of deep networks is that they can, in principle,
enable very fast inference on real time video data and generalize enable very fast inference on real time video data and generalize
over many training examples to resolve ambiguities inherent in image understanding over many training situations to resolve ambiguities inherent in image understanding
and motion estimation. and motion estimation.
Thus, in this work, we aim to develop end-to-end deep networks which can, given Thus, in this work, we aim to develop deep neural networks which can, given
sequences of images, segment the image pixels into object instances and estimate sequences of images, segment the image pixels into object instances and estimate
the location and 3D motion of each object instance relative to the camera the location and 3D motion of each object instance relative to the camera
(Figure \ref{figure:teaser}). (Figure \ref{figure:teaser}).
@ -39,9 +43,12 @@ the location and 3D motion of each object instance relative to the camera
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera. and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder Using a standard encoder-decoder network for pixel-wise dense prediction,
network for pixel-wise prediction. A fully-connected network branching off the encoder predicts a 3D motion for each object. SfM-Net predicts a pre-determined number of binary masks ranging over the complete image,
However, due to the fixed number of objects masks, the system can only predict a small number of motions and with each mask specifying the membership of the image pixels to one object.
A fully-connected network branching off the encoder then predicts a 3D motion for each object,
as well as the camera ego-motion.
However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions and
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}). often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}).
\begin{figure}[t] \begin{figure}[t]
\centering \centering
@ -64,9 +71,11 @@ deep learning approaches to motion estimation, may significantly benefit motion
estimation by structuring the problem, creating physical constraints and reducing estimation by structuring the problem, creating physical constraints and reducing
the dimensionality of the estimate. the dimensionality of the estimate.
A scalable approach to instance segmentation based on region-based convolutional networks In the context of still images, a
was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect scalable approach to instance segmentation based on region-based convolutional networks
a large number of objects from a large number of classes at once from Faster R-CNN was recently introduced with Mask R-CNN \cite{MaskRCNN}.
Mask R-CNN inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN \cite{FasterRCNN}
and predicts pixel-precise segmentation masks for each detected object (Figure \ref{figure:maskrcnn_cs}). and predicts pixel-precise segmentation masks for each detected object (Figure \ref{figure:maskrcnn_cs}).
\begin{figure}[t] \begin{figure}[t]
@ -126,7 +135,7 @@ image depending on the semantics of each region or pixel, which include whether
pixel belongs to the background, to which object instance it belongs if it is not background, pixel belongs to the background, to which object instance it belongs if it is not background,
and the class of the object it belongs to. and the class of the object it belongs to.
Often, failure cases of these methods include motion boundaries or regions with little texture, Often, failure cases of these methods include motion boundaries or regions with little texture,
where semantics become important. where semantics become very important.
Extensions of these approaches to scene flow estimate flow and depth Extensions of these approaches to scene flow estimate flow and depth
with similarly generic networks \cite{SceneFlowDataset} and similar limitations. with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
@ -171,14 +180,14 @@ These concerns restrict the applicability of the current slanted plane models in
which often require estimations to be done in realtime and for which an end-to-end which often require estimations to be done in realtime and for which an end-to-end
approach based on learning would be preferable. approach based on learning would be preferable.
Futhermore, in other contexts, the move towards end-to-end deep learning has often lead By analogy, in other contexts, the move towards end-to-end deep learning has often lead
to significant benefits in terms of accuracy and speed. to significant benefits in terms of accuracy and speed.
As an example, consider the evolution of region-based convolutional networks, which started As an example, consider the evolution of region-based convolutional networks, which started
out as prohibitively slow with a CNN as a single component and out as prohibitively slow with a CNN as a single component and
became very fast and much more accurate over the course of their development into became very fast and much more accurate over the course of their development into
end-to-end deep networks. end-to-end deep networks.
Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples. and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
@ -201,15 +210,15 @@ with a brightness constancy proxy loss.
Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with
end-to-end deep learning. end-to-end deep learning.
Unlike SfM-Net, we build on a scalable object detection and instance segmentation Unlike SfM-Net, we build on a scalable object detection and instance segmentation
approach with R-CNNs, which provide a strong baseline. approach with R-CNNs, which provide us with a strong baseline for these tasks.
\paragraph{End-to-end deep networks for camera pose estimation} \paragraph{End-to-end deep networks for camera pose estimation}
Deep networks have been used for estimating the 6-DOF camera pose from Deep networks have been used for estimating the 6-DOF camera pose from
a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion
from monocular video \cite{UnsupPoseDepth}. from monocular video \cite{UnsupPoseDepth}.
These works are related to These works are related to
ours in that we also need to output various rotations and translations from a deep network ours in that we also need to output various rotations and translations from a deep network,
and thus need to solve similar regression problems and use similar parametrizations and thus need to solve similar regression problems and may be able to use similar parametrizations
and losses. and losses.
@ -217,8 +226,8 @@ and losses.
First, in section \ref{sec:background}, we introduce preliminaries and building First, in section \ref{sec:background}, we introduce preliminaries and building
blocks from earlier works that serve as a foundation for our networks and losses. blocks from earlier works that serve as a foundation for our networks and losses.
Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone
as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}), as well as the developments in region-based CNNs which we build on (\ref{ssec:rcnn}),
specifically Mask R-CNN and the FPN \cite{FPN}. specifically Mask R-CNN and the Feature Pyramid Network (FPN) \cite{FPN}.
In section \ref{sec:approach}, we describe our technical contribution, starting In section \ref{sec:approach}, we describe our technical contribution, starting
with our motion estimation model and modifications to the Mask R-CNN backbone and head networks (\ref{ssec:model}), with our motion estimation model and modifications to the Mask R-CNN backbone and head networks (\ref{ssec:model}),
followed by our losses and supervision methods for training followed by our losses and supervision methods for training

View File

@ -125,39 +125,39 @@
%\pagenumbering{arabic} % Arabische Seitenzahlen %\pagenumbering{arabic} % Arabische Seitenzahlen
\section{Introduction} \section{Introduction}
\label{sec:introduction}
\parindent 2em \parindent 2em
\onehalfspacing \onehalfspacing
\input{introduction} \input{introduction}
\label{sec:introduction}
\section{Background} \section{Background}
\label{sec:background}
\parindent 2em \parindent 2em
\onehalfspacing \onehalfspacing
\label{sec:background}
\input{background} \input{background}
\section{Motion R-CNN} \section{Motion R-CNN}
\label{sec:approach}
\parindent 2em \parindent 2em
\onehalfspacing \onehalfspacing
\label{sec:approach}
\input{approach} \input{approach}
\section{Experiments} \section{Experiments}
\label{sec:experiments}
\parindent 2em \parindent 2em
\onehalfspacing \onehalfspacing
\input{experiments} \input{experiments}
\label{sec:experiments}
\section{Conclusion} \section{Conclusion}
\label{sec:conclusion}
\parindent 2em \parindent 2em
\onehalfspacing \onehalfspacing
\input{conclusion} \input{conclusion}
\label{sec:conclusion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Bibliografie mit BibLaTeX % Bibliografie mit BibLaTeX