This commit is contained in:
Simon Meister 2017-11-22 21:46:21 +01:00
parent 9215f296a7
commit 48ed4b4696
6 changed files with 71 additions and 68 deletions

View File

@ -23,7 +23,7 @@ thus combining the representation learning benefits and speed of end-to-end deep
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
scene flow. scene flow.
Building on recent advances in region-based convolutional networks (R-CNNs), Building on recent advances in region-based convolutional neural networks (R-CNNs),
we integrate motion estimation with instance segmentation. we integrate motion estimation with instance segmentation.
Given two consecutive frames from a monocular RGB-D camera, Given two consecutive frames from a monocular RGB-D camera,
our resulting end-to-end deep network detects objects with precise per-pixel our resulting end-to-end deep network detects objects with precise per-pixel
@ -54,8 +54,8 @@ Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwi
von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell, von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist. das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.
Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional Hierbei bauen wir auf den aktuellen Fortschritten bei regionsbasierten Convolutional
Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung. Neural Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab. und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.

View File

@ -116,7 +116,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
} }
\paragraph{Motion R-CNN backbone} \paragraph{Motion R-CNN backbone}
Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backbone network to compute feature maps from input imagery. Like Faster R-CNN and Mask R-CNN, we use a ResNet variant \cite{ResNet} as backbone network to compute feature maps from input imagery.
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching, Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone, laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
@ -125,15 +125,16 @@ Additionally, we also experiment with concatenating the camera space XYZ coordin
XYZ$_t$ and XYZ$_{t+1}$, into the input as well. XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
We do not introduce a separate network for computing region proposals and use our modified backbone network We do not introduce a separate network for computing region proposals and use our modified backbone network
as both RPN and for extracting the RoI features. as both RPN and for extracting the RoI features.
Technically, our feature encoder network will have to learn image matching representations similar to Technically, our feature encoder network will have to learn image matching representations similar to
that learned by the FlowNet encoder, but the output will be computed in the those learned by the FlowNet encoder, but the output will be computed in the
object-centric framework of a region based convolutional network head with a 3D parametrization. object-centric framework of a region-based convolutional network head with a 3D parametrization.
Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
from the encoder is integrated for specific objects via RoI extraction and from the encoder is integrated for specific objects via RoI extraction and subsequently
processed by the RoI head for each object. processed by the RoI head for each object.
\paragraph{Per-RoI motion prediction} \paragraph{Per-RoI motion prediction}
We use a rigid 3D motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}. We use a 3D rigid motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$ For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations \footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$} and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
@ -178,7 +179,7 @@ We then extend the Mask R-CNN head by adding a small fully-connected network for
prediction in addition to the fully-connected layers for prediction in addition to the fully-connected layers for
refined boxes and classes and the convolutional network for the masks. refined boxes and classes and the convolutional network for the masks.
Like for refined boxes and masks, we make one separate motion prediction for each class. Like for refined boxes and masks, we make one separate motion prediction for each class.
Each instance motion is predicted as a set of nine scalar parameters, Each instance motion is predicted as a set of nine values,
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$, $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$. where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small Here, we assume that motions between frames are relatively small
@ -400,12 +401,12 @@ and $(c_0, c_1, f)$ are the camera intrinsics.
For now, the depth map is always assumed to come from ground truth. For now, the depth map is always assumed to come from ground truth.
Given $k$ detections with predicted motions as above, we transform all points within the bounding Given $k$ detections with predicted motions as above, we transform all points within the bounding
box of a detected object according to the predicted motion of the object. box and mask of a detected object according to the predicted motion of the object.
We first define the \emph{full image} mask $M_k$ for object k, For this, we first define the \emph{full image} mask $M_k$ for object k,
which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
it to the width and height of the predicted bounding box and then copying the values it to the width and height of the predicted bounding box and then copying the values
of the resized mask into a full resolution mask initialized with zeros, of the resized mask into a full (image) resolution mask initialized with zeros,
starting at the top-left coordinate of the predicted bounding box. starting at the top-left coordinate of the predicted bounding box.
Again we binarize masks at a threshold of $0.5$. Again we binarize masks at a threshold of $0.5$.

View File

@ -8,7 +8,7 @@ The optical flow
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$ $\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
visually corresponding pixel in the second frame $I_{t+1}$, visually corresponding pixel in the second frame $I_{t+1}$,
and can be interpreted as the apparent movement of brightness patterns between the two frames. and can be interpreted as the (apparent) movement of brightness patterns between the two frames.
Optical flow can be regarded as two-dimensional motion estimation. Optical flow can be regarded as two-dimensional motion estimation.
Scene flow is the generalization of optical flow to three-dimensional space and additionally Scene flow is the generalization of optical flow to three-dimensional space and additionally
@ -64,7 +64,7 @@ learns a spatially compressed, wide (in the number of channels) representation o
and a fully-connected prediction network on top of the encoder. and a fully-connected prediction network on top of the encoder.
The compressed representations learned by CNNs of these categories do not, however, allow The compressed representations learned by CNNs of these categories do not, however, allow
for prediction of high-resolution output, as spatial detail is lost through sequential applications for prediction of high-resolution output, as spatial detail is lost through sequential application
of pooling or strides. of pooling or strides.
Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder, Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder,
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid. performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
@ -87,7 +87,7 @@ Recently, other, similarly generic,
encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}. encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
\subsection{SfM-Net} \subsection{SfM-Net}
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described Table \ref{table:sfmnet} shows the SfM-Net architecture \cite{SfmNet} we described
in the introduction. in the introduction.
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
are predicted in addition to a depth map, and a unsupervised re-projection loss based on are predicted in addition to a depth map, and a unsupervised re-projection loss based on
@ -237,7 +237,7 @@ In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64. input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important. stride may be important.
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants Thus, we add an additional C$_6$ block to be used in the Motion R-CNN ResNet variants
to increase the bottleneck stride to 64, following FlowNetS. to increase the bottleneck stride to 64, following FlowNetS.
\subsection{Region-based CNNs} \subsection{Region-based CNNs}
@ -246,14 +246,14 @@ We now give an overview of region-based convolutional networks, which are curren
most popular deep networks for object detection, and have recently also been applied to instance segmentation. most popular deep networks for object detection, and have recently also been applied to instance segmentation.
\paragraph{R-CNN} \paragraph{R-CNN}
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN The very first region-based convolutional networks (R-CNNs) \cite{RCNN} used a non-learned algorithm external to a standard encoder CNN
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
For each of the region proposals, the input image is cropped using the region bounding box and the crop is For each of the region proposals, the input image is cropped using the region bounding box and the crop is
passed through the CNN, which performs classification of the object (or non-object, if the region shows background). passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
\paragraph{Fast R-CNN} \paragraph{Fast R-CNN}
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals, The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
which is costly, as there is generally a large number of proposals. which is costly, as there generally is a large number of proposals.
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
as input to the CNN (compared to the sequential input of crops in the case of R-CNN). as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image, Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
@ -273,31 +273,31 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the
algorithm, which has to be run prior to the network passes and makes up a large portion of the total algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time. processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN classification into a single deep network, leading to faster training and test-time processing when compared to Fast R-CNN
and again, improved accuracy. and again, improved accuracy.
This unified network operates in two stages. This unified network operates in two stages.
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network, In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
which is a deep feature encoder CNN with the original image as input. which is a deep feature encoder CNN with the original image as input.
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which Next, the output features from the backbone are passed into a small, fully-convolutional \emph{Region Proposal Network} network (RPN), which
predicts objectness scores and regresses bounding boxes at each of its output positions. predicts objectness scores and regresses bounding boxes at each of its output positions.
At any of the $h \times w$ output positions of the RPN head, At any of the $h \times w$ output positions of the RPN,
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different $\text{N}_a$ bounding boxes with their \emph{objectness} scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total. aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios, to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels, and 3 aspect ratios,
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16 $\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}). with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection. For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
The region proposals can then be obtained as the N highest scoring RPN predictions. The region proposals can then be obtained as the N highest scoring RPN predictions.
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification Then, the \emph{second stage} corresponds to the original Fast R-CNN head, performing classification
and bounding box refinement for each of the region proposals, which are now obtained and bounding box refinement for each of the region proposals, which are now obtained
from the RPN instead of being pre-computed by an external algorithm. from the RPN instead of being pre-computed by an external algorithm.
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals, As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
and the refined bounding boxes are predicted separately for each object class. and the refined bounding boxes are predicted separately for each object class.
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet architecture
(for Faster R-CNN, the mask head is ignored). (for Faster R-CNN, the mask head is ignored).
{ {
@ -330,7 +330,7 @@ ave & average pool & N$_{RoI}$ $\times$ 2048 \\
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ & From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
& From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ & From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls} + 1$ \\
\midrule \midrule
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\ \multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
\midrule \midrule
@ -352,21 +352,21 @@ whereas Faster R-CNN uses RoI pooling.
\paragraph{Mask R-CNN} \paragraph{Mask R-CNN}
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
However, it can be helpful to know class and object (instance) membership of all individual pixels, However, it can be helpful to know class and object (instance) membership of individual pixels,
which generally involves computing a binary image mask for each object instance specifying which pixels belong which generally involves computing a binary image mask for each object instance specifying which pixels belong
to that object. This problem is called \emph{instance segmentation}. to that object. This problem is called \emph{instance segmentation}.
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting Mask R-CNN \cite{MaskRCNN} extends Faster R-CNN to instance segmentation by predicting
fixed resolution instance masks within the bounding boxes of each detected object, fixed resolution instance masks within the bounding boxes of each detected object,
which are, at test-time, bilinearly resized to fit inside the respective bounding boxes. which are, at test-time, bilinearly resized to fit inside the respective bounding boxes.
For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise binary mask for each instance. compute a pixel-precise binary mask for each instance.
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no Note that the per-class masks \emph{logits} (raw network outputs) are put through a sigmoid layer, and thus there is no
comptetition between classes in the mask prediction branch. comptetition between classes in the mask prediction branch.
Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
bilinear sampling for extracting the RoI features, which is much more precise. bilinear sampling for extracting the RoI features, which is much more precise.
In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel In the original RoI pooling adopted from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
boundary of the bounding box, and thus some detail is lost. boundaries of the bounding boxes, and thus some detail is lost.
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}. The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
@ -408,7 +408,7 @@ F$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2
& From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\ & From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
& From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ & From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls} + 1$ \\
\midrule \midrule
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\ \multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
\midrule \midrule
@ -429,12 +429,12 @@ block (see Figure \ref{figure:fpn_block}).
\paragraph{Feature Pyramid Networks} \paragraph{Feature Pyramid Networks}
In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
of the size of the bounding box of each RoI. of the size of the bounding box of any specific RoI.
However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
might have lost too much spatial information to allow properly predicting the exact bounding might have lost too much spatial information to allow properly predicting the exact bounding
box and a high resolution mask. box and a high resolution mask.
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI. of an appropriate scale to be used for RoI extraction, depending on the size of the bounding box of the RoI.
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet} For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
encoder by combining bilinearly upsampled feature maps coming from the bottleneck encoder by combining bilinearly upsampled feature maps coming from the bottleneck
with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}). with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
@ -454,17 +454,16 @@ as the RPN heads themselves correspond to different scales.
Now, in the RPN, higher resolution feature maps can be used for regressing smaller Now, in the RPN, higher resolution feature maps can be used for regressing smaller
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$, bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
which has a stride of $4$ with respect to the input image. which has a stride of $4$ with respect to the input image.
Most importantly, the RoI features can now be extracted at the pyramid level P$_j$ appropriate for a Most importantly, the RoI features can now be extracted from the pyramid level P$_j$ appropriate for a
RoI bounding box with size $h \times w$, RoI bounding box with size $h \times w$, where
\begin{equation} \begin{equation}
j = 2 + j_a, j = 2 + j_a,
\end{equation} \end{equation}
where
\begin{equation} \begin{equation}
j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{\sqrt{w \cdot h}}{s_0}\right)\right], 0, 4\right) j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{\sqrt{w \cdot h}}{s_0}\right)\right], 0, 4\right)
\label{eq:level_assignment} \label{eq:level_assignment}
\end{equation} \end{equation}
is the index (from small anchor to large anchor) of the corresponding anchor box and is the index (from small anchor to large anchor) of the corresponding anchor box, and
\begin{equation} \begin{equation}
s_0 = 256 \cdot 0.125 s_0 = 256 \cdot 0.125
\label{eq:level_assignment} \label{eq:level_assignment}
@ -513,12 +512,12 @@ $c$ is the output vector from a softmax layer,
$c_{c^*} \in (0,1)$ is the output probability for class $c^*$, $c_{c^*} \in (0,1)$ is the output probability for class $c^*$,
and $\text{C}$ is the number of classes. and $\text{C}$ is the number of classes.
Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$, Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$,
as in $\text{N}_{cls}$, we do not count the background class. as $\text{N}_{cls}$ does not include the background class.
Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss, Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss,
\begin{equation} \begin{equation}
\ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y), \ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y),
\end{equation} \end{equation}
where $y^* \in \{0,1\}$ is a label and $y \in (0,1)$ is the output from a sigmoid layer. where $y^* \in \{0,1\}$ is a ground truth label and $y \in (0,1)$ is the output of a sigmoid layer.
Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is
the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask. the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask.
@ -549,7 +548,7 @@ b_h^* = \log \left( \frac{h^*}{h_r} \right),
which represents the regression target for the bounding box which represents the regression target for the bounding box
outputs of the network. outputs of the network.
Thus, for each bounding box prediction, the network predicts the box encoding $b_e$, Thus, for bounding box regression, the network predicts the box encoding $b_e$,
\begin{equation} \begin{equation}
b_e = (b_x, b_y, b_w, b_h), b_e = (b_x, b_y, b_w, b_h),
\end{equation} \end{equation}
@ -565,7 +564,7 @@ b_w = \log \left( \frac{w}{w_r} \right),
b_h = \log \left( \frac{h}{h_r} \right). b_h = \log \left( \frac{h}{h_r} \right).
\end{equation*} \end{equation*}
At test time, to get from a predicted box encoding $b_e$ to the predicted bounding box $b$, At test time, to convert from a predicted box encoding $b_e$ to the predicted bounding box $b$,
the definitions above can be inverted, the definitions above can be inverted,
\begin{equation} \begin{equation}
b = (x, y, w, h), b = (x, y, w, h),
@ -631,14 +630,14 @@ the predicted refined RoI box encoding for class $c_i^*$.
Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$ Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
the binary ground truth mask using the RPN proposal bounding box. the binary ground truth mask using the RPN proposal bounding box.
In our implementation, we use nearest neighbour resizing for resizing the mask In our implementation, we use nearest neighbour resizing for resizing the cropped mask
targets. targets.
Note that values in $m_i$ and $c_i$ are already normalized probabilities from Note that values in $m_i$ and $c_i$ are already normalized probabilities from
sigmoid and softmax layers, respectively. sigmoid and softmax layers, respectively.
Then, the ROI loss is computed as Then, the ROI loss is computed as
\begin{equation} \begin{equation}
L_{RoI} = L_{cls} + L_{box} + L_{mask} L_{RoI} = L_{cls} + L_{box} + L_{mask},
\end{equation} \end{equation}
where where
\begin{equation} \begin{equation}
@ -669,7 +668,7 @@ losses are only enabled for the foreground RoIs. Note that the bounding box and
for all classes other than $c_i^*$ are not penalized. for all classes other than $c_i^*$ are not penalized.
\paragraph{Inference} \paragraph{Inference}
During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals During inference, the 300 (ResNet) or 1000 (ResNet-FPN) highest scoring region proposals
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes, from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
and passed through the RoI bounding box refinement and classification heads and passed through the RoI bounding box refinement and classification heads
(but not through the mask head). (but not through the mask head).

View File

@ -4,7 +4,7 @@ We introduced Motion R-CNN, which enables 3D object motion estimation in paralle
to instance segmentation in the framework of region-based convolutional networks, to instance segmentation in the framework of region-based convolutional networks,
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera. given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
In addition to instance motions, our network estimates the 3D ego-motion of the camera. In addition to instance motions, our network estimates the 3D ego-motion of the camera.
We combine all these estimates to yield a dense optical flow output from our We combine all these estimates to obtain a dense optical flow output from our
end-to-end deep network. end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with bounding box, instance mask, depth, and 3D motion ground truth, us with bounding box, instance mask, depth, and 3D motion ground truth,
@ -75,8 +75,8 @@ geometry for making a more reliable depth estimate, at least when the camera
is moving. We could also extend our method to stereo input data easily by concatenating is moving. We could also extend our method to stereo input data easily by concatenating
all of the frames into the input image. all of the frames into the input image.
In case we choose the option of integrating the depth prediction directly into In case we choose the option of integrating the depth prediction directly into
the R-CNN, the current backbone,
this would however require using a different dataset for training it, as Virtual KITTI does not this would however require using a different dataset for training Motion R-CNN, as Virtual KITTI does not
provide stereo images. provide stereo images.
If we would use a specialized depth network, we could use stereo data If we would use a specialized depth network, we could use stereo data
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset, for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
@ -149,8 +149,9 @@ RoI instance motion loss. Still, to use all available information from
ground truth optical flow and obtain more accurate supervision, ground truth optical flow and obtain more accurate supervision,
it would likely be beneficial to add a global, flow-based camera motion loss it would likely be beneficial to add a global, flow-based camera motion loss
independent of the RoI supervision. independent of the RoI supervision.
To do this, one could use a re-projection loss conceptually identical to the one To do this, one could use a re-projection loss conceptually similar to the one
for supervising instance motions with ground truth flow. However, to adjust for the for supervising instance motions with ground truth flow,
but computed on the full image instead of for individual RoIs. However, to adjust for the
fact that the camera motion can only be accurately supervised with flow at positions where fact that the camera motion can only be accurately supervised with flow at positions where
no object motion accurs, this loss would have to be masked with the ground truth no object motion accurs, this loss would have to be masked with the ground truth
object masks. Again, we could use this flow-based loss in an unsupervised way, object masks. Again, we could use this flow-based loss in an unsupervised way,

View File

@ -27,7 +27,7 @@ from different viewing angles, resulting in a total of 10 variants per sequence.
In addition to the RGB frames, a variety of ground truth is supplied. In addition to the RGB frames, a variety of ground truth is supplied.
For each frame, we are given a dense depth and optical flow map and the camera For each frame, we are given a dense depth and optical flow map and the camera
extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2). extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2).
For all cars and vans in the each frame, we are given 2D and 3D object bounding For all cars and vans in each frame, we are given 2D and 3D object bounding
boxes, instance masks, 3D poses, and various other labels. boxes, instance masks, 3D poses, and various other labels.
This makes the Virtual KITTI dataset ideally suited for developing our joint This makes the Virtual KITTI dataset ideally suited for developing our joint
@ -85,7 +85,7 @@ $\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k), R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k),
\end{equation} \end{equation}
\begin{equation} \begin{equation}
t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t. t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t^k.
\end{equation} \end{equation}
As for the camera, we define $o_k^* \in \{ 0, 1 \}$, As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
@ -112,11 +112,11 @@ measures the mean angle of the error rotation between predicted and ground truth
\begin{equation} \begin{equation}
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2, E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
\end{equation} \end{equation}
is the mean euclidean norm between predicted and ground truth translation, and is the mean Euclidean distance between predicted and ground truth translation, and
\begin{equation} \begin{equation}
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2 E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
\end{equation} \end{equation}
is the mean euclidean norm between predicted and ground truth pivot. is the mean Euclidean distance between predicted and ground truth pivot.
Moreover, we define precision and recall measures for the detection of moving objects, Moreover, we define precision and recall measures for the detection of moving objects,
where where
@ -151,8 +151,8 @@ Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{Mas
We train for a total of 192K iterations on the Virtual KITTI training set. We train for a total of 192K iterations on the Virtual KITTI training set.
For this, we use a single Titan X (Pascal) GPU and a batch size of 1, For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
which results in approximately one day of training for a complete run. which results in approximately one day of training for a complete run.
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with
momentum of $0.9$. momentum set to $0.9$.
As learning rate we use $0.25 \cdot 10^{-2}$ for the As learning rate we use $0.25 \cdot 10^{-2}$ for the
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
@ -255,7 +255,7 @@ the average ground truth camera translation. The camera rotation angle error
is still relatively high, compared to the small average ground truth camera rotation. is still relatively high, compared to the small average ground truth camera rotation.
Although both variants use the exact same network for predicting the camera motion, Although both variants use the exact same network for predicting the camera motion,
the FPN variant performs worse here, with the error in rotation angle twice as high. the FPN variant performs worse here, with the error in rotation angle twice as high.
One possible explanations that should be investigated in futher work is One possible explanations that should be investigated in future work is
that in the FPN variant, all blocks in the backbone are shared between the camera that in the FPN variant, all blocks in the backbone are shared between the camera
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
C$6$ blocks are only used in the camera branch, and thus only experience weight C$6$ blocks are only used in the camera branch, and thus only experience weight
@ -265,8 +265,9 @@ As a remedy, increasing the loss weighting of the camera motion loss may be
helpful. helpful.
\paragraph{Instance motion} \paragraph{Instance motion}
The object pivots are estimated with relatively (given that the scenes are in a realistic scale) The object pivots are estimated with relatively
high accuracy in both variants, although the FPN variant is significantly more high accuracy in both variants (given that the scenes are in a realistic scale),
although the FPN variant is significantly more
accurate, which we ascribe to the higher resolution features used in this variant. accurate, which we ascribe to the higher resolution features used in this variant.
The predicted 3D object translations and rotations still have a relatively high The predicted 3D object translations and rotations still have a relatively high

View File

@ -113,7 +113,7 @@ manageable pieces.
\centering \centering
\includegraphics[width=\textwidth]{figures/net_intro} \includegraphics[width=\textwidth]{figures/net_intro}
\caption{ \caption{
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the 3D instance motion Overview of our network based on Mask R-CNN \cite{MaskRCNN}. For each region of interest (RoI), we predict the 3D instance motion
in parallel to the class, bounding box and mask. Additionally, we branch off a in parallel to the class, bounding box and mask. Additionally, we branch off a
small network from the bottleneck for predicting the 3D camera ego-motion. small network from the bottleneck for predicting the 3D camera ego-motion.
Novel components in addition to Mask R-CNN are shown in red. Novel components in addition to Mask R-CNN are shown in red.
@ -124,8 +124,8 @@ Novel components in addition to Mask R-CNN are shown in red.
\subsection{Related work} \subsection{Related work}
In the following, we will refer to systems which use deep networks for all In the following, we will refer to systems which use deep networks for all
optimization and do not perform time-critical side computation (e.g. numerical optimization) optimization and do not perform time-critical side computation
at inference time as \emph{end-to-end} deep learning systems. at inference time (e.g. numerical optimization) as \emph{end-to-end} deep learning systems.
\paragraph{Deep networks in optical flow estimation} \paragraph{Deep networks in optical flow estimation}
@ -142,8 +142,9 @@ where semantics become very important.
Extensions of these approaches to scene flow estimate dense flow and dense depth Extensions of these approaches to scene flow estimate dense flow and dense depth
with similarly generic networks \cite{SceneFlowDataset} and similar limitations. with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
Other works \cite{ESI, JOF, FlowLayers, MRFlow} make use of semantic segmentation to structure Other works make use of semantic segmentation to structure
the optical flow estimation problem and introduce reasoning at the object level, the optical flow estimation problem and introduce reasoning at the object level
\cite{ESI, JOF, FlowLayers, MRFlow},
but still require expensive energy minimization for each but still require expensive energy minimization for each
new input, as CNNs are only used for some of the components and numerical new input, as CNNs are only used for some of the components and numerical
optimization is central to their inference. optimization is central to their inference.
@ -168,20 +169,20 @@ without the use of (deep) learning.
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow}, In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
with depth obtained from a non-learned stereo algorithm, to be used as pre-computed with depth obtained from a non-learned stereo algorithm to be used as pre-computed
inputs to a slanted plane scene flow model based on \cite{KITTI2015}. inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
Most likely due to their use of deep learning for instance segmentation and for some other components, this Most likely due to their use of deep learning for instance segmentation and for some other components, this
approach outperforms the previous related scene flow methods on public benchmarks. approach outperforms the previous related scene flow methods on relevant public benchmarks \cite{KITTI2012, KITTI2015}.
Still, the method uses a energy-minimization formulation for the scene flow estimation itself Still, the method uses a energy-minimization formulation for the scene flow estimation itself
and takes minutes to make a prediction. and takes minutes to make a prediction.
Interestingly, the slanted plane methods achieve the current state-of-the-art Interestingly, the slanted plane methods achieve the current state-of-the-art
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015}, in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet2}. outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet, FlowNet2}.
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts, However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime. generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
These concerns restrict the applicability of the current slanted plane models in practical settings, These concerns restrict the applicability of the current slanted plane models in practical settings,
which often require estimations to be done in realtime (or close to realtime) and for which an end-to-end which often require estimations to be done in real time (or close to real time) and for which an end-to-end
approach based on learning would be preferable. approach based on learning would be preferable.
Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead