This commit is contained in:
Simon Meister 2017-11-17 19:42:58 +01:00
parent e5f6f23c6b
commit 73470ee4a8
4 changed files with 34 additions and 33 deletions

View File

@ -26,7 +26,8 @@ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
\midrule
& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
& From C$_4$: ResNet \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
& ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
@ -76,7 +77,7 @@ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
\midrule
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 512 \\
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
@ -104,7 +105,7 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
To obtain a larger bottleneck stride, we compute the feature pyramid starting
with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted.
with C$_6$ instead of C$_5$, and thus, the subsampling from P$_5$ to P$_6$ is omitted.
The modifications are analogous to our Motion R-CNN ResNet,
but we still show the architecture for completeness.
Again, we use ReLU activations after all hidden layers and
@ -206,10 +207,14 @@ a still and moving camera.
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
feature resolution prior to RoI extraction would be reduced too much.
In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
block to make the camera network of both variants comparable.
Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
convolution to the $C_5$ features to reduce the number of inputs to the following
Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$
and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant)
to increase the bottleneck stride prior to the camera network to 64.
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
the backbone makes use of all blocks through $C_6$ and
we can simply branch of our camera network from the $C_6$ bottleneck.
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
convolution to the $C_6$ features to reduce the number of inputs to the following
fully-connected layers.
Instead of averaging, we use bilinear resizing to bring the convolutional features
to a fixed size without losing all spatial information,
@ -356,13 +361,6 @@ and sample proposals and RoIs in the exact same way.
During inference, we proceed analogously to Mask R-CNN.
In the same way as the RoI mask head, at test time, we compute the RoI motion head
from the features extracted with refined bounding boxes.
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
extracted RoI features before passing them into the motion head.
The intuition behind that is that we want to mask out (set to zero) any positions in the
extracted feature window which belong to the background. Then, the RoI motion
head aggregates the motion (image matching) information from the backbone
over positions localized within the object only, but not over positions belonging
to the background, which should not influence the final object motion estimate.
Again, as for masks and bounding boxes in Mask R-CNN,
the predicted output object motions are the predicted object motions for the

View File

@ -134,10 +134,10 @@ and N$_{motions} = 3$.
\subsection{ResNet}
\label{ssec:resnet}
ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
became popular as basic building block of many deep network architectures for a variety
of different tasks. Figure \ref{figure:bottleneck}
shows the fundamental building block of . The additive \emph{residual unit} enables the training
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
of very deep networks without the gradients becoming too small as the distance
from the output layer increases.
@ -147,8 +147,9 @@ is also used in many other region-based convolutional networks.
The initial image data is always passed through the ResNet backbone as a first step to
bootstrap the complete deep network.
Note that for the Mask R-CNN architectures we describe below, this is equivalent
to the standard backbone.
In , the C$_5$ bottleneck has a stride of 32 with respect to the
to the standard ResNet-50 backbone. We now introduce one small extension that
will be useful for our Motion R-CNN network.
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important.

View File

@ -130,7 +130,6 @@ For training on a dataset without any motion ground truth, e.g.
Cityscapes, it may be critical to add this term in addition to an unsupervised
loss for the instance motions.
\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
@ -139,16 +138,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.
\paragraph{Deeper networks for larger bottleneck strides}
% TODO remove?
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
For accurately estimating the motion of objects with large displacements between
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
We could do this easily in both of our network variants by adding one or multiple additional
ResNet blocks. In the variant without FPN, these blocks would have to be placed
after RoI feature extraction. In the FPN variant, the blocks could be simply
added after the encoder C$_5$ bottleneck.
For saving memory, we could however also consider modifying the underlying
ResNet architecture and increase the number of blocks, but reduce the number
of layers in each block.
\paragraph{Masking prior to the RoI motion head}
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
the backbone are integrated over the complete RoI window to yield the features
for motion estimation.
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
However, ideally, the motion (image matching) information from the backbone should
For example, consider
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
extracted RoI features before passing them into the motion head.
The intuition behind that is that we want to mask out (set to zero) any positions in the
extracted feature window which belong to the background. Then, the RoI motion
head could aggregate the motion (image matching) information from the backbone
over positions localized within the object only, but not over positions belonging
to the background, which should probably not influence the final object motion estimate.

View File

@ -166,7 +166,7 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
\paragraph{R-CNN training parameters}
For training the RPN and RoI heads and during inference,
we use the exact same number of proposals and RoIs as Mask R-CNN in
the and -FPN variants, respectively.
the ResNet and ResNet-FPN variants, respectively.
\paragraph{Initialization}
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained