fucked it

This commit is contained in:
Simon Meister 2017-11-16 16:51:54 +01:00
parent a5a014fc57
commit 653b41ee96
6 changed files with 72 additions and 41 deletions

View File

@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
in order to enable image matching between the consecutive frames.
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
show our Motion R-CNN networks based on Mask R-CNN ResNet-50 and Mask R-CNN ResNet-50-FPN,
show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN,
respectively.
{\begin{table}[h]
@ -20,13 +20,13 @@ respectively.
\midrule\midrule
& input images & H $\times$ W $\times$ C \\
\midrule
C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)} (Table \ref{table:maskrcnn_resnet})}\\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
\midrule
& From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
@ -52,8 +52,8 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
\end{tabular}
\caption {
Motion R-CNN ResNet-50 architecture based on the Mask R-CNN
ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}).
Motion R-CNN ResNet architecture based on the Mask R-CNN
ResNet architecture (Table \ref{table:maskrcnn_resnet}).
We use ReLU activations after all hidden layers and
additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
}
@ -70,13 +70,13 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
\midrule\midrule
& input images & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
\midrule
& From C$_5$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
@ -101,9 +101,11 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
\end{tabular}
\caption {
Motion R-CNN ResNet-50-FPN architecture based on the Mask R-CNN
ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
The modifications are analogous to our Motion R-CNN ResNet-50,
Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
To obtain a larger bottleneck stride, we compute the feature pyramid starting
with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted.
The modifications are analogous to our Motion R-CNN ResNet,
but we still show the architecture for completeness.
Again, we use ReLU activations after all hidden layers and
additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
@ -201,12 +203,12 @@ a still and moving camera.
\label{ssec:design}
\paragraph{Camera motion network}
In our ResNet-50 variant (Table \ref{table:motionrcnn_resnet}), the underlying
In our variant (Table \ref{table:motionrcnn_resnet}), the underlying
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
feature resolution for RoI extraction would be reduced too much.
In our ResNet-50 variant, we first pass the $C_4$ features through a $C_5$
feature resolution prior to RoI extraction would be reduced too much.
In our variant, we therefore first pass the $C_4$ features through a $C_5$
block to make the camera network of both variants comparable.
Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
convolution to the $C_5$ features to reduce the number of inputs to the following
fully-connected layers.
Instead of averaging, we use bilinear resizing to bring the convolutional features

View File

@ -129,13 +129,25 @@ and N$_{motions} = 3$.
\label{ssec:resnet}
ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but
became popular as basic building block of many deep network architectures for a variety
of different tasks. In Table \ref{table:resnet}, we show the ResNet-50 variant
of different tasks. Figure \ref{figure:bottleneck}
shows the fundamental building block of . The additive \emph{residual unit} enables the training
of very deep networks without the gradients becoming too small as the distance
from the output layer increases.
In Table \ref{table:resnet}, we show the ResNet variant
that will serve as the basic CNN backbone of our networks, and
is also used in many other region-based convolutional networks.
The initial image data is always passed through ResNet-50 as a first step to
The initial image data is always passed through the ResNet backbone as a first step to
bootstrap the complete deep network.
Figure \ref{figure:bottleneck}
shows the fundamental building block of ResNet-50.
Note that for the Mask R-CNN architectures we describe below, this is equivalent
to the standard backbone.
In , the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important.
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
to increase the bottleneck stride to 64, following FlowNetS.
{
%\begin{table}[h]
@ -146,7 +158,7 @@ shows the fundamental building block of ResNet-50.
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
\multicolumn{3}{c}{\textbf{ResNet-50}}\\
\multicolumn{3}{c}{\textbf{ResNet}}\\
\midrule
C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\
@ -182,17 +194,25 @@ $\begin{bmatrix}
3 \times 3, 512 \\
1 \times 1, 2048 \\
\end{bmatrix}_{b/2}$ $\times$ 3
& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
\midrule
C$_6$ &
$\begin{bmatrix}
1 \times 1, 512 \\
3 \times 3, 512 \\
1 \times 1, 2048 \\
\end{bmatrix}_{b/2}$ $\times$ 3
& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\bottomrule
\caption {
ResNet-50 \cite{ResNet} architecture.
Backbone architecture based on ResNet-50 \cite{ResNet}.
Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
the first convolution operation in the block has a stride of 2. Note that the stride
is only applied to the first block, but not to repeated blocks.
Batch normalization \cite{BN} is used after every convolution.
Batch normalization \cite{BN} is used after every residual unit.
}
\label{table:resnet}
\end{longtable}
@ -230,7 +250,7 @@ Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
each corresponding to one of the proposal bounding boxes.
The extracted per-RoI feature maps are collected into a batch and passed into a small Fast R-CNN
The extracted per-RoI (region of interest) feature maps are collected into a batch and passed into a small Fast R-CNN
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
@ -276,7 +296,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
fixed resolution instance masks within the bounding boxes of each detected object.
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise binary mask for each instance.
The basic Mask R-CNN ResNet-50 architecture is shown in Table \ref{table:maskrcnn_resnet}.
The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}.
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
comptetition between classes for the mask prediction branch.
@ -295,7 +315,7 @@ boundary of the bounding box, and thus some detail is lost.
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\
\midrule
@ -311,7 +331,7 @@ ROI$_{\mathrm{RPN}}$ & sample boxes$_{\mathrm{RPN}}$ and scores$_{\mathrm{RPN}}$
\multicolumn{3}{c}{\textbf{RoI Head}}\\
\midrule
& From C$_4$ with ROI$_{\mathrm{RPN}}$: RoI extraction & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\
R$_1$& ResNet-50 \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
R$_1$& ResNet \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
ave & average pool & N$_{RoI}$ $\times$ 2048 \\
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
@ -327,8 +347,8 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}
\bottomrule
\caption {
Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture.
Note that this is equivalent to the Faster R-CNN architecture if the mask
Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture.
Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask
head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction,
whereas Faster R-CNN used RoI pooling.
}
@ -350,7 +370,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
encoder by combining bilinear upsampled feature maps coming from the bottleneck
with lateral skip connections from the encoder.
The Mask R-CNN ResNet-50-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
At each output position of the resulting RPN pyramid, bounding boxes are predicted
@ -395,7 +415,7 @@ which is the highest resolution feature map.
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\
\midrule
@ -435,7 +455,7 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}
\bottomrule
\caption {
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture.
Operations enclosed in a []$_p$ block make up a single FPN
block (see Figure \ref{figure:fpn_block}).
}

View File

@ -273,3 +273,9 @@
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
year={2016}}
@inproceedings{ImageNet,
title={ImageNet Large Scale Visual Recognition Challenge},
author={Olga Russakovsky and others},
booktitle={IJCV},
year={2015}}

View File

@ -66,7 +66,7 @@ and also fine-tune on the training set as mentioned in the previous paragraph.
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
@ -85,8 +85,8 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
\end{tabular}
\caption {
A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction,
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
}
\label{table:motionrcnn_resnet_fpn_depth}
\end{table}
@ -140,14 +140,15 @@ into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.
\paragraph{Deeper networks for larger bottleneck strides}
% TODO remove?
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
For accurately estimating the motion of objects with large displacements between
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
We could do this easily in both of our network variants by adding one ore multiple additional
We could do this easily in both of our network variants by adding one or multiple additional
ResNet blocks. In the variant without FPN, these blocks would have to be placed
after RoI feature extraction. In the FPN variant, the blocks could be simply
added after the encoder C$_5$ bottleneck.
For saving memory, we could however also consider modifying the underlying
ResNet-50 architecture and increase the number of blocks, but reduce the number
architecture and increase the number of blocks, but reduce the number
of layers in each block.

View File

@ -152,7 +152,7 @@ predicted camera motions.
For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants.
We train both, the Motion R-CNN and -FPN variants.
\paragraph{Training schedule}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
@ -166,11 +166,13 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
\paragraph{R-CNN training parameters}
For training the RPN and RoI heads and during inference,
we use the exact same number of proposals and RoIs as Mask R-CNN in
the ResNet-50 and ResNet-50-FPN variants, respectively.
the and -FPN variants, respectively.
\paragraph{Initialization}
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
Following the pre-existing TensorFlow implementation of Faster R-CNN,
we initialize all hidden layers with He initialization \cite{He}.
we initialize all other hidden layers with He initialization \cite{He}.
For the fully-connected camera and instance motion output layers,
we use a truncated normal initializer with a standard
deviation of $0.0001$ and zero mean, truncated at two standard deviations.
@ -220,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
wrong by both $\geq 3$ pixels and $\geq 5\%$.
Camera and instance motion errors are averaged over the validation set.
We optionally enable camera motion prediction (cam.),
replace the ResNet-50 backbone with ResNet-50-FPN (FPN),
replace the backbone with -FPN (FPN),
or input XYZ coordinates into the backbone (XYZ).
We either supervise
object motions (sup.) with 3D motion ground truth (3D) or

View File

@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
\centering
\includegraphics[width=\textwidth]{figures/maskrcnn_cs}
\caption{
Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
}
\label{figure:maskrcnn_cs}