This commit is contained in:
Simon Meister 2017-11-17 13:05:41 +01:00
parent 653b41ee96
commit 9a207a4024
9 changed files with 27 additions and 19 deletions

View File

@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
in order to enable image matching between the consecutive frames.
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN,
show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
respectively.
{\begin{table}[h]
@ -203,12 +203,12 @@ a still and moving camera.
\label{ssec:design}
\paragraph{Camera motion network}
In our variant (Table \ref{table:motionrcnn_resnet}), the underlying
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
feature resolution prior to RoI extraction would be reduced too much.
In our variant, we therefore first pass the $C_4$ features through a $C_5$
In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
block to make the camera network of both variants comparable.
Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
convolution to the $C_5$ features to reduce the number of inputs to the following
fully-connected layers.
Instead of averaging, we use bilinear resizing to bring the convolutional features
@ -291,7 +291,7 @@ supervision without 3D instance motion ground truth.
In contrast to SfM-Net, where a single optical flow field is
composed and penalized to supervise the motion prediction, our loss considers
the motion of all objects in isolation and composes a batch of flow windows
for the RoIs.
for the RoIs. Network predictions are shown in red.
}
\label{figure:flow_loss}
\end{figure}

View File

@ -57,6 +57,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{Refinement}}\\
\midrule
& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
\multicolumn{3}{c}{...}\\
\midrule
@ -64,7 +65,8 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
\bottomrule
\caption {
FlowNetS \cite{FlowNet} architecture.
FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
are used for refinement.
}
\label{table:flownets}
\end{longtable}
@ -85,7 +87,10 @@ Recently, other, similarly generic,
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
\subsection{SfM-Net}
Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}.
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture.
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
image brightness differences penalizes the predictions.
{
%\begin{table}[h]
@ -94,8 +99,7 @@ Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detai
\toprule
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
\multicolumn{3}{c}{\todo{Conv-Deconv}}\\
\midrule
\multicolumn{3}{c}{\textbf{Conv-Deconv}}\\
\midrule
\multicolumn{3}{c}{\textbf{Motion Network}}\\
\midrule
@ -106,7 +110,7 @@ FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
\midrule
\multicolumn{3}{c}{\textbf{Structure Network} ()}\\
\multicolumn{3}{c}{\textbf{Structure Network}}\\
\midrule
& input image $I_t$ & H $\times$ W $\times$ 3 \\
& Conv-Deconv & H $\times$ W $\times$ 32 \\
@ -114,11 +118,14 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
\bottomrule
\caption {
SfM-Net \cite{SfmNet} architecture.
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional
encoder-decoder network, where convolutions and deconvolutions with stride 2 are
used for downsampling and upsampling, respectively. The stride at the bottleneck
with respect to the input image is 32.
The Conv-Deconv weights for the structure and motion networks are not shared,
and N$_{motions} = 3$.
}
\label{table:flownets}
\label{table:sfmnet}
\end{longtable}
@ -201,7 +208,7 @@ $\begin{bmatrix}
1 \times 1, 512 \\
3 \times 3, 512 \\
1 \times 1, 2048 \\
\end{bmatrix}_{b/2}$ $\times$ 3
\end{bmatrix}_{b/2}$ $\times$ 2
& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
\bottomrule
@ -296,7 +303,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
fixed resolution instance masks within the bounding boxes of each detected object.
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
compute a pixel-precise binary mask for each instance.
The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}.
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
comptetition between classes for the mask prediction branch.
@ -370,7 +377,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
encoder by combining bilinear upsampled feature maps coming from the bottleneck
with lateral skip connections from the encoder.
The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
At each output position of the resulting RPN pyramid, bounding boxes are predicted

View File

@ -271,7 +271,7 @@
@inproceedings{UnsupFlownet,
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
booktitle={ECCV Workshops},
year={2016}}
@inproceedings{ImageNet,

View File

@ -150,5 +150,5 @@ ResNet blocks. In the variant without FPN, these blocks would have to be placed
after RoI feature extraction. In the FPN variant, the blocks could be simply
added after the encoder C$_5$ bottleneck.
For saving memory, we could however also consider modifying the underlying
architecture and increase the number of blocks, but reduce the number
ResNet architecture and increase the number of blocks, but reduce the number
of layers in each block.

View File

@ -222,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
wrong by both $\geq 3$ pixels and $\geq 5\%$.
Camera and instance motion errors are averaged over the validation set.
We optionally enable camera motion prediction (cam.),
replace the backbone with -FPN (FPN),
replace the ResNet backbone with ResNet-FPN (FPN),
or input XYZ coordinates into the backbone (XYZ).
We either supervise
object motions (sup.) with 3D motion ground truth (3D) or

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.4 MiB

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 536 KiB

After

Width:  |  Height:  |  Size: 537 KiB

View File

@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
\centering
\includegraphics[width=\textwidth]{figures/maskrcnn_cs}
\caption{
Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
}
\label{figure:maskrcnn_cs}
@ -104,6 +104,7 @@ manageable pieces.
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion
in parallel to the class, bounding box and mask. Additionally, we branch off a
small network for predicting the camera motion from the bottleneck.
Novel components in addition to Mask R-CNN are shown in red.
}
\label{figure:net_intro}
\end{figure}