mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
fucked it
This commit is contained in:
parent
a5a014fc57
commit
653b41ee96
30
approach.tex
30
approach.tex
@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
|
||||
in order to enable image matching between the consecutive frames.
|
||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
|
||||
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
|
||||
show our Motion R-CNN networks based on Mask R-CNN ResNet-50 and Mask R-CNN ResNet-50-FPN,
|
||||
show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN,
|
||||
respectively.
|
||||
|
||||
{\begin{table}[h]
|
||||
@ -20,13 +20,13 @@ respectively.
|
||||
\midrule\midrule
|
||||
& input images & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
|
||||
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)} (Table \ref{table:maskrcnn_resnet})}\\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||
\midrule
|
||||
& From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
@ -52,8 +52,8 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
\end{tabular}
|
||||
|
||||
\caption {
|
||||
Motion R-CNN ResNet-50 architecture based on the Mask R-CNN
|
||||
ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}).
|
||||
Motion R-CNN ResNet architecture based on the Mask R-CNN
|
||||
ResNet architecture (Table \ref{table:maskrcnn_resnet}).
|
||||
We use ReLU activations after all hidden layers and
|
||||
additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
|
||||
}
|
||||
@ -70,13 +70,13 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
|
||||
\midrule\midrule
|
||||
& input images & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||
\midrule
|
||||
& From C$_5$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
@ -101,9 +101,11 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
\end{tabular}
|
||||
|
||||
\caption {
|
||||
Motion R-CNN ResNet-50-FPN architecture based on the Mask R-CNN
|
||||
ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
The modifications are analogous to our Motion R-CNN ResNet-50,
|
||||
Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
|
||||
ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
To obtain a larger bottleneck stride, we compute the feature pyramid starting
|
||||
with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted.
|
||||
The modifications are analogous to our Motion R-CNN ResNet,
|
||||
but we still show the architecture for completeness.
|
||||
Again, we use ReLU activations after all hidden layers and
|
||||
additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
|
||||
@ -201,12 +203,12 @@ a still and moving camera.
|
||||
|
||||
\label{ssec:design}
|
||||
\paragraph{Camera motion network}
|
||||
In our ResNet-50 variant (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||
In our variant (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
||||
feature resolution for RoI extraction would be reduced too much.
|
||||
In our ResNet-50 variant, we first pass the $C_4$ features through a $C_5$
|
||||
feature resolution prior to RoI extraction would be reduced too much.
|
||||
In our variant, we therefore first pass the $C_4$ features through a $C_5$
|
||||
block to make the camera network of both variants comparable.
|
||||
Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
||||
Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
||||
fully-connected layers.
|
||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||
|
||||
@ -129,13 +129,25 @@ and N$_{motions} = 3$.
|
||||
\label{ssec:resnet}
|
||||
ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||
became popular as basic building block of many deep network architectures for a variety
|
||||
of different tasks. In Table \ref{table:resnet}, we show the ResNet-50 variant
|
||||
of different tasks. Figure \ref{figure:bottleneck}
|
||||
shows the fundamental building block of . The additive \emph{residual unit} enables the training
|
||||
of very deep networks without the gradients becoming too small as the distance
|
||||
from the output layer increases.
|
||||
|
||||
In Table \ref{table:resnet}, we show the ResNet variant
|
||||
that will serve as the basic CNN backbone of our networks, and
|
||||
is also used in many other region-based convolutional networks.
|
||||
The initial image data is always passed through ResNet-50 as a first step to
|
||||
The initial image data is always passed through the ResNet backbone as a first step to
|
||||
bootstrap the complete deep network.
|
||||
Figure \ref{figure:bottleneck}
|
||||
shows the fundamental building block of ResNet-50.
|
||||
Note that for the Mask R-CNN architectures we describe below, this is equivalent
|
||||
to the standard backbone.
|
||||
In , the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
|
||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||
stride may be important.
|
||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||
to increase the bottleneck stride to 64, following FlowNetS.
|
||||
|
||||
|
||||
{
|
||||
%\begin{table}[h]
|
||||
@ -146,7 +158,7 @@ shows the fundamental building block of ResNet-50.
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{ResNet-50}}\\
|
||||
\multicolumn{3}{c}{\textbf{ResNet}}\\
|
||||
\midrule
|
||||
C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\
|
||||
|
||||
@ -182,17 +194,25 @@ $\begin{bmatrix}
|
||||
3 \times 3, 512 \\
|
||||
1 \times 1, 2048 \\
|
||||
\end{bmatrix}_{b/2}$ $\times$ 3
|
||||
& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
\midrule
|
||||
C$_6$ &
|
||||
$\begin{bmatrix}
|
||||
1 \times 1, 512 \\
|
||||
3 \times 3, 512 \\
|
||||
1 \times 1, 2048 \\
|
||||
\end{bmatrix}_{b/2}$ $\times$ 3
|
||||
& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
ResNet-50 \cite{ResNet} architecture.
|
||||
Backbone architecture based on ResNet-50 \cite{ResNet}.
|
||||
Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
|
||||
block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
|
||||
the first convolution operation in the block has a stride of 2. Note that the stride
|
||||
is only applied to the first block, but not to repeated blocks.
|
||||
Batch normalization \cite{BN} is used after every convolution.
|
||||
Batch normalization \cite{BN} is used after every residual unit.
|
||||
}
|
||||
\label{table:resnet}
|
||||
\end{longtable}
|
||||
@ -230,7 +250,7 @@ Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only
|
||||
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
||||
Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
|
||||
each corresponding to one of the proposal bounding boxes.
|
||||
The extracted per-RoI feature maps are collected into a batch and passed into a small Fast R-CNN
|
||||
The extracted per-RoI (region of interest) feature maps are collected into a batch and passed into a small Fast R-CNN
|
||||
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
||||
The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
|
||||
is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
|
||||
@ -276,7 +296,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
|
||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise binary mask for each instance.
|
||||
The basic Mask R-CNN ResNet-50 architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
||||
comptetition between classes for the mask prediction branch.
|
||||
|
||||
@ -295,7 +315,7 @@ boundary of the bounding box, and thus some detail is lost.
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
|
||||
C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\
|
||||
\midrule
|
||||
@ -311,7 +331,7 @@ ROI$_{\mathrm{RPN}}$ & sample boxes$_{\mathrm{RPN}}$ and scores$_{\mathrm{RPN}}$
|
||||
\multicolumn{3}{c}{\textbf{RoI Head}}\\
|
||||
\midrule
|
||||
& From C$_4$ with ROI$_{\mathrm{RPN}}$: RoI extraction & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\
|
||||
R$_1$& ResNet-50 \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
|
||||
R$_1$& ResNet \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
|
||||
ave & average pool & N$_{RoI}$ $\times$ 2048 \\
|
||||
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
@ -327,8 +347,8 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture.
|
||||
Note that this is equivalent to the Faster R-CNN architecture if the mask
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture.
|
||||
Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask
|
||||
head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction,
|
||||
whereas Faster R-CNN used RoI pooling.
|
||||
}
|
||||
@ -350,7 +370,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
|
||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||
encoder by combining bilinear upsampled feature maps coming from the bottleneck
|
||||
with lateral skip connections from the encoder.
|
||||
The Mask R-CNN ResNet-50-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
|
||||
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
|
||||
At each output position of the resulting RPN pyramid, bounding boxes are predicted
|
||||
@ -395,7 +415,7 @@ which is the highest resolution feature map.
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\
|
||||
\midrule
|
||||
@ -435,7 +455,7 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
|
||||
Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture.
|
||||
Operations enclosed in a []$_p$ block make up a single FPN
|
||||
block (see Figure \ref{figure:fpn_block}).
|
||||
}
|
||||
|
||||
6
bib.bib
6
bib.bib
@ -273,3 +273,9 @@
|
||||
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
||||
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
|
||||
year={2016}}
|
||||
|
||||
@inproceedings{ImageNet,
|
||||
title={ImageNet Large Scale Visual Recognition Challenge},
|
||||
author={Olga Russakovsky and others},
|
||||
booktitle={IJCV},
|
||||
year={2015}}
|
||||
|
||||
@ -66,7 +66,7 @@ and also fine-tune on the training set as mentioned in the previous paragraph.
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||||
\midrule
|
||||
@ -85,8 +85,8 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
|
||||
\end{tabular}
|
||||
|
||||
\caption {
|
||||
A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction,
|
||||
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
|
||||
based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
}
|
||||
\label{table:motionrcnn_resnet_fpn_depth}
|
||||
\end{table}
|
||||
@ -140,14 +140,15 @@ into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
|
||||
\paragraph{Deeper networks for larger bottleneck strides}
|
||||
% TODO remove?
|
||||
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
|
||||
For accurately estimating the motion of objects with large displacements between
|
||||
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
|
||||
We could do this easily in both of our network variants by adding one ore multiple additional
|
||||
We could do this easily in both of our network variants by adding one or multiple additional
|
||||
ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
||||
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
||||
added after the encoder C$_5$ bottleneck.
|
||||
For saving memory, we could however also consider modifying the underlying
|
||||
ResNet-50 architecture and increase the number of blocks, but reduce the number
|
||||
architecture and increase the number of blocks, but reduce the number
|
||||
of layers in each block.
|
||||
|
||||
@ -152,7 +152,7 @@ predicted camera motions.
|
||||
|
||||
For our initial experiments, we concatenate both RGB frames as
|
||||
well as the XYZ coordinates for both frames as input to the networks.
|
||||
We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants.
|
||||
We train both, the Motion R-CNN and -FPN variants.
|
||||
|
||||
\paragraph{Training schedule}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
@ -166,11 +166,13 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
\paragraph{R-CNN training parameters}
|
||||
For training the RPN and RoI heads and during inference,
|
||||
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
||||
the ResNet-50 and ResNet-50-FPN variants, respectively.
|
||||
the and -FPN variants, respectively.
|
||||
|
||||
\paragraph{Initialization}
|
||||
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
||||
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
|
||||
Following the pre-existing TensorFlow implementation of Faster R-CNN,
|
||||
we initialize all hidden layers with He initialization \cite{He}.
|
||||
we initialize all other hidden layers with He initialization \cite{He}.
|
||||
For the fully-connected camera and instance motion output layers,
|
||||
we use a truncated normal initializer with a standard
|
||||
deviation of $0.0001$ and zero mean, truncated at two standard deviations.
|
||||
@ -220,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
||||
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
||||
Camera and instance motion errors are averaged over the validation set.
|
||||
We optionally enable camera motion prediction (cam.),
|
||||
replace the ResNet-50 backbone with ResNet-50-FPN (FPN),
|
||||
replace the backbone with -FPN (FPN),
|
||||
or input XYZ coordinates into the backbone (XYZ).
|
||||
We either supervise
|
||||
object motions (sup.) with 3D motion ground truth (3D) or
|
||||
|
||||
@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/maskrcnn_cs}
|
||||
\caption{
|
||||
Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
|
||||
Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
|
||||
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
|
||||
}
|
||||
\label{figure:maskrcnn_cs}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user