mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-12 17:35:51 +00:00
WIP
This commit is contained in:
parent
e5f6f23c6b
commit
73470ee4a8
26
approach.tex
26
approach.tex
@ -26,7 +26,8 @@ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||
\midrule
|
||||
& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
& From C$_4$: ResNet \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
& ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
@ -76,7 +77,7 @@ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||
\midrule
|
||||
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 512 \\
|
||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
@ -104,7 +105,7 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
|
||||
ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
To obtain a larger bottleneck stride, we compute the feature pyramid starting
|
||||
with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted.
|
||||
with C$_6$ instead of C$_5$, and thus, the subsampling from P$_5$ to P$_6$ is omitted.
|
||||
The modifications are analogous to our Motion R-CNN ResNet,
|
||||
but we still show the architecture for completeness.
|
||||
Again, we use ReLU activations after all hidden layers and
|
||||
@ -206,10 +207,14 @@ a still and moving camera.
|
||||
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
||||
feature resolution prior to RoI extraction would be reduced too much.
|
||||
In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
|
||||
block to make the camera network of both variants comparable.
|
||||
Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
||||
Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$
|
||||
and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant)
|
||||
to increase the bottleneck stride prior to the camera network to 64.
|
||||
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
|
||||
the backbone makes use of all blocks through $C_6$ and
|
||||
we can simply branch of our camera network from the $C_6$ bottleneck.
|
||||
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
|
||||
convolution to the $C_6$ features to reduce the number of inputs to the following
|
||||
fully-connected layers.
|
||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||
to a fixed size without losing all spatial information,
|
||||
@ -356,13 +361,6 @@ and sample proposals and RoIs in the exact same way.
|
||||
During inference, we proceed analogously to Mask R-CNN.
|
||||
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
||||
from the features extracted with refined bounding boxes.
|
||||
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||
extracted RoI features before passing them into the motion head.
|
||||
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||
extracted feature window which belong to the background. Then, the RoI motion
|
||||
head aggregates the motion (image matching) information from the backbone
|
||||
over positions localized within the object only, but not over positions belonging
|
||||
to the background, which should not influence the final object motion estimate.
|
||||
|
||||
Again, as for masks and bounding boxes in Mask R-CNN,
|
||||
the predicted output object motions are the predicted object motions for the
|
||||
|
||||
@ -134,10 +134,10 @@ and N$_{motions} = 3$.
|
||||
|
||||
\subsection{ResNet}
|
||||
\label{ssec:resnet}
|
||||
ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||
became popular as basic building block of many deep network architectures for a variety
|
||||
of different tasks. Figure \ref{figure:bottleneck}
|
||||
shows the fundamental building block of . The additive \emph{residual unit} enables the training
|
||||
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
|
||||
of very deep networks without the gradients becoming too small as the distance
|
||||
from the output layer increases.
|
||||
|
||||
@ -147,8 +147,9 @@ is also used in many other region-based convolutional networks.
|
||||
The initial image data is always passed through the ResNet backbone as a first step to
|
||||
bootstrap the complete deep network.
|
||||
Note that for the Mask R-CNN architectures we describe below, this is equivalent
|
||||
to the standard backbone.
|
||||
In , the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
to the standard ResNet-50 backbone. We now introduce one small extension that
|
||||
will be useful for our Motion R-CNN network.
|
||||
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
|
||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||
stride may be important.
|
||||
|
||||
@ -130,7 +130,6 @@ For training on a dataset without any motion ground truth, e.g.
|
||||
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||||
loss for the instance motions.
|
||||
|
||||
|
||||
\paragraph{Temporal consistency}
|
||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||||
@ -139,16 +138,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||
into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
|
||||
\paragraph{Deeper networks for larger bottleneck strides}
|
||||
% TODO remove?
|
||||
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
|
||||
For accurately estimating the motion of objects with large displacements between
|
||||
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
|
||||
We could do this easily in both of our network variants by adding one or multiple additional
|
||||
ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
||||
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
||||
added after the encoder C$_5$ bottleneck.
|
||||
For saving memory, we could however also consider modifying the underlying
|
||||
ResNet architecture and increase the number of blocks, but reduce the number
|
||||
of layers in each block.
|
||||
\paragraph{Masking prior to the RoI motion head}
|
||||
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
|
||||
the backbone are integrated over the complete RoI window to yield the features
|
||||
for motion estimation.
|
||||
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
|
||||
However, ideally, the motion (image matching) information from the backbone should
|
||||
|
||||
For example, consider
|
||||
|
||||
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||
extracted RoI features before passing them into the motion head.
|
||||
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||
extracted feature window which belong to the background. Then, the RoI motion
|
||||
head could aggregate the motion (image matching) information from the backbone
|
||||
over positions localized within the object only, but not over positions belonging
|
||||
to the background, which should probably not influence the final object motion estimate.
|
||||
|
||||
@ -166,7 +166,7 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
\paragraph{R-CNN training parameters}
|
||||
For training the RPN and RoI heads and during inference,
|
||||
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
||||
the and -FPN variants, respectively.
|
||||
the ResNet and ResNet-FPN variants, respectively.
|
||||
|
||||
\paragraph{Initialization}
|
||||
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user