mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
e5f6f23c6b
commit
73470ee4a8
26
approach.tex
26
approach.tex
@ -26,7 +26,8 @@ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\
|
|||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||||
\midrule
|
\midrule
|
||||||
& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
& From C$_4$: ResNet \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||||
|
& ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||||
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||||
@ -76,7 +77,7 @@ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1
|
|||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||||
\midrule
|
\midrule
|
||||||
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 512 \\
|
||||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||||
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||||
@ -104,7 +105,7 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
|||||||
Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
|
Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
|
||||||
ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||||
To obtain a larger bottleneck stride, we compute the feature pyramid starting
|
To obtain a larger bottleneck stride, we compute the feature pyramid starting
|
||||||
with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted.
|
with C$_6$ instead of C$_5$, and thus, the subsampling from P$_5$ to P$_6$ is omitted.
|
||||||
The modifications are analogous to our Motion R-CNN ResNet,
|
The modifications are analogous to our Motion R-CNN ResNet,
|
||||||
but we still show the architecture for completeness.
|
but we still show the architecture for completeness.
|
||||||
Again, we use ReLU activations after all hidden layers and
|
Again, we use ReLU activations after all hidden layers and
|
||||||
@ -206,10 +207,14 @@ a still and moving camera.
|
|||||||
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
|
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||||
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
||||||
feature resolution prior to RoI extraction would be reduced too much.
|
feature resolution prior to RoI extraction would be reduced too much.
|
||||||
In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
|
Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$
|
||||||
block to make the camera network of both variants comparable.
|
and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant)
|
||||||
Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
to increase the bottleneck stride prior to the camera network to 64.
|
||||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
|
||||||
|
the backbone makes use of all blocks through $C_6$ and
|
||||||
|
we can simply branch of our camera network from the $C_6$ bottleneck.
|
||||||
|
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
|
||||||
|
convolution to the $C_6$ features to reduce the number of inputs to the following
|
||||||
fully-connected layers.
|
fully-connected layers.
|
||||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||||
to a fixed size without losing all spatial information,
|
to a fixed size without losing all spatial information,
|
||||||
@ -356,13 +361,6 @@ and sample proposals and RoIs in the exact same way.
|
|||||||
During inference, we proceed analogously to Mask R-CNN.
|
During inference, we proceed analogously to Mask R-CNN.
|
||||||
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
||||||
from the features extracted with refined bounding boxes.
|
from the features extracted with refined bounding boxes.
|
||||||
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
|
||||||
extracted RoI features before passing them into the motion head.
|
|
||||||
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
|
||||||
extracted feature window which belong to the background. Then, the RoI motion
|
|
||||||
head aggregates the motion (image matching) information from the backbone
|
|
||||||
over positions localized within the object only, but not over positions belonging
|
|
||||||
to the background, which should not influence the final object motion estimate.
|
|
||||||
|
|
||||||
Again, as for masks and bounding boxes in Mask R-CNN,
|
Again, as for masks and bounding boxes in Mask R-CNN,
|
||||||
the predicted output object motions are the predicted object motions for the
|
the predicted output object motions are the predicted object motions for the
|
||||||
|
|||||||
@ -134,10 +134,10 @@ and N$_{motions} = 3$.
|
|||||||
|
|
||||||
\subsection{ResNet}
|
\subsection{ResNet}
|
||||||
\label{ssec:resnet}
|
\label{ssec:resnet}
|
||||||
ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but
|
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||||
became popular as basic building block of many deep network architectures for a variety
|
became popular as basic building block of many deep network architectures for a variety
|
||||||
of different tasks. Figure \ref{figure:bottleneck}
|
of different tasks. Figure \ref{figure:bottleneck}
|
||||||
shows the fundamental building block of . The additive \emph{residual unit} enables the training
|
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
|
||||||
of very deep networks without the gradients becoming too small as the distance
|
of very deep networks without the gradients becoming too small as the distance
|
||||||
from the output layer increases.
|
from the output layer increases.
|
||||||
|
|
||||||
@ -147,8 +147,9 @@ is also used in many other region-based convolutional networks.
|
|||||||
The initial image data is always passed through the ResNet backbone as a first step to
|
The initial image data is always passed through the ResNet backbone as a first step to
|
||||||
bootstrap the complete deep network.
|
bootstrap the complete deep network.
|
||||||
Note that for the Mask R-CNN architectures we describe below, this is equivalent
|
Note that for the Mask R-CNN architectures we describe below, this is equivalent
|
||||||
to the standard backbone.
|
to the standard ResNet-50 backbone. We now introduce one small extension that
|
||||||
In , the C$_5$ bottleneck has a stride of 32 with respect to the
|
will be useful for our Motion R-CNN network.
|
||||||
|
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
|
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
|
||||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||||
stride may be important.
|
stride may be important.
|
||||||
|
|||||||
@ -130,7 +130,6 @@ For training on a dataset without any motion ground truth, e.g.
|
|||||||
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||||||
loss for the instance motions.
|
loss for the instance motions.
|
||||||
|
|
||||||
|
|
||||||
\paragraph{Temporal consistency}
|
\paragraph{Temporal consistency}
|
||||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||||||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||||||
@ -139,16 +138,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
|||||||
into our architecture, we could enable temporally consistent motion estimation
|
into our architecture, we could enable temporally consistent motion estimation
|
||||||
from image sequences of arbitrary length.
|
from image sequences of arbitrary length.
|
||||||
|
|
||||||
\paragraph{Deeper networks for larger bottleneck strides}
|
\paragraph{Masking prior to the RoI motion head}
|
||||||
% TODO remove?
|
Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from
|
||||||
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
|
the backbone are integrated over the complete RoI window to yield the features
|
||||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
|
for motion estimation.
|
||||||
For accurately estimating the motion of objects with large displacements between
|
For example, average pooling is applied before the fully-connected layers in the variant without FPN.
|
||||||
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
|
However, ideally, the motion (image matching) information from the backbone should
|
||||||
We could do this easily in both of our network variants by adding one or multiple additional
|
|
||||||
ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
For example, consider
|
||||||
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
|
||||||
added after the encoder C$_5$ bottleneck.
|
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||||
For saving memory, we could however also consider modifying the underlying
|
extracted RoI features before passing them into the motion head.
|
||||||
ResNet architecture and increase the number of blocks, but reduce the number
|
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||||
of layers in each block.
|
extracted feature window which belong to the background. Then, the RoI motion
|
||||||
|
head could aggregate the motion (image matching) information from the backbone
|
||||||
|
over positions localized within the object only, but not over positions belonging
|
||||||
|
to the background, which should probably not influence the final object motion estimate.
|
||||||
|
|||||||
@ -166,7 +166,7 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
|||||||
\paragraph{R-CNN training parameters}
|
\paragraph{R-CNN training parameters}
|
||||||
For training the RPN and RoI heads and during inference,
|
For training the RPN and RoI heads and during inference,
|
||||||
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
||||||
the and -FPN variants, respectively.
|
the ResNet and ResNet-FPN variants, respectively.
|
||||||
|
|
||||||
\paragraph{Initialization}
|
\paragraph{Initialization}
|
||||||
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user