mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 01:45:50 +00:00
WIP
This commit is contained in:
parent
653b41ee96
commit
9a207a4024
10
approach.tex
10
approach.tex
@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
|
||||
in order to enable image matching between the consecutive frames.
|
||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
|
||||
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
|
||||
show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN,
|
||||
show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
|
||||
respectively.
|
||||
|
||||
{\begin{table}[h]
|
||||
@ -203,12 +203,12 @@ a still and moving camera.
|
||||
|
||||
\label{ssec:design}
|
||||
\paragraph{Camera motion network}
|
||||
In our variant (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
||||
feature resolution prior to RoI extraction would be reduced too much.
|
||||
In our variant, we therefore first pass the $C_4$ features through a $C_5$
|
||||
In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
|
||||
block to make the camera network of both variants comparable.
|
||||
Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
||||
Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
||||
fully-connected layers.
|
||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||
@ -291,7 +291,7 @@ supervision without 3D instance motion ground truth.
|
||||
In contrast to SfM-Net, where a single optical flow field is
|
||||
composed and penalized to supervise the motion prediction, our loss considers
|
||||
the motion of all objects in isolation and composes a batch of flow windows
|
||||
for the RoIs.
|
||||
for the RoIs. Network predictions are shown in red.
|
||||
}
|
||||
\label{figure:flow_loss}
|
||||
\end{figure}
|
||||
|
||||
@ -57,6 +57,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical
|
||||
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Refinement}}\\
|
||||
\midrule
|
||||
& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
\multicolumn{3}{c}{...}\\
|
||||
\midrule
|
||||
@ -64,7 +65,8 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
FlowNetS \cite{FlowNet} architecture.
|
||||
FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
|
||||
are used for refinement.
|
||||
}
|
||||
\label{table:flownets}
|
||||
\end{longtable}
|
||||
@ -85,7 +87,10 @@ Recently, other, similarly generic,
|
||||
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
||||
|
||||
\subsection{SfM-Net}
|
||||
Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}.
|
||||
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture.
|
||||
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
|
||||
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
|
||||
image brightness differences penalizes the predictions.
|
||||
|
||||
{
|
||||
%\begin{table}[h]
|
||||
@ -94,8 +99,7 @@ Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detai
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
\multicolumn{3}{c}{\todo{Conv-Deconv}}\\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Conv-Deconv}}\\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Motion Network}}\\
|
||||
\midrule
|
||||
@ -106,7 +110,7 @@ FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}
|
||||
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
|
||||
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Structure Network} ()}\\
|
||||
\multicolumn{3}{c}{\textbf{Structure Network}}\\
|
||||
\midrule
|
||||
& input image $I_t$ & H $\times$ W $\times$ 3 \\
|
||||
& Conv-Deconv & H $\times$ W $\times$ 32 \\
|
||||
@ -114,11 +118,14 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
SfM-Net \cite{SfmNet} architecture.
|
||||
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional
|
||||
encoder-decoder network, where convolutions and deconvolutions with stride 2 are
|
||||
used for downsampling and upsampling, respectively. The stride at the bottleneck
|
||||
with respect to the input image is 32.
|
||||
The Conv-Deconv weights for the structure and motion networks are not shared,
|
||||
and N$_{motions} = 3$.
|
||||
}
|
||||
\label{table:flownets}
|
||||
\label{table:sfmnet}
|
||||
\end{longtable}
|
||||
|
||||
|
||||
@ -201,7 +208,7 @@ $\begin{bmatrix}
|
||||
1 \times 1, 512 \\
|
||||
3 \times 3, 512 \\
|
||||
1 \times 1, 2048 \\
|
||||
\end{bmatrix}_{b/2}$ $\times$ 3
|
||||
\end{bmatrix}_{b/2}$ $\times$ 2
|
||||
& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
|
||||
\bottomrule
|
||||
@ -296,7 +303,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
|
||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise binary mask for each instance.
|
||||
The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
||||
comptetition between classes for the mask prediction branch.
|
||||
|
||||
@ -370,7 +377,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
|
||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||
encoder by combining bilinear upsampled feature maps coming from the bottleneck
|
||||
with lateral skip connections from the encoder.
|
||||
The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
|
||||
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
|
||||
At each output position of the resulting RPN pyramid, bounding boxes are predicted
|
||||
|
||||
2
bib.bib
2
bib.bib
@ -271,7 +271,7 @@
|
||||
@inproceedings{UnsupFlownet,
|
||||
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
|
||||
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
||||
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
|
||||
booktitle={ECCV Workshops},
|
||||
year={2016}}
|
||||
|
||||
@inproceedings{ImageNet,
|
||||
|
||||
@ -150,5 +150,5 @@ ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
||||
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
||||
added after the encoder C$_5$ bottleneck.
|
||||
For saving memory, we could however also consider modifying the underlying
|
||||
architecture and increase the number of blocks, but reduce the number
|
||||
ResNet architecture and increase the number of blocks, but reduce the number
|
||||
of layers in each block.
|
||||
|
||||
@ -222,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
||||
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
||||
Camera and instance motion errors are averaged over the validation set.
|
||||
We optionally enable camera motion prediction (cam.),
|
||||
replace the backbone with -FPN (FPN),
|
||||
replace the ResNet backbone with ResNet-FPN (FPN),
|
||||
or input XYZ coordinates into the backbone (XYZ).
|
||||
We either supervise
|
||||
object motions (sup.) with 3D motion ground truth (3D) or
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 1.4 MiB After Width: | Height: | Size: 1.4 MiB |
Binary file not shown.
|
Before Width: | Height: | Size: 76 KiB After Width: | Height: | Size: 75 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 536 KiB After Width: | Height: | Size: 537 KiB |
@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/maskrcnn_cs}
|
||||
\caption{
|
||||
Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
|
||||
Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
|
||||
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
|
||||
}
|
||||
\label{figure:maskrcnn_cs}
|
||||
@ -104,6 +104,7 @@ manageable pieces.
|
||||
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion
|
||||
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
||||
small network for predicting the camera motion from the bottleneck.
|
||||
Novel components in addition to Mask R-CNN are shown in red.
|
||||
}
|
||||
\label{figure:net_intro}
|
||||
\end{figure}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user