mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
653b41ee96
commit
9a207a4024
10
approach.tex
10
approach.tex
@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
|
|||||||
in order to enable image matching between the consecutive frames.
|
in order to enable image matching between the consecutive frames.
|
||||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
|
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
|
||||||
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
|
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
|
||||||
show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN,
|
show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
|
||||||
respectively.
|
respectively.
|
||||||
|
|
||||||
{\begin{table}[h]
|
{\begin{table}[h]
|
||||||
@ -203,12 +203,12 @@ a still and moving camera.
|
|||||||
|
|
||||||
\label{ssec:design}
|
\label{ssec:design}
|
||||||
\paragraph{Camera motion network}
|
\paragraph{Camera motion network}
|
||||||
In our variant (Table \ref{table:motionrcnn_resnet}), the underlying
|
In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
|
||||||
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
ResNet backbone is only computed up to the $C_4$ block, as otherwise the
|
||||||
feature resolution prior to RoI extraction would be reduced too much.
|
feature resolution prior to RoI extraction would be reduced too much.
|
||||||
In our variant, we therefore first pass the $C_4$ features through a $C_5$
|
In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
|
||||||
block to make the camera network of both variants comparable.
|
block to make the camera network of both variants comparable.
|
||||||
Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
|
||||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
convolution to the $C_5$ features to reduce the number of inputs to the following
|
||||||
fully-connected layers.
|
fully-connected layers.
|
||||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||||
@ -291,7 +291,7 @@ supervision without 3D instance motion ground truth.
|
|||||||
In contrast to SfM-Net, where a single optical flow field is
|
In contrast to SfM-Net, where a single optical flow field is
|
||||||
composed and penalized to supervise the motion prediction, our loss considers
|
composed and penalized to supervise the motion prediction, our loss considers
|
||||||
the motion of all objects in isolation and composes a batch of flow windows
|
the motion of all objects in isolation and composes a batch of flow windows
|
||||||
for the RoIs.
|
for the RoIs. Network predictions are shown in red.
|
||||||
}
|
}
|
||||||
\label{figure:flow_loss}
|
\label{figure:flow_loss}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|||||||
@ -57,6 +57,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical
|
|||||||
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
|
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
|
||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{Refinement}}\\
|
\multicolumn{3}{c}{\textbf{Refinement}}\\
|
||||||
|
\midrule
|
||||||
& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||||
\multicolumn{3}{c}{...}\\
|
\multicolumn{3}{c}{...}\\
|
||||||
\midrule
|
\midrule
|
||||||
@ -64,7 +65,8 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
|
|||||||
\bottomrule
|
\bottomrule
|
||||||
|
|
||||||
\caption {
|
\caption {
|
||||||
FlowNetS \cite{FlowNet} architecture.
|
FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
|
||||||
|
are used for refinement.
|
||||||
}
|
}
|
||||||
\label{table:flownets}
|
\label{table:flownets}
|
||||||
\end{longtable}
|
\end{longtable}
|
||||||
@ -85,7 +87,10 @@ Recently, other, similarly generic,
|
|||||||
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
||||||
|
|
||||||
\subsection{SfM-Net}
|
\subsection{SfM-Net}
|
||||||
Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}.
|
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture.
|
||||||
|
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
|
||||||
|
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
|
||||||
|
image brightness differences penalizes the predictions.
|
||||||
|
|
||||||
{
|
{
|
||||||
%\begin{table}[h]
|
%\begin{table}[h]
|
||||||
@ -94,8 +99,7 @@ Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detai
|
|||||||
\toprule
|
\toprule
|
||||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||||
\midrule\midrule
|
\midrule\midrule
|
||||||
\multicolumn{3}{c}{\todo{Conv-Deconv}}\\
|
\multicolumn{3}{c}{\textbf{Conv-Deconv}}\\
|
||||||
\midrule
|
|
||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{Motion Network}}\\
|
\multicolumn{3}{c}{\textbf{Motion Network}}\\
|
||||||
\midrule
|
\midrule
|
||||||
@ -106,7 +110,7 @@ FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}
|
|||||||
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
|
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
|
||||||
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
|
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
|
||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{Structure Network} ()}\\
|
\multicolumn{3}{c}{\textbf{Structure Network}}\\
|
||||||
\midrule
|
\midrule
|
||||||
& input image $I_t$ & H $\times$ W $\times$ 3 \\
|
& input image $I_t$ & H $\times$ W $\times$ 3 \\
|
||||||
& Conv-Deconv & H $\times$ W $\times$ 32 \\
|
& Conv-Deconv & H $\times$ W $\times$ 32 \\
|
||||||
@ -114,11 +118,14 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
|
|||||||
\bottomrule
|
\bottomrule
|
||||||
|
|
||||||
\caption {
|
\caption {
|
||||||
SfM-Net \cite{SfmNet} architecture.
|
SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional
|
||||||
|
encoder-decoder network, where convolutions and deconvolutions with stride 2 are
|
||||||
|
used for downsampling and upsampling, respectively. The stride at the bottleneck
|
||||||
|
with respect to the input image is 32.
|
||||||
The Conv-Deconv weights for the structure and motion networks are not shared,
|
The Conv-Deconv weights for the structure and motion networks are not shared,
|
||||||
and N$_{motions} = 3$.
|
and N$_{motions} = 3$.
|
||||||
}
|
}
|
||||||
\label{table:flownets}
|
\label{table:sfmnet}
|
||||||
\end{longtable}
|
\end{longtable}
|
||||||
|
|
||||||
|
|
||||||
@ -201,7 +208,7 @@ $\begin{bmatrix}
|
|||||||
1 \times 1, 512 \\
|
1 \times 1, 512 \\
|
||||||
3 \times 3, 512 \\
|
3 \times 3, 512 \\
|
||||||
1 \times 1, 2048 \\
|
1 \times 1, 2048 \\
|
||||||
\end{bmatrix}_{b/2}$ $\times$ 3
|
\end{bmatrix}_{b/2}$ $\times$ 2
|
||||||
& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||||
|
|
||||||
\bottomrule
|
\bottomrule
|
||||||
@ -296,7 +303,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
|
|||||||
fixed resolution instance masks within the bounding boxes of each detected object.
|
fixed resolution instance masks within the bounding boxes of each detected object.
|
||||||
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||||
compute a pixel-precise binary mask for each instance.
|
compute a pixel-precise binary mask for each instance.
|
||||||
The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
||||||
comptetition between classes for the mask prediction branch.
|
comptetition between classes for the mask prediction branch.
|
||||||
|
|
||||||
@ -370,7 +377,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
|
|||||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||||
encoder by combining bilinear upsampled feature maps coming from the bottleneck
|
encoder by combining bilinear upsampled feature maps coming from the bottleneck
|
||||||
with lateral skip connections from the encoder.
|
with lateral skip connections from the encoder.
|
||||||
The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||||
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
|
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
|
||||||
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
|
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
|
||||||
At each output position of the resulting RPN pyramid, bounding boxes are predicted
|
At each output position of the resulting RPN pyramid, bounding boxes are predicted
|
||||||
|
|||||||
2
bib.bib
2
bib.bib
@ -271,7 +271,7 @@
|
|||||||
@inproceedings{UnsupFlownet,
|
@inproceedings{UnsupFlownet,
|
||||||
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
|
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
|
||||||
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
||||||
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
|
booktitle={ECCV Workshops},
|
||||||
year={2016}}
|
year={2016}}
|
||||||
|
|
||||||
@inproceedings{ImageNet,
|
@inproceedings{ImageNet,
|
||||||
|
|||||||
@ -150,5 +150,5 @@ ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
|||||||
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
||||||
added after the encoder C$_5$ bottleneck.
|
added after the encoder C$_5$ bottleneck.
|
||||||
For saving memory, we could however also consider modifying the underlying
|
For saving memory, we could however also consider modifying the underlying
|
||||||
architecture and increase the number of blocks, but reduce the number
|
ResNet architecture and increase the number of blocks, but reduce the number
|
||||||
of layers in each block.
|
of layers in each block.
|
||||||
|
|||||||
@ -222,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
|
|||||||
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
wrong by both $\geq 3$ pixels and $\geq 5\%$.
|
||||||
Camera and instance motion errors are averaged over the validation set.
|
Camera and instance motion errors are averaged over the validation set.
|
||||||
We optionally enable camera motion prediction (cam.),
|
We optionally enable camera motion prediction (cam.),
|
||||||
replace the backbone with -FPN (FPN),
|
replace the ResNet backbone with ResNet-FPN (FPN),
|
||||||
or input XYZ coordinates into the backbone (XYZ).
|
or input XYZ coordinates into the backbone (XYZ).
|
||||||
We either supervise
|
We either supervise
|
||||||
object motions (sup.) with 3D motion ground truth (3D) or
|
object motions (sup.) with 3D motion ground truth (3D) or
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 1.4 MiB After Width: | Height: | Size: 1.4 MiB |
Binary file not shown.
|
Before Width: | Height: | Size: 76 KiB After Width: | Height: | Size: 75 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 536 KiB After Width: | Height: | Size: 537 KiB |
@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
|
|||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{figures/maskrcnn_cs}
|
\includegraphics[width=\textwidth]{figures/maskrcnn_cs}
|
||||||
\caption{
|
\caption{
|
||||||
Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
|
Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
|
||||||
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
|
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
|
||||||
}
|
}
|
||||||
\label{figure:maskrcnn_cs}
|
\label{figure:maskrcnn_cs}
|
||||||
@ -104,6 +104,7 @@ manageable pieces.
|
|||||||
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion
|
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion
|
||||||
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
||||||
small network for predicting the camera motion from the bottleneck.
|
small network for predicting the camera motion from the bottleneck.
|
||||||
|
Novel components in addition to Mask R-CNN are shown in red.
|
||||||
}
|
}
|
||||||
\label{figure:net_intro}
|
\label{figure:net_intro}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user