mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2026-01-15 09:34:32 +00:00
WIP
This commit is contained in:
parent
9a18aac080
commit
ce2a7a5253
14
approach.tex
14
approach.tex
@ -17,7 +17,7 @@ region proposal. Table \ref{table:motionrcnn_resnet} shows the modified network.
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
& input images & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
@ -26,8 +26,8 @@ C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H
|
||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||
\midrule
|
||||
& From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
& 1 $\times$ 1 conv, 1024 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
|
||||
& 1 $\times$ 1 conv, 2048 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
& 3 $\times$ 3 conv, 2048, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
& average pool & 1 $\times$ 2048 \\
|
||||
M$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
|
||||
@ -65,7 +65,7 @@ ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}).
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
& input images & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
@ -73,8 +73,8 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
|
||||
\midrule
|
||||
& From C$_5$: 1 $\times$ 1 conv, 1024 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
|
||||
& From C$_5$: 1 $\times$ 1 conv, 2048 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
& 3 $\times$ 3 conv, 2048, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
|
||||
& average pool & 1 $\times$ 2048 \\
|
||||
M$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
$R_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
@ -265,7 +265,7 @@ and finally compute the optical flow at each point as the difference of the init
|
||||
Note that we batch this computation over all RoIs, so that we only perform
|
||||
it once per forward pass. The mathematical details are analogous to the
|
||||
dense, full image flow computation in the following subsection and will not
|
||||
be repeated here. \todo{add diagram to make it easier to understand}
|
||||
be repeated here. \todo{probably better to add the mathematical details, as it may otherwise be confusing at some points}
|
||||
|
||||
For each RoI, we can now penalize the optical flow grid to supervise the object motion.
|
||||
If there is optical flow ground truth available, we can use the RoI bounding box to
|
||||
|
||||
119
background.tex
119
background.tex
@ -2,12 +2,12 @@ In this section, we will give a more detailed description of previous works
|
||||
we directly build on and other prerequisites.
|
||||
|
||||
\subsection{Optical flow and scene flow}
|
||||
Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
|
||||
Let $I_t,I_{t+1} : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
|
||||
sequence of images.
|
||||
The optical flow
|
||||
$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
|
||||
maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
|
||||
visually corresponding pixel in the second frame $I_2$,
|
||||
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
|
||||
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
|
||||
visually corresponding pixel in the second frame $I_{t+1}$,
|
||||
and can be interpreted as the apparent movement of brigthness patterns between the two frames.
|
||||
Optical flow can be regarded as two-dimensional motion estimation.
|
||||
|
||||
@ -32,11 +32,50 @@ performing upsampling of the compressed features and resulting in a encoder-deco
|
||||
The most popular deep networks of this kind for end-to-end optical flow prediction
|
||||
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
|
||||
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
|
||||
Table \ref{} shows the classical FlowNetS architecture for optical flow prediction.
|
||||
Table \ref{table:flownets} shows the classical FlowNetS architecture for optical flow prediction.
|
||||
|
||||
{
|
||||
%\begin{table}[h]
|
||||
%\centering
|
||||
\begin{longtable}{llr}
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
& input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Encoder}}\\
|
||||
\midrule
|
||||
& 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\
|
||||
& 5 $\times$ 5 conv, 128, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 128 \\
|
||||
& 5 $\times$ 5 conv, 256, stride 2 & $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 256 \\
|
||||
& 3 $\times$ 3 conv, 256 & $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 256 \\
|
||||
& 3 $\times$ 3 conv, 512, stride 2 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\
|
||||
& 3 $\times$ 3 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\
|
||||
& 3 $\times$ 3 conv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
& 3 $\times$ 3 conv, 512, & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Refinement}}\\
|
||||
& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
|
||||
\multicolumn{3}{c}{...}\\
|
||||
\midrule
|
||||
flow & $\times$ 2 bilinear upsample & $\tfrac{1}{1}$ H $\times$ $\tfrac{1}{1}$ W $\times$ 2 \\
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
FlowNetS \cite{FlowNet} architecture.
|
||||
}
|
||||
\label{table:flownets}
|
||||
\end{longtable}
|
||||
|
||||
|
||||
%\end{table}
|
||||
}
|
||||
|
||||
Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
|
||||
with supervision from dense optical flow ground truth.
|
||||
Potentially, the same network could also be used for semantic segmentation if
|
||||
the number of output channels was adapted from two to the number of classes. % TODO verify
|
||||
the number of output final and intermediate output channels was adapted from two to the number of classes.\
|
||||
Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
|
||||
given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
|
||||
Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
|
||||
@ -44,8 +83,45 @@ operations in the encoder.
|
||||
Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.
|
||||
|
||||
\subsection{SfM-Net}
|
||||
Here, we will describe the SfM-Net \cite{SfmNet} architecture in more detail and show their results
|
||||
and some of the issues.
|
||||
Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}.
|
||||
|
||||
{
|
||||
%\begin{table}[h]
|
||||
%\centering
|
||||
\begin{longtable}{llr}
|
||||
\toprule
|
||||
\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
|
||||
\midrule\midrule
|
||||
\multicolumn{3}{c}{\todo{Conv-Deconv}}\\
|
||||
\midrule
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Motion Network}}\\
|
||||
\midrule
|
||||
& input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\
|
||||
& Conv-Deconv & H $\times$ W $\times$ 32 \\
|
||||
masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\
|
||||
FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & H $\times$ W $\times$ 32 \\
|
||||
object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
|
||||
camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Structure Network} ()}\\
|
||||
\midrule
|
||||
& input image $I_t$ & H $\times$ W $\times$ 3 \\
|
||||
& Conv-Deconv & H $\times$ W $\times$ 32 \\
|
||||
depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
SfM-Net \cite{SfmNet} architecture.
|
||||
The Conv-Deconv weights for the structure and motion networks are not shared,
|
||||
and N$_{motions} = 3$.
|
||||
}
|
||||
\label{table:flownets}
|
||||
\end{longtable}
|
||||
|
||||
|
||||
%\end{table}
|
||||
}
|
||||
|
||||
\subsection{ResNet}
|
||||
\label{ssec:resnet}
|
||||
@ -109,7 +185,7 @@ $\begin{bmatrix}
|
||||
\bottomrule
|
||||
|
||||
\caption {
|
||||
ResNet-50 \cite{ResNet} architecture (Figure from \cite{ResNet}).
|
||||
ResNet-50 \cite{ResNet} architecture.
|
||||
Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
|
||||
block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
|
||||
the first conv operation in the block has a stride of 2. Note that the stride
|
||||
@ -220,18 +296,22 @@ C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\
|
||||
\midrule
|
||||
& From C$_4$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\
|
||||
& 1 $\times$ 1 conv, 6 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 6 \\
|
||||
& flatten & A $\times$ 6 \\
|
||||
& decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & A $\times$ 6 \\
|
||||
ROI$_{\mathrm{RPN}}$ & sample bounding boxes \& scores & N$_{RoI}$ $\times$ 6 \\
|
||||
R$_0$ & From C$_4$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\
|
||||
& 1 $\times$ 1 conv, 4 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ $N_a \cdot$ 4 \\
|
||||
& flatten & A $\times$ 4 \\
|
||||
boxes$_{\mathrm{RPN}}$ & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & A $\times$ 4\\
|
||||
& From R$_0$: 1 $\times$ 1 conv, 2 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ $N_a \cdot$ 2 \\
|
||||
& flatten & A $\times$ 2 \\
|
||||
scores$_{\mathrm{RPN}}$& softmax & A $\times$ 2 \\
|
||||
ROI$_{\mathrm{RPN}}$ & sample boxes$_{\mathrm{RPN}}$ and scores$_{\mathrm{RPN}}$ & N$_{RoI}$ $\times$ 6 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RoI Head}}\\
|
||||
\midrule
|
||||
& From C$_4$ with ROI$_{\mathrm{RPN}}$: RoI extraction & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\
|
||||
R$_1$& ResNet-50 \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
|
||||
ave & average pool & N$_{RPN}$ $\times$ 2048 \\
|
||||
boxes& From ave: fully connected, 4 & N$_{RPN}$ $\times$ 4 \\
|
||||
ave & average pool & N$_{RoI}$ $\times$ 2048 \\
|
||||
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
& From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
\midrule
|
||||
@ -271,7 +351,7 @@ The Mask R-CNN ResNet-50-FPN variant is shown in Table \ref{table:maskrcnn_resne
|
||||
Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
|
||||
the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
|
||||
At each output position of the resulting RPN pyramid, bounding boxes are predicted
|
||||
with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale.
|
||||
with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$).
|
||||
For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$,
|
||||
the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$,
|
||||
respectively.
|
||||
@ -311,7 +391,7 @@ P$_6$ & From P$_5$: 2 $\times$ 2 subsample, 256 & $\tfrac{1}{64}$ H $\times$ $\t
|
||||
\midrule
|
||||
\multicolumn{3}{c}{$\forall i \in \{2...6\}$}\\
|
||||
& From P$_i$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{2^i}$ H $\times$ $\tfrac{1}{2^i}$ W $\times$ 512 \\
|
||||
& 1 $\times$ 1 conv, 6 & $\tfrac{1}{2^i}$ H $\times$ $\tfrac{1}{2^i}$ W $\times$ 6 \\
|
||||
& 1 $\times$ 1 conv, 6 & $\tfrac{1}{2^i}$ H $\times$ $\tfrac{1}{2^i}$ W $\times$ $N_a \cdot$ 6 \\
|
||||
RPN$_i$& flatten & A$_i$ $\times$ 6 \\
|
||||
\midrule
|
||||
& From \{RPN$_2$ ... RPN$_6$\}: concatenate & A $\times$ 6 \\
|
||||
@ -323,7 +403,8 @@ ROI$_{\mathrm{RPN}}$ & sample bounding boxes \& scores & N$_{RoI}$ $\times$ 6 \\
|
||||
R$_2$ & From \{P$_2$ ... P$_6$\} with ROI$_{\mathrm{RPN}}$: RoI extraction (Eq. \ref{eq:level_assignment}) & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\
|
||||
& 2 $\times$ 2 max pool & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 256 \\
|
||||
F$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RoI}$ $\times$ 1024 \\
|
||||
boxes& From F$_1$: fully connected, 4 & N$_{RoI}$ $\times$ 4 \\
|
||||
& From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\
|
||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
& From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
\midrule
|
||||
|
||||
@ -9,7 +9,7 @@ as well as extensions for motion estimation and related evaluations
|
||||
and postprocessings. In addition, we generated all ground truth for
|
||||
Motion R-CNN in the form of TFRecords from the raw Virtual KITTI
|
||||
data to enable fast loading during training.
|
||||
Note that for RoI pooling and cropping,
|
||||
Note that for RoI extraction and cropping operations,
|
||||
we use the \texttt{tf.crop\_and\_resize} TensorFlow function with
|
||||
interpolation set to bilinear.
|
||||
|
||||
@ -26,7 +26,7 @@ Each sequence is rendered with varying lighting and weather conditions and
|
||||
from different viewing angles, resulting in a total of 10 variants per sequence.
|
||||
In addition to the RGB frames, a variety of ground truth is supplied.
|
||||
For each frame, we are given a dense depth and optical flow map and the camera
|
||||
extrinsics matrix. There are two annotated object classes, cars, and vans.
|
||||
extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2).
|
||||
For all cars and vans in the each frame, we are given 2D and 3D object bounding
|
||||
boxes, instance masks, 3D poses, and various other labels.
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user