WIP

2026-02-06 10:05:40 +00:00 · 2017-11-17 13:05:41 +01:00 · 2017-11-17 13:05:41 +01:00 · 9a207a4024
commit 9a207a4024
parent 653b41ee96
9 changed files with 27 additions and 19 deletions
--- a/approach.tex
+++ b/approach.tex
@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
 in order to enable image matching between the consecutive frames.
 Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
 region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
-show our Motion R-CNN networks based on Mask R-CNN  and Mask R-CNN -FPN,
+show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN,
 respectively.

 {\begin{table}[h]
@ -203,12 +203,12 @@ a still and moving camera.

 \label{ssec:design}
 \paragraph{Camera motion network}
-In our  variant (Table \ref{table:motionrcnn_resnet}), the underlying
+In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying
 ResNet backbone is only computed up to the $C_4$ block, as otherwise the
 feature resolution prior to RoI extraction would be reduced too much.
-In our  variant, we therefore first pass the $C_4$ features through a $C_5$
+In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$
 block to make the camera network of both variants comparable.
-Then, in both, the  and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
+Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
 convolution to the $C_5$ features to reduce the number of inputs to the following
 fully-connected layers.
 Instead of averaging, we use bilinear resizing to bring the convolutional features
@ -291,7 +291,7 @@ supervision without 3D instance motion ground truth.
 In contrast to SfM-Net, where a single optical flow field is
 composed and penalized to supervise the motion prediction, our loss considers
 the motion of all objects in isolation and composes a batch of flow windows
-for the RoIs.
+for the RoIs. Network predictions are shown in red.
 }
 \label{figure:flow_loss}
 \end{figure}
--- a/background.tex
+++ b/background.tex
@ -57,6 +57,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical
 & 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
 \midrule
 \multicolumn{3}{c}{\textbf{Refinement}}\\
+\midrule
 & 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
 \multicolumn{3}{c}{...}\\
 \midrule
@ -64,7 +65,8 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\
 \bottomrule

 \caption {
-FlowNetS \cite{FlowNet} architecture.
+FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions)
+are used for refinement.
 }
 \label{table:flownets}
 \end{longtable}
@ -85,7 +87,10 @@ Recently, other, similarly generic,
 encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.

 \subsection{SfM-Net}
-Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}.
+Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture.
+Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
+are predicted in addition to a depth map, and a unsupervised re-projection loss based on
+image brightness differences penalizes the predictions.

 {
 %\begin{table}[h]
@ -94,8 +99,7 @@ Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detai
 \toprule
 \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
 \midrule\midrule
-\multicolumn{3}{c}{\todo{Conv-Deconv}}\\
-\midrule
+\multicolumn{3}{c}{\textbf{Conv-Deconv}}\\
 \midrule
 \multicolumn{3}{c}{\textbf{Motion Network}}\\
 \midrule
@ -106,7 +110,7 @@ FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}
 object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\
 camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\
 \midrule
-\multicolumn{3}{c}{\textbf{Structure Network} ()}\\
+\multicolumn{3}{c}{\textbf{Structure Network}}\\
 \midrule
 & input image $I_t$ & H $\times$ W $\times$ 3 \\
 & Conv-Deconv & H $\times$ W $\times$ 32 \\
@ -114,11 +118,14 @@ depth & 1 $\times$1 conv, 1  & H $\times$ W $\times$ 1 \\
 \bottomrule

 \caption {
-SfM-Net \cite{SfmNet} architecture.
+SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional
+encoder-decoder network, where convolutions and deconvolutions with stride 2 are
+used for downsampling and upsampling, respectively. The stride at the bottleneck
+with respect to the input image is 32.
 The Conv-Deconv weights for the structure and motion networks are not shared,
 and N$_{motions} = 3$.
 }
-\label{table:flownets}
+\label{table:sfmnet}
 \end{longtable}


@ -201,7 +208,7 @@ $\begin{bmatrix}
 1 \times 1, 512 \\
 3 \times 3, 512 \\
 1 \times 1, 2048 \\
-\end{bmatrix}_{b/2}$ $\times$ 3
+\end{bmatrix}_{b/2}$ $\times$ 2
 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\

 \bottomrule
@ -296,7 +303,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
 fixed resolution instance masks within the bounding boxes of each detected object.
 This is done by simply extending the Faster R-CNN head with multiple convolutions, which
 compute a pixel-precise binary mask for each instance.
-The basic Mask R-CNN  architecture is shown in Table \ref{table:maskrcnn_resnet}.
+The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
 Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
 comptetition between classes for the mask prediction branch.

@ -370,7 +377,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
 For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
 encoder by combining bilinear upsampled feature maps coming from the bottleneck
 with lateral skip connections from the encoder.
-The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
+The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
 Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
 the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
 At each output position of the resulting RPN pyramid, bounding boxes are predicted
--- a/bib.bib
+++ b/bib.bib
@ -271,7 +271,7 @@
@inproceedings{UnsupFlownet,
  title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
  author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
-  booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
+  booktitle={ECCV Workshops},
  year={2016}}

  @inproceedings{ImageNet,
--- a/conclusion.tex
+++ b/conclusion.tex
@ -150,5 +150,5 @@ ResNet blocks. In the variant without FPN, these blocks would have to be placed
 after RoI feature extraction. In the FPN variant, the blocks could be simply
 added after the encoder C$_5$ bottleneck.
 For saving memory, we could however also consider modifying the underlying
- architecture and increase the number of blocks, but reduce the number
+ResNet architecture and increase the number of blocks, but reduce the number
 of layers in each block.
--- a/experiments.tex
+++ b/experiments.tex
@ -222,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
 wrong by both $\geq 3$ pixels and $\geq 5\%$.
 Camera and instance motion errors are averaged over the validation set.
 We optionally enable camera motion prediction (cam.),
-replace the  backbone with -FPN (FPN),
+replace the ResNet backbone with ResNet-FPN (FPN),
 or input XYZ coordinates into the backbone (XYZ).
 We either supervise
 object motions (sup.) with 3D motion ground truth (3D) or
--- a/figures/flow_loss.png
+++ b/figures/flow_loss.png
--- a/figures/net_intro.png
+++ b/figures/net_intro.png
--- a/figures/teaser.png
+++ b/figures/teaser.png
--- a/introduction.tex
+++ b/introduction.tex
@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
  \centering
  \includegraphics[width=\textwidth]{figures/maskrcnn_cs}
 \caption{
-Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
+Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
 on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
 }
 \label{figure:maskrcnn_cs}
@ -104,6 +104,7 @@ manageable pieces.
 Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion
 in parallel to the class, bounding box and mask. Additionally, we branch off a
 small network for predicting the camera motion from the bottleneck.
+Novel components in addition to Mask R-CNN are shown in red.
 }
 \label{figure:net_intro}
 \end{figure}