fucked it

2025-12-13 09:55:49 +00:00 · 2017-11-16 16:51:54 +01:00 · 2017-11-16 16:51:54 +01:00 · 653b41ee96
commit 653b41ee96
parent a5a014fc57
6 changed files with 72 additions and 41 deletions
--- a/approach.tex
+++ b/approach.tex
@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system
 in order to enable image matching between the consecutive frames.
 Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
 region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
-show our Motion R-CNN networks based on Mask R-CNN ResNet-50 and Mask R-CNN ResNet-50-FPN,
+show our Motion R-CNN networks based on Mask R-CNN  and Mask R-CNN -FPN,
 respectively.

 {\begin{table}[h]
@ -20,13 +20,13 @@ respectively.
 \midrule\midrule
 & input images & H $\times$ W $\times$ C \\
 \midrule
-C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
+C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
 \midrule
 \multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)} (Table \ref{table:maskrcnn_resnet})}\\
 \midrule
 \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
 \midrule
-& From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
+& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
 & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
 & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
 T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2  & 1 $\times$ 1024 \\
@ -52,8 +52,8 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
 \end{tabular}

 \caption {
-Motion R-CNN ResNet-50 architecture based on the Mask R-CNN
-ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}).
+Motion R-CNN ResNet architecture based on the Mask R-CNN
+ResNet architecture (Table \ref{table:maskrcnn_resnet}).
 We use ReLU activations after all hidden layers and
 additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
 }
@ -70,13 +70,13 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
 \midrule\midrule
 & input images & H $\times$ W $\times$ C \\
 \midrule
-C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
+C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
 \midrule
 \multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
 \midrule
 \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
 \midrule
-& From C$_5$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
+& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\
 & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
 & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
 T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2  & 1 $\times$ 1024 \\
@ -101,9 +101,11 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
 \end{tabular}

 \caption {
-Motion R-CNN ResNet-50-FPN architecture based on the Mask R-CNN
-ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
-The modifications are analogous to our Motion R-CNN ResNet-50,
+Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN
+ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
+To obtain a larger bottleneck stride, we compute the feature pyramid starting
+with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted.
+The modifications are analogous to our Motion R-CNN ResNet,
 but we still show the architecture for completeness.
 Again, we use ReLU activations after all hidden layers and
 additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
@ -201,12 +203,12 @@ a still and moving camera.

 \label{ssec:design}
 \paragraph{Camera motion network}
-In our ResNet-50 variant (Table \ref{table:motionrcnn_resnet}), the underlying
+In our  variant (Table \ref{table:motionrcnn_resnet}), the underlying
 ResNet backbone is only computed up to the $C_4$ block, as otherwise the
-feature resolution for RoI extraction would be reduced too much.
-In our ResNet-50 variant, we first pass the $C_4$ features through a $C_5$
+feature resolution prior to RoI extraction would be reduced too much.
+In our  variant, we therefore first pass the $C_4$ features through a $C_5$
 block to make the camera network of both variants comparable.
-Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
+Then, in both, the  and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional
 convolution to the $C_5$ features to reduce the number of inputs to the following
 fully-connected layers.
 Instead of averaging, we use bilinear resizing to bring the convolutional features
--- a/background.tex
+++ b/background.tex
@ -129,13 +129,25 @@ and N$_{motions} = 3$.
 \label{ssec:resnet}
 ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but
 became popular as basic building block of many deep network architectures for a variety
-of different tasks. In Table \ref{table:resnet}, we show the ResNet-50 variant
+of different tasks. Figure \ref{figure:bottleneck}
+shows the fundamental building block of . The additive \emph{residual unit} enables the training
+of very deep networks without the gradients becoming too small as the distance
+from the output layer increases.
+
+In Table \ref{table:resnet}, we show the ResNet variant
 that will serve as the basic CNN backbone of our networks, and
 is also used in many other region-based convolutional networks.
-The initial image data is always passed through ResNet-50 as a first step to
+The initial image data is always passed through the ResNet backbone as a first step to
 bootstrap the complete deep network.
-Figure \ref{figure:bottleneck}
-shows the fundamental building block of ResNet-50.
+Note that for the Mask R-CNN architectures we describe below, this is equivalent
+to the standard  backbone.
+In , the C$_5$ bottleneck has a stride of 32 with respect to the
+input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64.
+For accurately estimating motions corresponding to larger pixel displacements, a larger
+stride may be important.
+Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
+to increase the bottleneck stride to 64, following FlowNetS.
+

 {
 %\begin{table}[h]
@ -146,7 +158,7 @@ shows the fundamental building block of ResNet-50.
 \midrule\midrule
 & input image & H $\times$ W $\times$ C \\
 \midrule
-\multicolumn{3}{c}{\textbf{ResNet-50}}\\
+\multicolumn{3}{c}{\textbf{ResNet}}\\
 \midrule
 C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\

@ -182,17 +194,25 @@ $\begin{bmatrix}
 3 \times 3, 512 \\
 1 \times 1, 2048 \\
 \end{bmatrix}_{b/2}$ $\times$ 3
-& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
+& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
+\midrule
+C$_6$ &
+$\begin{bmatrix}
+1 \times 1, 512 \\
+3 \times 3, 512 \\
+1 \times 1, 2048 \\
+\end{bmatrix}_{b/2}$ $\times$ 3
+& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\

 \bottomrule

 \caption {
-ResNet-50 \cite{ResNet} architecture.
+Backbone architecture based on ResNet-50 \cite{ResNet}.
 Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck}
 block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$,
 the first convolution operation in the block has a stride of 2. Note that the stride
 is only applied to the first block, but not to repeated blocks.
-Batch normalization \cite{BN} is used after every convolution.
+Batch normalization \cite{BN} is used after every residual unit.
 }
 \label{table:resnet}
 \end{longtable}
@ -230,7 +250,7 @@ Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only
 as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
 Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
 each corresponding to one of the proposal bounding boxes.
-The extracted per-RoI feature maps are collected into a batch and passed into a small Fast R-CNN
+The extracted per-RoI (region of interest) feature maps are collected into a batch and passed into a small Fast R-CNN
 \emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
 The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
 is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
@ -276,7 +296,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat
 fixed resolution instance masks within the bounding boxes of each detected object.
 This is done by simply extending the Faster R-CNN head with multiple convolutions, which
 compute a pixel-precise binary mask for each instance.
-The basic Mask R-CNN ResNet-50 architecture is shown in Table \ref{table:maskrcnn_resnet}.
+The basic Mask R-CNN  architecture is shown in Table \ref{table:maskrcnn_resnet}.
 Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
 comptetition between classes for the mask prediction branch.

@ -295,7 +315,7 @@ boundary of the bounding box, and thus some detail is lost.
 \midrule\midrule
 & input image & H $\times$ W $\times$ C \\
 \midrule
-C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet})  & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
+C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet})  & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
 \midrule
 \multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\
 \midrule
@ -311,7 +331,7 @@ ROI$_{\mathrm{RPN}}$ & sample boxes$_{\mathrm{RPN}}$ and scores$_{\mathrm{RPN}}$
 \multicolumn{3}{c}{\textbf{RoI Head}}\\
 \midrule
 & From C$_4$ with ROI$_{\mathrm{RPN}}$: RoI extraction & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\
-R$_1$& ResNet-50 \{C$_5$ without stride\} (Table \ref{table:resnet})  & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
+R$_1$& ResNet \{C$_5$ without stride\} (Table \ref{table:resnet})  & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\
 ave & average pool & N$_{RoI}$ $\times$ 2048 \\
 & From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
 boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
@ -327,8 +347,8 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}
 \bottomrule

 \caption {
-Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture.
-Note that this is equivalent to the Faster R-CNN architecture if the mask
+Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture.
+Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask
 head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction,
 whereas Faster R-CNN used RoI pooling.
 }
@ -350,7 +370,7 @@ of an appropriate scale to be used, depending of the size of the bounding box.
 For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
 encoder by combining bilinear upsampled feature maps coming from the bottleneck
 with lateral skip connections from the encoder.
-The Mask R-CNN ResNet-50-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
+The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
 Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios,
 the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$.
 At each output position of the resulting RPN pyramid, bounding boxes are predicted
@ -395,7 +415,7 @@ which is the highest resolution feature map.
 \midrule\midrule
 & input image & H $\times$ W $\times$ C \\
 \midrule
-C$_5$ & ResNet-50 (Table \ref{table:resnet})  & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
+C$_5$ & ResNet (Table \ref{table:resnet})  & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
 \midrule
 \multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\
 \midrule
@ -435,7 +455,7 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}
 \bottomrule

 \caption {
-Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture.
+Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture.
 Operations enclosed in a []$_p$ block make up a single FPN
 block (see Figure \ref{figure:fpn_block}).
 }
--- a/bib.bib
+++ b/bib.bib
@ -273,3 +273,9 @@
  author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
  booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
  year={2016}}
+
+  @inproceedings{ImageNet,
+    title={ImageNet Large Scale Visual Recognition Challenge},
+    author={Olga Russakovsky and others},
+    booktitle={IJCV},
+    year={2015}}
--- a/conclusion.tex
+++ b/conclusion.tex
@ -66,7 +66,7 @@ and also fine-tune on the training set as mentioned in the previous paragraph.
 \midrule\midrule
 & input image & H $\times$ W $\times$ C \\
 \midrule
-C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
+C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\
 \midrule
 \multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
 \midrule
@ -85,8 +85,8 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
 \end{tabular}

 \caption {
-A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction,
-based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
+A possible Motion R-CNN ResNet-FPN architecture with depth prediction,
+based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
 }
 \label{table:motionrcnn_resnet_fpn_depth}
 \end{table}
@ -140,14 +140,15 @@ into our architecture, we could enable temporally consistent motion estimation
 from image sequences of arbitrary length.

 \paragraph{Deeper networks for larger bottleneck strides}
+% TODO remove?
 Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
 input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
 For accurately estimating the motion of objects with large displacements between
 the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
-We could do this easily in both of our network variants by adding one ore multiple additional
+We could do this easily in both of our network variants by adding one or multiple additional
 ResNet blocks. In the variant without FPN, these blocks would have to be placed
 after RoI feature extraction. In the FPN variant, the blocks could be simply
 added after the encoder C$_5$ bottleneck.
 For saving memory, we could however also consider modifying the underlying
-ResNet-50 architecture and increase the number of blocks, but reduce the number
+ architecture and increase the number of blocks, but reduce the number
 of layers in each block.
--- a/experiments.tex
+++ b/experiments.tex
@ -152,7 +152,7 @@ predicted camera motions.

 For our initial experiments, we concatenate both RGB frames as
 well as the XYZ coordinates for both frames as input to the networks.
-We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants.
+We train both, the Motion R-CNN  and -FPN variants.

 \paragraph{Training schedule}
 Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
@ -166,11 +166,13 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
 \paragraph{R-CNN training parameters}
 For training the RPN and RoI heads and during inference,
 we use the exact same number of proposals and RoIs as Mask R-CNN in
-the ResNet-50 and ResNet-50-FPN variants, respectively.
+the  and -FPN variants, respectively.

 \paragraph{Initialization}
+For initializing the  C$_1$ to C$_5$ weights, we use a pre-trained
+ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
 Following the pre-existing TensorFlow implementation of Faster R-CNN,
-we initialize all hidden layers with He initialization \cite{He}.
+we initialize all other hidden layers with He initialization \cite{He}.
 For the fully-connected camera and instance motion output layers,
 we use a truncated normal initializer with a standard
 deviation of $0.0001$ and zero mean, truncated at two standard deviations.
@ -220,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is
 wrong by both $\geq 3$ pixels and $\geq 5\%$.
 Camera and instance motion errors are averaged over the validation set.
 We optionally enable camera motion prediction (cam.),
-replace the ResNet-50 backbone with ResNet-50-FPN (FPN),
+replace the  backbone with -FPN (FPN),
 or input XYZ coordinates into the backbone (XYZ).
 We either supervise
 object motions (sup.) with 3D motion ground truth (3D) or
--- a/introduction.tex
+++ b/introduction.tex
@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \
  \centering
  \includegraphics[width=\textwidth]{figures/maskrcnn_cs}
 \caption{
-Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
+Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN}
 on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
 }
 \label{figure:maskrcnn_cs}