diff --git a/approach.tex b/approach.tex index 150fa55..d583ed1 100644 --- a/approach.tex +++ b/approach.tex @@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system in order to enable image matching between the consecutive frames. Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn} -show our Motion R-CNN networks based on Mask R-CNN ResNet-50 and Mask R-CNN ResNet-50-FPN, +show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN, respectively. {\begin{table}[h] @@ -20,13 +20,13 @@ respectively. \midrule\midrule & input images & H $\times$ W $\times$ C \\ \midrule -C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ +C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ \midrule \multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)} (Table \ref{table:maskrcnn_resnet})}\\ \midrule \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ \midrule -& From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\ & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\ T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ @@ -52,8 +52,8 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ \end{tabular} \caption { -Motion R-CNN ResNet-50 architecture based on the Mask R-CNN -ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}). +Motion R-CNN ResNet architecture based on the Mask R-CNN +ResNet architecture (Table \ref{table:maskrcnn_resnet}). We use ReLU activations after all hidden layers and additonally dropout with $p = 0.5$ after all fully-connected hidden layers. } @@ -70,13 +70,13 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers. \midrule\midrule & input images & H $\times$ W $\times$ C \\ \midrule -C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ \midrule \multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\ \midrule \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ \midrule -& From C$_5$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ +& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\ & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\ T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ @@ -101,9 +101,11 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ \end{tabular} \caption { -Motion R-CNN ResNet-50-FPN architecture based on the Mask R-CNN -ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}). -The modifications are analogous to our Motion R-CNN ResNet-50, +Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN +ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}). +To obtain a larger bottleneck stride, we compute the feature pyramid starting +with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted. +The modifications are analogous to our Motion R-CNN ResNet, but we still show the architecture for completeness. Again, we use ReLU activations after all hidden layers and additonally dropout with $p = 0.5$ after all fully-connected hidden layers. @@ -201,12 +203,12 @@ a still and moving camera. \label{ssec:design} \paragraph{Camera motion network} -In our ResNet-50 variant (Table \ref{table:motionrcnn_resnet}), the underlying +In our variant (Table \ref{table:motionrcnn_resnet}), the underlying ResNet backbone is only computed up to the $C_4$ block, as otherwise the -feature resolution for RoI extraction would be reduced too much. -In our ResNet-50 variant, we first pass the $C_4$ features through a $C_5$ +feature resolution prior to RoI extraction would be reduced too much. +In our variant, we therefore first pass the $C_4$ features through a $C_5$ block to make the camera network of both variants comparable. -Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional +Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional convolution to the $C_5$ features to reduce the number of inputs to the following fully-connected layers. Instead of averaging, we use bilinear resizing to bring the convolutional features diff --git a/background.tex b/background.tex index bd16d1e..6b53a86 100644 --- a/background.tex +++ b/background.tex @@ -129,13 +129,25 @@ and N$_{motions} = 3$. \label{ssec:resnet} ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but became popular as basic building block of many deep network architectures for a variety -of different tasks. In Table \ref{table:resnet}, we show the ResNet-50 variant +of different tasks. Figure \ref{figure:bottleneck} +shows the fundamental building block of . The additive \emph{residual unit} enables the training +of very deep networks without the gradients becoming too small as the distance +from the output layer increases. + +In Table \ref{table:resnet}, we show the ResNet variant that will serve as the basic CNN backbone of our networks, and is also used in many other region-based convolutional networks. -The initial image data is always passed through ResNet-50 as a first step to +The initial image data is always passed through the ResNet backbone as a first step to bootstrap the complete deep network. -Figure \ref{figure:bottleneck} -shows the fundamental building block of ResNet-50. +Note that for the Mask R-CNN architectures we describe below, this is equivalent +to the standard backbone. +In , the C$_5$ bottleneck has a stride of 32 with respect to the +input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64. +For accurately estimating motions corresponding to larger pixel displacements, a larger +stride may be important. +Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants +to increase the bottleneck stride to 64, following FlowNetS. + { %\begin{table}[h] @@ -146,7 +158,7 @@ shows the fundamental building block of ResNet-50. \midrule\midrule & input image & H $\times$ W $\times$ C \\ \midrule -\multicolumn{3}{c}{\textbf{ResNet-50}}\\ +\multicolumn{3}{c}{\textbf{ResNet}}\\ \midrule C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\ @@ -182,17 +194,25 @@ $\begin{bmatrix} 3 \times 3, 512 \\ 1 \times 1, 2048 \\ \end{bmatrix}_{b/2}$ $\times$ 3 -& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ +& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +\midrule +C$_6$ & +$\begin{bmatrix} +1 \times 1, 512 \\ +3 \times 3, 512 \\ +1 \times 1, 2048 \\ +\end{bmatrix}_{b/2}$ $\times$ 3 +& $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ \bottomrule \caption { -ResNet-50 \cite{ResNet} architecture. +Backbone architecture based on ResNet-50 \cite{ResNet}. Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck} block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$, the first convolution operation in the block has a stride of 2. Note that the stride is only applied to the first block, but not to repeated blocks. -Batch normalization \cite{BN} is used after every convolution. +Batch normalization \cite{BN} is used after every residual unit. } \label{table:resnet} \end{longtable} @@ -230,7 +250,7 @@ Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only as input to the CNN (compared to the sequential input of crops in the case of R-CNN). Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image, each corresponding to one of the proposal bounding boxes. -The extracted per-RoI feature maps are collected into a batch and passed into a small Fast R-CNN +The extracted per-RoI (region of interest) feature maps are collected into a batch and passed into a small Fast R-CNN \emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass. The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying @@ -276,7 +296,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat fixed resolution instance masks within the bounding boxes of each detected object. This is done by simply extending the Faster R-CNN head with multiple convolutions, which compute a pixel-precise binary mask for each instance. -The basic Mask R-CNN ResNet-50 architecture is shown in Table \ref{table:maskrcnn_resnet}. +The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}. Note that the per-class masks logits are put through a sigmoid layer, and thus there is no comptetition between classes for the mask prediction branch. @@ -295,7 +315,7 @@ boundary of the bounding box, and thus some detail is lost. \midrule\midrule & input image & H $\times$ W $\times$ C \\ \midrule -C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ +C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ \midrule \multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\ \midrule @@ -311,7 +331,7 @@ ROI$_{\mathrm{RPN}}$ & sample boxes$_{\mathrm{RPN}}$ and scores$_{\mathrm{RPN}}$ \multicolumn{3}{c}{\textbf{RoI Head}}\\ \midrule & From C$_4$ with ROI$_{\mathrm{RPN}}$: RoI extraction & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\ -R$_1$& ResNet-50 \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\ +R$_1$& ResNet \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\ ave & average pool & N$_{RoI}$ $\times$ 2048 \\ & From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ @@ -327,8 +347,8 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls} \bottomrule \caption { -Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture. -Note that this is equivalent to the Faster R-CNN architecture if the mask +Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture. +Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction, whereas Faster R-CNN used RoI pooling. } @@ -350,7 +370,7 @@ of an appropriate scale to be used, depending of the size of the bounding box. For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet} encoder by combining bilinear upsampled feature maps coming from the bottleneck with lateral skip connections from the encoder. -The Mask R-CNN ResNet-50-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. +The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios, the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$. At each output position of the resulting RPN pyramid, bounding boxes are predicted @@ -395,7 +415,7 @@ which is the highest resolution feature map. \midrule\midrule & input image & H $\times$ W $\times$ C \\ \midrule -C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ +C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ \midrule \multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\ \midrule @@ -435,7 +455,7 @@ masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls} \bottomrule \caption { -Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture. +Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture. Operations enclosed in a []$_p$ block make up a single FPN block (see Figure \ref{figure:fpn_block}). } diff --git a/bib.bib b/bib.bib index 43567ae..1f15dd4 100644 --- a/bib.bib +++ b/bib.bib @@ -273,3 +273,9 @@ author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis}, booktitle={ECCV Workshop on Brave new ideas for motion representations in videos}, year={2016}} + + @inproceedings{ImageNet, + title={ImageNet Large Scale Visual Recognition Challenge}, + author={Olga Russakovsky and others}, + booktitle={IJCV}, + year={2015}} diff --git a/conclusion.tex b/conclusion.tex index cac9bd6..83926fb 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -66,7 +66,7 @@ and also fine-tune on the training set as mentioned in the previous paragraph. \midrule\midrule & input image & H $\times$ W $\times$ C \\ \midrule -C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ \midrule \multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\ \midrule @@ -85,8 +85,8 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra \end{tabular} \caption { -A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction, -based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}). +A possible Motion R-CNN ResNet-FPN architecture with depth prediction, +based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}). } \label{table:motionrcnn_resnet_fpn_depth} \end{table} @@ -140,14 +140,15 @@ into our architecture, we could enable temporally consistent motion estimation from image sequences of arbitrary length. \paragraph{Deeper networks for larger bottleneck strides} +% TODO remove? Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64. For accurately estimating the motion of objects with large displacements between the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network. -We could do this easily in both of our network variants by adding one ore multiple additional +We could do this easily in both of our network variants by adding one or multiple additional ResNet blocks. In the variant without FPN, these blocks would have to be placed after RoI feature extraction. In the FPN variant, the blocks could be simply added after the encoder C$_5$ bottleneck. For saving memory, we could however also consider modifying the underlying -ResNet-50 architecture and increase the number of blocks, but reduce the number + architecture and increase the number of blocks, but reduce the number of layers in each block. diff --git a/experiments.tex b/experiments.tex index 6589aa9..1efda83 100644 --- a/experiments.tex +++ b/experiments.tex @@ -152,7 +152,7 @@ predicted camera motions. For our initial experiments, we concatenate both RGB frames as well as the XYZ coordinates for both frames as input to the networks. -We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants. +We train both, the Motion R-CNN and -FPN variants. \paragraph{Training schedule} Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. @@ -166,11 +166,13 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. \paragraph{R-CNN training parameters} For training the RPN and RoI heads and during inference, we use the exact same number of proposals and RoIs as Mask R-CNN in -the ResNet-50 and ResNet-50-FPN variants, respectively. +the and -FPN variants, respectively. \paragraph{Initialization} +For initializing the C$_1$ to C$_5$ weights, we use a pre-trained +ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository. Following the pre-existing TensorFlow implementation of Faster R-CNN, -we initialize all hidden layers with He initialization \cite{He}. +we initialize all other hidden layers with He initialization \cite{He}. For the fully-connected camera and instance motion output layers, we use a truncated normal initializer with a standard deviation of $0.0001$ and zero mean, truncated at two standard deviations. @@ -220,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is wrong by both $\geq 3$ pixels and $\geq 5\%$. Camera and instance motion errors are averaged over the validation set. We optionally enable camera motion prediction (cam.), -replace the ResNet-50 backbone with ResNet-50-FPN (FPN), +replace the backbone with -FPN (FPN), or input XYZ coordinates into the backbone (XYZ). We either supervise object motions (sup.) with 3D motion ground truth (3D) or diff --git a/introduction.tex b/introduction.tex index 532ba3f..20cae90 100644 --- a/introduction.tex +++ b/introduction.tex @@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \ \centering \includegraphics[width=\textwidth]{figures/maskrcnn_cs} \caption{ -Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN} +Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN} on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}. } \label{figure:maskrcnn_cs}