diff --git a/approach.tex b/approach.tex index 388e93c..55d1d43 100644 --- a/approach.tex +++ b/approach.tex @@ -26,7 +26,8 @@ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\ \midrule \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ \midrule -& From C$_4$: ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ +& From C$_4$: ResNet \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +& ResNet \{C$_6$\} (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\ & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\ T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ @@ -76,7 +77,7 @@ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1 \midrule \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ \midrule -& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ +& From C$_6$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 512 \\ & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\ & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\ T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ @@ -104,7 +105,7 @@ $\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ Motion R-CNN ResNet-FPN architecture based on the Mask R-CNN ResNet-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}). To obtain a larger bottleneck stride, we compute the feature pyramid starting -with C$_6$ instead of C$_5$ (thus, the subsampling from P$_5$ to P$_6$) is omitted. +with C$_6$ instead of C$_5$, and thus, the subsampling from P$_5$ to P$_6$ is omitted. The modifications are analogous to our Motion R-CNN ResNet, but we still show the architecture for completeness. Again, we use ReLU activations after all hidden layers and @@ -206,10 +207,14 @@ a still and moving camera. In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying ResNet backbone is only computed up to the $C_4$ block, as otherwise the feature resolution prior to RoI extraction would be reduced too much. -In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$ -block to make the camera network of both variants comparable. -Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional -convolution to the $C_5$ features to reduce the number of inputs to the following +Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$ +and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant) +to increase the bottleneck stride prior to the camera network to 64. +In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), +the backbone makes use of all blocks through $C_6$ and +we can simply branch of our camera network from the $C_6$ bottleneck. +Then, in both, the ResNet and ResNet-FPN variant, we apply a additional +convolution to the $C_6$ features to reduce the number of inputs to the following fully-connected layers. Instead of averaging, we use bilinear resizing to bring the convolutional features to a fixed size without losing all spatial information, @@ -356,13 +361,6 @@ and sample proposals and RoIs in the exact same way. During inference, we proceed analogously to Mask R-CNN. In the same way as the RoI mask head, at test time, we compute the RoI motion head from the features extracted with refined bounding boxes. -Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the -extracted RoI features before passing them into the motion head. -The intuition behind that is that we want to mask out (set to zero) any positions in the -extracted feature window which belong to the background. Then, the RoI motion -head aggregates the motion (image matching) information from the backbone -over positions localized within the object only, but not over positions belonging -to the background, which should not influence the final object motion estimate. Again, as for masks and bounding boxes in Mask R-CNN, the predicted output object motions are the predicted object motions for the diff --git a/background.tex b/background.tex index 7189a38..43ca3cd 100644 --- a/background.tex +++ b/background.tex @@ -134,10 +134,10 @@ and N$_{motions} = 3$. \subsection{ResNet} \label{ssec:resnet} -ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but +ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but became popular as basic building block of many deep network architectures for a variety of different tasks. Figure \ref{figure:bottleneck} -shows the fundamental building block of . The additive \emph{residual unit} enables the training +shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training of very deep networks without the gradients becoming too small as the distance from the output layer increases. @@ -147,8 +147,9 @@ is also used in many other region-based convolutional networks. The initial image data is always passed through the ResNet backbone as a first step to bootstrap the complete deep network. Note that for the Mask R-CNN architectures we describe below, this is equivalent -to the standard backbone. -In , the C$_5$ bottleneck has a stride of 32 with respect to the +to the standard ResNet-50 backbone. We now introduce one small extension that +will be useful for our Motion R-CNN network. +In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64. For accurately estimating motions corresponding to larger pixel displacements, a larger stride may be important. diff --git a/conclusion.tex b/conclusion.tex index c69f23d..f4a1f7a 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -130,7 +130,6 @@ For training on a dataset without any motion ground truth, e.g. Cityscapes, it may be critical to add this term in addition to an unsupervised loss for the instance motions. - \paragraph{Temporal consistency} A next step after the two aforementioned ones could be to extend our network to exploit more than two temporally consecutive frames, which has previously been shown to be beneficial in the @@ -139,16 +138,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM}, into our architecture, we could enable temporally consistent motion estimation from image sequences of arbitrary length. -\paragraph{Deeper networks for larger bottleneck strides} -% TODO remove? -Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the -input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64. -For accurately estimating the motion of objects with large displacements between -the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network. -We could do this easily in both of our network variants by adding one or multiple additional -ResNet blocks. In the variant without FPN, these blocks would have to be placed -after RoI feature extraction. In the FPN variant, the blocks could be simply -added after the encoder C$_5$ bottleneck. -For saving memory, we could however also consider modifying the underlying -ResNet architecture and increase the number of blocks, but reduce the number -of layers in each block. +\paragraph{Masking prior to the RoI motion head} +Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from +the backbone are integrated over the complete RoI window to yield the features +for motion estimation. +For example, average pooling is applied before the fully-connected layers in the variant without FPN. +However, ideally, the motion (image matching) information from the backbone should + +For example, consider + +Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the +extracted RoI features before passing them into the motion head. +The intuition behind that is that we want to mask out (set to zero) any positions in the +extracted feature window which belong to the background. Then, the RoI motion +head could aggregate the motion (image matching) information from the backbone +over positions localized within the object only, but not over positions belonging +to the background, which should probably not influence the final object motion estimate. diff --git a/experiments.tex b/experiments.tex index bbd9992..b33fd57 100644 --- a/experiments.tex +++ b/experiments.tex @@ -166,7 +166,7 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. \paragraph{R-CNN training parameters} For training the RPN and RoI heads and during inference, we use the exact same number of proposals and RoIs as Mask R-CNN in -the and -FPN variants, respectively. +the ResNet and ResNet-FPN variants, respectively. \paragraph{Initialization} For initializing the C$_1$ to C$_5$ weights, we use a pre-trained