diff --git a/approach.tex b/approach.tex index d583ed1..1ea1833 100644 --- a/approach.tex +++ b/approach.tex @@ -9,7 +9,7 @@ First, we modify the backbone network and provide two frames to the R-CNN system in order to enable image matching between the consecutive frames. Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn} -show our Motion R-CNN networks based on Mask R-CNN and Mask R-CNN -FPN, +show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN, respectively. {\begin{table}[h] @@ -203,12 +203,12 @@ a still and moving camera. \label{ssec:design} \paragraph{Camera motion network} -In our variant (Table \ref{table:motionrcnn_resnet}), the underlying +In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying ResNet backbone is only computed up to the $C_4$ block, as otherwise the feature resolution prior to RoI extraction would be reduced too much. -In our variant, we therefore first pass the $C_4$ features through a $C_5$ +In our ResNet variant, we therefore first pass the $C_4$ features through a $C_5$ block to make the camera network of both variants comparable. -Then, in both, the and -FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional +Then, in both, the ResNet and ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), we apply a additional convolution to the $C_5$ features to reduce the number of inputs to the following fully-connected layers. Instead of averaging, we use bilinear resizing to bring the convolutional features @@ -291,7 +291,7 @@ supervision without 3D instance motion ground truth. In contrast to SfM-Net, where a single optical flow field is composed and penalized to supervise the motion prediction, our loss considers the motion of all objects in isolation and composes a batch of flow windows -for the RoIs. +for the RoIs. Network predictions are shown in red. } \label{figure:flow_loss} \end{figure} diff --git a/background.tex b/background.tex index 6b53a86..7189a38 100644 --- a/background.tex +++ b/background.tex @@ -57,6 +57,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical & 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\ \midrule \multicolumn{3}{c}{\textbf{Refinement}}\\ +\midrule & 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ \multicolumn{3}{c}{...}\\ \midrule @@ -64,7 +65,8 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\ \bottomrule \caption { -FlowNetS \cite{FlowNet} architecture. +FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions) +are used for refinement. } \label{table:flownets} \end{longtable} @@ -85,7 +87,10 @@ Recently, other, similarly generic, encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. \subsection{SfM-Net} -Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}. +Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture. +Motions and full-image masks for a fixed number N$_{motions}$ of independent objects +are predicted in addition to a depth map, and a unsupervised re-projection loss based on +image brightness differences penalizes the predictions. { %\begin{table}[h] @@ -94,8 +99,7 @@ Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detai \toprule \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \midrule\midrule -\multicolumn{3}{c}{\todo{Conv-Deconv}}\\ -\midrule +\multicolumn{3}{c}{\textbf{Conv-Deconv}}\\ \midrule \multicolumn{3}{c}{\textbf{Motion Network}}\\ \midrule @@ -106,7 +110,7 @@ FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix} object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\ camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\ \midrule -\multicolumn{3}{c}{\textbf{Structure Network} ()}\\ +\multicolumn{3}{c}{\textbf{Structure Network}}\\ \midrule & input image $I_t$ & H $\times$ W $\times$ 3 \\ & Conv-Deconv & H $\times$ W $\times$ 32 \\ @@ -114,11 +118,14 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\ \bottomrule \caption { -SfM-Net \cite{SfmNet} architecture. +SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional +encoder-decoder network, where convolutions and deconvolutions with stride 2 are +used for downsampling and upsampling, respectively. The stride at the bottleneck +with respect to the input image is 32. The Conv-Deconv weights for the structure and motion networks are not shared, and N$_{motions} = 3$. } -\label{table:flownets} +\label{table:sfmnet} \end{longtable} @@ -201,7 +208,7 @@ $\begin{bmatrix} 1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048 \\ -\end{bmatrix}_{b/2}$ $\times$ 3 +\end{bmatrix}_{b/2}$ $\times$ 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ \bottomrule @@ -296,7 +303,7 @@ Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentat fixed resolution instance masks within the bounding boxes of each detected object. This is done by simply extending the Faster R-CNN head with multiple convolutions, which compute a pixel-precise binary mask for each instance. -The basic Mask R-CNN architecture is shown in Table \ref{table:maskrcnn_resnet}. +The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}. Note that the per-class masks logits are put through a sigmoid layer, and thus there is no comptetition between classes for the mask prediction branch. @@ -370,7 +377,7 @@ of an appropriate scale to be used, depending of the size of the bounding box. For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet} encoder by combining bilinear upsampled feature maps coming from the bottleneck with lateral skip connections from the encoder. -The Mask R-CNN -FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. +The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios, the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$. At each output position of the resulting RPN pyramid, bounding boxes are predicted diff --git a/bib.bib b/bib.bib index 1f15dd4..9fa04b6 100644 --- a/bib.bib +++ b/bib.bib @@ -271,7 +271,7 @@ @inproceedings{UnsupFlownet, title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness}, author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis}, - booktitle={ECCV Workshop on Brave new ideas for motion representations in videos}, + booktitle={ECCV Workshops}, year={2016}} @inproceedings{ImageNet, diff --git a/conclusion.tex b/conclusion.tex index 83926fb..c69f23d 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -150,5 +150,5 @@ ResNet blocks. In the variant without FPN, these blocks would have to be placed after RoI feature extraction. In the FPN variant, the blocks could be simply added after the encoder C$_5$ bottleneck. For saving memory, we could however also consider modifying the underlying - architecture and increase the number of blocks, but reduce the number +ResNet architecture and increase the number of blocks, but reduce the number of layers in each block. diff --git a/experiments.tex b/experiments.tex index 1efda83..bbd9992 100644 --- a/experiments.tex +++ b/experiments.tex @@ -222,7 +222,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is wrong by both $\geq 3$ pixels and $\geq 5\%$. Camera and instance motion errors are averaged over the validation set. We optionally enable camera motion prediction (cam.), -replace the backbone with -FPN (FPN), +replace the ResNet backbone with ResNet-FPN (FPN), or input XYZ coordinates into the backbone (XYZ). We either supervise object motions (sup.) with 3D motion ground truth (3D) or diff --git a/figures/flow_loss.png b/figures/flow_loss.png index ee1c4ec..868a605 100755 Binary files a/figures/flow_loss.png and b/figures/flow_loss.png differ diff --git a/figures/net_intro.png b/figures/net_intro.png index 043b7b7..c39a58b 100755 Binary files a/figures/net_intro.png and b/figures/net_intro.png differ diff --git a/figures/teaser.png b/figures/teaser.png index ab6685d..225d115 100755 Binary files a/figures/teaser.png and b/figures/teaser.png differ diff --git a/introduction.tex b/introduction.tex index 20cae90..3bf2203 100644 --- a/introduction.tex +++ b/introduction.tex @@ -73,7 +73,7 @@ and predicts pixel-precise segmentation masks for each detected object (Figure \ \centering \includegraphics[width=\textwidth]{figures/maskrcnn_cs} \caption{ -Instance segmentation results of Mask R-CNN -FPN \cite{MaskRCNN} +Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN} on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}. } \label{figure:maskrcnn_cs} @@ -104,6 +104,7 @@ manageable pieces. Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion in parallel to the class, bounding box and mask. Additionally, we branch off a small network for predicting the camera motion from the bottleneck. +Novel components in addition to Mask R-CNN are shown in red. } \label{figure:net_intro} \end{figure}