diff --git a/approach.tex b/approach.tex index 300e397..4494459 100644 --- a/approach.tex +++ b/approach.tex @@ -26,7 +26,6 @@ object-centric framework of a region based convolutional network head with a 3D Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information from the encoder is integrated for specific objects via RoI cropping and processed by the RoI head for each object. -\todo{figure of backbone} \paragraph{Per-RoI motion prediction} We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}. @@ -90,9 +89,55 @@ between the two frames $I_t$ and $I_{t+1}$. For this, we flatten the bottleneck output of the backbone and pass it through a fully connected layer. We again represent $R_t^{cam}$ using a Euler angle representation and predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects. -Again, we predict a softmax score $o_t^k$ for classifying differentiating between +Again, we predict a softmax score $o_t^{cam}$ for classifying differentiating between a still and moving camera. +{ +\begin{table}[h] +\centering +\begin{tabular}{llr} +layer id & layer operations & output dimensions \\ +\toprule \\ +& input image & H $\times$ W $\times$ C \\ +\midrule \\ +C$_4$ & \textbf{ResNet-50} [up to C$_4$] & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ +\midrule \\ +\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)} (see Table \ref{table:maskrcnn_resnet})}\\ +\midrule \\ +\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ +\midrule \\ +& From C$_4$: \textbf{ResNet-50} [C$_5$ without stride] & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +& average pool & 1 $\times$ 2048 \\ +& fully connected, 1024 & 1 $\times$ 1024 \\ +M$_1$ & fully connected, 1024 & 1 $\times$ 1024 \\ +$R_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\ +$t_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\ +$o_t^{cam}$& From M$_1$: fully connected, 2 & 1 $\times$ 2 \\ +\midrule \\ +\multicolumn{3}{c}{\textbf{RoI Head} (see Table \ref{table:maskrcnn_resnet})}\\ +\midrule \\ +\multicolumn{3}{c}{\textbf{RoI Head: Masks} (see Table \ref{table:maskrcnn_resnet})}\\ +\midrule \\ +\multicolumn{3}{c}{\textbf{RoI Head: Motions}}\\ +\midrule \\ +& From ave: fully connected, 1024 & N$_{RPN}$ $\times$ 1024 \\ +M$_2$ & fully connected, 1024 & N$_{RPN}$ $\times$ 1024 \\ +$\forall k: R_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\ +$\forall k: t_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\ +$\forall k: p_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\ +$\forall k: o_t^k$ & From M$_2$: fully connected, 2 & N$_{RPN}$ $\times$ 2 \\ + +\bottomrule +\end{tabular} + +\caption { +Motion R-CNN ResNet architecture based on the Mask R-CNN +ResNet architecture (Table \ref{table:maskrcnn_resnet}). +} +\label{table:motion_rcnn_resnet} +\end{table} +} + \subsection{Supervision} \label{ssec:supervision} @@ -108,8 +153,8 @@ Similar to the camera pose regression loss in \cite{PoseNet2}, we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss performs better in our case than the standard $\ell_1$-loss. -For each RoI, we compute the motion loss $L_{motion}^k$ as a linear sum of -the individual losses, +For each RoI, we compute the total motion loss $L_{motion}^k$ from +the individual loss terms as, \begin{equation} L_{motion}^k = l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k, @@ -124,7 +169,7 @@ l_{t}^k = \ell_1^* (t^{gt,i_k} - t^{k,c_k}), \begin{equation} l_{p}^k = \ell_1^* (p^{gt,i_k} - p^{k,c_k}). \end{equation} -are the smooth $\ell_1$-losses for the predicted rotation, translation and pivot, +are the smooth $\ell_1$-loss terms for the predicted rotation, translation and pivot, respectively and \begin{equation} l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}). @@ -226,7 +271,7 @@ Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \begin{pmatrix} X_{t+1} \\ Y_{t+1} \\ Z_{t+1} \end{pmatrix} -= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^k += P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^c \end{equation}. Note that in our experiments, we either use the ground truth camera motion to focus diff --git a/background.tex b/background.tex index 5ef049d..13d1e70 100644 --- a/background.tex +++ b/background.tex @@ -48,7 +48,7 @@ performing upsampling of the compressed features and resulting in a encoder-deco The most popular deep networks of this kind for end-to-end optical flow prediction are variants of the FlowNet family \cite{FlowNet, FlowNet2}, which was recently extended to scene flow estimation \cite{SceneFlowDataset}. -Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction. +Table \ref{} shows the classical FlowNetS architecture for optical fow prediction. Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained with supervision from dense optical flow ground truth. Potentially, the same network could also be used for semantic segmentation if @@ -60,17 +60,94 @@ operations in the encoder. Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. \subsection{SfM-Net} -Here, we will describe the SfM-Net architecture in more detail and show their results +Here, we will describe the SfM-Net \cite{SfmNet} architecture in more detail and show their results and some of the issues. \subsection{ResNet} \label{ssec:resnet} -For completeness, we will give a short review of the ResNet \cite{ResNet} architecture we will use -as a backbone CNN for our network. +ResNet \cite{ResNet} was initially introduced as a CNN for image classification, but +became popular as basic building block of many deep network architectures for a variety +of different tasks. In Table \ref{table:resnet}, we show the ResNet-50 variant +that will serve as the basic CNN backbone of our networks, and +is also used in many other region-based convolutional networks. +The initial image data is always passed through ResNet-50 as a first step to +bootstrap the complete deep network. +Figure \ref{figure:bottleneck} +shows the fundamental building block of ResNet-50. + +{ +\begin{table}[h] +\centering +\begin{tabular}{llr} + layer id & layer operations & output dimensions \\ +\toprule \\ + & input image & H $\times$ W $\times$ C \\ +\midrule \\ +\multicolumn{3}{c}{\textbf{ResNet-50}}\\ +\midrule \\ +C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\ + +& 3 $\times$ 3 max pool, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 64 \\ + +C$_2$ & +$\begin{bmatrix} +1 \times 1, 64 \\ +3 \times 3, 64 \\ +1 \times 1, 256 \\ +\end{bmatrix}_b$ $\times$ 3 +& $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 256 \\ +\midrule \\ +C$_3$ & +$\begin{bmatrix} +1 \times 1, 128 \\ +3 \times 3, 128 \\ +1 \times 1, 512 \\ +\end{bmatrix}_{b/2}$ $\times$ 4 +& $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 512 \\ +\midrule \\ +C$_4$ & +$\begin{bmatrix} +1 \times 1, 256 \\ +3 \times 3, 256 \\ +1 \times 1, 1024 \\ +\end{bmatrix}_{b/2}$ $\times$ 6 +& $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ +\midrule \\ +C$_5$ & +$\begin{bmatrix} +1 \times 1, 512 \\ +3 \times 3, 512 \\ +1 \times 1, 2048 \\ +\end{bmatrix}_{b/2}$ $\times$ 3 +& $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ + +\bottomrule +\end{tabular} + +\caption { +ResNet-50 \cite{ResNet} architecture. +Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck} +block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$, +the first conv operation in the block has a stride of 2. Note that the stride +is only applied to the first block, but not to repeated blocks. +} +\label{table:resnet} +\end{table} +} + +\begin{figure}[t] + \centering + \includegraphics[width=0.3\textwidth]{figures/bottleneck} +\caption{ +ResNet \cite{ResNet} \enquote{bottleneck} block introduced to reduce computational +complexity in deeper network variants, shown here with 256 input and output channels. +} +\label{figure:bottleneck} +\end{figure} \subsection{Region-based convolutional networks} \label{ssec:rcnn} -We now give a short review of region-based convolutional networks, which are currently by far the +We now give an overview of region-based convolutional networks, which are currently by far the most popular deep networks for object detection, and have recently also been applied to instance segmentation. \paragraph{R-CNN} @@ -146,6 +223,49 @@ variant based on Feature Pyramid Networks \cite{FPN}. Figure \ref{} compares the two Mask R-CNN head variants. \todo{RoI Align} +{ +\begin{table}[h] +\centering +\begin{tabular}{llr} +layer id & layer operations & output dimensions \\ +\toprule \\ +& input image & H $\times$ W $\times$ C \\ +\midrule \\ +C$_4$ & \textbf{ResNet-50} [up to C$_4$] & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ +\midrule \\ +\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\ +\midrule \\ +& From C$_4$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\ +& 1 $\times$ 1 conv, 4 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 4 \\ +& flatten & A $\times$ 4 \\ + & decode bounding boxes \ref{} & A $\times$ 4 \\ +boxes$_{\mathrm{RPN}}$ & sample bounding boxes \ref{} & N$_{RPN}$ $\times$ 4 \\ +\midrule \\ +\multicolumn{3}{c}{\textbf{RoI Head}}\\ +\midrule \\ +& From C$_4$ with boxes$_{\mathrm{RPN}}$: RoI pooling \ref{} & N$_{RPN}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\ +R$_1$& \textbf{ResNet-50} [C$_5$ without stride] & N$_{RPN}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\ +ave & average pool & N$_{RPN}$ $\times$ 2048 \\ +boxes& From ave: fully connected, 4 & N$_{RPN}$ $\times$ 4 \\ +logits& From ave: fully connected, N$_{cls}$ & N$_{RPN}$ $\times$ N$_{cls}$ \\ +\midrule \\ +\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\ +\midrule \\ +& From R$_1$: 2 $\times$ 2 deconv, 256, stride 2 & N$_{RPN}$ $\times$ 14 $\times$ 14 $\times$ 256 \\ +masks & 1 $\times$ 1 conv, N$_{cls}$ & N$_{RPN}$ $\times$ 14 $\times$ 14 $\times$ N$_{cls}$ \\ + +\bottomrule +\end{tabular} + +\caption { +Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture. +Note that this is equivalent to the Faster R-CNN architecture if the mask +head is left out. +} +\label{table:maskrcnn_resnet} +\end{table} +} + \paragraph{Bounding box regression} All bounding boxes predicted by the RoI head or RPN are estimated as offsets with respect to a reference bounding box. In the case of the RPN, diff --git a/experiments.tex b/experiments.tex index 38c7bee..bca4744 100644 --- a/experiments.tex +++ b/experiments.tex @@ -176,7 +176,7 @@ AEE: Average Endpoint Error; Fl-all: Ratio of pixels where flow estimate is wrong by both $\geq 3$ pixels and $\geq 5\%$. Camera and instance motion errors are averaged over the validation set. We optionally train camera motion prediction (cam.), -replace the ResNet50 backbone with ResNet50-FPN (FPN), +replace the ResNet-50 backbone with ResNet-50-FPN (FPN), or input XYZ coordinates into the backbone (XYZ). We either supervise object motions (sup.) with 3D motion ground truth (3D) or diff --git a/figures/bottleneck.png b/figures/bottleneck.png new file mode 100644 index 0000000..bfa8404 Binary files /dev/null and b/figures/bottleneck.png differ diff --git a/thesis.tex b/thesis.tex index cbcd518..50708df 100644 --- a/thesis.tex +++ b/thesis.tex @@ -28,6 +28,7 @@ \usepackage{lipsum} % zur Erzeugung von Lorem-Ipsum-Blindtext \usepackage[math]{blindtext} % zur Erzeugung von deutschem Blindtext \usepackage{hyperref} % Verlinkungen im Dokument +\usepackage{csquotes} % INFO % % Das hyperref-Paket sollte möglichst als letztes geladen werden, darum steht es weiter hinten in der Präambel. @@ -82,7 +83,7 @@ \newcommand{\todo}[1]{\textbf{\textcolor{red}{#1}}} - +\setlength{\belowrulesep}{0pt} % Titelei \author{\myname}