diff --git a/approach.tex b/approach.tex index a63a54a..d518917 100644 --- a/approach.tex +++ b/approach.tex @@ -17,7 +17,7 @@ region proposal. Table \ref{table:motionrcnn_resnet} shows the modified network. \toprule \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \midrule\midrule -& input image & H $\times$ W $\times$ C \\ +& input images & H $\times$ W $\times$ C \\ \midrule C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ \midrule @@ -26,8 +26,8 @@ C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ \midrule & From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ -& 1 $\times$ 1 conv, 1024 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ -& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\ +& 1 $\times$ 1 conv, 2048 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +& 3 $\times$ 3 conv, 2048, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ & average pool & 1 $\times$ 2048 \\ M$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ @@ -65,7 +65,7 @@ ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}). \toprule \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \midrule\midrule -& input image & H $\times$ W $\times$ C \\ +& input images & H $\times$ W $\times$ C \\ \midrule C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ \midrule @@ -73,8 +73,8 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra \midrule \multicolumn{3}{c}{\textbf{Camera Motion Network}}\\ \midrule -& From C$_5$: 1 $\times$ 1 conv, 1024 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ -& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\ +& From C$_5$: 1 $\times$ 1 conv, 2048 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ +& 3 $\times$ 3 conv, 2048, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ & average pool & 1 $\times$ 2048 \\ M$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ $R_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\ @@ -265,7 +265,7 @@ and finally compute the optical flow at each point as the difference of the init Note that we batch this computation over all RoIs, so that we only perform it once per forward pass. The mathematical details are analogous to the dense, full image flow computation in the following subsection and will not -be repeated here. \todo{add diagram to make it easier to understand} +be repeated here. \todo{probably better to add the mathematical details, as it may otherwise be confusing at some points} For each RoI, we can now penalize the optical flow grid to supervise the object motion. If there is optical flow ground truth available, we can use the RoI bounding box to diff --git a/background.tex b/background.tex index db54de5..21927ad 100644 --- a/background.tex +++ b/background.tex @@ -2,12 +2,12 @@ In this section, we will give a more detailed description of previous works we directly build on and other prerequisites. \subsection{Optical flow and scene flow} -Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a +Let $I_t,I_{t+1} : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images. The optical flow -$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ -maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the -visually corresponding pixel in the second frame $I_2$, +$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$ +maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the +visually corresponding pixel in the second frame $I_{t+1}$, and can be interpreted as the apparent movement of brigthness patterns between the two frames. Optical flow can be regarded as two-dimensional motion estimation. @@ -32,11 +32,50 @@ performing upsampling of the compressed features and resulting in a encoder-deco The most popular deep networks of this kind for end-to-end optical flow prediction are variants of the FlowNet family \cite{FlowNet, FlowNet2}, which was recently extended to scene flow estimation \cite{SceneFlowDataset}. -Table \ref{} shows the classical FlowNetS architecture for optical flow prediction. +Table \ref{table:flownets} shows the classical FlowNetS architecture for optical flow prediction. + +{ +%\begin{table}[h] +%\centering +\begin{longtable}{llr} +\toprule +\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ +\midrule\midrule + & input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\ +\midrule +\multicolumn{3}{c}{\textbf{Encoder}}\\ +\midrule +& 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\ +& 5 $\times$ 5 conv, 128, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 128 \\ +& 5 $\times$ 5 conv, 256, stride 2 & $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 256 \\ +& 3 $\times$ 3 conv, 256 & $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 256 \\ +& 3 $\times$ 3 conv, 512, stride 2 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\ +& 3 $\times$ 3 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\ +& 3 $\times$ 3 conv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ +& 3 $\times$ 3 conv, 512, & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ +& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\ +\midrule +\multicolumn{3}{c}{\textbf{Refinement}}\\ +& 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ +\multicolumn{3}{c}{...}\\ +\midrule +flow & $\times$ 2 bilinear upsample & $\tfrac{1}{1}$ H $\times$ $\tfrac{1}{1}$ W $\times$ 2 \\ +\bottomrule + +\caption { +FlowNetS \cite{FlowNet} architecture. +} +\label{table:flownets} +\end{longtable} + + +%\end{table} +} + Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained with supervision from dense optical flow ground truth. Potentially, the same network could also be used for semantic segmentation if -the number of output channels was adapted from two to the number of classes. % TODO verify +the number of output final and intermediate output channels was adapted from two to the number of classes.\ Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well, given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements. Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling @@ -44,8 +83,45 @@ operations in the encoder. Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. \subsection{SfM-Net} -Here, we will describe the SfM-Net \cite{SfmNet} architecture in more detail and show their results -and some of the issues. +Here, we will describe the SfM-Net \cite{SfmNet} architecture in some more detail \todo{finish}. + +{ +%\begin{table}[h] +%\centering +\begin{longtable}{llr} +\toprule +\textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ +\midrule\midrule +\multicolumn{3}{c}{\todo{Conv-Deconv}}\\ +\midrule +\midrule +\multicolumn{3}{c}{\textbf{Motion Network}}\\ +\midrule + & input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\ + & Conv-Deconv & H $\times$ W $\times$ 32 \\ +masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\ +FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & H $\times$ W $\times$ 32 \\ +object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\ +camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\ +\midrule +\multicolumn{3}{c}{\textbf{Structure Network} ()}\\ +\midrule +& input image $I_t$ & H $\times$ W $\times$ 3 \\ +& Conv-Deconv & H $\times$ W $\times$ 32 \\ +depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\ +\bottomrule + +\caption { +SfM-Net \cite{SfmNet} architecture. +The Conv-Deconv weights for the structure and motion networks are not shared, +and N$_{motions} = 3$. +} +\label{table:flownets} +\end{longtable} + + +%\end{table} +} \subsection{ResNet} \label{ssec:resnet} @@ -109,7 +185,7 @@ $\begin{bmatrix} \bottomrule \caption { -ResNet-50 \cite{ResNet} architecture (Figure from \cite{ResNet}). +ResNet-50 \cite{ResNet} architecture. Operations enclosed in a []$_b$ block make up a single ResNet \enquote{bottleneck} block (see Figure \ref{figure:bottleneck}). If the block is denoted as []$_b/2$, the first conv operation in the block has a stride of 2. Note that the stride @@ -220,18 +296,22 @@ C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ \midrule \multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\ \midrule -& From C$_4$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\ -& 1 $\times$ 1 conv, 6 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 6 \\ -& flatten & A $\times$ 6 \\ - & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & A $\times$ 6 \\ -ROI$_{\mathrm{RPN}}$ & sample bounding boxes \& scores & N$_{RoI}$ $\times$ 6 \\ +R$_0$ & From C$_4$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 512 \\ +& 1 $\times$ 1 conv, 4 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ $N_a \cdot$ 4 \\ +& flatten & A $\times$ 4 \\ +boxes$_{\mathrm{RPN}}$ & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & A $\times$ 4\\ +& From R$_0$: 1 $\times$ 1 conv, 2 & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ $N_a \cdot$ 2 \\ +& flatten & A $\times$ 2 \\ +scores$_{\mathrm{RPN}}$& softmax & A $\times$ 2 \\ +ROI$_{\mathrm{RPN}}$ & sample boxes$_{\mathrm{RPN}}$ and scores$_{\mathrm{RPN}}$ & N$_{RoI}$ $\times$ 6 \\ \midrule \multicolumn{3}{c}{\textbf{RoI Head}}\\ \midrule & From C$_4$ with ROI$_{\mathrm{RPN}}$: RoI extraction & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 1024 \\ R$_1$& ResNet-50 \{C$_5$ without stride\} (Table \ref{table:resnet}) & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 2048 \\ -ave & average pool & N$_{RPN}$ $\times$ 2048 \\ -boxes& From ave: fully connected, 4 & N$_{RPN}$ $\times$ 4 \\ +ave & average pool & N$_{RoI}$ $\times$ 2048 \\ +& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ +boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ & From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ \midrule @@ -271,7 +351,7 @@ The Mask R-CNN ResNet-50-FPN variant is shown in Table \ref{table:maskrcnn_resne Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios, the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$. At each output position of the resulting RPN pyramid, bounding boxes are predicted -with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale. +with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$). For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$, the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$, respectively. @@ -311,7 +391,7 @@ P$_6$ & From P$_5$: 2 $\times$ 2 subsample, 256 & $\tfrac{1}{64}$ H $\times$ $\t \midrule \multicolumn{3}{c}{$\forall i \in \{2...6\}$}\\ & From P$_i$: 1 $\times$ 1 conv, 512 & $\tfrac{1}{2^i}$ H $\times$ $\tfrac{1}{2^i}$ W $\times$ 512 \\ -& 1 $\times$ 1 conv, 6 & $\tfrac{1}{2^i}$ H $\times$ $\tfrac{1}{2^i}$ W $\times$ 6 \\ +& 1 $\times$ 1 conv, 6 & $\tfrac{1}{2^i}$ H $\times$ $\tfrac{1}{2^i}$ W $\times$ $N_a \cdot$ 6 \\ RPN$_i$& flatten & A$_i$ $\times$ 6 \\ \midrule & From \{RPN$_2$ ... RPN$_6$\}: concatenate & A $\times$ 6 \\ @@ -323,7 +403,8 @@ ROI$_{\mathrm{RPN}}$ & sample bounding boxes \& scores & N$_{RoI}$ $\times$ 6 \\ R$_2$ & From \{P$_2$ ... P$_6$\} with ROI$_{\mathrm{RPN}}$: RoI extraction (Eq. \ref{eq:level_assignment}) & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\ & 2 $\times$ 2 max pool & N$_{RoI}$ $\times$ 7 $\times$ 7 $\times$ 256 \\ F$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RoI}$ $\times$ 1024 \\ -boxes& From F$_1$: fully connected, 4 & N$_{RoI}$ $\times$ 4 \\ +& From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\ +boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\ & From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ \midrule diff --git a/experiments.tex b/experiments.tex index 5e2557e..d64544b 100644 --- a/experiments.tex +++ b/experiments.tex @@ -9,7 +9,7 @@ as well as extensions for motion estimation and related evaluations and postprocessings. In addition, we generated all ground truth for Motion R-CNN in the form of TFRecords from the raw Virtual KITTI data to enable fast loading during training. -Note that for RoI pooling and cropping, +Note that for RoI extraction and cropping operations, we use the \texttt{tf.crop\_and\_resize} TensorFlow function with interpolation set to bilinear. @@ -26,7 +26,7 @@ Each sequence is rendered with varying lighting and weather conditions and from different viewing angles, resulting in a total of 10 variants per sequence. In addition to the RGB frames, a variety of ground truth is supplied. For each frame, we are given a dense depth and optical flow map and the camera -extrinsics matrix. There are two annotated object classes, cars, and vans. +extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2). For all cars and vans in the each frame, we are given 2D and 3D object bounding boxes, instance masks, 3D poses, and various other labels.