bsc-thesis/approach.tex


\subsection{Motion R-CNN architecture}
\label{ssec:architecture}

Building on Mask R-CNN \cite{MaskRCNN},
we estimate per-object motion by predicting the 3D motion of each detected object.
For this, we extend Mask R-CNN in two straightforward ways.
First, we modify the backbone network and provide two frames to the R-CNN system
in order to enable image matching between the consecutive frames.
Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
region proposal. Table \ref{table:motionrcnn_resnet} shows the modified network.

{
\begin{table}[h]
\centering
\begin{tabular}{llr}
\toprule
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_4$ & ResNet-50 \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)} (Table \ref{table:maskrcnn_resnet})}\\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
\midrule
& From C$_4$: ResNet-50 \{C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
& 1 $\times$ 1 conv, 1024 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
& average pool & 1 $\times$ 2048 \\
M$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2  & 1 $\times$ 1024 \\

$R_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\
$t_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\
$o_t^{cam}$& From M$_1$: fully connected, 2 & 1 $\times$ 2 \\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet})}\\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head: Motions}}\\
\midrule
M$_2$ & From ave: $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RPN}$ $\times$ 1024 \\
$\forall k: R_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\
$\forall k: t_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\
$\forall k: p_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\
$\forall k: o_t^k$ & From M$_2$: fully connected, 2 & N$_{RPN}$ $\times$ 2 \\

\bottomrule
\end{tabular}

\caption {
Motion R-CNN ResNet-50 architecture based on the Mask R-CNN
ResNet-50 architecture (Table \ref{table:maskrcnn_resnet}).
}
\label{table:motionrcnn_resnet}
\end{table}
}

{
\begin{table}[h]
\centering
\begin{tabular}{llr}
\toprule
\textbf{Layer ID} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{Camera Motion Network}}\\
\midrule
& From C$_5$: 1 $\times$ 1 conv, 1024 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
& 3 $\times$ 3 conv, 1024, stride 2 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 1024 \\
& average pool & 1 $\times$ 2048 \\
M$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2  & 1 $\times$ 1024 \\
$R_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\
$t_t^{cam}$& From M$_1$: fully connected, 3 & 1 $\times$ 3 \\
$o_t^{cam}$& From M$_1$: fully connected, 2 & 1 $\times$ 2 \\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
\multicolumn{3}{c}{\textbf{RoI Head: Motions}}\\
\midrule
M$_2$ & From F$_1$: $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RPN}$ $\times$ 1024 \\
$\forall k: R_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\
$\forall k: t_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\
$\forall k: p_t^k$ & From M$_2$: fully connected, 3 & N$_{RPN}$ $\times$ 3 \\
$\forall k: o_t^k$ & From M$_2$: fully connected, 2 & N$_{RPN}$ $\times$ 2 \\

\bottomrule
\end{tabular}

\caption {
Motion R-CNN ResNet-50-FPN architecture based on the Mask R-CNN
ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
The modifications are analogous to our Motion R-CNN ResNet-50,
but we still show the architecture for completeness.
}
\label{table:motionrcnn_resnet_fpn}
\end{table}
}

\paragraph{Backbone Network}
Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backbone network to compute feature maps from input imagery.

Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
Alternatively, we also experiment with concatenating the camera space XYZ coordinates for each frame
into the input as well.
We do not introduce a separate network for computing region proposals and use our modified backbone network
as both first stage RPN and second stage feature extractor for region cropping.
Technically, our feature encoder network will have to learn a motion representation similar to
that learned by the FlowNet encoder, but the output will be computed in the
object-centric framework of a region based convolutional network head with a 3D parametrization.
Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information
from the encoder is integrated for specific objects via RoI cropping and
processed by the RoI head for each object.

\paragraph{Per-RoI motion prediction}
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$.
We parametrize ${R_t^k}$ using an Euler angle representation,

\begin{equation}
R_t^k = R_t^{k,z}(\gamma) \cdot R_t^{k,x}(\alpha) \cdot R_t^{k,y}(\beta),
\end{equation}

where
\begin{equation}
R_t^{k,x}(\alpha) =
\begin{pmatrix}
  1 & 0 & 0 \\
  0 & \cos(\alpha) & -\sin(\alpha) \\
  0 & \sin(\alpha) & \cos(\alpha)
\end{pmatrix},
\end{equation}

\begin{equation}
R_t^{k,y}(\beta) =
\begin{pmatrix}
  \cos(\beta) & 0 & \sin(\beta) \\
  0 & 1 & 0 \\
  -\sin(\beta) & 0 & \cos(\beta)
\end{pmatrix},
\end{equation}

\begin{equation}
R_t^{k,z}(\gamma) =
\begin{pmatrix}
  \cos(\gamma) & -\sin(\gamma) & 0 \\
  \sin(\gamma) & \cos(\gamma) & 0 \\
  0 & 0 & 1
\end{pmatrix},
\end{equation}

and $\alpha, \beta, \gamma$ are the rotation angles in radians about the $x,y,z$-axis, respectively.

We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for
refined boxes and classes.
Like for refined boxes and masks, we make one separate motion prediction for each class.
Each instance motion is predicted as a set of nine scalar parameters,
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
Here, we assume that motions between frames are relatively small
and that objects rotate at most 90 degrees in either direction along any axis,
which is in general a safe assumption for image sequences from videos.
All predictions are made in camera space, and translation and pivot predictions are in meters.
We additionally predict softmax scores $o_t^k$ for classifying the objects into
still and moving objects.

\paragraph{Camera motion prediction}
In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
between the two frames $I_t$ and $I_{t+1}$.
For this, we flatten the bottleneck output of the backbone and pass it through a fully connected layer.
We again represent $R_t^{cam}$ using a Euler angle representation and
predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.
Again, we predict a softmax score $o_t^{cam}$ for classifying differentiating between
a still and moving camera.

\subsection{Supervision}
\label{ssec:supervision}

\paragraph{Per-RoI supervision with 3D motion ground truth}
The most straightforward way to supervise the object motions is by using ground truth
motions computed from ground truth object poses, which is in general
only practical when training on synthetic datasets.
Given the $k$-th positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
Note that we dropped the subscript $t$ to increase readability.
Similar to the camera pose regression loss in \cite{PoseNet2},
we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted
rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss
performs better in our case than the standard $\ell_1$-loss.
For each RoI, we compute the total motion loss $L_{motion}^k$ from
the individual loss terms as,

\begin{equation}
L_{motion}^k =  l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k,
\end{equation}
where
\begin{equation}
l_{R}^k = \ell_1^* (R^{gt,i_k} - R^{k,c_k}),
\end{equation}
\begin{equation}
l_{t}^k = \ell_1^* (t^{gt,i_k} - t^{k,c_k}),
\end{equation}
\begin{equation}
l_{p}^k = \ell_1^* (p^{gt,i_k} - p^{k,c_k}).
\end{equation}
are the smooth $\ell_1$-loss terms for the predicted rotation, translation and pivot,
respectively and
\begin{equation}
l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}).
\end{equation}
is the cross-entropy loss for the predicted classification into moving and non-moving objects.

Note that we do not penalize the rotation and translation for objects with
$o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the network
may not reliably predict exact identity motions for still objects, which is
numerically more difficult to optimize than performing classification between
moving and non-moving objects and discarding the regression for the non-moving
ones.

\paragraph{Camera motion supervision}
We supervise the camera motion with ground truth analogously to the
object motions, with the only difference being that we only have
a rotation and translation, but no pivot term for the camera motion.
If the ground truth shows that the camera is not moving, we again do not
penalize rotation and translation. For the camera, the loss is reduced to the
classification term in this case.

\paragraph{Per-RoI supervision \emph{without} 3D motion ground truth}
A more general way to supervise the object motions is a re-projection
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
which we can apply to coordinates within the object bounding boxes,
and which does not require ground truth 3D object motions.

In this case, for any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
with the same resolution as the predicted mask. We use the same bounding box
to crop the corresponding region from the dense, full image depth map
and bilinearly resize the depth crop to the same resolution as the mask and point
grid.
We then compute the optical flow at each of the grid points by creating
a 3D point cloud from the point grid and depth crop. To this point cloud, we
apply the RoI's predicted motion, masked by the predicted mask.
Then, we apply the camera motion to the points, project them back to 2D
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
Note that we batch this computation over all RoIs, so that we only perform
it once per forward pass. The mathematical details are analogous to the
dense, full image flow computation in the following subsection and will not
be repeated here. \todo{add diagram to make it easier to understand}

For each RoI, we can now penalize the optical flow grid to supervise the object motion.
If there is optical flow ground truth available, we can use the RoI bounding box to
crop and resize a region from the ground truth optical flow to match the RoI's
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.

However, we can also use the re-projection loss without optical flow ground truth
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
In this case, we use the bounding box to crop and resize a corresponding region
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
For more details on differentiable bilinear sampling for deep learning, we refer the reader to
\cite{STN}.

When compared to supervision with motion ground truth, a re-projection
loss could benefit motion regression by removing any loss balancing issues between the
rotation, translation and pivot terms \cite{PoseNet2},
which can make it interesting even when 3D motion ground truth is available.


\subsection{Dense flow from motion}
\label{ssec:postprocessing}
As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network.
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
where
\begin{equation}
P_t =
\begin{pmatrix}
X_t \\ Y_t \\ Z_t
\end{pmatrix}
=
\frac{d_t}{f}
\begin{pmatrix}
x_t - c_0 \\ y_t - c_1 \\ f
\end{pmatrix},
\end{equation}
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
which range over all coordinates in $I_t$.
For now, the depth map is always assumed to come from ground truth.

Given $k$ detections with predicted motions as above, we transform all points within the bounding
box of a detected object according to the predicted motion of the object.

We first define the \emph{full image} mask $m_t^k$ for object k,
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box.
Then,
\begin{equation}
P'_{t+1} =
P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
\end{equation}

Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce!

\begin{equation}
\begin{pmatrix}
X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
\end{pmatrix}
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^c
\end{equation}.

Note that in our experiments, we either use the ground truth camera motion to focus
on the object motion predictions or the predicted camera motion to predict complete
motion. We will always state which variant we use in the experimental section.

Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
\begin{equation}
\begin{pmatrix}
x_{t+1} \\ y_{t+1}
\end{pmatrix}
=
\frac{f}{Z_{t+1}}
\begin{pmatrix}
X_{t+1} \\ Y_{t+1}
\end{pmatrix}
+
\begin{pmatrix}
c_0 \\ c_1
\end{pmatrix}.
\end{equation}
We can now obtain the optical flow between $I_t$ and $I_{t+1}$ at each point as
\begin{equation}
\begin{pmatrix}
u \\ v
\end{pmatrix}
=
\begin{pmatrix}
x_{t+1} - x_{t} \\ y_{t+1} - y_{t}
\end{pmatrix}.
\end{equation}