mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
163 lines
5.6 KiB
TeX
163 lines
5.6 KiB
TeX
|
|
\subsection{Motion R-CNN architecture}
|
|
|
|
Building on Mask R-CNN, we enable per-object motion estimation by predicting the 3D motion of each detected object
|
|
in camera space.
|
|
|
|
\paragraph{Backbone Network}
|
|
Like Faster R-CNN and Mask R-CNN, we use a ResNet variant as backbone network to compute feature maps from input imagery.
|
|
|
|
Inspired by FlowNetS, we make one modification to enable image matching within the backbone network,
|
|
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
|
we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
|
|
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
|
as both first stage RPN and second stage feature extractor for region cropping.
|
|
|
|
\paragraph{Per-RoI motion prediction}
|
|
We use a rigid motion parametrization similar to the one used by SfM-Net \cite{Byravan:2017:SNL}.
|
|
For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in SE3$
|
|
of the object between the two frames $I_t$ and $I_{t+1}$ as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$ .
|
|
We parametrize ${R_t^k}$ using an Euler angle representation,
|
|
|
|
\begin{equation}
|
|
R_t^k = R_t^{k,z}(\gamma) \cdot R_t^{k,x}(\alpha) \cdot R_t^{k,y}(\beta),
|
|
\end{equation}
|
|
|
|
where
|
|
\begin{equation}
|
|
R_t^{k,x}(\alpha) =
|
|
\begin{pmatrix}
|
|
1 & 0 & 0 \\
|
|
0 & cos(\alpha) & -sin(\alpha) \\
|
|
0 & sin(\alpha) & cos(\alpha)
|
|
\end{pmatrix},
|
|
\end{equation}
|
|
|
|
\begin{equation}
|
|
R_t^{k,y}(\beta) =
|
|
\begin{pmatrix}
|
|
cos(\beta) & 0 & sin(\beta) \\
|
|
0 & 1 & 0 \\
|
|
-sin(\beta) & 0 & cos(\beta)
|
|
\end{pmatrix},
|
|
\end{equation}
|
|
|
|
\begin{equation}
|
|
R_t^{k,z}(\gamma) =
|
|
\begin{pmatrix}
|
|
cos(\gamma) & -sin(\gamma) & 0 \\
|
|
sin(\gamma) & cos(\gamma) & 0 \\
|
|
0 & 0 & 1
|
|
\end{pmatrix},
|
|
\end{equation}
|
|
|
|
and $\alpha, \beta, \gamma$ are the rotation angles about the $x,y,z$-axis, respectively.
|
|
|
|
|
|
Figure \ref{fig:motion_rcnn_head} shows our extended per-RoI head network.
|
|
We then extend the Faster R-CNN head by adding a fully-connected layer in parallel to the final fully-connected layers for
|
|
predicting refined boxes and classes.
|
|
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
|
Each motion is predicted as a set of nine scalar motion parameters,
|
|
$sin(\alpha)$, $sin(\beta)$, $sin(\gamma)$, $t_t^k$ and $p_t^k$,
|
|
where $sin(\alpha)$, $sin(\beta)$ and $sin(\gamma)$ are clipped to $[-1, 1]$.
|
|
Here, we assume that motions between frames are relatively small
|
|
and objects rotate no more than 90 degree in either direction.
|
|
|
|
|
|
\subsection{Supervision}
|
|
|
|
\paragraph{Per-RoI supervision with motion ground truth}
|
|
Given a positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
|
|
let $R_{c_k}^k, t_{c_k}^k, p_{c_k}^k$ be the predicted motion for class $c_k$
|
|
and $R_{gt}^{i_k}, t_{gt}^{i_k}, p_{gt}^{i_k}$ the ground truth motion for the example $i_k$.
|
|
We compute the motion loss $L_{motion}^k$ for each RoI as
|
|
|
|
\begin{equation}
|
|
L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
|
|
\end{equation}
|
|
where
|
|
\begin{equation}
|
|
l_{R}^k = arccos\left(\frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right)
|
|
\end{equation}
|
|
measures the angle of the error rotation between predicted and ground truth rotation,
|
|
\begin{equation}
|
|
l_{t}^k = \lVert inv(R_{c_k}^k) \cdot (t_{gt}^{i_k} - t_{c_k}^k) \rVert,
|
|
\end{equation}
|
|
is the euclidean norm between predicted and ground truth translation, and
|
|
\begin{equation}
|
|
l_{p}^k = \lVert p_{gt}^{i_k} - p_{c_k}^k \rVert
|
|
\end{equation}
|
|
is the euclidean norm between predicted and ground truth pivot.
|
|
|
|
|
|
\subsection{Dense flow from motion}
|
|
We compose a dense optical flow map from the outputs of our Motion R-CNN network.
|
|
Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
|
|
where
|
|
\begin{equation}
|
|
P_t =
|
|
\begin{pmatrix}
|
|
X_t \\ Y_t \\ Z_t
|
|
\end{pmatrix}
|
|
=
|
|
\frac{d_t}{f}
|
|
\begin{pmatrix}
|
|
x_t - c_0 \\ y_t - c_1 \\ f
|
|
\end{pmatrix},
|
|
\end{equation}
|
|
is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
|
|
which range over all coordinates in $I_t$.
|
|
|
|
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
|
box of a detected object according to the predicted motion of the object.
|
|
|
|
We first define the \emph{full image} mask $m_t^k$ for object k,
|
|
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
|
|
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
|
|
of the resized mask into a full image map starting at the top-right coordinate of the predicted bounding box.
|
|
Then,
|
|
\begin{equation}
|
|
P'_{t+1} =
|
|
P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
|
\end{equation}
|
|
|
|
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in SE3$, % TODO introduce!
|
|
|
|
\begin{equation}
|
|
\begin{pmatrix}
|
|
X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
|
|
\end{pmatrix}
|
|
= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^k
|
|
\end{equation}.
|
|
|
|
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
|
|
\begin{equation}
|
|
\begin{pmatrix}
|
|
x_{t+1} \\ y_{t+1}
|
|
\end{pmatrix}
|
|
=
|
|
\frac{f}{Z_{t+1}}
|
|
\begin{pmatrix}
|
|
X_{t+1} \\ Y_{t+1}
|
|
\end{pmatrix}
|
|
+
|
|
\begin{pmatrix}
|
|
c_0 \\ c_1
|
|
\end{pmatrix}.
|
|
\end{equation}
|
|
We can now obtain the optical flow between $I_t$ and $I_{t+1}$ at each point as
|
|
\begin{equation}
|
|
\begin{pmatrix}
|
|
u \\ v
|
|
\end{pmatrix}
|
|
=
|
|
\begin{pmatrix}
|
|
x_{t+1} - x_{t} \\ y_{t+1} - y_{t}
|
|
\end{pmatrix}.
|
|
\end{equation}
|
|
|
|
|
|
%Given the predicted motion as above, a depth map $d_t$ for frame $I_t$ and
|
|
%the predicted or ground truth camera motion $\{R_c^k, t_c^k\}\in SE3$.
|