This commit is contained in:
Simon Meister 2017-11-11 23:08:47 +01:00
parent ce2a7a5253
commit 024af8fede
3 changed files with 52 additions and 20 deletions

View File

@ -196,8 +196,8 @@ a still and moving camera.
The most straightforward way to supervise the object motions is by using ground truth
motions computed from ground truth object poses, which is in general
only practical when training on synthetic datasets.
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
Note that we dropped the subscript $t$ to increase readability.
Similar to the camera pose regression loss in \cite{PoseNet2},
@ -231,7 +231,8 @@ $o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the net
may not reliably predict exact identity motions for still objects, which is
numerically more difficult to optimize than performing classification between
moving and non-moving objects and discarding the regression for the non-moving
ones.
ones. Also, analogous to masks and bounding boxes, the estimates for classes
other than $c_k^*$ are not penalized.
Now, our modified RoI loss is
\begin{equation}

View File

@ -600,7 +600,16 @@ is the Iverson bracket indicator function. Thus, the bounding box and mask
losses are only enabled for the foreground RoIs. Note that the bounding box and mask predictions
for all classes other than $c_i^*$ are not penalized.
\paragraph{Test-time operation}
\paragraph{Inference}
During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals
are selected and passed through the RoI head. After this, non-maximum supression
is applied to predicted foreground RoIs.
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
and passed through the RoI bounding box refinement and classification heads
(but not through the mask head).
After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class,
with a maximum IoU of 0.7.
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
after again extracting the corresponding features.
Thus, during inference, the features for the mask head are extracted using the refined
bounding boxes, instead of the RPN bounding boxes. This is important for not
introducing any misalignment, as we want to create the instance mask inside of the
more precise, refined detection bounding boxes.

View File

@ -110,8 +110,8 @@ set, we introduce a few error metrics.
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
let $i_k$ be the index of the best matching ground truth example,
let $c_k$ be the predicted class,
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
Then, assuming there are $N$ such detections,
\begin{equation}
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
@ -125,6 +125,28 @@ is the mean euclidean norm between predicted and ground truth translation, and
E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2
\end{equation}
is the mean euclidean norm between predicted and ground truth pivot.
Moreover, we define precision and recall measures for the detection of moving objects,
where
\begin{equation}
O_{pr} = \frac{tp}{tp + fp}
\end{equation}
is the fraction of objects which are actually moving among all objects classified as moving,
and
\begin{equation}
O_{rc} = \frac{tp}{tp + fn}
\end{equation}
is the fraction of objects correctly classified as moving among all objects which are actually moving.
Here, we used
\begin{equation}
tp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1],
\end{equation}
\begin{equation}
fp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0],
\end{equation}
and
\begin{equation}
fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
\end{equation}
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
predicted camera motions.
@ -160,20 +182,20 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
{
\begin{table}[t]
\centering
\begin{tabular}{@{}*{11}{c}@{}}
\begin{tabular}{@{}*{13}{c}@{}}
\toprule
\multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\
\cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11}
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & 0.1 & 0.04 & 6.73 & 26.59\% \\
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & 0.22 & 0.07 & 12.62 & 46.28\% \\
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
\cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\midrule
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? \% \\
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? \% \\
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
\bottomrule
\end{tabular}