WIP

2026-03-01 12:54:10 +00:00 · 2017-11-11 23:08:47 +01:00 · 2017-11-11 23:08:47 +01:00 · 024af8fede
commit 024af8fede
parent ce2a7a5253
3 changed files with 52 additions and 20 deletions
--- a/approach.tex
+++ b/approach.tex
@ -196,8 +196,8 @@ a still and moving camera.
 The most straightforward way to supervise the object motions is by using ground truth
 motions computed from ground truth object poses, which is in general
 only practical when training on synthetic datasets.
-Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
-let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
+Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$,
+let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$
 and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
 Note that we dropped the subscript $t$ to increase readability.
 Similar to the camera pose regression loss in \cite{PoseNet2},
@ -231,7 +231,8 @@ $o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the net
 may not reliably predict exact identity motions for still objects, which is
 numerically more difficult to optimize than performing classification between
 moving and non-moving objects and discarding the regression for the non-moving
-ones.
+ones. Also, analogous to masks and bounding boxes, the estimates for classes
+other than $c_k^*$ are not penalized.

 Now, our modified RoI loss is
 \begin{equation}
--- a/background.tex
+++ b/background.tex
@ -600,7 +600,16 @@ is the Iverson bracket indicator function. Thus, the bounding box and mask
 losses are only enabled for the foreground RoIs. Note that the bounding box and mask predictions
 for all classes other than $c_i^*$ are not penalized.

-\paragraph{Test-time operation}
+\paragraph{Inference}
 During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals
-are selected and passed through the RoI head. After this, non-maximum supression
-is applied to predicted foreground RoIs.
+from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
+and passed through the RoI bounding box refinement and classification heads
+(but not through the mask head).
+After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class,
+with a maximum IoU of 0.7.
+Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
+after again extracting the corresponding features.
+Thus, during inference, the features for the mask head are extracted using the refined
+bounding boxes, instead of the RPN bounding boxes. This is important for not
+introducing any misalignment, as we want to create the instance mask inside of the
+more precise, refined detection bounding boxes.
--- a/experiments.tex
+++ b/experiments.tex
@ -110,8 +110,8 @@ set, we introduce a few error metrics.
 Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
 let $i_k$ be the index of the best matching ground truth example,
 let $c_k$ be the predicted class,
-let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
-and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
+let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
+and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
 Then, assuming there are $N$ such detections,
 \begin{equation}
 E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
@ -125,6 +125,28 @@ is the mean euclidean norm between predicted and ground truth translation, and
 E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2
 \end{equation}
 is the mean euclidean norm between predicted and ground truth pivot.
+Moreover, we define precision and recall measures for the detection of moving objects,
+where
+\begin{equation}
+O_{pr} = \frac{tp}{tp + fp}
+\end{equation}
+is the fraction of objects which are actually moving among all objects classified as moving,
+and
+\begin{equation}
+O_{rc} = \frac{tp}{tp + fn}
+\end{equation}
+is the fraction of objects correctly classified as moving among all objects which are actually moving.
+Here, we used
+\begin{equation}
+tp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1],
+\end{equation}
+\begin{equation}
+fp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0],
+\end{equation}
+and
+\begin{equation}
+fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
+\end{equation}
 Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
 predicted camera motions.

@ -160,20 +182,20 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
 {
 \begin{table}[t]
 \centering
-\begin{tabular}{@{}*{11}{c}@{}}
+\begin{tabular}{@{}*{13}{c}@{}}
 \toprule
-\multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\
-  \cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11}
-FPN        & cam.       & sup. & XYZ         & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE   & Fl-all \\\midrule
-$\times$   & \checkmark & 3D   & \checkmark  & 0.4           & 0.49        & 17.06        & 0.1                 & 0.04              & 6.73  & 26.59\%    \\
-\checkmark & \checkmark & 3D   & \checkmark  & 0.35          & 0.38        & 11.87        & 0.22                & 0.07              & 12.62 & 46.28\%    \\
-$\times$   & $\times$   & 3D   & \checkmark  & ?             & ?           & ?            & -                   & -                 & ?     & ?    \%    \\
-\checkmark & $\times$   & 3D   & \checkmark  & ?             & ?           & ?            & -                   & -                 & ?     & ?    \%    \\
+\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
+  \cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
+FPN        & cam.       & sup. & XYZ         & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$  & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE   & Fl-all \\\midrule
+$\times$   & \checkmark & 3D   & \checkmark  & 0.4           & 0.49        & 17.06        & ?        & ?         & 0.1                 & 0.04              & 6.73  & 26.59\%    \\
+\checkmark & \checkmark & 3D   & \checkmark  & 0.35          & 0.38        & 11.87        & ?        & ?         & 0.22                & 0.07              & 12.62 & 46.28\%    \\
+$\times$   & $\times$   & 3D   & \checkmark  & ?             & ?           & ?            & ?        & ?         & -                   & -                 & ?     & ?    \%    \\
+\checkmark & $\times$   & 3D   & \checkmark  & ?             & ?           & ?            & ?        & ?         & -                   & -                 & ?     & ?    \%    \\
  \midrule
-$\times$   & \checkmark & flow   & \checkmark  & ?             & ?           & ?            & ?                   & ?                 & ?     & ?    \%    \\
-\checkmark & \checkmark & flow   & \checkmark  & ?             & ?           & ?            & ?                   & ?                 & ?     & ?    \%    \\
-$\times$   & $\times$   & flow   & \checkmark  & ?             & ?           & ?            & -                   & -                 & ?     & ?    \%    \\
-\checkmark & $\times$   & flow   & \checkmark  & ?             & ?           & ?            & -                   & -                 & ?     & ?    \%    \\
+$\times$   & \checkmark & flow & \checkmark  & ?             & ?           & ?            & ?        & ?         & ?                   & ?                 & ?     & ?    \%    \\
+\checkmark & \checkmark & flow & \checkmark  & ?             & ?           & ?            & ?        & ?         & ?                   & ?                 & ?     & ?    \%    \\
+$\times$   & $\times$   & flow & \checkmark  & ?             & ?           & ?            & ?        & ?         & -                   & -                 & ?     & ?    \%    \\
+\checkmark & $\times$   & flow & \checkmark  & ?             & ?           & ?            & ?        & ?         & -                   & -                 & ?     & ?    \%    \\
  \bottomrule
 \end{tabular}