diff --git a/approach.tex b/approach.tex index d518917..56a8b53 100644 --- a/approach.tex +++ b/approach.tex @@ -196,8 +196,8 @@ a still and moving camera. The most straightforward way to supervise the object motions is by using ground truth motions computed from ground truth object poses, which is in general only practical when training on synthetic datasets. -Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$, -let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$ +Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$, +let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$ and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$. Note that we dropped the subscript $t$ to increase readability. Similar to the camera pose regression loss in \cite{PoseNet2}, @@ -231,7 +231,8 @@ $o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the net may not reliably predict exact identity motions for still objects, which is numerically more difficult to optimize than performing classification between moving and non-moving objects and discarding the regression for the non-moving -ones. +ones. Also, analogous to masks and bounding boxes, the estimates for classes +other than $c_k^*$ are not penalized. Now, our modified RoI loss is \begin{equation} diff --git a/background.tex b/background.tex index 21927ad..2625c3a 100644 --- a/background.tex +++ b/background.tex @@ -600,7 +600,16 @@ is the Iverson bracket indicator function. Thus, the bounding box and mask losses are only enabled for the foreground RoIs. Note that the bounding box and mask predictions for all classes other than $c_i^*$ are not penalized. -\paragraph{Test-time operation} +\paragraph{Inference} During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals -are selected and passed through the RoI head. After this, non-maximum supression -is applied to predicted foreground RoIs. +from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes, +and passed through the RoI bounding box refinement and classification heads +(but not through the mask head). +After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class, +with a maximum IoU of 0.7. +Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes, +after again extracting the corresponding features. +Thus, during inference, the features for the mask head are extracted using the refined +bounding boxes, instead of the RPN bounding boxes. This is important for not +introducing any misalignment, as we want to create the instance mask inside of the +more precise, refined detection bounding boxes. diff --git a/experiments.tex b/experiments.tex index d64544b..263084d 100644 --- a/experiments.tex +++ b/experiments.tex @@ -110,8 +110,8 @@ set, we introduce a few error metrics. Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example, let $i_k$ be the index of the best matching ground truth example, let $c_k$ be the predicted class, -let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$ -and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$. +let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$ +and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$. Then, assuming there are $N$ such detections, \begin{equation} E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right) @@ -125,6 +125,28 @@ is the mean euclidean norm between predicted and ground truth translation, and E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2 \end{equation} is the mean euclidean norm between predicted and ground truth pivot. +Moreover, we define precision and recall measures for the detection of moving objects, +where +\begin{equation} +O_{pr} = \frac{tp}{tp + fp} +\end{equation} +is the fraction of objects which are actually moving among all objects classified as moving, +and +\begin{equation} +O_{rc} = \frac{tp}{tp + fn} +\end{equation} +is the fraction of objects correctly classified as moving among all objects which are actually moving. +Here, we used +\begin{equation} +tp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1], +\end{equation} +\begin{equation} +fp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0], +\end{equation} +and +\begin{equation} +fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1]. +\end{equation} Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for predicted camera motions. @@ -160,20 +182,20 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i { \begin{table}[t] \centering -\begin{tabular}{@{}*{11}{c}@{}} +\begin{tabular}{@{}*{13}{c}@{}} \toprule -\multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\ - \cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11} -FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule -$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & 0.1 & 0.04 & 6.73 & 26.59\% \\ -\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & 0.22 & 0.07 & 12.62 & 46.28\% \\ -$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & - & - & ? & ? \% \\ -\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & - & - & ? & ? \% \\ +\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\ + \cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13} +FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule +$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\ +\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\ +$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\ +\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\ \midrule -$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? \% \\ -\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? \% \\ -$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & - & - & ? & ? \% \\ -\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & - & - & ? & ? \% \\ +$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\ +\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\ +$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\ +\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\ \bottomrule \end{tabular}