mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
ce2a7a5253
commit
024af8fede
@ -196,8 +196,8 @@ a still and moving camera.
|
||||
The most straightforward way to supervise the object motions is by using ground truth
|
||||
motions computed from ground truth object poses, which is in general
|
||||
only practical when training on synthetic datasets.
|
||||
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
|
||||
Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
Note that we dropped the subscript $t$ to increase readability.
|
||||
Similar to the camera pose regression loss in \cite{PoseNet2},
|
||||
@ -231,7 +231,8 @@ $o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the net
|
||||
may not reliably predict exact identity motions for still objects, which is
|
||||
numerically more difficult to optimize than performing classification between
|
||||
moving and non-moving objects and discarding the regression for the non-moving
|
||||
ones.
|
||||
ones. Also, analogous to masks and bounding boxes, the estimates for classes
|
||||
other than $c_k^*$ are not penalized.
|
||||
|
||||
Now, our modified RoI loss is
|
||||
\begin{equation}
|
||||
|
||||
@ -600,7 +600,16 @@ is the Iverson bracket indicator function. Thus, the bounding box and mask
|
||||
losses are only enabled for the foreground RoIs. Note that the bounding box and mask predictions
|
||||
for all classes other than $c_i^*$ are not penalized.
|
||||
|
||||
\paragraph{Test-time operation}
|
||||
\paragraph{Inference}
|
||||
During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals
|
||||
are selected and passed through the RoI head. After this, non-maximum supression
|
||||
is applied to predicted foreground RoIs.
|
||||
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
|
||||
and passed through the RoI bounding box refinement and classification heads
|
||||
(but not through the mask head).
|
||||
After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class,
|
||||
with a maximum IoU of 0.7.
|
||||
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
|
||||
after again extracting the corresponding features.
|
||||
Thus, during inference, the features for the mask head are extracted using the refined
|
||||
bounding boxes, instead of the RPN bounding boxes. This is important for not
|
||||
introducing any misalignment, as we want to create the instance mask inside of the
|
||||
more precise, refined detection bounding boxes.
|
||||
|
||||
@ -110,8 +110,8 @@ set, we introduce a few error metrics.
|
||||
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
|
||||
let $i_k$ be the index of the best matching ground truth example,
|
||||
let $c_k$ be the predicted class,
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$
|
||||
and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$.
|
||||
Then, assuming there are $N$ such detections,
|
||||
\begin{equation}
|
||||
E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
|
||||
@ -125,6 +125,28 @@ is the mean euclidean norm between predicted and ground truth translation, and
|
||||
E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth pivot.
|
||||
Moreover, we define precision and recall measures for the detection of moving objects,
|
||||
where
|
||||
\begin{equation}
|
||||
O_{pr} = \frac{tp}{tp + fp}
|
||||
\end{equation}
|
||||
is the fraction of objects which are actually moving among all objects classified as moving,
|
||||
and
|
||||
\begin{equation}
|
||||
O_{rc} = \frac{tp}{tp + fn}
|
||||
\end{equation}
|
||||
is the fraction of objects correctly classified as moving among all objects which are actually moving.
|
||||
Here, we used
|
||||
\begin{equation}
|
||||
tp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1],
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
fp = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0],
|
||||
\end{equation}
|
||||
and
|
||||
\begin{equation}
|
||||
fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
|
||||
\end{equation}
|
||||
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
||||
predicted camera motions.
|
||||
|
||||
@ -160,20 +182,20 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
|
||||
{
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
\begin{tabular}{@{}*{11}{c}@{}}
|
||||
\begin{tabular}{@{}*{13}{c}@{}}
|
||||
\toprule
|
||||
\multicolumn{4}{c}{Network} & \multicolumn{3}{c}{Instance Motion Error} & \multicolumn{2}{c}{Camera Motion Error} &\multicolumn{2}{c}{Optical Flow Error} \\
|
||||
\cmidrule(lr){1-4}\cmidrule(lr){5-7}\cmidrule(l){8-9}\cmidrule(l){10-11}
|
||||
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & 0.1 & 0.04 & 6.73 & 26.59\% \\
|
||||
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & 0.22 & 0.07 & 12.62 & 46.28\% \\
|
||||
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\multicolumn{4}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
||||
\cmidrule(lr){1-4}\cmidrule(lr){5-9}\cmidrule(l){10-11}\cmidrule(l){12-13}
|
||||
FPN & cam. & sup. & XYZ & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||
$\times$ & \checkmark & 3D & \checkmark & 0.4 & 0.49 & 17.06 & ? & ? & 0.1 & 0.04 & 6.73 & 26.59\% \\
|
||||
\checkmark & \checkmark & 3D & \checkmark & 0.35 & 0.38 & 11.87 & ? & ? & 0.22 & 0.07 & 12.62 & 46.28\% \\
|
||||
$\times$ & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\checkmark & $\times$ & 3D & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\midrule
|
||||
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? \% \\
|
||||
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? \% \\
|
||||
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & - & - & ? & ? \% \\
|
||||
$\times$ & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
||||
\checkmark & \checkmark & flow & \checkmark & ? & ? & ? & ? & ? & ? & ? & ? & ? \% \\
|
||||
$\times$ & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\checkmark & $\times$ & flow & \checkmark & ? & ? & ? & ? & ? & - & - & ? & ? \% \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user