mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
final
This commit is contained in:
parent
f0050801a4
commit
9215f296a7
135
background.tex
135
background.tex
@ -17,6 +17,8 @@ to estimate disparity-based depth, however monocular depth estimation with deep
|
|||||||
popular \cite{DeeperDepth, UnsupPoseDepth}.
|
popular \cite{DeeperDepth, UnsupPoseDepth}.
|
||||||
In this preliminary work, we will assume per-pixel depth to be given.
|
In this preliminary work, we will assume per-pixel depth to be given.
|
||||||
|
|
||||||
|
\subsection{CNNs for dense motion estimation}
|
||||||
|
|
||||||
{
|
{
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
\centering
|
\centering
|
||||||
@ -54,7 +56,6 @@ are used for refinement.
|
|||||||
\end{table}
|
\end{table}
|
||||||
}
|
}
|
||||||
|
|
||||||
\subsection{CNNs for dense motion estimation}
|
|
||||||
Deep convolutional neural network (CNN) architectures
|
Deep convolutional neural network (CNN) architectures
|
||||||
\cite{ImageNetCNN, VGGNet, ResNet}
|
\cite{ImageNetCNN, VGGNet, ResNet}
|
||||||
became widely popular through numerous successes in classification and recognition tasks.
|
became widely popular through numerous successes in classification and recognition tasks.
|
||||||
@ -134,30 +135,17 @@ and N$_{motions} = 3$.
|
|||||||
|
|
||||||
\subsection{ResNet}
|
\subsection{ResNet}
|
||||||
\label{ssec:resnet}
|
\label{ssec:resnet}
|
||||||
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
|
|
||||||
became popular as basic building block of many deep network architectures for a variety
|
|
||||||
of different tasks. Figure \ref{figure:bottleneck}
|
|
||||||
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
|
|
||||||
of very deep networks without the gradients becoming too small as the distance
|
|
||||||
from the output layer increases.
|
|
||||||
|
|
||||||
In Table \ref{table:resnet}, we show the ResNet variant
|
|
||||||
that will serve as the basic CNN backbone of our networks, and
|
|
||||||
is also used in many other region-based convolutional networks.
|
|
||||||
The initial image data is always passed through the ResNet backbone as a first step to
|
|
||||||
bootstrap the complete deep network.
|
|
||||||
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
|
|
||||||
to the standard ResNet-50 backbone.
|
|
||||||
|
|
||||||
We additionally introduce one small extension that
|
|
||||||
will be useful for our Motion R-CNN network.
|
|
||||||
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
|
||||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
|
||||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
|
||||||
stride may be important.
|
|
||||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
|
||||||
to increase the bottleneck stride to 64, following FlowNetS.
|
|
||||||
|
|
||||||
|
\begin{figure}[t]
|
||||||
|
\centering
|
||||||
|
\includegraphics[width=0.3\textwidth]{figures/bottleneck}
|
||||||
|
\caption{
|
||||||
|
ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
|
||||||
|
complexity in deeper network variants, shown here with 256 input and output channels.
|
||||||
|
Figure taken from \cite{ResNet}.
|
||||||
|
}
|
||||||
|
\label{figure:bottleneck}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
{
|
{
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
@ -228,16 +216,29 @@ Batch normalization \cite{BN} is used after every residual unit.
|
|||||||
\end{table}
|
\end{table}
|
||||||
}
|
}
|
||||||
|
|
||||||
\begin{figure}[t]
|
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||||
\centering
|
became popular as basic building block of many deep network architectures for a variety
|
||||||
\includegraphics[width=0.3\textwidth]{figures/bottleneck}
|
of different tasks. Figure \ref{figure:bottleneck}
|
||||||
\caption{
|
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
|
||||||
ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
|
of very deep networks without the gradients becoming too small as the distance
|
||||||
complexity in deeper network variants, shown here with 256 input and output channels.
|
from the output layer increases.
|
||||||
Figure taken from \cite{ResNet}.
|
|
||||||
}
|
In Table \ref{table:resnet}, we show the ResNet variant
|
||||||
\label{figure:bottleneck}
|
that will serve as the basic CNN backbone of our networks, and
|
||||||
\end{figure}
|
is also used in many other region-based convolutional networks.
|
||||||
|
The initial image data is always passed through the ResNet backbone as a first step to
|
||||||
|
bootstrap the complete deep network.
|
||||||
|
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
|
||||||
|
to the standard ResNet-50 backbone.
|
||||||
|
|
||||||
|
We additionally introduce one small extension that
|
||||||
|
will be useful for our Motion R-CNN network.
|
||||||
|
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||||
|
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||||
|
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||||
|
stride may be important.
|
||||||
|
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||||
|
to increase the bottleneck stride to 64, following FlowNetS.
|
||||||
|
|
||||||
\subsection{Region-based CNNs}
|
\subsection{Region-based CNNs}
|
||||||
\label{ssec:rcnn}
|
\label{ssec:rcnn}
|
||||||
@ -266,6 +267,39 @@ Thus, given region proposals, all computation is reduced to a single pass throug
|
|||||||
speeding up the system by two orders of magnitude at inference time and one order of magnitude
|
speeding up the system by two orders of magnitude at inference time and one order of magnitude
|
||||||
at training time.
|
at training time.
|
||||||
|
|
||||||
|
|
||||||
|
\paragraph{Faster R-CNN}
|
||||||
|
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
||||||
|
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||||
|
processing time.
|
||||||
|
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||||
|
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
||||||
|
and again, improved accuracy.
|
||||||
|
This unified network operates in two stages.
|
||||||
|
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
||||||
|
which is a deep feature encoder CNN with the original image as input.
|
||||||
|
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
||||||
|
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||||
|
At any of the $h \times w$ output positions of the RPN head,
|
||||||
|
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
||||||
|
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
||||||
|
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
||||||
|
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
||||||
|
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
||||||
|
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
||||||
|
|
||||||
|
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
||||||
|
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
||||||
|
|
||||||
|
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||||
|
and bounding box refinement for each of the region proposals, which are now obtained
|
||||||
|
from the RPN instead of being pre-computed by an external algorithm.
|
||||||
|
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
||||||
|
and the refined bounding boxes are predicted separately for each object class.
|
||||||
|
|
||||||
|
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
||||||
|
(for Faster R-CNN, the mask head is ignored).
|
||||||
|
|
||||||
{
|
{
|
||||||
\begin{table}[t]
|
\begin{table}[t]
|
||||||
\centering
|
\centering
|
||||||
@ -316,39 +350,6 @@ whereas Faster R-CNN uses RoI pooling.
|
|||||||
\end{table}
|
\end{table}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
\paragraph{Faster R-CNN}
|
|
||||||
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
|
||||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
|
||||||
processing time.
|
|
||||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
|
||||||
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
|
||||||
and again, improved accuracy.
|
|
||||||
This unified network operates in two stages.
|
|
||||||
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
|
||||||
which is a deep feature encoder CNN with the original image as input.
|
|
||||||
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
|
||||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
|
||||||
At any of the $h \times w$ output positions of the RPN head,
|
|
||||||
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
|
||||||
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
|
||||||
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
|
||||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
|
||||||
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
|
||||||
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
|
||||||
|
|
||||||
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
|
||||||
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
|
||||||
|
|
||||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
|
||||||
and bounding box refinement for each of the region proposals, which are now obtained
|
|
||||||
from the RPN instead of being pre-computed by an external algorithm.
|
|
||||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
|
||||||
and the refined bounding boxes are predicted separately for each object class.
|
|
||||||
|
|
||||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
|
||||||
(for Faster R-CNN, the mask head is ignored).
|
|
||||||
|
|
||||||
\paragraph{Mask R-CNN}
|
\paragraph{Mask R-CNN}
|
||||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||||
|
|||||||
1
bib.bib
1
bib.bib
@ -330,6 +330,7 @@
|
|||||||
volume={1},
|
volume={1},
|
||||||
number={4},
|
number={4},
|
||||||
pages={541-551},
|
pages={541-551},
|
||||||
|
month = dec,
|
||||||
journal = neco,
|
journal = neco,
|
||||||
year = {1989}}
|
year = {1989}}
|
||||||
|
|
||||||
|
|||||||
@ -2,8 +2,8 @@
|
|||||||
|
|
||||||
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
||||||
to instance segmentation in the framework of region-based convolutional networks,
|
to instance segmentation in the framework of region-based convolutional networks,
|
||||||
given an input of two consecutive frames from a monocular camera.
|
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
|
||||||
In addition to instance motions, our network estimates the 3D motion of the camera.
|
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
|
||||||
We combine all these estimates to yield a dense optical flow output from our
|
We combine all these estimates to yield a dense optical flow output from our
|
||||||
end-to-end deep network.
|
end-to-end deep network.
|
||||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||||
@ -20,7 +20,8 @@ the accuracy of the motion predictions is still not convincing.
|
|||||||
More work will be thus required to bring the system (closer) to competetive accuracy,
|
More work will be thus required to bring the system (closer) to competetive accuracy,
|
||||||
which includes trying penalization with the flow loss instead of 3D motion ground truth,
|
which includes trying penalization with the flow loss instead of 3D motion ground truth,
|
||||||
experimenting with the weighting between different loss terms,
|
experimenting with the weighting between different loss terms,
|
||||||
and improvements to the network architecture and training process.
|
and improvements to the network architecture, loss design, and training process.
|
||||||
|
|
||||||
We thus presented a partial step towards real time 3D motion estimation based on a
|
We thus presented a partial step towards real time 3D motion estimation based on a
|
||||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||||
to previous end-to-end deep networks for dense motion estimation, the output
|
to previous end-to-end deep networks for dense motion estimation, the output
|
||||||
@ -30,7 +31,7 @@ applications.
|
|||||||
\subsection{Future Work}
|
\subsection{Future Work}
|
||||||
\paragraph{Mask R-CNN baseline}
|
\paragraph{Mask R-CNN baseline}
|
||||||
As our Mask R-CNN re-implementation is still not as accurate as reported in the
|
As our Mask R-CNN re-implementation is still not as accurate as reported in the
|
||||||
original paper, working on the implementation details of this baseline would be
|
original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be
|
||||||
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
|
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
|
||||||
R-CNN in TensorFlow was released, which should be studied to this end.
|
R-CNN in TensorFlow was released, which should be studied to this end.
|
||||||
|
|
||||||
@ -54,7 +55,7 @@ and optical flow ground truth to evaluate the composed flow field.
|
|||||||
Note that with our current model, we can only evaluate on the \emph{train} set
|
Note that with our current model, we can only evaluate on the \emph{train} set
|
||||||
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
||||||
|
|
||||||
As KITTI 2015 also provides object masks for moving objects, we could in principle
|
As KITTI 2015 also provides instance masks for moving objects, we could in principle
|
||||||
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
||||||
KITTI 2015 test set, this makes little sense, though.
|
KITTI 2015 test set, this makes little sense, though.
|
||||||
|
|
||||||
@ -78,7 +79,7 @@ the R-CNN,
|
|||||||
this would however require using a different dataset for training it, as Virtual KITTI does not
|
this would however require using a different dataset for training it, as Virtual KITTI does not
|
||||||
provide stereo images.
|
provide stereo images.
|
||||||
If we would use a specialized depth network, we could use stereo data
|
If we would use a specialized depth network, we could use stereo data
|
||||||
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI,
|
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
|
||||||
though we would loose the ability to easily train the system in an end-to-end manner.
|
though we would loose the ability to easily train the system in an end-to-end manner.
|
||||||
|
|
||||||
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
||||||
@ -138,7 +139,7 @@ setting \cite{UnsupFlownet, UnFlow},
|
|||||||
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
||||||
|
|
||||||
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
||||||
We already described an optical flow based loss for supervising instance motions
|
We already described a loss based on optical flow for supervising instance motions
|
||||||
when we do not have 3D instance motion ground truth, or when we do not have
|
when we do not have 3D instance motion ground truth, or when we do not have
|
||||||
any motion ground truth at all.
|
any motion ground truth at all.
|
||||||
However, it would also be useful to train our model without access to 3D camera
|
However, it would also be useful to train our model without access to 3D camera
|
||||||
@ -159,9 +160,9 @@ Cityscapes, it may be critical to add this term in addition to an unsupervised
|
|||||||
loss for the instance motions.
|
loss for the instance motions.
|
||||||
|
|
||||||
\paragraph{Temporal consistency}
|
\paragraph{Temporal consistency}
|
||||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
A next step after the aforementioned ones could be to extend our network to exploit more than two
|
||||||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||||||
context of classical energy-minimization based scene flow \cite{TemporalSF}.
|
context of classical energy-minimization-based scene flow \cite{TemporalSF}.
|
||||||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||||
into our architecture, we could enable temporally consistent motion estimation
|
into our architecture, we could enable temporally consistent motion estimation
|
||||||
from image sequences of arbitrary length.
|
from image sequences of arbitrary length.
|
||||||
|
|||||||
@ -160,11 +160,12 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
|||||||
For training the RPN and RoI heads and during inference,
|
For training the RPN and RoI heads and during inference,
|
||||||
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
||||||
the ResNet and ResNet-FPN variants, respectively.
|
the ResNet and ResNet-FPN variants, respectively.
|
||||||
All losses are added up without additional weighting between the loss terms,
|
All losses (the original ones and our new motion losses)
|
||||||
|
are added up without additional weighting between the loss terms,
|
||||||
as in Mask R-CNN.
|
as in Mask R-CNN.
|
||||||
|
|
||||||
\paragraph{Initialization}
|
\paragraph{Initialization}
|
||||||
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
For initializing the C$_1$ to C$_5$ (see Table~\ref{table:resnet}) weights, we use a pre-trained
|
||||||
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
|
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
|
||||||
Following the pre-existing TensorFlow implementation of Faster R-CNN,
|
Following the pre-existing TensorFlow implementation of Faster R-CNN,
|
||||||
we initialize all other hidden layers with He initialization \cite{He}.
|
we initialize all other hidden layers with He initialization \cite{He}.
|
||||||
@ -185,7 +186,7 @@ are in general expected to output.
|
|||||||
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
|
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
|
||||||
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
|
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
|
||||||
in the upper and lower row, respectively.
|
in the upper and lower row, respectively.
|
||||||
From left to right, we show the input image with instance segmentation results as overlay,
|
From left to right, we show the first input frame with instance segmentation results as overlay,
|
||||||
the estimated flow, as well as the flow error map.
|
the estimated flow, as well as the flow error map.
|
||||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||||
}
|
}
|
||||||
@ -200,7 +201,7 @@ We visually compare a Motion R-CNN ResNet trained without (upper row) and
|
|||||||
with (lower row) classifying the objects into moving and non-moving objects.
|
with (lower row) classifying the objects into moving and non-moving objects.
|
||||||
Note that in the selected example, all cars are parking, and thus the predicted
|
Note that in the selected example, all cars are parking, and thus the predicted
|
||||||
motion in the first row is an error.
|
motion in the first row is an error.
|
||||||
From left to right, we show the input image with instance segmentation results as overlay,
|
From left to right, we show the first input frame with instance segmentation results as overlay,
|
||||||
the estimated flow, as well as the flow error map.
|
the estimated flow, as well as the flow error map.
|
||||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||||
}
|
}
|
||||||
@ -215,7 +216,7 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
|
|||||||
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
||||||
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
|
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
|
||||||
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||||
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule
|
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & - \\\midrule
|
||||||
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
|
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
|
||||||
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
|
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
@ -237,14 +238,15 @@ to the average rotation angle in the ground truth camera motions.
|
|||||||
|
|
||||||
For our initial experiments, we concatenate both RGB frames as
|
For our initial experiments, we concatenate both RGB frames as
|
||||||
well as the XYZ coordinates for both frames as input to the networks.
|
well as the XYZ coordinates for both frames as input to the networks.
|
||||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
|
We train both, the Motion R-CNN ResNet and ResNet-FPN variants, and supervise
|
||||||
camera and instance motions with 3D motion ground truth.
|
camera and instance motions with 3D motion ground truth.
|
||||||
|
|
||||||
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
||||||
results on the Virtual KITTI validation set.
|
results on the Virtual KITTI validation set.
|
||||||
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
|
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
|
||||||
that decides between a moving and still object.
|
that decides between a moving and still object.
|
||||||
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
In Table \ref{table:vkitti}, we compare various metrics for the Motion R-CNN
|
||||||
|
ResNet and ResNet-FPN network variants
|
||||||
on the Virtual KITTI validation set.
|
on the Virtual KITTI validation set.
|
||||||
|
|
||||||
\paragraph{Camera motion}
|
\paragraph{Camera motion}
|
||||||
@ -264,8 +266,8 @@ helpful.
|
|||||||
|
|
||||||
\paragraph{Instance motion}
|
\paragraph{Instance motion}
|
||||||
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
||||||
high precision in both variants, although the FPN variant is significantly more
|
high accuracy in both variants, although the FPN variant is significantly more
|
||||||
precise, which we ascribe to the higher resolution features used in this variant.
|
accurate, which we ascribe to the higher resolution features used in this variant.
|
||||||
|
|
||||||
The predicted 3D object translations and rotations still have a relatively high
|
The predicted 3D object translations and rotations still have a relatively high
|
||||||
error, compared to the average actual (ground truth) translations and rotations,
|
error, compared to the average actual (ground truth) translations and rotations,
|
||||||
|
|||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
x
Reference in New Issue
Block a user