mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
final
This commit is contained in:
parent
9215f296a7
commit
48ed4b4696
@ -23,7 +23,7 @@ thus combining the representation learning benefits and speed of end-to-end deep
|
|||||||
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
|
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
|
||||||
scene flow.
|
scene flow.
|
||||||
|
|
||||||
Building on recent advances in region-based convolutional networks (R-CNNs),
|
Building on recent advances in region-based convolutional neural networks (R-CNNs),
|
||||||
we integrate motion estimation with instance segmentation.
|
we integrate motion estimation with instance segmentation.
|
||||||
Given two consecutive frames from a monocular RGB-D camera,
|
Given two consecutive frames from a monocular RGB-D camera,
|
||||||
our resulting end-to-end deep network detects objects with precise per-pixel
|
our resulting end-to-end deep network detects objects with precise per-pixel
|
||||||
@ -54,8 +54,8 @@ Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwi
|
|||||||
von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
|
von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
|
||||||
das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.
|
das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.
|
||||||
|
|
||||||
Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional
|
Hierbei bauen wir auf den aktuellen Fortschritten bei regionsbasierten Convolutional
|
||||||
Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
|
Neural Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
|
||||||
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
|
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
|
||||||
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
|
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
|
||||||
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
|
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
|
||||||
|
|||||||
19
approach.tex
19
approach.tex
@ -116,7 +116,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
|
|||||||
}
|
}
|
||||||
|
|
||||||
\paragraph{Motion R-CNN backbone}
|
\paragraph{Motion R-CNN backbone}
|
||||||
Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backbone network to compute feature maps from input imagery.
|
Like Faster R-CNN and Mask R-CNN, we use a ResNet variant \cite{ResNet} as backbone network to compute feature maps from input imagery.
|
||||||
|
|
||||||
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
|
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
|
||||||
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
||||||
@ -125,15 +125,16 @@ Additionally, we also experiment with concatenating the camera space XYZ coordin
|
|||||||
XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
|
XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
|
||||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||||
as both RPN and for extracting the RoI features.
|
as both RPN and for extracting the RoI features.
|
||||||
|
|
||||||
Technically, our feature encoder network will have to learn image matching representations similar to
|
Technically, our feature encoder network will have to learn image matching representations similar to
|
||||||
that learned by the FlowNet encoder, but the output will be computed in the
|
those learned by the FlowNet encoder, but the output will be computed in the
|
||||||
object-centric framework of a region based convolutional network head with a 3D parametrization.
|
object-centric framework of a region-based convolutional network head with a 3D parametrization.
|
||||||
Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
|
Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
|
||||||
from the encoder is integrated for specific objects via RoI extraction and
|
from the encoder is integrated for specific objects via RoI extraction and subsequently
|
||||||
processed by the RoI head for each object.
|
processed by the RoI head for each object.
|
||||||
|
|
||||||
\paragraph{Per-RoI motion prediction}
|
\paragraph{Per-RoI motion prediction}
|
||||||
We use a rigid 3D motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}.
|
We use a 3D rigid motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}.
|
||||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
|
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
|
||||||
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
||||||
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
||||||
@ -178,7 +179,7 @@ We then extend the Mask R-CNN head by adding a small fully-connected network for
|
|||||||
prediction in addition to the fully-connected layers for
|
prediction in addition to the fully-connected layers for
|
||||||
refined boxes and classes and the convolutional network for the masks.
|
refined boxes and classes and the convolutional network for the masks.
|
||||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||||
Each instance motion is predicted as a set of nine scalar parameters,
|
Each instance motion is predicted as a set of nine values,
|
||||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
|
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
|
||||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||||
Here, we assume that motions between frames are relatively small
|
Here, we assume that motions between frames are relatively small
|
||||||
@ -400,12 +401,12 @@ and $(c_0, c_1, f)$ are the camera intrinsics.
|
|||||||
For now, the depth map is always assumed to come from ground truth.
|
For now, the depth map is always assumed to come from ground truth.
|
||||||
|
|
||||||
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
||||||
box of a detected object according to the predicted motion of the object.
|
box and mask of a detected object according to the predicted motion of the object.
|
||||||
|
|
||||||
We first define the \emph{full image} mask $M_k$ for object k,
|
For this, we first define the \emph{full image} mask $M_k$ for object k,
|
||||||
which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
|
which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
|
||||||
it to the width and height of the predicted bounding box and then copying the values
|
it to the width and height of the predicted bounding box and then copying the values
|
||||||
of the resized mask into a full resolution mask initialized with zeros,
|
of the resized mask into a full (image) resolution mask initialized with zeros,
|
||||||
starting at the top-left coordinate of the predicted bounding box.
|
starting at the top-left coordinate of the predicted bounding box.
|
||||||
Again we binarize masks at a threshold of $0.5$.
|
Again we binarize masks at a threshold of $0.5$.
|
||||||
|
|
||||||
|
|||||||
@ -8,7 +8,7 @@ The optical flow
|
|||||||
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
|
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
|
||||||
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
|
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
|
||||||
visually corresponding pixel in the second frame $I_{t+1}$,
|
visually corresponding pixel in the second frame $I_{t+1}$,
|
||||||
and can be interpreted as the apparent movement of brightness patterns between the two frames.
|
and can be interpreted as the (apparent) movement of brightness patterns between the two frames.
|
||||||
Optical flow can be regarded as two-dimensional motion estimation.
|
Optical flow can be regarded as two-dimensional motion estimation.
|
||||||
|
|
||||||
Scene flow is the generalization of optical flow to three-dimensional space and additionally
|
Scene flow is the generalization of optical flow to three-dimensional space and additionally
|
||||||
@ -64,7 +64,7 @@ learns a spatially compressed, wide (in the number of channels) representation o
|
|||||||
and a fully-connected prediction network on top of the encoder.
|
and a fully-connected prediction network on top of the encoder.
|
||||||
|
|
||||||
The compressed representations learned by CNNs of these categories do not, however, allow
|
The compressed representations learned by CNNs of these categories do not, however, allow
|
||||||
for prediction of high-resolution output, as spatial detail is lost through sequential applications
|
for prediction of high-resolution output, as spatial detail is lost through sequential application
|
||||||
of pooling or strides.
|
of pooling or strides.
|
||||||
Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder,
|
Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder,
|
||||||
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
|
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
|
||||||
@ -87,7 +87,7 @@ Recently, other, similarly generic,
|
|||||||
encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
|
encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
|
||||||
|
|
||||||
\subsection{SfM-Net}
|
\subsection{SfM-Net}
|
||||||
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described
|
Table \ref{table:sfmnet} shows the SfM-Net architecture \cite{SfmNet} we described
|
||||||
in the introduction.
|
in the introduction.
|
||||||
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
|
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
|
||||||
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
|
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
|
||||||
@ -237,7 +237,7 @@ In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
|||||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||||
stride may be important.
|
stride may be important.
|
||||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
Thus, we add an additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||||
to increase the bottleneck stride to 64, following FlowNetS.
|
to increase the bottleneck stride to 64, following FlowNetS.
|
||||||
|
|
||||||
\subsection{Region-based CNNs}
|
\subsection{Region-based CNNs}
|
||||||
@ -246,14 +246,14 @@ We now give an overview of region-based convolutional networks, which are curren
|
|||||||
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
||||||
|
|
||||||
\paragraph{R-CNN}
|
\paragraph{R-CNN}
|
||||||
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
|
The very first region-based convolutional networks (R-CNNs) \cite{RCNN} used a non-learned algorithm external to a standard encoder CNN
|
||||||
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
||||||
For each of the region proposals, the input image is cropped using the region bounding box and the crop is
|
For each of the region proposals, the input image is cropped using the region bounding box and the crop is
|
||||||
passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
|
passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
|
||||||
|
|
||||||
\paragraph{Fast R-CNN}
|
\paragraph{Fast R-CNN}
|
||||||
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
|
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
|
||||||
which is costly, as there is generally a large number of proposals.
|
which is costly, as there generally is a large number of proposals.
|
||||||
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
|
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
|
||||||
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
||||||
Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
|
Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
|
||||||
@ -273,31 +273,31 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the
|
|||||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||||
processing time.
|
processing time.
|
||||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||||
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
classification into a single deep network, leading to faster training and test-time processing when compared to Fast R-CNN
|
||||||
and again, improved accuracy.
|
and again, improved accuracy.
|
||||||
This unified network operates in two stages.
|
This unified network operates in two stages.
|
||||||
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
||||||
which is a deep feature encoder CNN with the original image as input.
|
which is a deep feature encoder CNN with the original image as input.
|
||||||
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
Next, the output features from the backbone are passed into a small, fully-convolutional \emph{Region Proposal Network} network (RPN), which
|
||||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||||
At any of the $h \times w$ output positions of the RPN head,
|
At any of the $h \times w$ output positions of the RPN,
|
||||||
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
$\text{N}_a$ bounding boxes with their \emph{objectness} scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
||||||
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
||||||
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
||||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels, and 3 aspect ratios,
|
||||||
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
||||||
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
||||||
|
|
||||||
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
||||||
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
||||||
|
|
||||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
Then, the \emph{second stage} corresponds to the original Fast R-CNN head, performing classification
|
||||||
and bounding box refinement for each of the region proposals, which are now obtained
|
and bounding box refinement for each of the region proposals, which are now obtained
|
||||||
from the RPN instead of being pre-computed by an external algorithm.
|
from the RPN instead of being pre-computed by an external algorithm.
|
||||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
||||||
and the refined bounding boxes are predicted separately for each object class.
|
and the refined bounding boxes are predicted separately for each object class.
|
||||||
|
|
||||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet architecture
|
||||||
(for Faster R-CNN, the mask head is ignored).
|
(for Faster R-CNN, the mask head is ignored).
|
||||||
|
|
||||||
{
|
{
|
||||||
@ -330,7 +330,7 @@ ave & average pool & N$_{RoI}$ $\times$ 2048 \\
|
|||||||
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||||
& From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
& From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls} + 1$ \\
|
||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
|
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
|
||||||
\midrule
|
\midrule
|
||||||
@ -352,21 +352,21 @@ whereas Faster R-CNN uses RoI pooling.
|
|||||||
|
|
||||||
\paragraph{Mask R-CNN}
|
\paragraph{Mask R-CNN}
|
||||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
However, it can be helpful to know class and object (instance) membership of individual pixels,
|
||||||
which generally involves computing a binary image mask for each object instance specifying which pixels belong
|
which generally involves computing a binary image mask for each object instance specifying which pixels belong
|
||||||
to that object. This problem is called \emph{instance segmentation}.
|
to that object. This problem is called \emph{instance segmentation}.
|
||||||
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
|
Mask R-CNN \cite{MaskRCNN} extends Faster R-CNN to instance segmentation by predicting
|
||||||
fixed resolution instance masks within the bounding boxes of each detected object,
|
fixed resolution instance masks within the bounding boxes of each detected object,
|
||||||
which are, at test-time, bilinearly resized to fit inside the respective bounding boxes.
|
which are, at test-time, bilinearly resized to fit inside the respective bounding boxes.
|
||||||
For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which
|
For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which
|
||||||
compute a pixel-precise binary mask for each instance.
|
compute a pixel-precise binary mask for each instance.
|
||||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
Note that the per-class masks \emph{logits} (raw network outputs) are put through a sigmoid layer, and thus there is no
|
||||||
comptetition between classes in the mask prediction branch.
|
comptetition between classes in the mask prediction branch.
|
||||||
|
|
||||||
Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
|
Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
|
||||||
bilinear sampling for extracting the RoI features, which is much more precise.
|
bilinear sampling for extracting the RoI features, which is much more precise.
|
||||||
In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
|
In the original RoI pooling adopted from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
|
||||||
boundary of the bounding box, and thus some detail is lost.
|
boundaries of the bounding boxes, and thus some detail is lost.
|
||||||
|
|
||||||
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||||
|
|
||||||
@ -408,7 +408,7 @@ F$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2
|
|||||||
& From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\
|
& From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\
|
||||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||||
& From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
& From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls} + 1$ \\
|
||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
|
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
|
||||||
\midrule
|
\midrule
|
||||||
@ -429,12 +429,12 @@ block (see Figure \ref{figure:fpn_block}).
|
|||||||
|
|
||||||
\paragraph{Feature Pyramid Networks}
|
\paragraph{Feature Pyramid Networks}
|
||||||
In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
|
In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
|
||||||
of the size of the bounding box of each RoI.
|
of the size of the bounding box of any specific RoI.
|
||||||
However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
|
However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
|
||||||
might have lost too much spatial information to allow properly predicting the exact bounding
|
might have lost too much spatial information to allow properly predicting the exact bounding
|
||||||
box and a high resolution mask.
|
box and a high resolution mask.
|
||||||
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
|
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
|
||||||
of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI.
|
of an appropriate scale to be used for RoI extraction, depending on the size of the bounding box of the RoI.
|
||||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||||
encoder by combining bilinearly upsampled feature maps coming from the bottleneck
|
encoder by combining bilinearly upsampled feature maps coming from the bottleneck
|
||||||
with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
|
with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
|
||||||
@ -454,17 +454,16 @@ as the RPN heads themselves correspond to different scales.
|
|||||||
Now, in the RPN, higher resolution feature maps can be used for regressing smaller
|
Now, in the RPN, higher resolution feature maps can be used for regressing smaller
|
||||||
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
|
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
|
||||||
which has a stride of $4$ with respect to the input image.
|
which has a stride of $4$ with respect to the input image.
|
||||||
Most importantly, the RoI features can now be extracted at the pyramid level P$_j$ appropriate for a
|
Most importantly, the RoI features can now be extracted from the pyramid level P$_j$ appropriate for a
|
||||||
RoI bounding box with size $h \times w$,
|
RoI bounding box with size $h \times w$, where
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
j = 2 + j_a,
|
j = 2 + j_a,
|
||||||
\end{equation}
|
\end{equation}
|
||||||
where
|
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{\sqrt{w \cdot h}}{s_0}\right)\right], 0, 4\right)
|
j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{\sqrt{w \cdot h}}{s_0}\right)\right], 0, 4\right)
|
||||||
\label{eq:level_assignment}
|
\label{eq:level_assignment}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
is the index (from small anchor to large anchor) of the corresponding anchor box and
|
is the index (from small anchor to large anchor) of the corresponding anchor box, and
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
s_0 = 256 \cdot 0.125
|
s_0 = 256 \cdot 0.125
|
||||||
\label{eq:level_assignment}
|
\label{eq:level_assignment}
|
||||||
@ -513,12 +512,12 @@ $c$ is the output vector from a softmax layer,
|
|||||||
$c_{c^*} \in (0,1)$ is the output probability for class $c^*$,
|
$c_{c^*} \in (0,1)$ is the output probability for class $c^*$,
|
||||||
and $\text{C}$ is the number of classes.
|
and $\text{C}$ is the number of classes.
|
||||||
Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$,
|
Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$,
|
||||||
as in $\text{N}_{cls}$, we do not count the background class.
|
as $\text{N}_{cls}$ does not include the background class.
|
||||||
Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss,
|
Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
\ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y),
|
\ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y),
|
||||||
\end{equation}
|
\end{equation}
|
||||||
where $y^* \in \{0,1\}$ is a label and $y \in (0,1)$ is the output from a sigmoid layer.
|
where $y^* \in \{0,1\}$ is a ground truth label and $y \in (0,1)$ is the output of a sigmoid layer.
|
||||||
Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is
|
Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is
|
||||||
the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask.
|
the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask.
|
||||||
|
|
||||||
@ -549,7 +548,7 @@ b_h^* = \log \left( \frac{h^*}{h_r} \right),
|
|||||||
which represents the regression target for the bounding box
|
which represents the regression target for the bounding box
|
||||||
outputs of the network.
|
outputs of the network.
|
||||||
|
|
||||||
Thus, for each bounding box prediction, the network predicts the box encoding $b_e$,
|
Thus, for bounding box regression, the network predicts the box encoding $b_e$,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
b_e = (b_x, b_y, b_w, b_h),
|
b_e = (b_x, b_y, b_w, b_h),
|
||||||
\end{equation}
|
\end{equation}
|
||||||
@ -565,7 +564,7 @@ b_w = \log \left( \frac{w}{w_r} \right),
|
|||||||
b_h = \log \left( \frac{h}{h_r} \right).
|
b_h = \log \left( \frac{h}{h_r} \right).
|
||||||
\end{equation*}
|
\end{equation*}
|
||||||
|
|
||||||
At test time, to get from a predicted box encoding $b_e$ to the predicted bounding box $b$,
|
At test time, to convert from a predicted box encoding $b_e$ to the predicted bounding box $b$,
|
||||||
the definitions above can be inverted,
|
the definitions above can be inverted,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
b = (x, y, w, h),
|
b = (x, y, w, h),
|
||||||
@ -631,14 +630,14 @@ the predicted refined RoI box encoding for class $c_i^*$.
|
|||||||
Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
|
Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
|
||||||
and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
|
and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
|
||||||
the binary ground truth mask using the RPN proposal bounding box.
|
the binary ground truth mask using the RPN proposal bounding box.
|
||||||
In our implementation, we use nearest neighbour resizing for resizing the mask
|
In our implementation, we use nearest neighbour resizing for resizing the cropped mask
|
||||||
targets.
|
targets.
|
||||||
Note that values in $m_i$ and $c_i$ are already normalized probabilities from
|
Note that values in $m_i$ and $c_i$ are already normalized probabilities from
|
||||||
sigmoid and softmax layers, respectively.
|
sigmoid and softmax layers, respectively.
|
||||||
|
|
||||||
Then, the ROI loss is computed as
|
Then, the ROI loss is computed as
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
L_{RoI} = L_{cls} + L_{box} + L_{mask}
|
L_{RoI} = L_{cls} + L_{box} + L_{mask},
|
||||||
\end{equation}
|
\end{equation}
|
||||||
where
|
where
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
@ -669,7 +668,7 @@ losses are only enabled for the foreground RoIs. Note that the bounding box and
|
|||||||
for all classes other than $c_i^*$ are not penalized.
|
for all classes other than $c_i^*$ are not penalized.
|
||||||
|
|
||||||
\paragraph{Inference}
|
\paragraph{Inference}
|
||||||
During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals
|
During inference, the 300 (ResNet) or 1000 (ResNet-FPN) highest scoring region proposals
|
||||||
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
|
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
|
||||||
and passed through the RoI bounding box refinement and classification heads
|
and passed through the RoI bounding box refinement and classification heads
|
||||||
(but not through the mask head).
|
(but not through the mask head).
|
||||||
|
|||||||
@ -4,7 +4,7 @@ We introduced Motion R-CNN, which enables 3D object motion estimation in paralle
|
|||||||
to instance segmentation in the framework of region-based convolutional networks,
|
to instance segmentation in the framework of region-based convolutional networks,
|
||||||
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
|
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
|
||||||
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
|
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
|
||||||
We combine all these estimates to yield a dense optical flow output from our
|
We combine all these estimates to obtain a dense optical flow output from our
|
||||||
end-to-end deep network.
|
end-to-end deep network.
|
||||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||||
us with bounding box, instance mask, depth, and 3D motion ground truth,
|
us with bounding box, instance mask, depth, and 3D motion ground truth,
|
||||||
@ -75,8 +75,8 @@ geometry for making a more reliable depth estimate, at least when the camera
|
|||||||
is moving. We could also extend our method to stereo input data easily by concatenating
|
is moving. We could also extend our method to stereo input data easily by concatenating
|
||||||
all of the frames into the input image.
|
all of the frames into the input image.
|
||||||
In case we choose the option of integrating the depth prediction directly into
|
In case we choose the option of integrating the depth prediction directly into
|
||||||
the R-CNN,
|
the current backbone,
|
||||||
this would however require using a different dataset for training it, as Virtual KITTI does not
|
this would however require using a different dataset for training Motion R-CNN, as Virtual KITTI does not
|
||||||
provide stereo images.
|
provide stereo images.
|
||||||
If we would use a specialized depth network, we could use stereo data
|
If we would use a specialized depth network, we could use stereo data
|
||||||
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
|
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
|
||||||
@ -149,8 +149,9 @@ RoI instance motion loss. Still, to use all available information from
|
|||||||
ground truth optical flow and obtain more accurate supervision,
|
ground truth optical flow and obtain more accurate supervision,
|
||||||
it would likely be beneficial to add a global, flow-based camera motion loss
|
it would likely be beneficial to add a global, flow-based camera motion loss
|
||||||
independent of the RoI supervision.
|
independent of the RoI supervision.
|
||||||
To do this, one could use a re-projection loss conceptually identical to the one
|
To do this, one could use a re-projection loss conceptually similar to the one
|
||||||
for supervising instance motions with ground truth flow. However, to adjust for the
|
for supervising instance motions with ground truth flow,
|
||||||
|
but computed on the full image instead of for individual RoIs. However, to adjust for the
|
||||||
fact that the camera motion can only be accurately supervised with flow at positions where
|
fact that the camera motion can only be accurately supervised with flow at positions where
|
||||||
no object motion accurs, this loss would have to be masked with the ground truth
|
no object motion accurs, this loss would have to be masked with the ground truth
|
||||||
object masks. Again, we could use this flow-based loss in an unsupervised way,
|
object masks. Again, we could use this flow-based loss in an unsupervised way,
|
||||||
|
|||||||
@ -27,7 +27,7 @@ from different viewing angles, resulting in a total of 10 variants per sequence.
|
|||||||
In addition to the RGB frames, a variety of ground truth is supplied.
|
In addition to the RGB frames, a variety of ground truth is supplied.
|
||||||
For each frame, we are given a dense depth and optical flow map and the camera
|
For each frame, we are given a dense depth and optical flow map and the camera
|
||||||
extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2).
|
extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2).
|
||||||
For all cars and vans in the each frame, we are given 2D and 3D object bounding
|
For all cars and vans in each frame, we are given 2D and 3D object bounding
|
||||||
boxes, instance masks, 3D poses, and various other labels.
|
boxes, instance masks, 3D poses, and various other labels.
|
||||||
|
|
||||||
This makes the Virtual KITTI dataset ideally suited for developing our joint
|
This makes the Virtual KITTI dataset ideally suited for developing our joint
|
||||||
@ -85,7 +85,7 @@ $\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
|
|||||||
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k),
|
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k),
|
||||||
\end{equation}
|
\end{equation}
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t.
|
t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t^k.
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
||||||
As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
|
As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
|
||||||
@ -112,11 +112,11 @@ measures the mean angle of the error rotation between predicted and ground truth
|
|||||||
\begin{equation}
|
\begin{equation}
|
||||||
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
|
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
|
||||||
\end{equation}
|
\end{equation}
|
||||||
is the mean euclidean norm between predicted and ground truth translation, and
|
is the mean Euclidean distance between predicted and ground truth translation, and
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
|
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
|
||||||
\end{equation}
|
\end{equation}
|
||||||
is the mean euclidean norm between predicted and ground truth pivot.
|
is the mean Euclidean distance between predicted and ground truth pivot.
|
||||||
|
|
||||||
Moreover, we define precision and recall measures for the detection of moving objects,
|
Moreover, we define precision and recall measures for the detection of moving objects,
|
||||||
where
|
where
|
||||||
@ -151,8 +151,8 @@ Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{Mas
|
|||||||
We train for a total of 192K iterations on the Virtual KITTI training set.
|
We train for a total of 192K iterations on the Virtual KITTI training set.
|
||||||
For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
|
For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
|
||||||
which results in approximately one day of training for a complete run.
|
which results in approximately one day of training for a complete run.
|
||||||
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
|
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with
|
||||||
momentum of $0.9$.
|
momentum set to $0.9$.
|
||||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||||
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||||
|
|
||||||
@ -255,7 +255,7 @@ the average ground truth camera translation. The camera rotation angle error
|
|||||||
is still relatively high, compared to the small average ground truth camera rotation.
|
is still relatively high, compared to the small average ground truth camera rotation.
|
||||||
Although both variants use the exact same network for predicting the camera motion,
|
Although both variants use the exact same network for predicting the camera motion,
|
||||||
the FPN variant performs worse here, with the error in rotation angle twice as high.
|
the FPN variant performs worse here, with the error in rotation angle twice as high.
|
||||||
One possible explanations that should be investigated in futher work is
|
One possible explanations that should be investigated in future work is
|
||||||
that in the FPN variant, all blocks in the backbone are shared between the camera
|
that in the FPN variant, all blocks in the backbone are shared between the camera
|
||||||
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
|
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
|
||||||
C$6$ blocks are only used in the camera branch, and thus only experience weight
|
C$6$ blocks are only used in the camera branch, and thus only experience weight
|
||||||
@ -265,8 +265,9 @@ As a remedy, increasing the loss weighting of the camera motion loss may be
|
|||||||
helpful.
|
helpful.
|
||||||
|
|
||||||
\paragraph{Instance motion}
|
\paragraph{Instance motion}
|
||||||
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
The object pivots are estimated with relatively
|
||||||
high accuracy in both variants, although the FPN variant is significantly more
|
high accuracy in both variants (given that the scenes are in a realistic scale),
|
||||||
|
although the FPN variant is significantly more
|
||||||
accurate, which we ascribe to the higher resolution features used in this variant.
|
accurate, which we ascribe to the higher resolution features used in this variant.
|
||||||
|
|
||||||
The predicted 3D object translations and rotations still have a relatively high
|
The predicted 3D object translations and rotations still have a relatively high
|
||||||
|
|||||||
@ -113,7 +113,7 @@ manageable pieces.
|
|||||||
\centering
|
\centering
|
||||||
\includegraphics[width=\textwidth]{figures/net_intro}
|
\includegraphics[width=\textwidth]{figures/net_intro}
|
||||||
\caption{
|
\caption{
|
||||||
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the 3D instance motion
|
Overview of our network based on Mask R-CNN \cite{MaskRCNN}. For each region of interest (RoI), we predict the 3D instance motion
|
||||||
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
||||||
small network from the bottleneck for predicting the 3D camera ego-motion.
|
small network from the bottleneck for predicting the 3D camera ego-motion.
|
||||||
Novel components in addition to Mask R-CNN are shown in red.
|
Novel components in addition to Mask R-CNN are shown in red.
|
||||||
@ -124,8 +124,8 @@ Novel components in addition to Mask R-CNN are shown in red.
|
|||||||
\subsection{Related work}
|
\subsection{Related work}
|
||||||
|
|
||||||
In the following, we will refer to systems which use deep networks for all
|
In the following, we will refer to systems which use deep networks for all
|
||||||
optimization and do not perform time-critical side computation (e.g. numerical optimization)
|
optimization and do not perform time-critical side computation
|
||||||
at inference time as \emph{end-to-end} deep learning systems.
|
at inference time (e.g. numerical optimization) as \emph{end-to-end} deep learning systems.
|
||||||
|
|
||||||
\paragraph{Deep networks in optical flow estimation}
|
\paragraph{Deep networks in optical flow estimation}
|
||||||
|
|
||||||
@ -142,8 +142,9 @@ where semantics become very important.
|
|||||||
Extensions of these approaches to scene flow estimate dense flow and dense depth
|
Extensions of these approaches to scene flow estimate dense flow and dense depth
|
||||||
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
||||||
|
|
||||||
Other works \cite{ESI, JOF, FlowLayers, MRFlow} make use of semantic segmentation to structure
|
Other works make use of semantic segmentation to structure
|
||||||
the optical flow estimation problem and introduce reasoning at the object level,
|
the optical flow estimation problem and introduce reasoning at the object level
|
||||||
|
\cite{ESI, JOF, FlowLayers, MRFlow},
|
||||||
but still require expensive energy minimization for each
|
but still require expensive energy minimization for each
|
||||||
new input, as CNNs are only used for some of the components and numerical
|
new input, as CNNs are only used for some of the components and numerical
|
||||||
optimization is central to their inference.
|
optimization is central to their inference.
|
||||||
@ -168,20 +169,20 @@ without the use of (deep) learning.
|
|||||||
|
|
||||||
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||||
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
|
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
|
||||||
with depth obtained from a non-learned stereo algorithm, to be used as pre-computed
|
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
|
||||||
inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
|
inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
|
||||||
Most likely due to their use of deep learning for instance segmentation and for some other components, this
|
Most likely due to their use of deep learning for instance segmentation and for some other components, this
|
||||||
approach outperforms the previous related scene flow methods on public benchmarks.
|
approach outperforms the previous related scene flow methods on relevant public benchmarks \cite{KITTI2012, KITTI2015}.
|
||||||
Still, the method uses a energy-minimization formulation for the scene flow estimation itself
|
Still, the method uses a energy-minimization formulation for the scene flow estimation itself
|
||||||
and takes minutes to make a prediction.
|
and takes minutes to make a prediction.
|
||||||
|
|
||||||
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
||||||
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||||
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet2}.
|
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet, FlowNet2}.
|
||||||
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
|
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
|
||||||
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
||||||
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
||||||
which often require estimations to be done in realtime (or close to realtime) and for which an end-to-end
|
which often require estimations to be done in real time (or close to real time) and for which an end-to-end
|
||||||
approach based on learning would be preferable.
|
approach based on learning would be preferable.
|
||||||
|
|
||||||
Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead
|
Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user