mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-12 17:35:51 +00:00
final
This commit is contained in:
parent
9215f296a7
commit
48ed4b4696
@ -23,7 +23,7 @@ thus combining the representation learning benefits and speed of end-to-end deep
|
||||
with a physically plausible scene model inspired by slanted plane energy-minimization approaches to
|
||||
scene flow.
|
||||
|
||||
Building on recent advances in region-based convolutional networks (R-CNNs),
|
||||
Building on recent advances in region-based convolutional neural networks (R-CNNs),
|
||||
we integrate motion estimation with instance segmentation.
|
||||
Given two consecutive frames from a monocular RGB-D camera,
|
||||
our resulting end-to-end deep network detects objects with precise per-pixel
|
||||
@ -54,8 +54,8 @@ Objekte respektiert, und kombinieren damit die Repräsentationskraft und Geschwi
|
||||
von end-to-end Deep Networks mit einem physikalisch plausiblen Szenenmodell,
|
||||
das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert ist.
|
||||
|
||||
Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional
|
||||
Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
|
||||
Hierbei bauen wir auf den aktuellen Fortschritten bei regionsbasierten Convolutional
|
||||
Neural Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung.
|
||||
Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D
|
||||
Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken
|
||||
und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab.
|
||||
|
||||
19
approach.tex
19
approach.tex
@ -116,7 +116,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers.
|
||||
}
|
||||
|
||||
\paragraph{Motion R-CNN backbone}
|
||||
Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backbone network to compute feature maps from input imagery.
|
||||
Like Faster R-CNN and Mask R-CNN, we use a ResNet variant \cite{ResNet} as backbone network to compute feature maps from input imagery.
|
||||
|
||||
Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
|
||||
laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
|
||||
@ -125,15 +125,16 @@ Additionally, we also experiment with concatenating the camera space XYZ coordin
|
||||
XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
|
||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||
as both RPN and for extracting the RoI features.
|
||||
|
||||
Technically, our feature encoder network will have to learn image matching representations similar to
|
||||
that learned by the FlowNet encoder, but the output will be computed in the
|
||||
object-centric framework of a region based convolutional network head with a 3D parametrization.
|
||||
those learned by the FlowNet encoder, but the output will be computed in the
|
||||
object-centric framework of a region-based convolutional network head with a 3D parametrization.
|
||||
Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information
|
||||
from the encoder is integrated for specific objects via RoI extraction and
|
||||
from the encoder is integrated for specific objects via RoI extraction and subsequently
|
||||
processed by the RoI head for each object.
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid 3D motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}.
|
||||
We use a 3D rigid motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
|
||||
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
||||
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
||||
@ -178,7 +179,7 @@ We then extend the Mask R-CNN head by adding a small fully-connected network for
|
||||
prediction in addition to the fully-connected layers for
|
||||
refined boxes and classes and the convolutional network for the masks.
|
||||
Like for refined boxes and masks, we make one separate motion prediction for each class.
|
||||
Each instance motion is predicted as a set of nine scalar parameters,
|
||||
Each instance motion is predicted as a set of nine values,
|
||||
$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$,
|
||||
where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
|
||||
Here, we assume that motions between frames are relatively small
|
||||
@ -400,12 +401,12 @@ and $(c_0, c_1, f)$ are the camera intrinsics.
|
||||
For now, the depth map is always assumed to come from ground truth.
|
||||
|
||||
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
||||
box of a detected object according to the predicted motion of the object.
|
||||
box and mask of a detected object according to the predicted motion of the object.
|
||||
|
||||
We first define the \emph{full image} mask $M_k$ for object k,
|
||||
For this, we first define the \emph{full image} mask $M_k$ for object k,
|
||||
which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing
|
||||
it to the width and height of the predicted bounding box and then copying the values
|
||||
of the resized mask into a full resolution mask initialized with zeros,
|
||||
of the resized mask into a full (image) resolution mask initialized with zeros,
|
||||
starting at the top-left coordinate of the predicted bounding box.
|
||||
Again we binarize masks at a threshold of $0.5$.
|
||||
|
||||
|
||||
@ -8,7 +8,7 @@ The optical flow
|
||||
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
|
||||
maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the
|
||||
visually corresponding pixel in the second frame $I_{t+1}$,
|
||||
and can be interpreted as the apparent movement of brightness patterns between the two frames.
|
||||
and can be interpreted as the (apparent) movement of brightness patterns between the two frames.
|
||||
Optical flow can be regarded as two-dimensional motion estimation.
|
||||
|
||||
Scene flow is the generalization of optical flow to three-dimensional space and additionally
|
||||
@ -64,7 +64,7 @@ learns a spatially compressed, wide (in the number of channels) representation o
|
||||
and a fully-connected prediction network on top of the encoder.
|
||||
|
||||
The compressed representations learned by CNNs of these categories do not, however, allow
|
||||
for prediction of high-resolution output, as spatial detail is lost through sequential applications
|
||||
for prediction of high-resolution output, as spatial detail is lost through sequential application
|
||||
of pooling or strides.
|
||||
Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder,
|
||||
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
|
||||
@ -87,7 +87,7 @@ Recently, other, similarly generic,
|
||||
encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}.
|
||||
|
||||
\subsection{SfM-Net}
|
||||
Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described
|
||||
Table \ref{table:sfmnet} shows the SfM-Net architecture \cite{SfmNet} we described
|
||||
in the introduction.
|
||||
Motions and full-image masks for a fixed number N$_{motions}$ of independent objects
|
||||
are predicted in addition to a depth map, and a unsupervised re-projection loss based on
|
||||
@ -237,7 +237,7 @@ In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||
stride may be important.
|
||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||
Thus, we add an additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||
to increase the bottleneck stride to 64, following FlowNetS.
|
||||
|
||||
\subsection{Region-based CNNs}
|
||||
@ -246,14 +246,14 @@ We now give an overview of region-based convolutional networks, which are curren
|
||||
most popular deep networks for object detection, and have recently also been applied to instance segmentation.
|
||||
|
||||
\paragraph{R-CNN}
|
||||
Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN
|
||||
The very first region-based convolutional networks (R-CNNs) \cite{RCNN} used a non-learned algorithm external to a standard encoder CNN
|
||||
for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object.
|
||||
For each of the region proposals, the input image is cropped using the region bounding box and the crop is
|
||||
passed through the CNN, which performs classification of the object (or non-object, if the region shows background).
|
||||
|
||||
\paragraph{Fast R-CNN}
|
||||
The original R-CNN involves computing one forward pass of the CNN for each of the region proposals,
|
||||
which is costly, as there is generally a large number of proposals.
|
||||
which is costly, as there generally is a large number of proposals.
|
||||
Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image
|
||||
as input to the CNN (compared to the sequential input of crops in the case of R-CNN).
|
||||
Then, fixed size (H $\times$ W) feature maps are extracted from the compressed feature map of the image,
|
||||
@ -273,31 +273,31 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the
|
||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
||||
classification into a single deep network, leading to faster training and test-time processing when compared to Fast R-CNN
|
||||
and again, improved accuracy.
|
||||
This unified network operates in two stages.
|
||||
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
||||
which is a deep feature encoder CNN with the original image as input.
|
||||
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
||||
Next, the output features from the backbone are passed into a small, fully-convolutional \emph{Region Proposal Network} network (RPN), which
|
||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||
At any of the $h \times w$ output positions of the RPN head,
|
||||
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
||||
At any of the $h \times w$ output positions of the RPN,
|
||||
$\text{N}_a$ bounding boxes with their \emph{objectness} scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
||||
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
||||
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels, and 3 aspect ratios,
|
||||
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
||||
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
||||
|
||||
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
||||
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
||||
|
||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head, performing classification
|
||||
and bounding box refinement for each of the region proposals, which are now obtained
|
||||
from the RPN instead of being pre-computed by an external algorithm.
|
||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
||||
and the refined bounding boxes are predicted separately for each object class.
|
||||
|
||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet architecture
|
||||
(for Faster R-CNN, the mask head is ignored).
|
||||
|
||||
{
|
||||
@ -330,7 +330,7 @@ ave & average pool & N$_{RoI}$ $\times$ 2048 \\
|
||||
& From ave: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
& From ave: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls} + 1$ \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
|
||||
\midrule
|
||||
@ -352,21 +352,21 @@ whereas Faster R-CNN uses RoI pooling.
|
||||
|
||||
\paragraph{Mask R-CNN}
|
||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||
However, it can be helpful to know class and object (instance) membership of individual pixels,
|
||||
which generally involves computing a binary image mask for each object instance specifying which pixels belong
|
||||
to that object. This problem is called \emph{instance segmentation}.
|
||||
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
|
||||
Mask R-CNN \cite{MaskRCNN} extends Faster R-CNN to instance segmentation by predicting
|
||||
fixed resolution instance masks within the bounding boxes of each detected object,
|
||||
which are, at test-time, bilinearly resized to fit inside the respective bounding boxes.
|
||||
For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise binary mask for each instance.
|
||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
||||
Note that the per-class masks \emph{logits} (raw network outputs) are put through a sigmoid layer, and thus there is no
|
||||
comptetition between classes in the mask prediction branch.
|
||||
|
||||
Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with
|
||||
bilinear sampling for extracting the RoI features, which is much more precise.
|
||||
In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
|
||||
boundary of the bounding box, and thus some detail is lost.
|
||||
In the original RoI pooling adopted from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel
|
||||
boundaries of the bounding boxes, and thus some detail is lost.
|
||||
|
||||
The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}.
|
||||
|
||||
@ -408,7 +408,7 @@ F$_1$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2
|
||||
& From F$_1$: fully connected, N$_{cls}$ $\cdot$ 4 & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4 \\
|
||||
boxes & decode bounding boxes (Eq. \ref{eq:pred_bounding_box}) & N$_{RoI}$ $\times$ N$_{cls}$ $\cdot$ 4\\
|
||||
& From F$_1$: fully connected, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\
|
||||
classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls} + 1$ \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RoI Head: Masks}}\\
|
||||
\midrule
|
||||
@ -429,12 +429,12 @@ block (see Figure \ref{figure:fpn_block}).
|
||||
|
||||
\paragraph{Feature Pyramid Networks}
|
||||
In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent
|
||||
of the size of the bounding box of each RoI.
|
||||
of the size of the bounding box of any specific RoI.
|
||||
However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features
|
||||
might have lost too much spatial information to allow properly predicting the exact bounding
|
||||
box and a high resolution mask.
|
||||
As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features
|
||||
of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI.
|
||||
of an appropriate scale to be used for RoI extraction, depending on the size of the bounding box of the RoI.
|
||||
For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet}
|
||||
encoder by combining bilinearly upsampled feature maps coming from the bottleneck
|
||||
with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}).
|
||||
@ -454,17 +454,16 @@ as the RPN heads themselves correspond to different scales.
|
||||
Now, in the RPN, higher resolution feature maps can be used for regressing smaller
|
||||
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
|
||||
which has a stride of $4$ with respect to the input image.
|
||||
Most importantly, the RoI features can now be extracted at the pyramid level P$_j$ appropriate for a
|
||||
RoI bounding box with size $h \times w$,
|
||||
Most importantly, the RoI features can now be extracted from the pyramid level P$_j$ appropriate for a
|
||||
RoI bounding box with size $h \times w$, where
|
||||
\begin{equation}
|
||||
j = 2 + j_a,
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{\sqrt{w \cdot h}}{s_0}\right)\right], 0, 4\right)
|
||||
\label{eq:level_assignment}
|
||||
\end{equation}
|
||||
is the index (from small anchor to large anchor) of the corresponding anchor box and
|
||||
is the index (from small anchor to large anchor) of the corresponding anchor box, and
|
||||
\begin{equation}
|
||||
s_0 = 256 \cdot 0.125
|
||||
\label{eq:level_assignment}
|
||||
@ -513,12 +512,12 @@ $c$ is the output vector from a softmax layer,
|
||||
$c_{c^*} \in (0,1)$ is the output probability for class $c^*$,
|
||||
and $\text{C}$ is the number of classes.
|
||||
Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$,
|
||||
as in $\text{N}_{cls}$, we do not count the background class.
|
||||
as $\text{N}_{cls}$ does not include the background class.
|
||||
Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss,
|
||||
\begin{equation}
|
||||
\ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y),
|
||||
\end{equation}
|
||||
where $y^* \in \{0,1\}$ is a label and $y \in (0,1)$ is the output from a sigmoid layer.
|
||||
where $y^* \in \{0,1\}$ is a ground truth label and $y \in (0,1)$ is the output of a sigmoid layer.
|
||||
Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is
|
||||
the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask.
|
||||
|
||||
@ -549,7 +548,7 @@ b_h^* = \log \left( \frac{h^*}{h_r} \right),
|
||||
which represents the regression target for the bounding box
|
||||
outputs of the network.
|
||||
|
||||
Thus, for each bounding box prediction, the network predicts the box encoding $b_e$,
|
||||
Thus, for bounding box regression, the network predicts the box encoding $b_e$,
|
||||
\begin{equation}
|
||||
b_e = (b_x, b_y, b_w, b_h),
|
||||
\end{equation}
|
||||
@ -565,7 +564,7 @@ b_w = \log \left( \frac{w}{w_r} \right),
|
||||
b_h = \log \left( \frac{h}{h_r} \right).
|
||||
\end{equation*}
|
||||
|
||||
At test time, to get from a predicted box encoding $b_e$ to the predicted bounding box $b$,
|
||||
At test time, to convert from a predicted box encoding $b_e$ to the predicted bounding box $b$,
|
||||
the definitions above can be inverted,
|
||||
\begin{equation}
|
||||
b = (x, y, w, h),
|
||||
@ -631,14 +630,14 @@ the predicted refined RoI box encoding for class $c_i^*$.
|
||||
Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
|
||||
and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
|
||||
the binary ground truth mask using the RPN proposal bounding box.
|
||||
In our implementation, we use nearest neighbour resizing for resizing the mask
|
||||
In our implementation, we use nearest neighbour resizing for resizing the cropped mask
|
||||
targets.
|
||||
Note that values in $m_i$ and $c_i$ are already normalized probabilities from
|
||||
sigmoid and softmax layers, respectively.
|
||||
|
||||
Then, the ROI loss is computed as
|
||||
\begin{equation}
|
||||
L_{RoI} = L_{cls} + L_{box} + L_{mask}
|
||||
L_{RoI} = L_{cls} + L_{box} + L_{mask},
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
@ -669,7 +668,7 @@ losses are only enabled for the foreground RoIs. Note that the bounding box and
|
||||
for all classes other than $c_i^*$ are not penalized.
|
||||
|
||||
\paragraph{Inference}
|
||||
During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring region proposals
|
||||
During inference, the 300 (ResNet) or 1000 (ResNet-FPN) highest scoring region proposals
|
||||
from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes,
|
||||
and passed through the RoI bounding box refinement and classification heads
|
||||
(but not through the mask head).
|
||||
|
||||
@ -4,7 +4,7 @@ We introduced Motion R-CNN, which enables 3D object motion estimation in paralle
|
||||
to instance segmentation in the framework of region-based convolutional networks,
|
||||
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
|
||||
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
|
||||
We combine all these estimates to yield a dense optical flow output from our
|
||||
We combine all these estimates to obtain a dense optical flow output from our
|
||||
end-to-end deep network.
|
||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||
us with bounding box, instance mask, depth, and 3D motion ground truth,
|
||||
@ -75,8 +75,8 @@ geometry for making a more reliable depth estimate, at least when the camera
|
||||
is moving. We could also extend our method to stereo input data easily by concatenating
|
||||
all of the frames into the input image.
|
||||
In case we choose the option of integrating the depth prediction directly into
|
||||
the R-CNN,
|
||||
this would however require using a different dataset for training it, as Virtual KITTI does not
|
||||
the current backbone,
|
||||
this would however require using a different dataset for training Motion R-CNN, as Virtual KITTI does not
|
||||
provide stereo images.
|
||||
If we would use a specialized depth network, we could use stereo data
|
||||
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
|
||||
@ -149,8 +149,9 @@ RoI instance motion loss. Still, to use all available information from
|
||||
ground truth optical flow and obtain more accurate supervision,
|
||||
it would likely be beneficial to add a global, flow-based camera motion loss
|
||||
independent of the RoI supervision.
|
||||
To do this, one could use a re-projection loss conceptually identical to the one
|
||||
for supervising instance motions with ground truth flow. However, to adjust for the
|
||||
To do this, one could use a re-projection loss conceptually similar to the one
|
||||
for supervising instance motions with ground truth flow,
|
||||
but computed on the full image instead of for individual RoIs. However, to adjust for the
|
||||
fact that the camera motion can only be accurately supervised with flow at positions where
|
||||
no object motion accurs, this loss would have to be masked with the ground truth
|
||||
object masks. Again, we could use this flow-based loss in an unsupervised way,
|
||||
|
||||
@ -27,7 +27,7 @@ from different viewing angles, resulting in a total of 10 variants per sequence.
|
||||
In addition to the RGB frames, a variety of ground truth is supplied.
|
||||
For each frame, we are given a dense depth and optical flow map and the camera
|
||||
extrinsics matrix. There are two annotated object classes, cars, and vans (N$_{cls}$ = 2).
|
||||
For all cars and vans in the each frame, we are given 2D and 3D object bounding
|
||||
For all cars and vans in each frame, we are given 2D and 3D object bounding
|
||||
boxes, instance masks, 3D poses, and various other labels.
|
||||
|
||||
This makes the Virtual KITTI dataset ideally suited for developing our joint
|
||||
@ -85,7 +85,7 @@ $\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
|
||||
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t.
|
||||
t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t^k.
|
||||
\end{equation}
|
||||
|
||||
As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
|
||||
@ -112,11 +112,11 @@ measures the mean angle of the error rotation between predicted and ground truth
|
||||
\begin{equation}
|
||||
E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2,
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth translation, and
|
||||
is the mean Euclidean distance between predicted and ground truth translation, and
|
||||
\begin{equation}
|
||||
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth pivot.
|
||||
is the mean Euclidean distance between predicted and ground truth pivot.
|
||||
|
||||
Moreover, we define precision and recall measures for the detection of moving objects,
|
||||
where
|
||||
@ -151,8 +151,8 @@ Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{Mas
|
||||
We train for a total of 192K iterations on the Virtual KITTI training set.
|
||||
For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
|
||||
which results in approximately one day of training for a complete run.
|
||||
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
|
||||
momentum of $0.9$.
|
||||
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with
|
||||
momentum set to $0.9$.
|
||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
|
||||
@ -255,7 +255,7 @@ the average ground truth camera translation. The camera rotation angle error
|
||||
is still relatively high, compared to the small average ground truth camera rotation.
|
||||
Although both variants use the exact same network for predicting the camera motion,
|
||||
the FPN variant performs worse here, with the error in rotation angle twice as high.
|
||||
One possible explanations that should be investigated in futher work is
|
||||
One possible explanations that should be investigated in future work is
|
||||
that in the FPN variant, all blocks in the backbone are shared between the camera
|
||||
motion branch and the feature pyramid. In the variant without FPN, the C$5$ and
|
||||
C$6$ blocks are only used in the camera branch, and thus only experience weight
|
||||
@ -265,8 +265,9 @@ As a remedy, increasing the loss weighting of the camera motion loss may be
|
||||
helpful.
|
||||
|
||||
\paragraph{Instance motion}
|
||||
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
||||
high accuracy in both variants, although the FPN variant is significantly more
|
||||
The object pivots are estimated with relatively
|
||||
high accuracy in both variants (given that the scenes are in a realistic scale),
|
||||
although the FPN variant is significantly more
|
||||
accurate, which we ascribe to the higher resolution features used in this variant.
|
||||
|
||||
The predicted 3D object translations and rotations still have a relatively high
|
||||
|
||||
@ -113,7 +113,7 @@ manageable pieces.
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/net_intro}
|
||||
\caption{
|
||||
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the 3D instance motion
|
||||
Overview of our network based on Mask R-CNN \cite{MaskRCNN}. For each region of interest (RoI), we predict the 3D instance motion
|
||||
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
||||
small network from the bottleneck for predicting the 3D camera ego-motion.
|
||||
Novel components in addition to Mask R-CNN are shown in red.
|
||||
@ -124,8 +124,8 @@ Novel components in addition to Mask R-CNN are shown in red.
|
||||
\subsection{Related work}
|
||||
|
||||
In the following, we will refer to systems which use deep networks for all
|
||||
optimization and do not perform time-critical side computation (e.g. numerical optimization)
|
||||
at inference time as \emph{end-to-end} deep learning systems.
|
||||
optimization and do not perform time-critical side computation
|
||||
at inference time (e.g. numerical optimization) as \emph{end-to-end} deep learning systems.
|
||||
|
||||
\paragraph{Deep networks in optical flow estimation}
|
||||
|
||||
@ -142,8 +142,9 @@ where semantics become very important.
|
||||
Extensions of these approaches to scene flow estimate dense flow and dense depth
|
||||
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
||||
|
||||
Other works \cite{ESI, JOF, FlowLayers, MRFlow} make use of semantic segmentation to structure
|
||||
the optical flow estimation problem and introduce reasoning at the object level,
|
||||
Other works make use of semantic segmentation to structure
|
||||
the optical flow estimation problem and introduce reasoning at the object level
|
||||
\cite{ESI, JOF, FlowLayers, MRFlow},
|
||||
but still require expensive energy minimization for each
|
||||
new input, as CNNs are only used for some of the components and numerical
|
||||
optimization is central to their inference.
|
||||
@ -168,20 +169,20 @@ without the use of (deep) learning.
|
||||
|
||||
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
|
||||
with depth obtained from a non-learned stereo algorithm, to be used as pre-computed
|
||||
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
|
||||
inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
|
||||
Most likely due to their use of deep learning for instance segmentation and for some other components, this
|
||||
approach outperforms the previous related scene flow methods on public benchmarks.
|
||||
approach outperforms the previous related scene flow methods on relevant public benchmarks \cite{KITTI2012, KITTI2015}.
|
||||
Still, the method uses a energy-minimization formulation for the scene flow estimation itself
|
||||
and takes minutes to make a prediction.
|
||||
|
||||
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
||||
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet2}.
|
||||
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet, FlowNet2}.
|
||||
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
|
||||
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
||||
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
||||
which often require estimations to be done in realtime (or close to realtime) and for which an end-to-end
|
||||
which often require estimations to be done in real time (or close to real time) and for which an end-to-end
|
||||
approach based on learning would be preferable.
|
||||
|
||||
Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user