This commit is contained in:
Simon Meister 2017-11-22 19:32:13 +01:00
parent f0050801a4
commit 9215f296a7
7 changed files with 90 additions and 85 deletions

View File

@ -17,6 +17,8 @@ to estimate disparity-based depth, however monocular depth estimation with deep
popular \cite{DeeperDepth, UnsupPoseDepth}.
In this preliminary work, we will assume per-pixel depth to be given.
\subsection{CNNs for dense motion estimation}
{
\begin{table}[h]
\centering
@ -54,7 +56,6 @@ are used for refinement.
\end{table}
}
\subsection{CNNs for dense motion estimation}
Deep convolutional neural network (CNN) architectures
\cite{ImageNetCNN, VGGNet, ResNet}
became widely popular through numerous successes in classification and recognition tasks.
@ -134,30 +135,17 @@ and N$_{motions} = 3$.
\subsection{ResNet}
\label{ssec:resnet}
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
became popular as basic building block of many deep network architectures for a variety
of different tasks. Figure \ref{figure:bottleneck}
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
of very deep networks without the gradients becoming too small as the distance
from the output layer increases.
In Table \ref{table:resnet}, we show the ResNet variant
that will serve as the basic CNN backbone of our networks, and
is also used in many other region-based convolutional networks.
The initial image data is always passed through the ResNet backbone as a first step to
bootstrap the complete deep network.
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
to the standard ResNet-50 backbone.
We additionally introduce one small extension that
will be useful for our Motion R-CNN network.
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important.
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
to increase the bottleneck stride to 64, following FlowNetS.
\begin{figure}[t]
\centering
\includegraphics[width=0.3\textwidth]{figures/bottleneck}
\caption{
ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
complexity in deeper network variants, shown here with 256 input and output channels.
Figure taken from \cite{ResNet}.
}
\label{figure:bottleneck}
\end{figure}
{
\begin{table}[h]
@ -228,16 +216,29 @@ Batch normalization \cite{BN} is used after every residual unit.
\end{table}
}
\begin{figure}[t]
\centering
\includegraphics[width=0.3\textwidth]{figures/bottleneck}
\caption{
ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
complexity in deeper network variants, shown here with 256 input and output channels.
Figure taken from \cite{ResNet}.
}
\label{figure:bottleneck}
\end{figure}
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
became popular as basic building block of many deep network architectures for a variety
of different tasks. Figure \ref{figure:bottleneck}
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
of very deep networks without the gradients becoming too small as the distance
from the output layer increases.
In Table \ref{table:resnet}, we show the ResNet variant
that will serve as the basic CNN backbone of our networks, and
is also used in many other region-based convolutional networks.
The initial image data is always passed through the ResNet backbone as a first step to
bootstrap the complete deep network.
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
to the standard ResNet-50 backbone.
We additionally introduce one small extension that
will be useful for our Motion R-CNN network.
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
For accurately estimating motions corresponding to larger pixel displacements, a larger
stride may be important.
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
to increase the bottleneck stride to 64, following FlowNetS.
\subsection{Region-based CNNs}
\label{ssec:rcnn}
@ -266,6 +267,39 @@ Thus, given region proposals, all computation is reduced to a single pass throug
speeding up the system by two orders of magnitude at inference time and one order of magnitude
at training time.
\paragraph{Faster R-CNN}
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
and again, improved accuracy.
This unified network operates in two stages.
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
which is a deep feature encoder CNN with the original image as input.
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
predicts objectness scores and regresses bounding boxes at each of its output positions.
At any of the $h \times w$ output positions of the RPN head,
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
The region proposals can then be obtained as the N highest scoring RPN predictions.
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
and bounding box refinement for each of the region proposals, which are now obtained
from the RPN instead of being pre-computed by an external algorithm.
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
and the refined bounding boxes are predicted separately for each object class.
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
(for Faster R-CNN, the mask head is ignored).
{
\begin{table}[t]
\centering
@ -316,39 +350,6 @@ whereas Faster R-CNN uses RoI pooling.
\end{table}
}
\paragraph{Faster R-CNN}
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
processing time.
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
and again, improved accuracy.
This unified network operates in two stages.
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
which is a deep feature encoder CNN with the original image as input.
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
predicts objectness scores and regresses bounding boxes at each of its output positions.
At any of the $h \times w$ output positions of the RPN head,
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
The region proposals can then be obtained as the N highest scoring RPN predictions.
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
and bounding box refinement for each of the region proposals, which are now obtained
from the RPN instead of being pre-computed by an external algorithm.
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
and the refined bounding boxes are predicted separately for each object class.
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
(for Faster R-CNN, the mask head is ignored).
\paragraph{Mask R-CNN}
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
However, it can be helpful to know class and object (instance) membership of all individual pixels,

View File

@ -330,6 +330,7 @@
volume={1},
number={4},
pages={541-551},
month = dec,
journal = neco,
year = {1989}}

View File

@ -2,8 +2,8 @@
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
to instance segmentation in the framework of region-based convolutional networks,
given an input of two consecutive frames from a monocular camera.
In addition to instance motions, our network estimates the 3D motion of the camera.
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
We combine all these estimates to yield a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
@ -20,7 +20,8 @@ the accuracy of the motion predictions is still not convincing.
More work will be thus required to bring the system (closer) to competetive accuracy,
which includes trying penalization with the flow loss instead of 3D motion ground truth,
experimenting with the weighting between different loss terms,
and improvements to the network architecture and training process.
and improvements to the network architecture, loss design, and training process.
We thus presented a partial step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
to previous end-to-end deep networks for dense motion estimation, the output
@ -30,7 +31,7 @@ applications.
\subsection{Future Work}
\paragraph{Mask R-CNN baseline}
As our Mask R-CNN re-implementation is still not as accurate as reported in the
original paper, working on the implementation details of this baseline would be
original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
R-CNN in TensorFlow was released, which should be studied to this end.
@ -54,7 +55,7 @@ and optical flow ground truth to evaluate the composed flow field.
Note that with our current model, we can only evaluate on the \emph{train} set
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
As KITTI 2015 also provides object masks for moving objects, we could in principle
As KITTI 2015 also provides instance masks for moving objects, we could in principle
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
KITTI 2015 test set, this makes little sense, though.
@ -78,7 +79,7 @@ the R-CNN,
this would however require using a different dataset for training it, as Virtual KITTI does not
provide stereo images.
If we would use a specialized depth network, we could use stereo data
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI,
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
though we would loose the ability to easily train the system in an end-to-end manner.
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
@ -138,7 +139,7 @@ setting \cite{UnsupFlownet, UnFlow},
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
We already described an optical flow based loss for supervising instance motions
We already described a loss based on optical flow for supervising instance motions
when we do not have 3D instance motion ground truth, or when we do not have
any motion ground truth at all.
However, it would also be useful to train our model without access to 3D camera
@ -159,9 +160,9 @@ Cityscapes, it may be critical to add this term in addition to an unsupervised
loss for the instance motions.
\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
A next step after the aforementioned ones could be to extend our network to exploit more than two
temporally consecutive frames, which has previously been shown to be beneficial in the
context of classical energy-minimization based scene flow \cite{TemporalSF}.
context of classical energy-minimization-based scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.

View File

@ -160,11 +160,12 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
For training the RPN and RoI heads and during inference,
we use the exact same number of proposals and RoIs as Mask R-CNN in
the ResNet and ResNet-FPN variants, respectively.
All losses are added up without additional weighting between the loss terms,
All losses (the original ones and our new motion losses)
are added up without additional weighting between the loss terms,
as in Mask R-CNN.
\paragraph{Initialization}
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
For initializing the C$_1$ to C$_5$ (see Table~\ref{table:resnet}) weights, we use a pre-trained
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
Following the pre-existing TensorFlow implementation of Faster R-CNN,
we initialize all other hidden layers with He initialization \cite{He}.
@ -185,7 +186,7 @@ are in general expected to output.
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
in the upper and lower row, respectively.
From left to right, we show the input image with instance segmentation results as overlay,
From left to right, we show the first input frame with instance segmentation results as overlay,
the estimated flow, as well as the flow error map.
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
}
@ -200,7 +201,7 @@ We visually compare a Motion R-CNN ResNet trained without (upper row) and
with (lower row) classifying the objects into moving and non-moving objects.
Note that in the selected example, all cars are parking, and thus the predicted
motion in the first row is an error.
From left to right, we show the input image with instance segmentation results as overlay,
From left to right, we show the first input frame with instance segmentation results as overlay,
the estimated flow, as well as the flow error map.
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
}
@ -215,7 +216,7 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & - \\\midrule
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
\bottomrule
@ -237,14 +238,15 @@ to the average rotation angle in the ground truth camera motions.
For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
We train both, the Motion R-CNN ResNet and ResNet-FPN variants, and supervise
camera and instance motions with 3D motion ground truth.
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
results on the Virtual KITTI validation set.
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
that decides between a moving and still object.
In Table \ref{table:vkitti}, we compare the performance of different network variants
In Table \ref{table:vkitti}, we compare various metrics for the Motion R-CNN
ResNet and ResNet-FPN network variants
on the Virtual KITTI validation set.
\paragraph{Camera motion}
@ -264,8 +266,8 @@ helpful.
\paragraph{Instance motion}
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
high precision in both variants, although the FPN variant is significantly more
precise, which we ascribe to the higher resolution features used in this variant.
high accuracy in both variants, although the FPN variant is significantly more
accurate, which we ascribe to the higher resolution features used in this variant.
The predicted 3D object translations and rotations still have a relatively high
error, compared to the average actual (ground truth) translations and rotations,

Binary file not shown.

Binary file not shown.

Binary file not shown.