This commit is contained in:
Simon Meister 2017-11-14 19:35:43 +01:00
parent 2a39cf1174
commit b9e4173f7f
6 changed files with 155 additions and 50 deletions

View File

@ -43,8 +43,9 @@ ist die Umfunktionierung generischer Deep Networks ein
beliebter Ansatz für klassische Probleme der Computer Vision geworden,
die pixelweise Schätzung erfordern.
Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden
für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen
Viele aktuelle end-to-end Deep Learning Methoden
für optischen Fluss oder Szenenfluss folgen diesem Trend und berechnen
vollständige und hochauflösende Flussfelder mit generischen
Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
Randbedingungen innerhalb der Szene.

View File

@ -209,7 +209,7 @@ Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionr
convolution to the $C_5$ features to reduce the number of inputs to the following
fully-connected layers.
Instead of averaging, we use bilinear resizing to bring the convolutional features
to a fixed size without losing spatial information,
to a fixed size without losing all spatial information,
flatten them, and finally apply multiple fully-connected layers to compute the
camera motion prediction.
@ -217,19 +217,13 @@ camera motion prediction.
In both of our network variants
(Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}),
we compute the fully-connected network for motion prediction from the
convolutional mask features, branching off right before the mask upsampling
deconvolution. The intuition behind this is that the final mask features contain
high resolution, spatial information about which positions belong to the object and
which belong to the background. Thus, we allow the motion estimation network to
make use of this data and ideally integrate the motion (image matching) information
localized within the object, but not that belonging to the background,
into the final object motion estimate.
flattened RoI features, which are also the basis for classification and
bounding box refinement.
\subsection{Supervision}
\label{ssec:supervision}
\paragraph{Per-RoI supervision with 3D motion ground truth}
\paragraph{Per-RoI instance motion supervision with 3D instance motion ground truth}
The most straightforward way to supervise the object motions is by using ground truth
motions computed from ground truth object poses, which is in general
only practical when training on synthetic datasets.
@ -284,14 +278,16 @@ If the ground truth shows that the camera is not moving, we again do not
penalize rotation and translation. For the camera, the loss is reduced to the
classification term in this case.
\paragraph{Per-RoI supervision \emph{without} 3D motion ground truth}
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
A more general way to supervise the object motions is a re-projection
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
which we can apply to coordinates within the object bounding boxes,
and which does not require ground truth 3D object motions.
In this case, for any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
with the same resolution as the predicted mask. We use the same bounding box
In this case, for any RoI,
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
with the same resolution as the predicted mask.
We use the same bounding box
to crop the corresponding region from the dense, full image depth map
and bilinearly resize the depth crop to the same resolution as the mask and point
grid.
@ -301,11 +297,18 @@ apply the RoI's predicted motion, masked by the predicted mask.
Then, we apply the camera motion to the points, project them back to 2D
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
Note that we batch this computation over all RoIs, so that we only perform
it once per forward pass. The mathematical details are analogous to the
dense, full image flow computation in the following subsection and will not
be repeated here. \todo{probably better to add the mathematical details, as it may otherwise be confusing at some points}
it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach.
The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the
dense, full image flow composition in the following subsection, so we will not
include them here. The only differences are that there is no sum over objects during
the point transformation based on instance motion, as we consider the single object
corresponding to an RoI in isolation, and that the masks are not resized to the
full image resolution, as
the depth crops and 2D point grid are at the same resolution as the predicted
$m \times m$ mask.
For each RoI, we can now penalize the optical flow grid to supervise the object motion.
For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
by penalizing the $m \times m$ optical flow grid.
If there is optical flow ground truth available, we can use the RoI bounding box to
crop and resize a region from the ground truth optical flow to match the RoI's
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
@ -336,6 +339,17 @@ and sample proposals and RoIs in the exact same way.
During inference, we proceed analogously to Mask R-CNN.
In the same way as the RoI mask head, at test time, we compute the RoI motion head
from the features extracted with refined bounding boxes.
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
extracted RoI features before passing them into the motion head.
The intuition behind that is that we want to mask out (set to zero) any positions in the
extracted feature window which belong to the background. Then, the RoI motion
head aggregates the motion (image matching) information from the backbone
over positions localized within the object only, but not over positions belonging
to the background, which should not influence the final object motion estimate.
Again, as for masks and bounding boxes in Mask R-CNN,
the predicted output object motions are the predicted object motions for the
highest scoring class.
\subsection{Dense flow from motion}
\label{ssec:postprocessing}
@ -360,17 +374,21 @@ For now, the depth map is always assumed to come from ground truth.
Given $k$ detections with predicted motions as above, we transform all points within the bounding
box of a detected object according to the predicted motion of the object.
We first define the \emph{full image} mask $m_t^k$ for object k,
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box.
Then,
We first define the \emph{full image} mask $M_t^k$ for object k,
which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing
$m_t^k$ to the width and height of the predicted bounding box and then copying the values
of the resized mask into a full resolution mask initialized with zeros,
starting at the top-left coordinate of the predicted bounding box.
Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects,
\begin{equation}
P'_{t+1} =
P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
\end{equation}
These motion predictions are understood to have already taken into account
the classification into moving and still objects,
and we thus, as described above, have identity motions for all objects with $o_t^k = 0$.
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce!
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$,
\begin{equation}
\begin{pmatrix}
@ -380,8 +398,8 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
\end{equation}.
Note that in our experiments, we either use the ground truth camera motion to focus
on the object motion predictions or the predicted camera motion to predict complete
motion. We will always state which variant we use in the experimental section.
on evaluating the object motion predictions or the predicted camera motion to evaluate
the complete motion estimates. We will always state which variant we use in the experimental section.
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
\begin{equation}

View File

@ -364,7 +364,7 @@ which has a stride of $4$ with respect to the input image.
Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a
RoI bounding box with size $h \times w$,
\begin{equation}
j = \log_2(\sqrt{w \cdot h} / 224). %TODO complete
j = \log_2(\sqrt{w \cdot h} / 224). \todo{complete}
\label{eq:level_assignment}
\end{equation}
@ -613,6 +613,9 @@ with a maximum IoU of 0.7.
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
after again extracting the corresponding features.
Thus, during inference, the features for the mask head are extracted using the refined
bounding boxes, instead of the RPN bounding boxes. This is important for not
bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not
introducing any misalignment, as we want to create the instance mask inside of the
more precise, refined detection bounding boxes.
Furthermore, note that bounding box and mask predictions for all classes but the predicted
class (the highest scoring class) are discarded, and thus the output bounding
box and mask correspond to the highest scoring class.

18
bib.bib
View File

@ -249,3 +249,21 @@
title = {Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
booktitle = {ICCV},
year = {2015}}
@inproceedings{UnFlow,
author = {Simon Meister and Junhwa Hur and Stefan Roth},
title = {UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss},
booktitle = {AAAI},
year = {2018}}
@inproceedings{UnsupDepth,
title={Unsupervised CNN for single view depth estimation: Geometry to the rescue},
author={Ravi Garg and BG Vijay Kumar and Gustavo Carneiro and Ian Reid},
booktitle={ECCV},
year={2016}}
@inproceedings{UnsupFlownet,
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
year={2016}}

View File

@ -7,9 +7,9 @@ In addition to instance motions, our network estimates the 3D motion of the came
We combine all these estimates to yield a dense optical flow output from our
end-to-end deep network.
Our model is trained on the synthetic Virtual KITTI dataset, which provides
us with all required ground truth data.
us with all required ground truth data, and evaluated on the same domain.
During inference, our model does not add any significant computational overhead
over the latest iterations of R-CNNs and is therefore just as fast and interesting
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
for real time scenarios.
We thus presented a step towards real time 3D motion estimation based on a
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
@ -18,6 +18,19 @@ of our network is highly interpretable, which may also bring benefits for safety
applications.
\subsection{Future Work}
\paragraph{Evaluation and finetuning on KITTI 2015}
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
on which we do not train, but we have yet to evaluate on a real world dataset.
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
and optical flow ground truth to evaluate the composed flow field.
Note that with our current model, we can only evaluate on the \emph{train} set
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
As KITTI 2015 also provides object masks for moving objects, we could in principle
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
KITTI 2015 test set, this makes little sense, though.
\paragraph{Predicting depth}
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
However, in many applications settings, we are not provided with any depth information.
@ -26,15 +39,23 @@ from which no depth data is available.
To do so, we could integrate depth prediction into our network by branching off a
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
Alternatively, we could add a specialized network for end-to-end depth regression
in parallel to the region-based network, e.g. \cite{GCNet}.
in parallel to the region-based network (or before, to provide XYZ input to the R-CNN), e.g. \cite{GCNet}.
Although single-frame monocular depth prediction with deep networks was already done
to some level of success,
our two-frame input should allow the network to make use of epipolar
geometry for making a more reliable depth estimate, at least when the camera
is moving. We could also extend our method to stereo input data easily by concatenating
all of the frames into the input image, which
would however require using a different dataset for training, as Virtual KITTI does not
all of the frames into the input image.
In case we choose the option of integrating the depth prediction directly into
the R-CNN,
this would however require using a different dataset for training it, as Virtual KITTI does not
provide stereo images.
If we would use a specialized depth network, we could use stereo data
for depth prediction and still train the R-CNN independently on the monocular Virtual KITTI,
though we would loose the ability to easily train the system in an end-to-end manner.
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
and also fine-tune on the training set as mentioned in the previous paragraph.
{
\begin{table}[h]
@ -45,7 +66,7 @@ provide stereo images.
\midrule\midrule
& input image & H $\times$ W $\times$ C \\
\midrule
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
\midrule
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
\midrule
@ -64,7 +85,7 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
\end{tabular}
\caption {
Preliminary Motion R-CNN ResNet-50-FPN architecture with depth prediction,
A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction,
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
}
\label{table:motionrcnn_resnet_fpn_depth}
@ -74,16 +95,41 @@ based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_re
Due to the amount of supervision required by the different components of the network
and the complexity of the optimization problem,
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
A next step will be training on a more realistic dataset.
A next step will be training on a more realistic dataset,
ideally without having to rely on synthetic data at all.
For this, we can first pre-train the RPN on an instance segmentation dataset like
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
On Cityscapes, we could continue train the instance segmentation components to
improve detection and masks and avoid forgetting instance segmentation.
As an alternative to this training scheme, we could investigate training on a pure
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth)
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
setting \cite{UnsupFlownet, UnFlow},
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
We already described an optical flow based loss for supervising instance motions
when we do not have 3D instance motion ground truth, or when we do not have
any motion ground truth at all.
However, it would also be useful to train our model without access to 3D camera
motion ground truth.
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
RoI instance motion loss. Still, to use all available information from
ground truth optical flow and obtain more accurate supervision,
it would likely be beneficial to add a global, flow-based camera motion loss
independent of the RoI supervision.
To do this, one could use a re-projection loss conceptually identical to the one
for supervising instance motions with ground truth flow. However, to adjust for the
fact that the camera motion can only be accurately supervised with flow at positions where
no object motion accurs, this loss would have to be masked with the ground truth
object masks. Again, we could use this flow-based loss in an unsupervised way.
For training on a dataset without any motion ground truth, e.g.
Cityscapes, it may be critical to add this term in addition to an unsupervised
loss for the instance motions.
\paragraph{Temporal consistency}
A next step after the two aforementioned ones could be to extend our network to exploit more than two
@ -92,3 +138,16 @@ context of energy-minimization based scene flow \cite{TemporalSF}.
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
into our architecture, we could enable temporally consistent motion estimation
from image sequences of arbitrary length.
\paragraph{Deeper networks for larger bottleneck strides}
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
For accurately estimating the motion of objects with large displacements between
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
We could do this easily in both of our network variants by adding one ore multiple additional
ResNet blocks. In the variant without FPN, these blocks would have to be placed
after RoI feature extraction. In the FPN variant, the blocks could be simply
added after the encoder C$_5$ bottleneck.
For saving memory, we could however also consider modifying the underlying
ResNet-50 architecture and increase the number of blocks, but reduce the number
of layers in each block.

View File

@ -5,11 +5,11 @@ computations. To make our code easy to extend and flexible, we build on
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
implementation.
On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN)
as well as extensions for motion estimation and related evaluations
as well as the Motion R-CNN extensions for motion estimation and related evaluations
and postprocessings. In addition, we generated all ground truth for
Motion R-CNN in the form of TFRecords from the raw Virtual KITTI
data to enable fast loading during training.
Note that for RoI extraction and cropping operations,
Note that for RoI extraction and bilinear crop and resize operations,
we use the \texttt{tf.crop\_and\_resize} TensorFlow function with
interpolation set to bilinear.
@ -147,8 +147,14 @@ fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
predicted camera motions.
\subsection{Training Setup}
\subsection{Virtual KITTI training setup}
\label{ssec:setup}
For our initial experiments, we concatenate both RGB frames as
well as the XYZ coordinates for both frames as input to the networks.
We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants.
\paragraph{Training schedule}
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
We train on a single Titan X (Pascal) for a total of 192K iterations on the
Virtual KITTI training set.
@ -172,8 +178,7 @@ Note that a larger weight prevented the
angle sine estimates from properly converging to the very small values they
are in general expected to output.
\subsection{Experiments on Virtual KITTI}
\subsection{Virtual KITTI evaluation}
\label{ssec:vkitti}
\begin{figure}[t]
@ -227,7 +232,8 @@ only impacted by the predicted 3D object motions.
\label{table:vkitti}
\end{table}
}
Figure \ref{figure:vkitti} visualizes instance segmentation and optical flow
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
results on the Virtual KITTI validation set.
Table \ref{table:vkitti} compares the performance of different network variants on the Virtual KITTI validation
set.
In Table \ref{table:vkitti}, we compare the performance of different network variants
on the Virtual KITTI validation set.