mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
2a39cf1174
commit
b9e4173f7f
@ -43,8 +43,9 @@ ist die Umfunktionierung generischer Deep Networks ein
|
|||||||
beliebter Ansatz für klassische Probleme der Computer Vision geworden,
|
beliebter Ansatz für klassische Probleme der Computer Vision geworden,
|
||||||
die pixelweise Schätzung erfordern.
|
die pixelweise Schätzung erfordern.
|
||||||
|
|
||||||
Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden
|
Viele aktuelle end-to-end Deep Learning Methoden
|
||||||
für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen
|
für optischen Fluss oder Szenenfluss folgen diesem Trend und berechnen
|
||||||
|
vollständige und hochauflösende Flussfelder mit generischen
|
||||||
Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
|
Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
|
||||||
inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
|
inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
|
||||||
Randbedingungen innerhalb der Szene.
|
Randbedingungen innerhalb der Szene.
|
||||||
|
|||||||
70
approach.tex
70
approach.tex
@ -209,7 +209,7 @@ Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionr
|
|||||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
convolution to the $C_5$ features to reduce the number of inputs to the following
|
||||||
fully-connected layers.
|
fully-connected layers.
|
||||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||||
to a fixed size without losing spatial information,
|
to a fixed size without losing all spatial information,
|
||||||
flatten them, and finally apply multiple fully-connected layers to compute the
|
flatten them, and finally apply multiple fully-connected layers to compute the
|
||||||
camera motion prediction.
|
camera motion prediction.
|
||||||
|
|
||||||
@ -217,19 +217,13 @@ camera motion prediction.
|
|||||||
In both of our network variants
|
In both of our network variants
|
||||||
(Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}),
|
(Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}),
|
||||||
we compute the fully-connected network for motion prediction from the
|
we compute the fully-connected network for motion prediction from the
|
||||||
convolutional mask features, branching off right before the mask upsampling
|
flattened RoI features, which are also the basis for classification and
|
||||||
deconvolution. The intuition behind this is that the final mask features contain
|
bounding box refinement.
|
||||||
high resolution, spatial information about which positions belong to the object and
|
|
||||||
which belong to the background. Thus, we allow the motion estimation network to
|
|
||||||
make use of this data and ideally integrate the motion (image matching) information
|
|
||||||
localized within the object, but not that belonging to the background,
|
|
||||||
into the final object motion estimate.
|
|
||||||
|
|
||||||
|
|
||||||
\subsection{Supervision}
|
\subsection{Supervision}
|
||||||
\label{ssec:supervision}
|
\label{ssec:supervision}
|
||||||
|
|
||||||
\paragraph{Per-RoI supervision with 3D motion ground truth}
|
\paragraph{Per-RoI instance motion supervision with 3D instance motion ground truth}
|
||||||
The most straightforward way to supervise the object motions is by using ground truth
|
The most straightforward way to supervise the object motions is by using ground truth
|
||||||
motions computed from ground truth object poses, which is in general
|
motions computed from ground truth object poses, which is in general
|
||||||
only practical when training on synthetic datasets.
|
only practical when training on synthetic datasets.
|
||||||
@ -284,14 +278,16 @@ If the ground truth shows that the camera is not moving, we again do not
|
|||||||
penalize rotation and translation. For the camera, the loss is reduced to the
|
penalize rotation and translation. For the camera, the loss is reduced to the
|
||||||
classification term in this case.
|
classification term in this case.
|
||||||
|
|
||||||
\paragraph{Per-RoI supervision \emph{without} 3D motion ground truth}
|
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
|
||||||
A more general way to supervise the object motions is a re-projection
|
A more general way to supervise the object motions is a re-projection
|
||||||
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
|
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
|
||||||
which we can apply to coordinates within the object bounding boxes,
|
which we can apply to coordinates within the object bounding boxes,
|
||||||
and which does not require ground truth 3D object motions.
|
and which does not require ground truth 3D object motions.
|
||||||
|
|
||||||
In this case, for any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
|
In this case, for any RoI,
|
||||||
with the same resolution as the predicted mask. We use the same bounding box
|
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
|
||||||
|
with the same resolution as the predicted mask.
|
||||||
|
We use the same bounding box
|
||||||
to crop the corresponding region from the dense, full image depth map
|
to crop the corresponding region from the dense, full image depth map
|
||||||
and bilinearly resize the depth crop to the same resolution as the mask and point
|
and bilinearly resize the depth crop to the same resolution as the mask and point
|
||||||
grid.
|
grid.
|
||||||
@ -301,11 +297,18 @@ apply the RoI's predicted motion, masked by the predicted mask.
|
|||||||
Then, we apply the camera motion to the points, project them back to 2D
|
Then, we apply the camera motion to the points, project them back to 2D
|
||||||
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
|
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
|
||||||
Note that we batch this computation over all RoIs, so that we only perform
|
Note that we batch this computation over all RoIs, so that we only perform
|
||||||
it once per forward pass. The mathematical details are analogous to the
|
it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach.
|
||||||
dense, full image flow computation in the following subsection and will not
|
The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the
|
||||||
be repeated here. \todo{probably better to add the mathematical details, as it may otherwise be confusing at some points}
|
dense, full image flow composition in the following subsection, so we will not
|
||||||
|
include them here. The only differences are that there is no sum over objects during
|
||||||
|
the point transformation based on instance motion, as we consider the single object
|
||||||
|
corresponding to an RoI in isolation, and that the masks are not resized to the
|
||||||
|
full image resolution, as
|
||||||
|
the depth crops and 2D point grid are at the same resolution as the predicted
|
||||||
|
$m \times m$ mask.
|
||||||
|
|
||||||
For each RoI, we can now penalize the optical flow grid to supervise the object motion.
|
For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
|
||||||
|
by penalizing the $m \times m$ optical flow grid.
|
||||||
If there is optical flow ground truth available, we can use the RoI bounding box to
|
If there is optical flow ground truth available, we can use the RoI bounding box to
|
||||||
crop and resize a region from the ground truth optical flow to match the RoI's
|
crop and resize a region from the ground truth optical flow to match the RoI's
|
||||||
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
|
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
|
||||||
@ -336,6 +339,17 @@ and sample proposals and RoIs in the exact same way.
|
|||||||
During inference, we proceed analogously to Mask R-CNN.
|
During inference, we proceed analogously to Mask R-CNN.
|
||||||
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
||||||
from the features extracted with refined bounding boxes.
|
from the features extracted with refined bounding boxes.
|
||||||
|
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||||
|
extracted RoI features before passing them into the motion head.
|
||||||
|
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||||
|
extracted feature window which belong to the background. Then, the RoI motion
|
||||||
|
head aggregates the motion (image matching) information from the backbone
|
||||||
|
over positions localized within the object only, but not over positions belonging
|
||||||
|
to the background, which should not influence the final object motion estimate.
|
||||||
|
|
||||||
|
Again, as for masks and bounding boxes in Mask R-CNN,
|
||||||
|
the predicted output object motions are the predicted object motions for the
|
||||||
|
highest scoring class.
|
||||||
|
|
||||||
\subsection{Dense flow from motion}
|
\subsection{Dense flow from motion}
|
||||||
\label{ssec:postprocessing}
|
\label{ssec:postprocessing}
|
||||||
@ -360,17 +374,21 @@ For now, the depth map is always assumed to come from ground truth.
|
|||||||
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
||||||
box of a detected object according to the predicted motion of the object.
|
box of a detected object according to the predicted motion of the object.
|
||||||
|
|
||||||
We first define the \emph{full image} mask $m_t^k$ for object k,
|
We first define the \emph{full image} mask $M_t^k$ for object k,
|
||||||
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
|
which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing
|
||||||
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
|
$m_t^k$ to the width and height of the predicted bounding box and then copying the values
|
||||||
of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box.
|
of the resized mask into a full resolution mask initialized with zeros,
|
||||||
Then,
|
starting at the top-left coordinate of the predicted bounding box.
|
||||||
|
Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
P'_{t+1} =
|
P'_{t+1} =
|
||||||
P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
These motion predictions are understood to have already taken into account
|
||||||
|
the classification into moving and still objects,
|
||||||
|
and we thus, as described above, have identity motions for all objects with $o_t^k = 0$.
|
||||||
|
|
||||||
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce!
|
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$,
|
||||||
|
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
\begin{pmatrix}
|
\begin{pmatrix}
|
||||||
@ -380,8 +398,8 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
|
|||||||
\end{equation}.
|
\end{equation}.
|
||||||
|
|
||||||
Note that in our experiments, we either use the ground truth camera motion to focus
|
Note that in our experiments, we either use the ground truth camera motion to focus
|
||||||
on the object motion predictions or the predicted camera motion to predict complete
|
on evaluating the object motion predictions or the predicted camera motion to evaluate
|
||||||
motion. We will always state which variant we use in the experimental section.
|
the complete motion estimates. We will always state which variant we use in the experimental section.
|
||||||
|
|
||||||
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
|
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
|
|||||||
@ -364,7 +364,7 @@ which has a stride of $4$ with respect to the input image.
|
|||||||
Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a
|
Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a
|
||||||
RoI bounding box with size $h \times w$,
|
RoI bounding box with size $h \times w$,
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
j = \log_2(\sqrt{w \cdot h} / 224). %TODO complete
|
j = \log_2(\sqrt{w \cdot h} / 224). \todo{complete}
|
||||||
\label{eq:level_assignment}
|
\label{eq:level_assignment}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
||||||
@ -613,6 +613,9 @@ with a maximum IoU of 0.7.
|
|||||||
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
|
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
|
||||||
after again extracting the corresponding features.
|
after again extracting the corresponding features.
|
||||||
Thus, during inference, the features for the mask head are extracted using the refined
|
Thus, during inference, the features for the mask head are extracted using the refined
|
||||||
bounding boxes, instead of the RPN bounding boxes. This is important for not
|
bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not
|
||||||
introducing any misalignment, as we want to create the instance mask inside of the
|
introducing any misalignment, as we want to create the instance mask inside of the
|
||||||
more precise, refined detection bounding boxes.
|
more precise, refined detection bounding boxes.
|
||||||
|
Furthermore, note that bounding box and mask predictions for all classes but the predicted
|
||||||
|
class (the highest scoring class) are discarded, and thus the output bounding
|
||||||
|
box and mask correspond to the highest scoring class.
|
||||||
|
|||||||
18
bib.bib
18
bib.bib
@ -249,3 +249,21 @@
|
|||||||
title = {Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
|
title = {Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
|
||||||
booktitle = {ICCV},
|
booktitle = {ICCV},
|
||||||
year = {2015}}
|
year = {2015}}
|
||||||
|
|
||||||
|
@inproceedings{UnFlow,
|
||||||
|
author = {Simon Meister and Junhwa Hur and Stefan Roth},
|
||||||
|
title = {UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss},
|
||||||
|
booktitle = {AAAI},
|
||||||
|
year = {2018}}
|
||||||
|
|
||||||
|
@inproceedings{UnsupDepth,
|
||||||
|
title={Unsupervised CNN for single view depth estimation: Geometry to the rescue},
|
||||||
|
author={Ravi Garg and BG Vijay Kumar and Gustavo Carneiro and Ian Reid},
|
||||||
|
booktitle={ECCV},
|
||||||
|
year={2016}}
|
||||||
|
|
||||||
|
@inproceedings{UnsupFlownet,
|
||||||
|
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
|
||||||
|
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
||||||
|
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
|
||||||
|
year={2016}}
|
||||||
|
|||||||
@ -7,9 +7,9 @@ In addition to instance motions, our network estimates the 3D motion of the came
|
|||||||
We combine all these estimates to yield a dense optical flow output from our
|
We combine all these estimates to yield a dense optical flow output from our
|
||||||
end-to-end deep network.
|
end-to-end deep network.
|
||||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||||
us with all required ground truth data.
|
us with all required ground truth data, and evaluated on the same domain.
|
||||||
During inference, our model does not add any significant computational overhead
|
During inference, our model does not add any significant computational overhead
|
||||||
over the latest iterations of R-CNNs and is therefore just as fast and interesting
|
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
||||||
for real time scenarios.
|
for real time scenarios.
|
||||||
We thus presented a step towards real time 3D motion estimation based on a
|
We thus presented a step towards real time 3D motion estimation based on a
|
||||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||||
@ -18,6 +18,19 @@ of our network is highly interpretable, which may also bring benefits for safety
|
|||||||
applications.
|
applications.
|
||||||
|
|
||||||
\subsection{Future Work}
|
\subsection{Future Work}
|
||||||
|
\paragraph{Evaluation and finetuning on KITTI 2015}
|
||||||
|
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
|
||||||
|
on which we do not train, but we have yet to evaluate on a real world dataset.
|
||||||
|
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
|
||||||
|
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
|
||||||
|
and optical flow ground truth to evaluate the composed flow field.
|
||||||
|
Note that with our current model, we can only evaluate on the \emph{train} set
|
||||||
|
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
||||||
|
|
||||||
|
As KITTI 2015 also provides object masks for moving objects, we could in principle
|
||||||
|
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
||||||
|
KITTI 2015 test set, this makes little sense, though.
|
||||||
|
|
||||||
\paragraph{Predicting depth}
|
\paragraph{Predicting depth}
|
||||||
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
|
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
|
||||||
However, in many applications settings, we are not provided with any depth information.
|
However, in many applications settings, we are not provided with any depth information.
|
||||||
@ -26,15 +39,23 @@ from which no depth data is available.
|
|||||||
To do so, we could integrate depth prediction into our network by branching off a
|
To do so, we could integrate depth prediction into our network by branching off a
|
||||||
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
|
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
|
||||||
Alternatively, we could add a specialized network for end-to-end depth regression
|
Alternatively, we could add a specialized network for end-to-end depth regression
|
||||||
in parallel to the region-based network, e.g. \cite{GCNet}.
|
in parallel to the region-based network (or before, to provide XYZ input to the R-CNN), e.g. \cite{GCNet}.
|
||||||
Although single-frame monocular depth prediction with deep networks was already done
|
Although single-frame monocular depth prediction with deep networks was already done
|
||||||
to some level of success,
|
to some level of success,
|
||||||
our two-frame input should allow the network to make use of epipolar
|
our two-frame input should allow the network to make use of epipolar
|
||||||
geometry for making a more reliable depth estimate, at least when the camera
|
geometry for making a more reliable depth estimate, at least when the camera
|
||||||
is moving. We could also extend our method to stereo input data easily by concatenating
|
is moving. We could also extend our method to stereo input data easily by concatenating
|
||||||
all of the frames into the input image, which
|
all of the frames into the input image.
|
||||||
would however require using a different dataset for training, as Virtual KITTI does not
|
In case we choose the option of integrating the depth prediction directly into
|
||||||
|
the R-CNN,
|
||||||
|
this would however require using a different dataset for training it, as Virtual KITTI does not
|
||||||
provide stereo images.
|
provide stereo images.
|
||||||
|
If we would use a specialized depth network, we could use stereo data
|
||||||
|
for depth prediction and still train the R-CNN independently on the monocular Virtual KITTI,
|
||||||
|
though we would loose the ability to easily train the system in an end-to-end manner.
|
||||||
|
|
||||||
|
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
||||||
|
and also fine-tune on the training set as mentioned in the previous paragraph.
|
||||||
|
|
||||||
{
|
{
|
||||||
\begin{table}[h]
|
\begin{table}[h]
|
||||||
@ -45,7 +66,7 @@ provide stereo images.
|
|||||||
\midrule\midrule
|
\midrule\midrule
|
||||||
& input image & H $\times$ W $\times$ C \\
|
& input image & H $\times$ W $\times$ C \\
|
||||||
\midrule
|
\midrule
|
||||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||||
\midrule
|
\midrule
|
||||||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||||||
\midrule
|
\midrule
|
||||||
@ -64,7 +85,7 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
|
|||||||
\end{tabular}
|
\end{tabular}
|
||||||
|
|
||||||
\caption {
|
\caption {
|
||||||
Preliminary Motion R-CNN ResNet-50-FPN architecture with depth prediction,
|
A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction,
|
||||||
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||||
}
|
}
|
||||||
\label{table:motionrcnn_resnet_fpn_depth}
|
\label{table:motionrcnn_resnet_fpn_depth}
|
||||||
@ -74,16 +95,41 @@ based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_re
|
|||||||
Due to the amount of supervision required by the different components of the network
|
Due to the amount of supervision required by the different components of the network
|
||||||
and the complexity of the optimization problem,
|
and the complexity of the optimization problem,
|
||||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||||||
A next step will be training on a more realistic dataset.
|
A next step will be training on a more realistic dataset,
|
||||||
|
ideally without having to rely on synthetic data at all.
|
||||||
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
||||||
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
||||||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
|
||||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||||
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
|
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
|
||||||
On Cityscapes, we could continue train the instance segmentation components to
|
On Cityscapes, we could continue train the instance segmentation components to
|
||||||
improve detection and masks and avoid forgetting instance segmentation.
|
improve detection and masks and avoid forgetting instance segmentation.
|
||||||
As an alternative to this training scheme, we could investigate training on a pure
|
As an alternative to this training scheme, we could investigate training on a pure
|
||||||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth)
|
||||||
|
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
|
||||||
|
setting \cite{UnsupFlownet, UnFlow},
|
||||||
|
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
||||||
|
|
||||||
|
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
||||||
|
We already described an optical flow based loss for supervising instance motions
|
||||||
|
when we do not have 3D instance motion ground truth, or when we do not have
|
||||||
|
any motion ground truth at all.
|
||||||
|
However, it would also be useful to train our model without access to 3D camera
|
||||||
|
motion ground truth.
|
||||||
|
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
|
||||||
|
RoI instance motion loss. Still, to use all available information from
|
||||||
|
ground truth optical flow and obtain more accurate supervision,
|
||||||
|
it would likely be beneficial to add a global, flow-based camera motion loss
|
||||||
|
independent of the RoI supervision.
|
||||||
|
To do this, one could use a re-projection loss conceptually identical to the one
|
||||||
|
for supervising instance motions with ground truth flow. However, to adjust for the
|
||||||
|
fact that the camera motion can only be accurately supervised with flow at positions where
|
||||||
|
no object motion accurs, this loss would have to be masked with the ground truth
|
||||||
|
object masks. Again, we could use this flow-based loss in an unsupervised way.
|
||||||
|
For training on a dataset without any motion ground truth, e.g.
|
||||||
|
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||||||
|
loss for the instance motions.
|
||||||
|
|
||||||
|
|
||||||
\paragraph{Temporal consistency}
|
\paragraph{Temporal consistency}
|
||||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||||||
@ -92,3 +138,16 @@ context of energy-minimization based scene flow \cite{TemporalSF}.
|
|||||||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||||
into our architecture, we could enable temporally consistent motion estimation
|
into our architecture, we could enable temporally consistent motion estimation
|
||||||
from image sequences of arbitrary length.
|
from image sequences of arbitrary length.
|
||||||
|
|
||||||
|
\paragraph{Deeper networks for larger bottleneck strides}
|
||||||
|
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
|
||||||
|
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
|
||||||
|
For accurately estimating the motion of objects with large displacements between
|
||||||
|
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
|
||||||
|
We could do this easily in both of our network variants by adding one ore multiple additional
|
||||||
|
ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
||||||
|
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
||||||
|
added after the encoder C$_5$ bottleneck.
|
||||||
|
For saving memory, we could however also consider modifying the underlying
|
||||||
|
ResNet-50 architecture and increase the number of blocks, but reduce the number
|
||||||
|
of layers in each block.
|
||||||
|
|||||||
@ -5,11 +5,11 @@ computations. To make our code easy to extend and flexible, we build on
|
|||||||
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
|
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
|
||||||
implementation.
|
implementation.
|
||||||
On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN)
|
On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN)
|
||||||
as well as extensions for motion estimation and related evaluations
|
as well as the Motion R-CNN extensions for motion estimation and related evaluations
|
||||||
and postprocessings. In addition, we generated all ground truth for
|
and postprocessings. In addition, we generated all ground truth for
|
||||||
Motion R-CNN in the form of TFRecords from the raw Virtual KITTI
|
Motion R-CNN in the form of TFRecords from the raw Virtual KITTI
|
||||||
data to enable fast loading during training.
|
data to enable fast loading during training.
|
||||||
Note that for RoI extraction and cropping operations,
|
Note that for RoI extraction and bilinear crop and resize operations,
|
||||||
we use the \texttt{tf.crop\_and\_resize} TensorFlow function with
|
we use the \texttt{tf.crop\_and\_resize} TensorFlow function with
|
||||||
interpolation set to bilinear.
|
interpolation set to bilinear.
|
||||||
|
|
||||||
@ -147,8 +147,14 @@ fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
|
|||||||
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
||||||
predicted camera motions.
|
predicted camera motions.
|
||||||
|
|
||||||
\subsection{Training Setup}
|
\subsection{Virtual KITTI training setup}
|
||||||
\label{ssec:setup}
|
\label{ssec:setup}
|
||||||
|
|
||||||
|
For our initial experiments, we concatenate both RGB frames as
|
||||||
|
well as the XYZ coordinates for both frames as input to the networks.
|
||||||
|
We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants.
|
||||||
|
|
||||||
|
\paragraph{Training schedule}
|
||||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
||||||
Virtual KITTI training set.
|
Virtual KITTI training set.
|
||||||
@ -172,8 +178,7 @@ Note that a larger weight prevented the
|
|||||||
angle sine estimates from properly converging to the very small values they
|
angle sine estimates from properly converging to the very small values they
|
||||||
are in general expected to output.
|
are in general expected to output.
|
||||||
|
|
||||||
|
\subsection{Virtual KITTI evaluation}
|
||||||
\subsection{Experiments on Virtual KITTI}
|
|
||||||
\label{ssec:vkitti}
|
\label{ssec:vkitti}
|
||||||
|
|
||||||
\begin{figure}[t]
|
\begin{figure}[t]
|
||||||
@ -227,7 +232,8 @@ only impacted by the predicted 3D object motions.
|
|||||||
\label{table:vkitti}
|
\label{table:vkitti}
|
||||||
\end{table}
|
\end{table}
|
||||||
}
|
}
|
||||||
Figure \ref{figure:vkitti} visualizes instance segmentation and optical flow
|
|
||||||
|
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
||||||
results on the Virtual KITTI validation set.
|
results on the Virtual KITTI validation set.
|
||||||
Table \ref{table:vkitti} compares the performance of different network variants on the Virtual KITTI validation
|
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
||||||
set.
|
on the Virtual KITTI validation set.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user