mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-12 17:35:51 +00:00
final
This commit is contained in:
parent
f0050801a4
commit
9215f296a7
135
background.tex
135
background.tex
@ -17,6 +17,8 @@ to estimate disparity-based depth, however monocular depth estimation with deep
|
||||
popular \cite{DeeperDepth, UnsupPoseDepth}.
|
||||
In this preliminary work, we will assume per-pixel depth to be given.
|
||||
|
||||
\subsection{CNNs for dense motion estimation}
|
||||
|
||||
{
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
@ -54,7 +56,6 @@ are used for refinement.
|
||||
\end{table}
|
||||
}
|
||||
|
||||
\subsection{CNNs for dense motion estimation}
|
||||
Deep convolutional neural network (CNN) architectures
|
||||
\cite{ImageNetCNN, VGGNet, ResNet}
|
||||
became widely popular through numerous successes in classification and recognition tasks.
|
||||
@ -134,30 +135,17 @@ and N$_{motions} = 3$.
|
||||
|
||||
\subsection{ResNet}
|
||||
\label{ssec:resnet}
|
||||
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||
became popular as basic building block of many deep network architectures for a variety
|
||||
of different tasks. Figure \ref{figure:bottleneck}
|
||||
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
|
||||
of very deep networks without the gradients becoming too small as the distance
|
||||
from the output layer increases.
|
||||
|
||||
In Table \ref{table:resnet}, we show the ResNet variant
|
||||
that will serve as the basic CNN backbone of our networks, and
|
||||
is also used in many other region-based convolutional networks.
|
||||
The initial image data is always passed through the ResNet backbone as a first step to
|
||||
bootstrap the complete deep network.
|
||||
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
|
||||
to the standard ResNet-50 backbone.
|
||||
|
||||
We additionally introduce one small extension that
|
||||
will be useful for our Motion R-CNN network.
|
||||
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||
stride may be important.
|
||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||
to increase the bottleneck stride to 64, following FlowNetS.
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.3\textwidth]{figures/bottleneck}
|
||||
\caption{
|
||||
ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
|
||||
complexity in deeper network variants, shown here with 256 input and output channels.
|
||||
Figure taken from \cite{ResNet}.
|
||||
}
|
||||
\label{figure:bottleneck}
|
||||
\end{figure}
|
||||
|
||||
{
|
||||
\begin{table}[h]
|
||||
@ -228,16 +216,29 @@ Batch normalization \cite{BN} is used after every residual unit.
|
||||
\end{table}
|
||||
}
|
||||
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
\includegraphics[width=0.3\textwidth]{figures/bottleneck}
|
||||
\caption{
|
||||
ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
|
||||
complexity in deeper network variants, shown here with 256 input and output channels.
|
||||
Figure taken from \cite{ResNet}.
|
||||
}
|
||||
\label{figure:bottleneck}
|
||||
\end{figure}
|
||||
ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
|
||||
became popular as basic building block of many deep network architectures for a variety
|
||||
of different tasks. Figure \ref{figure:bottleneck}
|
||||
shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
|
||||
of very deep networks without the gradients becoming too small as the distance
|
||||
from the output layer increases.
|
||||
|
||||
In Table \ref{table:resnet}, we show the ResNet variant
|
||||
that will serve as the basic CNN backbone of our networks, and
|
||||
is also used in many other region-based convolutional networks.
|
||||
The initial image data is always passed through the ResNet backbone as a first step to
|
||||
bootstrap the complete deep network.
|
||||
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
|
||||
to the standard ResNet-50 backbone.
|
||||
|
||||
We additionally introduce one small extension that
|
||||
will be useful for our Motion R-CNN network.
|
||||
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||
For accurately estimating motions corresponding to larger pixel displacements, a larger
|
||||
stride may be important.
|
||||
Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
|
||||
to increase the bottleneck stride to 64, following FlowNetS.
|
||||
|
||||
\subsection{Region-based CNNs}
|
||||
\label{ssec:rcnn}
|
||||
@ -266,6 +267,39 @@ Thus, given region proposals, all computation is reduced to a single pass throug
|
||||
speeding up the system by two orders of magnitude at inference time and one order of magnitude
|
||||
at training time.
|
||||
|
||||
|
||||
\paragraph{Faster R-CNN}
|
||||
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
||||
and again, improved accuracy.
|
||||
This unified network operates in two stages.
|
||||
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
||||
which is a deep feature encoder CNN with the original image as input.
|
||||
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||
At any of the $h \times w$ output positions of the RPN head,
|
||||
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
||||
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
||||
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
||||
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
||||
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
||||
|
||||
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
||||
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
||||
|
||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||
and bounding box refinement for each of the region proposals, which are now obtained
|
||||
from the RPN instead of being pre-computed by an external algorithm.
|
||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
||||
and the refined bounding boxes are predicted separately for each object class.
|
||||
|
||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
||||
(for Faster R-CNN, the mask head is ignored).
|
||||
|
||||
{
|
||||
\begin{table}[t]
|
||||
\centering
|
||||
@ -316,39 +350,6 @@ whereas Faster R-CNN uses RoI pooling.
|
||||
\end{table}
|
||||
}
|
||||
|
||||
|
||||
\paragraph{Faster R-CNN}
|
||||
After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
|
||||
algorithm, which has to be run prior to the network passes and makes up a large portion of the total
|
||||
processing time.
|
||||
The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
|
||||
classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
|
||||
and again, improved accuracy.
|
||||
This unified network operates in two stages.
|
||||
In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
|
||||
which is a deep feature encoder CNN with the original image as input.
|
||||
Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
|
||||
predicts objectness scores and regresses bounding boxes at each of its output positions.
|
||||
At any of the $h \times w$ output positions of the RPN head,
|
||||
$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
|
||||
aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
|
||||
In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
|
||||
to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
|
||||
$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
|
||||
with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
|
||||
|
||||
For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
|
||||
The region proposals can then be obtained as the N highest scoring RPN predictions.
|
||||
|
||||
Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
|
||||
and bounding box refinement for each of the region proposals, which are now obtained
|
||||
from the RPN instead of being pre-computed by an external algorithm.
|
||||
As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
|
||||
and the refined bounding boxes are predicted separately for each object class.
|
||||
|
||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
||||
(for Faster R-CNN, the mask head is ignored).
|
||||
|
||||
\paragraph{Mask R-CNN}
|
||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||
|
||||
1
bib.bib
1
bib.bib
@ -330,6 +330,7 @@
|
||||
volume={1},
|
||||
number={4},
|
||||
pages={541-551},
|
||||
month = dec,
|
||||
journal = neco,
|
||||
year = {1989}}
|
||||
|
||||
|
||||
@ -2,8 +2,8 @@
|
||||
|
||||
We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
|
||||
to instance segmentation in the framework of region-based convolutional networks,
|
||||
given an input of two consecutive frames from a monocular camera.
|
||||
In addition to instance motions, our network estimates the 3D motion of the camera.
|
||||
given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
|
||||
In addition to instance motions, our network estimates the 3D ego-motion of the camera.
|
||||
We combine all these estimates to yield a dense optical flow output from our
|
||||
end-to-end deep network.
|
||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||
@ -20,7 +20,8 @@ the accuracy of the motion predictions is still not convincing.
|
||||
More work will be thus required to bring the system (closer) to competetive accuracy,
|
||||
which includes trying penalization with the flow loss instead of 3D motion ground truth,
|
||||
experimenting with the weighting between different loss terms,
|
||||
and improvements to the network architecture and training process.
|
||||
and improvements to the network architecture, loss design, and training process.
|
||||
|
||||
We thus presented a partial step towards real time 3D motion estimation based on a
|
||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||
to previous end-to-end deep networks for dense motion estimation, the output
|
||||
@ -30,7 +31,7 @@ applications.
|
||||
\subsection{Future Work}
|
||||
\paragraph{Mask R-CNN baseline}
|
||||
As our Mask R-CNN re-implementation is still not as accurate as reported in the
|
||||
original paper, working on the implementation details of this baseline would be
|
||||
original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be
|
||||
a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
|
||||
R-CNN in TensorFlow was released, which should be studied to this end.
|
||||
|
||||
@ -54,7 +55,7 @@ and optical flow ground truth to evaluate the composed flow field.
|
||||
Note that with our current model, we can only evaluate on the \emph{train} set
|
||||
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
||||
|
||||
As KITTI 2015 also provides object masks for moving objects, we could in principle
|
||||
As KITTI 2015 also provides instance masks for moving objects, we could in principle
|
||||
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
||||
KITTI 2015 test set, this makes little sense, though.
|
||||
|
||||
@ -78,7 +79,7 @@ the R-CNN,
|
||||
this would however require using a different dataset for training it, as Virtual KITTI does not
|
||||
provide stereo images.
|
||||
If we would use a specialized depth network, we could use stereo data
|
||||
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI,
|
||||
for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
|
||||
though we would loose the ability to easily train the system in an end-to-end manner.
|
||||
|
||||
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
||||
@ -138,7 +139,7 @@ setting \cite{UnsupFlownet, UnFlow},
|
||||
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
||||
|
||||
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
||||
We already described an optical flow based loss for supervising instance motions
|
||||
We already described a loss based on optical flow for supervising instance motions
|
||||
when we do not have 3D instance motion ground truth, or when we do not have
|
||||
any motion ground truth at all.
|
||||
However, it would also be useful to train our model without access to 3D camera
|
||||
@ -159,9 +160,9 @@ Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||||
loss for the instance motions.
|
||||
|
||||
\paragraph{Temporal consistency}
|
||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||||
A next step after the aforementioned ones could be to extend our network to exploit more than two
|
||||
temporally consecutive frames, which has previously been shown to be beneficial in the
|
||||
context of classical energy-minimization based scene flow \cite{TemporalSF}.
|
||||
context of classical energy-minimization-based scene flow \cite{TemporalSF}.
|
||||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||
into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
|
||||
@ -160,11 +160,12 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
|
||||
For training the RPN and RoI heads and during inference,
|
||||
we use the exact same number of proposals and RoIs as Mask R-CNN in
|
||||
the ResNet and ResNet-FPN variants, respectively.
|
||||
All losses are added up without additional weighting between the loss terms,
|
||||
All losses (the original ones and our new motion losses)
|
||||
are added up without additional weighting between the loss terms,
|
||||
as in Mask R-CNN.
|
||||
|
||||
\paragraph{Initialization}
|
||||
For initializing the C$_1$ to C$_5$ weights, we use a pre-trained
|
||||
For initializing the C$_1$ to C$_5$ (see Table~\ref{table:resnet}) weights, we use a pre-trained
|
||||
ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
|
||||
Following the pre-existing TensorFlow implementation of Faster R-CNN,
|
||||
we initialize all other hidden layers with He initialization \cite{He}.
|
||||
@ -185,7 +186,7 @@ are in general expected to output.
|
||||
Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
|
||||
For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
|
||||
in the upper and lower row, respectively.
|
||||
From left to right, we show the input image with instance segmentation results as overlay,
|
||||
From left to right, we show the first input frame with instance segmentation results as overlay,
|
||||
the estimated flow, as well as the flow error map.
|
||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||
}
|
||||
@ -200,7 +201,7 @@ We visually compare a Motion R-CNN ResNet trained without (upper row) and
|
||||
with (lower row) classifying the objects into moving and non-moving objects.
|
||||
Note that in the selected example, all cars are parking, and thus the predicted
|
||||
motion in the first row is an error.
|
||||
From left to right, we show the input image with instance segmentation results as overlay,
|
||||
From left to right, we show the first input frame with instance segmentation results as overlay,
|
||||
the estimated flow, as well as the flow error map.
|
||||
The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
|
||||
}
|
||||
@ -215,7 +216,7 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
|
||||
\multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
|
||||
\cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
|
||||
FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule
|
||||
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule
|
||||
- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & - \\\midrule
|
||||
$\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\
|
||||
\checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\
|
||||
\bottomrule
|
||||
@ -237,14 +238,15 @@ to the average rotation angle in the ground truth camera motions.
|
||||
|
||||
For our initial experiments, we concatenate both RGB frames as
|
||||
well as the XYZ coordinates for both frames as input to the networks.
|
||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
|
||||
We train both, the Motion R-CNN ResNet and ResNet-FPN variants, and supervise
|
||||
camera and instance motions with 3D motion ground truth.
|
||||
|
||||
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
||||
results on the Virtual KITTI validation set.
|
||||
In Figure \ref{figure:moving}, we visually justify the addition of the classifier
|
||||
that decides between a moving and still object.
|
||||
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
||||
In Table \ref{table:vkitti}, we compare various metrics for the Motion R-CNN
|
||||
ResNet and ResNet-FPN network variants
|
||||
on the Virtual KITTI validation set.
|
||||
|
||||
\paragraph{Camera motion}
|
||||
@ -264,8 +266,8 @@ helpful.
|
||||
|
||||
\paragraph{Instance motion}
|
||||
The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
|
||||
high precision in both variants, although the FPN variant is significantly more
|
||||
precise, which we ascribe to the higher resolution features used in this variant.
|
||||
high accuracy in both variants, although the FPN variant is significantly more
|
||||
accurate, which we ascribe to the higher resolution features used in this variant.
|
||||
|
||||
The predicted 3D object translations and rotations still have a relatively high
|
||||
error, compared to the average actual (ground truth) translations and rotations,
|
||||
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
x
Reference in New Issue
Block a user