final

2025-12-13 09:55:49 +00:00 · 2017-11-22 19:32:13 +01:00 · 2017-11-22 19:32:13 +01:00 · 9215f296a7
commit 9215f296a7
parent f0050801a4
7 changed files with 90 additions and 85 deletions
--- a/background.tex
+++ b/background.tex
@ -17,6 +17,8 @@ to estimate disparity-based depth, however monocular depth estimation with deep
 popular \cite{DeeperDepth, UnsupPoseDepth}.
 In this preliminary work, we will assume per-pixel depth to be given.
 \subsection{CNNs for dense motion estimation}
 {
 \begin{table}[h]
 \centering
@ -54,7 +56,6 @@ are used for refinement.
 \end{table}
 }
 \subsection{CNNs for dense motion estimation}
 Deep convolutional neural network (CNN) architectures
 \cite{ImageNetCNN, VGGNet, ResNet}
 became widely popular through numerous successes in classification and recognition tasks.
@ -134,30 +135,17 @@ and N$_{motions} = 3$.
 \subsection{ResNet}
 \label{ssec:resnet}
 ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
 became popular as basic building block of many deep network architectures for a variety
 of different tasks. Figure \ref{figure:bottleneck}
 shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
 of very deep networks without the gradients becoming too small as the distance
 from the output layer increases.
 In Table \ref{table:resnet}, we show the ResNet variant
 that will serve as the basic CNN backbone of our networks, and
 is also used in many other region-based convolutional networks.
 The initial image data is always passed through the ResNet backbone as a first step to
 bootstrap the complete deep network.
 Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
 to the standard ResNet-50 backbone.
 We additionally introduce one small extension that
 will be useful for our Motion R-CNN network.
 In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
 input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
 For accurately estimating motions corresponding to larger pixel displacements, a larger
 stride may be important.
 Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
 to increase the bottleneck stride to 64, following FlowNetS.
 \begin{figure}[t]
  \centering
  \includegraphics[width=0.3\textwidth]{figures/bottleneck}
 \caption{
 ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
 complexity in deeper network variants, shown here with 256 input and output channels.
 Figure taken from \cite{ResNet}.
 }
 \label{figure:bottleneck}
 \end{figure}
 {
 \begin{table}[h]
@ -228,16 +216,29 @@ Batch normalization \cite{BN} is used after every residual unit.
 \end{table}
 }
-\begin{figure}[t]
+ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but
-  \centering
+became popular as basic building block of many deep network architectures for a variety
-  \includegraphics[width=0.3\textwidth]{figures/bottleneck}
+of different tasks. Figure \ref{figure:bottleneck}
-\caption{
+shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training
-ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational
+of very deep networks without the gradients becoming too small as the distance
-complexity in deeper network variants, shown here with 256 input and output channels.
+from the output layer increases.
-Figure taken from \cite{ResNet}.
+
-}
+In Table \ref{table:resnet}, we show the ResNet variant
-\label{figure:bottleneck}
+that will serve as the basic CNN backbone of our networks, and
-\end{figure}
+is also used in many other region-based convolutional networks.
 The initial image data is always passed through the ResNet backbone as a first step to
 bootstrap the complete deep network.
 Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
 to the standard ResNet-50 backbone.
 We additionally introduce one small extension that
 will be useful for our Motion R-CNN network.
 In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
 input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
 For accurately estimating motions corresponding to larger pixel displacements, a larger
 stride may be important.
 Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants
 to increase the bottleneck stride to 64, following FlowNetS.
 \subsection{Region-based CNNs}
 \label{ssec:rcnn}
@ -266,6 +267,39 @@ Thus, given region proposals, all computation is reduced to a single pass throug
 speeding up the system by two orders of magnitude at inference time and one order of magnitude
 at training time.
 \paragraph{Faster R-CNN}
 After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
 algorithm, which has to be run prior to the network passes and makes up a large portion of the total
 processing time.
 The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
 classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
 and again, improved accuracy.
 This unified network operates in two stages.
 In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
 which is a deep feature encoder CNN with the original image as input.
 Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
 predicts objectness scores and regresses bounding boxes at each of its output positions.
 At any of the $h \times w$ output positions of the RPN head,
 $\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
 aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
 In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
 to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
 $\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
 with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
 For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
 The region proposals can then be obtained as the N highest scoring RPN predictions.
 Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
 and bounding box refinement for each of the region proposals, which are now obtained
 from the RPN instead of being pre-computed by an external algorithm.
 As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
 and the refined bounding boxes are predicted separately for each object class.
 Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
 (for Faster R-CNN, the mask head is ignored).
 {
 \begin{table}[t]
 \centering
@ -316,39 +350,6 @@ whereas Faster R-CNN uses RoI pooling.
 \end{table}
 }
 \paragraph{Faster R-CNN}
 After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal
 algorithm, which has to be run prior to the network passes and makes up a large portion of the total
 processing time.
 The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and
 classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN
 and again, improved accuracy.
 This unified network operates in two stages.
 In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network,
 which is a deep feature encoder CNN with the original image as input.
 Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which
 predicts objectness scores and regresses bounding boxes at each of its output positions.
 At any of the $h \times w$ output positions of the RPN head,
 $\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different
 aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total.
 In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding
 to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios,
 $\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16
 with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}).
 For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection.
 The region proposals can then be obtained as the N highest scoring RPN predictions.
 Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification
 and bounding box refinement for each of the region proposals, which are now obtained
 from the RPN instead of being pre-computed by an external algorithm.
 As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals,
 and the refined bounding boxes are predicted separately for each object class.
 Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
 (for Faster R-CNN, the mask head is ignored).
 \paragraph{Mask R-CNN}
 Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
 However, it can be helpful to know class and object (instance) membership of all individual pixels,
--- a/bib.bib
+++ b/bib.bib
@ -330,6 +330,7 @@
  volume={1},
  number={4},
  pages={541-551},
  month = dec,
  journal = neco,
  year = {1989}}
--- a/conclusion.tex
+++ b/conclusion.tex
@ -2,8 +2,8 @@
 We introduced Motion R-CNN, which enables 3D object motion estimation in parallel
 to instance segmentation in the framework of region-based convolutional networks,
-given an input of two consecutive frames from a monocular camera.
+given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera.
-In addition to instance motions, our network estimates the 3D motion of the camera.
+In addition to instance motions, our network estimates the 3D ego-motion of the camera.
 We combine all these estimates to yield a dense optical flow output from our
 end-to-end deep network.
 Our model is trained on the synthetic Virtual KITTI dataset, which provides
@ -20,7 +20,8 @@ the accuracy of the motion predictions is still not convincing.
 More work will be thus required to bring the system (closer) to competetive accuracy,
 which includes trying penalization with the flow loss instead of 3D motion ground truth,
 experimenting with the weighting between different loss terms,
-and improvements to the network architecture and training process.
+and improvements to the network architecture, loss design, and training process.
 We thus presented a partial step towards real time 3D motion estimation based on a
 physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
 to previous end-to-end deep networks for dense motion estimation, the output
@ -30,7 +31,7 @@ applications.
 \subsection{Future Work}
 \paragraph{Mask R-CNN baseline}
 As our Mask R-CNN re-implementation is still not as accurate as reported in the
-original paper, working on the implementation details of this baseline would be
+original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be
 a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask
 R-CNN in TensorFlow was released, which should be studied to this end.
@ -54,7 +55,7 @@ and optical flow ground truth to evaluate the composed flow field.
 Note that with our current model, we can only evaluate on the \emph{train} set
 of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
-As KITTI 2015 also provides object masks for moving objects, we could in principle
+As KITTI 2015 also provides instance masks for moving objects, we could in principle
 fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
 KITTI 2015 test set, this makes little sense, though.
@ -78,7 +79,7 @@ the R-CNN,
 this would however require using a different dataset for training it, as Virtual KITTI does not
 provide stereo images.
 If we would use a specialized depth network, we could use stereo data
-for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI,
+for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset,
 though we would loose the ability to easily train the system in an end-to-end manner.
 As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
@ -138,7 +139,7 @@ setting \cite{UnsupFlownet, UnFlow},
 and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
 \paragraph{Supervising the camera motion without 3D camera motion ground truth}
-We already described an optical flow based loss for supervising instance motions
+We already described a loss based on optical flow for supervising instance motions
 when we do not have 3D instance motion ground truth, or when we do not have
 any motion ground truth at all.
 However, it would also be useful to train our model without access to 3D camera
@ -159,9 +160,9 @@ Cityscapes, it may be critical to add this term in addition to an unsupervised
 loss for the instance motions.
 \paragraph{Temporal consistency}
-A next step after the two aforementioned ones could be to extend our network to exploit more than two
+A next step after the aforementioned ones could be to extend our network to exploit more than two
 temporally consecutive frames, which has previously been shown to be beneficial in the
-context of classical energy-minimization based scene flow \cite{TemporalSF}.
+context of classical energy-minimization-based scene flow \cite{TemporalSF}.
 In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
 into our architecture, we could enable temporally consistent motion estimation
 from image sequences of arbitrary length.
--- a/experiments.tex
+++ b/experiments.tex
@ -160,11 +160,12 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.
 For training the RPN and RoI heads and during inference,
 we use the exact same number of proposals and RoIs as Mask R-CNN in
 the ResNet and ResNet-FPN variants, respectively.
-All losses are added up without additional weighting between the loss terms,
+All losses (the original ones and our new motion losses)
 are added up without additional weighting between the loss terms,
 as in Mask R-CNN.
 \paragraph{Initialization}
-For initializing the  C$_1$ to C$_5$ weights, we use a pre-trained
+For initializing the  C$_1$ to C$_5$ (see Table~\ref{table:resnet}) weights, we use a pre-trained
 ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository.
 Following the pre-existing TensorFlow implementation of Faster R-CNN,
 we initialize all other hidden layers with He initialization \cite{He}.
@ -185,7 +186,7 @@ are in general expected to output.
 Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision.
 For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN
 in the upper and lower row, respectively.
-From left to right, we show the input image with instance segmentation results as overlay,
+From left to right, we show the first input frame with instance segmentation results as overlay,
 the estimated flow, as well as the flow error map.
 The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
 }
@ -200,7 +201,7 @@ We visually compare a Motion R-CNN ResNet trained without (upper row) and
 with (lower row) classifying the objects into moving and non-moving objects.
 Note that in the selected example, all cars are parking, and thus the predicted
 motion in the first row is an error.
-From left to right, we show the input image with instance segmentation results as overlay,
+From left to right, we show the first input frame with instance segmentation results as overlay,
 the estimated flow, as well as the flow error map.
 The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones.
 }
@ -215,7 +216,7 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i
 \multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\
  \cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10}
 FPN        & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$  & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE   & Fl-all \\\midrule
-          & (0.279)       & (0.442)     & -            & -        & -           & (0.220)             & (0.684)           & -     &     -\%    \\\midrule
+-          & (0.279)       & (0.442)     & -            & -        & -           & (0.220)             & (0.684)           & -     &     -    \\\midrule
 $\times$   & 0.301         & 0.237       & 3.331        & 0.790    & 0.916       & 0.087               & 0.053             & 11.17 & 24.91\%    \\
 \checkmark & 0.293         & 0.210       & 1.958        & 0.844    & 0.914       & 0.169               & 0.050             & 8.29  & 45.22\%    \\
  \bottomrule
@ -237,14 +238,15 @@ to the average rotation angle in the ground truth camera motions.
 For our initial experiments, we concatenate both RGB frames as
 well as the XYZ coordinates for both frames as input to the networks.
-We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise
+We train both, the Motion R-CNN ResNet and ResNet-FPN variants, and supervise
 camera and instance motions with 3D motion ground truth.
 In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
 results on the Virtual KITTI validation set.
 In Figure \ref{figure:moving}, we visually justify the addition of the classifier
 that decides between a moving and still object.
-In Table \ref{table:vkitti}, we compare the performance of different network variants
+In Table \ref{table:vkitti}, we compare various metrics for the Motion R-CNN
 ResNet and ResNet-FPN network variants
 on the Virtual KITTI validation set.
 \paragraph{Camera motion}
@ -264,8 +266,8 @@ helpful.
 \paragraph{Instance motion}
 The object pivots are estimated with relatively (given that the scenes are in a realistic scale)
-high precision in both variants, although the FPN variant is significantly more
+high accuracy in both variants, although the FPN variant is significantly more
-precise, which we ascribe to the higher resolution features used in this variant.
+accurate, which we ascribe to the higher resolution features used in this variant.
 The predicted 3D object translations and rotations still have a relatively high
 error, compared to the average actual (ground truth) translations and rotations,
--- a/figures/flow_loss.pdf
+++ b/figures/flow_loss.pdf
--- a/figures/moving.pdf
+++ b/figures/moving.pdf
--- a/figures/net_intro.pdf
+++ b/figures/net_intro.pdf