diff --git a/background.tex b/background.tex index 86ace88..4d64c71 100644 --- a/background.tex +++ b/background.tex @@ -17,6 +17,8 @@ to estimate disparity-based depth, however monocular depth estimation with deep popular \cite{DeeperDepth, UnsupPoseDepth}. In this preliminary work, we will assume per-pixel depth to be given. +\subsection{CNNs for dense motion estimation} + { \begin{table}[h] \centering @@ -54,7 +56,6 @@ are used for refinement. \end{table} } -\subsection{CNNs for dense motion estimation} Deep convolutional neural network (CNN) architectures \cite{ImageNetCNN, VGGNet, ResNet} became widely popular through numerous successes in classification and recognition tasks. @@ -134,30 +135,17 @@ and N$_{motions} = 3$. \subsection{ResNet} \label{ssec:resnet} -ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but -became popular as basic building block of many deep network architectures for a variety -of different tasks. Figure \ref{figure:bottleneck} -shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training -of very deep networks without the gradients becoming too small as the distance -from the output layer increases. - -In Table \ref{table:resnet}, we show the ResNet variant -that will serve as the basic CNN backbone of our networks, and -is also used in many other region-based convolutional networks. -The initial image data is always passed through the ResNet backbone as a first step to -bootstrap the complete deep network. -Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent -to the standard ResNet-50 backbone. - -We additionally introduce one small extension that -will be useful for our Motion R-CNN network. -In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the -input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64. -For accurately estimating motions corresponding to larger pixel displacements, a larger -stride may be important. -Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants -to increase the bottleneck stride to 64, following FlowNetS. +\begin{figure}[t] + \centering + \includegraphics[width=0.3\textwidth]{figures/bottleneck} +\caption{ +ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational +complexity in deeper network variants, shown here with 256 input and output channels. +Figure taken from \cite{ResNet}. +} +\label{figure:bottleneck} +\end{figure} { \begin{table}[h] @@ -228,16 +216,29 @@ Batch normalization \cite{BN} is used after every residual unit. \end{table} } -\begin{figure}[t] - \centering - \includegraphics[width=0.3\textwidth]{figures/bottleneck} -\caption{ -ResNet \cite{ResNet} \enquote{bottleneck} convolutional block introduced to reduce computational -complexity in deeper network variants, shown here with 256 input and output channels. -Figure taken from \cite{ResNet}. -} -\label{figure:bottleneck} -\end{figure} +ResNet (Residual Network) \cite{ResNet} was initially introduced as a CNN for image classification, but +became popular as basic building block of many deep network architectures for a variety +of different tasks. Figure \ref{figure:bottleneck} +shows the fundamental building block of ResNet. The additive \emph{residual unit} enables the training +of very deep networks without the gradients becoming too small as the distance +from the output layer increases. + +In Table \ref{table:resnet}, we show the ResNet variant +that will serve as the basic CNN backbone of our networks, and +is also used in many other region-based convolutional networks. +The initial image data is always passed through the ResNet backbone as a first step to +bootstrap the complete deep network. +Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent +to the standard ResNet-50 backbone. + +We additionally introduce one small extension that +will be useful for our Motion R-CNN network. +In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the +input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64. +For accurately estimating motions corresponding to larger pixel displacements, a larger +stride may be important. +Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants +to increase the bottleneck stride to 64, following FlowNetS. \subsection{Region-based CNNs} \label{ssec:rcnn} @@ -266,6 +267,39 @@ Thus, given region proposals, all computation is reduced to a single pass throug speeding up the system by two orders of magnitude at inference time and one order of magnitude at training time. + +\paragraph{Faster R-CNN} +After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal +algorithm, which has to be run prior to the network passes and makes up a large portion of the total +processing time. +The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and +classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN +and again, improved accuracy. +This unified network operates in two stages. +In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network, +which is a deep feature encoder CNN with the original image as input. +Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which +predicts objectness scores and regresses bounding boxes at each of its output positions. +At any of the $h \times w$ output positions of the RPN head, +$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different +aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total. +In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding +to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios, +$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16 +with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}). + +For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection. +The region proposals can then be obtained as the N highest scoring RPN predictions. + +Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification +and bounding box refinement for each of the region proposals, which are now obtained +from the RPN instead of being pre-computed by an external algorithm. +As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals, +and the refined bounding boxes are predicted separately for each object class. + +Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture +(for Faster R-CNN, the mask head is ignored). + { \begin{table}[t] \centering @@ -316,39 +350,6 @@ whereas Faster R-CNN uses RoI pooling. \end{table} } - -\paragraph{Faster R-CNN} -After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal -algorithm, which has to be run prior to the network passes and makes up a large portion of the total -processing time. -The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and -classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN -and again, improved accuracy. -This unified network operates in two stages. -In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network, -which is a deep feature encoder CNN with the original image as input. -Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which -predicts objectness scores and regresses bounding boxes at each of its output positions. -At any of the $h \times w$ output positions of the RPN head, -$\text{N}_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $\text{N}_a$ \emph{anchors} with different -aspect ratios and scales. Thus, there are $\text{N}_a \times h \times w$ reference anchors in total. -In Faster R-CNN, $\text{N}_a = 9$, with 3 scales, corresponding -to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios, -$\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16 -with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}). - -For each RPN prediction at a given position, the objectness score tells us how likely it is to correspond to a detection. -The region proposals can then be obtained as the N highest scoring RPN predictions. - -Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification -and bounding box refinement for each of the region proposals, which are now obtained -from the RPN instead of being pre-computed by an external algorithm. -As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals, -and the refined bounding boxes are predicted separately for each object class. - -Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture -(for Faster R-CNN, the mask head is ignored). - \paragraph{Mask R-CNN} Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. However, it can be helpful to know class and object (instance) membership of all individual pixels, diff --git a/bib.bib b/bib.bib index 8f4ee20..d02cc5e 100644 --- a/bib.bib +++ b/bib.bib @@ -330,6 +330,7 @@ volume={1}, number={4}, pages={541-551}, + month = dec, journal = neco, year = {1989}} diff --git a/conclusion.tex b/conclusion.tex index 8a73061..b3c00cb 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -2,8 +2,8 @@ We introduced Motion R-CNN, which enables 3D object motion estimation in parallel to instance segmentation in the framework of region-based convolutional networks, -given an input of two consecutive frames from a monocular camera. -In addition to instance motions, our network estimates the 3D motion of the camera. +given an input of two consecutive frames (and XYZ point coordinates) from a monocular camera. +In addition to instance motions, our network estimates the 3D ego-motion of the camera. We combine all these estimates to yield a dense optical flow output from our end-to-end deep network. Our model is trained on the synthetic Virtual KITTI dataset, which provides @@ -20,7 +20,8 @@ the accuracy of the motion predictions is still not convincing. More work will be thus required to bring the system (closer) to competetive accuracy, which includes trying penalization with the flow loss instead of 3D motion ground truth, experimenting with the weighting between different loss terms, -and improvements to the network architecture and training process. +and improvements to the network architecture, loss design, and training process. + We thus presented a partial step towards real time 3D motion estimation based on a physically sound scene decomposition. Thanks to instance-level reasoning, in contrast to previous end-to-end deep networks for dense motion estimation, the output @@ -30,7 +31,7 @@ applications. \subsection{Future Work} \paragraph{Mask R-CNN baseline} As our Mask R-CNN re-implementation is still not as accurate as reported in the -original paper, working on the implementation details of this baseline would be +original paper \cite{MaskRCNN}, working on the implementation details of this baseline would be a critical, direct next step. Recently, a highly accurate, third-party implementation of Mask R-CNN in TensorFlow was released, which should be studied to this end. @@ -54,7 +55,7 @@ and optical flow ground truth to evaluate the composed flow field. Note that with our current model, we can only evaluate on the \emph{train} set of KITTI 2015, as there is no public depth ground truth for the \emph{test} set. -As KITTI 2015 also provides object masks for moving objects, we could in principle +As KITTI 2015 also provides instance masks for moving objects, we could in principle fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the KITTI 2015 test set, this makes little sense, though. @@ -78,7 +79,7 @@ the R-CNN, this would however require using a different dataset for training it, as Virtual KITTI does not provide stereo images. If we would use a specialized depth network, we could use stereo data -for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI, +for depth prediction and still train Motion R-CNN independently on the monocular Virtual KITTI dataset, though we would loose the ability to easily train the system in an end-to-end manner. As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test, @@ -138,7 +139,7 @@ setting \cite{UnsupFlownet, UnFlow}, and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}. \paragraph{Supervising the camera motion without 3D camera motion ground truth} -We already described an optical flow based loss for supervising instance motions +We already described a loss based on optical flow for supervising instance motions when we do not have 3D instance motion ground truth, or when we do not have any motion ground truth at all. However, it would also be useful to train our model without access to 3D camera @@ -159,9 +160,9 @@ Cityscapes, it may be critical to add this term in addition to an unsupervised loss for the instance motions. \paragraph{Temporal consistency} -A next step after the two aforementioned ones could be to extend our network to exploit more than two +A next step after the aforementioned ones could be to extend our network to exploit more than two temporally consecutive frames, which has previously been shown to be beneficial in the -context of classical energy-minimization based scene flow \cite{TemporalSF}. +context of classical energy-minimization-based scene flow \cite{TemporalSF}. In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM}, into our architecture, we could enable temporally consistent motion estimation from image sequences of arbitrary length. diff --git a/experiments.tex b/experiments.tex index 5c01e61..bd1d58c 100644 --- a/experiments.tex +++ b/experiments.tex @@ -160,11 +160,12 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. For training the RPN and RoI heads and during inference, we use the exact same number of proposals and RoIs as Mask R-CNN in the ResNet and ResNet-FPN variants, respectively. -All losses are added up without additional weighting between the loss terms, +All losses (the original ones and our new motion losses) +are added up without additional weighting between the loss terms, as in Mask R-CNN. \paragraph{Initialization} -For initializing the C$_1$ to C$_5$ weights, we use a pre-trained +For initializing the C$_1$ to C$_5$ (see Table~\ref{table:resnet}) weights, we use a pre-trained ImageNet \cite{ImageNet} checkpoint from the official TensorFlow repository. Following the pre-existing TensorFlow implementation of Faster R-CNN, we initialize all other hidden layers with He initialization \cite{He}. @@ -185,7 +186,7 @@ are in general expected to output. Visualization of results on Virtual KITTI with XYZ input, camera motion prediction and 3D motion supervision. For each example, we show the results with Motion R-CNN ResNet and ResNet-FPN in the upper and lower row, respectively. -From left to right, we show the input image with instance segmentation results as overlay, +From left to right, we show the first input frame with instance segmentation results as overlay, the estimated flow, as well as the flow error map. The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones. } @@ -200,7 +201,7 @@ We visually compare a Motion R-CNN ResNet trained without (upper row) and with (lower row) classifying the objects into moving and non-moving objects. Note that in the selected example, all cars are parking, and thus the predicted motion in the first row is an error. -From left to right, we show the input image with instance segmentation results as overlay, +From left to right, we show the first input frame with instance segmentation results as overlay, the estimated flow, as well as the flow error map. The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) in blue and wrong estimates in red tones. } @@ -215,7 +216,7 @@ The flow error map depicts correct estimates ($\leq 3$ px or $\leq 5\%$ error) i \multicolumn{1}{c}{Network} & \multicolumn{5}{c}{Instance Motion} & \multicolumn{2}{c}{Camera Motion} &\multicolumn{2}{c}{Flow Error} \\ \cmidrule(lr){1-1}\cmidrule(lr){2-6}\cmidrule(l){7-8}\cmidrule(l){9-10} FPN & $E_{R} [deg]$ & $E_{t} [m]$ & $E_{p} [m] $ & $O_{pr}$ & $O_{rc}$ & $E_{R}^{cam} [deg]$ & $E_{t}^{cam} [m]$ & AEE & Fl-all \\\midrule -- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & -\% \\\midrule +- & (0.279) & (0.442) & - & - & - & (0.220) & (0.684) & - & - \\\midrule $\times$ & 0.301 & 0.237 & 3.331 & 0.790 & 0.916 & 0.087 & 0.053 & 11.17 & 24.91\% \\ \checkmark & 0.293 & 0.210 & 1.958 & 0.844 & 0.914 & 0.169 & 0.050 & 8.29 & 45.22\% \\ \bottomrule @@ -237,14 +238,15 @@ to the average rotation angle in the ground truth camera motions. For our initial experiments, we concatenate both RGB frames as well as the XYZ coordinates for both frames as input to the networks. -We train both, the Motion R-CNN ResNet and ResNet-FPN variants and supervise +We train both, the Motion R-CNN ResNet and ResNet-FPN variants, and supervise camera and instance motions with 3D motion ground truth. In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow results on the Virtual KITTI validation set. In Figure \ref{figure:moving}, we visually justify the addition of the classifier that decides between a moving and still object. -In Table \ref{table:vkitti}, we compare the performance of different network variants +In Table \ref{table:vkitti}, we compare various metrics for the Motion R-CNN +ResNet and ResNet-FPN network variants on the Virtual KITTI validation set. \paragraph{Camera motion} @@ -264,8 +266,8 @@ helpful. \paragraph{Instance motion} The object pivots are estimated with relatively (given that the scenes are in a realistic scale) -high precision in both variants, although the FPN variant is significantly more -precise, which we ascribe to the higher resolution features used in this variant. +high accuracy in both variants, although the FPN variant is significantly more +accurate, which we ascribe to the higher resolution features used in this variant. The predicted 3D object translations and rotations still have a relatively high error, compared to the average actual (ground truth) translations and rotations, diff --git a/figures/flow_loss.pdf b/figures/flow_loss.pdf index ae73758..5f3a84d 100755 Binary files a/figures/flow_loss.pdf and b/figures/flow_loss.pdf differ diff --git a/figures/moving.pdf b/figures/moving.pdf index 16d91ee..3656e83 100755 Binary files a/figures/moving.pdf and b/figures/moving.pdf differ diff --git a/figures/net_intro.pdf b/figures/net_intro.pdf index 264eabd..61b41dd 100755 Binary files a/figures/net_intro.pdf and b/figures/net_intro.pdf differ