diff --git a/abstract.tex b/abstract.tex index f600fb4..fb37a8d 100644 --- a/abstract.tex +++ b/abstract.tex @@ -43,8 +43,9 @@ ist die Umfunktionierung generischer Deep Networks ein beliebter Ansatz für klassische Probleme der Computer Vision geworden, die pixelweise Schätzung erfordern. -Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden -für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen +Viele aktuelle end-to-end Deep Learning Methoden +für optischen Fluss oder Szenenfluss folgen diesem Trend und berechnen +vollständige und hochauflösende Flussfelder mit generischen Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische Randbedingungen innerhalb der Szene. diff --git a/approach.tex b/approach.tex index b291f90..8afdbb8 100644 --- a/approach.tex +++ b/approach.tex @@ -209,7 +209,7 @@ Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionr convolution to the $C_5$ features to reduce the number of inputs to the following fully-connected layers. Instead of averaging, we use bilinear resizing to bring the convolutional features -to a fixed size without losing spatial information, +to a fixed size without losing all spatial information, flatten them, and finally apply multiple fully-connected layers to compute the camera motion prediction. @@ -217,19 +217,13 @@ camera motion prediction. In both of our network variants (Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}), we compute the fully-connected network for motion prediction from the -convolutional mask features, branching off right before the mask upsampling -deconvolution. The intuition behind this is that the final mask features contain -high resolution, spatial information about which positions belong to the object and -which belong to the background. Thus, we allow the motion estimation network to -make use of this data and ideally integrate the motion (image matching) information -localized within the object, but not that belonging to the background, -into the final object motion estimate. - +flattened RoI features, which are also the basis for classification and +bounding box refinement. \subsection{Supervision} \label{ssec:supervision} -\paragraph{Per-RoI supervision with 3D motion ground truth} +\paragraph{Per-RoI instance motion supervision with 3D instance motion ground truth} The most straightforward way to supervise the object motions is by using ground truth motions computed from ground truth object poses, which is in general only practical when training on synthetic datasets. @@ -284,14 +278,16 @@ If the ground truth shows that the camera is not moving, we again do not penalize rotation and translation. For the camera, the loss is reduced to the classification term in this case. -\paragraph{Per-RoI supervision \emph{without} 3D motion ground truth} +\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth} A more general way to supervise the object motions is a re-projection loss similar to the unsupervised loss in SfM-Net \cite{SfmNet}, which we can apply to coordinates within the object bounding boxes, and which does not require ground truth 3D object motions. -In this case, for any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box -with the same resolution as the predicted mask. We use the same bounding box +In this case, for any RoI, +we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box +with the same resolution as the predicted mask. +We use the same bounding box to crop the corresponding region from the dense, full image depth map and bilinearly resize the depth crop to the same resolution as the mask and point grid. @@ -301,11 +297,18 @@ apply the RoI's predicted motion, masked by the predicted mask. Then, we apply the camera motion to the points, project them back to 2D and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids. Note that we batch this computation over all RoIs, so that we only perform -it once per forward pass. The mathematical details are analogous to the -dense, full image flow computation in the following subsection and will not -be repeated here. \todo{probably better to add the mathematical details, as it may otherwise be confusing at some points} +it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach. +The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the +dense, full image flow composition in the following subsection, so we will not +include them here. The only differences are that there is no sum over objects during +the point transformation based on instance motion, as we consider the single object +corresponding to an RoI in isolation, and that the masks are not resized to the +full image resolution, as +the depth crops and 2D point grid are at the same resolution as the predicted +$m \times m$ mask. -For each RoI, we can now penalize the optical flow grid to supervise the object motion. +For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion +by penalizing the $m \times m$ optical flow grid. If there is optical flow ground truth available, we can use the RoI bounding box to crop and resize a region from the ground truth optical flow to match the RoI's optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss. @@ -336,6 +339,17 @@ and sample proposals and RoIs in the exact same way. During inference, we proceed analogously to Mask R-CNN. In the same way as the RoI mask head, at test time, we compute the RoI motion head from the features extracted with refined bounding boxes. +Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the +extracted RoI features before passing them into the motion head. +The intuition behind that is that we want to mask out (set to zero) any positions in the +extracted feature window which belong to the background. Then, the RoI motion +head aggregates the motion (image matching) information from the backbone +over positions localized within the object only, but not over positions belonging +to the background, which should not influence the final object motion estimate. + +Again, as for masks and bounding boxes in Mask R-CNN, +the predicted output object motions are the predicted object motions for the +highest scoring class. \subsection{Dense flow from motion} \label{ssec:postprocessing} @@ -360,17 +374,21 @@ For now, the depth map is always assumed to come from ground truth. Given $k$ detections with predicted motions as above, we transform all points within the bounding box of a detected object according to the predicted motion of the object. -We first define the \emph{full image} mask $m_t^k$ for object k, -which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing -$m_k^b$ to the width and height of the predicted bounding box and then copying the values -of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box. -Then, +We first define the \emph{full image} mask $M_t^k$ for object k, +which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing +$m_t^k$ to the width and height of the predicted bounding box and then copying the values +of the resized mask into a full resolution mask initialized with zeros, +starting at the top-left coordinate of the predicted bounding box. +Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects, \begin{equation} P'_{t+1} = -P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\} +P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\} \end{equation} +These motion predictions are understood to have already taken into account +the classification into moving and still objects, +and we thus, as described above, have identity motions for all objects with $o_t^k = 0$. -Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce! +Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, \begin{equation} \begin{pmatrix} @@ -380,8 +398,8 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1} \end{equation}. Note that in our experiments, we either use the ground truth camera motion to focus -on the object motion predictions or the predicted camera motion to predict complete -motion. We will always state which variant we use in the experimental section. +on evaluating the object motion predictions or the predicted camera motion to evaluate +the complete motion estimates. We will always state which variant we use in the experimental section. Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again, \begin{equation} diff --git a/background.tex b/background.tex index 1ab4c71..6787544 100644 --- a/background.tex +++ b/background.tex @@ -364,7 +364,7 @@ which has a stride of $4$ with respect to the input image. Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a RoI bounding box with size $h \times w$, \begin{equation} -j = \log_2(\sqrt{w \cdot h} / 224). %TODO complete +j = \log_2(\sqrt{w \cdot h} / 224). \todo{complete} \label{eq:level_assignment} \end{equation} @@ -613,6 +613,9 @@ with a maximum IoU of 0.7. Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes, after again extracting the corresponding features. Thus, during inference, the features for the mask head are extracted using the refined -bounding boxes, instead of the RPN bounding boxes. This is important for not +bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not introducing any misalignment, as we want to create the instance mask inside of the more precise, refined detection bounding boxes. +Furthermore, note that bounding box and mask predictions for all classes but the predicted +class (the highest scoring class) are discarded, and thus the output bounding +box and mask correspond to the highest scoring class. diff --git a/bib.bib b/bib.bib index 35976d7..ed50ad8 100644 --- a/bib.bib +++ b/bib.bib @@ -249,3 +249,21 @@ title = {Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification}, booktitle = {ICCV}, year = {2015}} + +@inproceedings{UnFlow, + author = {Simon Meister and Junhwa Hur and Stefan Roth}, + title = {UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss}, + booktitle = {AAAI}, + year = {2018}} + +@inproceedings{UnsupDepth, + title={Unsupervised CNN for single view depth estimation: Geometry to the rescue}, + author={Ravi Garg and BG Vijay Kumar and Gustavo Carneiro and Ian Reid}, + booktitle={ECCV}, + year={2016}} + +@inproceedings{UnsupFlownet, + title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness}, + author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis}, + booktitle={ECCV Workshop on Brave new ideas for motion representations in videos}, + year={2016}} diff --git a/conclusion.tex b/conclusion.tex index b2cc6e4..cac9bd6 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -7,9 +7,9 @@ In addition to instance motions, our network estimates the 3D motion of the came We combine all these estimates to yield a dense optical flow output from our end-to-end deep network. Our model is trained on the synthetic Virtual KITTI dataset, which provides -us with all required ground truth data. +us with all required ground truth data, and evaluated on the same domain. During inference, our model does not add any significant computational overhead -over the latest iterations of R-CNNs and is therefore just as fast and interesting +over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting for real time scenarios. We thus presented a step towards real time 3D motion estimation based on a physically sound scene decomposition. Thanks to instance-level reasoning, in contrast @@ -18,6 +18,19 @@ of our network is highly interpretable, which may also bring benefits for safety applications. \subsection{Future Work} +\paragraph{Evaluation and finetuning on KITTI 2015} +Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset +on which we do not train, but we have yet to evaluate on a real world dataset. +The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015}, +which provides depth ground truth to compose a optical flow field from our 3D motion estimates, +and optical flow ground truth to evaluate the composed flow field. +Note that with our current model, we can only evaluate on the \emph{train} set +of KITTI 2015, as there is no public depth ground truth for the \emph{test} set. + +As KITTI 2015 also provides object masks for moving objects, we could in principle +fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the +KITTI 2015 test set, this makes little sense, though. + \paragraph{Predicting depth} In this work, we focused on motion estimation when RGB-D frames with dense depth are available. However, in many applications settings, we are not provided with any depth information. @@ -26,15 +39,23 @@ from which no depth data is available. To do so, we could integrate depth prediction into our network by branching off a depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}). Alternatively, we could add a specialized network for end-to-end depth regression -in parallel to the region-based network, e.g. \cite{GCNet}. +in parallel to the region-based network (or before, to provide XYZ input to the R-CNN), e.g. \cite{GCNet}. Although single-frame monocular depth prediction with deep networks was already done to some level of success, our two-frame input should allow the network to make use of epipolar geometry for making a more reliable depth estimate, at least when the camera is moving. We could also extend our method to stereo input data easily by concatenating -all of the frames into the input image, which -would however require using a different dataset for training, as Virtual KITTI does not +all of the frames into the input image. +In case we choose the option of integrating the depth prediction directly into +the R-CNN, +this would however require using a different dataset for training it, as Virtual KITTI does not provide stereo images. +If we would use a specialized depth network, we could use stereo data +for depth prediction and still train the R-CNN independently on the monocular Virtual KITTI, +though we would loose the ability to easily train the system in an end-to-end manner. + +As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test, +and also fine-tune on the training set as mentioned in the previous paragraph. { \begin{table}[h] @@ -45,7 +66,7 @@ provide stereo images. \midrule\midrule & input image & H $\times$ W $\times$ C \\ \midrule -C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ +C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\ \midrule \multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\ \midrule @@ -64,7 +85,7 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra \end{tabular} \caption { -Preliminary Motion R-CNN ResNet-50-FPN architecture with depth prediction, +A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction, based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}). } \label{table:motionrcnn_resnet_fpn_depth} @@ -74,16 +95,41 @@ based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_re Due to the amount of supervision required by the different components of the network and the complexity of the optimization problem, we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now. -A next step will be training on a more realistic dataset. +A next step will be training on a more realistic dataset, +ideally without having to rely on synthetic data at all. For this, we can first pre-train the RPN on an instance segmentation dataset like Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating -steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets. -On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize -the motion losses (and depth prediction if added), as no instance segmentation ground truth exists. +steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets. +On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize +the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists. On Cityscapes, we could continue train the instance segmentation components to improve detection and masks and avoid forgetting instance segmentation. As an alternative to this training scheme, we could investigate training on a pure -instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction. +instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) +prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow +setting \cite{UnsupFlownet, UnFlow}, +and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}. + +\paragraph{Supervising the camera motion without 3D camera motion ground truth} +We already described an optical flow based loss for supervising instance motions +when we do not have 3D instance motion ground truth, or when we do not have +any motion ground truth at all. +However, it would also be useful to train our model without access to 3D camera +motion ground truth. +The 3D camera motion will be already indirectly supervised when it is used in the flow-based +RoI instance motion loss. Still, to use all available information from +ground truth optical flow and obtain more accurate supervision, +it would likely be beneficial to add a global, flow-based camera motion loss +independent of the RoI supervision. +To do this, one could use a re-projection loss conceptually identical to the one +for supervising instance motions with ground truth flow. However, to adjust for the +fact that the camera motion can only be accurately supervised with flow at positions where +no object motion accurs, this loss would have to be masked with the ground truth +object masks. Again, we could use this flow-based loss in an unsupervised way. +For training on a dataset without any motion ground truth, e.g. +Cityscapes, it may be critical to add this term in addition to an unsupervised +loss for the instance motions. + \paragraph{Temporal consistency} A next step after the two aforementioned ones could be to extend our network to exploit more than two @@ -92,3 +138,16 @@ context of energy-minimization based scene flow \cite{TemporalSF}. In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM}, into our architecture, we could enable temporally consistent motion estimation from image sequences of arbitrary length. + +\paragraph{Deeper networks for larger bottleneck strides} +Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the +input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64. +For accurately estimating the motion of objects with large displacements between +the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network. +We could do this easily in both of our network variants by adding one ore multiple additional +ResNet blocks. In the variant without FPN, these blocks would have to be placed +after RoI feature extraction. In the FPN variant, the blocks could be simply +added after the encoder C$_5$ bottleneck. +For saving memory, we could however also consider modifying the underlying +ResNet-50 architecture and increase the number of blocks, but reduce the number +of layers in each block. diff --git a/experiments.tex b/experiments.tex index a3cea47..9e82b24 100644 --- a/experiments.tex +++ b/experiments.tex @@ -5,11 +5,11 @@ computations. To make our code easy to extend and flexible, we build on the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline implementation. On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN) -as well as extensions for motion estimation and related evaluations +as well as the Motion R-CNN extensions for motion estimation and related evaluations and postprocessings. In addition, we generated all ground truth for Motion R-CNN in the form of TFRecords from the raw Virtual KITTI data to enable fast loading during training. -Note that for RoI extraction and cropping operations, +Note that for RoI extraction and bilinear crop and resize operations, we use the \texttt{tf.crop\_and\_resize} TensorFlow function with interpolation set to bilinear. @@ -147,8 +147,14 @@ fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1]. Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for predicted camera motions. -\subsection{Training Setup} +\subsection{Virtual KITTI training setup} \label{ssec:setup} + +For our initial experiments, we concatenate both RGB frames as +well as the XYZ coordinates for both frames as input to the networks. +We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants. + +\paragraph{Training schedule} Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. We train on a single Titan X (Pascal) for a total of 192K iterations on the Virtual KITTI training set. @@ -172,8 +178,7 @@ Note that a larger weight prevented the angle sine estimates from properly converging to the very small values they are in general expected to output. - -\subsection{Experiments on Virtual KITTI} +\subsection{Virtual KITTI evaluation} \label{ssec:vkitti} \begin{figure}[t] @@ -227,7 +232,8 @@ only impacted by the predicted 3D object motions. \label{table:vkitti} \end{table} } -Figure \ref{figure:vkitti} visualizes instance segmentation and optical flow + +In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow results on the Virtual KITTI validation set. -Table \ref{table:vkitti} compares the performance of different network variants on the Virtual KITTI validation -set. +In Table \ref{table:vkitti}, we compare the performance of different network variants +on the Virtual KITTI validation set.