diff --git a/abstract.tex b/abstract.tex index 49a0bf0..958867c 100644 --- a/abstract.tex +++ b/abstract.tex @@ -59,9 +59,9 @@ das von slanted-plane Energieminimierungsmethoden für Szenenfluss inspiriert is Hierbei bauen wir auf den aktuellen Fortschritten in regionsbasierten Convolutional Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentierung. -Bei Eingabe von zwei aufeinanderfolgenden frames aus einer monokularen RGB-D +Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken -und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den frames ab. +und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab. Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen, setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes optisches Flussfeld zusammen. diff --git a/approach.tex b/approach.tex index 0fab62d..d0d1df1 100644 --- a/approach.tex +++ b/approach.tex @@ -279,6 +279,20 @@ penalize rotation and translation. For the camera, the loss is reduced to the classification term in this case. \paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth} +\begin{figure}[t] + \centering + \includegraphics[width=\textwidth]{figures/flow_loss} +\caption{ +Overview of the alternative, optical flow based loss for instance motion +supervision without 3D instance motion ground truth. +In contrast to SfM-Net, where a single optical flow field is +composed and penalized to supervise the motion prediction, our loss considers +the motion of all objects in isolation and composes a batch of flow windows +for the RoIs. +} +\label{figure:flow_loss} +\end{figure} + A more general way to supervise the object motions is a re-projection loss similar to the unsupervised loss in SfM-Net \cite{SfmNet}, which we can apply to coordinates within the object bounding boxes, diff --git a/background.tex b/background.tex index 44b0e7d..3de37d8 100644 --- a/background.tex +++ b/background.tex @@ -59,7 +59,7 @@ Table \ref{table:flownets} shows the classical FlowNetS architecture for optical & 5 $\times$ 5 deconv, 512, stride 2 & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 512 \\ \multicolumn{3}{c}{...}\\ \midrule -flow & $\times$ 2 bilinear upsample & $\tfrac{1}{1}$ H $\times$ $\tfrac{1}{1}$ W $\times$ 2 \\ +flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\ \bottomrule \caption { @@ -78,7 +78,7 @@ Potentially, the same network could also be used for semantic segmentation if the number of output final and intermediate output channels was adapted from two to the number of classes.\ Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well, given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements. -Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling +Note that the maximum displacement that can be correctly estimated depends on the number of 2D convolution strides or pooling operations in the encoder. Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. @@ -364,9 +364,24 @@ which has a stride of $4$ with respect to the input image. Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a RoI bounding box with size $h \times w$, \begin{equation} -j = \log_2(\sqrt{w \cdot h} / 224). \todo{complete} +j = 2 + j_a, +\end{equation} +where +\begin{equation} +j_a = \mathrm{clip}\left(\left[\log_2\left(\frac{w \cdot h}{s_0}\right)\right], 0, 4\right) \label{eq:level_assignment} \end{equation} +is the index (from small anchor to large anchor) of the corresponding anchor box and +\begin{equation} +s_0 = 256 \cdot 0.125 +\label{eq:level_assignment} +\end{equation} +is the scale of the smallest anchor boxes. +This formula is slightly different from the one used in the FPN paper, +as we want to assign the bounding boxes which are at the same scale +as some anchor to the exact same pyramid level from which the RPN of this +anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$, +which is the highest resolution feature map. { @@ -454,6 +469,8 @@ frequently in the following chapters. For vector or tuple arguments, the sum of losses is computed. For classification we define $\ell_{cls}$ as the cross-entropy classification loss. +\todo{formally define cross-entropy losses?} + \label{ssec:rcnn_techn} \paragraph{Bounding box regression} All bounding boxes predicted by the RoI head or RPN are estimated as offsets diff --git a/bib.bib b/bib.bib index ed50ad8..43567ae 100644 --- a/bib.bib +++ b/bib.bib @@ -257,11 +257,17 @@ year = {2018}} @inproceedings{UnsupDepth, - title={Unsupervised CNN for single view depth estimation: Geometry to the rescue}, + title={Unsupervised Learning of Depth and Ego-Motion from Video}, author={Ravi Garg and BG Vijay Kumar and Gustavo Carneiro and Ian Reid}, booktitle={ECCV}, year={2016}} +@inproceedings{UnsupPoseDepth, + title={Unsupervised CNN for single view depth estimation: Geometry to the rescue}, + author={Tinghui Zhou and Matthew Brown and Noah Snavely and David G. Lowe}, + booktitle={CVPR}, + year={2017}} + @inproceedings{UnsupFlownet, title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness}, author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis}, diff --git a/figures/flow_loss.png b/figures/flow_loss.png new file mode 100755 index 0000000..ee1c4ec Binary files /dev/null and b/figures/flow_loss.png differ diff --git a/figures/net_intro.png b/figures/net_intro.png index 259fbc1..043b7b7 100755 Binary files a/figures/net_intro.png and b/figures/net_intro.png differ diff --git a/introduction.tex b/introduction.tex index 3b93756..84cf59d 100644 --- a/introduction.tex +++ b/introduction.tex @@ -57,6 +57,13 @@ Figure taken from \cite{SfmNet}. Thus, this approach is very unlikely to scale to dynamic scenes with a potentially large number of diverse objects due to the inflexible nature of their instance segmentation technique. +Still, we think that the general idea of estimating object-level motion with +end-to-end deep networks instead +of directly predicting a dense flow field, as is common in current end-to-end +deep learning approaches to motion estimation, may significantly benefit motion +estimation by structuring the problem, creating physical constraints and reducing +the dimensionality of the estimate. + A scalable approach to instance segmentation based on region-based convolutional networks was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect a large number of objects from a large number of classes at once from Faster R-CNN @@ -71,10 +78,10 @@ on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}. } \label{figure:maskrcnn_cs} \end{figure} - -We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of -Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net. -For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head +Inspired by the accurate segmentation results of Mask R-CNN, +we thus propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of +Mask R-CNN with the end-to-end instance-level 3D motion estimation approach introduced with SfM-Net. +For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI Mask R-CNN head in parallel to classification, bounding box refinement and mask prediction. For each RoI, we predict a single 3D rigid object motion together with the object pivot in camera space in this way. @@ -86,8 +93,8 @@ as to the number or variety of object instances (Figure \ref{figure:net_intro}). Eventually, we want to extend our method to include depth prediction, yielding the first end-to-end deep network to perform 3D scene flow estimation -in a principled way from considering individual objects. -For now, we will work with RGB-D frames to break down the problem into +in a principled way from the consideration of individual objects. +For now, we will assume that RGB-D frames are given to break down the problem into manageable pieces. \begin{figure}[t] @@ -95,7 +102,7 @@ manageable pieces. \includegraphics[width=\textwidth]{figures/net_intro} \caption{ Overview of our network based on Mask R-CNN. For each RoI, we predict the instance motion -in parallel to the class, bounding box and mask. We branch off a additionaly +in parallel to the class, bounding box and mask. Additionally, we branch off a small network for predicting the camera motion from the bottleneck. } \label{figure:net_intro} @@ -104,8 +111,8 @@ small network for predicting the camera motion from the bottleneck. \subsection{Related work} In the following, we will refer to systems which use deep networks for all -optimization and do not perform time-critical side computation at inference time as -\emph{end-to-end} deep learning systems. +optimization and do not perform time-critical side computation (e.g. numerical optimization) +at inference time as \emph{end-to-end} deep learning systems. \paragraph{Deep networks in optical flow} @@ -118,14 +125,15 @@ image depending on the semantics of each region or pixel, which include whether pixel belongs to the background, to which object instance it belongs if it is not background, and the class of the object it belongs to. Often, failure cases of these methods include motion boundaries or regions with little texture, -where semantics become more important. \todo{elaborate}% TODO make sure this is a grounded statement +where semantics become important. Extensions of these approaches to scene flow estimate flow and depth with similarly generic networks \cite{SceneFlowDataset} and similar limitations. Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper? the optical flow estimation problem and introduce reasoning at the object level, but still require expensive energy minimization for each -new input, as CNNs are only used for some of the components. +new input, as CNNs are only used for some of the components and numerical +optimization is central to their inference. In contrast, we tackle motion estimation at the instance-level with end-to-end deep networks and derive optical flow from the individual object motions. @@ -133,14 +141,16 @@ deep networks and derive optical flow from the individual object motions. \paragraph{Slanted plane methods for 3D scene flow} The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being composed of planar segments. Pixels are assigned to one of the planar segments, -each of which undergoes a independent 3D rigid motion. % TODO explain benefits of this modelling, but unify with explanations above +each of which undergoes a independent 3D rigid motion. +This model simplifies the motion estimation problem significantly by reducing the dimensionality +of the estimate, and thus leads to accurate results. In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015} assigns each slanted plane to one rigidly moving object instance, thus reducing the number of independently moving segments by allowing multiple segments to share the motion of the object they belong to. In all of these methods, pixel assignment and motion estimation are formulated as energy-minimization problem which is optimized for each input data point, -without the use of (deep) learning. % TODO make sure it's ok to say there's no learning +without the use of (deep) learning. In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow}, a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined @@ -183,25 +193,27 @@ SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a s of the points into objects together with the 3D motion of each object. Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and estimates a segmentation of pixels into objects together with their 3D motions between the frames. -In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning. +In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow from end-to-end deep learning. For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate with a brightness constancy proxy loss. -Like SfM-Net, we aim to estimate motion and instance segmentation jointly with +Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with end-to-end deep learning. Unlike SfM-Net, we build on a scalable object detection and instance segmentation approach with R-CNNs, which provide a strong baseline. \paragraph{End-to-end deep networks for camera pose estimation} Deep networks have been used for estimating the 6-DOF camera pose from -a single RGB frame \cite{PoseNet, PoseNet2}. These works are related to +a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion +from monocular video \cite{UnsupPoseDepth}. +These works are related to ours in that we also need to output various rotations and translations from a deep network and thus need to solve similar regression problems and use similar parametrizations and losses. \subsection{Outline} -In section \ref{sec:background}, we introduce preliminaries and building +First, in section \ref{sec:background}, we introduce preliminaries and building blocks from earlier works that serve as a foundation for our networks and losses. Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}), @@ -212,11 +224,11 @@ followed by our losses and supervision methods for training the extended region-based CNN (\ref{ssec:supervision}), and finally the postprocessings we use to derive dense flow from our 3D motion estimates (\ref{ssec:postprocessing}). -In section \ref{sec:experiments}, we introduce the Virtual KITTI dataset we use +Then, in section \ref{sec:experiments}, we introduce the Virtual KITTI dataset we use for training our networks as well as all preprocessings we perform (\ref{ssec:datasets}), give details of our experimental setup (\ref{ssec:setup}), and finally describe the experimental results on Virtual KITTI (\ref{ssec:vkitti}). -In section \ref{sec:conclusion}, we summarize our work and describe future +Finally, in section \ref{sec:conclusion}, we summarize our work and describe future developments, including depth prediction, training on real world data, and exploiting frames over longer time intervals.