diff --git a/abstract.tex b/abstract.tex index 1e73166..3864469 100644 --- a/abstract.tex +++ b/abstract.tex @@ -1,32 +1,36 @@ \begin{abstract} -Many state of the art energy-minimization approaches to optical flow and scene flow estimation -rely on a (piecewise) rigid scene model, where the scene is represented as an ensemble of distinct, -rigidly moving components, a static background and a moving camera. +Many state of the art energy-minimization approaches to optical flow and scene +flow estimation rely on a (piecewise) rigid scene model, where the scene is +represented as an ensemble of distinct, rigidly moving components, a static +background and a moving camera. By constraining the optimization problem with a physically sound scene model, these approaches enable higly accurate motion estimation. -With the advent of deep learning methods, it has become popular to re-purpose generic deep networks -for classical computer vision problems involving pixel-wise estimation. +With the advent of deep learning methods, it has become popular to re-purpose +generic deep networks for classical computer vision problems involving +pixel-wise estimation. -Following this trend, many recent end-to-end deep learning approaches to optical flow -and scene flow directly predict full resolution -depth and flow fields with a generic network for dense, pixel-wise prediction, -thereby ignoring the inherent structure of the underlying motion estimation problem -and any physical constraints within the scene. +Following this trend, many recent end-to-end deep learning approaches to optical +flow and scene flow directly predict full resolution flow fields with +a generic network for dense, pixel-wise prediction, thereby ignoring the +inherent structure of the underlying motion estimation problem and any physical +constraints within the scene. -We introduce an end-to-end deep learning approach for dense motion estimation +We introduce a scalable end-to-end deep learning approach for dense motion estimation that respects the structure of the scene as being composed of distinct objects, thus combining the representation learning benefits of end-to-end deep networks with a physically plausible scene model. -Building on recent advanced in region-based convolutional networks (R-CNNs), we integrate motion -estimation with instance segmentation. +Building on recent advanced in region-based convolutional networks (R-CNNs), +we integrate motion estimation with instance segmentation. Given two consecutive frames from a monocular RGBD camera, -our resulting end-to-end deep network detects objects with accurate per-pixel masks -and estimates the 3D motion of each detected object between the frames. -By additionally estimating a global camera motion in the same network, we compose a dense -optical flow field based on instance-level and global motion predictions. +our resulting end-to-end deep network detects objects with accurate per-pixel +masks and estimates the 3D motion of each detected object between the frames. +By additionally estimating a global camera motion in the same network, +we compose a dense optical flow field based on instance-level and global motion +predictions. -We demonstrate the feasibility of our approach on the KITTI 2015 optical flow benchmark. +We demonstrate the feasibility of our approach on the KITTI 2015 optical flow +benchmark. \end{abstract} diff --git a/background.tex b/background.tex index f59977d..6d870dd 100644 --- a/background.tex +++ b/background.tex @@ -1,15 +1,19 @@ \subsection{Optical flow, scene flow and structure from motion} -Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images. -The optical flow $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ maps pixel coordinates in the first -frame $I_1$ to pixel coordinates of the visually corresponding pixel in the second frame $I_2$, thus -representing the apparent movement of brigthness patterns between the two frames. +Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a +sequence of images. +The optical flow +$\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$ +maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the +visually corresponding pixel in the second frame $I_2$, +thus representing the apparent movement of brigthness patterns between the two frames. Optical flow can be regarded as two-dimensional motion estimation. Scene flow is the generalization of optical flow to 3-dimensional space. \subsection{Convolutional neural networks for dense motion estimation} -Deep convolutional neural network (CNN) architectures \cite{ImageNetCNN, VGGNet, ResNet} became widely popular -through numerous successes in classification and recognition tasks. +Deep convolutional neural network (CNN) architectures +\cite{ImageNetCNN, VGGNet, ResNet} +became widely popular through numerous successes in classification and recognition tasks. The general structure of a CNN consists of a convolutional encoder, which learns a spatially compressed, wide (in the number of channels) representation of the input image, and a fully connected prediction network on top of the encoder. @@ -47,13 +51,13 @@ In the following, we give a short review of region-based convolutional networks, most popular deep networks for object detection, and have recently also been applied to instance segmentation. \paragraph{R-CNN} -The original region-based convolutional network (R-CNN) \cite{RCNN} uses a non-learned algorithm external to a standard encoder CNN +Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. For each of the region proposals, the input image is cropped at the proposed region and the crop is passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement! \paragraph{Fast R-CNN} -The original R-CNN involved computing on forward pass of the deep CNN for each of the region proposals, +The original R-CNN involves computing one forward pass of the CNN for each of the region proposals, which is costly, as there is generally a large amount of proposals. Fast R-CNN \cite{FastRCNN} significantly reduces computation by performing only a single forward pass with the whole image as input to the CNN (compared to the sequential input of crops in the case of R-CNN). @@ -66,7 +70,7 @@ speeding up the system by orders of magnitude. % TODO verify that \paragraph{Faster R-CNN} After streamlining the CNN components, Fast R-CNN is limited by the speed of the region proposal -algorith, which has to be run prior to the network passes and makes up a large portion of the total +algorithm, which has to be run prior to the network passes and makes up a large portion of the total processing time. The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and classification into a single deep network, leading to faster processing when compared to Fast R-CNN diff --git a/bib.bib b/bib.bib index 7d07bcc..7773af5 100644 --- a/bib.bib +++ b/bib.bib @@ -52,7 +52,7 @@ Booktitle = {{ICCV}}, Year = {2015}} -@inproceedings{Behl2017ICCV, +@inproceedings{InstanceSceneFlow, Author = {Aseem Behl and Omid Hosseini Jafari and Siva Karthik Mustikovela and Hassan Abu Alhaija and Carsten Rother and Andreas Geiger}, Title = {Bounding Boxes, Segmentations and Object Coordinates: @@ -125,8 +125,20 @@ booktitle = {{CVPR}}, year = {2012}} -@INPROCEEDINGS{KITTI2015, +@inproceedings{KITTI2015, author = {Moritz Menze and Andreas Geiger}, title = {Object Scene Flow for Autonomous Vehicles}, booktitle = {{CVPR}}, year = {2015}} + +@inproceedings{PRSF, + author = {C. Vogel and K. Schindler and S. Roth}, + title = {Piecewise Rigid Scene Flow}, + booktitle = {{ICCV}}, + year = {2013}} + +@inproceedings{PRSM, + author = {C. Vogel and K. Schindler and S. Roth}, + title = {3D Scene Flow with a Piecewise Rigid Scene Model}, + booktitle = {{IJCV}}, + year = {2015}} diff --git a/conclusion.tex b/conclusion.tex index 86df7d3..436aa3f 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -22,6 +22,6 @@ steps of training on, for example, Cityscapes and the KITTI stereo and optical f On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize the motion losses (and depth prediction), as no instance segmentation ground truth exists. On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to -improve detection and masks and avoid any forgetting effects. +improve detection and masks and avoid forgetting instance segmentation. As an alternative to this training scheme, we could investigate training on a pure instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction. diff --git a/introduction.tex b/introduction.tex index c5786ab..0e3a880 100644 --- a/introduction.tex +++ b/introduction.tex @@ -1,8 +1,24 @@ -\subsection{Motivation \& Goals} +\subsection{Motivation} % introduce problem to sovle % mention classical non deep-learning works, then say it would be nice to go end-to-end deep +% Steal intro from behl2017 & FlowLayers + +Deep learning research is moving towards videos. +Motion estimation is an inherently ambigous problem and +A recent trend is towards end-to-end deep learning systems, away from energy-minimization. +Often however, this leads to a compromise in modelling as it is more difficult to +formulate a end-to-end deep network architecture for a given problem than it is +to state a fesable energy-minimization problem. +For this reason, we see lots of generic models applied to domains which previously +employed intricate physical models to simplify optimization. +On the on hand, end-to-end deep learning may bringe unique benefits due do the ability +of a learned system to deal with ambiguity. +On the other hand, +%Thus, there is an emerging trend to unify geometry with deep learning by +% THE ABOVE IS VERY DRAFT_LIKE + Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera. SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder @@ -23,24 +39,42 @@ Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head in parallel to classification and bounding box refinement. -\subsection{Related Work} +\subsection{Related work} -\paragraph{Deep networks for optical flow and scene flow} +\paragraph{Deep networks in optical flow and scene flow} -\paragraph{Deep networks for 3D motion estimation} +\cite{FlowLayers} +\cite{ESI} + +\paragraph{Slanted plane methods for 3D scene flow} +The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being +composed of planar segments. Pixels are assigned to one of the planar segments, +each of which undergoes a rigid motion. + +In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015} +assigns each slanted plane to one rigidly moving object instance, thus +reducing the number of independently moving segments by allowing multiple +segments to share the motion of the object they belong to. + +In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow}, +a CNN is used to compute 2D bounding boxes and instance masks, which are then combined +with depth obtained from a non-learned stereo algorithm to be used as pre-computed +inputs to the object scene flow model from \cite{KITTI2015}. + +Interestingly, these slanted plane methods achieve the current state-of-the-art +in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015}, +outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}. + +% +In other contexts, the move from +% talk about performance issues with energy-minimization components, draw parallels to evolution of R-CNNs in terms of speed and accuracy when moving towards full end-to-end learning + +\paragraph{End-to-end deep networks for 3D rigid motion estimation} End-to-end deep learning for predicting rigid 3D object motions was first introduced with SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation of the points into objects together with the 3D motion of each object. Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and estimates a segmentation of pixels into objects together with their 3D motions between the frames. In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning. -For supervision, SfM-Net penalizes the dense optical flow composed from the 3D motions and depth estimate +For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate with a brightness constancy proxy loss. - - - -Recently, deep CNN-based recognition was combined with energy-based 3D scene flow estimation \cite{Behl2017ICCV}. - - -\cite{FlowLayers} -\cite{ESI} diff --git a/thesis.tex b/thesis.tex index ee0087a..9065236 100644 --- a/thesis.tex +++ b/thesis.tex @@ -151,7 +151,7 @@ % Verwende keyword=meinbegriff, um nur die Einträge aus deiner .bib-Datei ausgeben zu lassen, die mit meinbegriff getaggt sind. % Darf ein bestimmtes Keyword nicht enthalten sein, verwende notkeyword=meinbegriff. \singlespacing -\printbibliography[title=Literaturverzeichnis, heading=bibliography] +\printbibliography[title=Bibliography, heading=bibliography] %\printbibliography[title=Literaturverzeichnis, heading=bibliography, keyword=meinbegriff]