WIP

2025-12-13 09:55:49 +00:00 · 2017-11-04 15:01:43 +01:00 · 2017-11-04 15:01:43 +01:00 · a6311dca56
commit a6311dca56
parent 84c5b1e6cd
5 changed files with 104 additions and 46 deletions
--- a/abstract.tex
+++ b/abstract.tex
@ -31,6 +31,9 @@ By additionally estimating a global camera motion in the same network,
 we compose a dense optical flow field based on instance-level and global motion
 predictions.

-%We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
-%benchmark.
+\end{abstract}
+
+\renewcommand{\abstractname}{Zusammenfassung}
+\begin{abstract}
+\todo{german abstract}
 \end{abstract}
--- a/approach.tex
+++ b/approach.tex
@ -17,7 +17,15 @@ laying the foundation for our motion estimation. Instead of taking a single imag
 we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
 We do not introduce a separate network for computing region proposals and use our modified backbone network
 as both first stage RPN and second stage feature extractor for region cropping.
-% TODO figures; introduce XYZ inputs
+Technically, our feature encoder network will have to learn a motion representation similar to
+that learned by the FlowNet encoder, but the output will be computed in the
+object-centric framework of a region based convolutional network head with a 3D parametrization.
+Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information
+from the encoder is integrated for specific objects via RoI cropping and
+processed by the RoI head for each object.
+\todo{figure of backbone}
+
+\todo{introduce optional XYZ input}

 \paragraph{Per-RoI motion prediction}
 We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
@ -70,6 +78,7 @@ where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
 Here, we assume that motions between frames are relatively small
 and that objects rotate at most 90 degrees in either direction along any axis.
 All predictions are made in camera space, and translation and pivot predictions are in meters.
+\todo{figure of head}

 \paragraph{Camera motion prediction}
 In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
@ -111,23 +120,53 @@ l_{p}^k = \lVert p^{gt,i_k} - p^{k,c_k} \rVert_1.
 \end{equation}

 \paragraph{Camera motion supervision}
-We supervise the camera motion with ground truth in the same way as the
-object motions.
+We supervise the camera motion with ground truth analogously to the
+object motions, with the only difference being that we only have
+a rotation and translation, but no pivot term for the camera motion.

 \paragraph{Per-RoI supervision \emph{without} motion ground truth}
 A more general way to supervise the object motions is a re-projection
-loss applied to coordinates within the object bounding box,
-as used in SfM-Net \cite{SfmNet}. Let
+loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
+which we can apply to coordinates within the object bounding boxes,
+and which does not require ground truth 3D object motions.

+For any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
+with the same resolution as the predicted mask. We use the same bounding box
+to crop the corresponding region from the dense, full image depth map
+and bilinearly resize the depth crop to the same resolution as the mask and point
+grid.
+We then compute the optical flow at each of the grid points by creating
+a 3D point cloud from the point grid and depth crop. To this point cloud, we
+apply the RoI's predicted motion, masked by the predicted mask.
+Then, we apply the camera motion to the points, project them back to 2D
+and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
+Note that we batch this computation over all RoIs, so that we only perform
+it once per forward pass. The mathematical details are analogous to the
+dense, full image flow computation in the following subsection and will not
+be repeated here. \todo{add diagram to make it easier to understand}

+For each RoI, we can now penalize the optical flow grid to supervise the object motion.
+If there is optical flow ground truth available, we can use the RoI bounding box to
+crop and resize a region from the ground truth optical flow to match the RoI's
+optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
+
+However, we can also use the re-projection loss without optical flow ground truth
+to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
+In this case, we use the bounding box to crop and resize a corresponding region
+from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
+using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference
+between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
+For more details on differentiable bilinear sampling for deep learning, we refer the reader to
+\cite{STN}.

 When compared to supervision with motion ground truth, a re-projection
 loss could benefit motion regression by removing any loss balancing issues between the
-rotation, translation and pivot terms \cite{PoseNet2}.
+rotation, translation and pivot terms \cite{PoseNet2},
+which can make it interesting even when 3D motion ground truth is available.


 \subsection{Dense flow from motion}
-We compose a dense optical flow map from the outputs of our Motion R-CNN network.
+As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network.
 Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$,
 where
 \begin{equation}
@ -143,6 +182,7 @@ x_t - c_0 \\ y_t - c_1 \\ f
 \end{equation}
 is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$,
 which range over all coordinates in $I_t$.
+For now, the depth map is always assumed to come from ground truth.

 Given $k$ detections with predicted motions as above, we transform all points within the bounding
 box of a detected object according to the predicted motion of the object.
@ -166,6 +206,10 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
 = P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^k
 \end{equation}.

+Note that in our experiments, we either use the ground truth camera motion to focus
+on the object motion predictions or the predicted camera motion to predict complete
+motion. We will always state which variant we use in the experimental section.
+
 Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
 \begin{equation}
 \begin{pmatrix}
@ -191,7 +235,3 @@ u \\ v
 x_{t+1} - x_{t} \\ y_{t+1} - y_{t}
 \end{pmatrix}.
 \end{equation}
-
-
-%Given the predicted motion as above, a depth map $d_t$ for frame $I_t$ and
-%the predicted or ground truth camera motion $\{R_c^k, t_c^k\}\in SE3$.
--- a/background.tex
+++ b/background.tex
@ -1,17 +1,20 @@
-\subsection{Optical flow, scene flow and structure from motion}
+Here, we will give a more detailed description of previous works
+we directly build on and other prerequisites.
+
+\subsection{Optical flow and scene flow}
 Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
 sequence of images.
 The optical flow
 $\mathbf{w} = (u, v)^T$ from $I_1$ to $I_2$
 maps pixel coordinates in the first frame $I_1$ to pixel coordinates of the
 visually corresponding pixel in the second frame $I_2$,
-thus representing the apparent movement of brigthness patterns between the two frames.
+and can be interpreted as the apparent movement of brigthness patterns between the two frames.
 Optical flow can be regarded as two-dimensional motion estimation.

 Scene flow is the generalization of optical flow to 3-dimensional space and
-requires estimating dense depth. Generally, stereo input is used for scene flow
-to estimate disparity-based depth, however monocular depth estimation can in
-principle be used.
+requires estimating depth for each pixel. Generally, stereo input is used for scene flow
+to estimate disparity-based depth, however monocular depth estimation with deep networks is becoming
+popular \cite{DeeperDepth}.

 \subsection{Convolutional neural networks for dense motion estimation}
 Deep convolutional neural network (CNN) architectures
@ -30,27 +33,18 @@ The most popular deep networks of this kind for end-to-end optical flow predicti
 are variants of the FlowNet family \cite{FlowNet, FlowNet2},
 which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
 Figure \ref{} shows the classical FlowNetS architecture for optical fow prediction.
-Note that the network itself is rather generic and is specialized for optical flow only through being trained
+Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained
 with supervision from dense optical flow ground truth.
 Potentially, the same network could also be used for semantic segmentation if
 the number of output channels was adapted from two to the number of classes. % TODO verify
-FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
+Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well,
 given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements.
 Note that the maximum displacement that can be correctly estimated only depends on the number of 2D strides or pooling
 operations in the encoder.
 Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}.

-% The conclusion should be an understanding of the generic nature of the popular dense prediction networks
-% for flow and depth, which primarily stems from the fact that they are quick re-purposing of recognition CNNs.
-
-% The reader should understand the limitations of the generic dense-estimator approach!
-
-% Also, it should be emphasized that FlowNet learns to match images with a generic encoder,
-% thus motivating the introduction of our motion head, which should integrate (and regularize) matching information learned
-% in the resnet backbone.
-
 \subsection{Region-based convolutional networks}
-In the following, we give a short review of region-based convolutional networks, which are currently by far the
+We now give a short review of region-based convolutional networks, which are currently by far the
 most popular deep networks for object detection, and have recently also been applied to instance segmentation.

 \paragraph{R-CNN}
@ -101,10 +95,11 @@ which generally involves computing a binary mask for each object instance specif
 to that object. This problem is called \emph{instance segmentation}.
 Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
 fixed resolution instance masks within the bounding boxes of each detected object.
-This can be done by simply extending the Faster R-CNN head with multiple convolutions, which
+This is done by simply extending the Faster R-CNN head with multiple convolutions, which
 compute a pixel-precise mask for each instance.
-In addition, Mask R-CNN
-Figure \ref{} compares the two Mask R-CNN network variants.
+In addition to extending the original Faster R-CNN head, Mask R-CNN also introduced a network
+variant based on Feature Pyramid Networks \cite{FPN}.
+Figure \ref{} compares the two Mask R-CNN head variants.

 \paragraph{Supervision of the RPN}
 \paragraph{Supervision of the RoI head}
--- a/bib.bib
+++ b/bib.bib
@ -172,3 +172,21 @@
  title = {Geometric loss functions for camera pose regression with deep learning},
  booktitle = {CVPR},
  year = {2017}}
+
+@inproceedings{STN,
+  author = {M. Jadeberg and K. Zisserman and K. Kavukcuoglu},
+  title = {Spatial transformer networks},
+  booktitle = {NIPS},
+  year = {2015}}
+
+@inproceedings{CensusTerm,
+  author = {Fridtjof Stein},
+  title = {Efficient Computation of Optical Flow Using the Census Transform},
+  booktitle = {DAGM},
+  year = {2004}}
+
+@inproceedings{DeeperDepth,
+  author = {Iro Laina and Christian Rupprecht and Vasileios Belagiannis and Federico Tombari and Nassir Navab},
+  title = {Deeper Depth Prediction with Fully Convolutional Residual Networks},
+  booktitle = {3DV},
+  year = {2016}}
--- a/introduction.tex
+++ b/introduction.tex
@ -12,7 +12,7 @@ of each obstacle, but to also know if and where the obstacle is moving,
 and to use sensors that will not make the system too expensive for widespread use.
 There are many other applications.. %TODO(make motivation wider)

-A promising approach for 3D scene understanding in these situations may be deep neural
+A promising approach for 3D scene understanding in these situations are deep neural
 networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
 in still images and are more and more often being applied to video data.
 A key benefit of end-to-end deep networks is that they can, in principle,
@ -20,7 +20,7 @@ enable very fast inference on real time video data and generalize
 over many training examples to resolve ambiguities inherent in image understanding
 and motion estimation.

-Thus, in this work, we aim to develop a end-to-end deep network which can, given
+Thus, in this work, we aim to develop end-to-end deep networks which can, given
 sequences of images, segment the image pixels into object instances and estimate
 the location and 3D motion of each object instance relative to the camera.

@ -44,19 +44,21 @@ and predicts pixel-precise segmentation masks for each detected object.
 We propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
 Mask R-CNN with the end-to-end 3D motion estimation approach introduced with SfM-Net.
 For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI R-CNN head
-in parallel to classification and bounding box refinement.
+in parallel to classification, bounding box refinement and mask prediction.
 For each RoI, we predict a single 3D rigid object motion together with the object
-pivot in camera space.
-As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone to take
+pivot in camera space in this way.
+As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone of Mask R-CNN to take
 two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
-This gives us a fully integrated end-to-end network architecture for segmenting pixels into instances
+This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
 and estimating the motion of all detected instances without any limitations
 as to the number or variety of object instances.
+Figure \ref{} gives an overview of our network.

-Eventually, we want to extend our method to include end-to-end depth prediction,
+Eventually, we want to extend our method to include depth prediction,
 yielding the first end-to-end deep network to perform 3D scene flow estimation
 in a principled way from considering individual objects.
-For now, we will work with RGB-D frames to break down the problem into manageable pieces.
+For now, we will work with RGB-D frames to break down the problem into
+manageable pieces.

 \subsection{Related work}

@ -75,12 +77,12 @@ image depending on the semantics of each region or pixel, which include whether
 pixel belongs to the background, to which object instance it belongs if it is not background,
 and the class of the object it belongs to.
 Often, failure cases of these methods include motion boundaries or regions with little texture,
-where semantics become more important. % TODO make sure this is a grounded statement
+where semantics become more important. \todo{elaborate}% TODO make sure this is a grounded statement
 Extensions of these approaches to scene flow estimate flow and depth
 with similarly generic networks \cite{SceneFlowDataset} and similar limitations.

 Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
-the optical flow estimation problem and introduce semantics,
+the optical flow estimation problem and introduce reasoning at the object level,
 but still require expensive energy minimization for each
 new input, as CNNs are only used for some of the components.

@ -94,7 +96,7 @@ reducing the number of independently moving segments by allowing multiple
 segments to share the motion of the object they belong to.
 In these methods, pixel assignment and motion estimation are formulated
 as energy-minimization problem and optimized for each input data point,
-without any learning. % TODO make sure it's ok to say there's no learning
+without the use of (deep) learning. % TODO make sure it's ok to say there's no learning

 In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
 a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
@ -126,8 +128,8 @@ in speed, but also in accuracy, especially considering the inherent ambiguity of
 and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.

 However, we think that the current end-to-end deep learning approaches to motion
-estimation are limited by a lack of spatial structure and regularity in their estimates,
-which stems from the generic nature of the employed networks. % TODO move to end-to-end deep nets section
+estimation are likely limited by a lack of spatial structure and regularity in their estimates
+as explained above, which stems from the generic nature of the employed networks.
 To this end, we aim to combine the modelling benefits of rigid scene decompositions
 with the promise of end-to-end deep learning.