save

2026-04-04 16:45:23 +00:00 · 2017-11-04 12:05:42 +01:00 · 2017-11-04 12:05:42 +01:00 · 84c5b1e6cd
commit 84c5b1e6cd
parent 2e84078d0d
7 changed files with 196 additions and 80 deletions
--- a/abstract.tex
+++ b/abstract.tex
@ -19,7 +19,7 @@ constraints within the scene.

 We introduce a scalable end-to-end deep learning approach for dense motion estimation
 that respects the structure of the scene as being composed of distinct objects,
-thus combining the representation learning benefits of end-to-end deep networks
+thus combining the representation learning benefits and speed of end-to-end deep networks
 with a physically plausible scene model.

 Building on recent advanced in region-based convolutional networks (R-CNNs),
@ -31,6 +31,6 @@ By additionally estimating a global camera motion in the same network,
 we compose a dense optical flow field based on instance-level and global motion
 predictions.

-We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
-benchmark.
+%We demonstrate the feasibility of our approach on the KITTI 2015 optical flow
+%benchmark.
 \end{abstract}
--- a/approach.tex
+++ b/approach.tex
@ -1,24 +1,30 @@

 \subsection{Motion R-CNN architecture}

-Building on Mask R-CNN, we enable per-object motion estimation by predicting the 3D motion of each detected object
-in camera space.
+Building on Mask R-CNN \cite{MaskRCNN},
+we estimate per-object motion by predicting the 3D motion of each detected object.
+For this, we extend Mask R-CNN in two straightforward ways.
+First, we modify the backbone network and provide two frames to the R-CNN system
+in order to enable image matching between the consecutive frames.
+Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each
+region proposal.

 \paragraph{Backbone Network}
-Like Faster R-CNN and Mask R-CNN, we use a ResNet variant as backbone network to compute feature maps from input imagery.
+Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backbone network to compute feature maps from input imagery.

-Inspired by FlowNetS, we make one modification to enable image matching within the backbone network,
+Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching,
 laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone,
 we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels.
 We do not introduce a separate network for computing region proposals and use our modified backbone network
 as both first stage RPN and second stage feature extractor for region cropping.
+% TODO figures; introduce XYZ inputs

 \paragraph{Per-RoI motion prediction}
-We use a rigid 3D motion parametrization similar to the one used by SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
+We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
 For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$
 \footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
 and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
-of the object between the two frames $I_t$ and $I_{t+1}$ as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$ .
+of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$.
 We parametrize ${R_t^k}$ using an Euler angle representation,

 \begin{equation}
@ -53,50 +59,71 @@ R_t^{k,z}(\gamma) =
 \end{pmatrix},
 \end{equation}

-and $\alpha, \beta, \gamma$ are the rotation angles about the $x,y,z$-axis, respectively.
+and $\alpha, \beta, \gamma$ are the rotation angles in radians about the $x,y,z$-axis, respectively.

-
-Figure \ref{fig:motion_rcnn_head} shows our extended per-RoI head network.
-We then extend the Faster R-CNN head by adding a fully connected layer in parallel to the final fully connected layers for
-predicting refined boxes and classes.
+We then extend the Mask R-CNN head by adding a fully connected layer in parallel to the fully connected layers for
+refined boxes and classes. Figure \ref{fig:motion_rcnn_head} shows the Motion R-CNN RoI head.
 Like for refined boxes and masks, we make one separate motion prediction for each class.
 Each motion is predicted as a set of nine scalar motion parameters,
 $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$,
 where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$.
 Here, we assume that motions between frames are relatively small
 and that objects rotate at most 90 degrees in either direction along any axis.
+All predictions are made in camera space, and translation and pivot predictions are in meters.

 \paragraph{Camera motion prediction}
 In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$
 between the two frames $I_t$ and $I_{t+1}$.
-For this, we flatten the full output of the backbone and pass it through a fully connected layer.
+For this, we flatten the bottleneck output of the backbone and pass it through a fully connected layer.
 We again represent $R_t^{cam}$ using a Euler angle representation and
 predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects.

 \subsection{Supervision}

 \paragraph{Per-RoI supervision with motion ground truth}
-Given a positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
-let $R_{c_k}^k, t_{c_k}^k, p_{c_k}^k$ be the predicted motion for class $c_k$
-and $R_{gt}^{i_k}, t_{gt}^{i_k}, p_{gt}^{i_k}$ the ground truth motion for the example $i_k$.
-We compute the motion loss $L_{motion}^k$ for each RoI as
+The most straightforward way to supervise the object motions is by using ground truth
+motions computed from ground truth object poses, which is in general
+only practical when training on synthetic datasets.
+Given the $k$-th positive RoI, let $i_k$ be the index of the matched ground truth example with class $c_k$,
+let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
+and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
+Note that we dropped the subscript $t$ to increase readability.
+Inspired by the camera pose regression loss in \cite{PoseNet2},
+we use $\ell_1$-loss to penalize the differences between ground truth and predicted % TODO actually, we use smooth l1
+rotation, translation and pivot.
+For each RoI, we compute the motion loss $L_{motion}^k$ as a linear sum of
+the individual losses,

 \begin{equation}
 L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k,
 \end{equation}
 where
 \begin{equation}
-l_{R}^k = \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R_{c_k}^k) \cdot R_{gt}^{i_k}) - 1}{2} \right\}\right\} \right)
+l_{R}^k = \lVert R^{gt,i_k} - R^{k,c_k} \rVert _1,
 \end{equation}
-measures the angle of the error rotation between predicted and ground truth rotation,
+
 \begin{equation}
-l_{t}^k = \lVert inv(R_{c_k}^k) \cdot (t_{gt}^{i_k} - t_{c_k}^k) \rVert,
+l_{t}^k = \lVert t^{gt,i_k} - t^{k,c_k} \rVert_1,
 \end{equation}
-is the euclidean norm between predicted and ground truth translation, and
+and
 \begin{equation}
-l_{p}^k = \lVert p_{gt}^{i_k} - p_{c_k}^k \rVert
+l_{p}^k = \lVert p^{gt,i_k} - p^{k,c_k} \rVert_1.
 \end{equation}
-is the euclidean norm between predicted and ground truth pivot.
+
+\paragraph{Camera motion supervision}
+We supervise the camera motion with ground truth in the same way as the
+object motions.
+
+\paragraph{Per-RoI supervision \emph{without} motion ground truth}
+A more general way to supervise the object motions is a re-projection
+loss applied to coordinates within the object bounding box,
+as used in SfM-Net \cite{SfmNet}. Let
+
+
+
+When compared to supervision with motion ground truth, a re-projection
+loss could benefit motion regression by removing any loss balancing issues between the
+rotation, translation and pivot terms \cite{PoseNet2}.


 \subsection{Dense flow from motion}
@ -130,7 +157,7 @@ P'_{t+1} =
 P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
 \end{equation}

-Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in SE3$, % TODO introduce!
+Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce!

 \begin{equation}
 \begin{pmatrix}
--- a/bib.bib
+++ b/bib.bib
@ -160,3 +160,15 @@
  title = {Feature Pyramid Networks for Object Detection},
  journal = {arXiv preprint arXiv:1612.03144},
  year = {2016}}
+
+@inproceedings{PoseNet,
+  author = {Alex Kendall and Matthew Grimes and Roberto Cipolla},
+  title = {PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization},
+  booktitle = {ICCV},
+  year = {2015}}
+
+@inproceedings{PoseNet2,
+  author = {Alex Kendall and Roberto Cipolla},
+  title = {Geometric loss functions for camera pose regression with deep learning},
+  booktitle = {CVPR},
+  year = {2017}}
--- a/conclusion.tex
+++ b/conclusion.tex
@ -1,5 +1,6 @@
-We have introduced a extension on top of region-based convolutional networks to enable object motion estimation
+We have introduced an extension on top of region-based convolutional networks to enable object motion estimation
 in parallel to instance segmentation.
+\todo{complete}

 \subsection{Future Work}
 \paragraph{Predicting depth}
@ -9,19 +10,20 @@ depth network from the backbone in parallel to the RPN, as in Figure \ref{}.
 Although single-frame monocular depth prediction with deep networks was already done
 to some level of success,
 our two-frame input should allow the network to make use of epipolar
-geometry for making a more reliable depth estimate.
+geometry for making a more reliable depth estimate, at least when the camera
+is moving.

 \paragraph{Training on real world data}
 Due to the amount of supervision required by the different components of the network
 and the complexity of the optimization problem,
 we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
-A next step would be training on a more realistic dataset.
-For example, we can first pre-train the RPN on an object detection dataset like
+A next step will be training on a more realistic dataset.
+For this, we can first pre-train the RPN on an object detection dataset like
 Cityscapes. As soon as the RPN works reliably, we could execute alternating
 steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
 On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
-the motion losses (and depth prediction), as no instance segmentation ground truth exists.
-On Cityscapes, we could continue train the full instance segmentation Mask R-CNN to
+the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
+On Cityscapes, we could continue train the instance segmentation components to
 improve detection and masks and avoid forgetting instance segmentation.
 As an alternative to this training scheme, we could investigate training on a pure
 instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
--- a/experiments.tex
+++ b/experiments.tex
@ -19,53 +19,82 @@ instance segmentation and motion estimation system, as it allows us to test
 different components in isolation and progress to more and more complete
 predictions up to supervising the full system on a single dataset.

+For our experiments, we use the \emph{clone} sequences, which are rendered in a
+way that most closely resembles the original KITTI dataset. We sample 100 examples
+to be used as validation set. From the remaining 2026 examples,
+we remove a small number of examples without object instances and use the resulting
+data as training set.
+
 \paragraph{Motion ground truth from 3D poses and camera extrinsics}
 For two consecutive frames $I_t$ and $I_{t+1}$,
-let $[R_t^{cam}|t_t^{cam}]$
-and $[R_{t+1}^{cam}|t_{t+1}^{cam}]$
+let $[R_t^{ex}|t_t^{ex}]$
+and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$
 be the camera extrinsics at the two frames.
 We compute the ground truth camera motion
-$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in SE3$ as
+$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as

 \begin{equation}
-R_{t}^{gt, cam} = R_{t+1}^{cam}  \cdot inv(R_t^{cam}),
+R_{t}^{gt, cam} = R_{t+1}^{ex}  \cdot inv(R_t^{ex}),
 \end{equation}
 \begin{equation}
-t_{t}^{gt, cam} = t_{t+1}^{cam}  - R_{gt}^{cam} \cdot t_t^{cam}.
+t_{t}^{gt, cam} = t_{t+1}^{ex}  - R_{t}^{ex} \cdot t_t^{ex}.
 \end{equation}

-For any object $k$ visible in both frames, let
-$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$
+For any object $i$ visible in both frames, let
+$(R_t^i, t_t^i)$ and $(R_{t+1}^i, t_{t+1}^i)$
 be its orientation and position in camera space
 at $I_t$ and $I_{t+1}$.
 Note that the pose at $t$ is given with respect to the camera at $t$ and
 the pose at $t+1$ is given with respect to the camera at $t+1$.

-We define the ground truth pivot as
+We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as

 \begin{equation}
-p_{t}^{gt, k} = t_t^k
+p_{t}^{gt, i} = t_t^i
 \end{equation}

 and compute the ground truth object motion
-$\{R_t^{gt, k}, t_t^{gt, k}\} \in SE3$ as
+$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as

 \begin{equation}
-R_{t}^{gt, k} = inv(R_{t}^{gt, cam}) \cdot R_{t+1}^k \cdot inv(R_t^k),
+R_{t}^{gt, i} = inv(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot inv(R_t^i),
 \end{equation}
 \begin{equation}
-t_{t}^{gt, k} = t_{t+1}^{cam}  - R_{gt}^{cam} \cdot t_t.
-\end{equation} % TODO
- % TODO change notation in approach to remove t subscript from motion matrices and vectors!
+t_{t}^{gt, i} = t_{t+1}^{i}  - R_t^{gt, cam} \cdot t_t.
+\end{equation}
+
+\paragraph{Evaluation metrics with motion ground truth}
+Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
+let $i_k$ be the index of the best matching ground truth example,
+let $c_k$ be the predicted class,
+let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$
+and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$.
+Then, assuming there are $N$ such detections,
+\begin{equation}
+E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right)
+\end{equation}
+measures the mean angle of the error rotation between predicted and ground truth rotation,
+\begin{equation}
+E_{t} = \frac{1}{N}\sum_k  \lVert inv(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \rVert,
+\end{equation}
+is the mean euclidean norm between predicted and ground truth translation, and
+\begin{equation}
+E_{p} = \frac{1}{N}\sum_k \lVert p^{gt,i_k} - p^{k,c_k} \rVert
+\end{equation}
+is the mean euclidean norm between predicted and ground truth pivot.

 \subsection{Training Setup}
-Our training schedule is similar to the Mask R-CNN Cityscapes schedule.
+Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
 We train on a single Titan X (Pascal) for a total of 192K iterations on the
-Virtual KITTI dataset. As learning rate we use $0.25 \cdot 10^{-2}$ for the
+Virtual KITTI training set. As learning rate we use $0.25 \cdot 10^{-2}$ for the
 first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations.

 \paragraph{R-CNN training parameters}
+\todo{add this}

 \subsection{Experiments on Virtual KITTI}
+\todo{add this}
+

 \subsection{Evaluation on KITTI 2015}
+\todo{add this}
--- a/introduction.tex
+++ b/introduction.tex
@ -1,23 +1,30 @@
-\subsection{Motivation}
+\subsection{Motivation \& Goals}

-% introduce problem to sovle
-% mention classical non deep-learning works, then say it would be nice to go end-to-end deep
+For moving in the real world, it is generally desirable to know which objects exists
+in the proximity of the moving agent,
+where they are located relative to the agent,
+and where they will be at some point in the future.
+In many cases, it would be preferable to infer such information from video data
+if technically feasible, as camera sensors are cheap and ubiquitous.

-% Steal intro from behl2017 & FlowLayers
+For example, in autonomous driving, it is crucial to not only know the position
+of each obstacle, but to also know if and where the obstacle is moving,
+and to use sensors that will not make the system too expensive for widespread use.
+There are many other applications.. %TODO(make motivation wider)

-Deep learning research is moving towards videos.
-Motion estimation is an inherently ambigous problem and.
-A recent trend is towards end-to-end deep learning systems, away from energy-minimization.
-Often however, this leads to a compromise in modelling as it is more difficult to
-formulate a end-to-end deep network architecture for a given problem than it is
-to state a fesable energy-minimization problem.
-For this reason, we see lots of generic models applied to domains which previously
-employed intricate physical models to simplify optimization.
-On the on hand, end-to-end deep learning may bringe unique benefits due do the ability
-of a learned system to deal with ambiguity.
-On the other hand,
-%Thus, there is an emerging trend to unify geometry with deep learning by
-% THE ABOVE IS VERY DRAFT_LIKE
+A promising approach for 3D scene understanding in these situations may be deep neural
+networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
+in still images and are more and more often being applied to video data.
+A key benefit of end-to-end deep networks is that they can, in principle,
+enable very fast inference on real time video data and generalize
+over many training examples to resolve ambiguities inherent in image understanding
+and motion estimation.
+
+Thus, in this work, we aim to develop a end-to-end deep network which can, given
+sequences of images, segment the image pixels into object instances and estimate
+the location and 3D motion of each object instance relative to the camera.
+
+\subsection{Technical outline}

 Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
 and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
@ -46,40 +53,64 @@ This gives us a fully integrated end-to-end network architecture for segmenting
 and estimating the motion of all detected instances without any limitations
 as to the number or variety of object instances.

+Eventually, we want to extend our method to include end-to-end depth prediction,
+yielding the first end-to-end deep network to perform 3D scene flow estimation
+in a principled way from considering individual objects.
+For now, we will work with RGB-D frames to break down the problem into manageable pieces.

 \subsection{Related work}

+In the following, we will refer to systems which use deep networks for all
+optimization and do not perform time-critical side computation at inference time as
+\emph{end-to-end} deep learning systems.
+
 \paragraph{Deep networks in optical flow}

 End-to-end deep networks for optical flow were recently introduced
 based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
-which pose optical flow as generic pixel-wise estimation problem without making any assumptions
+which pose optical flow as generic, homogenous pixel-wise estimation problem without making any assumptions
 about the regularity and structure of the estimated flow.
-Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure
-the optical flow estimation, but still require expensive energy minimization for each
+Specifically, such methods ignore that the optical flow varies across an
+image depending on the semantics of each region or pixel, which include whether a
+pixel belongs to the background, to which object instance it belongs if it is not background,
+and the class of the object it belongs to.
+Often, failure cases of these methods include motion boundaries or regions with little texture,
+where semantics become more important. % TODO make sure this is a grounded statement
+Extensions of these approaches to scene flow estimate flow and depth
+with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
+
+Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
+the optical flow estimation problem and introduce semantics,
+but still require expensive energy minimization for each
 new input, as CNNs are only used for some of the components.

 \paragraph{Slanted plane methods for 3D scene flow}
 The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
 composed of planar segments. Pixels are assigned to one of the planar segments,
-each of which undergoes a rigid motion.
-
+each of which undergoes a independent 3D rigid motion. % TODO explain benefits of this modelling, but unify with explanations above
 In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
 assigns each slanted plane to one rigidly moving object instance, thus
 reducing the number of independently moving segments by allowing multiple
 segments to share the motion of the object they belong to.
+In these methods, pixel assignment and motion estimation are formulated
+as energy-minimization problem and optimized for each input data point,
+without any learning. % TODO make sure it's ok to say there's no learning

-In a recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
+In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
 a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
 with depth obtained from a non-learned stereo algorithm to be used as pre-computed
-inputs to their slanted plane scene flow model based on \cite{KITTI2015}.
+inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
+Most likely due to their use of deep learning for instance segmentation and for some other components, this
+approach outperforms the previous related scene flow methods on public benchmarks.
+Still, the method uses a energy-minimization formulation for the scene flow estimation
+and takes minutes to make a prediction.

-Interestingly, these slanted plane methods achieve the current state-of-the-art
-in scene flow \emph{and} optical flow estimation on the KITTI benchmarks \cite{KITTI2012, KITTI2015},
+Interestingly, the slanted plane methods achieve the current state-of-the-art
+in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
 outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
-However, the end-to-end deep networks are significantly faster than energy-minimization based slanted plane models,
-generally taking a fraction of a second instead of minutes to compute and can often be modified to run in realtime.
-These concerns restrict the applicability of the current slanted plane models in practical applications,
+However, the end-to-end deep networks are significantly faster than energy-minimization counterparts,
+generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
+These concerns restrict the applicability of the current slanted plane models in practical settings,
 which often require estimations to be done in realtime and for which an end-to-end
 approach based on learning would be preferable.

@ -92,14 +123,14 @@ end-to-end deep networks.

 Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements
 in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
-and the ability of deep networks to learn to handle ambiguity from experience. % TODO instead of experience, talk about compressing large datasets / generalization
+and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
+
 However, we think that the current end-to-end deep learning approaches to motion
 estimation are limited by a lack of spatial structure and regularity in their estimates,
-which stems from the generic nature of the employed networks.
+which stems from the generic nature of the employed networks. % TODO move to end-to-end deep nets section
 To this end, we aim to combine the modelling benefits of rigid scene decompositions
 with the promise of end-to-end deep learning.

-
 \paragraph{End-to-end deep networks for 3D rigid motion estimation}
 End-to-end deep learning for predicting rigid 3D object motions was first introduced with
 SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
@ -109,3 +140,15 @@ estimates a segmentation of pixels into objects together with their 3D motions b
 In addition, SfM-Net predicts dense depth and camera motion to obtain full 3D scene flow from end-to-end deep learning.
 For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
 with a brightness constancy proxy loss.
+
+Like SfM-Net, we aim to estimate motion and instance segmentation jointly with
+end-to-end deep learning.
+Unlike SfM-Net, we build on a scalable object detection and instance segmentation
+approach with R-CNNs, which provide a strong baseline.
+
+\paragraph{End-to-end deep networks for camera pose estimation}
+Deep networks have been used for estimating the 6-DOF camera pose from
+a single RGB frame \cite{PoseNet, PoseNet2}. These works are related to
+ours in that we also need to output various rotations and translations from a deep network
+and thus need to solve similar regression problems and use similar parametrizations
+and losses.
--- a/thesis.tex
+++ b/thesis.tex
@ -81,6 +81,9 @@
 \parindent 0em % Erstzeileneinzug


+\newcommand{\todo}[1]{\textbf{\textcolor{red}{#1}}}
+
+
 % Titelei
 \author{\myname}
 \thesistitle{\mytitleen}{\mytitlede}