diff --git a/approach.tex b/approach.tex index 2c9d9e5..300e397 100644 --- a/approach.tex +++ b/approach.tex @@ -1,5 +1,6 @@ \subsection{Motion R-CNN architecture} +\label{ssec:architecture} Building on Mask R-CNN \cite{MaskRCNN}, we estimate per-object motion by predicting the 3D motion of each detected object. @@ -76,8 +77,11 @@ Each instance motion is predicted as a set of nine scalar parameters, $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$, where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$. Here, we assume that motions between frames are relatively small -and that objects rotate at most 90 degrees in either direction along any axis. +and that objects rotate at most 90 degrees in either direction along any axis, +which is in general a safe assumption for image sequences from videos. All predictions are made in camera space, and translation and pivot predictions are in meters. +We additionally predict softmax scores $o_t^k$ for classifying the objects into +still and moving objects. \todo{figure of head} \paragraph{Camera motion prediction} @@ -86,8 +90,11 @@ between the two frames $I_t$ and $I_{t+1}$. For this, we flatten the bottleneck output of the backbone and pass it through a fully connected layer. We again represent $R_t^{cam}$ using a Euler angle representation and predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects. +Again, we predict a softmax score $o_t^k$ for classifying differentiating between +a still and moving camera. \subsection{Supervision} +\label{ssec:supervision} \paragraph{Per-RoI supervision with 3D motion ground truth} The most straightforward way to supervise the object motions is by using ground truth @@ -97,32 +104,47 @@ Given the $k$-th positive RoI, let $i_k$ be the index of the matched ground trut let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$ and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$. Note that we dropped the subscript $t$ to increase readability. -Inspired by the camera pose regression loss in \cite{PoseNet2}, -we use an $\ell_1$-loss to penalize the differences between ground truth and predicted % TODO actually, we use smooth l1 -rotation, translation and pivot. +Similar to the camera pose regression loss in \cite{PoseNet2}, +we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted +rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss +performs better in our case than the standard $\ell_1$-loss. For each RoI, we compute the motion loss $L_{motion}^k$ as a linear sum of the individual losses, \begin{equation} -L_{motion}^k =l_{R}^k + l_{t}^k + l_{p}^k, +L_{motion}^k = l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k, \end{equation} where \begin{equation} -l_{R}^k = \lVert R^{gt,i_k} - R^{k,c_k} \rVert _1, +l_{R}^k = \ell_1^* (R^{gt,i_k} - R^{k,c_k}), \end{equation} +\begin{equation} +l_{t}^k = \ell_1^* (t^{gt,i_k} - t^{k,c_k}), +\end{equation} +\begin{equation} +l_{p}^k = \ell_1^* (p^{gt,i_k} - p^{k,c_k}). +\end{equation} +are the smooth $\ell_1$-losses for the predicted rotation, translation and pivot, +respectively and +\begin{equation} +l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}). +\end{equation} +is the cross-entropy loss for the predicted classification into moving and non-moving objects. -\begin{equation} -l_{t}^k = \lVert t^{gt,i_k} - t^{k,c_k} \rVert_1, -\end{equation} -and -\begin{equation} -l_{p}^k = \lVert p^{gt,i_k} - p^{k,c_k} \rVert_1. -\end{equation} +Note that we do not penalize the rotation and translation for objects with +$o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the network +may not reliably predict exact identity motions for still objects, which is +numerically more difficult to optimize than performing classification between +moving and non-moving objects and discarding the regression for the non-moving +ones. \paragraph{Camera motion supervision} We supervise the camera motion with ground truth analogously to the object motions, with the only difference being that we only have a rotation and translation, but no pivot term for the camera motion. +If the ground truth shows that the camera is not moving, we again do not +penalize rotation and translation. For the camera, the loss is reduced to the +classification term in this case. \paragraph{Per-RoI supervision \emph{without} 3D motion ground truth} A more general way to supervise the object motions is a re-projection @@ -166,6 +188,7 @@ which can make it interesting even when 3D motion ground truth is available. \subsection{Dense flow from motion} +\label{ssec:postprocessing} As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network. Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$, where diff --git a/background.tex b/background.tex index 8393850..5ef049d 100644 --- a/background.tex +++ b/background.tex @@ -1,6 +1,22 @@ In this section, we will give a more detailed description of previous works we directly build on and other prerequisites. +\subsection{Basic definitions} +For regression, we define the smooth $\ell_1$-loss as +\begin{equation} +\ell_1^*(x) = +\begin{cases} +0.5x^2 &\text{if |x| < 1} \\ +|x| - 0.5 &\text{otherwise,} +\end{cases} +\end{equation} +which provides a certain robustness to outliers and will be used +frequently in the following chapters. +For classification we define the cross-entropy loss as +\begin{equation} +\ell_{cls} = +\end{equation} + \subsection{Optical flow and scene flow} Let $I_1,I_2 : P \to \mathbb{R}^3$ be two temporally consecutive frames in a sequence of images. @@ -43,7 +59,17 @@ Note that the maximum displacement that can be correctly estimated only depends operations in the encoder. Recently, other encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. +\subsection{SfM-Net} +Here, we will describe the SfM-Net architecture in more detail and show their results +and some of the issues. + +\subsection{ResNet} +\label{ssec:resnet} +For completeness, we will give a short review of the ResNet \cite{ResNet} architecture we will use +as a backbone CNN for our network. + \subsection{Region-based convolutional networks} +\label{ssec:rcnn} We now give a short review of region-based convolutional networks, which are currently by far the most popular deep networks for object detection, and have recently also been applied to instance segmentation. @@ -51,7 +77,7 @@ most popular deep networks for object detection, and have recently also been app Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. For each of the region proposals, the input image is cropped using the regions bounding box and the crop is -passed through a CNN, which performs classification of the object (or non-object, if the region shows background). % and box refinement! +passed through a CNN, which performs classification of the object (or non-object, if the region shows background). \paragraph{Fast R-CNN} The original R-CNN involves computing one forward pass of the CNN for each of the region proposals, @@ -120,6 +146,69 @@ variant based on Feature Pyramid Networks \cite{FPN}. Figure \ref{} compares the two Mask R-CNN head variants. \todo{RoI Align} +\paragraph{Bounding box regression} +All bounding boxes predicted by the RoI head or RPN are estimated as offsets +with respect to a reference bounding box. In the case of the RPN, +the reference bounding box is one of the anchors, and refined bounding boxes from the RoI head are +predicted relative to the RPN output bounding boxes. +Let $(x, y, w, h)$ be the top left coordinates, height and width of the bounding box +to be predicted. Likewise, let $(x^*, y^*, w^*, h^*)$ be the ground truth bounding +box and let $(x_r, y_r, w_r, h_r)$ be the reference bounding box. +We then define the ground truth \emph{box encoding} $b^*$ as +\begin{equation*} +b^* = (b_x^*, b_y^*, b_w^*, b_h^*), +\end{equation*} +where +\begin{equation*} +b_x^* = \frac{x^* - x_r}{w_r}, +\end{equation*} +\begin{equation*} +b_y^* = \frac{y^* - y_r}{h_r} +\end{equation*} +\begin{equation*} +b_w^* = \log \left( \frac{w^*}{w_r} \right) +\end{equation*} +\begin{equation*} +b_h^* = \log \left( \frac{h^*}{h_r} \right), +\end{equation*} +which represents the regression target for the bounding box refinement +outputs of the network. + +In the same way, we define the predicted box encoding $b$ as +\begin{equation*} +(b_x, b_y, b_w, b_h), +\end{equation*} +where +\begin{equation*} +b_x = \frac{x - x_r}{w_r}, +\end{equation*} +\begin{equation*} +b_y = \frac{y - y_r}{h_r} +\end{equation*} +\begin{equation*} +b_w = \log \left( \frac{w}{w_r} \right) +\end{equation*} +\begin{equation*} +b_h = \log \left( \frac{h}{h_r} \right). +\end{equation*} + +At test time, to get from a predicted box encoding $(b_x, b_y, b_w, b_h)$ to the actual bounding box $(x, y, w, h)$, +we invert the definitions above, +\begin{equation*} +x = b_x \cdot w_r + x_r, +\end{equation*} +\begin{equation*} +y = b_y \cdot b_r + y_r, +\end{equation*} +\begin{equation*} +w = \exp(b_w) \cdot w_r, +\end{equation*} +\begin{equation*} +h = \exp(b_h) \cdot h_r, +\end{equation*} +and thus obtain the bounding box as the reference bounding box adjusted by +the predicted relative offsets and scales. + \paragraph{Supervision of the RPN} \todo{TODO} diff --git a/conclusion.tex b/conclusion.tex index 80b641b..465bd7a 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -14,12 +14,15 @@ for real time scenarios. We thus presented a step towards real time 3D motion estimation based on a physically sound scene decomposition. Thanks to instance-level reasoning, in contrast to previous end-to-end deep networks for dense motion estimation, the output -of our network is highly interpretable, which may bring benefits for safety-critical +of our network is highly interpretable, which may also bring benefits for safety-critical applications. \subsection{Future Work} \paragraph{Predicting depth} -In most cases, we want to work with raw RGB sequences for which no depth is available. +In this work, we focused on motion estimation when RGB-D frames with dense depth are available. +However, in many applications settings, we are not provided with any depth information. +In most cases, we want to work with raw RGB sequences from one or multiple simple cameras, +from which no depth data is available. To do so, we could integrate depth prediction into our network by branching off a depth network from the backbone in parallel to the RPN, as in Figure \ref{}. Although single-frame monocular depth prediction with deep networks was already done diff --git a/experiments.tex b/experiments.tex index 90bb6eb..38c7bee 100644 --- a/experiments.tex +++ b/experiments.tex @@ -14,6 +14,7 @@ we use the \texttt{tf.crop\_and\_resize} TensorFlow function with interpolation set to bilinear. \subsection{Datasets} +\label{ssec:datasets} \paragraph{Virtual KITTI} The synthetic Virtual KITTI dataset \cite{VKITTI} is a re-creation of the KITTI @@ -54,12 +55,22 @@ We compute the ground truth camera motion $\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as \begin{equation} -R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot inv(R_t^{ex}), +R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}), \end{equation} \begin{equation} t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}. \end{equation} +Additionally, we define $o_t^{gt, cam} \in \{ 0, 1 \}$, +\begin{equation} +o_t^{gt, cam} = +\begin{cases} +1 &\text{if the camera pose changes between $t$ and $t+1$} \\ +0 &\text{otherwise,} +\end{cases} +\end{equation} +which specifies the camera is moving in between the frames. + For any object $i$ visible in both frames, let $(R_t^i, t_t^i)$ and $(R_{t+1}^i, t_{t+1}^i)$ be its orientation and position in camera space @@ -77,12 +88,22 @@ and compute the ground truth object motion $\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as \begin{equation} -R_{t}^{gt, i} = inv(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot inv(R_t^i), +R_{t}^{gt, i} = \mathrm{inv}(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i), \end{equation} \begin{equation} t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t. \end{equation} +As for the camera, we define $o_t^{gt, i} \in \{ 0, 1 \}$, +\begin{equation} +o_t^{gt, i} = +\begin{cases} +1 &\text{if the position of object i changes between $t$ and $t+1$} \\ +0 &\text{otherwise,} +\end{cases} +\end{equation} +which specifies whether an object is moving in between the frames. + \paragraph{Evaluation metrics with motion ground truth} To evaluate the 3D instance and camera motions on the Virtual KITTI validation set, we introduce a few error metrics. @@ -93,21 +114,22 @@ let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}$ be the predicted motion for class $c_k$ and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}$ the ground truth motion for the example $i_k$. Then, assuming there are $N$ such detections, \begin{equation} -E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(inv(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right) +E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{tr(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right) \end{equation} measures the mean angle of the error rotation between predicted and ground truth rotation, \begin{equation} -E_{t} = \frac{1}{N}\sum_k \lVert inv(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \rVert, +E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \right\rVert_2, \end{equation} is the mean euclidean norm between predicted and ground truth translation, and \begin{equation} -E_{p} = \frac{1}{N}\sum_k \lVert p^{gt,i_k} - p^{k,c_k} \rVert +E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2 \end{equation} is the mean euclidean norm between predicted and ground truth pivot. Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for predicted camera motions. \subsection{Training Setup} +\label{ssec:setup} Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. We train on a single Titan X (Pascal) for a total of 192K iterations on the Virtual KITTI training set. @@ -120,6 +142,7 @@ first 144K iterations and $0.25 \cdot 10^{-3}$ for all remaining iterations. \todo{add this} \subsection{Experiments on Virtual KITTI} +\label{ssec:vkitti} \begin{figure}[t] \centering diff --git a/introduction.tex b/introduction.tex index 4034dfd..b26713c 100644 --- a/introduction.tex +++ b/introduction.tex @@ -9,7 +9,7 @@ and estimates their 3D locations as well as all 3D object motions between the fr \label{figure:teaser} \end{figure} -\subsection{Motivation \& Goals} +\subsection{Motivation} For moving in the real world, it is generally desirable to know which objects exists in the proximity of the moving agent, @@ -35,7 +35,7 @@ sequences of images, segment the image pixels into object instances and estimate the location and 3D motion of each object instance relative to the camera (Figure \ref{figure:teaser}). -\subsection{Technical outline} +\subsection{Technical goals} Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera. @@ -178,3 +178,25 @@ a single RGB frame \cite{PoseNet, PoseNet2}. These works are related to ours in that we also need to output various rotations and translations from a deep network and thus need to solve similar regression problems and use similar parametrizations and losses. + + +\subsection{Outline} +In section \ref{sec:background}, we introduce preliminaries and building +blocks from earlier works that serve as a foundation for our networks and losses. +Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone +as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}), +specifically Mask R-CNN and the FPN \cite{FPN}. +In section \ref{sec:approach}, we describe our technical contribution, starting +with our modifications to the Mask R-CNN backbone and head networks (\ref{ssec:architecture}), +followed by our losses and supervision methods for training +the extended region-based CNN (\ref{ssec:supervision}), and +finally the postprocessings we use to derive dense flow from our 3D motion estimates +(\ref{ssec:postprocessing}). +In section \ref{sec:experiments}, we introduce the Virtual KITTI dataset we use +for training our networks as well as all preprocessings we perform (\ref{ssec:datasets}), +give details of our experimental setup (\ref{ssec:setup}), +and finally describe the experimental results +on Virtual KITTI (\ref{ssec:vkitti}). +In section \ref{sec:conclusion}, we summarize our work and describe future +developments, including depth prediction, training on real world data, +and exploiting frames over longer time intervals. diff --git a/thesis.tex b/thesis.tex index c232239..cbcd518 100644 --- a/thesis.tex +++ b/thesis.tex @@ -123,17 +123,20 @@ \onehalfspacing \input{introduction} +\label{sec:introduction} \section{Background} \parindent 2em \onehalfspacing +\label{sec:background} \input{background} \section{Motion R-CNN} \parindent 2em \onehalfspacing +\label{sec:approach} \input{approach} \section{Experiments} @@ -141,12 +144,14 @@ \onehalfspacing \input{experiments} +\label{sec:experiments} \section{Conclusion} \parindent 2em \onehalfspacing \input{conclusion} +\label{sec:conclusion} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Bibliografie mit BibLaTeX