diff --git a/abstract.tex b/abstract.tex index 958867c..bcb1cb1 100644 --- a/abstract.tex +++ b/abstract.tex @@ -28,7 +28,7 @@ we integrate motion estimation with instance segmentation. Given two consecutive frames from a monocular RGB-D camera, our resulting end-to-end deep network detects objects with precise per-pixel object masks and estimates the 3D motion of each detected object between the frames. -By additionally estimating a global camera motion in the same network, +By additionally estimating the camera ego-motion in the same network, we compose a dense optical flow field based on instance-level and global motion predictions. We train our network on the synthetic Virtual KITTI dataset, which provides ground truth for all components of our system. @@ -62,7 +62,7 @@ Networks (R-CNNs) auf und integrieren Bewegungsschätzung mit Instanzsegmentieru Bei Eingabe von zwei aufeinanderfolgenden Frames aus einer monokularen RGB-D Kamera erkennt unser end-to-end Deep Network Objekte mit pixelgenauen Objektmasken und schätzt die 3D-Bewegung jedes erkannten Objekts zwischen den Frames ab. -Indem wir zusätzlich im selben Netzwerk die globale Kamerabewegung schätzen, +Indem wir zusätzlich im selben Netzwerk die Eigenbewerung der Kamera schätzen, setzen wir aus den instanzbasierten und globalen Bewegungsschätzungen ein dichtes optisches Flussfeld zusammen. Wir trainieren unser Netzwerk auf dem synthetischen Virtual KITTI Datensatz, diff --git a/approach.tex b/approach.tex index 55d1d43..158c0dd 100644 --- a/approach.tex +++ b/approach.tex @@ -7,7 +7,7 @@ we estimate per-object motion by predicting the 3D motion of each detected objec For this, we extend Mask R-CNN in two straightforward ways. First, we modify the backbone network and provide two frames to the R-CNN system in order to enable image matching between the consecutive frames. -Second, we extend the Mask R-CNN RoI head to predict a 3D motion for each +Second, we extend the Mask R-CNN RoI head to predict a 3D motion and pivot for each region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn} show our Motion R-CNN networks based on Mask R-CNN ResNet and Mask R-CNN ResNet-FPN, respectively. @@ -18,7 +18,7 @@ respectively. \toprule \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \midrule\midrule -& input images & H $\times$ W $\times$ C \\ +& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\ \midrule C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 1024 \\ \midrule @@ -69,7 +69,7 @@ additonally dropout with $p = 0.5$ after all fully-connected hidden layers. \toprule \textbf{Output} & \textbf{Layer Operations} & \textbf{Output Dimensions} \\ \midrule\midrule -& input images & H $\times$ W $\times$ C \\ +& input images $I_t$, $I_{t+1}$, and (optional) XYZ$_{t}$, XYZ$_{t+1}$ & H $\times$ W $\times$ C \\ \midrule C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 2048 \\ \midrule @@ -121,32 +121,32 @@ Like Faster R-CNN and Mask R-CNN, we use a ResNet \cite{ResNet} variant as backb Inspired by FlowNetS \cite{FlowNet}, we make one modification to the ResNet backbone to enable image matching, laying the foundation for our motion estimation. Instead of taking a single image as input to the backbone, we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yielding a input image map with six channels. -Alternatively, we also experiment with concatenating the camera space XYZ coordinates for each frame, +Additionally, we also experiment with concatenating the camera space XYZ coordinates for each frame, XYZ$_t$ and XYZ$_{t+1}$, into the input as well. We do not introduce a separate network for computing region proposals and use our modified backbone network as both first stage RPN and second stage feature extractor for extracting the RoI features. -Technically, our feature encoder network will have to learn a motion representation similar to +Technically, our feature encoder network will have to learn image matching representations similar to that learned by the FlowNet encoder, but the output will be computed in the object-centric framework of a region based convolutional network head with a 3D parametrization. -Thus, in contrast to the dense FlowNet decoder, the estimated dense motion information -from the encoder is integrated for specific objects via RoI cropping and +Thus, in contrast to the dense FlowNet decoder, the estimated dense image matching information +from the encoder is integrated for specific objects via RoI extraction and processed by the RoI head for each object. \paragraph{Per-RoI motion prediction} We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}. -For the $k$-th object proposal, we predict the rigid transformation $\{R_t^k, t_t^k\}\in \mathbf{SE}(3)$ +For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$ \footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$} -of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_t^k \in \mathbb{R}^3$ at time $t$. -We parametrize ${R_t^k}$ using an Euler angle representation, +of the object between the two frames $I_t$ and $I_{t+1}$, as well as the object pivot $p_k \in \mathbb{R}^3$ at time $t$. +We parametrize ${R_k}$ using an Euler angle representation, \begin{equation} -R_t^k = R_t^{k,z}(\gamma) \cdot R_t^{k,x}(\alpha) \cdot R_t^{k,y}(\beta), +R_k = R_k^z(\gamma) \cdot R_k^x(\alpha) \cdot R_k^y(\beta), \end{equation} where \begin{equation} -R_t^{k,x}(\alpha) = +R_k^x(\alpha) = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos(\alpha) & -\sin(\alpha) \\ @@ -155,7 +155,7 @@ R_t^{k,x}(\alpha) = \end{equation} \begin{equation} -R_t^{k,y}(\beta) = +R_k^y(\beta) = \begin{pmatrix} \cos(\beta) & 0 & \sin(\beta) \\ 0 & 1 & 0 \\ @@ -164,7 +164,7 @@ R_t^{k,y}(\beta) = \end{equation} \begin{equation} -R_t^{k,z}(\gamma) = +R_k^z(\gamma) = \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ @@ -179,25 +179,26 @@ prediction in addition to the fully-connected layers for refined boxes and classes and the convolutional network for the masks. Like for refined boxes and masks, we make one separate motion prediction for each class. Each instance motion is predicted as a set of nine scalar parameters, -$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_t^k$ and $p_t^k$, +$\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$, $t_k$ and $p_k$, where $\sin(\alpha)$, $\sin(\beta)$ and $\sin(\gamma)$ are clipped to $[-1, 1]$. Here, we assume that motions between frames are relatively small and that objects rotate at most 90 degrees in either direction along any axis, -which is in general a safe assumption for image sequences from videos. +which is in general a safe assumption for image sequences from videos, +and enables us to obtain unique cosine values from the predicted sine values. All predictions are made in camera space, and translation and pivot predictions are in meters. -We additionally predict softmax scores $o_t^k$ for classifying the objects into -still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_t^k = 0$, -we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_t^k = (0,0,0)^T$, +We additionally predict softmax scores $o_k$ for classifying the objects into +still and moving objects. As a postprocessing, for any object instance $k$ with predicted moving flag $o_k = 0$, +we set $\sin(\alpha) = \sin(\beta) = \sin(\gamma) = 0$ and $t_k = (0,0,0)^T$, and thus predict an identity motion. \paragraph{Camera motion prediction} -In addition to the object transformations, we optionally predict the camera motion $\{R_t^{cam}, t_t^{cam}\}\in \mathbf{SE}(3)$ +In addition to the object transformations, we optionally predict the camera motion $\{R_{cam}, t_{cam}\}\in \mathbf{SE}(3)$ between the two frames $I_t$ and $I_{t+1}$. For this, we branch off a small fully-connected network from the bottleneck output of the backbone. -We again represent $R_t^{cam}$ using a Euler angle representation and -predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_t^{cam}$ in the same way as for the individual objects. -Again, we predict a softmax score $o_t^{cam}$ for differentiating between +We again represent $R_{cam}$ using a Euler angle representation and +predict $\sin(\alpha)$, $\sin(\beta)$, $\sin(\gamma)$ and $t_{cam}$ in the same way as for the individual objects. +Again, we predict a softmax score $o_{cam}$ for differentiating between a still and moving camera. \subsection{Network design} @@ -207,19 +208,19 @@ a still and moving camera. In our ResNet variant without FPN (Table \ref{table:motionrcnn_resnet}), the underlying ResNet backbone is only computed up to the $C_4$ block, as otherwise the feature resolution prior to RoI extraction would be reduced too much. -Therefore, in our the variant without FPN, we first pass the $C_4$ features through $C_5$ +Therefore, in our variant without FPN, we first pass the $C_4$ features through $C_5$ and $C_6$ blocks (with weights independent from the $C_5$ block used in the RoI head in this variant) -to increase the bottleneck stride prior to the camera network to 64. +to increase the bottleneck stride prior to the camera motion network to 64. In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), -the backbone makes use of all blocks through $C_6$ and -we can simply branch of our camera network from the $C_6$ bottleneck. +the backbone makes use of all blocks through $C_6$, and +we can simply branch off our camera motion network from the $C_6$ bottleneck. Then, in both, the ResNet and ResNet-FPN variant, we apply a additional convolution to the $C_6$ features to reduce the number of inputs to the following -fully-connected layers. +fully-connected layers, and thus keep the number of weights reasonably small. Instead of averaging, we use bilinear resizing to bring the convolutional features to a fixed size without losing all spatial information, -flatten them, and finally apply multiple fully-connected layers to compute the -camera motion prediction. +flatten them, and finally apply multiple fully-connected layers to predict the +camera motion parameters. \paragraph{RoI motion head network} In both of our network variants @@ -227,6 +228,15 @@ In both of our network variants we compute the fully-connected network for motion prediction from the flattened RoI features, which are also the basis for classification and bounding box refinement. +Note that the features (extracted from the upsampled FPN stage appropriate to the RoI bounding box scales) +passed to our ResNet-FPN RoI head went through the $C_6$ +bottleneck, which has a stride of 64 with respect to the original image. +In contrast, the bottleneck for the features passed to our ResNet RoI head +is $C_4$ (with a stride of 16). Thus, the ResNet-FPN variant can in principle estimate +object motions based on larger displacements than the ResNet variant. +Additionally, as smaller bounding boxes use higher resolution features, the +motions and pivots of (especially smaller) objects can in principle be more accurately +estimated with the FPN variant. \subsection{Supervision} \label{ssec:supervision} @@ -235,42 +245,41 @@ bounding box refinement. The most straightforward way to supervise the object motions is by using ground truth motions computed from ground truth object poses, which is in general only practical when training on synthetic datasets. -Given the $k$-th foreground RoI, let $i_k$ be the index of the matched ground truth example with class $c_k^*$, -let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k^*$ -and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$. -Note that we dropped the subscript $t$ to increase readability. +Given the $k$-th foreground RoI (as defined for Mask R-CNN) with ground class $c_k^*$, +let $R_k, t_k, p_k, o_k$ be the predicted motion for class $c_k^*$ as parametrized above, +and $R_k^*, t_k^*, p_k^*, o_k^*$ the ground truth motion for the matched ground truth example. Similar to the camera pose regression loss in \cite{PoseNet2}, we use a variant of the $\ell_1$-loss to penalize the differences between ground truth and predicted rotation, translation (and pivot, in our case). We found that the smooth $\ell_1$-loss performs better in our case than the standard $\ell_1$-loss. -We then compute the RoI motion loss as +We thus compute the RoI motion loss as \begin{equation} -L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o^{gt,i_k} + l_o^k, +L_{motion} = \frac{1}{N_{RoI}^{fg}} \sum_k^{N_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o_k^* + l_o^k, \end{equation} where \begin{equation} -l_{R}^k = \ell_{reg} (R^{gt,i_k} - R^{k,c_k}), +l_{R}^k = \ell_{reg} (R_k^* - R_k), \end{equation} \begin{equation} -l_{t}^k = \ell_{reg} (t^{gt,i_k} - t^{k,c_k}), +l_{t}^k = \ell_{reg} (t_k^* - t_k), \end{equation} \begin{equation} -l_{p}^k = \ell_{reg} (p^{gt,i_k} - p^{k,c_k}). +l_{p}^k = \ell_{reg} (p_k^* - p_k). \end{equation} -are the smooth $\ell_1$-loss terms for the predicted rotation, translation and pivot, +are the smooth-$\ell_1$ losses for the predicted rotation, translation and pivot, respectively and \begin{equation} -l_o^k = \ell_{cls}(o_t^k, o^{gt,i_k}). +l_o^k = \ell_{cls}(o_k, o_k^*). \end{equation} is the cross-entropy loss for the predicted classification into moving and non-moving objects. Note that we do not penalize the rotation and translation for objects with -$o^{gt,i_k} = 0$, which do not move between $t$ and $t+1$. We found that the network +$o_k^* = 0$, which do not move between $t$ and $t+1$. We found that the network may not reliably predict exact identity motions for still objects, which is numerically more difficult to optimize than performing classification between moving and non-moving objects and discarding the regression for the non-moving -ones. Also, analogous to masks and bounding boxes, the estimates for classes +ones. Also, analogously to masks and bounding boxes, the estimates for classes other than $c_k^*$ are not penalized. Now, our modified RoI loss is @@ -281,10 +290,10 @@ L_{RoI} = L_{cls} + L_{box} + L_{mask} + L_{motion}. \paragraph{Camera motion supervision} We supervise the camera motion with ground truth analogously to the object motions, with the only difference being that we only have -a rotation and translation, but no pivot term for the camera motion. +a rotation and translation, but no pivot loss for the camera motion. If the ground truth shows that the camera is not moving, we again do not -penalize rotation and translation. For the camera, the loss is reduced to the -classification term in this case. +penalize rotation and translation. In this case, the camera motion loss is reduced to the +classification loss. \paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth} \begin{figure}[t] @@ -310,19 +319,20 @@ In this case, for any RoI, we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box with the same resolution as the predicted mask. We use the same bounding box -to crop the corresponding region from the dense, full image depth map +to crop the corresponding region from the dense, full-image depth map and bilinearly resize the depth crop to the same resolution as the mask and point grid. -We then compute the optical flow at each of the grid points by creating -a 3D point cloud from the point grid and depth crop. To this point cloud, we -apply the RoI's predicted motion, masked by the predicted mask. +Next, we create a 3D point cloud from the point grid and depth crop. To this point cloud, we +apply the object motion predicted for the RoI, masked by the predicted mask. Then, we apply the camera motion to the points, project them back to 2D and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids. Note that we batch this computation over all RoIs, so that we only perform -it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach. +it once per forward pass. +Figure \ref{figure:flow_loss} illustrates the approach. + The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the -dense, full image flow composition in the following subsection, so we will not -include them here. The only differences are that there is no sum over objects during +dense, full-image flow composition in the following subsection, so we will not +duplicate them here. The only differences are that there is no sum over objects during the point transformation based on instance motion, as we consider the single object corresponding to an RoI in isolation, and that the masks are not resized to the full image resolution, as @@ -333,21 +343,22 @@ For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion by penalizing the $m \times m$ optical flow grid. If there is optical flow ground truth available, we can use the RoI bounding box to crop and resize a region from the ground truth optical flow to match the RoI's -optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss. +optical flow grid and penalize the difference between the flow grids with a (smooth) $\ell_1$-loss. However, we can also use the re-projection loss without optical flow ground truth to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}. -In this case, we use the bounding box to crop and resize a corresponding region +In this case, we can use the bounding box to crop and resize a corresponding region from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$ -using the 2D grid displaced with the predicted flow grid. Then, we can penalize the difference +using the 2D grid displaced with the predicted flow grid (the latter is often called \emph{backward warping}). +Then, we can penalize the difference between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}. For more details on differentiable bilinear sampling for deep learning, we refer the reader to \cite{STN}. -When compared to supervision with motion ground truth, a re-projection +When compared to supervision with 3D instance motion ground truth, a re-projection loss could benefit motion regression by removing any loss balancing issues between the -rotation, translation and pivot terms \cite{PoseNet2}, -which can make it interesting even when 3D motion ground truth is available. +rotation, translation and pivot losses \cite{PoseNet2}, +which could make it interesting even when 3D motion ground truth is available. \subsection{Training and inference} \label{ssec:training_inference} @@ -368,7 +379,7 @@ highest scoring class. \subsection{Dense flow from motion} \label{ssec:postprocessing} -As a postprocessing, we compose a dense optical flow map from the outputs of our Motion R-CNN network. +As a postprocessing, we compose the dense optical flow between $I_t$ and $I_{t+1}$ from the outputs of our Motion R-CNN network. Given the depth map $d_t$ for frame $I_t$, we first create a 3D point cloud in camera space at time $t$, where \begin{equation} @@ -383,33 +394,34 @@ x_t - c_0 \\ y_t - c_1 \\ f \end{pmatrix}, \end{equation} is the 3D coordinate at $t$ corresponding to the point with pixel coordinates $x_t, y_t$, -which range over all coordinates in $I_t$. +which range over all coordinates in $I_t$, +and $(c_0, c_1, f)$ are the camera intrinsics. For now, the depth map is always assumed to come from ground truth. Given $k$ detections with predicted motions as above, we transform all points within the bounding box of a detected object according to the predicted motion of the object. -We first define the \emph{full image} mask $M_t^k$ for object k, -which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing -$m_t^k$ to the width and height of the predicted bounding box and then copying the values +We first define the \emph{full image} mask $M_k$ for object k, +which can be computed from the predicted box mask $m_k$ (for the predicted class) by bilinearly resizing +it to the width and height of the predicted bounding box and then copying the values of the resized mask into a full resolution mask initialized with zeros, starting at the top-left coordinate of the predicted bounding box. -Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects, +Then, given the predicted motions $(R_k, t_k)$, as well as $p_k$ for all objects, \begin{equation} P'_{t+1} = -P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\} +P_t + \sum_1^{k} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\} \end{equation} These motion predictions are understood to have already taken into account the classification into moving and still objects, -and we thus, as described above, have identity motions for all objects with $o_t^k = 0$. +and we thus, as described above, have identity motions for all objects with $o_k = 0$. -Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, +Next, we transform all points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$, \begin{equation} \begin{pmatrix} X_{t+1} \\ Y_{t+1} \\ Z_{t+1} \end{pmatrix} -= P_{t+1} = R_t^c \cdot P'_{t+1} + t_t^c += P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam} \end{equation}. Note that in our experiments, we either use the ground truth camera motion to focus diff --git a/background.tex b/background.tex index 00b9540..a0b4aa6 100644 --- a/background.tex +++ b/background.tex @@ -8,10 +8,10 @@ The optical flow $\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$ maps pixel coordinates in the first frame $I_t$ to pixel coordinates of the visually corresponding pixel in the second frame $I_{t+1}$, -and can be interpreted as the apparent movement of brigthness patterns between the two frames. +and can be interpreted as the apparent movement of brightness patterns between the two frames. Optical flow can be regarded as two-dimensional motion estimation. -Scene flow is the generalization of optical flow to 3-dimensional space and +Scene flow is the generalization of optical flow to three-dimensional space and additionally requires estimating depth for each pixel. Generally, stereo input is used for scene flow to estimate disparity-based depth, however monocular depth estimation with deep networks is also becoming popular \cite{DeeperDepth, UnsupPoseDepth}. @@ -47,7 +47,7 @@ flow & $\times$ 2 bilinear upsample & H $\times$ W $\times$ 2 \\ \bottomrule \end{tabular} \caption { -FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions) +Overview of the FlowNetS \cite{FlowNet} architecture. Transpose convolutions (deconvolutions) are used for refinement. } \label{table:flownets} @@ -70,21 +70,22 @@ performing upsampling of the compressed features and resulting in a encoder-deco The most popular deep networks of this kind for end-to-end optical flow prediction are variants of the FlowNet family \cite{FlowNet, FlowNet2}, which was recently extended to scene flow estimation \cite{SceneFlowDataset}. -Table \ref{table:flownets} shows the classical FlowNetS architecture for optical flow prediction. +Table \ref{table:flownets} gives an overview of the classical FlowNetS architecture for optical flow prediction. Note that the network itself is a rather generic autoencoder and is specialized for optical flow only through being trained with supervision from dense optical flow ground truth. Potentially, the same network could also be used for semantic segmentation if -the number of output final and intermediate output channels was adapted from two to the number of classes.\ +the number of final and intermediate output channels was adapted from two to the number of classes. Still, FlowNetS demonstrates that a generic deep encoder-decoder CNN can learn to perform image matching arguably well, given just two consecutive frames as input and a large enough receptive field at the outputs to cover the displacements. -Note that the maximum displacement that can be correctly estimated depends on the number of 2D convolution strides or pooling +Note that the maximum displacement that can be correctly estimated depends on the number of strided 2D convolutions (and the stride they use) and pooling operations in the encoder. Recently, other, similarly generic, -encoder-decoder CNNs have been applied to optical flow as well \cite{DenseNetDenseFlow}. +encoder-decoder CNNs have been applied to optical flow prediction as well \cite{DenseNetDenseFlow}. \subsection{SfM-Net} -Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture. +Table \ref{table:sfmnet} shows the SfM-Net \cite{SfmNet} architecture we described +in the introduction. Motions and full-image masks for a fixed number N$_{motions}$ of independent objects are predicted in addition to a depth map, and a unsupervised re-projection loss based on image brightness differences penalizes the predictions. @@ -103,7 +104,7 @@ image brightness differences penalizes the predictions. & input images $I_t$ and $I_{t+1}$ & H $\times$ W $\times$ 6 \\ & Conv-Deconv & H $\times$ W $\times$ 32 \\ masks & 1 $\times$1 conv, N$_{motions}$ & H $\times$ W $\times$ N$_{motions}$ \\ -FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & H $\times$ W $\times$ 32 \\ +FC & From bottleneck: $\begin{bmatrix}\textrm{fully connected}, 512\end{bmatrix}$ $\times$ 2 & 1 $\times$ 512 \\ object motions & fully connected, $N_{motions} \cdot$ 9 & H $\times$ W $\times$ $N_{motions} \cdot$ 9 \\ camera motion & From FC: $\times$ 2 & H $\times$ W $\times$ 6 \\ \midrule @@ -118,7 +119,7 @@ depth & 1 $\times$1 conv, 1 & H $\times$ W $\times$ 1 \\ \end{tabular} \caption { -SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully convolutional +SfM-Net \cite{SfmNet} architecture. Here, Conv-Deconv is a simple fully-convolutional encoder-decoder network, where convolutions and deconvolutions with stride 2 are used for downsampling and upsampling, respectively. The stride at the bottleneck with respect to the input image is 32. @@ -147,7 +148,7 @@ Note that for the Mask R-CNN architectures we describe below, this is equivalent to the standard ResNet-50 backbone. We now introduce one small extension that will be useful for our Motion R-CNN network. In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the -input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride is 64. +input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64. For accurately estimating motions corresponding to larger pixel displacements, a larger stride may be important. Thus, we add a additional C$_6$ block to be used in the Motion R-CNN ResNet variants @@ -166,9 +167,9 @@ to increase the bottleneck stride to 64, following FlowNetS. \multicolumn{3}{c}{\textbf{ResNet}}\\ \midrule C$_1$ & 7 $\times$ 7 conv, 64, stride 2 & $\tfrac{1}{2}$ H $\times$ $\tfrac{1}{2}$ W $\times$ 64 \\ - +\midrule & 3 $\times$ 3 max pool, stride 2 & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 64 \\ - +\midrule C$_2$ & $\begin{bmatrix} 1 \times 1, 64 \\ @@ -242,8 +243,8 @@ most popular deep networks for object detection, and have recently also been app \paragraph{R-CNN} Region-based convolutional networks (R-CNNs) \cite{RCNN} use a non-learned algorithm external to a standard encoder CNN for computing \emph{region proposals} in the shape of 2D bounding boxes, which represent regions that may contain an object. -For each of the region proposals, the input image is cropped using the regions bounding box and the crop is -passed through a CNN, which performs classification of the object (or non-object, if the region shows background). +For each of the region proposals, the input image is cropped using the region bounding box and the crop is +passed through the CNN, which performs classification of the object (or non-object, if the region shows background). \paragraph{Fast R-CNN} The original R-CNN involves computing one forward pass of the CNN for each of the region proposals, @@ -256,8 +257,8 @@ The extracted per-RoI (region of interest) feature maps are collected into a bat \emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass. The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying -full image feature map are max-pooled to yield the output value at this cell. -Thus, given region proposals, the per-region computation is reduced to a single pass through the complete network, +full-image feature map are max-pooled to yield the output value at the cell. +Thus, given region proposals, all computation is reduced to a single pass through the complete network, speeding up the system by two orders of magnitude at inference time and one order of magnitude at training time. @@ -297,15 +298,15 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ \midrule M$_0$ & From R$_1$: 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\ & 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ N$_{cls}$ \\ -masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ +masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ \bottomrule \end{tabular} \caption { -Mask R-CNN \cite{MaskRCNN} ResNet \cite{ResNet} architecture. -Note that this is equivalent to the Faster R-CNN ResNet architecture if the mask +Mask R-CNN \cite{MaskRCNN} ResNet-50 \cite{ResNet} architecture. +Note that this is equivalent to the Faster R-CNN ResNet-50 architecture if the mask head is left out. In Mask R-CNN, bilinear sampling is used for RoI extraction, -whereas Faster R-CNN used RoI pooling. +whereas Faster R-CNN uses RoI pooling. } \label{table:maskrcnn_resnet} \end{table} @@ -317,17 +318,17 @@ After streamlining the CNN components, Fast R-CNN is limited by the speed of the algorithm, which has to be run prior to the network passes and makes up a large portion of the total processing time. The Faster R-CNN object detection system \cite{FasterRCNN} unifies the generation of region proposals and subsequent box refinement and -classification into a single deep network, leading to faster processing when compared to Fast R-CNN +classification into a single deep network, leading to faster test-time processing when compared to Fast R-CNN and again, improved accuracy. This unified network operates in two stages. In the \emph{first stage}, one forward pass is performed on the \emph{backbone} network, which is a deep feature encoder CNN with the original image as input. -Next, the \emph{backbone} output features are passed into a small, fully convolutional \emph{Region Proposal Network (RPN)} head, which +Next, the \emph{backbone} output features are passed into a small, fully-convolutional \emph{Region Proposal Network (RPN)} head, which predicts objectness scores and regresses bounding boxes at each of its output positions. At any of the $h \times w$ output positions of the RPN head, $N_a$ bounding boxes with their objectness scores are predicted as offsets relative to a fixed set of $N_a$ \emph{anchors} with different aspect ratios and scales. Thus, there are $N_a \times h \times w$ reference anchors in total. -In Faster R-CNN, $N_a = 9$, with 3 scales corresponding +In Faster R-CNN, $N_a = 9$, with 3 scales, corresponding to anchor boxes of areas of $\{128^2, 256^2, 512^2\}$ pixels and 3 aspect ratios, $\{1:2, 1:1, 2:1\}$. For the ResNet Faster R-CNN backbone, we generally have a stride of 16 with respect to the input image at the RPN output (Table \ref{table:maskrcnn_resnet}). @@ -337,8 +338,12 @@ The region proposals can then be obtained as the N highest scoring RPN predictio Then, the \emph{second stage} corresponds to the original Fast R-CNN head network, performing classification and bounding box refinement for each of the region proposals, which are now obtained -from the RPN instead of being pre-computed by some external algorithm. -As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals. +from the RPN instead of being pre-computed by an external algorithm. +As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for each of the region proposals, +and the refined bounding boxes are predicted separately for each object class. + +Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture +(here, the mask head is ignored). \paragraph{Mask R-CNN} Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. @@ -346,18 +351,20 @@ However, it can be helpful to know class and object (instance) membership of all which generally involves computing a binary mask for each object instance specifying which pixels belong to that object. This problem is called \emph{instance segmentation}. Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting -fixed resolution instance masks within the bounding boxes of each detected object. +fixed resolution instance masks within the bounding boxes of each detected object, +which are then bilinearly resized to fit inside the respective bounding boxes. This is done by simply extending the Faster R-CNN head with multiple convolutions, which compute a pixel-precise binary mask for each instance. -The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}. Note that the per-class masks logits are put through a sigmoid layer, and thus there is no -comptetition between classes for the mask prediction branch. +comptetition between classes in the mask prediction branch. -One important additional technical aspect of Mask R-CNN is the replacement of RoI pooling with +Additionally, an important technical aspect of Mask R-CNN is the replacement of RoI pooling with bilinear sampling for extracting the RoI features, which is much more precise. In the original RoI pooling from Fast R-CNN, the bins for max-pooling are not aligned with the actual pixel boundary of the bounding box, and thus some detail is lost. +The basic Mask R-CNN ResNet architecture is shown in Table \ref{table:maskrcnn_resnet}. + { \begin{table}[h] \centering @@ -367,7 +374,7 @@ boundary of the bounding box, and thus some detail is lost. \midrule\midrule & input image & H $\times$ W $\times$ C \\ \midrule -C$_5$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ +C$_5$ & ResNet \{up to C$_5$\} (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\ \midrule \multicolumn{3}{c}{\textbf{Feature Pyramid Network (FPN)}}\\ \midrule @@ -403,11 +410,11 @@ classes& softmax, N$_{cls}$ & N$_{RoI}$ $\times$ N$_{cls}$ \\ M$_1$ & From R$_2$: $\begin{bmatrix}\textrm{3 $\times$ 3 conv} \end{bmatrix}$ $\times$ 4, 256 & N$_{RoI}$ $\times$ 14 $\times$ 14 $\times$ 256 \\ & 2 $\times$ 2 deconv, 256, stride 2 & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ 256 \\ & 1 $\times$ 1 conv, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ -masks & sigmoid, N$_{cls}$ & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ +masks & sigmoid & N$_{RoI}$ $\times$ 28 $\times$ 28 $\times$ N$_{cls}$ \\ \bottomrule \end{tabular} \caption { -Mask R-CNN \cite{MaskRCNN} ResNet-FPN \cite{ResNet} architecture. +Mask R-CNN \cite{MaskRCNN} ResNet-50-FPN \cite{ResNet} architecture. Operations enclosed in a []$_p$ block make up a single FPN block (see Figure \ref{figure:fpn_block}). } @@ -416,28 +423,29 @@ block (see Figure \ref{figure:fpn_block}). } \paragraph{Feature Pyramid Networks} -In Faster R-CNN, a single feature map is used as a source of all RoIs, independent -of the size of the bounding box of the RoI. -However, for small objects, the C$_4$ (see Table \ref{table:maskrcnn_resnet}) features -might have lost too much spatial information to properly predict the exact bounding -box and a high resolution mask. Likewise, for very big objects, the fixed size -RoI window might be too small to cover the region of the feature map containing -information for this object. +In Faster R-CNN, a single feature map is used as the source of all RoI features during RoI extraction, independent +of the size of the bounding box of each RoI. +However, for small objects, the C$_4$ (see Table \ref{table:resnet}) features +might have lost too much spatial information to allow properly predicting the exact bounding +box and a high resolution mask. As a solution to this, the Feature Pyramid Network (FPN) \cite{FPN} enables features -of an appropriate scale to be used, depending of the size of the bounding box. +of an appropriate scale to be used for RoI extraction, depending of the size of the bounding box of an RoI. For this, a pyramid of feature maps is created on top of the ResNet \cite{ResNet} -encoder by combining bilinear upsampled feature maps coming from the bottleneck -with lateral skip connections from the encoder. -The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. +encoder by combining bilinearly upsampled feature maps coming from the bottleneck +with lateral skip connections from the encoder (Figure~\ref{figure:fpn_block}). +For each consecutive upsampling block, the lateral skip connections are taken from +the encoder block with the same output resolution as the upsampled features coming +from the bottleneck. + Instead of a single RPN head with anchors at 3 scales and 3 aspect ratios, -the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$. +the FPN variant has one RPN head after each of the pyramid levels P$_2$ ... P$_6$ (see Table \ref{table:maskrcnn_resnet_fpn}). At each output position of the resulting RPN pyramid, bounding boxes are predicted with respect to 3 anchor aspect ratios $\{1:2, 1:1, 2:1\}$ and a single scale ($N_a = 3$). For P$_2$, P$_3$, P$_4$, P$_5$, P$_6$, the scale corresponds to anchor bounding boxes of areas $32^2, 64^2, 128^2, 256^2, 512^2$, respectively. Note that there is no need for multiple anchor scales per anchor position anymore, -as the RPN heads themselves correspond to multiple scales. +as the RPN heads themselves correspond to different scales. Now, in the RPN, higher resolution feature maps can be used for regressing smaller bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$, which has a stride of $4$ with respect to the input image. @@ -463,6 +471,8 @@ as some anchor to the exact same pyramid level from which the RPN of this anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$, which is the highest resolution feature map. +The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. + \begin{figure}[t] \centering @@ -506,7 +516,7 @@ All bounding boxes predicted by the RoI head or RPN are estimated as offsets with respect to a reference bounding box. In the case of the RPN, the reference bounding box is one of the anchors, and refined bounding boxes from the RoI head are predicted relative to the RPN output bounding boxes. -Let $(x, y, w, h)$ be the top left coordinates, height and width of the bounding box +Let $(x, y, w, h)$ be the top left coordinates, width, and height of the bounding box to be predicted. Likewise, let $(x^*, y^*, w^*, h^*)$ be the ground truth bounding box and let $(x_r, y_r, w_r, h_r)$ be the reference bounding box. The ground truth \emph{box encoding} $b_e^*$ is then defined as @@ -561,7 +571,7 @@ w = \exp(b_w) \cdot w_r, h = \exp(b_h) \cdot h_r, \end{equation*} and thus the bounding box is obtained as the reference bounding box adjusted by -the predicted relative offsets and scales. +the predicted relative offsets and scales encoded in $b_e$. \paragraph{Supervision of the RPN} A positive RPN proposal is defined as one with a IoU of at least $0.7$ with @@ -571,7 +581,7 @@ with at most $50\%$ positive examples (if there are less positive examples, more negative examples are used instead). For examples selected in this way, a regression loss is computed between predicted and ground truth bounding box encoding, and a classification loss -is computed for the predicted objectness. +is computed for the predicted objectness scores. Specifically, let $s_i^* = 1$ if proposal $i$ is positive and $s_i^* = 0$ if it is negative, let $s_i$ be the predicted objectness score and $b_i$, $b_i^*$ the predicted and ground truth bounding box encodings. @@ -588,7 +598,7 @@ L_{box}^{RPN} = \frac{1}{N_{RPN}^{pos}} \sum_{i=1}^{N_{RPN}} s_i^* \cdot \ell_{r \end{equation} and \begin{equation} -N_{RPN}^{pos} = \sum_{i=1}^{N_{pos}} s_i^* +N_{RPN}^{pos} = \sum_{i=1}^{N_{RPN}} s_i^* \end{equation} is the number of positive examples. Note that the bounding box loss is only active for positive examples, and that the classification loss is computed @@ -648,14 +658,14 @@ During inference, the 300 (without FPN) or 1000 (with FPN) highest scoring regio from the RPN are selected. The corresponding features are extracted from the backbone, as during training, by using the RPN bounding boxes, and passed through the RoI bounding box refinement and classification heads (but not through the mask head). -After this, non-maximum supression (NMS) is applied to predicted RoIs with predicted non-background class, -with a maximum IoU of 0.7. -Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes, -after again extracting the corresponding features. +After this, non-maximum supression (NMS) is applied to predicted RoIs for which the predicted class is not the background class, +with a maximum IoU of 0.7 of the refined boxes. +Finally, the mask head is applied to the 100 highest scoring (after NMS) refined boxes, +after extracting the corresponding features again. Thus, during inference, the features for the mask head are extracted using the refined bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not -introducing any misalignment, as we want to create the instance mask inside of the -more precise, refined detection bounding boxes. +introducing any misalignment, as the instance masks are to be created inside of the +final, more precise, refined detection bounding boxes. Furthermore, note that bounding box and mask predictions for all classes but the predicted class (the highest scoring class) are discarded, and thus the output bounding box and mask correspond to the highest scoring class. diff --git a/conclusion.tex b/conclusion.tex index f4a1f7a..e394c46 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -18,6 +18,12 @@ of our network is highly interpretable, which may also bring benefits for safety applications. \subsection{Future Work} +\paragraph{Training on all Virtual KITTI sequences} +We only trained our models on the \emph{clone} variants of the Virtual KITTI sequences +to make training faster. +In the future, it would be interesting to train on all variants, as the different +lighting conditions and angles should lead to a more general model. + \paragraph{Evaluation and finetuning on KITTI 2015} Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset on which we do not train, but we have yet to evaluate on a real world dataset. @@ -138,19 +144,19 @@ In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM}, into our architecture, we could enable temporally consistent motion estimation from image sequences of arbitrary length. -\paragraph{Masking prior to the RoI motion head} -Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from -the backbone are integrated over the complete RoI window to yield the features -for motion estimation. -For example, average pooling is applied before the fully-connected layers in the variant without FPN. -However, ideally, the motion (image matching) information from the backbone should - -For example, consider - -Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the -extracted RoI features before passing them into the motion head. -The intuition behind that is that we want to mask out (set to zero) any positions in the -extracted feature window which belong to the background. Then, the RoI motion -head could aggregate the motion (image matching) information from the backbone -over positions localized within the object only, but not over positions belonging -to the background, which should probably not influence the final object motion estimate. +% \paragraph{Masking prior to the RoI motion head} +% Currently, in the Motion R-CNN RoI motion head, the RoI features extracted from +% the backbone are integrated over the complete RoI window to yield the features +% for motion estimation. +% For example, average pooling is applied before the fully-connected layers in the variant without FPN. +% However, ideally, the motion (image matching) information from the backbone should +% +% For example, consider +% +% Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the +% extracted RoI features before passing them into the motion head. +% The intuition behind that is that we want to mask out (set to zero) any positions in the +% extracted feature window which belong to the background. Then, the RoI motion +% head could aggregate the motion (image matching) information from the backbone +% over positions localized within the object only, but not over positions belonging +% to the background, which should probably not influence the final object motion estimate. diff --git a/experiments.tex b/experiments.tex index 87d2853..3ee2e57 100644 --- a/experiments.tex +++ b/experiments.tex @@ -1,6 +1,6 @@ \subsection{Implementation} -Our networks and loss functions are implemented using built-in TensorFlow \cite{TensorFlow} -functions, enabling us to use automatic differentiation for all gradient +Our networks and loss functions are implemented using built-in TensorFlow +functions \cite{TensorFlow}, enabling us to use automatic differentiation for all gradient computations. To make our code easy to extend and flexible, we build on the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline implementation. @@ -49,18 +49,18 @@ let $[R_t^{ex}|t_t^{ex}]$ and $[R_{t+1}^{ex}|t_{t+1}^{ex}]$ be the camera extrinsics at the two frames. We compute the ground truth camera motion -$\{R_t^{gt, cam}, t_t^{gt, cam}\} \in \mathbf{SE}(3)$ as +$\{R_{cam}^*, t_{cam}^*\} \in \mathbf{SE}(3)$ as \begin{equation} -R_{t}^{gt, cam} = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}), +R_{cam}^* = R_{t+1}^{ex} \cdot \mathrm{inv}(R_t^{ex}), \end{equation} \begin{equation} -t_{t}^{gt, cam} = t_{t+1}^{ex} - R_{t}^{ex} \cdot t_t^{ex}. +t_{cam}^* = t_{t+1}^{ex} - R_{cam}^* \cdot t_t^{ex}. \end{equation} -Additionally, we define $o_t^{gt, cam} \in \{ 0, 1 \}$, +Additionally, we define $o_{cam}^* \in \{ 0, 1 \}$, \begin{equation} -o_t^{gt, cam} = +o_{cam}^* = \begin{cases} 1 &\text{if the camera pose changes between $t$ and $t+1$} \\ 0 &\text{otherwise,} @@ -75,25 +75,25 @@ at $I_t$ and $I_{t+1}$. Note that the pose at $t$ is given with respect to the camera at $t$ and the pose at $t+1$ is given with respect to the camera at $t+1$. -We define the ground truth pivot $p_{t}^{gt, i} \in \mathbb{R}^3$ as +We define the ground truth pivot $p_k^* \in \mathbb{R}^3$ as \begin{equation} -p_{t}^{gt, i} = t_t^i +p_k^* = t_t^i \end{equation} and compute the ground truth object motion -$\{R_t^{gt, i}, t_t^{gt, i}\} \in \mathbf{SE}(3)$ as +$\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as \begin{equation} -R_{t}^{gt, i} = \mathrm{inv}(R_t^{gt, cam}) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i), +R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i), \end{equation} \begin{equation} -t_{t}^{gt, i} = t_{t+1}^{i} - R_t^{gt, cam} \cdot t_t. +t_k^* = t_{t+1}^{i} - R_k^* \cdot t_t. \end{equation} -As for the camera, we define $o_t^{gt, i} \in \{ 0, 1 \}$, +As for the camera, we define $o_k^* \in \{ 0, 1 \}$, \begin{equation} -o_t^{gt, i} = +o_k^* = \begin{cases} 1 &\text{if the position of object i changes between $t$ and $t+1$} \\ 0 &\text{otherwise,} @@ -105,21 +105,19 @@ which specifies whether an object is moving in between the frames. To evaluate the 3D instance and camera motions on the Virtual KITTI validation set, we introduce a few error metrics. Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example, -let $i_k$ be the index of the best matching ground truth example, -let $c_k$ be the predicted class, -let $R^{k,c_k}, t^{k,c_k}, p^{k,c_k}, o^{k,c_k}$ be the predicted motion for class $c_k$ -and $R^{gt,i_k}, t^{gt,i_k}, p^{gt,i_k}, o^{gt,i_k}$ the ground truth motion for the example $i_k$. +let $R_k, t_k, p_k, o_k$ be the predicted motion for the predicted class $c_k$ +and $R_k^*, t_k^*, p_k^*, o_k^*$ the motion ground truth for the best matching example. Then, assuming there are $N$ such detections, \begin{equation} -E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R^{k,c_k}) \cdot R^{gt,i_k}) - 1}{2} \right\}\right\} \right) +E_{R} = \frac{1}{N}\sum_k \arccos\left( \min\left\{1, \max\left\{-1, \frac{\mathrm{tr}(\mathrm{inv}(R_k^*) \cdot R_k) - 1}{2} \right\}\right\} \right) \end{equation} measures the mean angle of the error rotation between predicted and ground truth rotation, \begin{equation} -E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R^{k,c_k}) \cdot (t^{gt,i_k} - t^{k,c_k}) \right\rVert_2, +E_{t} = \frac{1}{N}\sum_k \left\lVert \mathrm{inv}(R_k) \cdot (t_k^* - t_k) \right\rVert_2, \end{equation} is the mean euclidean norm between predicted and ground truth translation, and \begin{equation} -E_{p} = \frac{1}{N}\sum_k \left\lVert p^{gt,i_k} - p^{k,c_k} \right\rVert_2 +E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2 \end{equation} is the mean euclidean norm between predicted and ground truth pivot. Moreover, we define precision and recall measures for the detection of moving objects, @@ -135,29 +133,30 @@ O_{rc} = \frac{\mathit{TP}}{\mathit{TP} + \mathit{FN}} is the fraction of objects correctly classified as moving among all objects which are actually moving. Here, we used \begin{equation} -\mathit{TP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 1], +\mathit{TP} = \sum_k [o_k = 1 \land o_k^* = 1], \end{equation} \begin{equation} -\mathit{FP} = \sum_k [o^{k,c_k} = 1 \land o^{gt,i_k} = 0], +\mathit{FP} = \sum_k [o_k = 1 \land o_k^* = 0], \end{equation} and \begin{equation} -\mathit{FN} = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1]. +\mathit{FN} = \sum_k [o_k = 0 \land o_k^* = 1]. \end{equation} Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for -predicted camera motions. +the predicted camera motion. \subsection{Virtual KITTI: Training setup} \label{ssec:setup} For our initial experiments, we concatenate both RGB frames as well as the XYZ coordinates for both frames as input to the networks. -We train both, the Motion R-CNN and -FPN variants. +We train both, the Motion R-CNN ResNet and ResNet-FPN variants. \paragraph{Training schedule} Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. -We train on a single Titan X (Pascal) for a total of 192K iterations on the -Virtual KITTI training set. +We train for a total of 192K iterations on the Virtual KITTI training set. +For this, we use a single Titan X (Pascal) GPU and a batch size of 1, +which results in approximately one day of training. As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a momentum of $0.9$. As learning rate we use $0.25 \cdot 10^{-2}$ for the diff --git a/introduction.tex b/introduction.tex index 3bf2203..c4b9936 100644 --- a/introduction.tex +++ b/introduction.tex @@ -11,26 +11,30 @@ and estimates their 3D locations as well as all 3D object motions between the fr \subsection{Motivation} -For moving in the real world, it is generally desirable to know which objects exists +For moving in the real world, it is often desirable to know which objects exists in the proximity of the moving agent, where they are located relative to the agent, -and where they will be at some point in the future. +and where they will be at some point in the near future. In many cases, it would be preferable to infer such information from video data -if technically feasible, as camera sensors are cheap and ubiquitous. +if technically feasible, as camera sensors are cheap and ubiquitous +(compared to, for example, Lidar). -For example, in autonomous driving, it is crucial to not only know the position +As an example, consider the autonomous driving problem. +Here, it is crucial to not only know the position of each obstacle, but to also know if and where the obstacle is moving, and to use sensors that will not make the system too expensive for widespread use. +At the same time, the autonomous driving system has to operate in real time to +react quickly enough for safely controlling the vehicle. -A promising approach for 3D scene understanding in situations like these are deep neural +A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification in still images and are more and more often being applied to video data. -A key benefit of end-to-end deep networks is that they can, in principle, +A key benefit of deep networks is that they can, in principle, enable very fast inference on real time video data and generalize -over many training examples to resolve ambiguities inherent in image understanding +over many training situations to resolve ambiguities inherent in image understanding and motion estimation. -Thus, in this work, we aim to develop end-to-end deep networks which can, given +Thus, in this work, we aim to develop deep neural networks which can, given sequences of images, segment the image pixels into object instances and estimate the location and 3D motion of each object instance relative to the camera (Figure \ref{figure:teaser}). @@ -39,9 +43,12 @@ the location and 3D motion of each object instance relative to the camera Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera. -SfM-Net predicts a batch of binary full image masks specyfing the object memberships of individual pixels with a standard encoder-decoder -network for pixel-wise prediction. A fully-connected network branching off the encoder predicts a 3D motion for each object. -However, due to the fixed number of objects masks, the system can only predict a small number of motions and +Using a standard encoder-decoder network for pixel-wise dense prediction, +SfM-Net predicts a pre-determined number of binary masks ranging over the complete image, +with each mask specifying the membership of the image pixels to one object. +A fully-connected network branching off the encoder then predicts a 3D motion for each object, +as well as the camera ego-motion. +However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions and often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}). \begin{figure}[t] \centering @@ -64,9 +71,11 @@ deep learning approaches to motion estimation, may significantly benefit motion estimation by structuring the problem, creating physical constraints and reducing the dimensionality of the estimate. -A scalable approach to instance segmentation based on region-based convolutional networks -was recently introduced with Mask R-CNN \cite{MaskRCNN}, which inherits the ability to detect -a large number of objects from a large number of classes at once from Faster R-CNN +In the context of still images, a +scalable approach to instance segmentation based on region-based convolutional networks +was recently introduced with Mask R-CNN \cite{MaskRCNN}. +Mask R-CNN inherits the ability to detect +a large number of objects from a large number of classes at once from Faster R-CNN \cite{FasterRCNN} and predicts pixel-precise segmentation masks for each detected object (Figure \ref{figure:maskrcnn_cs}). \begin{figure}[t] @@ -126,7 +135,7 @@ image depending on the semantics of each region or pixel, which include whether pixel belongs to the background, to which object instance it belongs if it is not background, and the class of the object it belongs to. Often, failure cases of these methods include motion boundaries or regions with little texture, -where semantics become important. +where semantics become very important. Extensions of these approaches to scene flow estimate flow and depth with similarly generic networks \cite{SceneFlowDataset} and similar limitations. @@ -171,14 +180,14 @@ These concerns restrict the applicability of the current slanted plane models in which often require estimations to be done in realtime and for which an end-to-end approach based on learning would be preferable. -Futhermore, in other contexts, the move towards end-to-end deep learning has often lead +By analogy, in other contexts, the move towards end-to-end deep learning has often lead to significant benefits in terms of accuracy and speed. As an example, consider the evolution of region-based convolutional networks, which started out as prohibitively slow with a CNN as a single component and became very fast and much more accurate over the course of their development into end-to-end deep networks. -Thus, in the context of motion estimation, one could expect end-to-end deep learning to not only bring large improvements +Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation and the ability of deep networks to learn to handle ambiguity from a large variety of training examples. @@ -201,15 +210,15 @@ with a brightness constancy proxy loss. Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with end-to-end deep learning. Unlike SfM-Net, we build on a scalable object detection and instance segmentation -approach with R-CNNs, which provide a strong baseline. +approach with R-CNNs, which provide us with a strong baseline for these tasks. \paragraph{End-to-end deep networks for camera pose estimation} Deep networks have been used for estimating the 6-DOF camera pose from a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion from monocular video \cite{UnsupPoseDepth}. These works are related to -ours in that we also need to output various rotations and translations from a deep network -and thus need to solve similar regression problems and use similar parametrizations +ours in that we also need to output various rotations and translations from a deep network, +and thus need to solve similar regression problems and may be able to use similar parametrizations and losses. @@ -217,8 +226,8 @@ and losses. First, in section \ref{sec:background}, we introduce preliminaries and building blocks from earlier works that serve as a foundation for our networks and losses. Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone -as well as the developments in region-based CNNs onto which we build (\ref{ssec:rcnn}), -specifically Mask R-CNN and the FPN \cite{FPN}. +as well as the developments in region-based CNNs which we build on (\ref{ssec:rcnn}), +specifically Mask R-CNN and the Feature Pyramid Network (FPN) \cite{FPN}. In section \ref{sec:approach}, we describe our technical contribution, starting with our motion estimation model and modifications to the Mask R-CNN backbone and head networks (\ref{ssec:model}), followed by our losses and supervision methods for training diff --git a/thesis.tex b/thesis.tex index 78e4f33..3c12476 100644 --- a/thesis.tex +++ b/thesis.tex @@ -125,39 +125,39 @@ %\pagenumbering{arabic} % Arabische Seitenzahlen \section{Introduction} +\label{sec:introduction} \parindent 2em \onehalfspacing \input{introduction} -\label{sec:introduction} \section{Background} +\label{sec:background} \parindent 2em \onehalfspacing -\label{sec:background} \input{background} \section{Motion R-CNN} +\label{sec:approach} \parindent 2em \onehalfspacing -\label{sec:approach} \input{approach} \section{Experiments} +\label{sec:experiments} \parindent 2em \onehalfspacing \input{experiments} -\label{sec:experiments} \section{Conclusion} +\label{sec:conclusion} \parindent 2em \onehalfspacing \input{conclusion} -\label{sec:conclusion} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Bibliografie mit BibLaTeX