diff --git a/approach.tex b/approach.tex index d3da89d..5160329 100644 --- a/approach.tex +++ b/approach.tex @@ -5,7 +5,7 @@ Building on Mask R-CNN \cite{MaskRCNN}, we estimate per-object motion by predicting the 3D motion of each detected object. For this, we extend Mask R-CNN in two straightforward ways. -First, we modify the backbone network and provide two frames to the R-CNN system +First, we modify the backbone network and provide it with two frames in order to enable image matching between the consecutive frames. Second, we extend the Mask R-CNN RoI head to predict a 3D motion and pivot for each region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn} @@ -32,10 +32,10 @@ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\ & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\ T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ -$R_t^{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\ -$t_t^{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\ +$R_{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\ +$t_{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\ & From T$_0$: fully connected, 2 & 1 $\times$ 2 \\ -$o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\ +$o_{cam}$& softmax, 2 & 1 $\times$ 2 \\ \midrule \multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet})}\\ \midrule @@ -43,11 +43,11 @@ $o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\ \midrule %& From M$_0$: flatten & N$_{RoI}$ $\times$ 7 $\cdot$ 7 $\cdot$ 256 \\ T$_1$ & From ave: $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RoI}$ $\times$ 1024 \\ -$\forall k: R_t^k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ -$\forall k: t_t^k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ -$\forall k: p_t^k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ +$\forall k: R_k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ +$\forall k: t_k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ +$\forall k: p_k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ & From T$_1$: fully connected, 2 & N$_{RoI}$ $\times$ 2 \\ -$\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ +$\forall k: o_k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ \bottomrule \end{tabular} @@ -81,10 +81,10 @@ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1 & bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\ & flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\ T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\ -$R_t^{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\ -$t_t^{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\ +$R_{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\ +$t_{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\ & From T$_2$: fully connected, 2 & 1 $\times$ 2 \\ -$o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\ +$o_{cam}$& softmax, 2 & 1 $\times$ 2 \\ \midrule \multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\ \midrule @@ -92,11 +92,11 @@ $o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\ \midrule %& From M$_1$: flatten & N$_{RoI}$ $\times$ 14 $\cdot$ 14 $\cdot$ 256 \\ T$_3$ & From F$_1$: $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RoI}$ $\times$ 1024 \\ -$\forall k: R_t^k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ -$\forall k: t_t^k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ -$\forall k: p_t^k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ +$\forall k: R_k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ +$\forall k: t_k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ +$\forall k: p_k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\ & From T$_2$: fully connected, 2 & N$_{RoI}$ $\times$ 2 \\ -$\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ +$\forall k: o_k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\ \bottomrule \end{tabular} @@ -124,7 +124,7 @@ we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yiel Additionally, we also experiment with concatenating the camera space XYZ coordinates for each frame, XYZ$_t$ and XYZ$_{t+1}$, into the input as well. We do not introduce a separate network for computing region proposals and use our modified backbone network -as both first stage RPN and second stage feature extractor for extracting the RoI features. +as both RPN and for extracting the RoI features. Technically, our feature encoder network will have to learn image matching representations similar to that learned by the FlowNet encoder, but the output will be computed in the object-centric framework of a region based convolutional network head with a 3D parametrization. @@ -133,7 +133,7 @@ from the encoder is integrated for specific objects via RoI extraction and processed by the RoI head for each object. \paragraph{Per-RoI motion prediction} -We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}. +We use a rigid 3D motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}. For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$ \footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$} @@ -214,7 +214,7 @@ to increase the bottleneck stride prior to the camera motion network to 64. In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}), the backbone makes use of all blocks through $C_6$, and we can simply branch off our camera motion network from the $C_6$ bottleneck. -Then, in both, the ResNet and ResNet-FPN variant, we apply a additional +Then, in both, the ResNet and ResNet-FPN variant, we apply one additional convolution to the $C_6$ features to reduce the number of inputs to the following fully-connected layers, and thus keep the number of weights reasonably small. Instead of averaging, we use bilinear resizing to bring the convolutional features @@ -255,7 +255,7 @@ performs better in our case than the standard $\ell_1$-loss. We thus compute the RoI motion loss as \begin{equation} -L_{motion} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_k^{\text{N}_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o_k^* + l_o^k, +L_{motion} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_k^{\text{N}_{RoI}} (l_{R}^k + l_{t}^k) \cdot o_k^* + l_{p}^k + l_o^k, \end{equation} where \begin{equation} @@ -272,7 +272,7 @@ respectively and \begin{equation} l_o^k = \ell_{cls}(o_k, o_k^*). \end{equation} -is the cross-entropy loss for the predicted classification into moving and non-moving objects. +is the (categorical) cross-entropy loss for the predicted classification into moving and non-moving objects. Note that we do not penalize the rotation and translation for objects with $o_k^* = 0$, which do not move between $t$ and $t+1$. We found that the network @@ -300,7 +300,7 @@ classification loss. \centering \includegraphics[width=\textwidth]{figures/flow_loss} \caption{ -Overview of the alternative, optical flow based loss for instance motion +Overview of the alternative, flow-based loss for instance motion supervision without 3D instance motion ground truth. In contrast to SfM-Net \cite{SfmNet}, where a single optical flow field is composed and penalized to supervise the motion prediction, our loss considers @@ -316,15 +316,16 @@ which we can apply to coordinates within the object bounding boxes, and which does not require ground truth 3D object motions. In this case, for any RoI, -we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box +we generate a uniform $m \times m$ grid of 2D points inside the RPN proposal bounding box with the same resolution as the predicted mask. +Note that the predicted mask we use here was binarized at a threshold of $0.5$. We use the same bounding box to crop the corresponding region from the dense, full-image depth map and bilinearly resize the depth crop to the same resolution as the mask and point grid. -Next, we create a 3D point cloud from the point grid and depth crop. To this point cloud, we +Next, we create a grid of 3D points (point cloud) from the grid of 2D points and depth crop. To this point cloud, we apply the object motion predicted for the RoI, masked by the predicted mask. -Then, we apply the camera motion to the points, project them back to 2D +Then, we apply the camera motion to the 3D points, project them back to 2D and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids. Note that we batch this computation over all RoIs, so that we only perform it once per forward pass. @@ -336,7 +337,7 @@ duplicate them here. The only differences are that there is no sum over objects the point transformation based on instance motion, as we consider the single object corresponding to an RoI in isolation, and that the masks are not resized to the full image resolution, as -the depth crops and 2D point grid are at the same resolution as the predicted +the depth crop and the grid of 2D points are at the same resolution as the predicted $m \times m$ mask. For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion @@ -345,12 +346,12 @@ If there is optical flow ground truth available, we can use the RoI bounding box crop and resize a region from the ground truth optical flow to match the RoI's optical flow grid and penalize the difference between the flow grids with a (smooth) $\ell_1$-loss. -However, we can also use the re-projection loss without optical flow ground truth +However, we could also use the re-projection loss without optical flow ground truth to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}. -In this case, we can use the bounding box to crop and resize a corresponding region -from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$ -using the 2D grid displaced with the predicted flow grid (the latter is often called \emph{backward warping}). -Then, we can penalize the difference +In this case, we could use the bounding box to crop and bilinearly resize the corresponding region +from the first image $I_t$ and bilinearly sample the corresponding region from the second image $I_{t+1}$, +using the 2D point grid displaced with the predicted flow grid (which is often called \emph{backward warping}). +Then, we could penalize the difference between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}. For more details on differentiable bilinear sampling for deep learning, we refer the reader to \cite{STN}. @@ -364,8 +365,8 @@ which could make it interesting even when 3D motion ground truth is available. \label{ssec:training_inference} \paragraph{Training} We train the Motion R-CNN RPN and RoI heads in the exact same way as described for Mask R-CNN. -We additionally compute the camera and instance motion losses and concatenate additional -information into the network input, but otherwise do not modify the training procedure +We additionally compute the camera and instance motion losses and concatenate the additional +frame (and, optionally, XYZ coordinates) into the network input, but otherwise do not modify the training procedure and sample proposals and RoIs in the exact same way. \paragraph{Inference} @@ -374,7 +375,7 @@ In the same way as the RoI mask head, at test time, we compute the RoI motion he from the features extracted with refined bounding boxes. Again, as for masks and bounding boxes in Mask R-CNN, -the predicted output object motions are the predicted object motions for the +the predicted output object motion is the predicted object motion for the highest scoring class. \subsection{Dense flow from 3D motion} @@ -406,29 +407,32 @@ which can be computed from the predicted box mask $m_k$ (for the predicted class it to the width and height of the predicted bounding box and then copying the values of the resized mask into a full resolution mask initialized with zeros, starting at the top-left coordinate of the predicted bounding box. -Then, given the predicted motions $(R_k, t_k)$, as well as $p_k$ for all objects, +Again we binarize masks at a threshold of $0.5$. + +Then, given the predicted motions $(R_k, t_k)$ and pivots $p_k$ for all objects, \begin{equation} P'_{t+1} = -P_t + \sum_1^{k} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\} +P_t + \sum_1^{\text{N}} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\}, \end{equation} -These motion predictions are understood to have already taken into account +where N is the number of detections. +The motion predictions are understood to have already taken into account the classification into moving and still objects, -and we thus, as described above, have identity motions for all objects with $o_k = 0$. +and we thus have, as described above, identity motions for all objects with $o_k = 0$. -Next, we transform all points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$, +Next, we transform points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$, \begin{equation} \begin{pmatrix} X_{t+1} \\ Y_{t+1} \\ Z_{t+1} \end{pmatrix} -= P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam} -\end{equation}. += P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam}. +\end{equation} -Note that in our experiments, we either use the ground truth camera motion to focus -on evaluating the object motion predictions or the predicted camera motion to evaluate -the complete motion estimates. We will always state which variant we use in the experimental section. +%Note that in our experiments, we either use the ground truth camera motion to focus +%on evaluating the object motion predictions or the predicted camera motion to evaluate +%the complete motion estimates. We will always state which variant we use in the experimental section. -Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again, +Finally, we project the transformed 3D points at time $t+1$ to 2D pixel coordinates again, \begin{equation} \begin{pmatrix} x_{t+1} \\ y_{t+1} @@ -443,7 +447,7 @@ X_{t+1} \\ Y_{t+1} c_0 \\ c_1 \end{pmatrix}. \end{equation} -We can now obtain the optical flow between $I_t$ and $I_{t+1}$ at each point as +We now obtain the optical flow between $I_t$ and $I_{t+1}$ at each point as \begin{equation} \begin{pmatrix} u \\ v diff --git a/background.tex b/background.tex index f11b970..86ace88 100644 --- a/background.tex +++ b/background.tex @@ -2,7 +2,7 @@ In this section, we will give a more detailed description of previous works we directly build on and other prerequisites. \subsection{Optical flow and scene flow} -Let $I_t,I_{t+1} : P \to \mathbb{R}^3$ be two temporally consecutive frames in a +Let $I_t,I_{t+1} : P \to \mathbb{R}^3$ be two temporally consecutive frames from a sequence of images. The optical flow $\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$ @@ -65,8 +65,10 @@ and a fully-connected prediction network on top of the encoder. The compressed representations learned by CNNs of these categories do not, however, allow for prediction of high-resolution output, as spatial detail is lost through sequential applications of pooling or strides. -Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder, +Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder, performing upsampling of the compressed features and resulting in a encoder-decoder pyramid. +In most cases, skip connections from the encoder part are used to combine high-resolution +detail with abstract, expressive features coming from the bottleneck (the last layer of the encoder). The most popular deep networks of this kind for end-to-end optical flow prediction are variants of the FlowNet family \cite{FlowNet, FlowNet2}, which was recently extended to scene flow estimation \cite{SceneFlowDataset}. @@ -144,8 +146,10 @@ that will serve as the basic CNN backbone of our networks, and is also used in many other region-based convolutional networks. The initial image data is always passed through the ResNet backbone as a first step to bootstrap the complete deep network. -Note that for the Mask R-CNN architectures we describe below, this is equivalent -to the standard ResNet-50 backbone. We now introduce one small extension that +Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent +to the standard ResNet-50 backbone. + +We additionally introduce one small extension that will be useful for our Motion R-CNN network. In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64. @@ -255,9 +259,9 @@ Then, fixed size (H $\times$ W) feature maps are extracted from the compressed f each corresponding to one of the proposal bounding boxes. The extracted per-RoI (region of interest) feature maps are collected into a batch and passed into a small Fast R-CNN \emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass. -The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features +Feature extraction is performed using \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the backbone features is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying -full-image feature map are max-pooled to yield the output value at the cell. +feature map are max-pooled to yield the output value at the cell. Thus, given region proposals, all computation is reduced to a single pass through the complete network, speeding up the system by two orders of magnitude at inference time and one order of magnitude at training time. @@ -343,17 +347,17 @@ As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for and the refined bounding boxes are predicted separately for each object class. Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture -(here, the mask head is ignored). +(for Faster R-CNN, the mask head is ignored). \paragraph{Mask R-CNN} Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity. However, it can be helpful to know class and object (instance) membership of all individual pixels, -which generally involves computing a binary mask for each object instance specifying which pixels belong +which generally involves computing a binary image mask for each object instance specifying which pixels belong to that object. This problem is called \emph{instance segmentation}. Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting fixed resolution instance masks within the bounding boxes of each detected object, -which are then bilinearly resized to fit inside the respective bounding boxes. -This is done by simply extending the Faster R-CNN head with multiple convolutions, which +which are, at test-time, bilinearly resized to fit inside the respective bounding boxes. +For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which compute a pixel-precise binary mask for each instance. Note that the per-class masks logits are put through a sigmoid layer, and thus there is no comptetition between classes in the mask prediction branch. @@ -382,7 +386,7 @@ P$_5$ & From C$_5$: 1 $\times$ 1 conv, 256 & $\tfrac{1}{32}$ H $\times$ $\tfrac{ P$_4$ & $\begin{bmatrix}\textrm{skip from C$_4$}\end{bmatrix}_p$ & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 256 \\ P$_3$ & $\begin{bmatrix}\textrm{skip from C$_3$}\end{bmatrix}_p$ & $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 256 \\ P$_2$ & $\begin{bmatrix}\textrm{skip from C$_2$}\end{bmatrix}_p$ & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 256 \\ -P$_6$ & From P$_5$: 2 $\times$ 2 subsample, 256 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 256 \\ +P$_6$ & From P$_5$: 2 $\times$ 2 subsample & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 256 \\ \midrule \multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\ \midrule @@ -449,7 +453,7 @@ as the RPN heads themselves correspond to different scales. Now, in the RPN, higher resolution feature maps can be used for regressing smaller bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$, which has a stride of $4$ with respect to the input image. -Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a +Most importantly, the RoI features can now be extracted at the pyramid level P$_j$ appropriate for a RoI bounding box with size $h \times w$, \begin{equation} j = 2 + j_a, @@ -468,7 +472,7 @@ is the scale of the smallest anchor boxes. This formula is slightly different from the one used in the FPN paper, as we want to assign the bounding boxes which are at the same scale as some anchor to the exact same pyramid level from which the RPN of this -anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$, +anchor is computed. Now, for example, the smallest boxes are cropped from P$_2$, which is the highest resolution feature map. The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}. @@ -508,14 +512,14 @@ $c$ is the output vector from a softmax layer, $c_{c^*} \in (0,1)$ is the output probability for class $c^*$, and $\text{C}$ is the number of classes. Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$, -as $\text{N}_{cls}$ does not include the background class. +as in $\text{N}_{cls}$, we do not count the background class. Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss, \begin{equation} \ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y), \end{equation} where $y^* \in \{0,1\}$ is a label and $y \in (0,1)$ is the output from a sigmoid layer. Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is -the sum of the $\ell_{cls*}$-losses for all 2D positions in the mask. +the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask. \label{ssec:rcnn_techn} \paragraph{Bounding box regression} @@ -618,16 +622,19 @@ a ground truth bounding box, and a background example is defined as one with a maximum IoU in $[0.1, 0.5)$. A total of 64 (without FPN) or 512 (with FPN) RoIs are sampled, with at most $25\%$ foreground examples. -Now, let $c_i^*$ be the ground truth object class, where $c_i = 0$ -for background examples and $c_i \in \{1, ..., \text{N}_{cls}\}$ for foreground examples, -and let $c_i$ be the class prediction. +Now, let $c_i^*$ be the ground truth object class, where $c_i^* = 0$ +for background examples and $c_i^* \in \{1, ..., \text{N}_{cls}\}$ for foreground examples, +and let $c_i$ be the RoI class prediction. Then, for any foreground RoI, let $b_i^*$ be the ground truth bounding box encoding and $b_i$ -the predicted refined box encoding for class $c_i^*$. +the predicted refined RoI box encoding for class $c_i^*$. Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$ and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from the binary ground truth mask using the RPN proposal bounding box. In our implementation, we use nearest neighbour resizing for resizing the mask targets. +Note that values in $m_i$ and $c_i$ are already normalized probabilities from +sigmoid and softmax layers, respectively. + Then, the ROI loss is computed as \begin{equation} L_{RoI} = L_{cls} + L_{box} + L_{mask} @@ -636,7 +643,7 @@ where \begin{equation} L_{cls} = \frac{1}{\text{N}_{RoI}} \sum_{i=1}^{\text{N}_{RoI}} \ell_{cls}(c_i, c_i^*), \end{equation} -is the average cross-entropy classification loss, +is the average (categorical) cross-entropy classification loss, \begin{equation} L_{box} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_{i=1}^{\text{N}_{RoI}} [c_i^* \geq 1] \cdot \ell_{reg}(b_i^* - b_i) \end{equation} @@ -644,7 +651,7 @@ is the average smooth-$\ell_1$ bounding box regression loss, \begin{equation} L_{mask} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_{i=1}^{\text{N}_{RoI}} [c_i^* \geq 1] \cdot \ell_{cls*}(m_i,m_i^*) \end{equation} -is the average binary cross-entropy mask loss, +is the average (binary) cross-entropy mask loss, \begin{equation} \text{N}_{RoI}^{\mathit{fg}} = \sum_{i=1}^{\text{N}_{RoI}} [c_i^* \geq 1] \end{equation} diff --git a/bib.bib b/bib.bib index c4dddd7..8f4ee20 100644 --- a/bib.bib +++ b/bib.bib @@ -393,7 +393,8 @@ @inproceedings{UnsupFlownet, title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness}, author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis}, - booktitle={ECCV 2016 Workshops}, + booktitle={1st Workshop on Brave new Ideas for Motion Representations in Videos}, + Note = {jointly with ECCV 2016}, Pages = {3--10}, Publisher = eccv-2016-pub, Series = eccv-2016-ser, @@ -411,3 +412,16 @@ pages={211--252}, journal=ijcv, year={2015}} + +@inproceedings{JOF, + Author = {Junhwa Hur and Stefan Roth}, + Booktitle = {4th Workshop on Computer Vision for Road Scene Understanding and Autonomous Driving}, + Editor = {Gang Hua and Herv{\'e} J{\'e}gou}, + Note = {jointly with ECCV 2016}, + Pages = {163--177}, + Publisher = eccv-2016-pub, + Series = eccv-2016-ser, + Sortmonth = eccv-2016-srtmon, + Title = {Joint Optical Flow and Temporally Consistent Semantic Segmentation}, + Volume = {9913}, + Year = eccv-2016-yr} diff --git a/experiments.tex b/experiments.tex index bfa1c5e..5c01e61 100644 --- a/experiments.tex +++ b/experiments.tex @@ -66,29 +66,26 @@ o_{cam}^* = 0 &\text{otherwise,} \end{cases} \end{equation} -which specifies the camera is moving in between the frames. +which specifies whether the camera is moving in between the frames. -For any object $i$ visible in both frames, let -$(R_t^i, t_t^i)$ and $(R_{t+1}^i, t_{t+1}^i)$ +For any object $k$ visible in both frames, let +$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$ be its orientation and position in camera space -at $I_t$ and $I_{t+1}$. +at $I_t$ and $I_{t+1}$, respectively. Note that the pose at $t$ is given with respect to the camera at $t$ and the pose at $t+1$ is given with respect to the camera at $t+1$. We define the ground truth pivot $p_k^* \in \mathbb{R}^3$ as - \begin{equation} -p_k^* = t_t^i +p_k^* = t_t^k \end{equation} - and compute the ground truth object motion $\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as - \begin{equation} -R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i), +R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k), \end{equation} \begin{equation} -t_k^* = t_{t+1}^{i} - R_k^* \cdot t_t. +t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t. \end{equation} As for the camera, we define $o_k^* \in \{ 0, 1 \}$, @@ -105,7 +102,7 @@ which specifies whether an object is moving in between the frames. To evaluate the 3D instance and camera motions on the Virtual KITTI validation set, we introduce a few error metrics. Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example, -let $R_k, t_k, p_k, o_k$ be the predicted motion for the predicted class $c_k$ +let $R_k, t_k, p_k, o_k$ be the predicted (and postprocessed) motion for the predicted class $c_k$ and $R_k^*, t_k^*, p_k^*, o_k^*$ the motion ground truth for the best matching example. Then, assuming there are $N$ such detections, \begin{equation} @@ -120,6 +117,7 @@ is the mean euclidean norm between predicted and ground truth translation, and E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2 \end{equation} is the mean euclidean norm between predicted and ground truth pivot. + Moreover, we define precision and recall measures for the detection of moving objects, where \begin{equation} @@ -152,7 +150,7 @@ the predicted camera motion. Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}. We train for a total of 192K iterations on the Virtual KITTI training set. For this, we use a single Titan X (Pascal) GPU and a batch size of 1, -which results in approximately one day of training. +which results in approximately one day of training for a complete run. As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a momentum of $0.9$. As learning rate we use $0.25 \cdot 10^{-2}$ for the diff --git a/introduction.tex b/introduction.tex index cc95d37..2c4edd2 100644 --- a/introduction.tex +++ b/introduction.tex @@ -15,7 +15,7 @@ For moving in the real world, it is often desirable to know which objects exists in the proximity of the moving agent, where they are located relative to the agent, and where they will be at some point in the near future. -In many cases, it would be preferable to infer such information from video data +In many cases, it would be preferable to infer such information from video data, if technically feasible, as camera sensors are cheap and ubiquitous (compared to, for example, Lidar). @@ -27,28 +27,29 @@ At the same time, the autonomous driving system has to operate in real time to react quickly enough for safely controlling the vehicle. A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural -networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification -in still images and are more and more often being applied to video data. +networks, which have recently achieved breakthroughs in object detection, instance segmentation, and classification +in still images, and are more and more often being applied to video data. A key benefit of deep networks is that they can, in principle, enable very fast inference on real time video data and generalize over many training situations to resolve ambiguities inherent in image understanding and motion estimation. Thus, in this work, we aim to develop deep neural networks which can, given -sequences of images, segment the image pixels into object instances and estimate +sequences of images, segment the image pixels into object instances, and estimate the location and 3D motion of each object instance relative to the camera (Figure \ref{figure:teaser}). \subsection{Technical goals} -Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth -and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera. +Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting dense depth +and dense optical flow from monocular image sequences, +based on estimating the 3D motion of individual objects and the camera. Using a standard encoder-decoder network for pixel-wise dense prediction, SfM-Net predicts a pre-determined number of binary masks ranging over the complete image, with each mask specifying the membership of the image pixels to one object. -A fully-connected network branching off the encoder then predicts a 3D motion for each object, -as well as the camera ego-motion. -However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions and +A fully-connected network branching off the encoder then predicts a 3D motion for each object +and the camera ego-motion. +However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions, and often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}). \begin{figure}[t] \centering @@ -56,19 +57,21 @@ often fails to properly segment the pixels into the correct masks or assigns bac \caption{ Results of SfM-Net \cite{SfmNet} on KITTI \cite{KITTI2015}. From left to right, we show their instance segmentation into up to 3 independent objects, -ground truth instance masks for the segmented objects, composed optical flow and ground truth optical flow. +ground truth instance masks for the segmented objects, composed optical flow, +and ground truth optical flow. Figure taken from \cite{SfmNet}. } \label{figure:sfmnet_kitti} \end{figure} -Thus, this approach is very unlikely to scale to dynamic scenes with a potentially -large number of diverse objects due to the inflexible nature of their instance segmentation technique. +Thus, due to the inflexible nature of their instance segmentation technique, +their approach is very unlikely to scale to dynamic scenes with a potentially +large number of diverse objects. Still, we think that the general idea of estimating object-level motion with end-to-end deep networks instead of directly predicting a dense flow field, as is common in current end-to-end deep learning approaches to motion estimation, may significantly benefit motion -estimation by structuring the problem, creating physical constraints and reducing +estimation by structuring the problem, creating physical constraints, and reducing the dimensionality of the estimate. In the context of still images, a @@ -102,7 +105,7 @@ as to the number or variety of object instances (Figure \ref{figure:net_intro}). Eventually, we want to extend our method to include depth prediction, yielding the first end-to-end deep network to perform 3D scene flow estimation -in a principled way from the consideration of individual objects. +in a principled and scalable way from the consideration of individual objects. For now, we will assume that RGB-D frames are given to break down the problem into manageable pieces. @@ -110,9 +113,9 @@ manageable pieces. \centering \includegraphics[width=\textwidth]{figures/net_intro} \caption{ -Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion +Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the 3D instance motion in parallel to the class, bounding box and mask. Additionally, we branch off a -small network for predicting the camera motion from the bottleneck. +small network from the bottleneck for predicting the 3D camera ego-motion. Novel components in addition to Mask R-CNN are shown in red. } \label{figure:net_intro} @@ -128,7 +131,7 @@ at inference time as \emph{end-to-end} deep learning systems. End-to-end deep networks for optical flow were recently introduced based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet}, -which pose optical flow as generic, homogenous pixel-wise estimation problem without making any assumptions +which pose optical flow as generic (and homogeneous) pixel-wise estimation problem without making any assumptions about the regularity and structure of the estimated flow. Specifically, such methods ignore that the optical flow varies across an image depending on the semantics of each region or pixel, which include whether a @@ -136,10 +139,10 @@ pixel belongs to the background, to which object instance it belongs if it is no and the class of the object it belongs to. Often, failure cases of these methods include motion boundaries or regions with little texture, where semantics become very important. -Extensions of these approaches to scene flow estimate flow and depth +Extensions of these approaches to scene flow estimate dense flow and dense depth with similarly generic networks \cite{SceneFlowDataset} and similar limitations. -Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper? +Other works \cite{ESI, JOF, FlowLayers, MRFlow} make use of semantic segmentation to structure the optical flow estimation problem and introduce reasoning at the object level, but still require expensive energy minimization for each new input, as CNNs are only used for some of the components and numerical @@ -153,7 +156,8 @@ The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as be composed of planar segments. Pixels are assigned to one of the planar segments, each of which undergoes a independent 3D rigid motion. This model simplifies the motion estimation problem significantly by reducing the dimensionality -of the estimate, and thus leads to accurate results. +of the estimate, and thus can lead to more accurate results than the direct estimation +of a homogenous motion field. In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015} assigns each slanted plane to one rigidly moving object instance, thus reducing the number of independently moving segments by allowing multiple @@ -164,23 +168,23 @@ without the use of (deep) learning. In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow}, a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined -with depth obtained from a non-learned stereo algorithm to be used as pre-computed +with depth obtained from a non-learned stereo algorithm, to be used as pre-computed inputs to a slanted plane scene flow model based on \cite{KITTI2015}. Most likely due to their use of deep learning for instance segmentation and for some other components, this approach outperforms the previous related scene flow methods on public benchmarks. -Still, the method uses a energy-minimization formulation for the scene flow estimation +Still, the method uses a energy-minimization formulation for the scene flow estimation itself and takes minutes to make a prediction. Interestingly, the slanted plane methods achieve the current state-of-the-art in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015}, -outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}. +outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet2}. However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts, generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime. These concerns restrict the applicability of the current slanted plane models in practical settings, -which often require estimations to be done in realtime and for which an end-to-end +which often require estimations to be done in realtime (or close to realtime) and for which an end-to-end approach based on learning would be preferable. -By analogy, in other contexts, the move towards end-to-end deep learning has often lead +Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead to significant benefits in terms of accuracy and speed. As an example, consider the evolution of region-based convolutional networks, which started out as prohibitively slow with a CNN as a single component and @@ -188,7 +192,7 @@ became very fast and much more accurate over the course of their development int end-to-end deep networks. Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements -in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation +in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation, and the ability of deep networks to learn to handle ambiguity from a large variety of training examples. However, we think that the current end-to-end deep learning approaches to motion @@ -200,10 +204,10 @@ with the promise of end-to-end deep learning. \paragraph{End-to-end deep networks for 3D rigid motion estimation} End-to-end deep learning for predicting rigid 3D object motions was first introduced with SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation -of the points into objects together with the 3D motion of each object. +of the points into objects together with 3D motions for each object. Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and estimates a segmentation of pixels into objects together with their 3D motions between the frames. -In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow from end-to-end deep learning. +In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow with end-to-end deep learning. For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate with a brightness constancy proxy loss. @@ -218,8 +222,8 @@ a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera from monocular video \cite{UnsupPoseDepth}. These works are related to ours in that we also need to output various rotations and translations from a deep network, -and thus need to solve similar regression problems and may be able to use similar parametrizations -and losses. +and thus need to solve similar regression problems, +and may be able to use similar parametrizations and losses. \subsection{Outline}