mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
wip
This commit is contained in:
parent
d0d7e6b176
commit
f0050801a4
94
approach.tex
94
approach.tex
@ -5,7 +5,7 @@
|
||||
Building on Mask R-CNN \cite{MaskRCNN},
|
||||
we estimate per-object motion by predicting the 3D motion of each detected object.
|
||||
For this, we extend Mask R-CNN in two straightforward ways.
|
||||
First, we modify the backbone network and provide two frames to the R-CNN system
|
||||
First, we modify the backbone network and provide it with two frames
|
||||
in order to enable image matching between the consecutive frames.
|
||||
Second, we extend the Mask R-CNN RoI head to predict a 3D motion and pivot for each
|
||||
region proposal. Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}
|
||||
@ -32,10 +32,10 @@ C$_4$ & ResNet \{up to C$_4$\} (Table \ref{table:resnet}) & $\tfrac{1}{16}$ H $\
|
||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||
T$_0$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
|
||||
$R_t^{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
$t_t^{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
$R_{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
$t_{cam}$& From T$_0$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
& From T$_0$: fully connected, 2 & 1 $\times$ 2 \\
|
||||
$o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\
|
||||
$o_{cam}$& softmax, 2 & 1 $\times$ 2 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet})}\\
|
||||
\midrule
|
||||
@ -43,11 +43,11 @@ $o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\
|
||||
\midrule
|
||||
%& From M$_0$: flatten & N$_{RoI}$ $\times$ 7 $\cdot$ 7 $\cdot$ 256 \\
|
||||
T$_1$ & From ave: $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RoI}$ $\times$ 1024 \\
|
||||
$\forall k: R_t^k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: t_t^k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: p_t^k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: R_k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: t_k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: p_k$ & From T$_1$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
& From T$_1$: fully connected, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
$\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
$\forall k: o_k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
@ -81,10 +81,10 @@ C$_6$ & ResNet (Table \ref{table:resnet}) & $\tfrac{1}{64}$ H $\times$ $\tfrac{1
|
||||
& bilinear resize, 7 $\times$ 7 & 7 $\times$ 7 $\times$ 512 \\
|
||||
& flatten & 1 $\times$ 7 $\cdot$ 7 $\cdot$ 512 \\
|
||||
T$_2$ & $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & 1 $\times$ 1024 \\
|
||||
$R_t^{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
$t_t^{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
$R_{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
$t_{cam}$& From T$_2$: fully connected, 3 & 1 $\times$ 3 \\
|
||||
& From T$_2$: fully connected, 2 & 1 $\times$ 2 \\
|
||||
$o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\
|
||||
$o_{cam}$& softmax, 2 & 1 $\times$ 2 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RoI Head \& RoI Head: Masks} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||||
\midrule
|
||||
@ -92,11 +92,11 @@ $o_t^{cam}$& softmax, 2 & 1 $\times$ 2 \\
|
||||
\midrule
|
||||
%& From M$_1$: flatten & N$_{RoI}$ $\times$ 14 $\cdot$ 14 $\cdot$ 256 \\
|
||||
T$_3$ & From F$_1$: $\begin{bmatrix}\textrm{fully connected}, 1024\end{bmatrix}$ $\times$ 2 & N$_{RoI}$ $\times$ 1024 \\
|
||||
$\forall k: R_t^k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: t_t^k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: p_t^k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: R_k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: t_k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
$\forall k: p_k$ & From T$_3$: fully connected, 3 & N$_{RoI}$ $\times$ 3 \\
|
||||
& From T$_2$: fully connected, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
$\forall k: o_t^k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
$\forall k: o_k$ & softmax, 2 & N$_{RoI}$ $\times$ 2 \\
|
||||
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
@ -124,7 +124,7 @@ we depth-concatenate two temporally consecutive frames $I_t$ and $I_{t+1}$, yiel
|
||||
Additionally, we also experiment with concatenating the camera space XYZ coordinates for each frame,
|
||||
XYZ$_t$ and XYZ$_{t+1}$, into the input as well.
|
||||
We do not introduce a separate network for computing region proposals and use our modified backbone network
|
||||
as both first stage RPN and second stage feature extractor for extracting the RoI features.
|
||||
as both RPN and for extracting the RoI features.
|
||||
Technically, our feature encoder network will have to learn image matching representations similar to
|
||||
that learned by the FlowNet encoder, but the output will be computed in the
|
||||
object-centric framework of a region based convolutional network head with a 3D parametrization.
|
||||
@ -133,7 +133,7 @@ from the encoder is integrated for specific objects via RoI extraction and
|
||||
processed by the RoI head for each object.
|
||||
|
||||
\paragraph{Per-RoI motion prediction}
|
||||
We use a rigid 3D motion parametrization similar to the one used in SfM-Net and SE3-Nets \cite{SfmNet,SE3Nets}.
|
||||
We use a rigid 3D motion parametrization similar to the one used in SE3-Nets and SfM-Net \cite{SE3Nets, SfmNet}.
|
||||
For the $k$-th object proposal, we predict the rigid transformation $\{R_k, t_k\}\in \mathbf{SE}(3)$
|
||||
\footnote{$\mathbf{SE}(3)$ refers to the Special Euclidean Group representing 3D rotations
|
||||
and translations: $\{R, t|R \in \mathbf{SO}(3), t \in \mathbb{R}^3\}$}
|
||||
@ -214,7 +214,7 @@ to increase the bottleneck stride prior to the camera motion network to 64.
|
||||
In our ResNet-FPN variant (Table \ref{table:motionrcnn_resnet_fpn}),
|
||||
the backbone makes use of all blocks through $C_6$, and
|
||||
we can simply branch off our camera motion network from the $C_6$ bottleneck.
|
||||
Then, in both, the ResNet and ResNet-FPN variant, we apply a additional
|
||||
Then, in both, the ResNet and ResNet-FPN variant, we apply one additional
|
||||
convolution to the $C_6$ features to reduce the number of inputs to the following
|
||||
fully-connected layers, and thus keep the number of weights reasonably small.
|
||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||
@ -255,7 +255,7 @@ performs better in our case than the standard $\ell_1$-loss.
|
||||
We thus compute the RoI motion loss as
|
||||
|
||||
\begin{equation}
|
||||
L_{motion} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_k^{\text{N}_{RoI}} l_{p}^k + (l_{R}^k + l_{t}^k) \cdot o_k^* + l_o^k,
|
||||
L_{motion} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_k^{\text{N}_{RoI}} (l_{R}^k + l_{t}^k) \cdot o_k^* + l_{p}^k + l_o^k,
|
||||
\end{equation}
|
||||
where
|
||||
\begin{equation}
|
||||
@ -272,7 +272,7 @@ respectively and
|
||||
\begin{equation}
|
||||
l_o^k = \ell_{cls}(o_k, o_k^*).
|
||||
\end{equation}
|
||||
is the cross-entropy loss for the predicted classification into moving and non-moving objects.
|
||||
is the (categorical) cross-entropy loss for the predicted classification into moving and non-moving objects.
|
||||
|
||||
Note that we do not penalize the rotation and translation for objects with
|
||||
$o_k^* = 0$, which do not move between $t$ and $t+1$. We found that the network
|
||||
@ -300,7 +300,7 @@ classification loss.
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/flow_loss}
|
||||
\caption{
|
||||
Overview of the alternative, optical flow based loss for instance motion
|
||||
Overview of the alternative, flow-based loss for instance motion
|
||||
supervision without 3D instance motion ground truth.
|
||||
In contrast to SfM-Net \cite{SfmNet}, where a single optical flow field is
|
||||
composed and penalized to supervise the motion prediction, our loss considers
|
||||
@ -316,15 +316,16 @@ which we can apply to coordinates within the object bounding boxes,
|
||||
and which does not require ground truth 3D object motions.
|
||||
|
||||
In this case, for any RoI,
|
||||
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
|
||||
we generate a uniform $m \times m$ grid of 2D points inside the RPN proposal bounding box
|
||||
with the same resolution as the predicted mask.
|
||||
Note that the predicted mask we use here was binarized at a threshold of $0.5$.
|
||||
We use the same bounding box
|
||||
to crop the corresponding region from the dense, full-image depth map
|
||||
and bilinearly resize the depth crop to the same resolution as the mask and point
|
||||
grid.
|
||||
Next, we create a 3D point cloud from the point grid and depth crop. To this point cloud, we
|
||||
Next, we create a grid of 3D points (point cloud) from the grid of 2D points and depth crop. To this point cloud, we
|
||||
apply the object motion predicted for the RoI, masked by the predicted mask.
|
||||
Then, we apply the camera motion to the points, project them back to 2D
|
||||
Then, we apply the camera motion to the 3D points, project them back to 2D
|
||||
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
|
||||
Note that we batch this computation over all RoIs, so that we only perform
|
||||
it once per forward pass.
|
||||
@ -336,7 +337,7 @@ duplicate them here. The only differences are that there is no sum over objects
|
||||
the point transformation based on instance motion, as we consider the single object
|
||||
corresponding to an RoI in isolation, and that the masks are not resized to the
|
||||
full image resolution, as
|
||||
the depth crops and 2D point grid are at the same resolution as the predicted
|
||||
the depth crop and the grid of 2D points are at the same resolution as the predicted
|
||||
$m \times m$ mask.
|
||||
|
||||
For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
|
||||
@ -345,12 +346,12 @@ If there is optical flow ground truth available, we can use the RoI bounding box
|
||||
crop and resize a region from the ground truth optical flow to match the RoI's
|
||||
optical flow grid and penalize the difference between the flow grids with a (smooth) $\ell_1$-loss.
|
||||
|
||||
However, we can also use the re-projection loss without optical flow ground truth
|
||||
However, we could also use the re-projection loss without optical flow ground truth
|
||||
to train the motion prediction in an unsupervised manner, similar to \cite{SfmNet}.
|
||||
In this case, we can use the bounding box to crop and resize a corresponding region
|
||||
from the first image $I_t$ and bilinearly sample a region from the second image $I_{t+1}$
|
||||
using the 2D grid displaced with the predicted flow grid (the latter is often called \emph{backward warping}).
|
||||
Then, we can penalize the difference
|
||||
In this case, we could use the bounding box to crop and bilinearly resize the corresponding region
|
||||
from the first image $I_t$ and bilinearly sample the corresponding region from the second image $I_{t+1}$,
|
||||
using the 2D point grid displaced with the predicted flow grid (which is often called \emph{backward warping}).
|
||||
Then, we could penalize the difference
|
||||
between the resulting image crops, for example, with a census loss \cite{CensusTerm,UnFlow}.
|
||||
For more details on differentiable bilinear sampling for deep learning, we refer the reader to
|
||||
\cite{STN}.
|
||||
@ -364,8 +365,8 @@ which could make it interesting even when 3D motion ground truth is available.
|
||||
\label{ssec:training_inference}
|
||||
\paragraph{Training}
|
||||
We train the Motion R-CNN RPN and RoI heads in the exact same way as described for Mask R-CNN.
|
||||
We additionally compute the camera and instance motion losses and concatenate additional
|
||||
information into the network input, but otherwise do not modify the training procedure
|
||||
We additionally compute the camera and instance motion losses and concatenate the additional
|
||||
frame (and, optionally, XYZ coordinates) into the network input, but otherwise do not modify the training procedure
|
||||
and sample proposals and RoIs in the exact same way.
|
||||
|
||||
\paragraph{Inference}
|
||||
@ -374,7 +375,7 @@ In the same way as the RoI mask head, at test time, we compute the RoI motion he
|
||||
from the features extracted with refined bounding boxes.
|
||||
|
||||
Again, as for masks and bounding boxes in Mask R-CNN,
|
||||
the predicted output object motions are the predicted object motions for the
|
||||
the predicted output object motion is the predicted object motion for the
|
||||
highest scoring class.
|
||||
|
||||
\subsection{Dense flow from 3D motion}
|
||||
@ -406,29 +407,32 @@ which can be computed from the predicted box mask $m_k$ (for the predicted class
|
||||
it to the width and height of the predicted bounding box and then copying the values
|
||||
of the resized mask into a full resolution mask initialized with zeros,
|
||||
starting at the top-left coordinate of the predicted bounding box.
|
||||
Then, given the predicted motions $(R_k, t_k)$, as well as $p_k$ for all objects,
|
||||
Again we binarize masks at a threshold of $0.5$.
|
||||
|
||||
Then, given the predicted motions $(R_k, t_k)$ and pivots $p_k$ for all objects,
|
||||
\begin{equation}
|
||||
P'_{t+1} =
|
||||
P_t + \sum_1^{k} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\}
|
||||
P_t + \sum_1^{\text{N}} M_k\left\{ R_k \cdot (P_t - p_k) + p_k + t_k - P_t \right\},
|
||||
\end{equation}
|
||||
These motion predictions are understood to have already taken into account
|
||||
where N is the number of detections.
|
||||
The motion predictions are understood to have already taken into account
|
||||
the classification into moving and still objects,
|
||||
and we thus, as described above, have identity motions for all objects with $o_k = 0$.
|
||||
and we thus have, as described above, identity motions for all objects with $o_k = 0$.
|
||||
|
||||
Next, we transform all points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$,
|
||||
Next, we transform points given the camera transformation $\{R_{cam}, t_{cam}\} \in \mathbf{SE}(3)$,
|
||||
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
|
||||
\end{pmatrix}
|
||||
= P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam}
|
||||
\end{equation}.
|
||||
= P_{t+1} = R_{cam} \cdot P'_{t+1} + t_{cam}.
|
||||
\end{equation}
|
||||
|
||||
Note that in our experiments, we either use the ground truth camera motion to focus
|
||||
on evaluating the object motion predictions or the predicted camera motion to evaluate
|
||||
the complete motion estimates. We will always state which variant we use in the experimental section.
|
||||
%Note that in our experiments, we either use the ground truth camera motion to focus
|
||||
%on evaluating the object motion predictions or the predicted camera motion to evaluate
|
||||
%the complete motion estimates. We will always state which variant we use in the experimental section.
|
||||
|
||||
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
|
||||
Finally, we project the transformed 3D points at time $t+1$ to 2D pixel coordinates again,
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
x_{t+1} \\ y_{t+1}
|
||||
@ -443,7 +447,7 @@ X_{t+1} \\ Y_{t+1}
|
||||
c_0 \\ c_1
|
||||
\end{pmatrix}.
|
||||
\end{equation}
|
||||
We can now obtain the optical flow between $I_t$ and $I_{t+1}$ at each point as
|
||||
We now obtain the optical flow between $I_t$ and $I_{t+1}$ at each point as
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
u \\ v
|
||||
|
||||
@ -2,7 +2,7 @@ In this section, we will give a more detailed description of previous works
|
||||
we directly build on and other prerequisites.
|
||||
|
||||
\subsection{Optical flow and scene flow}
|
||||
Let $I_t,I_{t+1} : P \to \mathbb{R}^3$ be two temporally consecutive frames in a
|
||||
Let $I_t,I_{t+1} : P \to \mathbb{R}^3$ be two temporally consecutive frames from a
|
||||
sequence of images.
|
||||
The optical flow
|
||||
$\mathbf{w} = (u, v)^T$ from $I_t$ to $I_{t+1}$
|
||||
@ -65,8 +65,10 @@ and a fully-connected prediction network on top of the encoder.
|
||||
The compressed representations learned by CNNs of these categories do not, however, allow
|
||||
for prediction of high-resolution output, as spatial detail is lost through sequential applications
|
||||
of pooling or strides.
|
||||
Thus, networks for dense prediction introduce a convolutional decoder on top of the representation encoder,
|
||||
Thus, networks for dense, high-resolution, prediction introduce a convolutional decoder on top of the representation encoder,
|
||||
performing upsampling of the compressed features and resulting in a encoder-decoder pyramid.
|
||||
In most cases, skip connections from the encoder part are used to combine high-resolution
|
||||
detail with abstract, expressive features coming from the bottleneck (the last layer of the encoder).
|
||||
The most popular deep networks of this kind for end-to-end optical flow prediction
|
||||
are variants of the FlowNet family \cite{FlowNet, FlowNet2},
|
||||
which was recently extended to scene flow estimation \cite{SceneFlowDataset}.
|
||||
@ -144,8 +146,10 @@ that will serve as the basic CNN backbone of our networks, and
|
||||
is also used in many other region-based convolutional networks.
|
||||
The initial image data is always passed through the ResNet backbone as a first step to
|
||||
bootstrap the complete deep network.
|
||||
Note that for the Mask R-CNN architectures we describe below, this is equivalent
|
||||
to the standard ResNet-50 backbone. We now introduce one small extension that
|
||||
Note that for the Mask R-CNN architectures we describe below, the architecture shown is equivalent
|
||||
to the standard ResNet-50 backbone.
|
||||
|
||||
We additionally introduce one small extension that
|
||||
will be useful for our Motion R-CNN network.
|
||||
In ResNet-50, the C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, their bottleneck stride is 64.
|
||||
@ -255,9 +259,9 @@ Then, fixed size (H $\times$ W) feature maps are extracted from the compressed f
|
||||
each corresponding to one of the proposal bounding boxes.
|
||||
The extracted per-RoI (region of interest) feature maps are collected into a batch and passed into a small Fast R-CNN
|
||||
\emph{head} network, which performs classification and prediction of refined boxes for all regions in one forward pass.
|
||||
The extraction technique is called \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the full image features
|
||||
Feature extraction is performed using \emph{RoI pooling}. In RoI pooling, the RoI bounding box window over the backbone features
|
||||
is divided into a H $\times$ W grid of cells. For each cell, the values of the underlying
|
||||
full-image feature map are max-pooled to yield the output value at the cell.
|
||||
feature map are max-pooled to yield the output value at the cell.
|
||||
Thus, given region proposals, all computation is reduced to a single pass through the complete network,
|
||||
speeding up the system by two orders of magnitude at inference time and one order of magnitude
|
||||
at training time.
|
||||
@ -343,17 +347,17 @@ As in Fast R-CNN, RoI pooling is used to extract one fixed size feature map for
|
||||
and the refined bounding boxes are predicted separately for each object class.
|
||||
|
||||
Table~\ref{table:maskrcnn_resnet} includes an overview of the Faster R-CNN ResNet network architecture
|
||||
(here, the mask head is ignored).
|
||||
(for Faster R-CNN, the mask head is ignored).
|
||||
|
||||
\paragraph{Mask R-CNN}
|
||||
Faster R-CNN and the earlier systems detect and classify objects at bounding box granularity.
|
||||
However, it can be helpful to know class and object (instance) membership of all individual pixels,
|
||||
which generally involves computing a binary mask for each object instance specifying which pixels belong
|
||||
which generally involves computing a binary image mask for each object instance specifying which pixels belong
|
||||
to that object. This problem is called \emph{instance segmentation}.
|
||||
Mask R-CNN \cite{MaskRCNN} extends the Faster R-CNN system to instance segmentation by predicting
|
||||
fixed resolution instance masks within the bounding boxes of each detected object,
|
||||
which are then bilinearly resized to fit inside the respective bounding boxes.
|
||||
This is done by simply extending the Faster R-CNN head with multiple convolutions, which
|
||||
which are, at test-time, bilinearly resized to fit inside the respective bounding boxes.
|
||||
For this, Mask R-CNN simply extends the Faster R-CNN head with multiple convolutions, which
|
||||
compute a pixel-precise binary mask for each instance.
|
||||
Note that the per-class masks logits are put through a sigmoid layer, and thus there is no
|
||||
comptetition between classes in the mask prediction branch.
|
||||
@ -382,7 +386,7 @@ P$_5$ & From C$_5$: 1 $\times$ 1 conv, 256 & $\tfrac{1}{32}$ H $\times$ $\tfrac{
|
||||
P$_4$ & $\begin{bmatrix}\textrm{skip from C$_4$}\end{bmatrix}_p$ & $\tfrac{1}{16}$ H $\times$ $\tfrac{1}{16}$ W $\times$ 256 \\
|
||||
P$_3$ & $\begin{bmatrix}\textrm{skip from C$_3$}\end{bmatrix}_p$ & $\tfrac{1}{8}$ H $\times$ $\tfrac{1}{8}$ W $\times$ 256 \\
|
||||
P$_2$ & $\begin{bmatrix}\textrm{skip from C$_2$}\end{bmatrix}_p$ & $\tfrac{1}{4}$ H $\times$ $\tfrac{1}{4}$ W $\times$ 256 \\
|
||||
P$_6$ & From P$_5$: 2 $\times$ 2 subsample, 256 & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 256 \\
|
||||
P$_6$ & From P$_5$: 2 $\times$ 2 subsample & $\tfrac{1}{64}$ H $\times$ $\tfrac{1}{64}$ W $\times$ 256 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{Region Proposal Network (RPN)}}\\
|
||||
\midrule
|
||||
@ -449,7 +453,7 @@ as the RPN heads themselves correspond to different scales.
|
||||
Now, in the RPN, higher resolution feature maps can be used for regressing smaller
|
||||
bounding boxes. For example, boxes of area close to $32^2$ are predicted using P$_2$,
|
||||
which has a stride of $4$ with respect to the input image.
|
||||
Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a
|
||||
Most importantly, the RoI features can now be extracted at the pyramid level P$_j$ appropriate for a
|
||||
RoI bounding box with size $h \times w$,
|
||||
\begin{equation}
|
||||
j = 2 + j_a,
|
||||
@ -468,7 +472,7 @@ is the scale of the smallest anchor boxes.
|
||||
This formula is slightly different from the one used in the FPN paper,
|
||||
as we want to assign the bounding boxes which are at the same scale
|
||||
as some anchor to the exact same pyramid level from which the RPN of this
|
||||
anchor is computed. Now, for example, the smallest boxes are cropped from $P_2$,
|
||||
anchor is computed. Now, for example, the smallest boxes are cropped from P$_2$,
|
||||
which is the highest resolution feature map.
|
||||
|
||||
The Mask R-CNN ResNet-FPN variant is shown in Table \ref{table:maskrcnn_resnet_fpn}.
|
||||
@ -508,14 +512,14 @@ $c$ is the output vector from a softmax layer,
|
||||
$c_{c^*} \in (0,1)$ is the output probability for class $c^*$,
|
||||
and $\text{C}$ is the number of classes.
|
||||
Note that for the object category classifier, $\text{C} = \text{N}_{cls} + 1$,
|
||||
as $\text{N}_{cls}$ does not include the background class.
|
||||
as in $\text{N}_{cls}$, we do not count the background class.
|
||||
Finally, for multi-label classification, we define the binary (sigmoid) cross-entropy loss,
|
||||
\begin{equation}
|
||||
\ell_{cls*}(y, y^*) = -y^* \cdot \log(y) - (1 - y^*) \cdot \log(1 - y),
|
||||
\end{equation}
|
||||
where $y^* \in \{0,1\}$ is a label and $y \in (0,1)$ is the output from a sigmoid layer.
|
||||
Note that for the mask loss that will be introduced below, $\ell_{cls*}$ is
|
||||
the sum of the $\ell_{cls*}$-losses for all 2D positions in the mask.
|
||||
the sum of the $\ell_{cls*}$-losses for all 2D positions over the mask.
|
||||
|
||||
\label{ssec:rcnn_techn}
|
||||
\paragraph{Bounding box regression}
|
||||
@ -618,16 +622,19 @@ a ground truth bounding box, and a background example is defined as
|
||||
one with a maximum IoU in $[0.1, 0.5)$.
|
||||
A total of 64 (without FPN) or 512 (with FPN) RoIs are sampled, with
|
||||
at most $25\%$ foreground examples.
|
||||
Now, let $c_i^*$ be the ground truth object class, where $c_i = 0$
|
||||
for background examples and $c_i \in \{1, ..., \text{N}_{cls}\}$ for foreground examples,
|
||||
and let $c_i$ be the class prediction.
|
||||
Now, let $c_i^*$ be the ground truth object class, where $c_i^* = 0$
|
||||
for background examples and $c_i^* \in \{1, ..., \text{N}_{cls}\}$ for foreground examples,
|
||||
and let $c_i$ be the RoI class prediction.
|
||||
Then, for any foreground RoI, let $b_i^*$ be the ground truth bounding box encoding and $b_i$
|
||||
the predicted refined box encoding for class $c_i^*$.
|
||||
the predicted refined RoI box encoding for class $c_i^*$.
|
||||
Additionally, for any foreground RoI, let $m_i$ be the predicted $m \times m$ mask for class $c_i^*$
|
||||
and $m_i^*$ the $m \times m$ mask target with values in $\{0,1\}$, where the mask target is cropped and resized from
|
||||
the binary ground truth mask using the RPN proposal bounding box.
|
||||
In our implementation, we use nearest neighbour resizing for resizing the mask
|
||||
targets.
|
||||
Note that values in $m_i$ and $c_i$ are already normalized probabilities from
|
||||
sigmoid and softmax layers, respectively.
|
||||
|
||||
Then, the ROI loss is computed as
|
||||
\begin{equation}
|
||||
L_{RoI} = L_{cls} + L_{box} + L_{mask}
|
||||
@ -636,7 +643,7 @@ where
|
||||
\begin{equation}
|
||||
L_{cls} = \frac{1}{\text{N}_{RoI}} \sum_{i=1}^{\text{N}_{RoI}} \ell_{cls}(c_i, c_i^*),
|
||||
\end{equation}
|
||||
is the average cross-entropy classification loss,
|
||||
is the average (categorical) cross-entropy classification loss,
|
||||
\begin{equation}
|
||||
L_{box} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_{i=1}^{\text{N}_{RoI}} [c_i^* \geq 1] \cdot \ell_{reg}(b_i^* - b_i)
|
||||
\end{equation}
|
||||
@ -644,7 +651,7 @@ is the average smooth-$\ell_1$ bounding box regression loss,
|
||||
\begin{equation}
|
||||
L_{mask} = \frac{1}{\text{N}_{RoI}^{\mathit{fg}}} \sum_{i=1}^{\text{N}_{RoI}} [c_i^* \geq 1] \cdot \ell_{cls*}(m_i,m_i^*)
|
||||
\end{equation}
|
||||
is the average binary cross-entropy mask loss,
|
||||
is the average (binary) cross-entropy mask loss,
|
||||
\begin{equation}
|
||||
\text{N}_{RoI}^{\mathit{fg}} = \sum_{i=1}^{\text{N}_{RoI}} [c_i^* \geq 1]
|
||||
\end{equation}
|
||||
|
||||
16
bib.bib
16
bib.bib
@ -393,7 +393,8 @@
|
||||
@inproceedings{UnsupFlownet,
|
||||
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
|
||||
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
||||
booktitle={ECCV 2016 Workshops},
|
||||
booktitle={1st Workshop on Brave new Ideas for Motion Representations in Videos},
|
||||
Note = {jointly with ECCV 2016},
|
||||
Pages = {3--10},
|
||||
Publisher = eccv-2016-pub,
|
||||
Series = eccv-2016-ser,
|
||||
@ -411,3 +412,16 @@
|
||||
pages={211--252},
|
||||
journal=ijcv,
|
||||
year={2015}}
|
||||
|
||||
@inproceedings{JOF,
|
||||
Author = {Junhwa Hur and Stefan Roth},
|
||||
Booktitle = {4th Workshop on Computer Vision for Road Scene Understanding and Autonomous Driving},
|
||||
Editor = {Gang Hua and Herv{\'e} J{\'e}gou},
|
||||
Note = {jointly with ECCV 2016},
|
||||
Pages = {163--177},
|
||||
Publisher = eccv-2016-pub,
|
||||
Series = eccv-2016-ser,
|
||||
Sortmonth = eccv-2016-srtmon,
|
||||
Title = {Joint Optical Flow and Temporally Consistent Semantic Segmentation},
|
||||
Volume = {9913},
|
||||
Year = eccv-2016-yr}
|
||||
|
||||
@ -66,29 +66,26 @@ o_{cam}^* =
|
||||
0 &\text{otherwise,}
|
||||
\end{cases}
|
||||
\end{equation}
|
||||
which specifies the camera is moving in between the frames.
|
||||
which specifies whether the camera is moving in between the frames.
|
||||
|
||||
For any object $i$ visible in both frames, let
|
||||
$(R_t^i, t_t^i)$ and $(R_{t+1}^i, t_{t+1}^i)$
|
||||
For any object $k$ visible in both frames, let
|
||||
$(R_t^k, t_t^k)$ and $(R_{t+1}^k, t_{t+1}^k)$
|
||||
be its orientation and position in camera space
|
||||
at $I_t$ and $I_{t+1}$.
|
||||
at $I_t$ and $I_{t+1}$, respectively.
|
||||
Note that the pose at $t$ is given with respect to the camera at $t$ and
|
||||
the pose at $t+1$ is given with respect to the camera at $t+1$.
|
||||
|
||||
We define the ground truth pivot $p_k^* \in \mathbb{R}^3$ as
|
||||
|
||||
\begin{equation}
|
||||
p_k^* = t_t^i
|
||||
p_k^* = t_t^k
|
||||
\end{equation}
|
||||
|
||||
and compute the ground truth object motion
|
||||
$\{R_k^*, t_k^*\} \in \mathbf{SE}(3)$ as
|
||||
|
||||
\begin{equation}
|
||||
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^i \cdot \mathrm{inv}(R_t^i),
|
||||
R_k^* = \mathrm{inv}(R_{cam}^*) \cdot R_{t+1}^k \cdot \mathrm{inv}(R_t^k),
|
||||
\end{equation}
|
||||
\begin{equation}
|
||||
t_k^* = t_{t+1}^{i} - R_k^* \cdot t_t.
|
||||
t_k^* = t_{t+1}^{k} - R_k^* \cdot t_t.
|
||||
\end{equation}
|
||||
|
||||
As for the camera, we define $o_k^* \in \{ 0, 1 \}$,
|
||||
@ -105,7 +102,7 @@ which specifies whether an object is moving in between the frames.
|
||||
To evaluate the 3D instance and camera motions on the Virtual KITTI validation
|
||||
set, we introduce a few error metrics.
|
||||
Given a foreground detection $k$ with an IoU of at least $0.5$ with a ground truth example,
|
||||
let $R_k, t_k, p_k, o_k$ be the predicted motion for the predicted class $c_k$
|
||||
let $R_k, t_k, p_k, o_k$ be the predicted (and postprocessed) motion for the predicted class $c_k$
|
||||
and $R_k^*, t_k^*, p_k^*, o_k^*$ the motion ground truth for the best matching example.
|
||||
Then, assuming there are $N$ such detections,
|
||||
\begin{equation}
|
||||
@ -120,6 +117,7 @@ is the mean euclidean norm between predicted and ground truth translation, and
|
||||
E_{p} = \frac{1}{N}\sum_k \left\lVert p_k^* - p_k \right\rVert_2
|
||||
\end{equation}
|
||||
is the mean euclidean norm between predicted and ground truth pivot.
|
||||
|
||||
Moreover, we define precision and recall measures for the detection of moving objects,
|
||||
where
|
||||
\begin{equation}
|
||||
@ -152,7 +150,7 @@ the predicted camera motion.
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
We train for a total of 192K iterations on the Virtual KITTI training set.
|
||||
For this, we use a single Titan X (Pascal) GPU and a batch size of 1,
|
||||
which results in approximately one day of training.
|
||||
which results in approximately one day of training for a complete run.
|
||||
As optimizer, we use stochastic gradient descent (SGD) \cite{SGD} with a
|
||||
momentum of $0.9$.
|
||||
As learning rate we use $0.25 \cdot 10^{-2}$ for the
|
||||
|
||||
@ -15,7 +15,7 @@ For moving in the real world, it is often desirable to know which objects exists
|
||||
in the proximity of the moving agent,
|
||||
where they are located relative to the agent,
|
||||
and where they will be at some point in the near future.
|
||||
In many cases, it would be preferable to infer such information from video data
|
||||
In many cases, it would be preferable to infer such information from video data,
|
||||
if technically feasible, as camera sensors are cheap and ubiquitous
|
||||
(compared to, for example, Lidar).
|
||||
|
||||
@ -27,28 +27,29 @@ At the same time, the autonomous driving system has to operate in real time to
|
||||
react quickly enough for safely controlling the vehicle.
|
||||
|
||||
A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural
|
||||
networks, which have recently achieved breakthroughs in object detection, instance segmentation and classification
|
||||
in still images and are more and more often being applied to video data.
|
||||
networks, which have recently achieved breakthroughs in object detection, instance segmentation, and classification
|
||||
in still images, and are more and more often being applied to video data.
|
||||
A key benefit of deep networks is that they can, in principle,
|
||||
enable very fast inference on real time video data and generalize
|
||||
over many training situations to resolve ambiguities inherent in image understanding
|
||||
and motion estimation.
|
||||
|
||||
Thus, in this work, we aim to develop deep neural networks which can, given
|
||||
sequences of images, segment the image pixels into object instances and estimate
|
||||
sequences of images, segment the image pixels into object instances, and estimate
|
||||
the location and 3D motion of each object instance relative to the camera
|
||||
(Figure \ref{figure:teaser}).
|
||||
|
||||
\subsection{Technical goals}
|
||||
|
||||
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting depth
|
||||
and dense optical flow in monocular image sequences based on estimating the 3D motion of individual objects and the camera.
|
||||
Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting dense depth
|
||||
and dense optical flow from monocular image sequences,
|
||||
based on estimating the 3D motion of individual objects and the camera.
|
||||
Using a standard encoder-decoder network for pixel-wise dense prediction,
|
||||
SfM-Net predicts a pre-determined number of binary masks ranging over the complete image,
|
||||
with each mask specifying the membership of the image pixels to one object.
|
||||
A fully-connected network branching off the encoder then predicts a 3D motion for each object,
|
||||
as well as the camera ego-motion.
|
||||
However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions and
|
||||
A fully-connected network branching off the encoder then predicts a 3D motion for each object
|
||||
and the camera ego-motion.
|
||||
However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions, and
|
||||
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}).
|
||||
\begin{figure}[t]
|
||||
\centering
|
||||
@ -56,19 +57,21 @@ often fails to properly segment the pixels into the correct masks or assigns bac
|
||||
\caption{
|
||||
Results of SfM-Net \cite{SfmNet} on KITTI \cite{KITTI2015}.
|
||||
From left to right, we show their instance segmentation into up to 3 independent objects,
|
||||
ground truth instance masks for the segmented objects, composed optical flow and ground truth optical flow.
|
||||
ground truth instance masks for the segmented objects, composed optical flow,
|
||||
and ground truth optical flow.
|
||||
Figure taken from \cite{SfmNet}.
|
||||
}
|
||||
\label{figure:sfmnet_kitti}
|
||||
\end{figure}
|
||||
Thus, this approach is very unlikely to scale to dynamic scenes with a potentially
|
||||
large number of diverse objects due to the inflexible nature of their instance segmentation technique.
|
||||
Thus, due to the inflexible nature of their instance segmentation technique,
|
||||
their approach is very unlikely to scale to dynamic scenes with a potentially
|
||||
large number of diverse objects.
|
||||
|
||||
Still, we think that the general idea of estimating object-level motion with
|
||||
end-to-end deep networks instead
|
||||
of directly predicting a dense flow field, as is common in current end-to-end
|
||||
deep learning approaches to motion estimation, may significantly benefit motion
|
||||
estimation by structuring the problem, creating physical constraints and reducing
|
||||
estimation by structuring the problem, creating physical constraints, and reducing
|
||||
the dimensionality of the estimate.
|
||||
|
||||
In the context of still images, a
|
||||
@ -102,7 +105,7 @@ as to the number or variety of object instances (Figure \ref{figure:net_intro}).
|
||||
|
||||
Eventually, we want to extend our method to include depth prediction,
|
||||
yielding the first end-to-end deep network to perform 3D scene flow estimation
|
||||
in a principled way from the consideration of individual objects.
|
||||
in a principled and scalable way from the consideration of individual objects.
|
||||
For now, we will assume that RGB-D frames are given to break down the problem into
|
||||
manageable pieces.
|
||||
|
||||
@ -110,9 +113,9 @@ manageable pieces.
|
||||
\centering
|
||||
\includegraphics[width=\textwidth]{figures/net_intro}
|
||||
\caption{
|
||||
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the instance motion
|
||||
Overview of our network based on Mask R-CNN. For each region of interest (RoI), we predict the 3D instance motion
|
||||
in parallel to the class, bounding box and mask. Additionally, we branch off a
|
||||
small network for predicting the camera motion from the bottleneck.
|
||||
small network from the bottleneck for predicting the 3D camera ego-motion.
|
||||
Novel components in addition to Mask R-CNN are shown in red.
|
||||
}
|
||||
\label{figure:net_intro}
|
||||
@ -128,7 +131,7 @@ at inference time as \emph{end-to-end} deep learning systems.
|
||||
|
||||
End-to-end deep networks for optical flow were recently introduced
|
||||
based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
|
||||
which pose optical flow as generic, homogenous pixel-wise estimation problem without making any assumptions
|
||||
which pose optical flow as generic (and homogeneous) pixel-wise estimation problem without making any assumptions
|
||||
about the regularity and structure of the estimated flow.
|
||||
Specifically, such methods ignore that the optical flow varies across an
|
||||
image depending on the semantics of each region or pixel, which include whether a
|
||||
@ -136,10 +139,10 @@ pixel belongs to the background, to which object instance it belongs if it is no
|
||||
and the class of the object it belongs to.
|
||||
Often, failure cases of these methods include motion boundaries or regions with little texture,
|
||||
where semantics become very important.
|
||||
Extensions of these approaches to scene flow estimate flow and depth
|
||||
Extensions of these approaches to scene flow estimate dense flow and dense depth
|
||||
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.
|
||||
|
||||
Other works \cite{FlowLayers, ESI, MRFlow} make use of semantic segmentation to structure % TODO cite jun's paper?
|
||||
Other works \cite{ESI, JOF, FlowLayers, MRFlow} make use of semantic segmentation to structure
|
||||
the optical flow estimation problem and introduce reasoning at the object level,
|
||||
but still require expensive energy minimization for each
|
||||
new input, as CNNs are only used for some of the components and numerical
|
||||
@ -153,7 +156,8 @@ The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as be
|
||||
composed of planar segments. Pixels are assigned to one of the planar segments,
|
||||
each of which undergoes a independent 3D rigid motion.
|
||||
This model simplifies the motion estimation problem significantly by reducing the dimensionality
|
||||
of the estimate, and thus leads to accurate results.
|
||||
of the estimate, and thus can lead to more accurate results than the direct estimation
|
||||
of a homogenous motion field.
|
||||
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
|
||||
assigns each slanted plane to one rigidly moving object instance, thus
|
||||
reducing the number of independently moving segments by allowing multiple
|
||||
@ -164,23 +168,23 @@ without the use of (deep) learning.
|
||||
|
||||
In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
|
||||
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
|
||||
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
|
||||
with depth obtained from a non-learned stereo algorithm, to be used as pre-computed
|
||||
inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
|
||||
Most likely due to their use of deep learning for instance segmentation and for some other components, this
|
||||
approach outperforms the previous related scene flow methods on public benchmarks.
|
||||
Still, the method uses a energy-minimization formulation for the scene flow estimation
|
||||
Still, the method uses a energy-minimization formulation for the scene flow estimation itself
|
||||
and takes minutes to make a prediction.
|
||||
|
||||
Interestingly, the slanted plane methods achieve the current state-of-the-art
|
||||
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
|
||||
outperforming end-to-end deep networks like \cite{FlowNet2, SceneFlowDataset}.
|
||||
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet2}.
|
||||
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
|
||||
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
|
||||
These concerns restrict the applicability of the current slanted plane models in practical settings,
|
||||
which often require estimations to be done in realtime and for which an end-to-end
|
||||
which often require estimations to be done in realtime (or close to realtime) and for which an end-to-end
|
||||
approach based on learning would be preferable.
|
||||
|
||||
By analogy, in other contexts, the move towards end-to-end deep learning has often lead
|
||||
Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead
|
||||
to significant benefits in terms of accuracy and speed.
|
||||
As an example, consider the evolution of region-based convolutional networks, which started
|
||||
out as prohibitively slow with a CNN as a single component and
|
||||
@ -188,7 +192,7 @@ became very fast and much more accurate over the course of their development int
|
||||
end-to-end deep networks.
|
||||
|
||||
Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements
|
||||
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation
|
||||
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation,
|
||||
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.
|
||||
|
||||
However, we think that the current end-to-end deep learning approaches to motion
|
||||
@ -200,10 +204,10 @@ with the promise of end-to-end deep learning.
|
||||
\paragraph{End-to-end deep networks for 3D rigid motion estimation}
|
||||
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
|
||||
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
|
||||
of the points into objects together with the 3D motion of each object.
|
||||
of the points into objects together with 3D motions for each object.
|
||||
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
|
||||
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
|
||||
In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow from end-to-end deep learning.
|
||||
In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow with end-to-end deep learning.
|
||||
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
|
||||
with a brightness constancy proxy loss.
|
||||
|
||||
@ -218,8 +222,8 @@ a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera
|
||||
from monocular video \cite{UnsupPoseDepth}.
|
||||
These works are related to
|
||||
ours in that we also need to output various rotations and translations from a deep network,
|
||||
and thus need to solve similar regression problems and may be able to use similar parametrizations
|
||||
and losses.
|
||||
and thus need to solve similar regression problems,
|
||||
and may be able to use similar parametrizations and losses.
|
||||
|
||||
|
||||
\subsection{Outline}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user