mirror of
https://github.com/tu-darmstadt-informatik/bsc-thesis.git
synced 2025-12-13 09:55:49 +00:00
WIP
This commit is contained in:
parent
2a39cf1174
commit
b9e4173f7f
@ -43,8 +43,9 @@ ist die Umfunktionierung generischer Deep Networks ein
|
||||
beliebter Ansatz für klassische Probleme der Computer Vision geworden,
|
||||
die pixelweise Schätzung erfordern.
|
||||
|
||||
Diesem Trend folgend berechnen viele aktuelle end-to-end Deep Learning Methoden
|
||||
für optischen Fluss oder Szenenfluss vollständige und hochauflösende Flussfelder mit generischen
|
||||
Viele aktuelle end-to-end Deep Learning Methoden
|
||||
für optischen Fluss oder Szenenfluss folgen diesem Trend und berechnen
|
||||
vollständige und hochauflösende Flussfelder mit generischen
|
||||
Netzwerken für dichte, pixelweise Schätzung, und ignorieren damit die
|
||||
inhärente Struktur des zugrundeliegenden Bewegungschätzungsproblems und jegliche physikalische
|
||||
Randbedingungen innerhalb der Szene.
|
||||
|
||||
70
approach.tex
70
approach.tex
@ -209,7 +209,7 @@ Then, in both, the ResNet-50 and ResNet-50-FPN variant (Table \ref{table:motionr
|
||||
convolution to the $C_5$ features to reduce the number of inputs to the following
|
||||
fully-connected layers.
|
||||
Instead of averaging, we use bilinear resizing to bring the convolutional features
|
||||
to a fixed size without losing spatial information,
|
||||
to a fixed size without losing all spatial information,
|
||||
flatten them, and finally apply multiple fully-connected layers to compute the
|
||||
camera motion prediction.
|
||||
|
||||
@ -217,19 +217,13 @@ camera motion prediction.
|
||||
In both of our network variants
|
||||
(Tables \ref{table:motionrcnn_resnet} and \ref{table:motionrcnn_resnet_fpn}),
|
||||
we compute the fully-connected network for motion prediction from the
|
||||
convolutional mask features, branching off right before the mask upsampling
|
||||
deconvolution. The intuition behind this is that the final mask features contain
|
||||
high resolution, spatial information about which positions belong to the object and
|
||||
which belong to the background. Thus, we allow the motion estimation network to
|
||||
make use of this data and ideally integrate the motion (image matching) information
|
||||
localized within the object, but not that belonging to the background,
|
||||
into the final object motion estimate.
|
||||
|
||||
flattened RoI features, which are also the basis for classification and
|
||||
bounding box refinement.
|
||||
|
||||
\subsection{Supervision}
|
||||
\label{ssec:supervision}
|
||||
|
||||
\paragraph{Per-RoI supervision with 3D motion ground truth}
|
||||
\paragraph{Per-RoI instance motion supervision with 3D instance motion ground truth}
|
||||
The most straightforward way to supervise the object motions is by using ground truth
|
||||
motions computed from ground truth object poses, which is in general
|
||||
only practical when training on synthetic datasets.
|
||||
@ -284,14 +278,16 @@ If the ground truth shows that the camera is not moving, we again do not
|
||||
penalize rotation and translation. For the camera, the loss is reduced to the
|
||||
classification term in this case.
|
||||
|
||||
\paragraph{Per-RoI supervision \emph{without} 3D motion ground truth}
|
||||
\paragraph{Per-RoI instance motion supervision \emph{without} 3D instance motion ground truth}
|
||||
A more general way to supervise the object motions is a re-projection
|
||||
loss similar to the unsupervised loss in SfM-Net \cite{SfmNet},
|
||||
which we can apply to coordinates within the object bounding boxes,
|
||||
and which does not require ground truth 3D object motions.
|
||||
|
||||
In this case, for any RoI, we generate a uniform 2D grid of points inside the RPN proposal bounding box
|
||||
with the same resolution as the predicted mask. We use the same bounding box
|
||||
In this case, for any RoI,
|
||||
we generate a uniform $m \times m$ 2D grid of points inside the RPN proposal bounding box
|
||||
with the same resolution as the predicted mask.
|
||||
We use the same bounding box
|
||||
to crop the corresponding region from the dense, full image depth map
|
||||
and bilinearly resize the depth crop to the same resolution as the mask and point
|
||||
grid.
|
||||
@ -301,11 +297,18 @@ apply the RoI's predicted motion, masked by the predicted mask.
|
||||
Then, we apply the camera motion to the points, project them back to 2D
|
||||
and finally compute the optical flow at each point as the difference of the initial and re-projected 2D grids.
|
||||
Note that we batch this computation over all RoIs, so that we only perform
|
||||
it once per forward pass. The mathematical details are analogous to the
|
||||
dense, full image flow computation in the following subsection and will not
|
||||
be repeated here. \todo{probably better to add the mathematical details, as it may otherwise be confusing at some points}
|
||||
it once per forward pass. Figure \ref{figure:flow_loss} illustrates the approach.
|
||||
The mathematical details for the 3D transformations and mappings between 2D and 3D are analogous to the
|
||||
dense, full image flow composition in the following subsection, so we will not
|
||||
include them here. The only differences are that there is no sum over objects during
|
||||
the point transformation based on instance motion, as we consider the single object
|
||||
corresponding to an RoI in isolation, and that the masks are not resized to the
|
||||
full image resolution, as
|
||||
the depth crops and 2D point grid are at the same resolution as the predicted
|
||||
$m \times m$ mask.
|
||||
|
||||
For each RoI, we can now penalize the optical flow grid to supervise the object motion.
|
||||
For each RoI, we can now compute $L_{RoI}$ and thus supervise the object motion
|
||||
by penalizing the $m \times m$ optical flow grid.
|
||||
If there is optical flow ground truth available, we can use the RoI bounding box to
|
||||
crop and resize a region from the ground truth optical flow to match the RoI's
|
||||
optical flow grid and penalize the difference between the flow grids with an $\ell_1$-loss.
|
||||
@ -336,6 +339,17 @@ and sample proposals and RoIs in the exact same way.
|
||||
During inference, we proceed analogously to Mask R-CNN.
|
||||
In the same way as the RoI mask head, at test time, we compute the RoI motion head
|
||||
from the features extracted with refined bounding boxes.
|
||||
Additionally, we use the \emph{predicted} binarized masks for each RoI to mask the
|
||||
extracted RoI features before passing them into the motion head.
|
||||
The intuition behind that is that we want to mask out (set to zero) any positions in the
|
||||
extracted feature window which belong to the background. Then, the RoI motion
|
||||
head aggregates the motion (image matching) information from the backbone
|
||||
over positions localized within the object only, but not over positions belonging
|
||||
to the background, which should not influence the final object motion estimate.
|
||||
|
||||
Again, as for masks and bounding boxes in Mask R-CNN,
|
||||
the predicted output object motions are the predicted object motions for the
|
||||
highest scoring class.
|
||||
|
||||
\subsection{Dense flow from motion}
|
||||
\label{ssec:postprocessing}
|
||||
@ -360,17 +374,21 @@ For now, the depth map is always assumed to come from ground truth.
|
||||
Given $k$ detections with predicted motions as above, we transform all points within the bounding
|
||||
box of a detected object according to the predicted motion of the object.
|
||||
|
||||
We first define the \emph{full image} mask $m_t^k$ for object k,
|
||||
which can be computed from the predicted box mask $m_k^b$ by bilinearly resizing
|
||||
$m_k^b$ to the width and height of the predicted bounding box and then copying the values
|
||||
of the resized mask into a full resolution all-zeros map, starting at the top-right coordinate of the predicted bounding box.
|
||||
Then,
|
||||
We first define the \emph{full image} mask $M_t^k$ for object k,
|
||||
which can be computed from the predicted box mask $m_t^k$ by bilinearly resizing
|
||||
$m_t^k$ to the width and height of the predicted bounding box and then copying the values
|
||||
of the resized mask into a full resolution mask initialized with zeros,
|
||||
starting at the top-left coordinate of the predicted bounding box.
|
||||
Then, given the predicted motions $(R_t^k, t_t^k)$ as well as $p_t^k$ for all objects,
|
||||
\begin{equation}
|
||||
P'_{t+1} =
|
||||
P_t + \sum_1^{k} m_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
||||
P_t + \sum_1^{k} M_t^k\left\{ R_t^k \cdot (P_t - p_t^k) + p_t^k + t_t^k - P_t \right\}
|
||||
\end{equation}
|
||||
These motion predictions are understood to have already taken into account
|
||||
the classification into moving and still objects,
|
||||
and we thus, as described above, have identity motions for all objects with $o_t^k = 0$.
|
||||
|
||||
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$, % TODO introduce!
|
||||
Next, we transform all points given the camera transformation $\{R_t^c, t_t^c\} \in \mathbf{SE}(3)$,
|
||||
|
||||
\begin{equation}
|
||||
\begin{pmatrix}
|
||||
@ -380,8 +398,8 @@ X_{t+1} \\ Y_{t+1} \\ Z_{t+1}
|
||||
\end{equation}.
|
||||
|
||||
Note that in our experiments, we either use the ground truth camera motion to focus
|
||||
on the object motion predictions or the predicted camera motion to predict complete
|
||||
motion. We will always state which variant we use in the experimental section.
|
||||
on evaluating the object motion predictions or the predicted camera motion to evaluate
|
||||
the complete motion estimates. We will always state which variant we use in the experimental section.
|
||||
|
||||
Finally, we project the transformed 3D points at time $t+1$ to pixel coordinates again,
|
||||
\begin{equation}
|
||||
|
||||
@ -364,7 +364,7 @@ which has a stride of $4$ with respect to the input image.
|
||||
Most importantly, the RoI features can now be extracted at the pyramid level $P_j$ appropriate for a
|
||||
RoI bounding box with size $h \times w$,
|
||||
\begin{equation}
|
||||
j = \log_2(\sqrt{w \cdot h} / 224). %TODO complete
|
||||
j = \log_2(\sqrt{w \cdot h} / 224). \todo{complete}
|
||||
\label{eq:level_assignment}
|
||||
\end{equation}
|
||||
|
||||
@ -613,6 +613,9 @@ with a maximum IoU of 0.7.
|
||||
Then, the mask head is applied to the 100 highest scoring (after NMS) refined boxes,
|
||||
after again extracting the corresponding features.
|
||||
Thus, during inference, the features for the mask head are extracted using the refined
|
||||
bounding boxes, instead of the RPN bounding boxes. This is important for not
|
||||
bounding boxes for the predicted class, instead of the RPN bounding boxes. This is important for not
|
||||
introducing any misalignment, as we want to create the instance mask inside of the
|
||||
more precise, refined detection bounding boxes.
|
||||
Furthermore, note that bounding box and mask predictions for all classes but the predicted
|
||||
class (the highest scoring class) are discarded, and thus the output bounding
|
||||
box and mask correspond to the highest scoring class.
|
||||
|
||||
18
bib.bib
18
bib.bib
@ -249,3 +249,21 @@
|
||||
title = {Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification},
|
||||
booktitle = {ICCV},
|
||||
year = {2015}}
|
||||
|
||||
@inproceedings{UnFlow,
|
||||
author = {Simon Meister and Junhwa Hur and Stefan Roth},
|
||||
title = {UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss},
|
||||
booktitle = {AAAI},
|
||||
year = {2018}}
|
||||
|
||||
@inproceedings{UnsupDepth,
|
||||
title={Unsupervised CNN for single view depth estimation: Geometry to the rescue},
|
||||
author={Ravi Garg and BG Vijay Kumar and Gustavo Carneiro and Ian Reid},
|
||||
booktitle={ECCV},
|
||||
year={2016}}
|
||||
|
||||
@inproceedings{UnsupFlownet,
|
||||
title={Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness},
|
||||
author={Jason J. Yu and Adam W. Harley and Konstantinos G. Derpanis},
|
||||
booktitle={ECCV Workshop on Brave new ideas for motion representations in videos},
|
||||
year={2016}}
|
||||
|
||||
@ -7,9 +7,9 @@ In addition to instance motions, our network estimates the 3D motion of the came
|
||||
We combine all these estimates to yield a dense optical flow output from our
|
||||
end-to-end deep network.
|
||||
Our model is trained on the synthetic Virtual KITTI dataset, which provides
|
||||
us with all required ground truth data.
|
||||
us with all required ground truth data, and evaluated on the same domain.
|
||||
During inference, our model does not add any significant computational overhead
|
||||
over the latest iterations of R-CNNs and is therefore just as fast and interesting
|
||||
over the latest iterations of R-CNNs (Faster R-CNN, Mask R-CNN) and is therefore just as fast and interesting
|
||||
for real time scenarios.
|
||||
We thus presented a step towards real time 3D motion estimation based on a
|
||||
physically sound scene decomposition. Thanks to instance-level reasoning, in contrast
|
||||
@ -18,6 +18,19 @@ of our network is highly interpretable, which may also bring benefits for safety
|
||||
applications.
|
||||
|
||||
\subsection{Future Work}
|
||||
\paragraph{Evaluation and finetuning on KITTI 2015}
|
||||
Thus far, we have evaluated our model on a subset of the Virtual KITTI dataset
|
||||
on which we do not train, but we have yet to evaluate on a real world dataset.
|
||||
The best candidate to evaluate our complete model is the KITTI 2015 dataset \cite{KITTI2015},
|
||||
which provides depth ground truth to compose a optical flow field from our 3D motion estimates,
|
||||
and optical flow ground truth to evaluate the composed flow field.
|
||||
Note that with our current model, we can only evaluate on the \emph{train} set
|
||||
of KITTI 2015, as there is no public depth ground truth for the \emph{test} set.
|
||||
|
||||
As KITTI 2015 also provides object masks for moving objects, we could in principle
|
||||
fine-tune on KITTI 2015 train alone. As long as we can not evaluate our method on the
|
||||
KITTI 2015 test set, this makes little sense, though.
|
||||
|
||||
\paragraph{Predicting depth}
|
||||
In this work, we focused on motion estimation when RGB-D frames with dense depth are available.
|
||||
However, in many applications settings, we are not provided with any depth information.
|
||||
@ -26,15 +39,23 @@ from which no depth data is available.
|
||||
To do so, we could integrate depth prediction into our network by branching off a
|
||||
depth network from the backbone in parallel to the RPN (Figure \ref{table:motionrcnn_resnet_fpn_depth}).
|
||||
Alternatively, we could add a specialized network for end-to-end depth regression
|
||||
in parallel to the region-based network, e.g. \cite{GCNet}.
|
||||
in parallel to the region-based network (or before, to provide XYZ input to the R-CNN), e.g. \cite{GCNet}.
|
||||
Although single-frame monocular depth prediction with deep networks was already done
|
||||
to some level of success,
|
||||
our two-frame input should allow the network to make use of epipolar
|
||||
geometry for making a more reliable depth estimate, at least when the camera
|
||||
is moving. We could also extend our method to stereo input data easily by concatenating
|
||||
all of the frames into the input image, which
|
||||
would however require using a different dataset for training, as Virtual KITTI does not
|
||||
all of the frames into the input image.
|
||||
In case we choose the option of integrating the depth prediction directly into
|
||||
the R-CNN,
|
||||
this would however require using a different dataset for training it, as Virtual KITTI does not
|
||||
provide stereo images.
|
||||
If we would use a specialized depth network, we could use stereo data
|
||||
for depth prediction and still train the R-CNN independently on the monocular Virtual KITTI,
|
||||
though we would loose the ability to easily train the system in an end-to-end manner.
|
||||
|
||||
As soon as we can predict depth, we can evaluate our model on the KITTI 2015 test,
|
||||
and also fine-tune on the training set as mentioned in the previous paragraph.
|
||||
|
||||
{
|
||||
\begin{table}[h]
|
||||
@ -45,7 +66,7 @@ provide stereo images.
|
||||
\midrule\midrule
|
||||
& input image & H $\times$ W $\times$ C \\
|
||||
\midrule
|
||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 1024 \\
|
||||
C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfrac{1}{32}$ W $\times$ 2048 \\
|
||||
\midrule
|
||||
\multicolumn{3}{c}{\textbf{RPN \& FPN} (Table \ref{table:maskrcnn_resnet_fpn})} \\
|
||||
\midrule
|
||||
@ -64,7 +85,7 @@ C$_5$ & ResNet-50 (Table \ref{table:resnet}) & $\tfrac{1}{32}$ H $\times$ $\tfra
|
||||
\end{tabular}
|
||||
|
||||
\caption {
|
||||
Preliminary Motion R-CNN ResNet-50-FPN architecture with depth prediction,
|
||||
A possible Motion R-CNN ResNet-50-FPN architecture with depth prediction,
|
||||
based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_resnet_fpn}).
|
||||
}
|
||||
\label{table:motionrcnn_resnet_fpn_depth}
|
||||
@ -74,16 +95,41 @@ based on the Mask R-CNN ResNet-50-FPN architecture (Table \ref{table:maskrcnn_re
|
||||
Due to the amount of supervision required by the different components of the network
|
||||
and the complexity of the optimization problem,
|
||||
we trained Motion R-CNN on the simple synthetic Virtual KITTI dataset for now.
|
||||
A next step will be training on a more realistic dataset.
|
||||
A next step will be training on a more realistic dataset,
|
||||
ideally without having to rely on synthetic data at all.
|
||||
For this, we can first pre-train the RPN on an instance segmentation dataset like
|
||||
Cityscapes \cite{Cityscapes}. As soon as the RPN works reliably, we could execute alternating
|
||||
steps of training on, for example, Cityscapes and the KITTI stereo and optical flow datasets.
|
||||
On KITTI stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses (and depth prediction if added), as no instance segmentation ground truth exists.
|
||||
steps of training on, for example, Cityscapes and the KITTI 2015 stereo and optical flow datasets.
|
||||
On KITTI 2015 stereo and flow, we could run the instance segmentation component in testing mode and only penalize
|
||||
the motion losses (and depth prediction, if added), as no complete instance segmentation ground truth exists.
|
||||
On Cityscapes, we could continue train the instance segmentation components to
|
||||
improve detection and masks and avoid forgetting instance segmentation.
|
||||
As an alternative to this training scheme, we could investigate training on a pure
|
||||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth) prediction.
|
||||
instance segmentation dataset with unsupervised warping-based proxy losses for the motion (and depth)
|
||||
prediction. Unsupervised deep learning of this kind was already done to some level of success in the optical flow
|
||||
setting \cite{UnsupFlownet, UnFlow},
|
||||
and was recently also applied to monocular depth networks trained on the KITTI dataset \cite{UnsupDepth}.
|
||||
|
||||
\paragraph{Supervising the camera motion without 3D camera motion ground truth}
|
||||
We already described an optical flow based loss for supervising instance motions
|
||||
when we do not have 3D instance motion ground truth, or when we do not have
|
||||
any motion ground truth at all.
|
||||
However, it would also be useful to train our model without access to 3D camera
|
||||
motion ground truth.
|
||||
The 3D camera motion will be already indirectly supervised when it is used in the flow-based
|
||||
RoI instance motion loss. Still, to use all available information from
|
||||
ground truth optical flow and obtain more accurate supervision,
|
||||
it would likely be beneficial to add a global, flow-based camera motion loss
|
||||
independent of the RoI supervision.
|
||||
To do this, one could use a re-projection loss conceptually identical to the one
|
||||
for supervising instance motions with ground truth flow. However, to adjust for the
|
||||
fact that the camera motion can only be accurately supervised with flow at positions where
|
||||
no object motion accurs, this loss would have to be masked with the ground truth
|
||||
object masks. Again, we could use this flow-based loss in an unsupervised way.
|
||||
For training on a dataset without any motion ground truth, e.g.
|
||||
Cityscapes, it may be critical to add this term in addition to an unsupervised
|
||||
loss for the instance motions.
|
||||
|
||||
|
||||
\paragraph{Temporal consistency}
|
||||
A next step after the two aforementioned ones could be to extend our network to exploit more than two
|
||||
@ -92,3 +138,16 @@ context of energy-minimization based scene flow \cite{TemporalSF}.
|
||||
In fact, by incorporating recurrent neural networks, e.g. LSTMs \cite{LSTM},
|
||||
into our architecture, we could enable temporally consistent motion estimation
|
||||
from image sequences of arbitrary length.
|
||||
|
||||
\paragraph{Deeper networks for larger bottleneck strides}
|
||||
Our current ResNet C$_5$ bottleneck has a stride of 32 with respect to the
|
||||
input image resolution. In FlowNetS \cite{FlowNet}, this bottleneck stride was 64.
|
||||
For accurately estimating the motion of objects with large displacements between
|
||||
the two frames, it might be useful to increase the maximum bottleneck stride in our backbone network.
|
||||
We could do this easily in both of our network variants by adding one ore multiple additional
|
||||
ResNet blocks. In the variant without FPN, these blocks would have to be placed
|
||||
after RoI feature extraction. In the FPN variant, the blocks could be simply
|
||||
added after the encoder C$_5$ bottleneck.
|
||||
For saving memory, we could however also consider modifying the underlying
|
||||
ResNet-50 architecture and increase the number of blocks, but reduce the number
|
||||
of layers in each block.
|
||||
|
||||
@ -5,11 +5,11 @@ computations. To make our code easy to extend and flexible, we build on
|
||||
the TensorFlow Object detection API \cite{TensorFlowObjectDetection}, which provides a Faster R-CNN baseline
|
||||
implementation.
|
||||
On top of this, we implemented Mask R-CNN and the Feature Pyramid Network (FPN)
|
||||
as well as extensions for motion estimation and related evaluations
|
||||
as well as the Motion R-CNN extensions for motion estimation and related evaluations
|
||||
and postprocessings. In addition, we generated all ground truth for
|
||||
Motion R-CNN in the form of TFRecords from the raw Virtual KITTI
|
||||
data to enable fast loading during training.
|
||||
Note that for RoI extraction and cropping operations,
|
||||
Note that for RoI extraction and bilinear crop and resize operations,
|
||||
we use the \texttt{tf.crop\_and\_resize} TensorFlow function with
|
||||
interpolation set to bilinear.
|
||||
|
||||
@ -147,8 +147,14 @@ fn = \sum_k [o^{k,c_k} = 0 \land o^{gt,i_k} = 1].
|
||||
Analogously, we define error metrics $E_{R}^{cam}$ and $E_{t}^{cam}$ for
|
||||
predicted camera motions.
|
||||
|
||||
\subsection{Training Setup}
|
||||
\subsection{Virtual KITTI training setup}
|
||||
\label{ssec:setup}
|
||||
|
||||
For our initial experiments, we concatenate both RGB frames as
|
||||
well as the XYZ coordinates for both frames as input to the networks.
|
||||
We train both, the Motion R-CNN ResNet-50 and ResNet-50-FPN variants.
|
||||
|
||||
\paragraph{Training schedule}
|
||||
Our training schedule is similar to the Mask R-CNN Cityscapes schedule \cite{MaskRCNN}.
|
||||
We train on a single Titan X (Pascal) for a total of 192K iterations on the
|
||||
Virtual KITTI training set.
|
||||
@ -172,8 +178,7 @@ Note that a larger weight prevented the
|
||||
angle sine estimates from properly converging to the very small values they
|
||||
are in general expected to output.
|
||||
|
||||
|
||||
\subsection{Experiments on Virtual KITTI}
|
||||
\subsection{Virtual KITTI evaluation}
|
||||
\label{ssec:vkitti}
|
||||
|
||||
\begin{figure}[t]
|
||||
@ -227,7 +232,8 @@ only impacted by the predicted 3D object motions.
|
||||
\label{table:vkitti}
|
||||
\end{table}
|
||||
}
|
||||
Figure \ref{figure:vkitti} visualizes instance segmentation and optical flow
|
||||
|
||||
In Figure \ref{figure:vkitti}, we visualize instance segmentation and optical flow
|
||||
results on the Virtual KITTI validation set.
|
||||
Table \ref{table:vkitti} compares the performance of different network variants on the Virtual KITTI validation
|
||||
set.
|
||||
In Table \ref{table:vkitti}, we compare the performance of different network variants
|
||||
on the Virtual KITTI validation set.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user