\begin{figure}[t]
  \centering
  \includegraphics[width=\textwidth]{figures/teaser}
\caption{
Given two temporally consecutive frames,
our network segments the pixels of the first frame into individual objects
and estimates their 3D locations as well as all 3D object motions between the frames.
}
\label{figure:teaser}
\end{figure}

\subsection{Motivation}

For moving in the real world, it is often desirable to know which objects exists
in the proximity of the moving agent,
where they are located relative to the agent,
and where they will be at some point in the near future.
In many cases, it would be preferable to infer such information from video data,
if technically feasible, as camera sensors are cheap and ubiquitous
(compared to, for example, Lidar).

As an example, consider the autonomous driving problem.
Here, it is crucial to not only know the position
of each obstacle, but to also know if and where the obstacle is moving,
and to use sensors that will not make the system too expensive for widespread use.
At the same time, the autonomous driving system has to operate in real time to
react quickly enough for safely controlling the vehicle.

A promising approach for 3D scene understanding in situations such as autonomous driving are deep neural
networks, which have recently achieved breakthroughs in object detection, instance segmentation, and classification
in still images, and are more and more often being applied to video data.
A key benefit of deep networks is that they can, in principle,
enable very fast inference on real time video data and generalize
over many training situations to resolve ambiguities inherent in image understanding
and motion estimation.

Thus, in this work, we aim to develop deep neural networks which can, given
sequences of images, segment the image pixels into object instances, and estimate
the location and 3D motion of each object instance relative to the camera
(Figure \ref{figure:teaser}).

\subsection{Technical goals}

Recently, SfM-Net \cite{SfmNet} introduced an end-to-end deep learning approach for predicting dense depth
and dense optical flow from monocular image sequences,
based on estimating the 3D motion of individual objects and the camera.
Using a standard encoder-decoder network for pixel-wise dense prediction,
SfM-Net predicts a pre-determined number of binary masks ranging over the complete image,
with each mask specifying the membership of the image pixels to one object.
A fully-connected network branching off the encoder then predicts a 3D motion for each object
and the camera ego-motion.
However, due to the fixed number of objects masks, the system can in practice only predict a small number of motions, and
often fails to properly segment the pixels into the correct masks or assigns background pixels to object motions (Figure \ref{figure:sfmnet_kitti}).
\begin{figure}[t]
  \centering
  \includegraphics[width=\textwidth]{figures/sfmnet_kitti}
\caption{
Results of SfM-Net \cite{SfmNet} on KITTI \cite{KITTI2015}.
From left to right, we show their instance segmentation into up to 3 independent objects,
ground truth instance masks for the segmented objects, composed optical flow,
and ground truth optical flow.
Figure taken from \cite{SfmNet}.
}
\label{figure:sfmnet_kitti}
\end{figure}
Thus, due to the inflexible nature of their instance segmentation technique,
their approach is very unlikely to scale to dynamic scenes with a potentially
large number of diverse objects.

Still, we think that the general idea of estimating object-level motion with
end-to-end deep networks instead
of directly predicting a dense flow field, as is common in current end-to-end
deep learning approaches to motion estimation, may significantly benefit motion
estimation by structuring the problem, creating physical constraints, and reducing
the dimensionality of the estimate.

In the context of still images, a
scalable approach to instance segmentation based on region-based convolutional networks
was recently introduced with Mask R-CNN \cite{MaskRCNN}.
Mask R-CNN inherits the ability to detect
a large number of objects from a large number of classes at once from Faster R-CNN \cite{FasterRCNN}
and predicts pixel-precise segmentation masks for each detected object (Figure \ref{figure:maskrcnn_cs}).

\begin{figure}[t]
  \centering
  \includegraphics[width=\textwidth]{figures/maskrcnn_cs}
\caption{
Instance segmentation results of Mask R-CNN ResNet-50-FPN \cite{MaskRCNN}
on Cityscapes \cite{Cityscapes}. Figure taken from \cite{MaskRCNN}.
}
\label{figure:maskrcnn_cs}
\end{figure}
Inspired by the accurate segmentation results of Mask R-CNN,
we thus propose \emph{Motion R-CNN}, which combines the scalable instance segmentation capabilities of
Mask R-CNN with the end-to-end instance-level 3D motion estimation approach introduced with SfM-Net.
For this, we naturally integrate 3D motion prediction for individual objects into the per-RoI Mask R-CNN head
in parallel to classification, bounding box refinement and mask prediction.
For each RoI, we predict a single 3D rigid object motion together with the object
pivot in camera space in this way.
As a foundation for image matching, we extend the ResNet \cite{ResNet} backbone of Mask R-CNN to take
two concatenated images as input, similar to FlowNetS \cite{FlowNet}.
This results in a fully integrated end-to-end network architecture for segmenting pixels into instances
and estimating the motion of all detected instances without any limitations
as to the number or variety of object instances (Figure \ref{figure:net_intro}).

Eventually, we want to extend our method to include depth prediction,
yielding the first end-to-end deep network to perform 3D scene flow estimation
in a principled and scalable way from the consideration of individual objects.
For now, we will assume that RGB-D frames are given to break down the problem into
manageable pieces.

\begin{figure}[t]
  \centering
  \includegraphics[width=\textwidth]{figures/net_intro}
\caption{
Overview of our network based on Mask R-CNN \cite{MaskRCNN}. For each region of interest (RoI), we predict the 3D instance motion
in parallel to the class, bounding box and mask. Additionally, we branch off a
small network from the bottleneck for predicting the 3D camera ego-motion.
Novel components in addition to Mask R-CNN are shown in red.
}
\label{figure:net_intro}
\end{figure}

\subsection{Related work}

In the following, we will refer to systems which use deep networks for all
optimization and do not perform time-critical side computation
at inference time (e.g. numerical optimization) as \emph{end-to-end} deep learning systems.

\paragraph{Deep networks in optical flow estimation}

End-to-end deep networks for optical flow were recently introduced
based on encoder-decoder networks or CNN pyramids \cite{FlowNet, FlowNet2, SPyNet},
which pose optical flow as generic (and homogeneous) pixel-wise estimation problem without making any assumptions
about the regularity and structure of the estimated flow.
Specifically, such methods ignore that the optical flow varies across an
image depending on the semantics of each region or pixel, which include whether a
pixel belongs to the background, to which object instance it belongs if it is not background,
and the class of the object it belongs to.
Often, failure cases of these methods include motion boundaries or regions with little texture,
where semantics become very important.
Extensions of these approaches to scene flow estimate dense flow and dense depth
with similarly generic networks \cite{SceneFlowDataset} and similar limitations.

Other works make use of semantic segmentation to structure
the optical flow estimation problem and introduce reasoning at the object level
\cite{ESI, JOF, FlowLayers, MRFlow},
but still require expensive energy minimization for each
new input, as CNNs are only used for some of the components and numerical
optimization is central to their inference.

In contrast, we tackle motion estimation at the instance-level with end-to-end
deep networks and derive optical flow from the individual object motions.

\paragraph{Slanted plane methods for 3D scene flow}
The slanted plane model for scene flow \cite{PRSF, PRSM} models a 3D scene as being
composed of planar segments. Pixels are assigned to one of the planar segments,
each of which undergoes a independent 3D rigid motion.
This model simplifies the motion estimation problem significantly by reducing the dimensionality
of the estimate, and thus can lead to more accurate results than the direct estimation
of a homogenous motion field.
In contrast to \cite{PRSF, PRSM}, the Object Scene Flow method \cite{KITTI2015}
assigns each slanted plane to one rigidly moving object instance, thus
reducing the number of independently moving segments by allowing multiple
segments to share the motion of the object they belong to.
In all of these methods, pixel assignment and motion estimation are formulated
as energy-minimization problem which is optimized for each input data point,
without the use of (deep) learning.

In a more recent approach termed Instance Scene Flow \cite{InstanceSceneFlow},
a CNN is used to compute 2D bounding boxes and instance masks for all objects in the scene, which are then combined
with depth obtained from a non-learned stereo algorithm to be used as pre-computed
inputs to a slanted plane scene flow model based on \cite{KITTI2015}.
Most likely due to their use of deep learning for instance segmentation and for some other components, this
approach outperforms the previous related scene flow methods on relevant public benchmarks \cite{KITTI2012, KITTI2015}.
Still, the method uses a energy-minimization formulation for the scene flow estimation itself
and takes minutes to make a prediction.

Interestingly, the slanted plane methods achieve the current state-of-the-art
in scene flow \emph{and} optical flow estimation on the challenging KITTI benchmarks \cite{KITTI2012, KITTI2015},
outperforming end-to-end deep networks like \cite{SceneFlowDataset, FlowNet, FlowNet2}.
However, the end-to-end deep networks are significantly faster than their energy-minimization counterparts,
generally taking a fraction of a second instead of minutes for prediction and can often be made to run in realtime.
These concerns restrict the applicability of the current slanted plane models in practical settings,
which often require estimations to be done in real time (or close to real time) and for which an end-to-end
approach based on learning would be preferable.

Also, by analogy, in other contexts, the move towards end-to-end deep learning has often lead
to significant benefits in terms of accuracy and speed.
As an example, consider the evolution of region-based convolutional networks, which started
out as prohibitively slow with a CNN as a single component and
became very fast and much more accurate over the course of their development into
end-to-end deep networks.

Thus, in the context of motion estimation, one may expect end-to-end deep learning to not only bring large improvements
in speed, but also in accuracy, especially considering the inherent ambiguity of motion estimation,
and the ability of deep networks to learn to handle ambiguity from a large variety of training examples.

However, we think that the current end-to-end deep learning approaches to motion
estimation are likely limited by a lack of spatial structure and regularity in their estimates
as explained above, which stems from the generic nature of the employed networks.
To this end, we aim to combine the modelling benefits of rigid scene decompositions
with the promise of end-to-end deep learning.

\paragraph{End-to-end deep networks for 3D rigid motion estimation}
End-to-end deep learning for predicting rigid 3D object motions was first introduced with
SE3-Nets \cite{SE3Nets}, which take raw 3D point clouds as input and produce a segmentation
of the points into objects together with 3D motions for each object.
Bringing this idea to the context of image sequences, SfM-Net \cite{SfmNet} takes two consecutive frames and
estimates a segmentation of pixels into objects together with their 3D motions between the frames.
In addition, SfM-Net predicts dense depth and camera ego-motion to obtain full 3D scene flow with end-to-end deep learning.
For supervision, SfM-Net penalizes the dense optical flow composed from all 3D motions and the depth estimate
with a brightness constancy proxy loss.

Like SfM-Net, we aim to estimate 3D motion and instance segmentation jointly with
end-to-end deep learning.
Unlike SfM-Net, we build on a scalable object detection and instance segmentation
approach with R-CNNs, which provide us with a strong baseline for these tasks.

\paragraph{End-to-end deep networks for camera pose estimation}
Deep networks have been used for estimating the 6-DOF camera pose from
a single RGB frame \cite{PoseNet, PoseNet2}, or for estimating depth and camera ego-motion
from monocular video \cite{UnsupPoseDepth}.
These works are related to
ours in that we also need to output various rotations and translations from a deep network,
and thus need to solve similar regression problems,
and may be able to use similar parametrizations and losses.


\subsection{Outline}
First, in section \ref{sec:background}, we introduce preliminaries and building
blocks from earlier works that serve as a foundation for our networks and losses.
Most importantly, we re-view the ResNet CNN (\ref{ssec:resnet}) that will serve as CNN backbone
as well as the developments in region-based CNNs which we build on (\ref{ssec:rcnn}),
specifically Mask R-CNN and the Feature Pyramid Network (FPN) \cite{FPN}.
In section \ref{sec:approach}, we describe our technical contribution, starting
with our motion estimation model and modifications to the Mask R-CNN backbone and head networks (\ref{ssec:model}),
followed by our losses and supervision methods for training
the extended region-based CNN (\ref{ssec:supervision}), and
finally the postprocessings we use to derive dense flow from our 3D motion estimates
(\ref{ssec:postprocessing}).
Then, in section \ref{sec:experiments}, we introduce the Virtual KITTI dataset we use
for training our networks as well as all preprocessings we perform (\ref{ssec:datasets}),
give details of our experimental setup (\ref{ssec:setup}),
and finally describe the experimental results
on Virtual KITTI (\ref{ssec:vkitti}).
Finally, in section \ref{sec:conclusion}, we summarize our work and describe future
developments, including depth prediction, training on real world data,
and exploiting frames over longer time intervals.