Monocular reconstruction from image features makes an abstraction of the problem which reduces its complexity and allows it to be tackled in real time. However, they introduces two significant drawbacks
Dense SLAM methods take advantage of all the available image information by working directly on the images for both mapping and tracking. The world is modeled as dense surface at the same time that new frames are tracked using whole-image alignment. These methods increase tracking accuracy and robustness using powerful GPU processors. The same methods can be used in combination with RGB-D cameras (giving the depth for each pixel), or stereo camera rigs to simplify the problem.
A semi-dense depth map is a depth map which does not include the depth for every pixel of the stereo pair but only for a subset of the moving pixels.
The depth map is propagated from frame to frame and refined with new stereo depth measurements. Depth is computed by performing per-pixel, adaptive-baseline stereo comparisons allowing accurate estimations of depth both of close-by and far-away image regions. Each inverse depth hypothesis is maintained per pixel and represented as a Gaussian probability distribution.
In stereo matching there is always a trade-off between precision and accuracy. A first approach consist of accumulating the respective cost functions over many frames. However, Engel et al. introduce a probabilistic approach taking advantage of the fact that in a video, small baseline frames are available before large-baseline frames [@engel2013slam].
The depth map is updated as follows:
A subset of pixels is selected for which the accuracy of a disparity search is sufficiently large. To achieve this goal three efficient local criteria are used to determine for which pixel a stereo update is worth the computational cost.
For each selected pixel, a suitable reference frame is selected for each pixel to perform a one-dimensional disparity search according to the pixel's age. The selection is done by searching the oldest frame the pixel was observed in, where the disparity search range and the observation angle does not exceed a given threshold.
Once the camera position of the next frame has been estimated the estimated inverse depth $d_0$ is propagated to that frame. The corresponding 3D point is calculated from $d_0$ and projected into the new frame, providing the initial inverse depth estimate $d_1$ in the new frame. Then, the hypothesis is assigned to the closest integer pixel position (maintaing sub-pixel accuracy to avoid discretization errors). Assuming that the camera rotation es small, $d_1$ can be approximated by:
where $t_z$ is the camera translation along the optical axis. The variance of $d_1$ is defined as:
where $\sigma^2{p}$ is the prediction uncertainty (the prediction step in an extended Kalman filter). It can be seen as keeping the variance on the z-coordinate of a point fixed $\sigma^2{z0} = \sigma^2{z_1}$
In the case of two inverse depth hypothesis are propagated to the same pixel there are two alternatives
A regularization iteration is performed for each frame which computes each inverse depth value as the average of the surrounding inverse depths, weighted by their respective variances. When two adjacent inverse depth values are statistically different (further away than $2\sigma$), they are not merged to preserve sharp edges.
The validity of each inverse depth hypothesis is represented by the probability that it is an outlier (e.g., due to occlusion or a moving object). For each successful stereo observation, this probability is decreased and increased for each failed stereo search, i.e. the respective intensity changes significantly on propagation, or when the absolute image gradient falls below a given threshold.
If the probability that all contributing neighbors are outliers rises above a given threshold, the hypothesis is removed. Equally, if the probability drops below another threshold in a pixel without hypothesis a new one is created from the neighbors. This fills holes arising from the forward-warping nature of the propagation step, and dilates the depth map to a small neighborhood around sharp edges, which significantly increases tracking and mapping robustness.
The semi-dense inverse depth map for the current camera image can be used for estimating the camera pose of the next frame. Dense tracking is preformed using dense image alignment [REF] based on the direct minimization of the photometric error:
where the warp function
maps each point $x_i \in \sigma_1$ in the reference image $I_1$ to the respective point $w(x_i, d_i, \xi) \in \sigma_2$ in the new image $I_2$. As input it only requires the 3D pose of the camera $\xi \in \mathbb{R}^6$ and uses the estimated inverse depth $d_i \in \mathbb{R}$ for the pixel in $I_1$. The final energy term to minimize is:
where $\alpha: \mathbb{R} \rightarrow \mathbb{R}$ weights each residual.