Real Time 3D Reconstruction from Monocular Video

Duality for Recognition and Reconstruction

Duality in vector spaces

Given a $m$-dimensional real vector space $V$ the dual vector space $V^{*}$ is, by definition, the vector space of linear functions $\varphi: V \rightarrow \mathbb{R}$. If $\mathbb{e}{1}, \ldots, \mathbf{e}{m}$ is a basis of $V$, the dual basis is given by $\mathbb{f}{1}, \ldots, \mathbf{f}{m}$, where $f{i}(e{j}) = \delta{ij}$ where $\delta{ij}$ is the Kronecker delta defined as

The canonical basis of $V$ is given by $(0, \ldots, 1^{i}, \ldots, 0)$ where $1 \leq i\leq m$. The notation for the dual basis is the same, even when its meaning is different.

If $V$ is a finite dimensional vector space, then $V = (V^{})^{}$, i.e. the bidual is the original space and $V$ is self-dual1

1. In arbitrary dimension, this result is not necessarily true, but we can restrict ourselves to reflexive spaces which are characterized by the self-duality conditions

The space of linear forms separates points according to the sign of each linear form $\varphi$2

2. This remark is a key criterium for Discriminant Analysis and it is ubiquitous in separation issues

If $A: V \rightarrow V$ is a regular transformation of $V$ (an element of $GL(·)$, e.g.), then the composition with $\varphi\in V^{} \in Hom(V, \mathbb{R})$ induces a dual map $$A^: V^ \rightarrow V^M_{A^*} = A^{-1} = A^{T}$$

Duality in fiber bundles

Let $M$ an $m$-dimensional real manifold, then its tangent space $T{x}M$ is generated by $m$ linearly independent vector fields. If $(x{1}, \ldots, x{m})$ is a local coordinate system centered at $x \in U \subset M$, where $U$ is an open set which represents a neighborhood of $x$, then the dual space $T{x}^{}M$ is defined as the set of linear forms $\varphi: T_{x}M \rightarrow \mathbb{R}$ and it is called the cotangent space* of $M$ at $x$. Often, it is denoted as $\Omega^1_{M,x}$. With the above notation,

is the standard basis for $T_{x}M$, whereas

is the standard basis for the dual space $T_{x}^{*}M$. In other words

are the above basis for tangent and cotangent spaces.

We can use the topological structure of $M$ to match together all tangent $T{x}M$ and cotangent $T{x}^{}M$ spaces, and obtain the tangent fiber bundle $\tauM = (TM, \pi{\tauM}, M)$ and the cotangent fiber bundle $\Omega^{1}{M} = (T^{}M, \pi_{\Omega_M}, M)$

If $(M, ds^2)$ is a riemannian manifold (a manifold with a positive a Riemannian metric), then the structural group $GL(m, \mathbb{R})$ of the tangent fiber bundle can be reduced to the orthogonal group $O(m)$; this is locally equivalent to the Gram-Schmid orthogonalization. Hence, the structural group of the cotangent bundle can be also reduced to the orthogonal group $O(m)$; in this case, matrices representing coordinate changes in the cotangent bundle $\Omega^{1}_{M}$ are given as the transposed of the original matrices giving coordinate changes between fibers of the original tangent bundle.

As a first conclusion, in the regular case there exists a perfect duality between tangent and cotangent bundle, even when we reduce the structural group in presence of a riemannian metrics.

Evaluation and criticisms

A comparison between two 3D point clouds recovers the riemannian framework by introducing the Fisher metrics on density functions associated to the cloud distributions. This approach is ideal since it suppose complete and stable information but in the presence of incomplete and noisy information jumps in the dimension of feature vectors, and the uncertainty about the "mise-in-correspondence", it requires a larger framework for its applicability to realistic scenarios.

Application to the recognition problem

The static case

Let us consider an object $B^{\alpha}$ with projection $b^{\alpha}{i} := \pi{\mathbf{C}i}(B^{\alpha})$ from the center $\mathbf{C}{i}$. In order to detect $B^{\alpha}$ from some of its projections the object is described with a set of image features grouped in $\mathbf{v}{0}^{\alpha}$ whose values may change due to changes in illumination or camera position. These changes are represented in other vector $\mathbf{v'}{0}^{\alpha}$ which can be recovered from transformations on the initial vector $\mathbf{v}_{0}^{\alpha}$. Continuous changes in these features are represented using vector fields linked to the "temporal evolution'' (linked to trajectories performed by a mobile camera, e.g.)

Objects will be classified from a feature vector field $\mathbf{v}_{0,i}^{\alpha}$ by evaluating the behavior of the vector fields with respect to different linear functionals which allow discriminate between features from the evaluation of linear forms.

The dynamic case

It allows to relate image flow and scene flow in the terms defined by @vedula1999sceneflow. The problem has two main components:

  • The transformation of a scene into an image flow is performed contracting 3D forms (a product of 3 linearly independent forms) along the vector field representing the projection. This operation can be intrinsically defined using the contraction operator in the Cartan's framework or conventional integrals (unfortunately, its expression is not intrinsic and it depend on the camera's localization, but it is relevant for estimation issues)

  • The transformation of an image flow into a scene flow needs to pass the original distribution of 2 l.i. vector fields to a 2-dimensional system of differential forms, and multiply it by the 1-differential form corresponding to the depth direction (eventually changing with the motion camera). Finally, to recover the usual description of the scene flow it is necessary to compute again the 3D distribution of vector fields which is the dual of the 3-differential form

  • TO BE DEVELOPED!

An extension to the affine case

Motivation: it is unavoidable to treat the affine case since the projections over the image plane distorting the object are modeled using affine transformations. This approach was partially included in the KLT algorithm developed by @tomasi1992shape based on the basic framework introduced by @lucas1981iterative

Structural groups allows to connect with KLT. The affine structural group is defined as the semidirect product of the general linear group and the translations group. If the original manifold can be approximated using a riemmanian manifold, it can be reduced to the semidirect product of the orthogonal group and the translation group. Therefore, the motion can be represented by the semidirect product of the Lie algebras for the orthogonal and the translation group. The elements of this group are matrices composed by $(n+1)\times (n+1)$-boxes with a first antisymmetric box, the last column not null and the last row following the scheme $(0\ 0\ \ldots \ 1)$