Real Time 3D Reconstruction from Monocular Video

A general framework to feedback Recognition with Reconstruction

First, to provide a clearer guidance throught this document we provide defintions for some basic concepts arising from differential topology and algebra:

Topological space: A collection of open sets which is closed by arbitrary union and finite intersections. In discrete terms it will be represented by a set of points belonging to a set of neighborhoods for each point allowing the definition of concepts such as continuity, connectedness, and convergence.
Topological group: is a group equipped with a topology which is compatible. The internal law (binary operation on the group) and the inverse of each element are continuous functions with respect the topology. In a topological group $G$ algebraic operations can be performed at the same time we can study the continuity of the group.
Fiber bundle: a kind of fibration space $\xi = (E; \pi, B, F)$ with total space $E$, base space $B$, projection map $\pi: E \rightarrow B$ and fiber $F$, verifying a locally trivial condition: Each point base $b\in B$ has a neighborhood $U$ such that $\pi^{-1}(U)$ is topologically equivalent to $U \times F$, and this topological equivalence is restricted to an isomorphism between the generic fiber $F$ and the particular fiber $F_{b} := \pi^{-1}(b)$ at $b\in B$. Hence, a fiber bundle is locally given as a product space but globally may have a different topological structure. A trivial bundle is globally a product, i.e. $E = B \times F$
Fibration: a generalization of a fiber bundle in which the fibers can display different structures, but they must be homotopy equivalent.
Vector fiber bundle: a fiber bundle $\xi = (E, \pi, B, F)$ whose fiber $F_{b} = \pi^{-1}(b)$ at $b$ is a vector space of dimension $r$ for each $b\in B$. We shall denote the generic fiber by $F = \mathbb{R}^{r}$ and we shall say that $\xi$ is a vector bundle of rank $r$
Principal fiber bundle: a fiber bundle attached to a group $G$ whose free and transitive action is applied over each fiber so that each one is a principal homogeneous space. The group is also the structure group of the bundle. Every vector bundle has an associated principal fiber bundle, whose structural group is a subgroup $H$ of the general linear group $GL(r, \mathbb{R})$; in the real case usually, $H$ is a "classical" subgroup of $GL(r, \mathbb{R})$ such as the orthogonal $O(r)$, special linear $SL(r)$ or special orthogonal group $SO(r) = O(r)\cap SL(r)$. However, a principal group is not necessarily a vector bundle, because its fiber can be a group, a quotient of groups (a grassmannian, e.g.) or other objects associated to references (feature vectors, e.g.)
Homeomorphism: a biyective and bicontinous application between two topological groups. It can be viewed just as a continuous homeomorphism between the same two groups. They are mappings that preserve the topological properties of a given space.
Diffeomorphism: an homeomorphism with the constraint of smoothness in the source and target topological space between manifolds.

There are two different transformation groups:

Classical transformation groups $\mathbb{L}$ are used in reconstruction linked to euclidean space such as the classical Lie groups (SO, SE, GL).
Gauge deformation groups $\mathbb{G}$ are used in recognition to represent regular smooth functions over a continuos space.

Before being in more detailed explanation it is necessary to define two common concepts used in recognition and reconstruction.

Pattern: used in recognition, related with topology and whose observations are managed with fibrations. A patterm is a canonical simplified representation of a set of similar objects. They allow to classify new samples both in supervised and in unsupervised learning. Each pattern provides a description of facts or a situation.
Model: used in reconstruction, related with geometry and whose observations are managed with vector fiber bundles. A model is an abstract conceptualization of the reality to infer or derive new unpredicted situations. It can be build using some knowledge of the reality to explain when the underlying theoretical background is hidden or using a priori knowledge about the facts to explain. Each model provides an explanation to partial aspects of a pattern so each model can be compatible with several patterns.

There are two more concepts commonly used in pattern recognition:

Detector: is an algorithm able of analyze an image and extract the most important features using both the color and the geometry of the pixels. It is represented by (scalar or vector) functions or distributions which can be applied on a topological space which can be considered as a base space $B$. Performance and accuracy are the most valuable properties for detectors. The most famous detectors are SIFT, SURF, CenSurE (STAR), ORB, etc.
Descriptor: is a compact representation of an image feature. Very often they are represented by feature vectors, i.e., an array of attributes and parameters representing the current values of functions and/or distributions on a base point $b\in B$. A typical example is given by a vector $\mathbf{v}{b}\in F{b} = \pi^{-1}(b)$ belonging to the fiber of a vector bundle $\xi$.They allow to pack information about not only the position and/or orientation of the feature but also about the properties which uniquely define the feature, up to the group action on the fiber $F_b$. Hence, the group action on the fiber allows searching similarities on an image dataset using only a bag of these descriptors (bag of words metaphor). Vector quantization is commonly used to compress this information using the less possible space trying to achieve the higher possible recognition rate. Some examples of descriptors are SIFT, SURF, BRIEF, FAST, etc.

Contributions

The idea consists of defining the supervised learning by using functions (escalar or vector) and/or distributions (vector or tensor) defined on manifolds. Hence, our approach is a natural extension of Learning Manifolds strategies which allows to incorproate differential methods to the Statistical Learning Theory. Distributions are collections of vector fields which are defined over manifolds which parametrize states of the system; so, the manifold learning framework can be included in our approach. In our case, distributions are estimated from statistical properties evaluated on collections of small displacements (vector fields) arising from tracking control points in video sequences. Fibrations and fiber bundles are mandatory in order to link these heterogeneous elements in a more compact framework. Furthermore, this approach introduces some relevant advantages:

Detectors are functionals over the base space $B$ of a fibration, while descriptors are functionals which are defined onto the fiber $F_b$ of the fibration at each point $b\in B$. In this way, patterns and models share information using morphisms between fibrations and fiber bundles.
A basic hierarchy between patterns and models is formulated in terms of groupoids on the space base (given by homeomorphisms, typically) and transformation groups on the fiber $F_b$, respectively. Gauge groups allow to deal with both of them since they can be applied both continuous mechanics and classical mechanics including Hamilton (PDE systems) and Euler formulations (energy minimization)
It allows to model dynamic phenomena. A vector field is a geometric representation of a linear differential equation; it proveides a model for motion's particle at a control point $b\in B$ in the optical flow of a video; its behavior is extended to the neighbors $b'\in U_b$ for a small open set $U_b$ centered at $b\in B$ ^vector-field. In a similar way, a PDE (Partial Differential Equations) provides models for deformations and their extensions (propagation or restoration models, e.g.), which can also be viewed as a section of a fibration. Let us remark that a fibration supports discontinuities at topological behavior of the fiber attached to near base points of the fibration, which are not allowed for fiber vector bundles (v.b.). Hence, fibrations are more flexible than v.b.

^vector-field. A vector field can be modeled as a local section $S: U \rightarrow E$ of a vector fiber bundle (a local section is a smooth application with the feature vectors $Fb$ as values such that $\pi{\xi}\circ s = 1_{B}$ where $1_B$ denotes the identity map of $B$ (this condition is introduced to guarantee compatibility between data en in the base and the fiber). ↩

Space-temporal variations in observable "geometric quantities"' can be represented as a displacement along a manifold which is modified along the path. The displacement is formally given by a covariant derivative defined on a v.b. or their extensions to fibrations. In order to achieve a feedback between recognition and 3D reconstruction we would need a functor which assigns to each fibration (the locus where live patterns for recognition) corresponding to a collection of observable features in a fiber bundle (ideal model given by configurations of vectors) representing a simplified vision of reality. Inversely, the v.b. structure can be considered as PL-approach to more general fibrations structures. This claim can be extended to principal groups and gauge fibrations, because the linearisation of the group of homeomorphisms fixing a base point $b\in B$ is $GL()$.
Global Optimization procedures based in energy functionals, such as Total Variation (TV) from @chan2005aspects, can be applied over fibrations in a "probabilistic" context and over fiber bundles space in a "deterministic" context. The process consists of convoluting the reconstruction function using the flux divergence of the simultaneous recognition process. Then, the "probabilistic" version is equivalent to minimize an energy functional using a Hausdorff distance over a fibration to prevent uncontrolled variations in shape irregularities. The "deterministic" version is equivalent to minimize an energy functional using a weighted sum of a L1-norm (for scalar component linked to potential energy) and a L2 norm (for vector component linked to kinetic energy) over a vectorial fiber bundle. The use of L2-norm on the tangent space $T_{b}B$ is equivalent to the reduction of the general linear group to the euclidean group in the tangent space at each base point $b\in B$.

General idea

This section summarizes the idea behind the general framework using concepts and results from local differential topology. The visualization of a 3D reconstruction can be represented as a transformation belonging to a classical group. For instance,

In the cartesian framework, the introduction of the euclidean metric is equivalent to the reduction of the general linear group $GL(3; \mathbb{R})$ to the special orthogonal group $SO()$ (Gram-Schmid orthonormalization) which gives the structural group for the associated principal bundle;
In the euclidean framework, the structural group is the euclidean group $SE(3)$. i.e. the semidirect product of the special linear group $SO(3)$ and the group or translations $\mathbb{R}^{3}$);
In the affine framework, the structural group is the affine group $A(3)$, i.e. the semidirect product of the general linear group $GL(3; \mathbb{R})$ and the group or translations $\mathbb{R}^{3}$);

The use of structural groups for each framework allows to represent the displacement of references linked to feature vectors (elements of a fiber $Fb$) and its interpretation in the euclidean or the affine framework (including apparent deformations due to projections onto the image plane). But these representations provide a support for kinematic aspects, because the tangent space $T{b}B\ := \pi^{-1}_{\xi}(b) \simeq \mathbb{R}^{n}$ for the tangent space to descriptors is a cartesian space which provides the support for euclidean and affine structures. The initial idea of SFM methods is modeled on k-tuples of points in the ordinary space $\mathbb{R}^{3}$ which are tracked along $m$ images giving a spatial configuration and its planar projections. The spatial configuration is represented by $n = 3k$ data giving a point of $\mathbb{R}^{n}$, and planar configurations are represented by $p = 2km$ data giving a point of $\mathbb{R}^{p}$. Hence the correspondence between spatial configurations and their planar projections is ideally represented by a map $\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^p$ which we suppose locally continuous, i.e. $\mathbf{f} \in C^{0} (n, p)$.

In these terms, the most general groupoids acting on source and target space are given by the homeomorphisms $Homeom{\underline{x}} (\mathbb{R}^{n})$ preserving a base point $\underline{x}\in \mathbb{R}^{n}$ (given by 3D configuration of control points) and the homeomorphisms $Homeom{\underline{y}} (\mathbb{R}^{p})$ preserving the base point $\underline{y}\in \mathbb{R}^{p}$ corresponding to the projection of the configuration $\underline{x}$ on each image plane. The linearization of each one of such homeomorphism is the general lineal group, i.e. $T{e} Homeom{\underline{x}} (\mathbb{R}^{n}) = GL(n; \mathbb{R})$ and similarly for the target space. This simple remark is the key for connecting (apparent or real) deformations with the linear approach based in feature vectors; in other words for connecting recognition with reconstruction issues.

Furthermore, this approach can be extended in a natural way from static to dynamic case. Indeed, motion kinematics associated to the camera motion can be read in terms of differential calculus linked to the map $\mathbf{f} \in C^{0} (n, p)$, including aspects such as optical flow or others. Let us remark that this approach is compatible with rigid motions in the source space and apparent deformations on images for the whole configurations.

To ease the management of the above information is convenient to linearize the above application. This transformation is motivated by a reformulation of the optica flow in terms of control points to be tracked. Following the idea of the optical flow from @tomasi1992shape and extended by @ma2001optimization, we can define a base application

$\mathbf{f}: \mathbb{R}^n \rightarrow \mathbb{R}^p$

Its differential provides a linearization represented by a transformation using the fiber tangent bundles of the global cartesian spaces

$d\mathbf{f}: \mathbb{TR}^n \rightarrow \mathbb{TR}^p$

The tangent space $\mathbb{TR}^n$ at each point is the fiber $T\underline{x} \mathbb{R}^{n+}$; in particular, each section on the fiber is represented by the position and the current velocity of the base configuration, given by all the tracked feature points $\mathbf{P}{1}, \ldots, \mathbf{P}_{k}$ with their velocities. The fiber tangent bundle of the cartesian space is trivial:

$\mathbb{TR}^n \approx \mathbb{R}^n \times \mathbb{R}^n$

The above arguments can be easiliy extended for "true" representations (euclidean framework) and to allow apparent deformations (arising from projections). It suffices to consider superimposed structures (euclidean, affine) to the cartesian space and extend the general linear group to the semidirect product by the group of traslations.

In the general case of arbitrary deformations (to compare similar but not identical shapes), it is necessary to work with gauge transformations on fibrations which provide the natural extension of structural groups (classical groups including euclidean, affine, similariity, projective groups) on vector bundles. In the extended case, the tangent space can be viewed as the fiber of a fibration whose base is a general deformation of the configurations space. The latter space can be linearized to recover structure from motion and viceversa in a more general framework including possible deformations. The fibration approach introduces two advantages:

It is a natural extension of other well-known approaches, and consequently is valid in euclidean and affine spaces
The base application $f$ is not necessarily regular, i.e. $range(\mathbb{J}_f) < inf(n, p)$, which is crucial fact to include pathologies linked to partial occlusions, incomplete information, degenerated configurations and uncertainty conditions.

As resume a fibration or fiber bundle extends the notion of a vector fiber bundle including different options such as: the fiber of the fibration can be a variety (not necessarily a manifold, including singularities and/or discrete collections of points, e.g.) of arbitrary dimension, such as the geometric elements recovered from 3D reconstruction (points, lines, planes, curves, etc). The same idea is used extensively in the field of manifold learning for non-linear dimensionality reduction (a generalization of the Principal Component Analysis method). In general, the fibers of a fibration can be varieties of arbitrary dimension and shape which can be patched together by lifting the information arising from the base space.

The introduction of classical groups and their linearization on tangent vector bundles (as linear approach to manifolds) has been already performed by @ma2001optimization with a remarkable simplification for optimization issues in 3D Reconstruction processes. In this work, we propose a similar scheme for gauge transformations acting on varieties which can be "fibered" by locally trivial fibrations. Any classical group can be used to recognize objects from their projections; however, this approach is strongly limited by small variations of objects to be recognized, because data are not related between them through a classical transformation.

Our scheme, allows an extension of the structural viewpoint (compatible with apparent deformations) with a based-appearances approaches linked to configurations of points and/or feature vectors. A key idea is to extend the action of the structural group of the principal fiber bundle for reconstructed objects to gauge group acting on a fibration. So, two simple homeomorphisms acting on source and target spaces convert the space of vector maps $\mathbf{f} \in C^{0}(n, p)$ in a Gauge space as prototype for active recognition and reconstruction from motion.