Visual tracking can be defined as the main task to solve in Simultaneous Localization and Mapping (SLAM). Besides, it is needed to accomplish the 3D reconstruction of a scene since always there is a need to perform stereo based depth computation.
A first approximation was introduced by @drummond2002real. It was based on active contour tracking of real objects based on a prior knowledge of the CAD model of the tracked object. They use a simplified approach to achieve a frame rate of 25 fps.
One of the most recent open source implementations is CoSLAM developed by @zou2013coslam. Here is a video about the final results:
The work of @salas2013slampp is pointing in our direction of setting a feedback between the reconstruction and the recognition phase, as they shows in their last work demoed at this video:
As the description of the video shows, they use a RGB-D camera to quickly estimate the depth of the scene. Then, the 6 DOF of the camera pose can be extracted using an ICP (Iterative Closest Point) alignment method from consecutive frames. The pose graph allows to close large loops at the same time is able to recover from bad pose estimations.