Natural Feature Tracking Using Markers
A fiducial marker is an easily detected feature as discussed above, and can be knowingly and intentionally placed, or may naturally exist in a scene. Natural Feature Tracking (NFT) is the idea of recognizing and tracking a scene that is not deliberately using markers. NFT can use items that are embedded in a natural pictorial view to enhance tracking points and regions within the view.; e.g., a known statue, building, or maybe a tree. The result can be seemingly marker-less tracking (since the markers may not be known to the user.
Using the NFT can be less computationally expensive than a full SLAM system, and is more practical for mobile devices. However, there is a practical limit on the number of unique markers that can be distinguished at any one time. Thus, if a large number of images need to be tracked a SLAM or fiducial marker could be more efficient.
A critical component for augmented reality is to know where you are and what’s around you. One of the technologies to enable such capabilities is simultaneous localization and mapping (SLAM) a system and process whereby a device creates a map of its surroundings and orients itself within the map in real time.
SLAM starts with an unknown environment where the augmented reality device tries to generate a map and localize itself within the map. Through a series of complex computations and algorithms which use the IMU sensor data to construct a map of the unknown environment while using it at the same time to identify where it is located (Fig. 8.64).
For some outdoor applications, the need for SLAM is almost entirely removed due to high precision differential GPS sensors. From a SLAM perspective, these may be viewed as location sensors whose likelihoods are so sharp that they completely dominate the inference.
However, in order for SLAM to work the system needs to create a pre-existing map of its surroundings and then orient itself within this map to refine it (Fig. 8.65).
There are several algorithms for establishing pose using SLAM. One technique uses a keyframe- solution that assists with building room-sized 3D models of a particular scene. The system runs a computationally intensive non-linear optimization called bundle adjustment to ensure that models with a high-level of accuracy are generated. This optimization is significantly improved using high-performance parallel processors using same-instruction, multiple-data (SIMD) GPU processors to make certain that a smooth operation occurs on mobile devices.
Fig. 8.64 Block diagram of the IMU and SLAM relationship to Pose and the 3D map
Fig. 8.65 Using SLAM technology the augmented reality device’s camera in conjunction with its gyroscope and other location devices assign coordinates to objects in the FoV
During fast motion tracking failures are typically experienced in an augmented reality system. In order to recover from such tracking failures, the system needs a relocalization routine to quickly compute an approximate pose from the camera when images are blurred or otherwise corrupted. The system must also deliver a triangulated 3D mesh of the environment for the application to utilize in order to deliver a realistic augmented experience that is blended with the real scene.
Markerless location mapping is not a new concept and was explored in earlier work by Mann (Video Orbits)  for featureless augmented reality tracking . Markerless vision-based tracking was also combined with gyro tracking .
In markerless augmented reality, the problem of finding the camera pose requires significant processing capability and more complex and sophisticated imageprocessing algorithms, such as disparity mapping, feature detection, optical flow, object classification, and real-time high-speed computation.
Fig. 8.66 Block diagram for an augmented reality system with tracking
In his report, “Theory and applications of marker-based augmented reality,” for VTT Technical Research Center of Finland, Sanni Siltanen  generated a block diagram for a simple augmented reality system which has been adopted by the industry, and is presented here slightly redrawn for simplification.
As illustrated in Fig. 8.66, the acquisition module captures the image from the camera sensor. The tracking module calculates the correct location and orientation for virtual overlay. The rendering module combines the original image and the virtual components using the calculated pose and then renders the augmented image on the display.
The tracking module is “the heart” of a non-head-up display augmented reality system; it calculates the relative pose of the camera in real time, and is critical in an augmented reality system with always on vision processing. “Always on vision“ is not the same thing as being able to simply take a picture with an augmented reality device. For example, Google glass has no vison processing but can take pictures.
A more detailed view of the tracking system can be seen in the following diagram which is an expanded version of the; Block diagram of a typical augmented reality device diagram shown in Fig. 7.3.
Please note that Figs. 8.66 and 8.67 are labeled in red to illustrate the matching components.
SoCs designed for smartphones with CPUs, GPUs, DSPs, and ISPs are often employed in augmented reality systems and used for doing vision processing. However, they may not be fast enough and so specialized vision processors from companies like CogniVue, Synopsys, or custom devices based on field programmable gate arrays (FPGA) have been developed to speed up the timing and critical algorithm execution.
For the kind of fast updating and image acquisition needed in mission-critical augmented reality systems, there is only 50 ms from sensor to display: the acquisition typically takes 17.7 ms, rendering will use up 23.3 ms, leaving only 10 ms for the tracking section, which really isn’t much, or enough.
The specialized vision processors, which handle the tracking problem can process the data in 2 ms typically.
Marker-based systems are less demanding, but unsuitable in many scenarios (e.g. outdoor tracking). Markerless tracking depends on identifying natural features vs. fiducial markers.
Fig. 8.67 Detailed diagram of an augmented reality system with tracking
Augmented reality systems use sensors for tracking and location (e.g. GPS); or Hybrid; e.g. GPS and MEMs gyroscope for position and visual tracking for orientation.
Mission critical augmented reality systems however, need to use some other tracking method. Visual tracking methods to estimate camera’s pose (camera-based tracking, optical tracking, or natural feature tracking) are often the solution.
Tracking and registration becomes more complex with Natural Feature Tracking (NFT).
Markerless augmented reality apps with NFT will ultimately be widely adopted. However, NFT requires a great deal of processing, and it must be done in 50 ms or less.
Fig. 8.68 An external 3D sensor attached to an Android phone assembly (Source: Van Gogh Imaging)
NFT involves what is known as Interest Point Detectors (IPD). Before tracking, features or key points must be detected. Typically algorithms found in the OpenCV image-processing library are used, programs such as Harris Corner Detection, GFTT, FAST, etc.
FAST (Features from Accelerated Segment Test) has been preferred for mobile apps as it requires less processor performance, but is not necessarily best for accuracy and precision.
Mobile augmented reality (and robotics) also involves Pose Estimation which uses random sample consensus (RANSAC), an iterative method to estimate parameters of a mathematical model from a set of observed data which contains outliers.
Feature descriptors for matching is where the most processing is required. An example is Scale-Invariant Feature Transform (or SIFT), an algorithm in computer vision to detect and describe local features in images.
SIFT is often used as a benchmark for detection/extraction performance. A CPU by itself typically experiences long processing latency running a SIFT program. Using a GPU for acceleration gives about a 5x to 10x improvement. Specialized tracking processors can often produce a 50 times improvement and are typically 100 times better in terms of performance per power.
Real-time performance is considered 50 ms or less from image acquisition to display (glasses), therefore feature detection, and tracking matching has to be 10 ms (or less), and <5 ms for low power operation.
NFT processing for “always-on” mobile augmented reality needs >100x improvement for always-on performance/power for 1MP sensors.
There are SLAM algorithms which are being carefully tailored to run on mobile devices which require efficient use of processing power, memory, and battery life. The picture above (Fig. 8.68) is an example of a 3D sensor attached to an Android phone. This type of configuration is going to be short-lived since smart tablets and phones are going to have the 3D sensors integrated directly into the hardware.
Wearable augmented reality applications need this performance to make power- efficient always-on augmented reality and vision applications.
So augmented reality smart-glasses manufactures have to weigh the tradeoffs of performance vs the components costs. Adding an extra vision processor may increase the costs and make the device uncompetitive.
GPS Markerless Tracking
Markerless augmented reality typically uses the GPS feature of a smartphone to locate and interact with augmented reality resources. Pokemon GO however, doesn’t use SLAM, and relies on phone tracking orientation, since the Pokemon GO augmented reality feature doesn’t work on devices without a gyroscope.