We've discussed what augmented reality is, but how does it work? As we said earlier, AR requires that we combine the real environment with a computer-generated virtual environment. The graphics are registered to the real 3D world. And, this must be done in real time.
There are a number of ways to accomplish this. In this book, we will consider just two. The first is the most common and accessible method: using a handheld mobile device such as a smartphone or tablet. Its camera captures the environment, and the computer graphics are rendered on the device's screen.
A second technique, using wearable AR smartglasses, is just emerging in commercial devices, such as Microsoft HoloLens and Metavision's Meta 2. This is an optical see-through of the real world, with computer graphics shown on a wearable near-eye display.
Using a handheld mobile device, such as a smartphone or tablet, augmented reality uses the device's camera to capture the video of the real world and combine it with virtual objects.
As illustrated in the following image, running an AR app on a mobile device, you simply point its camera to a target in the real world and the app will recognize the target and render a 3D computer graphic registered to the target's position and orientation. This is handheld mobile video see-through augmented reality:
We use the words handheld and mobile because we're using a handheld mobile device. We use video see-through because we're using the device's camera to capture reality, which will be combined with computer graphics. The AR video image is displayed on the device's flat screen.
Mobile devices have features important for AR, including the following:
- Untethered and battery-powered
- Flat panel graphic display touchscreen input
- Rear-facing camera
- CPU (main processor), GPU (graphics processor), and memory
- Motion sensors, namely accelerometer for detecting linear motion and gyroscope for rotational motion
- GPS and/or other position sensors for geolocation and wireless and/or Wi-Fi data connection to the internet
Let's chat about each of these. First of all, mobile devices are... mobile.... Yeah, I know you get that. No wires. But what this really means is that like you, mobile devices are free to roam the real world. They are not tethered to a PC or other console. This is natural for AR because AR experiences take place in the real world, while moving around in the real world.
Mobile devices sport a flat panel color graphic display with excellent resolution and pixel density sufficient for handheld viewing distances. And, of course, the killer feature that helped catapult the iPhone revolution is the multitouch input sensor on the display that is used for interacting with the displayed images with your fingers.
A rear-facing camera is used to capture video from the real world and display it in real time on the screen. This video data is digital, so your AR app can modify it and combine virtual graphics in real time as well. This is a monocular image, captured from a single camera and thus a single viewpoint. Correspondingly, the computer graphics use a single viewpoint to render the virtual objects that go with it.
Today's mobile devices are quite powerful computers, including CPU (main processor) and GPU (graphics processor), both of which are critical for AR to recognize targets in the video, process sensor, and user input, and render the combined video on the screen. We continue to see these requirements and push hardware manufacturers to try ever harder to deliver higher performance.
Built-in sensors that measure motion, orientation, and other conditions are also key to the success of mobile AR. An accelerometer is used for detecting linear motion along three axes and a gyroscope for detecting rotational motion around the three axes. Using real-time data from the sensors, the software can estimate the device's position and orientation in real 3D space at any given time. This data is used to determine the specific view the device's camera is capturing and uses this 3D transformation to register the computer-generated graphics in 3D space as well.
In addition, GPS sensor can be used for applications that need to map where they are on the globe, for example, the use of AR to annotate a street view or mountain range or find a rogue Pokémon.
Last but not least, mobile devices are enabled with wireless communication and/or Wi-Fi connections to the internet. Many AR apps require an internet connection, especially when a database of recognition targets or metadata needs to be accessed online.
In contrast to handheld mobiles, AR devices worn like eyeglasses or futuristic visors, such as Microsoft HoloLens and Metavision Meta, may be referred to as optical see-through eyewear augmented reality devices, or simply, smartglasses. As illustrated in the following image, they do not use video to capture and render the real world. Instead, you look directly through the visor and the computer graphics are optically merged with the scene:
The display technologies used to implement optical see-through AR vary from vendor to vendor, but the principles are similar. The glass that you look through while wearing the device is not a basic lens material that might be prescribed by your optometrist. It uses a combiner lens much like a beam splitter, with an angled surface that redirects a projected image coming from the side toward your eye.
An optical see-through display will mix the light from the real world with the virtual objects. Thus, brighter graphics are more visible and effective; darker areas may get lost. Black pixels are transparent. For similar reasons, these devices do not work great in brightly lit environments. You don't need a very dark room but dim lighting is more effective.
We can refer to these displays as binocular. You look through the visor with both eyes. Like VR headsets, there will be two separate views generated, one for each eye to account for parallax and enhance the perception of 3D. In real life, each eye sees a slightly different view in front, offset by the inter-pupillary distance between your eyes. The augmented computer graphics must also be drawn separately for each eye with similar offset viewpoints.
One such device is Microsoft HoloLens, a standalone mobile unit; Metavision Meta 2 can be tethered to a PC using its processing resources. Wearable AR headsets are packed with hardware, yet they must be in a form factor that is lightweight and ergonomic so they can be comfortably worn as you move around. The headsets typically include the following:
- Lens optics, with a specific field of view
- Forward-facing camera
- Depth sensors for positional tracking and hand recognition
- Accelerometer and gyroscope for linear and rotational motion detection and near-ear audio speakers
- Microphone
Furthermore, as a standalone device, you could say that HoloLens is like wearing a laptop wrapped around your head--hopefully, not for the weight but the processing capacity! It runs Windows 10 and must handle all the spatial and graphics processing itself. To assist, Microsoft developed a custom chip called holographic processing unit (HPU) to complement the CPU and GPU.
Instead of headphones, wearable AR headsets often include near-ear speakers that don't block out environmental sounds. While handheld AR could also emit audio, it would come from the phone's speaker or the headphones you may have inserted into your ears. In either case, the audio would not be registered with the graphics. With wearable near-eye visual augmentation, it's safe to assume that your ears are close to your eyes. This enables the use of spatial audio for more convincing and immersive AR experiences.
The following image illustrates a more traditional target-based AR. The device camera captures a frame of video. The software analyzes the frame looking for a familiar target, such as a pre-programmed marker, using a technique called photogrammetry. As part of target detection, its deformation (for example, size and skew) is analyzed to determine its distance, position, and orientation relative to the camera in a three-dimensional space.
From that, the camera pose (position and orientation) in 3D space is determined. These values are then used in the computer graphics calculations to render virtual objects. Finally, the rendered graphics are merged with the video frame and displayed to the user:
iOS and Android phones typically have a refresh rate of 60Hz. This means the image on your screen is updated 60 times a second, or 1.67 milliseconds per frame. A lot of work goes into this quick update. Also, much effort has been invested in optimizing the software to minimize any wasted calculations, eliminate redundancy, and other tricks that improve performance without negatively impacting user experience. For example, once a target has been recognized, the software will try to simply track and follow as it appears to move from one frame to the next rather than re-recognizing the target from scratch each time.
To interact with virtual objects on your mobile screen, the input processing required is a lot like any mobile app or game. As illustrated in the following image, the app detects a touch event on the screen. Then, it determines which object you intended to tap by mathematically casting a ray from the screen's XY position into 3D space, using the current camera pose. If the ray intersects a detectable object, the app may respond to the tap (for example, move or modify the geometry). The next time the frame is updated, these changes will be rendered on the screen:
A distinguishing characteristic of handheld mobile AR is that you experience it from an arm's length viewpoint. Holding the device out in front of you, you look through its screen like a portal to the augmented real world. The field of view is defined by the size of the device screen and how close you're holding it to your face. And it's not entirely a hands-free experience because unless you're using a tripod or something to hold the device, you're using one or two hands to hold the device at all times.
Snapchat's popular augmented reality selfies go even further. Using the phone's front-facing camera, the app analyzes your face using complex AI pattern matching algorithms to identify significant points, or nodes, that correspond to the features of your face--eyes, nose, lips, chin, and so on. It then constructs a 3D mesh, like a mask of your face. Using that, it can apply alternative graphics that match up with your facial features and even morph and distort your actual face for play and entertainment. See this video for a detailed explanation from Snapchat's Vox engineers: https://www.youtube.com/watch?v=Pc2aJxnmzh0. The ability to do all of this in real time is remarkably fun and serious business:
Perhaps, by the time you are reading this book, there will be mobile devices with built-in depth sensors, including Google Project Tango and Intel RealSense technologies, capable of scanning the environment and building a 3D spatial map mesh that could be used for more advanced tracking and interactions. We will explain these capabilities in the next topic and explore them in this book in the context of wearable AR headsets, but they may apply to new mobile devices too.
Handheld mobile AR described in the previous topic is mostly about augmenting 2D video with regard to the phone camera's location in 3D space. Optical wearable AR devices are completely about 3D data. Yes, like mobile AR, wearable AR devices can do target-based tracking using its built-in camera. But wait, there's more, much more!
These devices include depth sensors that scan your environment and construct a spatial map (3D mesh) of your environment. With this, you can register objects to specific surfaces without the need for special markers or a database of target images for tracking.
A depth sensor measures the distance of solid surfaces from you, using an infrared (IR) camera and projector. It projects IR dots into the environment (not visible to the naked eye) in a pattern that is then read by its IR camera and analyzed by the software (and/or hardware). On nearer objects, the dot pattern spread is different than further ones; depth is calculated using this displacement. Analysis is not performed on just a single snapshot but across multiple frames over time to provide more accuracy, so the spatial model can be continuously refined and updated.
A visible light camera may also be used in conjunction with the depth sensor data to further improve the spatial map. Using photogrammetry techniques, visible features in the scene are identified as a set of points (nodes) and tracked across multiple video frames. The 3D position of each node is calculated using triangulation.
From this, we get a good 3D mesh representation of the space, including the ability to discern separate objects that may occlude (be in front of) other objects. Other sensors locate the user's actual head in the real world, providing the user's own position and view of the scene. This technique is called SLAM. Originally developed for robotics applications, the 2002 seminal paper on this topic by Andrew Davison, University of Oxford, can be found at https://www.doc.ic.ac.uk/~ajd/Publications/davison_cml2002.pdf.
A cool thing about present day implementations of SLAP is how the data is continuously updated in response to real time sensor readings in your device.
The following illustration shows what occurs during each update frame. The device uses current readings from its sensors to maintain the spatial map and calculate the virtual camera pose. This camera transformation is then used to render views of the virtual objects registered to the mesh. The scene is rendered twice, for the left and right eye views. The computer graphics are displayed on the head-mounted visor glass and will be visible to the user as if it were really there--virtual objects sharing space with real world physical objects:
That said, spatial mapping is not limited to devices with depth sensing cameras. Using clever photogrammetry techniques, much can be accomplished in software alone. The Apple iOS ARKit, for example, uses just the video camera of the mobile device, processing each frame together with its various positional and motion sensors to fuse the data into a 3D point cloud representation of the environment. Google ARCore works similarly. The Vuforia SDK has a similar tool, albeit more limited, called Smart Terrain.
Developing AR with spatial mapping
Spatial mapping is the representation of all of the information the app has from its sensors about the real world. It is used to render virtual AR world objects. Specifically, spatial mapping is used to do the following:
- Help virtual objects or characters navigate around the room
- Have virtual objects occlude a real object or be occluded by a real object to interact with something, such as bouncing off the floor
- Place a virtual object onto a real object
- Show the user a visualization of the room they are in
In video game development, a level designer's job is to create the fantasy world stage, including terrains, buildings, passageways, obstacles, and so on. The Unity game development platform has great tools to constrain the navigation of objects and characters within the physical constraints of the level. Game developers, for example, add simplified geometry, or navmesh, derived from a detailed level design; it is used to constrain the movement of characters within a scene. In many ways, the AR spatial map acts like a navmesh for your virtual AR objects.
A spatial map, while just a mesh, is 3D and does represent the surfaces of solid objects, not just walls and floors but furniture. When your virtual object moves behind a real object, the map can be used to occlude virtual objects with real-world objects when it's rendered on the display. Normally, occlusion is not possible without a spatial map.
When a spatial map has collider properties, it can be used to interact with virtual objects, letting them bump into or bounce off real-world surfaces.
Lastly, a spatial map could be used to transform physical objects directly. For example, since we know where the walls are, we can paint them a different color in AR.
This can get pretty complicated. A spatial map is just a triangular mesh. How can your application code determine physical objects from that? It's difficult but not an unsolvable problem. In fact, the HoloLens toolkit, for example, includes a spatialUnderstanding module that analyzes the spatial map and does higher level identification, such as identification of floor, ceiling, and walls, using techniques such as ray casting, topology queries, and shape queries.
Spatial mapping can encompass a whole lot of data that could overwhelm the processing resources of your device and deliver an underwhelming user experience. HoloLens, for example, mitigates this by letting you subdivide your physical space into what they call spatial surface observers, which in turn contain a set of spatial surfaces. An observer is a bounding volume that defines a region of space with mapping data as one or more surfaces. A surface is a triangle 3D mesh in real-world 3D space. Organizing and partitioning space reduces the dataset needed to be tracked, analyzed, and rendered for a given interaction.
Ordinarily AR eyewear devices neither use a game controller or clicker nor positionally tracked hand controllers. Instead, you use your hands. Hand gesture recognition is another challenging AI problem for computer vision and image processing.
In conjunction with tracking, where the user is looking (gaze), gestures are used to trigger events such as select, grab, and move. Assuming the device does not support eye tracking (moving your eyes without moving your head), the gaze reticle is normally at the center of your gaze. You must move your head to point to the object of interest that you want to interact with:
More advanced interactions could be enabled with true hand tracking, where the user's gaze is not necessarily used to identify the object to interact; however, you can reach out and touch the virtual objects and use your fingers to push, grab, or move elements in the scene. Voice command input is being increasingly used in conjunction with true hand tracking, instead of hand gestures.
Other AR display techniques
In addition to handheld video see-through and wearable optical see-through, there are other AR display techniques as well.
A monocular headset shows a single image in one eye, allowing the other eye to view the real world unaugmented. It tends to be lightweight and used more as a heads-up display (HUD), as if information were projected on the front of a helmet rather than registered to the 3D world. An example of this is Google Glass. While the technology can be useful in some applications, we are not considering it in this book.
Wearable video see-through uses a head-mounted display (HMD) with a camera and combines real-world video with virtual graphics on its near-eye display. This may be possible on VR headsets such as HTC Vive and Samsung GearVR, with camera passthrough enabled, but it has a few problems. First, these VR devices do not have depth sensors to scan the environment, preventing the registration of graphics with the real 3D world.
The camera on such devices is monoscopic, yet the VR display is stereoscopic. Both the eyes see the same image, or what is called bi-ocular. This will cause issues in correctly rendering the graphics and registering to the real world.
Another problem is that the device's camera is offset from your actual eyes in front by an inch or more. The viewpoint of the camera is not the same as your eyes; the graphics would need to be registered accordingly.
For these reasons, wearable video see-through AR presently can look pretty weird, feel uncomfortable, and generally not work very well. But if you have one of these devices, feel free to try the projects in this book on it and see how it works. Also, we can expect new devices to come on the market soon which will position themselves as combined VR + AR and hopefully solve these issues, perhaps with dual stereo cameras, optical correction, or other solutions.