We're at the dawn of a whole new computing platform, preceded by personal computers, the internet, and mobile device revolutions. Augmented reality (AR) is the future, today!
Let's help invent this future where your daily world is augmented by digital information, assistants, communication, and entertainment. As it emerges, there is a booming need for developers and other skilled makers to design and build these applications.
This book aims to educate you about the underlying AR technologies, best practices, and steps for making AR apps, using some of the most powerful and popular 3D development tools available, including Unity with Vuforia, Apple ARKit, Google ARCore, Microsoft HoloLens, and the open source ARToolkit. We will guide you through the making of quality content appropriate to a variety of AR devices and platforms and their intended uses.
In this first chapter, we introduce you to AR and talk about how it works and how it can be used. We will explore some of the key concepts and technical achievements that define the state of the art today. We then show examples of effective AR applications, and introduce the devices, platforms, and development tools that will be covered throughout this book.
Welcome to the future!
We will cover the following topics in this chapter:
- Augmented reality versus virtual reality
- How AR works
- Types of markers
- Technical issues with augmented reality
- Applications of augmented reality
- The focus of this book
Simply put, AR is the combination of digital data and real-world human sensory input in real-time that is apparently attached (registered) to the physical space.
AR is most often associated with visual augmentation, where computer graphics are combined with actual world imagery. Using a mobile device, such as a smartphone or tablet, AR combines graphics with video. We refer to this as handheld video see-through. The following is an image of the Pokémon Go game that brought AR to the general public in 2016:
AR is not really new; it has been explored in research labs, military, and other industries since the 1990's. Software toolkits for desktop PCs have been available as both open source and propriety platforms since the late 1990's. The proliferation of smartphones and tablets has accelerated the industrial and consumer interest in AR. And certainly, opportunities for handheld AR have not yet reached their full potential, with Apple only recently entering the fray with its release of ARKit for iOS in June 2017 and Google's release of ARCore SDK for Android in August 2017.
Much of today's interest and excitement for AR is moving toward wearable eyewear AR with optical see-through tracking. These sophisticated devices, such as Microsoft HoloLens and Metavision's Meta headsets, and yet-to-be-revealed (as of this writing) devices from Magic Leap and others use depth sensors to scan and model your environment and then register computer graphics to the real-world space. The following is a depiction of a HoloLens device used in a classroom:
However, AR doesn't necessarily need to be visual. Consider a blind person using computer-generated auditory feedback to help guide them through natural obstacles. Even for a sighted person, a system like that which augments the perception of your real-world surroundings with auditory assistance is very useful. Inversely, consider a deaf person using an AR device who listens and visually displays the sounds and words going on around them.
Also, consider tactic displays as augmented reality for touch. A simple example is, the Apple Watch with a mapping app that will tap you on your wrist with haptic vibrations to remind you it's time to turn at the next intersection. Bionics is another example of this. It's not hard to consider the current advances in prosthetics for amputees as AR for the body, augmenting kinesthesia perception of body position and movement.
Then, there's this idea of augmenting spatial cognition and way finding. In 2004, researcher Udo Wachter built and wore a belt on his waist, lined with haptic vibrators (buzzers) attached every few inches. The buzzer facing north at any given moment would vibrate, letting him constantly know what direction he was facing. Udo's sense of direction improved dramatically over a period of weeks (https://www.wired.com/2007/04/esp/):
Can AR apply to smell or taste? I don't really know, but researchers have been exploring these possibilities as well.
What is real? How do you define "real"? If you're talking about what you can feel, what you can smell, and what you can taste and see, then "real" is simply electrical signals interpreted by your brain. ~ "Morpheus in The Matrix (1999)"
OK, this may be getting weird and very science fictiony. (Have you read Ready Player One and Snow Crash?) But let's play along a little bit more before we get into the crux of this specific book.
According to the Merriam-Webster dictionary (https://www.merriam-webster.com), the word augment is defined as, to make greater, more numerous, larger, or more intense. And reality is defined as, the quality or state of being real. Take a moment to reflect on this. You will realize that augmented reality, at its core, is about taking what is real and making it greater, more intense, and more useful.
Apart from this literal definition, augmented reality is a technology and, more importantly, a new medium whose purpose is to improve human experiences, whether they be directed tasks, learning, communication, or entertainment. We use the word real a lot when talking about AR: real-world, real-time, realism, really cool!
As human flesh and blood, we experience the real world through our senses: eyes, ears, nose, tongue, and skin. Through the miracle of life and consciousness, our brains integrate these different types of input, giving us vivid living experiences. Using human ingenuity and invention, we have built increasingly powerful and intelligent machines (computers) that can also sense the real world, however humbly. These computers crunch data much faster and more reliably than us. AR is the technology where we allow machines to present to us a data-processed representation of the world to enhance our knowledge and understanding.
In this way, AR uses a lot of artificial intelligence (AI) technologies. One way AR crosses with AI is in the area of computer vision. Computer vision is seen as a part of AI because it utilizes techniques for pattern recognition and computer learning. AR uses computer vision to recognize targets in your field of view, whether specific coded markers, natural feature tracking (NFT), or other techniques to recognize objects or text. Once your app recognizes a target and establishes its location and orientation in the real world, it can generate computer graphics that aligns with those real-world transforms, overlaid on top of the real-world imagery.
However, augmented reality is not just the combining of computer data with human senses. There's more to it than that. In his acclaimed 1997 research report, A Survey of augmented reality(http://www.cs.unc.edu/~azuma/ARpresence.pdf), Ronald Azuma proposed AR meet the following characteristics:
- Combines real and virtual
- Interactive in real time
- Registered in 3D
AR is experienced in real time, not pre-recorded. Cinematic special effects, for example, that combine real action with computer graphics do not count as AR.
Also, the computer-generated display must be registered to the real 3D world. 2D overlays do not count as AR. By this definition, various head-up displays, such as in Iron Man or even Google Glass, are not AR. In AR, the app is aware of its 3D surroundings and graphics are registered to that space. From the user's point of view, AR graphics could actually be real objects physically sharing the space around them.
Throughout this book, we will emphasize these three characteristics of AR. Later in this chapter, we will explore the technologies that enable this fantastic combination of real and virtual, real-time interactions, and registration in 3D.
As wonderful as this AR future may seem, before moving on, it would be remiss not to highlight the alternative possible dystopian future of augmented reality! If you haven't seen it yet, we strongly recommend watching the Hyper-Reality video produced by artist Keiichi Matsuda (https://vimeo.com/166807261). This depiction of an incredible, frightening, yet very possible potential future infected with AR, as the artist explains, presents a provocative and kaleidoscopic new vision of the future, where physical and virtual realities have merged, and the city is saturated in media. But let's not worry about that right now. A screenshot of the video is as follows:
Virtual reality (VR) is a sister technology of AR. As described, AR augments your current experience in the real world by adding digital data to it. In contrast, VR magically, yet convincingly, transports you to a different (computer-generated) world. VR is intended to be a totally immersive experience in which you are no longer in the current environment. The sense of presence and immersion are critical for VR's success.
AR does not carry that burden of creating an entire world. For AR, it is sufficient for computer-generated graphics to be added to your existing world space. Although, as we'll see, that is not an easy accomplishment either and in some ways is much more difficult than VR. They have much in common, but AR and VR have contrasting technical challenges, market opportunities, and useful applications.
Although financial market projections change from month to month, analysts consistently agree that the combined VR/AR market will be huge, as much as $120 billion by 2021 (http://www.digi-capital.com/news/2017/01/after-mixed-year-mobile-ar-to-drive-108-billion-vrar-market-by-2021) with AR representing over 75 percent of that. This is in no way a rebuff of VR; its market will continue to be very big and growing, but it is projected to be dwarfed by AR.
Since VR is so immersive, its applications are inherently limited. As a user, the decision to put on a VR headset and enter into a VR experience is, well, a commitment. Seriously! You are deciding to move yourself from where you are now and to a different place.
AR, however, brings virtual stuff to you. You physically stay where you are and augment that reality. This is a safer, less engaging, and more subtle transaction. It carries a lower barrier for market adoption and user acceptance.
VR headsets visually block off the real world. This is very intentional. No external light should seep into the view. In VR, everything you see is designed and produced by the application developer to create the VR experience. The technology design and development implications of this requirement are immense. A fundamental problem with VR is motion to photon latency. When you move your head, the VR image must update quickly, within 11 milliseconds for 90 frames per second, or you risk experiencing motion sickness. There are multiple theories why this happens (see https://en.wikipedia.org/wiki/Virtual_reality_sickness).
In AR, latency is much less of a problem because most of the visual field is the real world, either a video or optical see-through. You're less likely to experience vertigo when most of what you see is real world. Generally, there's a lot less graphics to render and less physics to calculate in each AR frame.
VR also imposes huge demands on your device's CPU and GPU processors to generate the 3D view for both left and right eyes. VR generates graphics for the entire scene as well as physics, animations, audio, and other processing requirements. Not as much rendering power is required by AR.
On the other hand, AR has an extra burden not borne by VR. AR must register its graphics with the real world. This can be quite complicated, computationally. When based on video processing, AR must engage image processing pattern recognition in real time to find and follow the target markers. More complex devices use depth sensors to build and track a scanned model of your physical space in real time (Simultaneous Localization and Mapping, or SLAM). As we'll see, there are a number of ways AR applications manage this complexity, using simple target shapes or clever image recognition and matching algorithms with predefined natural images. Should this be: Custom depth sensing hardware and semiconductors are used to calculate a 3D mesh of the user's environment in real time, along with geolocation sensors. This, in turn, is used to register the position and orientation computer graphics superimposed on the real-world visuals.
VR headsets ordinarily include headphones that, like the visual display, preferably block outside sounds in the real world so you can be fully immersed in the virtual one using spatial audio. In contrast, AR headsets provide open-back headphones or small speakers (instead of headphones) that allow the mix of real-world sounds with the spatial audio coming from the virtual scene.
Because of these inherent differences between AR and VR, the applications of these technologies can be quite different. In our opinion, a lot of applications presently being explored for VR will eventually find their home in AR instead. Even in cases where it's ambiguous whether the application could either augment the real world versus transport the user to a virtual space, the advantage of AR not isolating you from the real world will be key to the acceptance of these applications. Gaming will be prevalent with both AR and VR, albeit the games will be different. Cinematic storytelling and experiences that require immersive presence will continue to thrive in VR. But all other applications of 3D computer simulations may find their home in the AR market.
For developers, a key difference between VR and AR, especially when considering head-mounted wearable devices, is that VR is presently available in the form of consumer devices, such as Oculus Rift, HTC Vive, PlayStation VR, and Google Daydream, with millions of devices already in consumers' hands. Wearable AR devices are still in Beta release and quite expensive. That makes VR business opportunities more realistic and measurable. As a result, AR is largely confined to handheld (phone or tablet-based) apps for consumers, or if you delve into wearables, it's an internal corporate project, experimental project, or speculative product R&D.
We've discussed what augmented reality is, but how does it work? As we said earlier, AR requires that we combine the real environment with a computer-generated virtual environment. The graphics are registered to the real 3D world. And, this must be done in real time.
There are a number of ways to accomplish this. In this book, we will consider just two. The first is the most common and accessible method: using a handheld mobile device such as a smartphone or tablet. Its camera captures the environment, and the computer graphics are rendered on the device's screen.
A second technique, using wearable AR smartglasses, is just emerging in commercial devices, such as Microsoft HoloLens and Metavision's Meta 2. This is an optical see-through of the real world, with computer graphics shown on a wearable near-eye display.
Using a handheld mobile device, such as a smartphone or tablet, augmented reality uses the device's camera to capture the video of the real world and combine it with virtual objects.
As illustrated in the following image, running an AR app on a mobile device, you simply point its camera to a target in the real world and the app will recognize the target and render a 3D computer graphic registered to the target's position and orientation. This is handheld mobile video see-through augmented reality:
We use the words handheld and mobile because we're using a handheld mobile device. We use video see-through because we're using the device's camera to capture reality, which will be combined with computer graphics. The AR video image is displayed on the device's flat screen.
Mobile devices have features important for AR, including the following:
- Untethered and battery-powered
- Flat panel graphic display touchscreen input
- Rear-facing camera
- CPU (main processor), GPU (graphics processor), and memory
- Motion sensors, namely accelerometer for detecting linear motion and gyroscope for rotational motion
- GPS and/or other position sensors for geolocation and wireless and/or Wi-Fi data connection to the internet
Let's chat about each of these. First of all, mobile devices are... mobile.... Yeah, I know you get that. No wires. But what this really means is that like you, mobile devices are free to roam the real world. They are not tethered to a PC or other console. This is natural for AR because AR experiences take place in the real world, while moving around in the real world.
Mobile devices sport a flat panel color graphic display with excellent resolution and pixel density sufficient for handheld viewing distances. And, of course, the killer feature that helped catapult the iPhone revolution is the multitouch input sensor on the display that is used for interacting with the displayed images with your fingers.
A rear-facing camera is used to capture video from the real world and display it in real time on the screen. This video data is digital, so your AR app can modify it and combine virtual graphics in real time as well. This is a monocular image, captured from a single camera and thus a single viewpoint. Correspondingly, the computer graphics use a single viewpoint to render the virtual objects that go with it.
Today's mobile devices are quite powerful computers, including CPU (main processor) and GPU (graphics processor), both of which are critical for AR to recognize targets in the video, process sensor, and user input, and render the combined video on the screen. We continue to see these requirements and push hardware manufacturers to try ever harder to deliver higher performance.
Built-in sensors that measure motion, orientation, and other conditions are also key to the success of mobile AR. An accelerometer is used for detecting linear motion along three axes and a gyroscope for detecting rotational motion around the three axes. Using real-time data from the sensors, the software can estimate the device's position and orientation in real 3D space at any given time. This data is used to determine the specific view the device's camera is capturing and uses this 3D transformation to register the computer-generated graphics in 3D space as well.
In addition, GPS sensor can be used for applications that need to map where they are on the globe, for example, the use of AR to annotate a street view or mountain range or find a rogue Pokémon.
Last but not least, mobile devices are enabled with wireless communication and/or Wi-Fi connections to the internet. Many AR apps require an internet connection, especially when a database of recognition targets or metadata needs to be accessed online.
In contrast to handheld mobiles, AR devices worn like eyeglasses or futuristic visors, such as Microsoft HoloLens and Metavision Meta, may be referred to as optical see-through eyewear augmented reality devices, or simply, smartglasses. As illustrated in the following image, they do not use video to capture and render the real world. Instead, you look directly through the visor and the computer graphics are optically merged with the scene:
The display technologies used to implement optical see-through AR vary from vendor to vendor, but the principles are similar. The glass that you look through while wearing the device is not a basic lens material that might be prescribed by your optometrist. It uses a combiner lens much like a beam splitter, with an angled surface that redirects a projected image coming from the side toward your eye.
An optical see-through display will mix the light from the real world with the virtual objects. Thus, brighter graphics are more visible and effective; darker areas may get lost. Black pixels are transparent. For similar reasons, these devices do not work great in brightly lit environments. You don't need a very dark room but dim lighting is more effective.
We can refer to these displays as binocular. You look through the visor with both eyes. Like VR headsets, there will be two separate views generated, one for each eye to account for parallax and enhance the perception of 3D. In real life, each eye sees a slightly different view in front, offset by the inter-pupillary distance between your eyes. The augmented computer graphics must also be drawn separately for each eye with similar offset viewpoints.
One such device is Microsoft HoloLens, a standalone mobile unit; Metavision Meta 2 can be tethered to a PC using its processing resources. Wearable AR headsets are packed with hardware, yet they must be in a form factor that is lightweight and ergonomic so they can be comfortably worn as you move around. The headsets typically include the following:
- Lens optics, with a specific field of view
- Forward-facing camera
- Depth sensors for positional tracking and hand recognition
- Accelerometer and gyroscope for linear and rotational motion detection and near-ear audio speakers
Furthermore, as a standalone device, you could say that HoloLens is like wearing a laptop wrapped around your head--hopefully, not for the weight but the processing capacity! It runs Windows 10 and must handle all the spatial and graphics processing itself. To assist, Microsoft developed a custom chip called holographic processing unit (HPU) to complement the CPU and GPU.
Instead of headphones, wearable AR headsets often include near-ear speakers that don't block out environmental sounds. While handheld AR could also emit audio, it would come from the phone's speaker or the headphones you may have inserted into your ears. In either case, the audio would not be registered with the graphics. With wearable near-eye visual augmentation, it's safe to assume that your ears are close to your eyes. This enables the use of spatial audio for more convincing and immersive AR experiences.
The following image illustrates a more traditional target-based AR. The device camera captures a frame of video. The software analyzes the frame looking for a familiar target, such as a pre-programmed marker, using a technique called photogrammetry. As part of target detection, its deformation (for example, size and skew) is analyzed to determine its distance, position, and orientation relative to the camera in a three-dimensional space.
From that, the camera pose (position and orientation) in 3D space is determined. These values are then used in the computer graphics calculations to render virtual objects. Finally, the rendered graphics are merged with the video frame and displayed to the user:
iOS and Android phones typically have a refresh rate of 60Hz. This means the image on your screen is updated 60 times a second, or 1.67 milliseconds per frame. A lot of work goes into this quick update. Also, much effort has been invested in optimizing the software to minimize any wasted calculations, eliminate redundancy, and other tricks that improve performance without negatively impacting user experience. For example, once a target has been recognized, the software will try to simply track and follow as it appears to move from one frame to the next rather than re-recognizing the target from scratch each time.
To interact with virtual objects on your mobile screen, the input processing required is a lot like any mobile app or game. As illustrated in the following image, the app detects a touch event on the screen. Then, it determines which object you intended to tap by mathematically casting a ray from the screen's XY position into 3D space, using the current camera pose. If the ray intersects a detectable object, the app may respond to the tap (for example, move or modify the geometry). The next time the frame is updated, these changes will be rendered on the screen:
A distinguishing characteristic of handheld mobile AR is that you experience it from an arm's length viewpoint. Holding the device out in front of you, you look through its screen like a portal to the augmented real world. The field of view is defined by the size of the device screen and how close you're holding it to your face. And it's not entirely a hands-free experience because unless you're using a tripod or something to hold the device, you're using one or two hands to hold the device at all times.
Snapchat's popular augmented reality selfies go even further. Using the phone's front-facing camera, the app analyzes your face using complex AI pattern matching algorithms to identify significant points, or nodes, that correspond to the features of your face--eyes, nose, lips, chin, and so on. It then constructs a 3D mesh, like a mask of your face. Using that, it can apply alternative graphics that match up with your facial features and even morph and distort your actual face for play and entertainment. See this video for a detailed explanation from Snapchat's Vox engineers: https://www.youtube.com/watch?v=Pc2aJxnmzh0. The ability to do all of this in real time is remarkably fun and serious business:
Perhaps, by the time you are reading this book, there will be mobile devices with built-in depth sensors, including Google Project Tango and Intel RealSense technologies, capable of scanning the environment and building a 3D spatial map mesh that could be used for more advanced tracking and interactions. We will explain these capabilities in the next topic and explore them in this book in the context of wearable AR headsets, but they may apply to new mobile devices too.
Handheld mobile AR described in the previous topic is mostly about augmenting 2D video with regard to the phone camera's location in 3D space. Optical wearable AR devices are completely about 3D data. Yes, like mobile AR, wearable AR devices can do target-based tracking using its built-in camera. But wait, there's more, much more!
These devices include depth sensors that scan your environment and construct a spatial map (3D mesh) of your environment. With this, you can register objects to specific surfaces without the need for special markers or a database of target images for tracking.
A depth sensor measures the distance of solid surfaces from you, using an infrared (IR) camera and projector. It projects IR dots into the environment (not visible to the naked eye) in a pattern that is then read by its IR camera and analyzed by the software (and/or hardware). On nearer objects, the dot pattern spread is different than further ones; depth is calculated using this displacement. Analysis is not performed on just a single snapshot but across multiple frames over time to provide more accuracy, so the spatial model can be continuously refined and updated.
A visible light camera may also be used in conjunction with the depth sensor data to further improve the spatial map. Using photogrammetry techniques, visible features in the scene are identified as a set of points (nodes) and tracked across multiple video frames. The 3D position of each node is calculated using triangulation.
From this, we get a good 3D mesh representation of the space, including the ability to discern separate objects that may occlude (be in front of) other objects. Other sensors locate the user's actual head in the real world, providing the user's own position and view of the scene. This technique is called SLAM. Originally developed for robotics applications, the 2002 seminal paper on this topic by Andrew Davison, University of Oxford, can be found at https://www.doc.ic.ac.uk/~ajd/Publications/davison_cml2002.pdf.
A cool thing about present day implementations of SLAP is how the data is continuously updated in response to real time sensor readings in your device.
"As the HoloLens gathers new data about the environment, and as changes to the environment occur, spatial surfaces will appear, disappear and change." (https://developer.microsoft.com/en-us/windows/holographic/spatial_mapping)
The following illustration shows what occurs during each update frame. The device uses current readings from its sensors to maintain the spatial map and calculate the virtual camera pose. This camera transformation is then used to render views of the virtual objects registered to the mesh. The scene is rendered twice, for the left and right eye views. The computer graphics are displayed on the head-mounted visor glass and will be visible to the user as if it were really there--virtual objects sharing space with real world physical objects:
That said, spatial mapping is not limited to devices with depth sensing cameras. Using clever photogrammetry techniques, much can be accomplished in software alone. The Apple iOS ARKit, for example, uses just the video camera of the mobile device, processing each frame together with its various positional and motion sensors to fuse the data into a 3D point cloud representation of the environment. Google ARCore works similarly. The Vuforia SDK has a similar tool, albeit more limited, called Smart Terrain.
Spatial mapping is the representation of all of the information the app has from its sensors about the real world. It is used to render virtual AR world objects. Specifically, spatial mapping is used to do the following:
- Help virtual objects or characters navigate around the room
- Have virtual objects occlude a real object or be occluded by a real object to interact with something, such as bouncing off the floor
- Place a virtual object onto a real object
- Show the user a visualization of the room they are in
In video game development, a level designer's job is to create the fantasy world stage, including terrains, buildings, passageways, obstacles, and so on. The Unity game development platform has great tools to constrain the navigation of objects and characters within the physical constraints of the level. Game developers, for example, add simplified geometry, or navmesh, derived from a detailed level design; it is used to constrain the movement of characters within a scene. In many ways, the AR spatial map acts like a navmesh for your virtual AR objects.
A spatial map, while just a mesh, is 3D and does represent the surfaces of solid objects, not just walls and floors but furniture. When your virtual object moves behind a real object, the map can be used to occlude virtual objects with real-world objects when it's rendered on the display. Normally, occlusion is not possible without a spatial map.
When a spatial map has collider properties, it can be used to interact with virtual objects, letting them bump into or bounce off real-world surfaces.
Lastly, a spatial map could be used to transform physical objects directly. For example, since we know where the walls are, we can paint them a different color in AR.
This can get pretty complicated. A spatial map is just a triangular mesh. How can your application code determine physical objects from that? It's difficult but not an unsolvable problem. In fact, the HoloLens toolkit, for example, includes a spatialUnderstanding module that analyzes the spatial map and does higher level identification, such as identification of floor, ceiling, and walls, using techniques such as ray casting, topology queries, and shape queries.
Spatial mapping can encompass a whole lot of data that could overwhelm the processing resources of your device and deliver an underwhelming user experience. HoloLens, for example, mitigates this by letting you subdivide your physical space into what they call spatial surface observers, which in turn contain a set of spatial surfaces. An observer is a bounding volume that defines a region of space with mapping data as one or more surfaces. A surface is a triangle 3D mesh in real-world 3D space. Organizing and partitioning space reduces the dataset needed to be tracked, analyzed, and rendered for a given interaction.
For more information on spatial mapping with HoloLens and Unity, refer tohttps://developer.microsoft.com/en-us/windows/mixed-reality/spatial_mapping andhttps://developer.microsoft.com/en-us/windows/mixed-reality/spatial_mapping_in_unity.
Ordinarily AR eyewear devices neither use a game controller or clicker nor positionally tracked hand controllers. Instead, you use your hands. Hand gesture recognition is another challenging AI problem for computer vision and image processing.
In conjunction with tracking, where the user is looking (gaze), gestures are used to trigger events such as select, grab, and move. Assuming the device does not support eye tracking (moving your eyes without moving your head), the gaze reticle is normally at the center of your gaze. You must move your head to point to the object of interest that you want to interact with:
More advanced interactions could be enabled with true hand tracking, where the user's gaze is not necessarily used to identify the object to interact; however, you can reach out and touch the virtual objects and use your fingers to push, grab, or move elements in the scene. Voice command input is being increasingly used in conjunction with true hand tracking, instead of hand gestures.
In addition to handheld video see-through and wearable optical see-through, there are other AR display techniques as well.
A monocular headset shows a single image in one eye, allowing the other eye to view the real world unaugmented. It tends to be lightweight and used more as a heads-up display (HUD), as if information were projected on the front of a helmet rather than registered to the 3D world. An example of this is Google Glass. While the technology can be useful in some applications, we are not considering it in this book.
Wearable video see-through uses a head-mounted display (HMD) with a camera and combines real-world video with virtual graphics on its near-eye display. This may be possible on VR headsets such as HTC Vive and Samsung GearVR, with camera passthrough enabled, but it has a few problems. First, these VR devices do not have depth sensors to scan the environment, preventing the registration of graphics with the real 3D world.
The camera on such devices is monoscopic, yet the VR display is stereoscopic. Both the eyes see the same image, or what is called bi-ocular. This will cause issues in correctly rendering the graphics and registering to the real world.
Another problem is that the device's camera is offset from your actual eyes in front by an inch or more. The viewpoint of the camera is not the same as your eyes; the graphics would need to be registered accordingly.
For these reasons, wearable video see-through AR presently can look pretty weird, feel uncomfortable, and generally not work very well. But if you have one of these devices, feel free to try the projects in this book on it and see how it works. Also, we can expect new devices to come on the market soon which will position themselves as combined VR + AR and hopefully solve these issues, perhaps with dual stereo cameras, optical correction, or other solutions.
As we've seen and discussed, the essence of AR is that your device recognizes objects in the real world and renders the computer graphics registered to the same 3D space, providing the illusion that the virtual objects are in the same physical space with you.
Since augmented reality was first invented decades ago, the types of targets the software can recognize has progressed from very simple markers for images and natural feature tracking to full spatial map meshes. There are many AR development toolkits available; some of them are more capable than others of supporting a range of targets.
The following is a survey of various target types. We will go into more detail in later chapters, as we use different targets in different projects.
The most basic target is a simple marker with a wide border. The advantage of marker targets is they're readily recognized by the software with very little processing overhead and minimize the risk of the app not working, for example, due to inconsistent ambient lighting or other environmental conditions. The following is the Hiro marker used in example projects in ARToolkit:
Taking simple markers to the next level, areas within the border can be reserved for 2D barcode patterns. This way, a single family of markers can be reused to pop up many different virtual objects by changing the encoded pattern. For example, a children's book may have an AR pop up on each page, using the same marker shape, but the bar code directs the app to show only the objects relevant to that page in the book.
The following is a set of very simple coded markers from ARToolkit:
Vuforia includes a powerful marker system called VuMark that makes it very easy to create branded markers, as illustrated in the following image. As you can see, while the marker styles vary for specific marketing purposes, they share common characteristics, including a reserved area within an outer border for the 2D code:
The ability to recognize and track arbitrary images is a tremendous boost to AR applications as it avoids the requirement of creating and distributing custom markers paired with specific apps. Image tracking falls into the category of natural feature tracking (NFT). There are characteristics that make a good target image, including having a well-defined border (preferably eight percent of the image width), irregular asymmetrical patterns, and good contrast. When an image is incorporated in your AR app, it's first analyzed and a feature map (2D node mesh) is stored and used to match real-world image captures, say, in frames of video from your phone.
It is worth noting that apps may be set up to see not just one marker in view but multiple markers. With multitargets, you can have virtual objects pop up for each marker in the scene simultaneously.
Similarly, markers can be printed and folded or pasted on geometric objects, such as product labels or toys. The following is an example cereal box target:
If a marker can include a 2D bar code, then why not just read text? Some AR SDKs allow you to configure your app (train) to read text in specified fonts. Vuforia goes further with a word list library and the ability to add your own words.
Your AR app can be configured to recognize basic shapes such as a cuboid or cylinder with specific relative dimensions. Its not just the shape but its measurements that may distinguish one target from another: Rubik's Cube versus a shoe box, for example. A cuboid may have width, height, and length. A cylinder may have a length and different top and bottom diameters (for example, a cone). In Vuforia's implementation of basic shapes, the texture patterns on the shaped object are not considered, just anything with a similar shape will match. But when you point your app to a real-world object with that shape, it should have enough textured surface for good edge detection; a solid white cube would not be easily recognized.
The ability to recognize and track complex 3D objects is similar but goes beyond 2D image recognition. While planar images are appropriate for flat surfaces, books or simple product packaging, you may need object recognition for toys or consumer products without their packaging. Vuforia, for example, offers Vuforia Object Scanner to create object data files that can be used in your app for targets. The following is an example of a toy car being scanned by Vuforia Object Scanner:
Earlier, we introduced spatial maps and dynamic spatial location via SLAM. SDKs that support spatial maps may implement their own solutions and/or expose access to a device's own support. For example, the HoloLens SDK Unity package supports its native spatial maps, of course. Vuforia's spatial maps (called Smart Terrain) does not use depth sensing like HoloLens; rather, it uses visible light camera to construct the environment mesh using photogrammetry. Apple ARKit and Google ARCore also map your environment using the camera video fused with other sensor data.
A bit of an outlier, but worth mentioning, AR apps can also use just the device's GPS sensor to identify its location in the environment and use that information to annotate what is in view. I use the word annotate because GPS tracking is not as accurate as any of the techniques we have mentioned, so it wouldn't work for close-up views of objects. But it can work just fine, say, standing atop a mountain and holding your phone up to see the names of other peaks within the view or walking down a street to look up Yelp! reviews of restaurants within range. You can even use it for locating and capturing Pokémon.
As an introduction to developing for augmented reality, this book focuses on all kinds of target tracking. This way, each of our projects can be built using either handheld or eyewear AR devices. Where a project's user experience can be enhanced on a more advanced device or technique, we'll try to include suggestions and instructions for supporting that too.
In this section, we do a brief survey of some of the tricky issues that AR researchers have struggled with in the past, and present and are likely to struggle with in the future.
In a theatre, on a computer screen, in a handheld mobile device, or in an AR headset, the angle from one edge of the view area to the opposite is commonly referred to as the angle of view or field of view (FOV). For example, a typical movie theatre screen is about 54 degrees; an IMAX theatre screen is 70 degrees. The human eye has about 160 degrees horizontal (monocular) or 200 degrees combined binocular FOV. The HTC Vive VR headset has about 100 degrees. Note that sometimes FOV is reported as separate horizontal and vertical; other times, it's a (better-sounding) diagonal measure.
Although not commonly discussed this way, when you hold a mobile device in front of you, it does have a field of view that is measured by the size of the screen and how far away you're holding it. So, an arm's length, about 18 inches away, is just 10 degrees or so. This is why you often see people preferring a large screen tablet for mobile AR rather than a phone.
When it comes to wearables, the expectations are greater. The Microsoft HoloLens FOV is only about 35 degrees, equivalent to holding your smartphone about 6 to 8 inches in front of your face or using a 15-inch computer monitor on your desk. Fortunately, despite the limitation, users report that you get used to it and learn to move your head instead of your eyes to discover and follow objects of interest in AR. Metavision Meta 2 does better; it's FOV is 90 degrees (diagonal).
The following image illustrates the effect of FOV when you wear a HoloLens device (image by Adam Dachis/NextReality):
The rendered image needs to satisfy the expectations of our visual perceptions to the extent that the goal of AR is to display virtual objects so they could realistically seem to reside in our physical environment. If the AR is just an overlay or annotation of the real world, then this may not be as important.
When rendering objects for 3D view, the views from the left and right eyes are offset slightly, based on your interpupillary distance (distance between the eyes), called parallax. This is not a problem and is handled in every VR and wearable AR device, but it's still worth mentioning.
Virtual AR objects coexisting in the real world that are in front of real objects should hide the objects behind them. That's easy; just draw the object on top. The opposite is not as simple. When the virtual object is behind a real-world one, say your virtual pet runs under a table or behind a sofa, it should be partially or completely hidden. This requires a spatial map of the environment; its mesh is used to occlude the computer graphics when rendering the scene.
An even more difficult problem comes up with photorealistic rendering of virtual objects. Ideally, you'd want the lighting on the object to match the lighting in the room itself. Suppose in the real world, the only light is a lamp in the corner of the room, but your AR object is lit from the opposite side. That would be conspicuously inconsistent and artificial. Apple ARKit and Google ARCore address this issue by capturing the ambient light color, intensity, and direction and then adjusting the virtual scene lighting accordingly, even offering the ability to calculate shadows from your virtual objects. This provides a more realistic render of your objects in the real world.
Photographers have known about depth of field since the beginning of photography. When a lens is focused on an object, things more in the foreground or further away may be out of focus; that range is called the depth of field. The iris in your eye is a lens too, and it adjusts to focus on near versus far objects, called accommodation. We can actually feel our iris changing its focus, and this oculomotor cue of stretching and relaxing also contributes to our depth perception.
However, using near-eye displays (in VR as well as AR), all the rendered objects are in focus, regardless of their distance perceived via parallax. Furthermore, the angle between your eyes changes when you're focused on something close up versus something further away, called vergence. So, we get mixed signals, focus (accommodation) on one distance and vergence at another. This results in what is called an accommodation-vergence conflict. This disparity can become tiring, at best, and inhibits the illusion of realism. This is a problem with both wearable AR and VR devices.
Potential solutions may emerge using eye tracking to adjust the rendered image according to your vergence. There is also the promise of advanced light-field technology that more accurately merges computer-generated graphics with real-world light patterns (see Magic Leap at https://www.magicleap.com).
No discussion of AR displays would be complete without mentioning pixels, those tiny colored dots that make up the image on the display screen. The more, the better. It is measured in terms of resolution. What's more important, perhaps, is pixel density, pixels per inch, as well as the color depth. Higher density displays produce a crisper image. Greater color depth, such as HDR displays (high dynamic range), provides more bits per pixel so there could be a more natural and linear range of brightness.
We also talk about motion-to-photon latency. This is the time it takes for the AR device to detect changes in the location and orientation and have that represented on the screen. A lagged latency is not just unrealistic and feels sloppy, it can result in motion sickness, especially in wearable displays. Depending on the device, the screen may refresh in cycles of 60 frames per second or more. All the sensor readings, analysis, geometry, and graphics rendering must be performed with that tiny timeframe or the user will experience latency.
Finally, the look and feel and comfort of devices is critical to their market acceptance and usefulness in practical situations. Handheld devices continue to get thinner and more lightweight. That's awesome, provided they continue to be important for AR.
Most of us agree that eventually all of this will move into wearable eyewear. Unless you're in an industrial hardhat environment, we all look forward to the day when AR eyewear becomes as lightweight and comfortable as a pair of sunglasses. Then, we'll wish for AR contact lenses (then, retinal implants?).
In 2009, Rolf Hainich described the ultimate display in his book The End of Hardware: Augmented Reality and Beyond as follows:
a nonintrusive, comfortable, high-resolution, wide-FOV, near-eye display with high dynamic range and perfect tracking.
Why augmented reality? In today's world, we are flooded with vast amounts of information through 24/7 media, internet connectivity, and mobile devices. The problem is not whether we have enough information, but that we have too much. The challenge is how to filter, process, and use valuable information and ignore redundant, irrelevant, and incorrect information. This is explained by Schmalsteig and Hollerer in their book, Augmented Reality, Principles, and Practice (Addison Wesley, 2016):
"augmented reality holds the promise of creating direct, automatic, and actionable links between the physical world and electronic information. It provides a simple and immediate user interface to an electronically enhanced physical world. The immense potential of augmented reality as a paradigm-shifting user interface metaphor becomes apparent when we review the most recent few milestones in human-computer interaction: the emergence of the World Wide Web, the social web, and the mobile device revolution." - Augmented Reality, Principles and Practice, Schmalstieg & Mollerer
What kinds of applications can benefit from this? Well, just about every human endeavor that presently uses digital information of any kind. Here are a few examples that we will further illustrate as actual projects throughout this book.
AR markers printed on product packaging with a companion app could provide additional details about the product, testimonials, or marketing media to augment the product. AR business cards are a way to show off how cool you are. See Chapter 4, Augmented Business Cards, for more details. Just as you may see QR codes in advertising today to take you to a website, AR markers in advertising may be a thing in the near future.
For years, AR has been used in children's books to bring stories to life. Older students studying more serious subjects may find more augmentation of their educational textbooks and media resources, bringing more immersive and interactive content to the curriculum. In Chapter 5, AR Solar System, we will build a sample educational project, a simulation of our Solar System.
AR-based how-to-fix-it apps have been demonstrated to improve technical training and reduce mistakes. How many reams of paper training manuals have already been digitized? But seriously, just converting them into PDFs or web pages is only a little bit better.
Instructional videos go a bit further. With AR, we can have the benefits of more interactive 3D graphics, personal coaching, and hands-on tutorials. In Chapter 7, Augmenting the Instruction Manual, we will illustrate techniques for building an industrial training in AR, showing you how to change a tire on your car.
Have you seen the video of a woman standing in front of an smart mirror trying on clothes, interacting with the system using hand gestures? For example, see Oak Labs at http://www.oaklabs.is/. The Wayfair online furniture store uses AR to help you visualize new furniture in your home before you purchase (https://www.wayfair.com). In Chapter 8, Room Decoration with AR, we will build a little app that will let you decorate your room with framed photos.
Can you say Pokémon? Of course, there's more to AR gaming than that, but let's give credit where it's due. Ninantic did bring AR in to popular culture with it. We won't describe all the possibilities of AR-based gaming. But, in Chapter 9, Poke The Ball Game, we will build a little AR ball game.
In engineering and other design disciplines, 3D artists and CAD engineers still build stuff in a 3D on 2D screen with a mouse. When is that going to change? It's time to get your hands into the virtual world and make things happen.
Music, cinema, storytelling, journalism, and so on will all benefit from the popular adoption of augmented reality. The possibilities are as infinite as the human imagination.
This book is for developers who are interested in learning and gaining real hands-on experience by building AR applications. We do not assume you have knowledge of the Unity game engine, C# language, 3D graphics, or previous experience with VR or AR development; although, any prior experience from novice to seasoned expert will be helpful.
Our focus will be on visual augmented reality. We will include some audio, which can be very important to complete an AR experience, but not other senses.
Mobile devices are natural platforms for AR, both mobile phones and tablets, and Android and iOS. We refer to this as handheld AR with video see-through tracking. All the projects in the book can be built and run on mobile devices, including iPhone, iPad, and Android phones and tablets.
Much of today's interest and excitement about AR is the wearable eyewear AR with optical see-through tracking, such as Microsoft HoloLens. Most of the projects in this book can also be built and run on eyewear AR devices. Cases where changes are required in the user interface and interactivity, for example, will be called out.
There is a risk in trying to cover such a wide range of target devices and platforms in one book. We will do our best to separate any device-specific dependencies in our step-by-step tutorial instructions and make it easy for you to follow the instructions relevant to your setup and skip those that do not pertain to you.
We've included topics regarding setting up your Windows or Mac development machine to build AR applications (sorry, Linux not included) and uploading the app onto your device.
For app development, we use the Unity 3D game development platform (https://unity3d.com/), which provides a powerful graphics engine and full-featured editor that you can drive using C# language programming. There are many sources that review and discuss the virtues and benefits of using Unity. Suffice to say, Unity includes native support for photorealistic rendering of computer graphics, humanoid and object animations, physics and collision, user interface and input event systems, and more. With Unity, you can create a project and then build it for any number of supported target platforms, including Windows, Android, iOS, as well as many other popular consoles and mobile devices.
For our AR toolkit, we will teach you how to use the popular and professional Vuforia AR SDK (https://www.vuforia.com/). AR development requires some sophisticated software algorithms and device management, much of which is handled quite elegantly by Vuforia. It was first published in 2008 by Daniel Wagner in a paper titled Robust and unobtrusive marker tracking on mobile phones, then it grew into the award-winning Vuforia by Qualcomm and was later acquired by PTC in 2015. Today, Vuforia supports a wide range of devices, from handheld mobiles to wearable eyewear, such as HoloLens. As we will see throughout this book, the SDK supports many types of tracking targets, including markers, images, objects, and surfaces; therefore, it can be used for many diverse applications. They also provide tools and cloud-based infrastructure for managing your AR assets.
Vuforia requires you have a license key in each of your apps. At the time of writing this, licenses are free for the first 1,000 downloads of your app; although, it displays a watermark in the corner of the display. Paid licenses start at $499 per app.
An alternative to Vuforia is the free and open source ARToolkit SDK (http://artoolkit.org/). ARToolkit was perhaps the first open source AR SDK and certainly the one that lasted the longest; it was first demonstrated in 1999 and released as open source in 2001. ARToolkit is now owned by DAQRI (https://daqri.com/), a leading industrial AR device and platform manufacturer.
As of this writing, the current ARToolkit 5 version is focused on marker and image-based targeting, and it is more limited than Vuforia. (ARToolkit version 6 is in Beta and it promises exciting new features and an internal architecture, but it is not covered in this book.) It does not support as wide a range of devices as Vuforia out of the box. However, since it is open source and has a sizable community, there are plugins available for just about any device (provided you're willing to tinker with it). The Unity package for ARToolkit does not necessarily support all the features of the native SDK.
Apple's ARKit for iOS (https://developer.apple.com/arkit/) is also in Beta and requires iOS 11 (also in Beta as of this writing). ARKit works on any iPhone and iPad using an Apple A9 or A10 processor. Unity provides an asset package that simplifies how to use ARKit and provides example scenes (https://bitbucket.org/Unity-Technologies/unity-arkit-plugin).
We are pleased to provide an introduction to Google ARCore (https://developers.google.com/ar/) in this book, but only an introduction. ARCore is brand new and at the time of writing this, it is in early preview only. The documentation and demo scene they provide is very bare-bones. The setup will likely be different when Unity supports ARCore in the final release. Things such as installing a preview of AR services APK will change. The list of supported Android devices is very short. Please refer to the the GitHub repository for this book for new implementation notes and code using Google ARCore for Android: https://github.com/ARUnityBook/. The principles are very similar to ARKit, but the Unity SDK and components are different.
Microsoft HoloLens is a Windows 10 MR (mixed reality) device. See https://www.microsoft.com/en-us/hololens/developers for more information. Using its companion, MixedRealityTooklit (formerly, HoloToolkit) components (https://github.com/Microsoft/MixedRealityToolkit) facilitates development using this fascinating AR device.
AR technology is moving ahead quickly. We really want you to learn the concepts and principles behind AR and its best practices, apart from specific SDK and devices.
The following table sorts out the various combinations of platforms, devices, and tools that will be covered in this book:
In this chapter, we introduced you to augmented reality, trying to define and describe what AR is, and what it is not, including comparing AR to its sister technology, namely virtual reality. Then, we described how AR works by separating handheld mobile AR from optical eyewear AR devices. In both cases, we described the typical features of such devices and why they're necessary for AR applications. Traditionally, AR is accomplished using video see-through and preprogrammed targets, such as markers or images. Wearable eyewear AR and emerging mobile devices use 3D spatial maps to model the environment and combine virtual objects more realistically because they can do things such as occlusion and physics between the real-world map and virtual objects. We then reviewed the many types of targets, including coded markers, images, and complex objects, and summarized many of the technical issues with AR, including field of view, visual perception, and display resolution. Finally, we looked at some real applications of AR, including those illustrated with projects in this book.
In the next chapter, we get to work. Our first step will be to install Unity and the major AR development toolkits--Vuforia, ARToolkit, Apple ARKit, Google ARCore, and Microsoft MixedRealityToolkit--on your development machine, either Windows or macOS. Let's get to it!