Welcome to Microsoft HoloLens By Example; join me on a journey through this book to learn about Microsoft's HoloLens device and more generally, mixed reality (MR) applications, through a series of exciting example, each uncovering an important concept related to developing mixed reality applications, including:
- Understanding the environment and context of the user through image recognition and spatial mapping
- Projecting and placing holograms into the real world
- Allowing the user to interact with holograms using a variety interaction modes including gaze, gesture and voice
- Sharing the experience across devices
We will start our journey by briefly peering into the past to see how our relationship with computers has changed over the past couple of decades, before defining MR with respect to reality and MR. We will then cover some of the core building blocks of MR applications before wrapping up this chapter.
A good place to start is always at the beginning, so let's start there.
Let's start our journey by briefly peering into the past, specifically at how our interaction with computers has changed over time and how it might be in the near future.
Early computers, bypassing when people were used as computers, were large mechanical machines, and their size and cost meant that they were fixed to a single location and limited to a specific task. The limitation of a single function was soon resolved with Electronic Numerical Integrator And Computer (ENIAC), one of the world's first general-purpose electronic computers. Due to its size and cost, it was still fixed to a single location and interacted with/programmed through a complex process of rearranging physical switches on a switch board. Mainframes followed and introduced a new form of interaction, the Command Line Interface (CLI). The user would interact with the computer by issuing a series of instructions and have the response returned to them via a Terminal screen. Once again, these computers were expensive, large, and complex to use.
The following screenshot is an example of DOS, an operating system that dominated the personal computing market in the late 1980s:
PC DOS 1.10 screenshot, Credit: Leyo, Source: https://commons.wikimedia.org/wiki/File:PC_DOS_1.10_screenshot.png
The era of direct manipulation interfaces and personal computing followed. With the reduced size, cost, and introduction of the Graphical User Interface (GUI), computers had finally become more accessible to the ordinary person. During this era, the internet was born, computers became platforms allowing people to connect and collaborate with one another, to be entertained, and to augment their skills by making data and information more accessible. However, these computers were, despite their name, still far from being personal, neglecting the user and their current context. These computers forced the user to work in a digital representation of their world, fixing them to a single location--the desk.
The following photograph shows the Apple Macintosh, seen as one of the innovators of the GUIs:
Apple LISA II Macintosh-XL - Credit: Gerhard GeWalt Walter, Source: https://commons.wikimedia.org/wiki/File:Apple-LISA-Macintosh-XL.jpg
BlackBerry Limited (then known as Research In Motion), released its first smartphone in around 2000, which edged us toward the mobile computer. The tipping point was in 2007, when Steve Jobs revealed the iPhone. Smartphones became ubiquitous; personal computing had finally arrived, and it was a platform that was inherently personal. Constraints of technology provided the catalyst to rethink what the role of the application was. Since then, we have been incrementally improving on this paradigm with more capable devices and smarter services. For a lot of us, our smartphone is our primary device-being always-on and always-connected has given us superpowers our ancestors had only dreamed about.
The following is a photograph showing the late Steve Jobs presenting the first version of the iPhone to the world:
Steve Jobs shows off the iPhone 4 at the 2010 Worldwide Developers Conference - Credit: Matthew Yohe, Source: https://en.wikipedia.org/w/index.php?title=File:Steve_Jobs_Headshot_2010.JPG
Taking a bird's-eye view, we can see the general (and obvious) trend of the following things:
- Minimization of hardware
- Increase in utility
- Moving toward more natural ways of interacting
- The shift from us being in the computer world toward computers being in our world
So, what does the next paradigm look like? I believe that HoloLens gives us a glimpse into what computers will look like in the near future, and even in its infancy, I believe that it provides us with a platform to start exploring and defining what the future will look like. Let's continue by clarifying the differences and similarities of virtual reality (VR), augmented reality (AR), and MR.
The following is a photograph illustrating an example of how MR seamlessly blends holograms into the real world:
Win10 HoloLens Minecraft, Credit: Microsoft Sweden, Source: https://www.flickr.com/photos/microsoftsweden/15716942894
Let's begin by first defining and contrasting three similar, but frequently misused, paradigms: VR, AR, and MR:
- VR describes technology and experiences where the user is fully immersed in a virtual environment.
- AR can be described as technology and techniques used to superimpose digital content onto the real world.
- MR can be considered as a blend of VR and AR. It uses the physical environment to add realism to holograms, which may or may not have any physical reference point (as AR does).
The differences between VR, AR, and MR are not so much in the technology but in the experience you are trying to create. Let's illustrate it through an example--imagine that you were given a brief to help manage children's anxiety when requiring hospital treatment.
With VR, you might create a story or game in space, with likeable characters that represent the staff of the hospital in role and character, but in a more interesting and engaging form. This experience will gently introduce the patient to the concepts, procedures, and routines required. On the other end of the spectrum, you can use augmented reality to deliver fun facts based on contextual triggers, for example, the child might glance (with their phone or glasses) at the medicine packaging to discover what famous people had their condition. MR, as the name suggests, mixes both the approaches--our solution can involve a friend, such as a teddy, for the child to converse with, expressing their concerns and fears. Their friend will accompany them at home and in the hospital, being contextually sensitive and respond appropriately.
As highlighted through these hypothetical examples, they are not mutually exclusive but adjust the degree to which they preserve reality; this spectrum is termed as the reality–virtuality continuum, coined by Paul Milgram in his paper Augmented Reality: A class of displays on the reality-virtuality continuum, 2007. He illustrated it graphically, showing the spectrum of reality between the extremes of real and virtual, showing how MR encompasses both AR and augmentedvirtuality (AV). The following is a figure by Paul Milgram and Fumio Kishino that defined the concept of Milgram’s reality-virtuality continuum and illustrates the concept--to the far left, you have reality and at the opposite end, the far right, you have virtuality--as MR strides itself between these two paradigms:
Representation of reality-virtuality continuum by Paul Milgram and Fumio Kishino
Our focus in this book, and the proposition of HoloLens, is MR (also referred to as Holographic). Next, we will look into the principles and building blocks that make up MR experiences.
The emergence of new technologies is always faced with the question of doing old things in a new way or doing new things in new ways. When the TV was first introduced, the early programs were adopted from radio, where the presenter read in front of a camera, neglecting the visual element of the medium. A similar phenomenon happened with computers, the web, and mobile--I would encourage you to think about the purpose of what you're trying to achieve rather than the process of how it is currently achieved to free you to create new and innovative solutions.
In this section, we will go over some basic design principles related to building MR experiences and the accompanying building blocks available on HoloLens. Keeping in mind that this medium is still in its infancy, the following principles are still a work in progress.
As discussed earlier, the degree of reality you want to preserve is up to you--the application designer. It is important to establish where your experience fits early on as it will impact how you design and also the implementation of the experience you are building. Microsoft outlines three types of experiences:
Enhanced environment apps: These are applications that respect the real world and supplement it with holographic content. An example of this can be pinning a weather poster near the front door, ensuring that you don't forget your umbrella when the forecast is for rain.
Blended environment apps: These applications are aware of the environment, but will replace parts of it with virtual content. An application that lets the user replace fittings and furniture is an example.
Virtual environment apps: These types of applications will disregard the environment and replace it completely with a virtual alternative. An application that converts your room into a jungle, with trees and bushes replacing the walls and the floor can be taken as an example.
Like with so many things, there is no right answer, just a good answer for a specific user, specific context, and at a specific time. For example, designing a weather app for a professional might have the weather forecast pinned to the door so that she sees it just before leaving for work, while it might be more useful to present the information through a holographic rain cloud, for example, to a younger audience.
In the next section, we will continue our discussion on the concepts of MR, specifically looking at how HoloLens makes sense of the environment.
One of the most compelling features of HoloLens is its ability to place and track virtual/digital content in the real world. It does this using a process known as spatial mapping, whereby the device actively scans the environment, building its digital representation in memory. In addition, it adds anchors using a concept called spatial anchors. Spatial anchors mark important points in the world in reference to the defined world origin; holograms are positioned relative to these spatial anchors, and these anchors are also used to join multiple spaces for handling larger environments.
The effectiveness of the scanning process will determine the quality of the experience; therefore, it is important to understand this process in order to create an experience that effectively captures sufficient data about the environment. One technique commonly used is digital painting; during this phase, the user is asked to paint the environment. As the user glances around, the scanned surfaces are visualized (or painted over), providing feedback to the user that the surface has been scanned.
However, scanning and capturing the environment is just one part of understanding the environment, and the second is making use of it; some uses include the following:
- Occlusion: One of the shortfalls of creating immersive MR experiences using single camera devices (such as Smartphones) is the inability to understand the surface to occlude virtual content from the real world when obstructed. Seeing holograms through objects is a quick way to force the user out of the illusion; with HoloLens, occluding holograms with the real world is easy.
- Visualization: Sometimes, visualizing the scanned surfaces is desirable, normally an internal effect such as feeding back what part of the environment is scanned to the user.
- Placement: Similar to occlusion in that it creates a compelling illusion, holograms should behave like the real objects that they are impersonating. Once the environment is scanned, further processing can be performed to gain greater knowledge of the environment, such as the types of surfaces available. With this knowledge, we can better infer where objects belong and how they should behave. In addition to creating more compelling illusions, matching the experience with the user's mental model of where things belong makes the experience more familiar, thus easing adoption by making it easier and more intuitive to use.
- Physics: HoloLens makes the scanned surfaces accessible as plain geometry data, which means we can leverage the existing physics simulation software to reinforce the presence of holograms in the user's environment. For example, if I throw a virtual ball, I expect it to bounce off the walls and onto the floor before settling down.
- Navigation: In game development, we have devised effective methods for path planning. Having a digital representation of our real world affords us to utilize these same techniques in the real world. Imagine offering a visually impaired person an opportunity to effectively navigate an environment independently or assisting a parent to find their lost child in a busy store.
- Recognition: Recognition refers to the ability of the computer to classify what objects are in the environment; this can be used to create a more immersive experience, such as having virtual characters sit on seats, or to provide a utility, such as helping teach a new language or assisting visually impaired people so that they can better understand their environment.
The luxury of designing for screen-based experiences is that your problem is simplified. In most cases, we own the screen and have a good understanding of it; we lose these luxuries with MR experiences, but gain more in terms of flexibility and therefore opportunity for new, innovative experiences. So it becomes even more important to understand your users and in what context they will be using your application, such as the following:
- Will they be sitting or standing?
- Will they be moving or stationary?
- Is the experience time dependent?
Some common practices when embedding holograms in the real world include the following:
- Place holograms in convenient places--places that are intuitive, easily discovered, and in reach, especially if they are interactive.
- Design for the constraints of the platform, but keep in mind that we are developing for a platform that will rapidly advance in the next few years. At the time of writing, Microsoft recommends placing holograms between 1.25 meters and 5 meters away from the device, with the optimum viewing distance of 2 meters. Find ways of gracefully fading content in and out when it gets too close or far, so as not to jar the user into an unexpected experience.
- As mentioned earlier, placing holograms on contextually relevant surfaces and using shadows create, more immersive experiences, giving a better illusion that the hologram exists in the real world.
- Avoid locking content to the camera; this can quickly become an annoyance to the user. Rather, use an alternative that is more gentle, an approach being adopted has the interface dragged, in an elastic-like manner, with the user's gaze.
- Make use of spatial sound to improve immersion and assist in hologram discovery. If you have ever listened to Virtual Barber Shop Hair Cut (https://www.youtube.com/watch?v=8IXm6SuUigI), you will appreciate how effective 3D sound can be in creating an immersive experience and, similar to mimicking the behavior of the objects you are trying to impersonate, use real world sound that the user will expect from the hologram.
The spatial sound, such as 3D, adds another dimension to how sound is perceived. Sounds are normally played back in stereo, meaning that the sound has no spatial position, that is, the user won't be able to infer where in space the sound comes from. Spatial sound is a set of techniques that mimic sound in the real world. This has many advantages, from offering more realism in your experience to assisting the user locate content.
Of course, this list is not comprehensive, but has a few practices to consider when building MR applications. Next, we will look at ways in which the user can interact with holograms.
With the introduction of any new computing paradigm comes new ways of interacting with it and, as highlighted in the opening paragraph, history has shown that we are moving from an interface that is natural to the computer toward an interface that is more natural to people. For the most part, HoloLens removes dedicated input devices and relies on inferred intent, gestures, and voice. I would argue that this constraint is the second most compelling offering that HoloLens gives us; it is an opportunity to invent more natural and seamless experiences that can be accessible to everyone. Microsoft refers to three main forms of input, including GazeGestureVoice (GGV); let's examine each of these in turn.
Gaze refers to tracking what the user is looking at; from this, we can infer their interest (and intent). For example, I will normally look at a person before I speak to them, hopefully, signalling that I have something to say to them. Similarly, during the conversation, I may gaze at an object, signalling to the other person that the object that I'm gazing at is the subject I'm speaking about.
This concept is heavily used in HoloLens applications for selecting and interacting with holograms. Gaze is accompanied with a cursor; the cursor provides a visual representation of the users gaze, providing visual feedback to what the user is looking at. It can additionally be used to show the state of the application or object the user is currently gazing at, for example, the cursor can visually change to signal whether the hologram the user is gazing at is interactive or not. On the official developer site, Microsoft has listed the design principles; I have paraphrased and listed them here for convenience:
Always present: The cursor is, in some sense, akin to the mouse pointer of a GUI; it helps the users understand the environment and the current state of the application.
Cursor scale: As the cursor is used for selecting and interacting with holograms, it's size should be no bigger than the objects the user can interact with. Scale can also be used to assist the users' understanding of depth, for example, the cursor will be larger when on nearby surfaces than when on surfaces farther away.
Look and feel: Using a directionless shape means that you avoid implying any specific direction with the cursor; the shape commonly used is a donut or torus. Making the cursor hug the surfaces gives the user a sense that the system is aware of their surroundings.
Visual cues: As mentioned earlier, the cursor is a great way of communicating to the user about what is important as well as relaying the current state of the application. In addition to signalling to the user what is interactive and what is not, it also can be used to present additional information (possible actions) or the current state, such as visualizing showing the user that their hand has been detected.
While gazing provides the mechanism for targeting objects, gestures and voice provide the means to interact with them. Gestures can be either discrete or continuous. The discrete gestures execute a specific action, for example, the air-tap gesture is equivalent to a double-click on a mouse or tap on the screen. In contrast, continuous gestures are entered and exited and while active, they will provide continuous update to their state. An example of this is the manipulation gesture, whereby the user enters the gesture by holding their finger down (called the hold gesture); once active, this will continuously provide updates of the position of the tracked hand until the gesture is exited with the finger being lifted. This is equivalent to dragging items on desktop and touch devices with the addition of depth.
HoloLens recognizes and tracks hands in either the ready state (back of hand facing you with the index finger up) or pressed state (back of hand facing you with the index finger down) and makes the current position and state of the currently tracked hands available, allowing you to devise your own gestures in addition of providing some standard gestures, some of which are reserved for the operating system. The following gestures are available:
Air-tap: This is when the user presses (finger down) and releases (finger up), and is performed within a certain threshold. This interaction is commonly associated to selecting holograms (as mentioned earlier).
Bloom: Reversed for the operating system, bloom is performed by holding your hand in front of you with your fingers closed, and then opening your hand up. When detected, HoloLens will redirect the user to the
Manipulation: As mentioned earlier, manipulation is a continuous gesture entered when the user presses their finger down and holds it down, and exited when hand tracking is lost or the user releases their finger. When active, the user's hand is tracked with the intention of using the absolute position to manipulate the targeted hologram.
Navigation: This is similar to the manipulation gesture, except for its intended use. Instead of mapping the absolute position changes of the user's hand with the hologram, as with manipulation, navigation provides a standard range of -1 to 1 on each axis (x, y, and z); this is useful (and often used) when interacting with user interfaces, such as scrolling or panning.
The last dominate form of interacting with HoloLens, and one I'm particularly excited about, is voice. In the recent times, we have seen the rise of Conversational User Interface (CUI); so, it's timely to introduce a platform where one of it's dominate inputs is voice. In addition to being a vision we have had since before the advent of computers, it also provides the following benefits:
- Hands free (obviously important for a device like HoloLens)
- More efficient and requires less effort to achieve a task; this is true for data entry and navigating deeply nested menus
- Reduces cognitive load; when done well, it should be intuitive and natural, with minimal learning required
However, how voice is used is really dependent on your application; it can simply be used to supplement gestures such as allowing the user to use the
Select keyword (a reserved keyword) to select the object the user is currently gazing at or support complex requests by the user, such as answering free-form questions from the user. Voice also has some weaknesses, including these:
- Difficulty with handling ambiguity in language; for example, how do you handle the request of
- Manipulating things in physical space is also cumbersome
- Social acceptance and privacy are also considerations that need to be taken into account
With the success of Machine Learning (ML) and adoption of services such as Amazon's Echo, it is likely that these weaknesses will be short lived.
So far, we have discussed a lot of high-level concepts, let's now wrap this chapter up before moving on and putting these concepts into practice through a series of examples.
I hope you're as excited as I am and, with this book, join me in shaping the future. This book consists of a series of examples, each walking through a "toy" example used to demonstrate a specific concept or feature of the HoloLens. As you work your way through this book, I encourage you to dream of what is possible, looking past some of the current nuances, knowing that they will be resolved in the near future. I would also discourage creating horseless carriages; a phrase used by the notable designer Don Norman in reference to how the car was designed (and named) on horse-drawn carriages, highlighting how new technology is always started by making it look like the old technology. So, rather than adapting from the existing apps, be inspired to adapt from the real world--with that said, let's make a start with our first example.