Conversational user interface (UI) is changing the way that we interact. Intelligent assistants, chatbots, and voice-enabled devices, such as Amazon Alexa and Google Home, offer a new, natural, and intuitive human-machine interaction and open up a whole new world for us as humans. Chatbots and voicebots ease, speed up, and improve daily tasks. They increase our efficiency and, compared to humans, they are also very cost-effective for the businesses employing them.
This chapter will address the concept of conversational UIs by initially exploring what they are, how they evolved, what they offer, their challenges, and how they will develop in the future. The chapter provides an introduction to the conversational world. We will take a look at how UI has developed over the years and the difference between voice control, chatbots, virtual assistants and conversational solutions.
Broadly speaking, conversational UI is a new form of interaction with computers that tries to mimic a "natural human conversation." To understand what this means, we can turn to the good old Oxford Dictionary and search for the definition of a conversation:
A talk, especially an informal one, between two or more people, in which news and ideas are exchanged.
On Wikipedia (https://en.wikipedia.org/wiki/Conversation), I found some interesting additions. There, conversation is defined a little more broadly: "An interactive communication between two or more people… the development of conversational skills and etiquette is an important part of socialization."
The development of conversational skills in a new language is a frequent focus of language teaching and learning. If we sum up the two definitions, we can agree that a conversation must be:
Some type of communication (a talk)
Between more than two people
Interactive: ideas and thoughts must be exchanged
Part of a socialization process
Focused on learning and teaching
Conversational UI, as opposed to the preceding definition:
Doesn't have to be oral: it could be in writing (for example, chatbots).
Is not just between people and is limited to two sides: in conversational UI, we have at least one form of a computer involved, and the conversation is limited to only two participants. Rarely does conversational UI involve more than two participants.
Is less interactive and it's hard to say whether ideas are exchanged between the two participants.
Is thought of as unsocialized, since we are dealing with computers and not people. However, the two main components are already there.
Is a medium of communication that enables natural conversation between two entities.
Is about learning and teaching by leveraging Natural Language Understanding (NLU), Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL), as computers continue to learn and develop their understanding capabilities.
The gaps that we identified above represent the future evolution of conversational UI. While it seems like there is a long way to go for us to actually be able to truly replace human-to-human interaction, with today's and future technologies, those gaps will close sooner than we think. However, let's start by taking a look at how human-computer interaction evolved over the last 50 years, before we try to predict the future.
Conversational UI is part of a long evolution of human-machine interaction. The interface of this communication has evolved tremendously over the years, mostly thanks to technology improvements, but also through the imagination and vision of humans.
Science fiction books and movies predicted different forms of humanized interaction with machines for decades (some of the best-known examples are Star Wars, 2001: A Space Odyssey, and Star Trek), however, computing power was extremely scarce and expensive, so investing in this resource on UIs wasn't a high priority. Today, when our smartphones use more computing power than a supercomputer did in the past, the development of human-machine interaction is much more natural and intuitive. In this chapter, we will review the evolution of computer UI, from the textual through to the graphical and all the way to the conversational UI.
A good example of a common use of textual interaction is search engines. Today, if I type a sentence such as
search for a hotel in NYC on Google or Bing (or any other search engine for that matter), the search engine will provide me with a list of relevant hotels in NYC.
With this interface, for example, to enable/disable an action or specific capability, we will click a button on a screen, using a mouse (instead of writing a textual command line), mimicking a mechanical action of turning on or off a real device.
The GUI became extremely popular during the 90s, with the introduction of Microsoft Windows, which became the most popular operating system for personal computers. The following evolution of GUIs came with the introduction of touchscreen devices, which eliminated the need for mediators, such as the mouse, and provided a more direct and natural way of interacting with a computer.
The latest evolution of computer-human interaction is the conversational UI. As defined above, a conversational interaction is a new form of communication between humans and machines that includes a series of questions and answers, if not an actual exchange of thoughts.
In the conversational interface, we experience, once again, a form of two-sided communication, where the user asks a question and the computer will respond with an answer. In many ways, this is similar to the textual interface we introduced earlier (see the example of the search engine), however, in this case, the end user is not searching for information on the internet but is instead interacting in a one-to-one format with someone who delivers the answer. That someone is a humanized-computer entity called a bot.
The conversational UI mimics a text/voice interaction with a friend/service provider. Though still not a true conversation as defined in the Oxford Dictionary, it provides a free and natural experience that gets the closest to a human-human interaction that we have seen yet.
A sub-category in the field of conversational UI is voice-enabled conversational UI. Whereas the shift from textual to GUI and then from GUI to conversational is defined as evolution, conversational voice interaction is a full paradigm shift. This new way to interact with machines, using nothing but our voice – our most basic communication and expression tool – takes human-machine relationships to a whole new level.
Computers are now capable of recognizing our voice, "understanding" our requests, responding back, and even replying with suggestions and recommendations. Being a natural interaction method for humans, voice makes it easy for young people and adults to engage with computers, in a limit-free environment.
Speech recognition (for voicebots)
In this section, we will walk through the "journey" of a conversational interaction along the conversational stack.
Voice recognition (also known as speech recognition or speech-to-text) transcribes voice into text. The computer captures our voice with a microphone and provides a text transcription of the words. Using a simple level of text processing, we can develop a voice control feature with simple commands, such as "turn left" or "call John." Leading providers of speech recognition today include Nuance, Amazon, IBM Watson, Google, Microsoft, and Apple.
To achieve a higher level of understanding, beyond simple commands, we must include a layer of NLU. NLU fulfills the task of reading comprehension. The computer "reads the text" (in a voicebot, it will be the transcribed text from the speech recognition) and then tries to grasp the user's intent behind it and translate it into concrete steps.
Lets take a look at travel bot, as an example. The system identifies two individual intentions:
Flight booking – BookFlight
Hotel booking – BookHotel
When a user asks to book a flight, the NLU layer is what helps the bot to understand that the intent behind the user's request is BookFlight. However, since people don't talk like computers, and since our goal is to create a humanized experience (and not a computerized one), the NLU layer should understand or be able to connect various requests to a specific intent.
Another example is when a user says, I need to fly to NYC. The NLU layer is expected to understand that the user's intent is to book a flight. A more complex request for our NLU to understand would be when a user says, I'm travelling again.
Similarly, the NLU should connect the user's sentence to the BookFlight intent. This is a much more complex task, since the bot can't identify the word flight in the sentence or a destination out of a list of cities or states. Therefore, the sentence is more difficult for the bot to understand.
Computer science considers NLU to be a "hard AI problem"(Turing Test as a Defining Feature of AI-Completeness in Artificial Intelligence, Evolutionary Computation and Metaheuristics (AIECM), Roman V. Yampolskiy), meaning that even with AI (powered by deep learning) developers are still struggling to provide a high-quality solution. To call a problem AI-hard means that this problem cannot be solved by a simple specific algorithm and that means dealing with unexpected circumstances while solving any real-world problem. In NLU, those unexpected circumstances are the various configurations of words and sentences in an endless number of languages and dialects. Some leading providers of NLU are Dialogflow (previously api.ai, acquired by Google), wit.ai (acquired by Facebook), Amazon, IBM Watson, and Microsoft.
To build a good NLU layer that can understand people, we must provide a broad and comprehensive sample set of concepts and categories in a subject area or domain. Simply put, we need to provide a list of associated samples or, even better, a collection of possible sentences for each single intent (request) that a user can activate on our bot. If we go back to our travel example, we would need to build a comprehensive dictionary, as you can see in the following table:
User says (samples)
I want to book my travel
I want to book a flight
I need a flight
Please book a hotel room
I need accommodation
Building these dictionaries, or sets of samples, can be a tough and Sisyphean task. It is domain-specific and language-specific, and, as such, requires different configurations and tweaks from one use case to another, and from one language to another. Unlike the GUI, where the user is restricted to choosing from the web screen, the conversational UI is unique, since it offers the user an unlimited experience. However, as such, it is also very difficult to pre-configure to a level of perfection (see the AI-hard problem above). Therefore, the more samples we provide, the better the bot's NLU layer will be able to understand different requests from a user. Beware of the catch-22 in this case: the more intents we build, the more samples are required, and all those samples can easily lead to intents overlapping. For example, when a user says, I need help, they might mean they want to contact support, but they also might require help on how to use the app.
Contextual conversation is one of the toughest challenges in conversational interaction. Being able to understand context is what makes a bot's interaction a humanized one. As mentioned previously, at its minimum, conversational UI is a series of questions and answers. However, adding a contextual aspect to it is what makes it a "true" conversational experience. By enabling context understanding, the bot can keep track of the conversation in its different stages and relate, and make a connection between, different requests. The entire flow of the conversation is taken into consideration and not just the last request.
In every conversational bot we build – either as a chatbot or a voicebot – the interaction will have two sides:
The end user will ask, Can I book a flight?
The bot will respond, Yes. The bot might also add, Do you want to fly international?
The end user can then approve this or respond by saying, No, domestic.
A contextual conversation is very different from a simple Q&A. For the preceding scenario, there were multiple different ways the user could have responded and the bot must be able to deal with all those different flows.
One methodology for dealing with different flows is to use a state machine methodology. This popular and simple way to describe context connects each state (phase) of the conversation to the next state, depending on the user's reaction.
However, the advantage of a state machine is also its disadvantage. This methodology forces us to map every possible conversational flow in advance. While it is very easy to use for building simple use cases, it is extremely difficult to understand and maintain over time, and it's impossible to use for more complicated flows (flight booking, for example, is a complex flow that can't be supported using a state machine). Another problem with the state machines method is that, even for simple use cases, to support multiple use cases with the same response, we still need to duplicate much of the work.
The event-driven contextual approach is a more suitable method for today's conversational UI. It lets the users express themselves in an unlimited flow and doesn't force them through a specific flow. Understanding that it's impossible to map the entire conversational flow in advance, the event-driven contextual approach focuses on the context of the user's request to gather all the information it needs in an unstructured way by minimizing all other options.
Using this methodology, the user leads the conversation and the machine analyzes the data and completes the flow at the back. This method allows us to depart from the restricting GUI state machine flow and provide human-level interaction.
In this example, the machine knows that it needs the following parameters to complete a flight:
The user in this case can fluently say, I want to book a flight to NYC, or I want to fly from SF to NYC tomorrow, or I want to fly with Delta.
For each of these flows, the machine will return to the user to collect the missing information:
Information bot collects
Information bot requests
I want to book a flight to NYC
Tomorrow, from SF with Delta
I want to fly from SF to NYC tomorrow
I want to fly with Delta to NYC
Destination: NYC Airline: Delta
From NY, tomorrow
By building a conversational flow in an event-driven contextual approach, we succeed in mimicking our interaction with a human agent. When booking a flight with a travel agent, I start the conversation and provide the details that I know. The agent, in return, will ask me only for the missing details and won't force me to state each detail at a certain time.
At this stage, I think we can agree that building a conversational UI is not an easy task. In fact, many bots today don't use NLU and avoid free-speech interaction. We had great expectations of chatbots and with those high expectations came a great disappointment. This is why many chatbots and voicebots today provide mostly simple Q&A flows.
Most of those bots have a limited offering and the business logic is connected to two-to-three specific use cases, such as opening hours or a phone number, no matter what the user is asking for. In other very popular chat interfaces, bots are still leaning on the GUI, offering a menu selection and eliminating free text.
However, if we are building a true conversational communication between our bot and our users, we must make sure that we connect it to a dynamic business logic. So, after we have enabled speech recognition, worked on our NLU, built samples, and developed an event-driven contextual flow, it is time to connect our bot to dynamic data. To reach real-time data, and to be able to run transactions, our bot needs to connect to the business logic of our application. This can be done through the usage of APIs to your backend systems.
Going back to our flight booking bot, we would need to retrieve real-time data on when the next flight from SF to NYC is, what seats are still available, and what the price is for the flight. Our APIs can also help us to complete the order and approve a payment. If you are lacking APIs for some of the needed data and functions, you can develop new ones or use screen-scraping techniques to avoid a complex development.
Conversational UI is still new to us and, as such, there are still challenges and gaps that prevent it from reaching its full potential. Technology has improved greatly over the years to get us to where we are, but, although we are far from HAL 9000 (from the movie 2001: A Space Odyssey, in which a computer program interacts freely with the ship's astronaut crew and controls the systems of the Discovery One spacecraft using thinking and feeling), we must keep in mind that even HAL had some malfunctions. In this section, I will list the five main challenges that technology and bot designers will have to address in the next few years.
As human-machine interaction becomes more sophisticated, natural, and humanized, the harder it is to build and develop it. While creating a simple command-line text-based interface can be done by any developer, a high-quality UI in the form of a chatbot or voicebot requires many experts, including chat and voice designers and NLU specialists, both of which are very hard to find.
Natural language understanding is the attempt to mimic reading comprehension by a machine. It is a subtopic of AI and, as mentioned earlier, it is an AI-hard (or AI-complete) problem. An AI-hard problem is equivalent to solving the central AI problem: making computers as intelligent as people (https://en.wikipedia.org/wiki/AI-complete). Why is it so difficult? As discussed above, when responding to a conversational UI, there is an infinite number of unknown and unexpected features in the input, within an infinite number of options of syntactic and semantic schemes to apply to it. This means that when we chat or talk to a bot, just as when we talk to another person, we are unlimited in what we can say. We are not restricted to keeping to a specific GUI path: we are free to ask about anything and everything.
One way to tackle the NLU AI-hard issue is to focus and limit the computer's understanding to a specific theme, subject, or use case. When I go to the doctor, I'm probably not going to consult with him about the return I will yield when investing in the NY stock exchange. When I visit the doctor, I am within a specific context: I don't feel well, I need a new subscription to a medication, and so on. In fact, just within a doctor scenario, there are so many use cases that we will have to predefine, so it would make sense to break those down into sub-use cases, to help improve our NLU in sub-domain contexts (pediatrician, gynecology, oncology, and so on).
If we go back to our travel example, we can train the NLU layer of our bot to be able to respond to everything related to the booking of flights. In this case, we mimic a possible conversation between the user and a travel agent. While a human travel agent can help us with additional tasks, such as finding a hotel, planning our trip, and more, in this use case we will stay within the context of booking flights to maximize the experience and the responses.
A major derivative of the NLU problem is the accuracy level of the conversation. Even when limiting our bot to a specific use case, the need to cover all possible requests, in each form of language, makes it very hard to create a good user experience (UX). In fact, more than 70% of the interactions we have with machines fail (https://www.fool.com/investing/2017/02/28/facebook-incs-chatbots-hit-a-70-failure-rate.aspx). While users are willing to try and address their needs quickly with an automated system, they are unforgiving once the system fails to serve them.
The accuracy of the level of understanding is dependent on the number of preconfigured samples in the bot. Those samples are sentences that users say that represent their request or intent. The bot, thereafter, translates them into actions. For every request, there are hundreds of such sentences. For complex requests, where there are also many parameters involved (such as our flight booking bot example), there are thousands, if not tens of thousands of them. This remains an unsolved problem today and, as a result, many bots today offer a poor experience to their users, which stays within very limited boundaries.
The transition from GUI to conversational UI (CUI), as well as to conversational user experience (CUX), and voice user experience (VUX) introduces many challenges within this paradigm shift that we are witnessing. Beyond the unlimited options that we discussed above, as part of the AI-hard problem raised around NLU, when building a conversational UI, and especially a voice UI and UX, there is a challenge of exposing the user to your offer in a screenless environment.
When I go to the store, I can see all the items I can choose from and purchase, and I can ask the salesperson for more help. A good salesperson will help me and recommend items that they think I should be made aware of in the store. When I shop online, I can view all the items that are available for me to purchase and can also search for something specific and browse through the various results. Here, as well, I can get recommendations, sometimes based on my previous purchases, in different graphical forms such as pop-ups or newsletters. Exposing the user to your offering within a text or a voice conversational UI is extremely difficult. Just as a conversational UI is limited in nature (focusing on specific use cases, within a certain context), the ways to expose the user to what you offer, or how you can help him/her, are limited as well.
Many chatbots offer a menu-based interaction, providing options to choose from. This way, the conversation is limited to a specific flow (state machine supported), but the added value is that the user can be exposed to additional information. The problem with this solution is that it inherits the GUI experience into the CUI and very often offers very little value.
In the case of voicebots, we often witness a "help" section, which provides the user with a list of actions they can perform when talking to the bot. This will be in the form of an introduction to the application, offering a few examples of what the user can ask. Going back to our flight example, imagine that a user says, Ok Google, open travel bot. The first response can be Welcome to Travel Bot! How can I help you? You can ask me: what is the next flight to NYC from SF? In addition, voice-enabled devices, such as Amazon Alexa and Google Home, provide users with an instruction cart that gives some examples of questions. The companies also send out a weekly newsletter with new capabilities.
I mentioned a couple of times the need to build contextual conversational UI and UX, and I will dedicate a full chapter (Chapter 3, Building a Killer Conversational App) to this in the book. Being a major challenge in today's conversational UI development, I believe that it deserves one more mention in this section.
We expect bots to replace humans – not computers. The conversational UI mimics my interaction with a human, whether through text or voice. Even when we limit the interaction to a specific use case and include all possible sample sentences that could prompt a question, there is one thing that is very difficult to predict within a contextual conversation: non-implicit requests.
If I call my travel agent and excitedly tell her that my daughter's 6th birthday is coming up, she might "do the math" and understand that we are planning a family trip to Disneyland. She will then extract all the parameters needed to complete my request:
Number of people/adults/kids
Hotels for the dates
Allergies and more…
Even though I haven't explicitly requested her help to plan a trip to Disneyland, the travel agent will be able to connect the dots and respond to my request. Training a machine to do that, that is, to react to non-implicit requests, remains a huge challenge in today's technology stack. However, the good news is that AI technologies and, more specifically, machine learning and deep learning, will become very useful in the next couple of years for tackling this challenge.
One very controversial aspect when discussing chatbots and voicebots is security and, more specifically, the privacy around it. In today's world, chatbot and voicebot platforms are controlled by some of the leading corporations and our data and information become their assets. Although Google, Amazon, and Facebook have been collecting private data for quite a while (whenever we searched the web, purchased items on Amazon, or just posted something on Facebook), now those companies "listen" to us outside of the web/app environment: they are in our homes and in every private message. Recently, Amazon Alexa was accused of recording a private conversation of a man at his home and sending it to his boss, without that person's consent.
The "constantly listening" functionality reminds many of George Orwell's 1984 and the party-monitoring telescreen that was designed to simultaneously broadcast entertainment and listen in to people's conversations to detect disorders. Although Orwell's telescreen was used by a tyranny to control its people, whereas today's solutions are owned by commercial corporations, one cannot help but wonder what the implications of using such devices will be in the future.
Conversational channels controlled by the above corporations have also become a challenge for businesses that are forced into running their customers' interactions through third-party channels. Where five years ago businesses were reluctant about shifting their data centers to the cloud, today it has no meaning at all, when data is being transferred through additional channels anyway.
This is important for us to understand when we design our chatbots and voicebots. Mainly, we should protect our customers' data and, where needed, obey the relevant country's/state's regulations. We should make sure we are not asking for specific data, such as SSN or credit card numbers and, for the time being, use complementary ways to get that, such as rerouting the user to a secure site to complete registration.
Intelligent assistance, chatbots, voicebots, and voice-enabled devices, such as Amazon Echo and Google Home, have stormed into our lives, offering many ways to improve daily tasks, through natural human-computer communication. In fact, some of the applications that we use today already take advantage of voice/chat-enabled interaction to ease our lives. Whether we are turning the lights on and off in our living room with a simple voice command or shopping online with a Facebook Messenger bot, conversational UI makes our interactions more focused and efficient.
Fast-forward from today, we can assume that conversational UI, and more specifically voice-enabled communication, will replace all interactions with computers. In the movie Her (2013), written and directed by Spike Jonze, an unseen computer bot communicates with the main character using voice. This voicebot (played by Scarlett Johansson) assists, guides, and consults the main character on any possible matter. It is a personal assistant on steroids.
Its knowledge is unlimited, it continues to learn all the time, it can create a conversation (a true exchange of ideas), and at the end it can even understand feelings (however, it still doesn't feel itself). However, as we've seen above, with current technology, real-life conversational UI still lacks many of the components seen in Her and faces unsolved challenges and question marks around it. The experience is limited for the user, as it's still mostly un-contextual and bots are far from understanding feelings or social situations.
Nevertheless, with all the limitations we experience today, creating a supercomputer that knows everything is more within reach than creating a super-knowledgeable person. Technology, whether in the form of advanced AI, ML, or DL methodologies, will solve most of those challenges and make the progress needed to build successful bot assistants.
What might take a bit more time to transform is human skepticism: conversational UI is also limited because its users are still very skeptical of it. Aware of its limitations, we stick to what works best and tend to not challenge it too much. When comparing children-bot interaction with that of adults, it is clear to see that while the latter group stays within specific boundaries of usage, the former interacts with the bots as they are real adult humans – knowledgeable about almost everything. It might be a classic chicken or the egg dilemma, but one thing is for sure: the journey has started and there's no going back.
Turing Test as a Defining Feature of AI-Completeness in Artificial Intelligence, Evolutionary Computation and Metaheuristics (AIECM), Roman V. Yampolskiy