In this chapter, we will explore the emerging avenues of deep learning on mobile devices. We will briefly discuss the basic concepts of machine learning and deep learning, and we'll introduce the various options available for integrating deep learning with Android and iOS. This chapter also introduces implementations of deep learning projects using native and cloud-based learning methodologies.
In this chapter, we will cover the following topics:
- Growth of artificial intelligence (AI)-powered mobile devices
- Understanding machine learning and deep learning
- Introducing to some common deep learning architectures
- Introducing to reinforcement learning and natural language processing (NLP)
- Methods of integrating AI on Android and iOS
AI is becoming more mobile than it used to be, as smaller devices are being packed with more computational power. Mobile devices, which were simply used to make phone calls and send text messages, have now been transformed into smartphones with the introduction of AI. These devices are now capable of leveraging the ever-increasing power of AI to learn user behavior and preferences, enhance photographs, carry out full-fledged conversations, and much more. The capabilities of an AI-powered smartphone is expected to only grow day by day. According to Gartner, by 2022, 80% of smartphones will be AI-enabled.
To cope with the high computational powers of AI, there have been regular changes and enhancements in hardware support of cellphones to provide them with the ability to think and act. Mobile manufacturing companies have been constantly upgrading hardware support on mobile devices to provide a seamless and personalized user experience.
Huawei has launched the Kirin 970 SoC, which enables on-device AI experiences using a specially dedicated neural network processing unit. Apple devices are fitted with an AI chip called neural engine, which is a part of the A11 Bionic chip. It is dedicated to machine learning and deep learning tasks such as facial and voice recognition, recording animojis, and object detection while capturing a picture. Qualcomm and MediaTek have released their own chips that enable on-device AI solutions. Exynos 9810, announced by Samsung, is a chip that is based on neural networks such as Snapdragon 845 of Qualcomm. The 2018 Samsung devices, Galaxy S9 and S9+, included these chips based on the country where they are marketed. With it's Galaxy S9, the company made it pretty evident that it would integrate AI to improve the functioning of the device's camera and translation of text in real time. The latest Samsung Galaxy S10 series is powered by the Qualcomm Snapdragon 855 to support on-device AI computations.
Google Translate Word Lens and the Bixby personal assistant have been used to develop the feature. With the technologies in place, it is possible for the device to translate up to 54 languages. The phones, which are smart enough to decide between a sensor of f/2.4 and f/1.5, are well suited for capturing photographs in low-light conditions. Google Pixel 2 leverages the power of machine learning to integrate eight image processing units using its coprocessor, Pixel Visual Core.
The incorporation of AI chips has not only helped to achieve greater efficiency and computational power, but it has also preserved the user's data and privacy. The advantages of including AI chips on mobile devices can be listed as follows:
- Performance: The CPUs of mobile devices in the current date are unsuitable to the demands of machine learning. Attempts to deploy machine learning models on these devices often results in slow service and a faster battery drain, leading to bad user experience. This is because the CPUs lack the efficiency to do enormous amounts of small calculations as required by the AI computations. AI chips, somewhat similar to Graphical Processing Units (GPU) chips that are responsible for handling graphics on devices, provide a separate space to perform calculations exclusively related to machine learning and deep learning processes. This allows the CPU to focus its time on other important tasks. With the incorporation of specialized AI hardware, the performance and battery life of devices have improved.
- User privacy: The hardware also ensures the increased safety of the user's privacy and security. In traditional mobile devices, data analysis and machine learning processes would require chunks of the user's data to be sent to the cloud, posing a threat to the user's data privacy and security of mobile devices. With the on-device AI chips in action, all of the required analyses and calculations can be performed offline on the device itself. This incorporation of dedicated hardware in mobile devices has tremendously reduced the risks of a user's data getting hacked or leaked.
- Efficiency: In the real world, tasks such as image recognition and processing could be a lot faster with the incorporation of AI chips. The neural network processing unit by Huawei is a well-suited example here. It can recognize images with an efficiency of 2,000 pictures per second. The company claims that this is 20 times faster than the time taken by a standard CPU. When working with 16-bit floating-point numbers, it can perform 1.92 teraflops, or 1 trillion floating operations every second. The neural engine by Apple can handle around 600 billion operations per second.
- Economy: On-device AI chips reduce the need to send data off into the cloud. This capability empowers users to access the services offline and save data. Therefore, people using the applications are saved from paying for the servers. This is advantageous to users as well as developers.
Let's look at a brief overview of how AI on mobile devices has impacted the way we interact with our smartphones.
The use of AI has greatly enhanced user experience on mobile devices. This can be broadly categorized into the following categories.
Personalization primarily means modifying a service or a product to suit a specific individual's preferences, sometimes related to clusters of individuals. On mobile devices, the use of AI helps to improve user experience by making the device and apps adapt to a user's habits and their unique profile instead of generic profile-oriented applications. The AI algorithms on mobile devices leverage the available user-specific data, such as location, purchase history, and behavior patterns, to predict and personalize present and future interactions such as a user's preferred activity or music during a particular time of the day.
For instance, AI collects data on the user's purchase history and compiles it with the other data that is obtained from online traffic, mobile devices, sensors embedded in electronic devices, and vehicles. This compiled data is then used to analyze the user's behavior and allow brands to take necessary actions to enhance the user engagement rate. Therefore, users can leverage the benefits of AI-empowered applications to get personalized results, which will reduce their scrolling time and let them explore more products and services.
The best examples out there are recommendation systems running through shopping platforms such as Walmart, Amazon, or media platforms such as YouTube or Netflix.
A virtual assistant is an application that understands voice commands and completes tasks for the user. They are able to interpret human speech using Natural Language Understanding (NLU) and generally respond via synthesized voices. You might use a virtual assistant for nearly all of the tasks that a real personal assistant would do for you, that is, making calls to people on your behalf, taking notes that you dictate, turning on or turning off the lights in your home/office with the help of home automation, play music for you, or even simply talk to you about any topic you'd like to talk about! A virtual assistant might be able to take commands in the form of text, audio, or visual gestures. Virtual assistants adapt to user habits over time and get smarter.
Leveraging the power of NLP, a virtual assistant can recognize commands from spoken language, and identify people and pets from images that you upload to your assistant or keep in any online album that is accessible to them.
The most popular virtual assistants on the market right now are Alexa by Amazon, Google Assistant, iPhone's Siri, Cortana by Microsoft, and Bixby running on Samsung devices. Some of the virtual assistants are passive listeners and respond only when they receive a specific wake up command. For example, Google Assistant can be activated using "Hey Google" or "OK Google", and can then be commanded to switch off the lights using "Switch off the bedroom lights" or can be used to call a person from your contacts list using "Make a call to <contact name>". In Google IO '18, Google unveiled the Duplex phone-calling reservation AI, demonstrating that Google Assistant would not only be capable of making a call, but it could also carry on a conversation and potentially book a reservation in a hair salon all by itself.
The use of virtual assistants is growing exponentially and is expected to reach 1.8 billion users by 2021. 54% of users agreed that virtual assistants help make daily tasks simpler, and 31% already use assistants in their daily lives. Additionally, 64% of users take advantage of virtual assistants for more than one purpose.
The technology that is powerful enough to identify or verify a face or understand a facial expression from digital images and videos is known as facial recognition. This system generally works by comparing the most common and prominent facial features from a given image with the faces stored in a database. Facial recognition also has the ability to understand patterns and variations based on an individual's facial textures and shape to uniquely recognize a person and is often described as a biometric AI-based application.
Initially, facial recognition was a form of computer application; however, recently, it is being widely used on mobile platforms. Facial recognition, accompanied by biometrics such as fingerprint and iris recognition, finds a common application in security systems on mobile devices. Generally, the process of facial recognition is performed in two steps—feature extraction and selection is the first, and the classification of objects is the second. Later developments have introduced several other methods, such as the use of the facial recognition algorithm, three-dimensional recognition, skin texture analysis, and thermal cameras.
Face ID, introduced in Apple's iPhone X, is a biometric authentication successor to the fingerprint-based authentication system found in several Android-based smartphones. The facial recognition sensor of Face ID consists of two parts: a Romeo module and a Juliet module. The Romeo module is responsible for projecting over 30,000 infrared dots on to the face of the user. The counterpart of this module, the Juliet module, reads the pattern formed by the dots on the user's face. The pattern is then sent to an on-device Secure Enclave module in the CPU of the device to confirm whether the face matches with the owner or not. These facial patterns cannot be directly accessed by Apple. The system does not allow the authorization to work when the eyes of the user are closed, which is an added layer of security.
The technology learns from changes in a user's appearance and works with makeup, beards, spectacles, sunglasses, and hats. It also works in the dark. The Flood Illuminator is a dedicated infrared flash that projects invisible infrared light on to the user's face to properly read the facial points and helps the system to function in low-light conditions or even complete darkness. Contrary to iPhones, Samsung devices primarily rely on two-dimensional facial recognition accompanied by an iris scanner that works as a biometric recognition in Galaxy Note 8. The leading premium smartphone seller in India, OnePlus, also depends on only two-dimensional facial recognition.
The integration of AI in cameras has empowered them to recognize, understand, and enhance scenes and photographs. AI cameras are able to understand and control the various parameters of cameras. These cameras work on the principles of a digital image processing technique called computational photography. It uses algorithms instead of optical processes seeking to use machine vision to identify and improve the contents of a picture. These cameras use deep learning models that are trained on a huge dataset of images, comprising several million samples, to automatically identify scenes, the availability of light, and the angle of the scene being captured.
When the camera is pointed in the right direction, the AI algorithms of the camera take over to change the settings of the camera to produce the best quality image. Under the hood, the system that enables AI-powered photography is not simple. The models used are highly optimized to produce the correct camera settings upon detection of the features of the scene to be captured in almost real time. They may also add dynamic exposure, color adjustments, and the best possible effect for the image. Sometimes, the images might be postprocessed automatically by the AI models instead of being processed during the clicking of the photograph in order to reduce the computational overhead of the device.
Nowadays, mobile devices are generally equipped with dual-lens cameras. These cameras use two lenses to add the bokeh effect (which is Japanese for "blur") on pictures. The bokeh effect adds a blurry sense to the background around the main subject, making it aesthetically pleasing. AI-based algorithms assist in simulating the effect that identifies the subject and blurs the remaining portion producing portrait effects.
The Google Pixel 3 camera works in two shooting modes called Top Shot and Photobooth. The camera initially captures several frames before and after the moment that the user is attempting to capture. The AI models that are available in the device are then able to pick the best frame. This is made possible by the vast amount of training that is provided to the image recognition system of the camera, which is then able to select the best-looking pictures, almost as if a human were picking them. Photobooth mode allows the user to simply hold the device toward a scene of action, and the images are automatically taken at the moment that the camera predicts to be a picture-perfect moment.
Predictive text is an input technology, generally used in messaging applications, that suggests words to the user depending on the words and phrases that are being entered. The prediction following each keypress is unique rather than producing a repeated sequence of letters in the same constant order. Predictive text can allow an entire word to be inputted by a single keypress, which can significantly speed up the input process. This makes input writing tasks such as typing a text message, writing an email, or making an entry into an address book highly efficient with the use of fewer device keys. The predictive text system links the user's preferred interface style and their level of learned ability to operate the predictive text software. The system eventually gets smarter by analyzing and adapting to the user's language. The T9 dictionary is a good example of such text predictors. It analyzes the frequency of words used and results in multiple most probable words. It is also capable of considering combinations of words.
Google also introduced a new feature that would help users compose and send emails faster than before. The feature, called Smart Compose, understands the text typed in so that AI can suggest words and phrases to finish sentences. The Smart Compose feature helps users to save time while writing emails by correcting spelling mistakes and grammatical errors, along with suggesting the words that are most commonly typed by users. Smart Reply is another feature, similar to reply suggestions in LinkedIn messaging, which suggests replies that can be sent on a single click, according to the context of the email received by the user. For example, if the user receives an email congratulating them of an accepted application, it is likely that the Smart Reply feature would give options to reply with—"Thank you!," "Thanks for letting me know," and "Thank you for accepting my application." Users can then click on the preferred reply and send a quick response.
In recent times, we have seen a great surge in the number of applications incorporating AI into their features for increased user engagement and customized service delivery. In this section, we will briefly discuss how some of the largest players in the domain of mobile apps have leveraged the benefits of AI to boost their business.
The best and the most popular example of machine learning in mobile apps is Netflix. The application uses linear regression, logistic regression, and other machine learning algorithms to provide the user with a perfect personalized recommendation experience. The content that is classified by actors, genre, length, reviews, years, and more is used to train the machine learning algorithms. All of these machine learning algorithms learn and adapt to the user's actions, choices, and preferences. For example, John watched the first episode of a new television series but didn't really like it, so he won't watch the subsequent episodes. The recommendation systems involved in Netflix understand that he does not prefer TV shows of that kind and removes them from his recommendations. Similarly, if John picked the eighth recommendation from the recommendations lists or wrote a bad review after watching a movie trailer, the algorithms involved try to adapt to his behavior and preferences to provide extremely personalized content.
Seeing AI, developed by Microsoft, is an intelligent camera app that uses computer vision to audibly help blind and visually impaired people to know about their surroundings. It comes with functionalities such as reading out short text and documents for the user, giving a description about a person, identifying currencies, colors, handwriting, light, and even images in other apps using the device's camera. To make the app this advanced and responsive in real time, developers have used the idea of making servers communicate with Microsoft Cognitive Services. OCR, barcode scanner, facial recognition, and scene recognition are the most powerful technologies brought together by the application to provide users with a collection of wonderful functionalities.
Allo was an AI-centric messaging app developed by Google. As of March 2019, Allo has been discontinued. However, it was an important milestone in the journey of AI-powered apps at Google. The application allowed users to perform an action on their Android phones via their voice. It used Smart Reply, a feature that suggested words and phrases by analyzing the context of the conversation. The application was not just limited to text. In fact, it was equally capable of analyzing images shared during a conversation and suggesting replies. This was made possible by powerful image recognition algorithms. Later, this Smart Reply feature was also implemented in the Google inbox and is now present in the Gmail app.
English Language Speech Assistant (ELSA), which is rated among the top five AI-based applications, is the world's smartest AI pronunciation tutor. The mobile application helps people improve their pronunciation. It is designed as an adventure game, differentiated by levels. Each level presents a set of words for the user to pronounce, which is taken as input. The user's response is examined carefully to point out their mistakes and help them improve. When the application detects a wrong pronunciation, it teaches the user the correct one by instructing them about the correct movements of the lips and the tongue so that the word is said correctly.
Socratic, a tutor application, allows a user to take pictures of mathematical problems and gives answers explaining the theory behind it, with details of how it should be solved. The application is not just limited to mathematics. Currently, it can help a user in 23 different subjects, including English, physics, chemistry, history, psychology, and calculus. Using the power of AI to analyze the required information, the application returns videos with step-by-step solutions. The application's algorithm, combined with computer vision technology, has the capability to read questions from images. Furthermore, it uses machine learning classifiers trained on millions of sample questions, which helps with the accurate prediction of concepts involved in solving a question.
Now, let's take a deeper look at machine learning and deep learning.
It is important to understand a few key concepts of machine learning and deep learning before you are able to work on solutions that are inclusive of the technologies and algorithms associated with the domain of AI. When we talk about the current state of AI, we often mean systems where we are able to churn a huge amount of data to find patterns and make predictions based on those patterns.
While the term "artificial intelligence" might bring up images of talking humanoid robots or cars that drive by themselves to a layman, to a person studying the field, the images might instead be in the form of graphs and networks of interconnected computing modules.
In the next section, we will begin with an introduction to machine learning.
In the year 1959, Arthur Samuel coined the term machine learning. In a gentle rephrasing of his definition of machine learning, the field of computer science that enables machines to learn from past experiences and produce predictions based on them when provided with unknown input is called machine learning.
A more precise definition of machine learning can be stated as follows:
- A computer program that improves its performance, P, on any task, T, by learning from its experience, E, regarding task T, is called a machine learning program.
- Using the preceding definition, in an analogy that is common at the moment, T is a task related to prediction, while P is the measure of accuracy achieved by a computer program while performing the task, T, based upon what the program was able to learn, and the learning is called E. With the increase of E, the computer program makes better predictions, which means that P is improved because the program performs task T with higher accuracy.
- In the real world, you might come across a teacher teaching a pupil to perform a certain task and then evaluating the skill of the pupil at performing the task by making the pupil take an examination. The more training that the pupil receives, the better they will be able to perform the task, and the better their score will be in the examination.
In the next section, let's try to understand deep learning.
We have been hearing the term learning for a long time, and in several contexts where it usually means gaining experience at performing a task. However, what would deep mean when prefixed to "learning"?
In computer science, deep learning refers to a machine learning model that has more than one layer of learning involved. What this means is that the computer program is composed of multiple algorithms through which the data passes one by one to finally produce the desired output.
Deep learning systems are created using the concept of neural networks. Neural networks are compositions of layers of neurons connected together such that data passes from one layer of neurons to another until it reaches the final or the output layer. Each layer of neurons gets data input in a form that may or may not be the same as the form in which the data was initially provided as input to the neural network.
Consider the following diagram of a neural network:
A few terms are introduced in the preceding screenshot. Let's discuss each one of them briefly.
The layer that holds the input values is called the input layer. Some argue that this layer is not actually a layer but only a variable that holds the data, and hence is the data itself, instead of being a layer. However, the dimensions of the matrix holding the layer are important and must be defined correctly for the neural network to communicate to the first hidden layer; therefore, it is conceptually a layer that holds data.
Any layer that is an intermediary between the input layer and the output layer is called a hidden layer. A typical neural network used in production environments may contain hundreds of input layers. Often, hidden layers contain a greater number of neurons than either the input or the output layer. However, in some special circumstances, this might not hold true. Having a greater number of neurons in the hidden layers is usually done to process the data in a dimension other than the input. This allows the program to reach insights or patterns that may not be visible in the data in the format it is present in when the user feeds it into the network.
The complexity of a neural network is directly dependent on the number of layers of neurons in the network. While a neural network may discover deeper patterns in the data by adding more layers, it also adds to the computational expensiveness of the network. It is also possible that the network passes into an erroneous state called overfitting. On the contrary, if the network is too simple, or, in other words, is not adequately deep, it will reach another erroneous state called underfitting.
The final layer in which the desired output is produced and stored is called the output layer. This layer often corresponds to the number of desired output categories or has a single neuron holding the desired regression output.
Each layer in the neural network undergoes the application of a function called the activation function. This function plays the role of keeping the data contained inside neurons within a normalized range, which would otherwise grow too large or too small and lead to errors in the computation relating to the handling of large decimal coefficients or large numbers in computers. Additionally, it is the activation function that enables the neural network to handle the non-linearity of patterns in data.
After a brief revision of the key terms, we are now ready to dive deeper into the world of deep learning. In this section, we will be learning about some famous deep learning algorithms and how they work.
Inspired from the animal visual cortex, a convolutional neural network (CNN) is primarily used for, and is the de facto standard for, image processing. The core concept of the convolutional layer is the presence of kernels (or filters) that learn to differentiate between the features of an image. A kernel is usually a much shorter matrix than the image matrix and is passed over the entire image in a sliding-window fashion, producing a dot product of the kernel with the corresponding slice of matrix from the image to be processed. The dot product allows the program to identify the features in the image.
Consider the following image vector:
[[10, 10, 10, 0, 0, 0],
[10, 10, 10, 0, 0, 0],
[10, 10, 10, 0, 0, 0],
[0, 0, 0, 10, 10, 10],
[0, 0, 0, 10, 10, 10],
[0, 0, 0, 10, 10, 10]]
The preceding matrix corresponds to an image that looks like this:
On applying a filter to detect horizontal edges, the filter is defined by the following matrix:
[[1, 1, 1],
[0, 0, 0],
[-1, -1, -1]]
The output matrix produced after the convolution of the original image with the filter is as follows:
[[ 0, 0, 0, 0],
[ 30, 10, -10, -30],
[ 30, 10, -10, -30],
[ 0, 0, 0, 0]]
There are no edges detected in the upper half or lower half of the image. On moving toward the vertical middle of the image from the left edge, a clear horizontal edge is found. On moving further right, two unclear instances of a horizontal edge are found before another clear instance of a horizontal edge. However, the clear horizontal edge found now is in the opposite color as the previous one.
Thus, by simple convolutions, it is possible to uncover patterns in the image files. CNNs also use several other concepts, such as pooling.
It is possible to understand pooling from the following screenshot:
In the simplest terms, pooling is the method of consolidating several image pixels into a single pixel. The pooling method used in the preceding screenshot is known as max pooling, wherein only the largest value from the selected sliding-window kernel is kept in the resultant matrix. This greatly simplifies the image and helps to train filters that are generic and not exclusive to a single image.
Generative adversarial networks (GANs) are a fairly new concept in the field of AI and have come as a major breakthrough in recent times. They were introduced by Ian Goodfellow in his research paper, in 2014. The core idea behind a GAN is the parallel run of two neural networks that compete against each other. The first neural network performs the task of generating samples and is called the generator. The other neural network tries to classify the sample based on the data previously provided and is called the discriminator. The functioning of GANs can be understood with the following screenshot:
Here, the random image vector undergoes a generative process to produce fake images that are then classified by the discriminator that has been trained with the real images. The fake images with higher classification confidence are further used for generation, while the ones with lower confidence are discarded. Over time, the discriminator learns to correctly recognize fake images, while the generator learns to produce images that resemble the real images increasingly after each generation.
What we have at the end of the learning is a system that can produce near-real data, and also a system that can classify samples with very high precision.
We will learn more about GANs in the upcoming chapters.
Not all data in the world exists independently of time. Stock market prices and spoken/written words are just a few examples of data that is bound to a time series. Therefore, the sequence of data has a temporal dimension, and you might assume that being able to use it in the manner befitting to data, which comes with the passage of time instead of a chunk of data that remains constant, would be more intuitive and would produce better prediction accuracy. In many cases, this has been found to be true and has led to the emergence of neural network architectures that can take time as a factor while learning and predicting.
One such architecture is the recurrent neural network (RNN). The major characteristic of such a network is that it not only passes data from one layer to another in a sequential manner, but it also takes data from any previous layer. Recall from the Understanding machine learning and deep learning section the diagram of a simple artificial neural network (ANN) with two hidden layers. The data was being fed into the next layer by the previous layer only. In an RNN with, say, two hidden layers, it is not mandatory for the input to the second hidden layer be provided only by the first hidden layer, as would be the case in a simple ANN.
This is depicted by the dashed arrows in the following screenshot:
RNNs, in contrast to simple ANNs, use a method called backpropagation through time (BPTT) instead of the classic backpropagation in ANNs. BPTT ensures that time is well represented in the backward propagation of the error by defining it in functions relating to the input that has to recur in the network.
It is very common to observe vanishing and exploding gradients in RNNs. These are a severe bottleneck in the implementation of deep RNNs where the data is present in a form where relationships between features are more complex than linear functions. To overcome the vanishing gradient problem, the concept of long short-term memory (LSTM) was introduced by German researchers Sepp Hochreiter and Juergen Schmidhuber, in 1997.
LSTM has proved highly useful in the fields of NLP, image caption generation, speech recognition, and other domains, where it broke previously established records after it was introduced. LSTMs store information outside the network that can be recalled at any moment, much like a secondary storage device in a computer system. This allows for delayed rewards to be introduced to the network. A spiritual analogy of LSTMs has been made, which calls it the "karma" or reward that a person receives for their actions carried out in the past.
We shall be diving deeper into LSTMs and CNNs in the upcoming chapters of this book.
In this section, we shall be studying the basic concepts of reinforcement learning and NLP. These are some very important topics in the field of AI. They may or may not use deep learning networks for their implementations, but they are quite often implemented using deep networks. Therefore, it is crucial to understand how they function.
Reinforcement learning is a branch of machine learning that deals with creating AI "agents" that perform a set of possible actions in a given environment in order to maximize a reward. While the other two branches of machine learning—supervised and unsupervised machine learning—usually perform learning on a dataset in the format of a table, reinforcement learning agents mostly learn using a decision tree to be made in any given situation such that the decision tree eventually leads to the leaf that has the maximum reward.
For example, consider a humanoid robot that wishes to learn to walk. It could first start by shoving both of its legs in front of itself, in which case it would fall, and the reward, which, in this case, is the distance covered by the humanoid robot, would be 0. It will then learn to add a certain amount of delay between the previous leg being put forward and the next leg being put forward. Due to this certain amount of delay, it could be that the robot is able to take x1 steps before, once again, both feet simultaneously push outward and it falls down.
Reinforcement learning deploys the concept of exploration, which means the search for a better solution, and exploitation, which means the usage of previously gained knowledge. Continuing our example, since x1 is greater than 0, the algorithm learns to put approximately the same certain amount of delay between the strides. Over time, with the combined effect of exploitation and exploration, reinforcement learning algorithms become very strong, and the humanoid, in this case, is able to learn not only how to walk but also run.
NLP is a vast field of AI that deals with the processing and understanding of human languages through the use of computer algorithms. NLP comprises several methods and techniques that are each geared toward a different part of human language understanding, such as understanding meaning based on the similarity of two text extracts, generating human language responses, understanding questions or instructions made in human languages, and the translation of text from one language to another.
NLP has found vast usage in the current world of technology with several top tech companies running toward excellence in the field. There are several voice-based user assistants, such as Siri, Cortana, and Google Assistant, that heavily depend upon accurate NLP in order to perform their functions correctly. NLP has also found usage in customer support with automated customer support platforms that reply to the most frequently made queries without the need of a human representative answering them. These NLP-based customer support systems can also learn from the responses made by the real representative while they interact with customers. One such major system in deployment can be found in the Help section of the DBS DigiBank application created by the Development Bank of Singapore.
Extensive research is underway in this domain, and it is expected to dominate every other field of AI in the upcoming days. In the next section, let's take a look at what the currently available methods of integrating deep learning with mobile applications are.
With the ever-increasing popularity of AI, mobile application users expect apps to adapt to the information that is provided and made available to them. The only way to make applications adaptive to the data is by deploying fine-tuned machine learning models to provide a delightful user experience.
Firebase ML Kit is a machine learning Software Development Kit (SDK) that is available on Firebase for mobile developers. It facilitates the hosting and serving of mobile machine learning models. It reduces the heavy tasks of running machine learning models on mobile devices to API calls that cover common mobile use cases such as face detection, text recognition, barcode scanning, image labeling, and landmark recognition. It simply takes input as parameters in order to output a bunch of analytical information. The APIs provided by ML Kit can run on the device, on the cloud, or on both. The on-device APIs are independent of network connections and, consequently, work faster compared to cloud-based APIs. The cloud-based APIs are hosted on the Google Cloud Platform and uses machine learning technology to provide a higher level of accuracy. If the available APIs do not cover the required use case, custom TensorFlow Lite models can be built, hosted, and served using the Firebase console. The ML Kit acts as an API layer between the custom models, making it easy to run. Let's look at the following screenshot:
Here, you can see what the dashboard for Firebase ML Kit looks like.
Core ML, a machine learning framework released by Apple in iOS 11, is used to make applications running on iOS, such as Siri, Camera, and QuickType more intelligent. Delivering efficient performance, Core ML facilitates the easy integration of machine learning models on iOS devices, giving the applications the power to analyze and predict from the available data. Standard machine learning models such as tree ensembles, SVMs, and generalized linear models are supported by Core ML. It contains extensive deep learning models with over 30 types of neuron layers.
Using the Vision framework, features such as face tracking, face detection, text detection, and object tracking can be easily integrated with the apps. The Natural Language framework helps to analyze natural text and deduce its language-specific metadata. When used with Create ML, the framework can be used to deploy custom NLP models. The support for GamePlayKit helps in the evaluation of learned decision trees. Core ML is highly efficient as it is built on top of low-level technologies such as Metal and Accelerate. This allows it to take advantage of the CPU and GPU. Moreover, Core ML does not require an active network connection to run. It has high on-device optimizations. This ensures that all of the computations are done offline, within the device itself, minimizing memory footprint and power consumption.
Built on the original Convolution Architecture for Fast Embedding (Caffe), which was developed at the University of California, Berkeley, Caffe2 is a lightweight, modular, and scalable deep learning framework developed by Facebook. It helps developers and researchers deploy machine learning models and deliver AI-powered performance on Android, iOS, and Raspberry Pi. Additionally, it supports integration in Android Studio, Microsoft Visual Studio, and Xcode. Caffe2 comes with native Python and C++ APIs that work interchangeably, facilitating easy prototyping and optimizations. It is efficient enough to handle large sets of data, and it facilitates automation, image processing, and statistical and mathematical operations. Caffe2, which is open source and hosted on GitHub, leverages community contributions for new models and algorithms.
TensorFlow, an open source software library developed by Google Brain, facilitates high-performance numerical computation. Due to its flexible architecture, it allows easy deployment of deep learning models and neural networks across CPUs, GPUs, and TPUs. Gmail uses a TensorFlow model to understand the context of a message and predicts replies in its widely known feature, Smart Reply. TensorFlow Lite is a lightweight version of TensorFlow that aids the deployment of machine learning models on Android and iOS devices. It leverages the power of the Android Neural Network API to support hardware acceleration.
The TensorFlow ecosystem, which is available for mobile devices through TensorFlow Lite, is illustrated in the following diagram:
In the preceding diagram, you can see that we need to convert a TensorFlow model into a TensorFlow Lite model before we can use it on mobile devices. This is important because TensorFlow models are bulkier and suffer more latency than the Lite models, which are optimized to run on mobile devices. The conversion is carried out through the TF Lite converter, which can be used in the following ways:
- Using Python APIs: The conversion of a TensorFlow model into a TensorFlow Lite model can be carried out using Python, with any of the following lines of code:
TFLiteConverter.from_saved_model(): Converts SavedModel directories.
TFLiteConverter.from_keras_model(): Converts tf.keras models.
TFLiteConverter.from_concrete_functions(): Converts concrete functions.
- Using the command-line tool: The TensorFlow Lite converter is available as a CLI tool as well, albeit it is somewhat less diverse in its capabilities than the Python API version:
We will demonstrate the conversion of a TensorFlow model into a TensorFlow Lite model in the upcoming chapters.
In this chapter, we learned about the growth of AI in mobile devices, which provides machines with the ability to reason and make decisions without being explicitly programmed. We also studied machine learning and deep learning, which are inclusive of the technologies and algorithms associated with the domain of AI. We looked at various deep learning architectures, including CNNs, GANs, RNNs, and LSTMs.
We introduced reinforcement learning and NLP, along with the different methods of integrating AI on Android and iOS. Basic knowledge of deep learning and of how we can integrate it with mobile apps is important for the upcoming chapters, where we shall be extensively using this knowledge to create some real-world applications.
In the next chapter, we will learn about face detection using on-device models.