Home

Data

Neural Search - From Prototype to Production with Jina

By Jina AI , Bo Wang , Cristian Mitroi and 3 more

Book

eBook $41.99 $28.99

Print $51.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $41.99 $28.99

Print $51.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Search is a big and ever-growing part of the tech ecosystem. Traditional search, however, has limitations that are hard to overcome because of the way it is designed. Neural search is a novel approach that uses the power of machine learning to retrieve information using vector embeddings as first-class citizens, opening up new possibilities of improving the results obtained through traditional search. Although neural search is a powerful tool, it is new and finetuning it can be tedious as it requires you to understand the several components on which it relies. Jina fills this gap by providing an infrastructure that reduces the time and complexity involved in creating deep learning–powered search engines. This book will enable you to learn the fundamentals of neural networks for neural search, its strengths and weaknesses, as well as how to use Jina to build a search engine. With the help of step-by-step explanations, practical examples, and self-assessment questions, you'll become well-versed with the basics of neural search and core Jina concepts, and learn to apply this knowledge to build your own search engine. By the end of this deep learning book, you'll be able to make the most of Jina's neural search design patterns to build an end-to-end search solution for any modality.

Publication date:: October 2022
Publisher: Packt
Pages: 188
ISBN: 9781801816823
Download code from GitHub

Neural Networks for Neural Search

Search has always been a crucial part of all information systems; getting the right information to the right user is integral. This is because a user query, as in a set of keywords, cannot fully represent a user’s information needs. Traditionally, symbolic search has been developed to allow users to perform keyword-based searches. However, such search applications were bound to a text-based search box. With the recent developments in deep learning and artificial intelligence, we can encode any kind of data into vectors and measure the similarities between two vectors. This allows users to create a query with any kind of data and get any kind of search result.

In this chapter, we will review important concepts regarding information retrieval and neural search, as well as looking at the benefits that neural search provides to developers. Before we start introducing neural search, we will first introduce the drawbacks of the traditional symbolic-based search. Then, we’ll move on to looking at how to use neural networks in order to build a cross/multi-modality search. This will include looking at its major applications.

In this chapter, we’re going to cover the following main topics in particular:

Legacy search versus neural search
Machine learning for search
Practical applications for neural search

Technical requirements

This chapter has the following technical requirements:

Hardware: A desktop or laptop computer with a minimum of 4 GB of RAM; 8 GB is suggested
Operating system: A Unix-like operating system such as macOS, or any Linux-based distribution, such as Ubuntu
Programming Language: Python 3.7 or higher, and Python Package Installer, or pip

Legacy search versus neural search

This section will guide you through the fundamentals of symbolic search systems, the different types of search applications, and their importance. This is followed by a brief description of how the symbolic search system works, with some code written in Python. Last but not least, we’ll summarize the pros and cons of the traditional symbolic search versus neural search. This will help us to understand how a neural search can better bridge the gap between a user’s intent and the retrieved documents.

Exploring various data types and search scenarios

In today’s society, governments, enterprises, and individuals create a huge amount of data by using various platforms every day. We live in the era of big data, where things such as texts, images, videos, and audio files play a significant role in society and the fulfillment of daily tasks.

Generally speaking, there are three types of data:

Structured data: This includes data that is logically expressed and realized by a two-dimensional table structure. Structured data strictly follows a specific data format and length specifications and is mainly stored and managed using relational databases.
Unstructured data: This has neither a regular or complete structure nor a predefined data model. This type of data is not appropriately managed by representing the data using a two-dimensional logical table used in databases. This includes office documents, text, pictures, hypertext markup language (HTML), various reports, images, and audio and video information in all formats.
Semi-structured data: This falls somewhere between structured and unstructured data. It includes log files, Extensible Markup Language (XML), and Javascript Object Notation (JSON). Semi-structured data does not conform to the data model structure associated with relational databases or other data tables, but it contains relevant tags that can be used to separate semantic elements that are used to stratify records and fields.

Search indices are widely used to hunt for unstructured and semi-structured data within a massive data collection to meet the information needs of users. Based on the levels and applications of the document collection, searches can be further divided into three types: web search, enterprise search, and personal search.

In a web search, the search engine first needs to index hundreds of millions of documents. The search results are then returned to users in an efficient manner while the system is continuously optimized. Typical examples of web search applications are Google, Bing, and Baidu.

In addition to web search, as a software development engineer, you are likely to encounter enterprise and personal search operations. In enterprise search scenarios, the search engine indexes internal documents of an enterprise to serve the employees and customers of the business, such as an internal patent search index of a company, or the search index of a music platform, such as SoundCloud.

If you are developing an email application and intend to allow users to search for historical emails, this constitutes a typical example of a personal search. This book focuses on enterprise and personal types of search operations.

Important Note

Make sure you understand the difference between search and match. Search, in most cases, is done in documents organized in an unstructured or semi-structured format, while match (such as an SQL-like query) takes place on structured data, such as tabular data.

As for different data types, the concept of modality plays an important role in a search system. Modality refers to the form of information such as text, images, video, and audio files. Cross-modality search (also known as cross-media search) refers to retrieving samples from different modes with similar semantics by exploring the relationship between different modalities and employing a certain modal sample.

For example, when we enter a keyword in an email inbox application, we can find the appropriate email returns as a result of a unimodal search – searching text by text. When you enter a keyword on a page for image retrieval, the search engine will return appropriate images as a result of a cross-modal search, searching images by text.

Of course, a unimodal search is not limited to searching text by text. The app known as Shazam, which is popular in the App Store, helps users to identify music and returns a track’s title to users in a short time. This can be seen as an application of unimodal search. Here, the concept of modality no longer refers to text, but to audio. On Pinterest, users can locate similar images through an image search, where the modality refers to an image. Likewise, the scope of a cross-modal search covers far more than searching for images by text.

Let’s consider this from another perspective. Is it possible for us to search across multiple modalities? Of course, the answer is “Yes!” Imagine a search scenario where a user uploads a photo of clothes and wants to look for similar types of clothing (we usually call this type of application “shop the look”), and at the same time enters a paragraph that describes the clothes in the search box to improve the accuracy of the search. In this way, our search keywords span two modalities (text and images). We refer to this search scenario as a multi-modal search.

Now that we have a grasp of the concept of modality, we will elaborate on the working principles, advantages, and disadvantages of symbolic search systems. By the end of this section, you will understand, that symbolic search systems cannot deal with different modalities.

How does the traditional search system work?

As a developer, you may have used Elasticsearch or Apache Solr to build a search system in web applications. These two widely used search frameworks were developed based on Apache Lucene. We’ll take Lucene as a case in point to introduce the components of a search system. Imagine you intend to search for a keyword in thousands of text documents (txt). How will you complete this task?

The easiest solution is to traverse all text documents from a path and read through the contents of these documents. If the keyword is in the file, the name of the document will be returned:

# src/chapter-1/sequential_match.py
import os
import glob
dir_path = os.path.dirname(os.path.realpath(__file__))
def match_sequentially():
    matches = []
    query = 'hello jina'
    txt_files = glob.glob(f'{dir_path}/resources/*.txt')
    for txt_file in txt_files: 
        with open(txt_file, 'r') as f:
            if query in f.read(): 
                matches.append(txt_file)
    return matches
if __name__ == '__main__':
    matches = match_sequentially()
    print(matches)

The code fulfills the simplest search function by traversing all files with the extension .txt in the current directory and then opening those files in turn. If the keyword hello jina used for the query is available, the filename will be printed with all the matching files. Although these lines of code allow you to conduct a basic search, the process has many flaws:

Poor scalability: In a production environment, there may be millions of files to be retrieved. Meanwhile, users of the retrieval system expect to obtain retrieval results in the shortest possible time, posing stringent requirements for the performance of the search system.
Lack of a relevance measurement: The code helps you achieve the most basic Boolean retrieval, which is to return the result of a match or mismatch. In a real-world scenario, users need a score to measure the relevance degree from a search system that is sorted in descending order, with more relevant files being returned to users first. Obviously, the aforementioned code snippets are unable to fulfill this function.

To address these issues, we need to index the files to be retrieved. Indexing refers to a process of converting a file type that allows a rapid search and skipping the continuous scanning of all files.

As an important part of our daily lives, indexing is comparable to consulting a dictionary and visiting a library. We’ll use the most widely used search library, Lucene, to illustrate the idea.

Lucene Core (https://lucene.apache.org/) is a Java library providing powerful indexing and search features, as well as spellchecking, hit highlighting, and advanced analysis/tokenization capabilities. Apache Lucene sets the standard for search and indexing performance. It is the search core of both Apache Solr and Elasticsearch.

In Lucene, after all collections of files to be retrieved are loaded, you may extract texts from such files and convert them to Lucene Documents, which generally contain the title, body, abstract, author, and URL of a file.

Next, your file will be analyzed by Lucene’s text analyzer, which generally includes the following processes:

Tokenizer: This splits the raw input paragraphs into tokens that cannot be further decomposed.

Decomposing compound words: In languages such as German, words composed of two or more tokens are called compound words.

Spell correction: Lucene allows users to conduct spellchecking to enhance the accuracy of retrieval.

Synonym analysis: This enables users to manually add synonyms in Lucene to improve the recall rate of the search system (note: the accuracy rate and recall rate will be elaborated upon shortly).

Stemming and lemmatization: The former enables users to derive the root by removing the suffix of a word (for example, play, the root form, is derived from the words plays, playing, and played), while the latter helps users convert words into basic forms, such as is, are, and been, which are converted to be.

Let’s attempt to preprocess some texts using NLTK.

Important Note

NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

First, install a Python package called nltk with this command:

pip install nltk
python -m nltk.downloader 'punkt'

We preprocess the text Jina is a neural search framework built with cutting-edge technology called deep learning:

import nltk
sentence = 'Jina is a neural search framework built with cutting-edge technology called deep learning'
def tokenize_and_stem():
    tokens = nltk.word_tokenize(sentence)
    stemmer = nltk.stem.porter.PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in 
                     tokens]
    return stemmed_tokens
if __name__ == '__main__':
    tokens = tokenize_and_stem()
    print(tokens)

This code enables us to carry out two operations on a sentence: tokenizing and stemming. The results of each are printed respectively. The raw input strings are parsed into a list of strings in Python, and finally each parsed token is lemmatized to its basic form. For instance, cutting and called are respectively converted to cut and call. For more operations, please refer to the official documentation of NLTK (https://www.nltk.org/).

After files are processed with the Lucene Document, the clean files will be indexed. Generally, in a traditional search system, all files are indexed using an inverted index. An inverted index (also referred to as a postings file or inverted file) is an index data structure storing a map of content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.

Simply put, an inverted index consists of two parts: a term dictionary, and postings.

Tokens, their IDs, and the document frequency (the frequency of such tokens appearing in the entire collection of documents to be retrieved) are stored in the term dictionary. A collection of all tokens is called a vocabulary. All tokens are sorted in alphabetical order in the dictionary.

In the postings, we save the token ID and the document IDs where the token occurred. Assuming that in the aforesaid example, the token jina from our query keyword hello jina appears three times in the entire collection of documents (in 1.txt, 3.txt, and 11.txt), then the token is “jina” and the document frequency is 3. Meanwhile, the names of the three text documents, 1.txt, 3.txt, and 11.txt, are saved in the posting. Then, the indexing of the text file is completed as shown in the following figure:

Figure 1.1 – Data structure of inverted index

When a user makes a query, keywords used for the query are generally shorter than the collection of documents to be retrieved. Lucene can perform the same preprocessing for such keywords (such as tokenization, decomposition, and spelling correction).

The processed tokens are mapped to the postings through the term dictionary in the inverted index so that matched files can be quickly found. Finally, Lucene’s scoring starts to work and scores each related file discovered according to a vector space model. Our index file is stored in an inverted index, which may be represented as a vector.

Assuming that our query keyword is jina, we map it to the vector of the inverted index and have it represented by - when it does not appear in the file; then the query vector [-,'jina',-,-, ...] can be obtained. This is how we represent a query, as a vector space model, in a traditional search engine.

Figure 1.2 – Term occurrence in the vector space model

Next, in order to derive the ranking, we need to numerically represent the token of the space vector model. Generally, tf-idf is regarded as a simple approach.

With this algorithm, we grant a higher weight to any token that appears relatively frequently. If such a token appears multiple times in many documents, we believe that the token is weakly representative, and the weight of the token will be reduced again. If the token does not appear in the documents, its weight is 0.

In Lucene, an algorithm called bm-25 is employed more frequently, which further optimizes tf-idf. After numerical calculation, the vector is expressed as follows:

Figure 1.3 – Vector space representation

As shown in the preceding figure, because the word a appears too frequently, it appears in document 1 and document 2 and has a low weight score. The token jina, a relatively uncommon word (appearing in document 2), has been granted a higher weight.

In the query vector, because the query keyword only has one word, jina, its weight is set as 1 and the weights of other tokens that do not appear are set as 0. Afterward, we multiply the query vector and the document vector element by element and add up the results to obtain the score of each document corresponding to the query keyword. Then, reverse sorting is performed so that the sorted documents can be returned to the user according to the score sorted in an inverted order (from high to low scores).

In short, if the keyword used for a query appears more frequently in a particular file and less frequently in the vocabulary file, its relative score will be higher and returned to the user with a higher priority. Of course, Lucene also grants different weights to various parts of a file. For example, the title and keywords of the file will have a higher scoring weight than the body would. Given the fact this book is about neural search, this aspect will not be elaborated upon further here.

Pros and cons of the traditional search system

In the previous section, we briefly revisited traditional symbolic search. Perhaps you have noticed that both the Lucene we introduced previously and the Lucene-based search frameworks, such as Elasticsearch and Solr, are based on text retrieval. This has quite a few advantages in the application scenarios of searching text by text:

Mature technology: Since research and development were done in 1999, the Lucene and Lucene-based search systems have existed for over 20 years and have been widely used in various web applications.

Easy integration: As users, developers of a web application do not need to have a deep understanding of Elasticsearch, Solr, or the operating logic of Lucene; only a small amount of code is required to integrate a high-performing, extensible search system into web applications.

Well-developed ecosystem: Thanks to the operation of Elastic Company, Elasticsearch has extended its search system functionality significantly. Currently, it is not only a search framework, but also a platform equipped with user management, a restful interface, data backup and restoration, and security management such as single sign-in, log audit, and other functions. Meanwhile, the Elasticsearch community has contributed a variety of plugins and integrations.

At the same time, you have probably realized that both Elasticsearch and Solr with Lucene at the core have unavoidable flaws.

In the previous section, we introduced the concept of modality. Lucene and Elasticsearch, which is built on top of it, are inherently unable to support cross-modal and multi-modal search options. Let’s take a moment to review the operating principle of Lucene, as Lucene has powered most of the search systems users are using on a daily basis. When texts are preprocessed in the first place, the search keyword must be text. When a data collection to be retrieved is preprocessed and indexed, likewise the index result is also the text stored in the inverted index.

In this way, the Lucene-based search platform can only rely on the text modality and retrieve data in the text modality. If objects to be retrieved are images, audio, or video files, how can they be found using a traditional search system? It is quite simple; two main methods are employed:

Manual tagging and adding metadata: For example, when a user uploads a song to a music platform, they may manually tag the author, album, music type, release time, and other data. Doing so ensures that users are able to retrieve music using text.

Hypothesis of the surrounding text: If an image, in the absence of user tagging, appears in an article, it will be assumed by the traditional search system to be more closely associated with its surrounding text. Accordingly, when a user’s query keyword matches the surrounding text of the image, the latter will be matched.

The essence of the two methods is to convert the document of non-text modality into a text modality so as to effectively use the current retrieval technology. However, this modal conversion process either relies on a large amount of manual tagging, or is done at the cost of query accuracy, which greatly undermines the user’s search experience.

Likewise, this type of search mode limits the user’s search habits to a keyword search and cannot be extended to a real cross-modal or even multi-modal search. For deeper insight into this issue, we may use a vector space to represent keywords of a paragraph and use another vector space to denote a text document to be retrieved. However, due to the restrictions of the technology back in the days when we had to rely on traditional search systems, we were unable to use the space vector to represent a piece of music, image, or video. It is also impossible to map two documents of different modalities to the same space vector to compare their similarities.

With the research and development on (statistical) machine learning techniques, more and more researchers and engineers have started to empower their search system using machine learning algorithms.

Machine learning for search

As a cross-disciplinary task, neural search has gone beyond the boundaries of information retrieval. It requires a general understanding of the concepts of machine learning, deep learning, and how we can apply these techniques to improve a search task. In this section, we will give a brief introduction to machine learning and how it can be applied to search systems.

Understanding machine learning and artificial intelligence

Machine learning refers to a technique that teaches computers to make decisions in a way that comes naturally to humans by enabling computers to learn the inherent laws of data and acquire new experience and knowledge, thus improving their intelligence.

Because various industries require an increased level of efficiency during data processing and analysis due to their growing demand for data, a large number of machine learning algorithms have emerged. The concept of statistical machine learning algorithms primarily refers to the steps and processes of solving optimization problems through mathematical and statistical methods.

With respect to different data and model requirements, appropriate machine learning algorithms are selected and employed to tackle practical issues in a more efficient manner. Machine learning has achieved great success in many fields, such as natural language understanding, computer vision, machine translation, and expert systems. It is fair to say that whether a system has a learning function or not has become a hallmark of it possessing intelligence.

Hinton et al. (2006) proposed the concept of deep learning (deep learning/deep neural networks). In 2009, Hinton introduced deep neural networks to scholars specialized in voices. Hence, in 2010, this field of research witnessed a remarkable breakthrough in speech recognition. In the next 11 years, convolutional neural networks (CNNs) were applied in the field of image recognition, leading to significant achievements.

Three founders of neural networks, LeCun, Bengio, and Hinton (2015), published a review titled Deep Learning in Nature. This shows that deep neural networks have not only been accepted by academia, but also the industrial field. Furthermore, in 2016 and 2017, the world witnessed a general expansion in deep learning. AlphaGo and AlphaZero were invented by Google after a short learning period and won a landslide victory over the top three Go players in the world. The intelligent voice system launched by iFLYTEK boasts a recognition accuracy rate of over 97% and stands at the forefront of AI worldwide; the autonomous driving systems developed by companies such as Google and Tesla have passed a milestone of testing on the road. These achievements have unveiled the value and charm of neural networks to humans again.

Machine learning has been applied to various industries, so maybe we can ask ourselves: can we apply machine learning to search applications? The answer here is “Yes.” In the next section, we’ll give a brief overview of different types of machine learning and how search can benefit from it.

Machine learning and learning-to-rank

Imagine a scenario where you intend to train a model capable of evaluating the price of a new apartment or house based on the collected data related to local real estate information and prices. This is one of the most important tasks of machine learning: regression.

Before the popularization of the deep learning technique, data analysts would have had to clean this data, use business logic to perform feature engineering, and design features of a real estate price predictor, such as the floor area, construction time, and type of apartments or houses, as well as the average prices of surrounding apartments or houses, and so on.

After feature engineering has been completed, raw data will be used to form a two-dimensional data table similar to Excel. The horizontal axis represents each house record, and the vertical axis represents each feature. The data is usually divided into two to three parts again: the majority of the data is used for model training, while a small amount of the data is used for model evaluation.

Next, machine learning engineers will select one or more appropriate algorithms from the machine learning toolkit to train the model and evaluate the performance of the model in the test data. Finally, the model with the best performance will be deployed in the production environment to serve customers.

Imagine another scenario where many landmark pictures have been collected from social networks. When a user uploads a new landmark picture, you expect your system to automatically recognize the name of the site. This is another important task of machine learning: classification.

In the field of traditional machine learning and computer vision, some features, such as SIFT, SURF, and HOG, are employed to develop a Bag-of-Visual-Words (BoW) through which a vector representation of this photo is established. Moreover, models are used to predict the classification. Nowadays, deep learning serves as the model to extract visual features from images without the requirement of feature engineering.

Let’s take a moment to look at our two examples. During the training process of predicting prices of houses (apartments), models are trained using feature engineering. All the training data is ground-truth, i.e., the house (apartment) prices and landmark names are documented. Such tasks are collectively referred to as supervised machine learning in the field of machine learning.

Since we can perform regression analysis and classification of data through supervised machine learning, is it possible to apply supervised machine learning in the search? The answer is yes, of course.

Assuming that our task is to optimize the search system, the goal is to predict the user’s click rate for the document and return documents with a higher predicted click rate to the user first. This is called learning-to-rank (first stage) and neural information retrieval (second stage). The concept of learning-to-rank (based on statistical machine learning), as proposed by academia in the early 1990s, evolved for nearly 20 years before experiencing a downturn since the emergence of deep learning in 2010, when neural information retrieval was at its peak.

Just like the prediction of an apartment (or house) price, or landmark recognition, engineers first performed data engineering after collecting such data. Common features include the number of query keywords in the document title/body, the percentage of the document title body that contains the query keywords, the tf-idf score, and the bm-25 score, among others. It follows that the final score of a traditional search system is used as a numerical feature during the training of models.

In a real-world scenario, Microsoft’s Bing search platform was designed with its own Microsoft Learning to Rank Datasets, which contain 136 features. Besides, they published a contest for learning-to-rank, calling for the use of these datasets as a basic training model for predicting the matching degree of web pages. After that, a trained model is applied in the production environment of a Bing search, which has improved the search effect to a certain extent.

At the same time, search companies such as Google, Yahoo, and Baidu have also conducted a large amount of research and have partially deployed their research results into the production environment.

In the fields of enterprise and personal search, Elastic has developed its learning-to-rank plugin named ElasticSearch LTR, which can be plug-in into your ES-powered search system. As a user, you still need to use a familiar machine learning framework to design features, train the learning-to-rank model, evaluate the model performance, and select models. Elasticsearch’s support for learning-to-rank can be plugged into the existence search system and get a new predicted ranking score based on model output. Although machine learning can be used to design models for multi-modal data, Elasticsearch places more emphasis on text-to-text search. Figure 1.4 demonstrates how learning-to-rank works in a search system.

Figure 1.4 – Learning-to-rank

This book will focus on search powered by deep neural networks, namely neural information retrieval.

The advantage of neural information retrieval is that users do not have to design features by themselves. Normally, we leverage two independent deep learning models (neural nets) as feature extractors to extract vectors from queries and documents respectively. Then, we measure the similarity between two vectors using metrics, such as cosine similarity. At this stage, neural-network-powered search has become very promising for industrial use cases. In the next section, we will introduce some of the potential applications for neural-network-powered search.

Practical applications powered by neural search

The previous section provided an overview of the representation and principles of dense vectors. This section will focus on the application of these vectors. During our daily work and study, all files will have a unique modality, such as a text, image, audio, or video file, and so on. If documents of any modality can be represented by dense vectors and then mapped to the same vector space, it is possible to compare the cross-modal similarity. This also allows us to use one modality to search for data in another modality.

This scenario was first extensively put into practice in the field of e-commerce with the common use of image search, for example. Its major application in this field includes having a product photo and hunting for related or similar products offline and online.

The e-commerce search primarily consists of steps such as the following:

Preprocessing
Feature extraction and fusion
Large-scale similarity search

During preprocessing, techniques such as resizing, normalization, and semantic segmentation may first be employed to process images. Resizing and normalization enable the input image to match the input format of the pre-trained neural network. Semantic segmentation has the function of removing background noise from the image and leaving only the product itself. Of course, we need to pre-train a neural pathway for feature extraction, which will be elaborated on shortly. By the same token, if the dataset of an e-commerce product to be retrieved has a large amount of noise, such as a large number of buildings, pedestrians, and so on in the background of fashion photos, it will be necessary to train a semantic segmentation model that can help us accurately extract the product profile from photos.

During feature extraction, a fully connected (FC) layer of deep learning is generally used as a feature extractor. The common backbone models of deep learning are AlexNet, VGGNet, Inception, and ResNet. These models are usually pre-trained on a large-scale dataset (such as the ImageNet dataset) to complete classification tasks. Transfer learning is carried out with the dataset in the e-commerce field in a bid to make the feature extractor suitable for the field, such as the feature extraction of fashion. Currently, a feature extractor with deep learning techniques at its core can be regarded as a global feature extractor. In some applications, traditional computer vision features, such as SIFT or VLAD, are employed for the extraction of local features and fusion with global features to enhance vector representation. The global feature will transform the preprocessed image into a dense vector representation.

When users make a query based on the search for images with images, the keyword used for the query is also an image. The system will generate a dense vector representation of that image. Then, users will be able to find the most similar image by comparing the dense vector of the image to be queried against those of all images in the library. This is feasible in theory. However, in reality, with the rapid increase in the number of commodities, there may be tens of millions of dense vectors of indexed images. As a result, the comparison of vectors in a pair-wise manner will fail to meet the user’s requirements for a quick response from the retrieval system.

Therefore, large-scale similarity search techniques, such as product quantization, are generally used to divide the vector to be searched into multiple buckets and perform a quick match based on the buckets by minimizing the recall rate and greatly speeding up the vector-matching process. Therefore, this technique is also commonly referred to as approximate nearest neighbor, or ANN retrieval. Commonly used ANN libraries include the FAISS, which is maintained by Facebook, and Annoy, maintained by Spotify.

Likewise, the search for images by images in an e-commerce scenario is also applicable to other scenarios, such as Tourism Landmark Retrieval (using pictures of tourist attractions to quickly locate other pictures of that attraction or similar tourist attractions), or Celebrity Retrieval (used to find photos of celebrities and retrieve their pictures). In the field of search engines, there are many such applications, which are collectively referred to as reverse image search.

Another interesting application is question answering. Neural-network-based search systems could be powerful when building a question-answering (QA) system on different tasks. First, the questions and answers that are currently available are taken as a training dataset on which to develop a pre-trained model of texts. When the user enters a question, the pre-trained model is employed to encode the question into a dense vector representation, conduct similarity matching in the dense vector representation of the existing repository of answers, and quickly help users find the answer to a question. Second, many question-answering systems, such as Quora, StackOverflow, and Zhihu, already have a large number of previously asked questions. When a user wants to ask a question, the question-answering system first determines whether the question has already been asked by someone else. If so, the user will be advised to click and check the answers to similar questions instead of repeating the query. This also involves similarity match, which is normally referred to as deduplication or paraphrase identification.

Meanwhile, in the real world, a large number of unexplored applications can be completed using neural information retrieval. For instance, if you employ text to search for untagged music, it is necessary to map the text and music representation to the same vector space. Then, the appearance time of scenarios in the video can be located using images. Conversely, when a user is watching a video, a product that appears in the video is retrieved and the purchase can be completed. Alternatively, deep learning can be carried out for specialized data retrieval, such as source code retrieval, DNA sequence retrieval, and more!

New terms learned in this chapter

Traditional search: Mostly applied to text retrieval. Measures the similarity by the weighted score of occurrences of a set of tokens from a query and documents.
Indexing: The process of converting files that allow a rapid search and skipping the continuous scanning of all files.
Searching: The process of conducting similarity score computation against a user query and indexed documents inside the document store and returning the top-k matches.
Vector space model: A way to represent a document numerically. The dimension of the VSM is the number of distinct tokens in all documents. The value of each dimension is the weight of each term.
TF-IDF: Term-Frequency Inverse Document Frequency is an algorithm that is intended to reflect how important a word is to a document in a collection of documents that are to be indexed.
Machine learning: This refers to a technique that teaches computers to make decisions in a way that comes naturally to humans by enabling computers to learn the distribution of data and acquire new experience and knowledge.
Deep neural networks: A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers that aims to predict, classify, or learn a compact representation (dense vector) of a piece of data.
Neural search: Unlike symbolic search, neural search makes use of the representation (a dense vector) generated by DNNs and measures the similarity between a query vector and a document vector, returning the top-k matches based on certain metrics.

Summary

In this chapter, you have learned about the key concepts of searching and matching. We have also covered the difference between legacy search and neural-network-based search. We saw how neural networks can help us tackle the issues traditional search cannot solve, such as cross-modality or multi-modality search.

Neural networks are able to encode different types of information into a common embedding space and make different pieces of information comparable, and that’s why deep learning and neural networks have the potential to better fulfill a user’s information needs.

We have introduced several possible applications using deep-learning-powered search systems, for instance, vision-based product search in fashion or tourism, or text-based search for question answering and text deduplication. More kinds of application are still to be explored!

You should now understand the core idea behind neural search: neural search has the ability to encode any kind of data into an expressive representation, namely an embedding. Creating a quality embedding is crucial to a search application powered by deep learning, since it determines the quality of the final search result.

In the next chapter, we will introduce the foundations of embeddings, such as how to encode information into embeddings, how to measure the distance between different embeddings, and some of the most important models we can use to encode different modalities of data.

About the Authors

Jina AI

Jina AI is a neural search company that provides cloud-native neural search solutions powered by AI and deep learning. It provides an open-source neural search ecosystem for businesses and developers, enabling everyone to search for information in all kinds of data with high availability and scalability.
Browse publications by this author
Bo Wang

Bo Wang is a machine learning engineer at Jina AI. He has a background in computer science, especially interested in the field of information retrieval. In the past years, he has been conducting research and engineering work on search intent classification, search result diversification, content-based image retrieval, and neural information retrieval. At Jina AI, Bo is working on developing a platform for automatically improving search quality with deep learning. In his spare time, he likes to play with his cats, watch anime, and play mobile games.
Browse publications by this author
Cristian Mitroi

Cristian Mitroi is a machine learning engineer with a wide breadth of experience in full stack, from infrastructure to model iteration and deployment. His background is based in linguistics, which led to him focusing on NLP. He also enjoys, and has experience in, teaching and interacting with the community, and has given workshops at various events. In his spare time, he performs improv comedy and organizes too many pen-and-paper role-playing games.
Browse publications by this author
Feng Wang

Feng Wang is a machine learning engineer at Jina AI. He received his Ph.D. from the department of computer science at the Hong Kong Baptist University in 2018. He has been a full-time R&D engineer for the past few years, and his interests include data mining and artificial intelligence, with a particular focus on natural language processing, multi-modal representation learning, and recommender systems. In his spare time, he likes climbing, hiking, and playing mobile games.
Browse publications by this author
Shubham Saboo

Shubham Saboo has taken on multiple roles, from a data scientist to an AI evangelist, at renowned firms across the globe, where he was involved in building organization-wide data strategies and technology infrastructure to create and scale data teams from scratch. His work as an AI evangelist has led him to build communities and reach out to a broader audience to foster the exchange of ideas and thoughts in the burgeoning field of AI. As part of his passion for learning new things and sharing knowledge with the community, he writes technical blogs on the advancements in AI and its economic implications. In his spare time, you can find him traveling the world, which enables him to immerse himself in different cultures and refine his worldview.
Browse publications by this author
Susana Guzmán

Susana Guzmán is the product manager at Jina AI. She has a background in computer science and for several years was working at different firms as a software developer with a focus on computer vision, working with both C++ and Python. She has a big interest in open source, which was what led her to Jina, where she started as a software engineer for 1 year until she got a clear overview of the product, which made her make the switch from engineering to PM. In her spare time, she likes to cook food from different cuisines around the world, looking for her new favorite dish.
Browse publications by this author