Reader small image

You're reading from  Neural Search - From Prototype to Production with Jina

Product typeBook
Published inOct 2022
PublisherPackt
ISBN-139781801816823
Edition1st Edition
Right arrow
Authors (6):
Jina AI
Jina AI
author image
Jina AI

Jina AI is a neural search company that provides cloud-native neural search solutions powered by AI and deep learning. It provides an open-source neural search ecosystem for businesses and developers, enabling everyone to search for information in all kinds of data with high availability and scalability.
Read more about Jina AI

Bo Wang
Bo Wang
author image
Bo Wang

Bo Wang is a machine learning engineer at Jina AI. He has a background in computer science, especially interested in the field of information retrieval. In the past years, he has been conducting research and engineering work on search intent classification, search result diversification, content-based image retrieval, and neural information retrieval. At Jina AI, Bo is working on developing a platform for automatically improving search quality with deep learning. In his spare time, he likes to play with his cats, watch anime, and play mobile games.
Read more about Bo Wang

Cristian Mitroi
Cristian Mitroi
author image
Cristian Mitroi

Cristian Mitroi is a machine learning engineer with a wide breadth of experience in full stack, from infrastructure to model iteration and deployment. His background is based in linguistics, which led to him focusing on NLP. He also enjoys, and has experience in, teaching and interacting with the community, and has given workshops at various events. In his spare time, he performs improv comedy and organizes too many pen-and-paper role-playing games.
Read more about Cristian Mitroi

Feng Wang
Feng Wang
author image
Feng Wang

Feng Wang is a machine learning engineer at Jina AI. He received his Ph.D. from the department of computer science at the Hong Kong Baptist University in 2018. He has been a full-time R&D engineer for the past few years, and his interests include data mining and artificial intelligence, with a particular focus on natural language processing, multi-modal representation learning, and recommender systems. In his spare time, he likes climbing, hiking, and playing mobile games.
Read more about Feng Wang

Shubham Saboo
Shubham Saboo
author image
Shubham Saboo

Shubham Saboo has taken on multiple roles, from a data scientist to an AI evangelist, at renowned firms across the globe, where he was involved in building organization-wide data strategies and technology infrastructure to create and scale data teams from scratch. His work as an AI evangelist has led him to build communities and reach out to a broader audience to foster the exchange of ideas and thoughts in the burgeoning field of AI. As part of his passion for learning new things and sharing knowledge with the community, he writes technical blogs on the advancements in AI and its economic implications. In his spare time, you can find him traveling the world, which enables him to immerse himself in different cultures and refine his worldview.
Read more about Shubham Saboo

Susana Guzmán
Susana Guzmán
author image
Susana Guzmán

Susana Guzmán is the product manager at Jina AI. She has a background in computer science and for several years was working at different firms as a software developer with a focus on computer vision, working with both C++ and Python. She has a big interest in open source, which was what led her to Jina, where she started as a software engineer for 1 year until she got a clear overview of the product, which made her make the switch from engineering to PM. In her spare time, she likes to cook food from different cuisines around the world, looking for her new favorite dish.
Read more about Susana Guzmán

View More author details
Right arrow

Introducing Foundations of Vector Representation

Vectors and vector representation are at the very core of neural search since the quality of vectors determines the quality of search results. In this chapter, you will learn about the concept of vectors within machine learning (ML). You will see common search algorithms using vector representation as well as their weaknesses and strengths.

We’re going to cover the following main topics in this chapter:

  • Introducing vectors in ML
  • Measuring the similarity between two vectors
  • Local and distributed representations

By the end of this chapter, you will have a solid understanding of how every type of data can be represented in vectors and why this concept is at the very core of neural search.

Technical requirements

This chapter has the following technical requirements:

  • A laptop with a minimum of 4 GB RAM (8 GB or more is preferred)
  • Python installed with version 3.7, 3.8, or 3.9 on a Unix-like operating system, such as macOS or Ubuntu

The code for this chapter can be found at https://github.com/PacktPublishing/Neural-Search-From-Prototype-to-Production-with-Jina/tree/main/src/Chapter02.

Introducing vectors in ML

Text is an important means of recording human knowledge. As of June 2021, the number of web pages indexed by mainstream search engines such as Google and Bing has reached 2.4 billion, and the majority of information is stored as text. How to store this textual information, and even how to efficiently retrieve the required information from the repository, has become a major issue in information retrieval. The first step in solving these problems lies in representing text in a format that is comprehensible to computers.

As network-based information has become increasingly diverse, in addition to text, web pages contain a large amount of multimedia information, such as pictures, music, and video files. These files are more diverse than text in terms of form and content and satisfy users’ needs from different perspectives. How to represent and retrieve these types of information, as well as how to pinpoint the multimodal information needed by users from...

Measuring similarity between two vectors

Measuring similarity between two vectors is important in a neural search system. Once all of the documents have been indexed into their vector representation, given a user query, we carry out the same encoding process to the query. In the end, we compare the encoded query vector against all the encoded document vectors to find out what the most similar documents are.

We can continue our example from the previous section, trying to measure the similarity between doc1 and doc2. First of all, we need to run the script two times to encode both doc1 and doc2:

doc1 = 'Jina is a neural search framework'
doc2 = 'Jina is built with cutting age technology called deep learning'

Then, we can produce a vector representation for both of them:

encoded_doc1 = [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0]
encoded_doc2 = [1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1]

Since the dimension of the encoded result is always identical...

Local and distributed representations

In this section, we’ll dive into local representations and distributed representations. We will go through the characteristics of two different representations and list the most widely used local and global representations to encode different modalities of data.

Local vector representation

As a classic method of text representation, local representation only makes use of the disjointed dimensions in the vector for a certain word when it is represented as a vector. Disjointed dimension means that each dimensionality of the vector represents a single token.

When only one dimension is used, it is called one-hot representation. One-hot means that the word is represented as a long vector, and the dimension of the vector is the total number of words to be represented. Most dimensions are 0, while only one dimension has a value of 1. Different words with a dimension of 1 are not used. If this method of representation is stored sparsely...

Summary

This chapter described the method of vector representation, which is a major step in the operation of search engines.

First, we introduced the importance of vector representation and how to use it, and then addressed local and distributed vector representation algorithms. In terms of distributed vector representation, the commonly used representation algorithms for text, images, and audio were covered, and common representation methods for other modalities and multimodality were summarized. Hence, we found that the dense vector representation method often entails relatively rich contextual information when compared with sparse vectors.

When building a scalable neural search system, it is important to create an encoder that can encode raw documents into high-quality embeddings. This encoding process needs to be performed fast to reduce the indexing time. At search time, it is critical to apply the same encoding process and find the top-ranked documents in a reasonable...

Further reading

  • Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  • Simonyan, Karen and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
  • He, Kaiming et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
  • Schneider, Steffen, et al. “wav2vec: Unsupervised pre-training for speech recognition.” arXiv preprint arXiv:1904.05862 (2019).
  • Radford, Alec et al. “Learning transferable visual models from natural language supervision.” International Conference on Machine Learning. PMLR, 2021.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Neural Search - From Prototype to Production with Jina
Published in: Oct 2022Publisher: PacktISBN-13: 9781801816823
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Jina AI

Jina AI is a neural search company that provides cloud-native neural search solutions powered by AI and deep learning. It provides an open-source neural search ecosystem for businesses and developers, enabling everyone to search for information in all kinds of data with high availability and scalability.
Read more about Jina AI

author image
Bo Wang

Bo Wang is a machine learning engineer at Jina AI. He has a background in computer science, especially interested in the field of information retrieval. In the past years, he has been conducting research and engineering work on search intent classification, search result diversification, content-based image retrieval, and neural information retrieval. At Jina AI, Bo is working on developing a platform for automatically improving search quality with deep learning. In his spare time, he likes to play with his cats, watch anime, and play mobile games.
Read more about Bo Wang

author image
Cristian Mitroi

Cristian Mitroi is a machine learning engineer with a wide breadth of experience in full stack, from infrastructure to model iteration and deployment. His background is based in linguistics, which led to him focusing on NLP. He also enjoys, and has experience in, teaching and interacting with the community, and has given workshops at various events. In his spare time, he performs improv comedy and organizes too many pen-and-paper role-playing games.
Read more about Cristian Mitroi

author image
Feng Wang

Feng Wang is a machine learning engineer at Jina AI. He received his Ph.D. from the department of computer science at the Hong Kong Baptist University in 2018. He has been a full-time R&D engineer for the past few years, and his interests include data mining and artificial intelligence, with a particular focus on natural language processing, multi-modal representation learning, and recommender systems. In his spare time, he likes climbing, hiking, and playing mobile games.
Read more about Feng Wang

author image
Shubham Saboo

Shubham Saboo has taken on multiple roles, from a data scientist to an AI evangelist, at renowned firms across the globe, where he was involved in building organization-wide data strategies and technology infrastructure to create and scale data teams from scratch. His work as an AI evangelist has led him to build communities and reach out to a broader audience to foster the exchange of ideas and thoughts in the burgeoning field of AI. As part of his passion for learning new things and sharing knowledge with the community, he writes technical blogs on the advancements in AI and its economic implications. In his spare time, you can find him traveling the world, which enables him to immerse himself in different cultures and refine his worldview.
Read more about Shubham Saboo

author image
Susana Guzmán

Susana Guzmán is the product manager at Jina AI. She has a background in computer science and for several years was working at different firms as a software developer with a focus on computer vision, working with both C++ and Python. She has a big interest in open source, which was what led her to Jina, where she started as a software engineer for 1 year until she got a clear overview of the product, which made her make the switch from engineering to PM. In her spare time, she likes to cook food from different cuisines around the world, looking for her new favorite dish.
Read more about Susana Guzmán