Packt+ | Advance your knowledge in tech

You're reading from Scaling Big Data with Hadoop and Solr, Second Edition

Product typeBook

Published inApr 2015

Publisher

ISBN-139781783553396

Edition1st Edition

Tools

Solr Hadoop

Concepts

Big Data

Author (1)

Hrishikesh Vijay Karambelkar

Chapter 5. Scaling Search Performance

As the data grows, it impacts the time taken for both search, as well as creating new indexes to keep up with the increasing size of the repository. The simplest way to preserve the same search performance while scaling your data is to keep increasing your hardware, which includes higher processing power and higher memory size. However, this is not a cost-effective alternative. So, instead we will want to look for optimizing the running of the big data search instance. We have also covered different architectures of Solr in Chapter 4, Big Data Search Using Hadoop and Its Ecosystem, among which the most suitable architecture can be chosen on the basis of the requirements and the usage patterns.

The overall optimization of the technology stack, which includes Apache Hadoop and Apache Solr, helps you maintain more data with reasonable performance. The optimization is most important while scaling your instance for big data with Hadoop and Solr. We are going...

Understanding the limits

Although you can have a completely distributed system for your big data search, there is a limit in terms of how far you can go. As you keep on distributing the shards, you may end up facing what is called the "laggard problem" for indexes for your instance.

This problem states that the response to your search query, which is an aggregation of results from all the shards, is controlled by the following formula:

QueryResponse = avg(max(shardResponseTime))

This means that if you have many shards, it is more likely that you will have one of them responding slowly (due to some anomaly) to your queries, and this will impact on your query response time, and this will start increasing.

The distributed search in Apache Solr has many limitations. Each document uploaded as distributed big data must have a unique key, and this unique key must be stored in the Solr repository. To do so, the Solr schema.xml file should have "stored=true" against the key attribute. This unique key...

Optimizing search schema

When Solr is used in the context of a specific requirement (for example, log search for an enterprise application) it holds a specific schema that can be defined in schema.xml and copied over to nodes. The schema is based on the schema attributes indexes and thus plays a vital role in the performance of your Solr instance.

Specifying default search field

In the schema.xml file of the Solr configuration, the system allows you to specify the <defaultSearchField> parameter. This is the parameter that controls when you search without an explicit field name in your query, and which field to pick up for searching. This is an optional parameter; if this is not specified, for all the queries that are not providing the field name, the search will run them on all the available fields in the schema. This will not only consume more CPU time but on the whole, slow down the search performance.

Configuring search schema fields

In custom schema, having a larger number of fields...

Index optimization

The indexes used in Apache Solr are inverted indexes. In case of the inverted indexing technique, all your text will be parsed and words will be extracted out of it. These words are then stored as index items, with the location of their appearance. For example, consider the following statements:

"Mike enjoys playing on a beach"
"Playing on the ground is a good exercise"
"Mike loves to exercise daily"

The index with location information for all these sentences will look like following (Numbers in brackets denote (sentence number, word number)):

Mike     (1,1), (3,1)
enjoys   (1,2)
playing  (1,3), (2,1)
on       (1,4), (2,2)
a        (1,5), (2,5)
beach    (1,6)
ground   (2,3)
is       (2,4)
good     (2,6)
loves    (3,2)
to       (3,3)
exercise (2,7), (3,4)
daily    (3,5)

When you perform a delete on your inverted index, it does not delete the document; it only marks the document as deleted. It will get cleaned only when the segment that the index is a part of is merged. When you...

Optimizing search runtime

The search runtime speed is also a primary concern, and so it should be performed. You can also perform optimization at various levels at runtime. When Solr fetches the results for the queries passed by the user, you can limit the fetching of the result to a certain number by specifying the rows attribute in your search. The following query will return 10 rows of results from 10 to 20.

q=Scaling Big Data&rows=10&start=10

This can also be specified in solrconfig.xml as queryResultWindowSize, thereby setting the size to a limited number of query results.

Let's look at various other optimizations possible in the search runtime.

Optimizing through search query

Whenever a query request is forwarded to a search instance, Solr can respond in various ways, such as XML or JSON. A typical Solr response not only contains information about the matched results, but also contains information about your facets, highlighted text, and many other things which are used by the client...

Monitoring Solr instance

You can monitor the Solr instance for the purpose of memory and CPU usage. There are various ways of doing this; a simple administration of Solr provides you with some statistics for the usage. Using standard tools like JConsole and JVisualVM, you can connect to the Solr process for monitoring the memory usage, threads, and CPU usage:

With JConsole, you can also look at different JMX-based MBeans supported by Solr. On an example jetty setup, you can simply connect Solr by using the following procedure:

Open JDK folder, which is being used by Solr
Go to the bin folder and run JConsole
In JConsole, connect to the Solr process; in case of the default jetty implementation, connect to start.jar
Once connected, switch to the MBean tab

You will find the MBean browser as shown in the following screenshot:

For a clustered search instance, you can connect remotely through JConsole. However, while starting JVM, you need to pass the following parameters to JVM (to bypass authentication...

Summary

In this chapter, we have covered various ways of optimizing Apache Solr and Hadoop instances. We started by reviewing the schema optimization and optimizing the index. We also looked at optimizing the container, and the search runtime, to speed up the overall process. We reviewed optimizing Hadoop instances. Finally, we looked at different ways of monitoring the Solr instances for performance.

The rest of the chapter is locked

You have been reading a chapter from

Scaling Big Data with Hadoop and Solr, Second Edition

Published in: Apr 2015Publisher: ISBN-13: 9781783553396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Hrishikesh Vijay Karambelkar

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.
Read more about Hrishikesh Vijay Karambelkar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages