Reader small image

You're reading from  Scaling Big Data with Hadoop and Solr, Second Edition

Product typeBook
Published inApr 2015
Publisher
ISBN-139781783553396
Edition1st Edition
Concepts
Right arrow
Author (1)
Hrishikesh Vijay Karambelkar
Hrishikesh Vijay Karambelkar
author image
Hrishikesh Vijay Karambelkar

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.
Read more about Hrishikesh Vijay Karambelkar

Right arrow

Chapter 5. Scaling Search Performance

As the data grows, it impacts the time taken for both search, as well as creating new indexes to keep up with the increasing size of the repository. The simplest way to preserve the same search performance while scaling your data is to keep increasing your hardware, which includes higher processing power and higher memory size. However, this is not a cost-effective alternative. So, instead we will want to look for optimizing the running of the big data search instance. We have also covered different architectures of Solr in Chapter 4, Big Data Search Using Hadoop and Its Ecosystem, among which the most suitable architecture can be chosen on the basis of the requirements and the usage patterns.

The overall optimization of the technology stack, which includes Apache Hadoop and Apache Solr, helps you maintain more data with reasonable performance. The optimization is most important while scaling your instance for big data with Hadoop and Solr. We are going...

Understanding the limits


Although you can have a completely distributed system for your big data search, there is a limit in terms of how far you can go. As you keep on distributing the shards, you may end up facing what is called the "laggard problem" for indexes for your instance.

This problem states that the response to your search query, which is an aggregation of results from all the shards, is controlled by the following formula:

QueryResponse = avg(max(shardResponseTime))

This means that if you have many shards, it is more likely that you will have one of them responding slowly (due to some anomaly) to your queries, and this will impact on your query response time, and this will start increasing.

The distributed search in Apache Solr has many limitations. Each document uploaded as distributed big data must have a unique key, and this unique key must be stored in the Solr repository. To do so, the Solr schema.xml file should have "stored=true" against the key attribute. This unique key...

Optimizing search schema


When Solr is used in the context of a specific requirement (for example, log search for an enterprise application) it holds a specific schema that can be defined in schema.xml and copied over to nodes. The schema is based on the schema attributes indexes and thus plays a vital role in the performance of your Solr instance.

Specifying default search field

In the schema.xml file of the Solr configuration, the system allows you to specify the <defaultSearchField> parameter. This is the parameter that controls when you search without an explicit field name in your query, and which field to pick up for searching. This is an optional parameter; if this is not specified, for all the queries that are not providing the field name, the search will run them on all the available fields in the schema. This will not only consume more CPU time but on the whole, slow down the search performance.

Configuring search schema fields

In custom schema, having a larger number of fields...

Index optimization


The indexes used in Apache Solr are inverted indexes. In case of the inverted indexing technique, all your text will be parsed and words will be extracted out of it. These words are then stored as index items, with the location of their appearance. For example, consider the following statements:

  1. "Mike enjoys playing on a beach"

  2. "Playing on the ground is a good exercise"

  3. "Mike loves to exercise daily"

The index with location information for all these sentences will look like following (Numbers in brackets denote (sentence number, word number)):

Mike     (1,1), (3,1)
enjoys   (1,2)
playing  (1,3), (2,1)
on       (1,4), (2,2)
a        (1,5), (2,5)
beach    (1,6)
ground   (2,3)
is       (2,4)
good     (2,6)
loves    (3,2)
to       (3,3)
exercise (2,7), (3,4)
daily    (3,5)

When you perform a delete on your inverted index, it does not delete the document; it only marks the document as deleted. It will get cleaned only when the segment that the index is a part of is merged. When you...

Optimizing search runtime


The search runtime speed is also a primary concern, and so it should be performed. You can also perform optimization at various levels at runtime. When Solr fetches the results for the queries passed by the user, you can limit the fetching of the result to a certain number by specifying the rows attribute in your search. The following query will return 10 rows of results from 10 to 20.

q=Scaling Big Data&rows=10&start=10

This can also be specified in solrconfig.xml as queryResultWindowSize, thereby setting the size to a limited number of query results.

Let's look at various other optimizations possible in the search runtime.

Optimizing through search query

Whenever a query request is forwarded to a search instance, Solr can respond in various ways, such as XML or JSON. A typical Solr response not only contains information about the matched results, but also contains information about your facets, highlighted text, and many other things which are used by the client...

Monitoring Solr instance


You can monitor the Solr instance for the purpose of memory and CPU usage. There are various ways of doing this; a simple administration of Solr provides you with some statistics for the usage. Using standard tools like JConsole and JVisualVM, you can connect to the Solr process for monitoring the memory usage, threads, and CPU usage:

With JConsole, you can also look at different JMX-based MBeans supported by Solr. On an example jetty setup, you can simply connect Solr by using the following procedure:

  • Open JDK folder, which is being used by Solr

  • Go to the bin folder and run JConsole

  • In JConsole, connect to the Solr process; in case of the default jetty implementation, connect to start.jar

  • Once connected, switch to the MBean tab

You will find the MBean browser as shown in the following screenshot:

For a clustered search instance, you can connect remotely through JConsole. However, while starting JVM, you need to pass the following parameters to JVM (to bypass authentication...

Summary


In this chapter, we have covered various ways of optimizing Apache Solr and Hadoop instances. We started by reviewing the schema optimization and optimizing the index. We also looked at optimizing the container, and the search runtime, to speed up the overall process. We reviewed optimizing Hadoop instances. Finally, we looked at different ways of monitoring the Solr instances for performance.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scaling Big Data with Hadoop and Solr, Second Edition
Published in: Apr 2015Publisher: ISBN-13: 9781783553396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Hrishikesh Vijay Karambelkar

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.
Read more about Hrishikesh Vijay Karambelkar