Apache Solr 4 Cookbook


Apache Solr 4 Cookbook
eBook: $26.99
Formats: PDF, PacktLib, ePub and Mobi formats
$22.94
save 15%!
Print + free eBook + free PacktLib access to the book: $71.98    Print cover: $44.99
$44.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Learn how to make Apache Solr search faster, more complete, and comprehensively scalable
  • Solve performance, setup, configuration, analysis, and query problems in no time
  • Get to grips with, and master, the new exciting features of Apache Solr 4

Book Details

Language : English
Paperback : 328 pages [ 235mm x 191mm ]
Release Date : January 2013
ISBN : 1782161325
ISBN 13 : 9781782161325
Author(s) : Rafał Kuć
Topics and Technologies : All Books, Big Data and Business Intelligence, Cookbooks, Open Source

Table of Contents

Preface
Chapter 1: Apache Solr Configuration
Chapter 2: Indexing Your Data
Chapter 3: Analyzing Your Text Data
Chapter 4: Querying Solr
Chapter 5: Using the Faceting Mechanism
Chapter 6: Improving Solr Performance
Chapter 7: In the Cloud
Chapter 8: Using Additional Solr Functionalities
Chapter 9: Dealing with Problems
Appendix: Real-life Situations
Index
  • Chapter 1: Apache Solr Configuration
    • Introduction
    • Running Solr on Jetty
    • Running Solr on Apache Tomcat
    • Installing a standalone ZooKeeper
    • Clustering your data
    • Choosing the right directory implementation
    • Configuring spellchecker to not use its own index
    • Solr cache configuration
    • How to fetch and index web pages
    • How to set up the extracting request handler
    • Changing the default similarity implementation
    • Chapter 2: Indexing Your Data
      • Introduction
      • Indexing PDF files
      • Generating unique fields automatically
      • Extracting metadata from binary files
      • How to properly configure Data Import Handler with JDBC
      • Indexing data from a database using Data Import Handler
      • How to import data using Data Import Handler and delta query
      • How to use Data Import Handler with the URL data source
      • How to modify data while importing with Data Import Handler
      • Updating a single field of your document
      • Handling multiple currencies
      • Detecting the document's language
      • Optimizing your primary key field indexing
      • Chapter 3: Analyzing Your Text Data
        • Introduction
        • Storing additional information using payloads
        • Eliminating XML and HTML tags from text
        • Copying the contents of one field to another
        • Changing words to other words
        • Splitting text by CamelCase
        • Splitting text by whitespace only
        • Making plural words singular without stemming
        • Lowercasing the whole string
        • Storing geographical points in the index
        • Stemming your data
        • Preparing text to perform an efficient trailing wildcard search
        • Splitting text by numbers and non-whitespace characters
        • Using Hunspell as a stemmer
        • Using your own stemming dictionary
        • Protecting words from being stemmed
        • Chapter 4: Querying Solr
          • Introduction
          • Asking for a particular field value
          • Sorting results by a field value
          • How to search for a phrase, not a single word
          • Boosting phrases over words
          • Positioning some documents over others in a query
          • Positioning documents with words closer to each other first
          • Sorting results by the distance from a point
          • Getting documents with only a partial match
          • Affecting scoring with functions
          • Nesting queries
          • Modifying returned documents
          • Using parent-child relationships
          • Ignoring typos in terms of performance
          • Detecting and omitting duplicate documents
          • Using field aliases
          • Returning a value of a function in the results
          • Chapter 5: Using the Faceting Mechanism
            • Introduction
            • Getting the number of documents with the same field value
            • Getting the number of documents with the same value range
            • Getting the number of documents matching the query and subquery
            • Removing filters from faceting results
            • Sorting faceting results in alphabetical order
            • Implementing the autosuggest feature using faceting
            • Getting the number of documents that don't have a value in the field
            • Having two different facet limits for two different fields in the same query
            • Using decision tree faceting
            • Calculating faceting for relevant documents in groups
            • Chapter 6: Improving Solr Performance
              • Introduction
              • Paging your results quickly
              • Configuring the document cache
              • Configuring the query result cache
              • Configuring the filter cache
              • Improving Solr performance right after the startup or commit operation
              • Caching whole result pages
              • Improving faceting performance for low cardinality fields
              • What to do when Solr slows down during indexing
              • Analyzing query performance
              • Avoiding filter caching
              • Controlling the order of execution of filter queries
              • Improving the performance of numerical range queries
              • Chapter 7: In the Cloud
                • Introduction
                • Creating a new SolrCloud cluster
                • Setting up two collections inside a single cluster
                • Managing your SolrCloud cluster
                • Understanding the SolrCloud cluster administration GUI
                • Distributed indexing and searching
                • Increasing the number of replicas on an already live cluster
                • Stopping automatic document distribution among shards
                • Chapter 8: Using Additional Solr Functionalities
                  • Introduction
                  • Getting more documents similar to those returned in the results list
                  • Highlighting matched words
                  • How to highlight long text fields and get good performance
                  • Sorting results by a function value
                  • Searching words by how they sound
                  • Ignoring defined words
                  • Computing statistics for the search results
                  • Checking the user's spelling mistakes
                  • Using field values to group results
                  • Using queries to group results
                  • Using function queries to group results
                  • Chapter 9: Dealing with Problems
                    • Introduction
                    • How to deal with too many opened files
                    • How to deal with out-of-memory problems
                    • How to sort non-English languages properly
                    • How to make your index smaller
                    • Diagnosing Solr problems
                    • How to avoid swapping
                    • Appendix: Real-life Situations
                      • Introduction
                      • How to implement a product's autocomplete functionality
                      • How to implement a category's autocomplete functionality
                      • How to use different query parsers in a single query
                      • How to get documents right after they were sent for indexation
                      • How to search your data in a near real-time manner
                      • How to get the documents with all the query words to the top of the results set
                      • How to boost documents based on their publishing date

                      Rafał Kuć

                      Rafał Kuć is a born team leader and a Software Developer. Working as a Consultant and a Software Engineer at Sematext Group, Inc., he concentrates on open source technologies such as Apache Lucene, Solr, ElasticSearch, and Hadoop stack. He has more than 11 years of experience in various software branches—from banking software to e-commerce products. He is mainly focused on Java, but open to every tool and programming language that will make the achievement of his goal easier and faster. He is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people to resolve their problems with Solr and Lucene. He is also a speaker for various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution. Rafał began his journey with Lucene in 2002 and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and this was it. He started working with ElasticSearch in the middle of 2010. Currently, Lucene, Solr, ElasticSearch, and information retrieval are his main points of interest. Rafał is also an author of Solr 3.1 Cookbook, the update to it—Solr 4.0 Cookbook, and is a co-author of ElasticSearch Server all published by Packt Publishing. The book you are holding in your hands was something that I wanted to write after finishing the ElasticSearch Server book and I got the opportunity. I wanted not to jump from topic to topic, but concentrate on a few of them and write about what I know and share the knowledge. Again, just like the ElasticSearch Server book, I couldn't include all topics I wanted, and some small details that are more or less important, depending on the use case, had to be left aside. Nevertheless, I hope that by reading this book you'll be able to easily get into all the details about ElasticSearch and underlying Apache Lucene, and I also hope that it will let you get the desired knowledge easier and faster.
                      Sorry, we don't have any reviews for this title yet.

                      Code Downloads

                      Download the code and support files for this book.


                      Submit Errata

                      Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                      Errata

                      - 4 submitted: last submission 19 Nov 2013

                      Errata Type: Code Page no. 214, 215, and 219
                      In the recipe Setting up two collections inside a single cluster, step 3 there is a lack of a space character between the name of the confname parameter and its value. That is:
                      cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:2181 -confdir /usr/share/config/books/conf -confnamebookscollection
                      and
                      cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:2181 -confdir /usr/share/config/users/conf -confnameuserscollection
                      should be
                      cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:2181 -confdir /usr/share/config/books/conf -confname bookscollection
                      and
                      cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:2181 -confdir /usr/share/config/users/conf -confname userscollection



                      In the recipe Managing your SolrCloud cluster, step 4, there is a lack of a space character between the name of the confname parameter and its value. That is:
                      cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:2181 -confdir /usr/share/config/books/conf -confnamebookscollection
                      should be
                      cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:2181 -confdir /usr/share/config/books/conf -confname bookscollection

                       

                      Errata Type: Code Page no. 10
                      In Chapter 1,in the Running Solr on Jetty recipe the example showing how to increase the header buffer size is based on Jetty 6. If you are using a newer version of Jetty, such as Jetty 8 instead of headerBufferSize, please use the requestHeaderSize property. So the example will look like this:
                      <set name="requestHeaderSize">32768</set>

                      Errata Type: Code; Page No. 28

                      Chapter 1, Recipe: How to fetch and index web pages
                      The example describing the schema.xml file should look like the description states, so it should be like this:
                      <schema name="nutch" version="1.5">

                      Errata Type: Code; Page No. 43

                      Chapter 2, Recipe: How to properly configure Data Import Handler with JDBC
                      In the db-data-config.xml example there is the following code snippet:
                      <field column="description" name="description" />
                      It should be:
                      <field column="desc" name="description" />

                      Errata Type: Code; Page No. 73

                      Chapter 3, Recipe: Eliminating XML and HTML tags from text
                      The value in html field of the example document should be surrounded by CDATA section, just like it is in the code you can download.
                      The example document should look like this:
                      <add> <doc> <field name="id">1</field> <field name="html"><![CDATA[<html><head><title>My page</title></head><body><p>This is a <b>my</b> <i>sample</i> page</body></html>]]></field> </doc> </add> </p>

                      Errata Type: Content; Page No. 90

                      Chapter 3, Recipe: Storing geographical points in the index There is a sentence missing before the last example. Currently it is "(…) can add data to index:" and it should be "(…) can add data to index. Now let’s look again at the query".

                      Please refer to the following blog entry with regard to solr.pl (autogenerating your unique  
                      key
                      )

                      http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/

                      Errata Type: Code Page No: 6

                      The How to do it... section refers to the context directory. In the Solr 4.3.1 release, the directory name is contexts and not context.

                      Sample chapters

                      You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                      Frequently bought together

                      Apache Solr 4 Cookbook +    Hadoop Real-World Solutions Cookbook =
                      50% Off
                      the second eBook
                      Price for both: $37.50

                      Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                      What you will learn from this book

                      • Efficient and configurable Apache Solr 4 setup
                      • Index your data in different formats, forms, and sources
                      • Implement different autocomplete functionality
                      • Achieve near real time search with Apache Solr 4
                      • Improve and benchmark Apache Solr for increased performance
                      • Master SolrCloud functionality
                      • Diagnose and resolve your problems with Apache Solr 4
                      • Improve the relevance of your queries
                      • Overcome common problems when analyzing your data

                      In Detail

                      Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features.

                      "Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data.

                      "Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration.

                      With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.

                      Approach

                      "Apache Solr 4 Cookbook" is written in a helpful, practical style with numerous hands-on recipes to help you master Apache Solr to get more precise search results and analysis, higher performance, and reliability.

                      Who this book is for

                      This book is for developers who wish to learn how to master Apache Solr 4. This book will specifically appeal to developers who wish to quickly get to grips with the changes and new features of Apache Solr 4. This book is also handy as a practical guide to solving common problems and issues when using Apache Solr.

                      Code Download and Errata
                      Packt Anytime, Anywhere
                      Register Books
                      Print Upgrades
                      eBook Downloads
                      Video Support
                      Contact Us
                      Awards Voting Nominations Previous Winners
                      Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                      Resources
                      Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software