Apache Solr 3 Enterprise Search Server


Apache Solr 3 Enterprise Search Server
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.49
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Comprehensive information on Apache Solr 3 with examples and tips so you can focus on the important parts
  • Integration examples with databases, web-crawlers, XSLT, Java & embedded-Solr, PHP & Drupal, JavaScript, Ruby frameworks
  • Advice on data modeling, deployment considerations to include security, logging, and monitoring, and advice on scaling Solr and measuring performance
  • An update of the best-selling title on Solr 1.4

Appendix

Book Details

Language : English
Paperback : 418 pages [ 235mm x 191mm ]
Release Date : November 2011
ISBN : 1849516065
ISBN 13 : 9781849516068
Author(s) : David Smiley, Eric Pugh
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source

Table of Contents

Preface
Chapter 1: Quick Starting Solr
Chapter 2: Schema and Text Analysis
Chapter 3: Indexing Data
Chapter 4: Searching
Chapter 5: Search Relevancy
Chapter 6: Faceting
Chapter 7: Search Components
Chapter 8: Deployment
Chapter 9: Integrating Solr
Chapter 10: Scaling Solr
Appendix: Search Quick Reference
Index
  • Chapter 1: Quick Starting Solr
    • An introduction to Solr
      • Lucene, the underlying engine
      • Solr, a Lucene-based search server
      • Comparison to database technology
    • Getting started
      • Solr's installation directory structure
      • Solr's home directory and Solr cores
      • Running Solr
    • A quick tour of Solr
      • Loading sample data
      • A simple query
      • Some statistics
      • The sample browse interface
    • Configuration files
    • Resources outside this book
    • Summary
    • Chapter 2: Schema and Text Analysis
      • MusicBrainz.org
      • One combined index or separate indices
        • One combined index
          • Problems with using a single combined index
        • Separate indices
      • Schema design
        • Step 1: Determine which searches are going to be powered by Solr
        • Step 2: Determine the entities returned from each search
        • Step 3: Denormalize related data
          • Denormalizing—'one-to-one' associated data
          • Denormalizing—'one-to-many' associated data
        • Step 4: (Optional) Omit the inclusion of fields only used in search results
      • The schema.xml file
        • Defining field types
        • Built-in field type classes
          • Numbers and dates
          • Geospatial
        • Field options
        • Field definitions
          • Dynamic field definitions
        • Our MusicBrainz field definitions
        • Copying fields
        • The unique key
        • The default search field and query operator
      • Text analysis
        • Configuration
        • Experimenting with text analysis
        • Character filters
        • Tokenization
        • WordDelimiterFilter
        • Stemming
          • Correcting and augmenting stemming
        • Synonyms
          • Index-time versus query-time, and to expand or not
        • Stop words
        • Phonetic sounds-like analysis
        • Substring indexing and wildcards
          • ReversedWildcardFilter
          • N-grams
          • N-gram costs
        • Sorting Text
        • Miscellaneous token filters
      • Summary
      • Chapter 3: Indexing Data
        • Communicating with Solr
          • Direct HTTP or a convenient client API
          • Push data to Solr or have Solr pull it
          • Data formats
          • HTTP POSTing options to Solr
          • Remote streaming
        • Solr's Update-XML format
          • Deleting documents
        • Commit, optimize, and rollback
        • Sending CSV formatted data to Solr
          • Configuration options
        • The Data Import Handler Framework
          • Setup
          • The development console
          • Writing a DIH configuration file
            • Data Sources
            • Entity processors
            • Fields and transformers
          • Example DIH configurations
            • Importing from databases
            • Importing XML from a file with XSLT
            • Importing multiple rich document files (crawling)
          • Importing commands
            • Delta imports
        • Indexing documents with Solr Cell
          • Extracting text and metadata from files
          • Configuring Solr
          • Solr Cell parameters
          • Extracting karaoke lyrics
          • Indexing richer documents
        • Update request processors
        • Summary
        • Chapter 4: Searching
          • Your first search, a walk-through
          • Solr's generic XML structured data representation
          • Solr's XML response format
            • Parsing the URL
          • Request handlers
          • Query parameters
            • Search criteria related parameters
            • Result pagination related parameters
            • Output related parameters
            • Diagnostic related parameters
          • Query parsers and local-params
          • Query syntax (the lucene query parser)
            • Matching all the documents
            • Mandatory, prohibited, and optional clauses
              • Boolean operators
            • Sub-queries
              • Limitations of prohibited clauses in sub-queries
            • Field qualifier
            • Phrase queries and term proximity
            • Wildcard queries
              • Fuzzy queries
            • Range queries
              • Date math
            • Score boosting
            • Existence (and non-existence) queries
            • Escaping special characters
          • The Dismax query parser (part 1)
            • Searching multiple fields
            • Limited query syntax
            • Min-should-match
              • Basic rules
              • Multiple rules
              • What to choose
            • A default search
          • Filtering
          • Sorting
          • Geospatial search
            • Indexing locations
            • Filtering by distance
            • Sorting by distance
          • Summary
          • Chapter 5: Search Relevancy
            • Scoring
              • Query-time and index-time boosting
              • Troubleshooting queries and scoring
            • Dismax query parser (part 2)
              • Lucene's DisjunctionMaxQuery
              • Boosting: Automatic phrase boosting
                • Configuring automatic phrase boosting
                • Phrase slop configuration
                • Partial phrase boosting
              • Boosting: Boost queries
              • Boosting: Boost functions
                • Add or multiply boosts?
            • Function queries
              • Field references
              • Function reference
                • Mathematical primitives
                • Other math
                • ord and rord
                • Miscellaneous functions
              • Function query boosting
                • Formula: Logarithm
                • Formula: Inverse reciprocal
                • Formula: Reciprocal
                • Formula: Linear
              • How to boost based on an increasing numeric field
                • Step by step…
                • External field values
              • How to boost based on recent dates
                • Step by step…
            • Summary
            • Chapter 6: Faceting
              • A quick example: Faceting release types
                • MusicBrainz schema changes
              • Field requirements
              • Types of faceting
              • Faceting field values
                • Alphabetic range bucketing
              • Faceting numeric and date ranges
                • Range facet parameters
              • Facet queries
              • Building a filter query from a facet
                • Field value filter queries
                • Facet range filter queries
              • Excluding filters (multi-select faceting)
              • Hierarchical faceting
              • Summary
              • Chapter 7: Search Components
                • About components
                • The Highlight component
                  • A highlighting example
                  • Highlighting configuration
                    • The regex fragmenter
                    • The fast vector highlighter with multi-colored highlighting
                • The SpellCheck component
                  • Schema configuration
                  • Configuration in solrconfig.xml
                    • Configuring spellcheckers (dictionaries)
                    • Processing of the q parameter
                    • Processing of the spellcheck.q parameter
                  • Building the dictionary from its source
                  • Issuing spellcheck requests
                  • Example usage for a misspelled query
                • Query complete / suggest
                  • Query term completion via facet.prefix
                  • Query term completion via the Suggester
                  • Query term completion via the Terms component
                • The QueryElevation component
                  • Configuration
                • The MoreLikeThis component
                  • Configuration parameters
                    • Parameters specific to the MLT search component
                    • Parameters specific to the MLT request handler
                    • Common MLT parameters
                  • MLT results example
                • The Stats component
                  • Configuring the stats component
                  • Statistics on track durations
                • The Clustering component
                • Result grouping/Field collapsing
                  • Configuring result grouping
                • The TermVector component
                • Summary
                • Chapter 8: Deployment
                  • Deployment methodology for Solr
                    • Questions to ask
                  • Installing Solr into a Servlet container
                    • Differences between Servlet containers
                      • Defining solr.home property
                  • Logging
                    • HTTP server request access logs
                    • Solr application logging
                      • Configuring logging output
                      • Logging using Log4j
                      • Jetty startup integration
                      • Managing log levels at runtime
                  • A SearchHandler per search interface?
                  • Leveraging Solr cores
                    • Configuring solr.xml
                      • Property substitution
                      • Include fragments of XML with XInclude
                    • Managing cores
                    • Why use multicore?
                  • Monitoring Solr performance
                    • Stats.jsp
                    • JMX
                      • Starting Solr with JMX
                  • Securing Solr from prying eyes
                    • Limiting server access
                      • Securing public searches
                      • Controlling JMX access
                    • Securing index data
                      • Controlling document access
                      • Other things to look at
                  • Summary
                  • Chapter 9: Integrating Solr
                    • Working with included examples
                      • Inventory of examples
                    • Solritas, the integrated search UI
                      • Pros and Cons of Solritas
                    • SolrJ: Simple Java interface
                      • Using Heritrix to download artist pages
                      • SolrJ-based client for Indexing HTML
                      • SolrJ client API
                        • Embedding Solr
                        • Searching with SolrJ
                        • Indexing
                      • When should I use embedded Solr?
                        • In-process indexing
                        • Standalone desktop applications
                        • Upgrading from legacy Lucene
                    • Using JavaScript with Solr
                      • Wait, what about security?
                      • Building a Solr powered artists autocomplete widget with jQuery and JSONP
                      • AJAX Solr
                    • Using XSLT to expose Solr via OpenSearch
                      • OpenSearch based Browse plugin
                        • Installing the Search MBArtists plugin
                    • Accessing Solr from PHP applications
                      • solr-php-client
                      • Drupal options
                        • Apache Solr Search integration module
                        • Hosted Solr by Acquia
                    • Ruby on Rails integrations
                      • The Ruby query response writer
                      • sunspot_rails gem
                        • Setting up MyFaves project
                        • Populating MyFaves relational database from Solr
                        • Build Solr indexes from a relational database
                        • Complete MyFaves website
                      • Which Rails/Ruby library should I use?
                    • Nutch for crawling web pages
                    • Maintaining document security with ManifoldCF
                      • Connectors
                      • Putting ManifoldCF to use
                    • Summary
                    • Chapter 10: Scaling Solr
                      • Tuning complex systems
                      • Testing Solr performance with SolrMeter
                      • Optimizing a single Solr server (Scale up)
                        • Configuring JVM settings to improve memory usage
                          • MMapDirectoryFactory to leverage additional virtual memory
                        • Enabling downstream HTTP caching
                        • Solr caching
                          • Tuning caches
                        • Indexing performance
                          • Designing the schema
                          • Sending data to Solr in bulk
                          • Don't overlap commits
                          • Disabling unique key checking
                          • Index optimization factors
                        • Enhancing faceting performance
                        • Using term vectors
                        • Improving phrase search performance
                      • Moving to multiple Solr servers (Scale horizontally)
                        • Replication
                        • Starting multiple Solr servers
                          • Configuring replication
                        • Load balancing searches across slaves
                          • Indexing into the master server
                          • Configuring slaves
                        • Configuring load balancing
                        • Sharding indexes
                          • Assigning documents to shards
                          • Searching across shards (distributed search)
                      • Combining replication and sharding (Scale deep)
                        • Near real time search
                    • Where next for scaling Solr?
                    • Summary

                        David Smiley

                        Born to code, David Smiley is a software engineer that’s passionate about search, Lucene, spatial, and open-source. He has a great deal of expertise with Lucene and Solr, which started in 2008 at MITRE. In 2009 as the lead author, he wrote Solr 1.4 Enterprise Search Server, the first book on Solr, published by PACKT. It was updated in 2011, and again for this third edition. After the first book, he developed a one and two-day Solr training courses delivered a half dozen times within MITRE, and he delivered LucidWorks’ training once too. Most of his excitement and energy relating to Lucene is centered on Lucene’s spatial module to include Spatial4j, which he is largely responsible for. He presented his progress on this at Lucene Revolution and other conferences several times. Finally, he currently holds committer / Project Management Committee (PMC) status with the Lucene/Solr open-source project. During all this time, David has staked his career on search, working exclusively on such projects, formerly for MITRE, and now as an independent consultant for various clients. You can reach him at dsmiley@apache.org.

                        Eric Pugh

                        Fascinated by the “craft” of software development, Eric Pugh has been involved in the open source world as a developer, committer, and user for the past decade. He is a emeritus member of the Apache Software Foundation. In biotech, financial services and defense IT, he has helped European and American companies develop coherent strategies for embracing open source software. As a speaker he has advocated the advantages of Agile practices in search/discovery/analytics projects. Eric became involved in Solr when he submitted the patch SOLR-284 for Parsing Rich Document types such as PDF and MS Office formats that became the single most popular patch as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the Free/Open Source Model to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell. He blogs at http://www.opensourceconnections.com/blog/.
                        Sorry, we don't have any reviews for this title yet.

                        Code Downloads

                        Download the code and support files for this book.


                        Submit Errata

                        Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


                        Errata

                        - 7 submitted: last submission 11 Apr 2013

                        Errata type: Typo | Errata date: 17th May 12

                        Page no.: 26

                        Check the second line in the second paragraph. "At the moment, we're just going to take a peak at the request handlers, which are defined with <requestHandler> elements."

                        Here, the word "peak" is wrong. Instead, it should be "peek".

                         

                        Errata type: Typo | Errata date: 24th December 12

                        Check for the instances of word "solconfig.xml" :

                        Page 114: Check the sentence "This XML is used for most of the response XML and it is also used in parts of solconfig.xml too."

                        Page 204: Check the sentence "Choose the snippet fragmenting algorithm. This parameter refers to a named <fragmenter/> element in <highlighting/> in solconfig.xml. gap is the default typical choice based on a fragment size."

                        Page 204: Check the sentence "This parameter refers to a named <formatter/> element in <highlighting/> in solconfig.xml."

                        Page 204: Check the sentence "This is a reference to a named <encoder/> element in <highlighting/> in solconfig.xml."

                        Page 206: Check the sentence "This parameter refers to a named <fragListBuilder/> element in <highlighting/> in solconfig.xml."

                        Page 206: Check the sentence "This parameter refers to a named <fragmentsBuilder/> element in <highlighting/> in solconfig.xml."

                         

                        All these instances are wrong. It should be replaced by "solrconfig.xml".

                         

                        Errata type: Typo | Errata date: 24th December 12

                        Page 164:  Check the sentence "With luck, you may find some websites that will suffice, perhaps http://www.wolframapha.com."

                        This link is wrong. It should be "http://www.wolframalpha.com". (alpha instead of apha)

                        Errata type: Typo | Errata date: 24th December 12

                        Check for the instances of the word "XHMTL":

                        Page 103: Check the sentence "To return only the metadata, and discard all the body content of the XHMTL you would use xpath=/xhtml:html/ xhtml:head/descendant:node()."

                        Page 104: Check the sentenc "Defaults to xml to produce the XHMTL structure."

                        Page 108: Check the sentence "This returns an XHMTL document that contains the metadata extracted from the document in the <head/> stanza, as well as the basic structure of the contents expressed as XHTML."

                         

                        All these instances are wrong. It should be replaced by "XHTML".

                        Errata type: Typo | Errata date: 21st February 12

                        Page no. 347: Check the line of code   <filter class="solr.CommonGramsFilterFactory" words="commongrams.txt" ignoreCase="true"/>"/>

                        There's an extra "/> at the end of this line. It should be removed.

                        Errata type: Typo | Errata date: 2nd March 12

                        Page no. 250:

                        The last paragraph talks about the HTTP server request log format and about the last number in a logline being the time in milliseconds for serving the request. This is not correct, as this represents the number of bytes sent in the response.
                        While it might be true, that Jetty compiles JSP pages on first request, the number "3816" refers to the HTML size of the admin page being sent over the line. (Otherwise subsequent requests should show a smaller number).

                        Errata type: Typo | Errata date: 25th April 12

                        Please check page no. 2 : "What you need for this book" section. First bullet point-"Java 6, a JDK release. Do not use Java 7."

                        There was a problem with the initial release of Java 7, but the 7u1 release addressed the problem and so Java 7 is cleared now.  So, page 2 says not to use Java 7, which is an outdated advise.

                         

                        Sample chapters

                        You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                        Frequently bought together

                        Apache Solr 3 Enterprise Search Server +    OpenCL Parallel Programming Development Cookbook =
                        50% Off
                        the second eBook
                        Price for both: £27.35

                        Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                        What you will learn from this book

                        • Design a schema to include text indexing details like tokenization, stemming, and synonyms
                        • Import data using various formats like CSV, XML, and from databases, and extract text from common document formats
                        • Search using Solr’s rich query syntax, perform geospatial searches, and influence relevancy order
                        • Enhance search results with faceting, query spell-checking, auto-completing queries, highlighted search results, and more
                        • Integrate a host of technologies with Solr from the server side to client-side JavaScript, to frameworks like Drupal
                        • Scale Solr by learning how to tune it and how to use replication and sharding

                        In Detail

                        If you are a developer building an app today then you know how important a good search experience is. Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spell-check, relevancy tuning, and more.

                        Apache Solr 3 Enterprise Search Server is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate Solr with other languages and frameworks.

                        Through using a large set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project, you will have a testing ground for Solr, and will learn how to import this data in various ways. You will then learn how to search this data in different ways, including Solr's rich query syntax and "boosting" match scores based on record data.
                        Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.

                        Approach

                        The book is written as a reference guide. It includes fully working examples based on a real-world public data set.

                         

                        Who this book is for

                        This book is for developers who want to learn how to use Apache Solr in their applications. Only basic programming skills are needed.

                        Code Download and Errata
                        Packt Anytime, Anywhere
                        Register Books
                        Print Upgrades
                        eBook Downloads
                        Video Support
                        Contact Us
                        Awards Voting Nominations Previous Winners
                        Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                        Resources
                        Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software