Apache Solr Beginner’s Guide


Apache Solr Beginner’s Guide
eBook: $26.99
Formats: PDF, PacktLib, ePub and Mobi formats
$22.94
save 15%!
Print + free eBook + free PacktLib access to the book: $71.98    Print cover: $44.99
$44.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Learn to use Solr in real-world contexts, even if you are not a programmer, using simple configuration examples
  • Define simple configurations for searching data in several ways in your specific context, from suggestions to advanced faceted navigation
  • Teaches you in an easy-to-follow style, full of examples, illustrations, and tips to suit the demands of beginners

Book Details

Language : English
Paperback : 324 pages [ 235mm x 191mm ]
Release Date : December 2013
ISBN : 1782162526
ISBN 13 : 9781782162520
Author(s) : Alfredo Serafini
Topics and Technologies : All Books, Big Data and Business Intelligence, Beginner's Guides, Open Source


Table of Contents

Preface
Chapter 1: Getting Ready with the Essentials
Chapter 2: Indexing with Local PDF Files
Chapter 3: Indexing Example Data from DBpedia – Paintings
Chapter 4: Searching the Example Data
Chapter 5: Extending Search
Chapter 6: Using Faceted Search – from Searching to Finding
Chapter 7: Working with Multiple Entities, Multicores, and Distributed Search
Chapter 8: Indexing External Data sources
Chapter 9: Introducing Customizations
Appendix: Solr Clients and Integrations
Pop Quiz Answers
Index
  • Chapter 1: Getting Ready with the Essentials
    • Understanding Solr
    • Learning the powerful aspects of Solr
    • Working with Java installation
      • Downloading and installing Java
      • Configuring CLASSPATH and PATH variables for Java
    • Installing and testing Solr
    • Time for action – starting Solr for the first time
      • Taking a glance at the Solr interface
    • Time for action – posting some example data
    • Time for action – testing Solr with cURL
    • Who uses Solr?
    • Resources on Solr
    • How will we use Solr?
    • Summary
  • Chapter 2: Indexing with Local PDF Files
    • Understanding and using an index
    • Posting example documents to the first Solr core
      • Analyzing the elements we need in Solr core
    • Time for action – configuring Solr Home and Solr core discovery
      • Knowing the legacy solr.xml format
    • Time for action – writing a simple solrconfig.xml file
    • Time for action – writing a simple schema.xml file
    • Time for action – starting the new core
    • Time for action – defining an example document
    • Time for action – indexing an example document with cURL
      • Executing the first search on the new core
      • Adding documents to the index from the web UI
    • Time for action – updating an existing document
    • Time for action – cleaning an index
    • Creating an index prototype from PDF files
    • Time for action – defining the schema.xml file with only dynamic fields and tokenization
    • Time for action – writing a simple solrconfig.xml file with an update handler
      • Testing the PDF file core with dummy data and an example query
      • Defining a new tokenized field for fulltext
    • Time for action – using Tika and cURL to extract text from PDFs
      • Using cURL to index some PDF data
    • Time for action – finding copies of the same files with deduplication
    • Time for action – looking inside an index with SimpleTextCodec
    • Understanding the structure of an inverted index
      • Understanding how optimization affects the segments of an index
    • Writing the full configuration for our PDF index example
      • Writing the solrconfig.xml file
      • Writing the schema.xml file
    • Summarizing some easy recipes for the maintenance of an index
    • Summary
  • Chapter 3: Indexing Example Data from DBpedia – Paintings
    • Harvesting paintings' data from DBpedia
    • Analyzing the entities that we want to index
      • Analyzing the first entity – Painting
    • Writing Solr core configurations for the first tests
    • Time for action – defining the basic solrconfig.xml file
      • Looking at the differences between commits and soft commits
    • Time for action – defining the simple schema.xml file
      • Introducing analyzers, tokenizers, and filters
      • Thinking fields for atomic updates
      • Indexing a test entity with JSON
      • Understanding the update chain
      • Using the atomic update
      • Understanding how optimistic concurrency works
    • Time for action – listing all the fields with the CSV output
    • Defining a new Solr core for our Painting entity
    • Time for action – refactoring the schema.xml file for the paintings core by introducing tokenization and stop words
      • Using common field attributes for different use cases
      • Testing the paintings schema
    • Collecting the paintings data from DBpedia
      • Downloading data using the DBpedia SPARQL endpoint
      • Creating Solr documents for example data
      • Indexing example data
    • Testing our paintings core
    • Time for action - looking at a field using the Schema browser in the web interface
    • Time for action – searching the new data in the paintings core
      • Using the Solr web interface for simple maintenance tasks
    • Summary
  • Chapter 4: Searching the Example Data
    • Looking at Solr's standard query parameters
      • Adding a timestamp field for tracking the last modified time
      • Sending Solr's query parameters over HTTP
        • Testing HTTP parameters on browsers
      • Choosing a format for the output
    • Time for action – searching for all documents with pagination
    • Time for action – projecting fields with fl
      • Introducing pseudo-fields and DocTransformers
        • Adding a constant field using transformers
    • Time for action – adding a custom DocTransformer to hide empty fields in the results
      • Looking at core parameters for queries
      • Using the Lucene query parser with defType
    • Time for action – searching for terms with a Boolean query
    • Time for action – using q.op for the default Boolean operator
    • Time for action – selecting documents with the filter query
    • Time for action – searching for incomplete terms with the wildcard query
    • Time for action – using the Boost options
      • Understanding the basic Lucene score
  • Time for action – searching for similar terms with fuzzy search
  • Time for action – writing a simple phrase query example
  • Time for action – playing with range queries
  • Time for action – sorting documents with the sort parameter
    • Playing with the request
  • Time for action – adding a default parameter to a handler
    • Playing with the response
  • Summarizing the parameters that affect result presentation
  • Analyzing response format
  • Time for action – enabling XSLT Response Writer with Luke
    • Listing all fields names with CSV output
    • Listing all field details for a core
    • Exploring Solr for Open Data publishing
      • Publishing results in CSV format
      • Publishing results with an RSS feed
    • Good resources on Solr Query Syntax
  • Summary
  • Chapter 5: Extending Search
    • Looking at different search parsers – Lucene, Dismax, and Edismax
      • Starting from the previous core definition
    • Time for action – inspecting results using the stats and debug components
      • Looking at Lucene and Solr query parsers
    • Time for action – debugging a query with the Lucene parser
    • Time for action – debugging a query with the Dismax parser
      • Using an Edismax default handler
    • Time for action – executing a nested Edismax query
    • A short list of search components
      • Adding the blooming filter and real-time Get
    • Time for action – executing a simple pseudo-join query
      • Highlighting results to improve the search experience
    • Time for action – generating highlighted snippets over a term
    • Some idea about geolocalization with Solr
    • Time for action – creating a repository of cities
      • Playing more with spatial search
      • Looking at the new Solr 4 spatial features – from points to polygons
    • Time for action – expanding the original data with coordinates during the update process
    • Performing editorial correction on boosting
    • Introducing the spellcheck component
    • Time for action – playing with spellchecks
      • Using a file to spellcheck against a list of controlled words
      • Collecting some hints for spellchecking analysis
    • Summary
  • Chapter 6: Using Faceted Search – from Searching to Finding
    • Exploring documents suggestion and matching with faceted search
    • Time for action – prototyping an auto-suggester with facets
    • Time for action – creating wordclouds on facets to view and analyze data
    • Thinking about faceted search and findability
      • Faceting for narrowing searches and exploring data
    • Time for action – defining facets over enumerated fields
    • Performing data normalization for the keyword field during the update phase
      • Reading more about Solr faceting parameters
    • Time for action – finding interesting topics using faceting on tokenized fields with a filter query
    • Using filter queries for caching filters
    • Time for action – finding interesting subjects using a facet query
    • Time for action – using range queries and facet range queries
    • Time for action – using a hierarchical facet (pivot)
    • Introducing group and field collapsing
    • Time for action – grouping results
    • Playing with terms
    • Time for action – playing with a term suggester
      • Thinking about term vectors and similarity
        • Moving to semantics with vector space models
        • Looking at the next step – customizing similarity
    • Time for action – having a look at the term vectors
      • Reading about functions
    • Introducing the More Like This component and recommendations
    • Time for action – obtaining similar documents by More Like This
      • Adopting a More Like This handler
  • Summary
  • Chapter 7: Working with Multiple Entities, Multicores, and Distributed Search
    • Working with multiple entities
    • Time for action – searching for cities using multiple core joins
      • Preparing example data for multiple entities
        • Downloading files for multiple entities
        • Generating Solr documents
      • Playing with joins on multicores (a core for every entity)
    • Using sharding for distributed search
    • Time for action – playing with sharding (distributed search)
    • Time for action – finding a document from any shard
    • Collecting some ideas on schemaless versus normalization
      • Creating a single denormalized index
        • Adding a field to track entity type
        • Analyzing, designing, and refactoring our domain
        • Using document clustering as a domain analysis tool
        • Managing index replication
        • Clustering Solr for distributed search using SolrCloud
      • Taking a journey from single core to SolrCloud
      • Understanding why we need Zookeeper
    • Time for action – testing SolrCloud and Zookeeper locally
      • Looking at the suggested configurations for SolrCloud
        • Changing the schema.xml file
        • Changing the solrconfig.xml file
      • Knowing the pros and cons of SolrCloud
    • Summary
  • Chapter 8: Indexing External Data sources
    • Stepping further into the real world
      • Collecting example data from the Web Gallery of Art site
    • Time for action – indexing data from a database (for example, a blog or an e-commerce website)
    • Time for action – handling sub-entities (for example, joins on complex data)
    • Time for action – indexing incrementally using delta imports
    • Time for action – indexing CSV (for example, open data)
    • Time for action – importing Solr XML document files
      • Importing data from another Solr instance
      • Indexing emails
    • Time for action – indexing rich documents (for example, PDF)
    • Adding more consideration about tuning
      • Understanding Java Virtual Machine, threads, and Solr
      • Choosing the correct directory for implementation
      • Adopting Solr cache
    • Time for action – indexing artist data from Tate Gallery and DBpedia
      • Using DataImportHandler
    • Summary
  • Chapter 9: Introducing Customizations
    • Looking at the Solr customizations
      • Adding some more details to the core discovery
    • Playing with specific languages
    • Time for action – detecting language with Tika and LangDetect
    • Introducing stemming for query expansion
    • Time for action – adopting a stemmer
      • Testing language analysis with JUnit and Scala
      • Writing new Solr plugins
      • Introducing Solr plugin structure and lifecycle
      • Implementing interfaces for obtaining information
    • Following an example plugin lifecycle
    • Time for action – writing a new ResponseWriter plugin with the Thymeleaf library
    • Using Maven for development
    • Time for action – integrating Stanford NER for Named Entity extraction
      • Pointing ideas for Solr's customizations
    • Summary
  • Appendix: Solr Clients and Integrations
    • Introducing SolrJ – an embedded or remote Solr client using the Java (JVM) API
    • Time for action – playing with an embedded Solr instance
    • Choosing between an embedded or remote Solr instance
    • Time for action – playing with an external HttpSolrServer
    • Time for action – using Bean Scripting Framework and JavaScript
      • Jahia CMS
      • Magnolia CMS
      • Alfresco DMS and CMS
      • Liferay
      • Broadleaf
      • Apache Jena
      • Solr Groovy or the Grails plugin
      • Solr scala
      • Spring data
    • Writing Solr clients and integrations outside JVM
      • JavaScript
        • Taking a glance at ajax-solr, solrstrap, facetview, and jcloud
      • Ruby
      • Python
      • C# and .NET
      • PHP
      • Drupal
      • WordPress
      • Magento e-commerce
      • Platforms for analyzing, manipulating, and enhancing text
      • Hydra
      • UIMA
      • Apache Stanbol
      • Carrot2
      • VuFind
    • Summary

Alfredo Serafini

Alfredo Serafini is a freelance software consultant, currently living in Rome, Italy. He has a mixed background. He has a bachelor's degree in Computer Science Engineering (2003, with a thesis on Music Information Retrieval), and he has completed a professional master's course in Sound Engineering (2007, with a thesis on gestural interface to MAX/MSP platform). From 2003 to 2006, he had been involved as a consultant and developer at Artificial Intelligence Research at Tor Vergata (ART) group. During this experience, he got his first chance to play with the Lucene library. Since then he has been working as a freelancer, alternating between working as a teacher of programming languages, a mentor for small companies on topics like Information Retrieval and Linked Data, and (not surprisingly) as a software engineer. He is currently a Linked Open Data enthusiast. He has also had a lot of interaction with the Scala language as well as graph and network databases. You can find more information about his activities on his website, titled designed to be unfinished, at http://www.seralf.it/.
Sorry, we don't have any reviews for this title yet.

Code Downloads

Download the code and support files for this book.


Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


Errata

- 5 submitted: last submission 24 Jun 2014

Page number: 12

The link: http://lucene.apche.org/solr/

Should be: http://lucene.apache.org/solr/

Page number: 53 | Type: Technical

The line: "If you look at the provided sources, you'll find the full configuration at the path /SolrStarterBook/solr-app/chp02/pdfs/conf"

 

should be: "If you look at the provided sources, you'll find the full configuration at the path /SolrStarterBook/solr-app/chp02/pdfs_full/conf"

Page number: 35 | Type: Technical

The line "Now, to simplify,save it with the name  docs.xml  under the  /SolrStarterBook/solr-app/test/chp02/ directory"

Should be "Now, to simplify,save it with the name  docs.xml  under the /SolrStarterBook/test/chp02/ directory"

Page number: 43 | Type: Technical

The line "One of  the essential  parts of the solrconfig,xml file will"

Should be "One of  the essential  parts of the solrconfig.xml file will"

Page number: 288 | Type : Technical

In Chapter 2 the Q2 - 4th question is "a field defined as stored can be returned in the output"
The answer is "true".

 

Question 3-4 here should be the same, for some resone they were splitted in two: if the reader looks at the Q2 - 3 will find the complementary answer. 

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Apache Solr Beginner’s Guide +    Elasticsearch Server: Second Edition =
50% Off
the second eBook
Price for both: $39.00

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Understand what is full-text search and a faceted navigation are and when to use them
  • Install and use Solr for testing
  • Write your own configurations for the Solr index incrementally and test them with the Solr web UI
  • Learn how to test a Solr running instance using cURL with different formats, like XML, JSON, and so on
  • Construe your data and define the entities to be indexed in Solr
  • Examine text and make auto-suggestions
  • Index data using various formats and various data sources, and learn how to expose data in various formats
  • Start using Solr in contexts like Open Data and Linked Data
  • Use Solr for expanding your data with resources from public, well-known knowledge base

In Detail

With over 40 billion web pages, the importance of optimizing a search engine's performance is essential.

Solr is an open source enterprise search platform from the Apache Lucene project. Full-text search, faceted search, hit highlighting, dynamic clustering, database integration, and rich document handling are just some of its many features. Solr is highly scalable thanks to its distributed search and index replication.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable with most popular programming languages. Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization.

With Apache Solr Beginner's Guide you will learn how to configure your own search engine experience. Using real data as an example, you will have the chance to start writing step-by-step, simple, real-world configurations and understand when and where to adopt this technology.

Apache Solr Beginner's Guide will start by letting you explore a simple search over real data. You will then go through a step-by-step description that gives you the chance to explore several practical features. At the end of the book you will see how Solr is used in different real-world contexts.

Using data from public domains like DBpedia, you will define several different configurations, exploring some of the most interesting Solr features, such as faceted search and navigation, auto-suggestion, and rich document indexing. You will see how to configure different analysers for handling different data types, without programming.

You will learn the basics of Solr, focusing on real-world examples and practical configurations.

Approach

Written in a friendly, example-driven format, the book includes plenty of step-by-step instructions and examples that are designed to help you get started with Apache Solr.

Who this book is for

This book is an entry level text into the wonderful world of Apache Solr. The book will center around a couple of simple projects such as setting up Solr and all the stuff that comes with customizing the Solr schema and configuration. This book is for developers looking to start using Apache Solr who are stuck or intimidated by the difficulty of setting it up and using it.

For anyone wanting to embed a search engine in their site to help users navigate around the mammoth data available this book is an ideal starting point. Moreover, if you are a data architect or a project manager and want to make some key design decisions, you will find that every example included in the book contains ideas usable in real-world contexts. s

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software