Free Sample
+ Collection

Apache Solr Beginner’s Guide

Beginner's Guide
Alfredo Serafini

Where do you start with Apache Soir? We’d suggest with this book, which assumes no prior knowledge and takes you step by careful step through all the essentials, putting you on the road towards successful implementation.
$26.99
$44.99
RRP $26.99
RRP $44.99
eBook
Print + eBook

Want this title & more?

$16.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781782162520
Paperback324 pages

About This Book

  • Learn to use Solr in real-world contexts, even if you are not a programmer, using simple configuration examples
  • Define simple configurations for searching data in several ways in your specific context, from suggestions to advanced faceted navigation
  • Teaches you in an easy-to-follow style, full of examples, illustrations, and tips to suit the demands of beginners

Who This Book Is For

This book is an entry level text into the wonderful world of Apache Solr. The book will center around a couple of simple projects such as setting up Solr and all the stuff that comes with customizing the Solr schema and configuration. This book is for developers looking to start using Apache Solr who are stuck or intimidated by the difficulty of setting it up and using it.

For anyone wanting to embed a search engine in their site to help users navigate around the mammoth data available this book is an ideal starting point. Moreover, if you are a data architect or a project manager and want to make some key design decisions, you will find that every example included in the book contains ideas usable in real-world contexts.

Table of Contents

Chapter 1: Getting Ready with the Essentials
Understanding Solr
Learning the powerful aspects of Solr
Working with Java installation
Installing and testing Solr
Time for action – starting Solr for the first time
Time for action – posting some example data
Time for action – testing Solr with cURL
Who uses Solr?
Resources on Solr
How will we use Solr?
Summary
Chapter 2: Indexing with Local PDF Files
Understanding and using an index
Posting example documents to the first Solr core
Time for action – configuring Solr Home and Solr core discovery
Time for action – writing a simple solrconfig.xml file
Time for action – writing a simple schema.xml file
Time for action – starting the new core
Time for action – defining an example document
Time for action – indexing an example document with cURL
Time for action – updating an existing document
Time for action – cleaning an index
Creating an index prototype from PDF files
Time for action – defining the schema.xml file with only dynamic fields and tokenization
Time for action – writing a simple solrconfig.xml file with an update handler
Time for action – using Tika and cURL to extract text from PDFs
Time for action – finding copies of the same files with deduplication
Time for action – looking inside an index with SimpleTextCodec
Understanding the structure of an inverted index
Writing the full configuration for our PDF index example
Summarizing some easy recipes for the maintenance of an index
Summary
Chapter 3: Indexing Example Data from DBpedia – Paintings
Harvesting paintings' data from DBpedia
Analyzing the entities that we want to index
Writing Solr core configurations for the first tests
Time for action – defining the basic solrconfig.xml file
Time for action – defining the simple schema.xml file
Time for action – listing all the fields with the CSV output
Defining a new Solr core for our Painting entity
Time for action – refactoring the schema.xml file for the paintings core by introducing tokenization and stop words
Collecting the paintings data from DBpedia
Testing our paintings core
Time for action - looking at a field using the Schema browser in the web interface
Time for action – searching the new data in the paintings core
Summary
Chapter 4: Searching the Example Data
Looking at Solr's standard query parameters
Time for action – searching for all documents with pagination
Time for action – projecting fields with fl
Time for action – adding a custom DocTransformer to hide empty fields in the results
Time for action – searching for terms with a Boolean query
Time for action – using q.op for the default Boolean operator
Time for action – selecting documents with the filter query
Time for action – searching for incomplete terms with the wildcard query
Time for action – using the Boost options
Time for action – searching for similar terms with fuzzy search
Time for action – writing a simple phrase query example
Time for action – playing with range queries
Time for action – sorting documents with the sort parameter
Time for action – adding a default parameter to a handler
Time for action – enabling XSLT Response Writer with Luke
Summary
Chapter 5: Extending Search
Looking at different search parsers – Lucene, Dismax, and Edismax
Time for action – inspecting results using the stats and debug components
Time for action – debugging a query with the Lucene parser
Time for action – debugging a query with the Dismax parser
Time for action – executing a nested Edismax query
A short list of search components
Time for action – executing a simple pseudo-join query
Time for action – generating highlighted snippets over a term
Some idea about geolocalization with Solr
Time for action – creating a repository of cities
Time for action – expanding the original data with coordinates during the update process
Performing editorial correction on boosting
Introducing the spellcheck component
Time for action – playing with spellchecks
Summary
Chapter 6: Using Faceted Search – from Searching to Finding
Exploring documents suggestion and matching with faceted search
Time for action – prototyping an auto-suggester with facets
Time for action – creating wordclouds on facets to view and analyze data
Thinking about faceted search and findability
Time for action – defining facets over enumerated fields
Performing data normalization for the keyword field during the update phase
Time for action – finding interesting topics using faceting on tokenized fields with a filter query
Using filter queries for caching filters
Time for action – finding interesting subjects using a facet query
Time for action – using range queries and facet range queries
Time for action – using a hierarchical facet (pivot)
Introducing group and field collapsing
Time for action – grouping results
Playing with terms
Time for action – playing with a term suggester
Time for action – having a look at the term vectors
Introducing the More Like This component and recommendations
Time for action – obtaining similar documents by More Like This
Summary
Chapter 7: Working with Multiple Entities, Multicores, and Distributed Search
Working with multiple entities
Time for action – searching for cities using multiple core joins
Using sharding for distributed search
Time for action – playing with sharding (distributed search)
Time for action – finding a document from any shard
Collecting some ideas on schemaless versus normalization
Time for action – testing SolrCloud and Zookeeper locally
Summary
Chapter 8: Indexing External Data sources
Stepping further into the real world
Time for action – indexing data from a database (for example, a blog or an e-commerce website)
Time for action – handling sub-entities (for example, joins on complex data)
Time for action – indexing incrementally using delta imports
Time for action – indexing CSV (for example, open data)
Time for action – importing Solr XML document files
Time for action – indexing rich documents (for example, PDF)
Adding more consideration about tuning
Time for action – indexing artist data from Tate Gallery and DBpedia
Summary
Chapter 9: Introducing Customizations
Looking at the Solr customizations
Playing with specific languages
Time for action – detecting language with Tika and LangDetect
Introducing stemming for query expansion
Time for action – adopting a stemmer
Following an example plugin lifecycle
Time for action – writing a new ResponseWriter plugin with the Thymeleaf library
Using Maven for development
Time for action – integrating Stanford NER for Named Entity extraction
Summary

What You Will Learn

  • Understand what is full-text search and a faceted navigation are and when to use them
  • Install and use Solr for testing
  • Write your own configurations for the Solr index incrementally and test them with the Solr web UI
  • Learn how to test a Solr running instance using cURL with different formats, like XML, JSON, and so on
  • Construe your data and define the entities to be indexed in Solr
  • Examine text and make auto-suggestions
  • Index data using various formats and various data sources, and learn how to expose data in various formats
  • Start using Solr in contexts like Open Data and Linked Data
  • Use Solr for expanding your data with resources from public, well-known knowledge base

In Detail

With over 40 billion web pages, the importance of optimizing a search engine's performance is essential.

Solr is an open source enterprise search platform from the Apache Lucene project. Full-text search, faceted search, hit highlighting, dynamic clustering, database integration, and rich document handling are just some of its many features. Solr is highly scalable thanks to its distributed search and index replication.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable with most popular programming languages. Solr's powerful external configuration allows it to be tailored to many types of application without Java coding, and it has a plugin architecture to support more advanced customization.

With Apache Solr Beginner's Guide you will learn how to configure your own search engine experience. Using real data as an example, you will have the chance to start writing step-by-step, simple, real-world configurations and understand when and where to adopt this technology.

Apache Solr Beginner's Guide will start by letting you explore a simple search over real data. You will then go through a step-by-step description that gives you the chance to explore several practical features. At the end of the book you will see how Solr is used in different real-world contexts.

Using data from public domains like DBpedia, you will define several different configurations, exploring some of the most interesting Solr features, such as faceted search and navigation, auto-suggestion, and rich document indexing. You will see how to configure different analysers for handling different data types, without programming.

You will learn the basics of Solr, focusing on real-world examples and practical configurations.

Authors

Read More