Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Apache Solr for Indexing Data
Apache Solr for Indexing Data

Apache Solr for Indexing Data: Enhance your Solr indexing experience with advanced techniques and the built-in functionalities available in Apache Solr

$15.99 per month
Book Dec 2015 160 pages 1st Edition
eBook
$29.99 $20.98
Print
$38.99
Subscription
$15.99 Monthly
eBook
$29.99 $20.98
Print
$38.99
Subscription
$15.99 Monthly

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Dec 28, 2015
Length 160 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783553235
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Apache Solr for Indexing Data

Chapter 1. Getting Started

We will start this chapter with a quick overview of Solr, followed by a section that helps you get Solr up and running. We will also cover some basic building blocks of the Solr architecture, its directory structure, and its configurations files. This chapter covers following topics:

  • Overview and installation of Solr

  • Running Solr

  • The Solr architecture and directory structure

  • Multicore Solr

Overview and installation of Solr


Solr is the one of the most popular open source enterprise search platforms from the Apache Lucene open source project. Its features include full text search, faceted search, highlighting, near-real-time indexing, dynamic clustering, rich document handling, and geospatial search. Solr is highly reliable and scalable. This is the reason Solr powers the search features of the world's largest Internet sites, for example, Netflix, TicketMaster, SourceForge, and so on (source: https://wiki.apache.org/solr/PublicServers).

Solr is written in Java and runs as a standalone full text search server with a REST-like API. You feed documents into it (which is called indexing) via XML, JSON, CSV, and binary over HTTP. You query it through HTTP GET and receive XML, JSON, CSV, and binary results.

Let's go through the installation process of Solr. This section describes how to install Solr on various operating systems such as Mac, Windows, and Linux. Let's go through each of them one by one.

Installing Solr in OS X (Mac)

The easiest way to install Solr on OS X is by using homebrew. If you are not aware of homebrew and don't have homebrew installed on your Mac, then go to http://brew.sh/. Homebrew is the easiest way of installing packages/software on Mac.

You will require JRE 1.7 or above to install Solr on OS X. Just type java –version in the terminal and see what the version of JRE installed in your computer is. If it's less than 1.7, then you need to upgrade it to higher version and proceed with the following instructions.

Just type the following command in the terminal and it will automatically download all the files needed for Solr. Sit back and relax for a few minutes until it completes:

$ brew install Solr

Running Solr


To test whether your installation was completed successfully, you need to run Solr. Type these commands in the terminal to run it:

$ cd /usr/local/Cellar/solr/4.4.0/libexec/example/
$ java -jar start.jar

After you run the preceding commands, you will see lots of dumping messages/logs on the terminal. Don't worry! It's normal. Just try to fix any error if it is there. Once the messages are stopped and there is no error message, simply go to any web browser and type http://localhost:8983/solr/#/.

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You will see following screen on your browser:

Fresh Solr do not contain any data. In Solr terminology, data is termed as a document. You will learn how to index data in Solr in upcoming chapters.

Installing Solr in Windows

There are multiple ways of installing Solr on a Windows machine. Here, I have explained the way to set up Solr with Jetty running as a service via NSSM:

  1. Install the latest Java JDK from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

  2. Download the latest Solr release (ZIP version) from http://www.apache.org/dyn/closer.cgi/lucene/solr/. At the time of writing this book, the latest Solr release was 4.10.1.

  3. Unzip the Solr download. You should have files as shown in the following screenshot. Open the example folder.

  4. Copy the etc, lib, logs, solr, and webapps folders and start.jar to C:\solr (you will need to create the folder at C:\solr), as shown in the following screenshot:

  5. Now open the C:\solr\solr folder and copy the contents back to the root C:\solr folder. When you are done, you can delete the C:\solr\solr folder. See the following image, the selected folder you can delete now:

    At this point, your C:\solr directory should look like what is shown in the following screenshot:

  6. Solr can be run at this point if you start it from the command line. Change your directory to C:\solr and then run java -Dsolr.solr.home=C:/solr/ -jar start.jar.

  7. If you go to http://localhost:8983/solr/, you should see the Solr dashboard.

  8. Now Solr is up and running, so we can work on getting Jetty to run as a Windows service. Since Jetty comes bundled with Solr, all that we need to do is run it as a service. There are several options to do this, but the one I prefer is through Non-Sucking Service Manager (NSSM)program in windows which is the, the most compatible service manager across Windows environment. NSSM can be downloaded from http://nssm.cc/download.

  9. Once you have downloaded NSSM, open the win32 or win64 folder as appropriate and copy nssm.exe to your C:\solr folder.

  10. Open Command Prompt, change the directory to C:\solr, and then run nssm install Solr.

  11. A dialog will open. Select java.exe as the application located at C:\Windows\System32\.

  12. In the options input box, enter: Dsolr.solr.home=C:/solr/ -Djetty.home=C:/solr/ -Djetty.logs=C:/solr/logs/ -cp C:/solr/lib/*.jar;C:/solr/start.jar -jar C:/solr/start.jar.

  13. Click on Install service. You should get a service successfully installed message.

  14. Finally run net start Solr.

  15. Jetty should now be running as a service. Check this by going to http://localhost:/8983/solr/.

Installing Solr on Linux

To install Solr on Linux/Unix, you will need Java Runtime Environment (JRE) version 1.7 or higher. Then follow these steps:

  1. Download the latest Solr release (.tgz) from http://www.apache.org/dyn/closer.cgi/lucene/solr/. At the time of writing this book, the latest release was 4.10.1.

  2. Unpack the file to your desired location.

  3. Solr runs inside a Java servlet container, such as Tomcat, Jetty, and so on. Solr distribution includes a working demo server in the example directory, which runs in Jetty. You can use Jetty servlet container, or use your preferred servlet container. If you are using a servlet container other than Jetty and it's already running, then stop that server.

  4. Copy the solr-4.10.1.war file from the Solr distribution under the dist directory to the webapps directory of your servlet container. Change the name of this file; it must be named solr.war.

  5. Copy the Solr home directory, solr-4.x.0/example/solr/, from the distribution to your desired Solr home location.

  6. Start your servlet container, passing to it the location of your Solr home in one of these ways:

    1. Set the solr.solr.home Java system property to your Solr home (for example, using this example jetty setup: java -Dsolr.solr.home=/some/dir -jar start.jar).

    2. Configure the servlet container so that a JNDI lookup of java:comp/env/solr/home by the Solr web app will point to your Solr Home.

    3. Start the servlet container in the directory containing ./solr. The default Solr Home is solr under the JVM's current working directory ($CWD/solr).

  7. To confirm the installation, just go to http://localhost:/8983/solr/ and you will see the Solr dashboard. Now your Solr is up and running.

Thus, by the end of the installation, your Solr is up and running. But since we have not fed any data into Solr, it will not index any data. Let's try to insert some example data into our server.

The Solr download comes with example data bundled in it. We can use the same data for indexing as an example. Go to the exampledocs directory under the example directory. Here, you will see a lot of files. Now go to the command line (terminal) and type the following commands:

$ cd $SOLR_HOME/example/exampledocs/
$ ./post.sh vidcard.xml

Within the post.sh file, the script will call http://localhost:8983/solr/update using curl to post xml data from the vidcard.xml file. When the import completes (without any error), you will see a message that looks something like this:

Now let's try to check out our imported data from web browser. Try http://localhost:8983/solr/select?q=*:*&wt=json to fetch all of the data in your Solr instance, like this:

When you see the preceding data, it means that your Solr server is running properly and is ready to index your desired feed. You will be reading indexing in depth in upcoming chapters.

The Solr architecture and directory structure


In real-world scenarios, Solr runs with other applications on a web server. A typical example is an online store application. The store provides a user interface, a shopping cart, an items catalogue, and a way to make purchases. It needs to store this information some sort of database. Here, Solr makes easy so add the capability of searching data in the online store. To make data searchable, you need to feed it to Solr for indexing. Data can be fed to Solr in various ways and also in various formats, such as .pdf, .doc, .txt, and so on. In the process of feeding data to Solr, you need to define a schema. A schema is a way of telling Solr about data and how you want to make your data indexed. A lot many factors need to be considered while feeding data, which we will discuss in detail in upcoming chapters.

Solr queries are RESTful, which means that a Solr query is just a simple HTTP request and the response is a structured document, mainly in XML, but it could be JSON, CSV, or any other format as well based on your requirement. A typical architecture of Solr in the real world looks something like this:

Do not worry if you are not able to understand the preceding diagram right now. We will cover every component related to indexing in detail. The purpose of this diagram is to give you a feel of the current architecture of Solr and its working in the real world. If you see the preceding diagram properly, you will find two .xml files named schema.xml and solrconfig.xml. These are the two most important files in the Solr configuration and are considered the building blocks of Solr.

Solr directory structure

Here's the directory layout of a typical Solr Home directory:

| + conf 
|     - schema.xml 
|     - solrconfig.xml 
|     - stopwords.txt
|     - synonyms.txt etc
| + data 
|     - index 
|     - spellchecker

Let's get a brief understanding of solrconfig.xml and schema.xml here before we proceed further, as these are the building blocks of Solr (as stated earlier). We will cover them in detail in the next few chapters.

The solrconfig.xml file is the core configuration file of Solr, with most parameters affecting Solr itself directly. This file can be found in the solr/collection1/conf/ directory. When configuring Solr, you'll work with solrconfig.xml often. The file consists of a series of XML statements that set configuration values, and some of the most important configurations are:

  • Defining data dir (the directory where indexed files remain)

  • Request handlers (handle upcoming HTTP requests)

  • Listeners

  • Request dispatchers (used to manage HTTP communications)

  • Admin web interface settings

  • Replication and duplication parameters

These are some of the important configurations defined in solrconfig.xml. This file is well commented; I would advise you to go through it from the start and read all the comments. You will get a very good understanding of the various components involved in the Solr configuration.

The second most important configuration file is called schema.xml. This file can be found in the solr/collection1/conf/ directory. As the name says, this file is used to define the schema of the data (content) that you want to index and make searchable. Data is called document in Solr terminology. The schema.xml file contains all the details about the fields that your documents can contain, and how these fields should be dealt with when adding documents to the index or when querying those fields. This file can be divided broadly into two sections:

  • The types section (the definitions of all types)

  • The fields section (the definitions of the document structure using types)

The structure of your document should be defined as a field under the fields section. Let's say you have to define a book as a document in Solr with fields as isbn, title, author, and price. The schema will be as follows:

<field name="isbn" type="string" required="true" indexed="true" stored="true"/> <field name="title" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text-general" indexed="true" stored="true" multiValued="true"/>
<field name="price" type="int" indexed="true" stored="true"/>

In the preceding schema, you see a type attribute, which defines the data type of the field. You can change the behavior of the field by changing the type. The multiValued attribute is used to tell Solr that the field can hold multiple values, while the required attribute makes the field mandatory for creating a document. After the fields section ends, we need to mention which field is going to be unique. In our case, it is going to be isbn:

<uniqueKey>isbn</uniqueKey>

The schema.xml file is also well-commented file. I will again advise you to go through the comments of this file, for starting this will help you understand the various field types and data types in detail.

Cores in Solr (Multicore Solr)


Solr cores make it possible to run multiple indexes with different configurations and schemas in a single Solr instance. The multicore feature of Solr helps in unified administration of Solr instances for complete and different applications. Cores in Solr are fairly isolated and have their own configuration and schema files. This helps manage cores at runtime (create or remove) from a Solr instance without restarting the process.

Cores in Solr are managed through a configuration file called solr.xml. The solr.xml file is present in your Solr Home directory. Since its inception, solr.xml has evolved from configuring one core to managing multiple cores and eventually defining parameters for SolrCloud. Do not worry much about SolrCloud if you are not aware of it, as we have a dedicated chapter that covers SolrCloud in detail. In brief, SolrCloud is a terminology used in distributed search and indexing. When we need to index huge amounts of data, we need to think of scalability and performance. This is where SolrCloud comes into the picture.

Starting from Solr 4.3, Solr will maintain two distinct formats for solr.xml; one is legacy and the other is discovery mode. The legacy format will be supported until the 4.x.0 series and it will be deprecated in the 5.0 release of Solr. The default solr.xml config file looks something like this:

<solr>

  <solrcloud>
    <str name="host">${host:}</str>
    <int name="hostPort">${jetty.port:8983}</int>
    <str name="hostContext">${hostContext:solr}</str>
    <int name="zkClientTimeout">${zkClientTimeout:30000}</int>
    <bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
  </solrcloud>

  <shardHandlerFactory name="shardHandlerFactory"
    class="HttpShardHandlerFactory">
    <int name="socketTimeout">${socketTimeout:0}</int>
    <int name="connTimeout">${connTimeout:0}</int>
  </shardHandlerFactory>

</solr>

The preceding configuration shows that Solr configurations are SolrCloud friendly, but this does not mean that Solr is running in SolrCloud mode, unless you start Solr with some special parameters (explained in the SolrCloud Chapter 10, Distributed Indexing). To configure multiple cores in Solr in legacy format, you need to edit the solr.xml file with the following code snippet and remove the existing discovery code from solr.xml:

<solr persistent="false">
    <cores adminPath="/admin/cores" defaultCoreName="core1">
    <core name="core1" instanceDir="core1" />
    <core name="core2" instanceDir="core2" />
  </cores>
</solr>

Now you need to create two cores (new directories, core1 and core2) in the Solr directory. You also need to create Solr configuration files for new cores. To do this, just copy the same configuration files (the conf directory in collections1) in both cores for now and restart the Solr server after you have made these settings.

Once you restart the Solr server with the preceding configuration, two cores will be created, with names core1 and core2 and the existing default Solr configuration settings. The instanceDir variable defines the directory name relative to solr.xml—where to look for configuration and data files. You can modify the paths of these cores according to your wishes and the configuration files according to your use case. You can also change the names of the cores.

You can verify your settings by opening the following URL in your browser: http://localhost:8983/solr/.

You will see two new cores created in the Solr dashboard. Currently, there is no document in any of the cores because we have not indexed any data so far. So, this concludes the process of creating multiple cores in Solr.

Summary


Thus, by the end of the first chapter, you have learned what Solr is, how to install and run it on various operating systems, what the various components and basic building blocks of Solr are (such as its configuration files and directory structure), and how to set up configuration files. You also learned in brief about the architecture of Solr. In the last section, we covered multicore setup in the Solr 4.x.0 series. However, the legacy method of multicore setup is going to be deprecated in the Solr 5.x release and then it's going to be only discovery mode, which is called SolrCloud.

In the next chapter, we will look deeply into the various components used in Solr configuration files, such as tokenizers, analyzers, filters, field types, and so on.

Left arrow icon Right arrow icon

Key benefits

  • Learn about distributed indexing and real-time optimization to change index data on fly
  • Index data from various sources and web crawlers using built-in analyzers and tokenizers
  • This step-by-step guide is packed with real-life examples on indexing data

Description

Apache Solr is a widely used, open source enterprise search server that delivers powerful indexing and searching features. These features help fetch relevant information from various sources and documentation. Solr also combines with other open source tools such as Apache Tika and Apache Nutch to provide more powerful features. This fast-paced guide starts by helping you set up Solr and get acquainted with its basic building blocks, to give you a better understanding of Solr indexing. You’ll quickly move on to indexing text and boosting the indexing time. Next, you’ll focus on basic indexing techniques, various index handlers designed to modify documents, and indexing a structured data source through Data Import Handler. Moving on, you will learn techniques to perform real-time indexing and atomic updates, as well as more advanced indexing techniques such as de-duplication. Later on, we’ll help you set up a cluster of Solr servers that combine fault tolerance and high availability. You will also gain insights into working scenarios of different aspects of Solr and how to use Solr with e-commerce data. By the end of the book, you will be competent and confident working with indexing and will have a good knowledge base to efficiently program elements.

What you will learn

[*] Get to know the basic features of Solr indexing and the analyzers/tokenizers available [*] Index XML/JSON data in Solr using the HTTP Post tool and CURL command [*] Work with Data Import Handler to index data from a database [*] Use Apache Tika with Solr to index word documents, PDFs, and much more [*] Utilize Apache Nutch and Solr integration to index crawled data from web pages [*] Update indexes in real-time data feeds [*] Discover techniques to index multi-language and distributed data in Solr [*] Combine the various indexing techniques into a real-life working example of an online shopping web application

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Dec 28, 2015
Length 160 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781783553235
Category :
Concepts :

Table of Contents

18 Chapters
Apache Solr for Indexing Data Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Authors Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Getting Started Chevron down icon Chevron up icon
Understanding Analyzers, Tokenizers, and Filters Chevron down icon Chevron up icon
Indexing Data Chevron down icon Chevron up icon
Indexing Data – The Basic Technique and Using Index Handlers Chevron down icon Chevron up icon
Indexing Data with the Help of Structured Datasources – Using DIH Chevron down icon Chevron up icon
Indexing Data Using Apache Tika Chevron down icon Chevron up icon
Apache Nutch Chevron down icon Chevron up icon
Commits, Real-Time Index Optimizations, and Atomic Updates Chevron down icon Chevron up icon
Advanced Topics – Multilanguage, Deduplication, and Others Chevron down icon Chevron up icon
Distributed Indexing Chevron down icon Chevron up icon
Case Study of Using Solr in E-Commerce Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.