Instant Apache Solr for Indexing Data How-to [Instant] — Save 50%
Learn how to index your data correctly and create better search experiences with Apache Solr with this book and ebook
This article by Alexandre Rafalovitch, author of Instant Apache Solr for Indexing Data How-to [Instant], will help you create a basic Solr collection and populate it with a simple dataset in CSV format.
You will start this journey by creating your own first collection. However, even before that, please download the latest 4.x Solr distribution from https://lucene.apache.org/solr/. Also, go through the tutorial available at https://lucene.apache.org/solr/tutorial.html. This will both provide a taster for Solr's capabilities and will make sure that you understand the most basic ways of working with Solr.
(For more resources related to this topic, see here.)
Assuming that you have walked through the tutorial, you should be nearly ready with the setup. Still, it does not hurt to go through the checklist:
Be familiar that you know how to start your operating system's shell (cmd.exe on Windows, Terminal/iTerm on Mac, and sh/bash/tch/zsh on Unix).
Ensure that running the java –version command on the shell's prompt returns at least Version 1.6. You may need to upgrade if you have an older version.
Ensure that you know where you unpacked the Solr distribution and the full path to the example directory within that. You needed that directory for the tutorial, but that's also where we are going to start our own Solr instance. That allows us to easily run an embedded Jetty web server and to also find all the additional JAR files that Solr needs to operate properly.
Now, create a directory where we will store our indexes and experiments. It can be anywhere on your drive. As Solr can run on any operating system where Java can run, we will use SOLRINDEXING as a name whenever we refer to that directory. Make sure to use absolute path names when substituting with your real path for the directory.
How to do it...
As our first example, we will create an index that stores and allows for the searching of simplified e-mail information. For now, we will just look at the addr_from and addr_to e-mail addresses and the subject line. You will see that it takes only two simple configuration files to get the basic Solr index working.
Under the SOLR-INDEXING directory, create a collection1 directory and inside that create a conf directory.
In the conf directory, create two files: schema.xml and solrconfig.xml.
The schema.xml file should have the following content:
<?xml version="1.0" encoding="UTF-8" ?>
<field name="id" type="string" indexed="true" stored="true"
<field name="addr_from" type="string" indexed="true"
<field name="addr_to" type="string" indexed="true"
<field name="subject" type="string" indexed="true"
<fieldType name="string" class="solr.StrField" />
The solrconfig.xml file should have the following content:
<?xml version="1.0" encoding="UTF-8" ?>
<httpCaching never304="true" />
<requestHandler name="/select" class="solr.SearchHandler" />
<requestHandler name="/update" class="solr.UpdateRequestHandler" />
<requestHandler name="/admin" class="solr.admin.AdminHandlers" />
<requestHandler name="/analysis/field" class="solr.
FieldAnalysisRequestHandler" startup="lazy" />
That is it. Now, let's start our just-created Solr instance. Open a new shell (we'll need the current one later). On that shell's command prompt, change the directory to the example directory of the Solr distribution and run the following command:
java -Dsolr.solr.home=SOLR-INDEXING -jar start.jar
Notice that solr.solr.home is not a typo; you do need the solr part twice. And, as always, if you have spaces in your paths (now or later), you may need to escape them in platform-specific ways, such as with backslashes on Unix/Linux or by quoting the whole value.
In the window of your shell, you should see a long list of messages that you can safely ignore (at least for now).
You can verify that everything is working fine by checking for the following three elements:
The long list of messages should finish with a message like Started SocketConnector@0.0.0.0:8983. This means that Solr is now running on port 8983 successfully.
You should now have a directory called data, right next to the directory called conf that we created earlier.
If you open the web browser and go to the http:// localhost:8983/ solr/, you should see a web-based admin interface that makes testing and troubleshooting your Solr instance much easier. We will be using this interface later, so do spend a couple of minutes clicking around now.
Now, let's load some actual content into our collection:
- Copy post.jar from the Solr distribution's example/exampledocs directory to our root SOLR-INDEXING directory.
Create a file called input1.csv in the collection1 directory, next to the conf and data directories with the following three-line content:
we need more Junior Java engineers"
com,"Updating vacancy description"
- Run the import command from the command line in the SOLR-INDEXING directory (one long command; do not split it across lines):
java -Dauto -Durl=http://localhost:8983/solr/collection1/
update -jar post.jar collection1/input1.csv
You should see the following in one of the message lines:
"1 files indexed".
If you now open a web browser and go to http:// localhost:8983/solr/ collection1/select?q=*%3A*&wt=ruby&indent=true, you should see Solr output with all the three documents displayed on the screen in a somewhat readable format.
How it works...
We have created two files to get our example working. Let's review what they mean and how they fit together:
The schema.xml file in the collection's conf directory defines the actual shape of data that you want to store and index. The fields define a structure of a record. Each field has a type, which is also defined in the same file. The field defines whether it is stored, indexed, required, multivalued, or a small number of other, more advanced properties. On the other hand, the field type defines what is actually done to the field when it is indexed and when it is searched. We will explore all of these later.
The solrconfig.xml file also in the collection's conf directory defines and tunes the components that make up Solr's runtime environment. At the very least, it needs to define which URLs can be called to add records to a collection (here, /update), which to query a collection (here, /select), and which to do various administrative tasks (here, /admin and /analysis/field).
Once Solr started, it created a single collection with the default name of collection1, assigned an update handler to it at the /solr/collection1/update URL and search handler at the /solr/collection1/select URL (as per solrconfig.xml). At that point, Solr was ready for the data to be imported into the four required fields (as per schema.xml).
We then proceeded to populate the index from a CSV file (one of many update formats available) and then verified that the records are all present in an indented Ruby format (again, one of many result formats available).
This article helped you create a basic Solr collection and populate it with a simple dataset in CSV format.
Resources for Article :
- Integrating Solr: Ruby on Rails Integration [Article]
- Indexing Data in Solr 1.4 Enterprise Search Server: Part2 [Article]
- Text Search, your Database or Solr [Article]
|Learn how to index your data correctly and create better search experiences with Apache Solr with this book and ebook|
eBook Price: £12.99
About the Author :
Alexandre Rafalovitch is an IT professional with more than 20 years of experience. Throughout his career, he has worked as a software developer, as a QA engineer, in a senior tech support role, and as a web master. Alexandre has worked with Java, C#, Python, and even XQuery, building software and websites (both the backend and frontend components). He is familiar with the issues of processing and presenting multilingual content in many languages, including Russian, Chinese, and Arabic.
Alexandre has developed several small open source projects of his own and has contributed to several more, including W3C Jigsaw and Apache Solr. He has published several industrial and academic publications and has presented at JavaOne twice.
Alexandre is currently working for the United Nations; however, the views expressed herein are those of the author and do not necessarily reflect the views of the United Nations.