Hibernate Search by Example — Save 50%
Explore the Hibernate Search system and use its extraordinary search features in your own applications with this book and ebook
In this article by Steve Perkins, author of Hibernate Search by Example, we will look at some advanced strategies for improving the performance and scalability of production applications, through code as well as server architecture.
(For more resources related to this topic, see here.)
Before diving into some advanced strategies for improving performance and scalability, let's briefly recap some of the general performance tips already spread across the book:
When mapping your entity classes for Hibernate Search, use the optional elements of the @Field annotation to strip the unnecessary bloat from your Lucene indexes:
If you are definitely not using index-time boosting , then there is no reason to store the information needed to make this possible. Set the norms element to Norms.NO .
By default, the information needed for a projection-based query is not stored unless you set the store element to Store.YES or Store. COMPRESS. If you had projection-based queries that are no longer being used, then remove this element as part of the cleanup.
Use conditional indexing and partial indexing to reduce the size of Lucene indexes.
Rely on filters to narrow your results at the Lucene level, rather than using a WHERE clause at the database query level.
Experiment with projection-based queries wherever possible , to reduce or eliminate the need for database calls. Be aware that with advanced database caching, the benefits might not always justify the added complexity.
Test various index manager options , such as trying the near-real-time index manager or the async worker execution mode.
Running applications in a cluster
Making modern Java applications scale in a production environment usually involves running them in a cluster of server instances. Hibernate Search is perfectly at home in a clustered environment, and offers multiple approaches for configuring a solution.
The most straightforward approach requires very little Hibernate Search configuration. Just set up a file server for hosting your Lucene indexes and make it available to every server instance in your cluster (for example, NFS, Samba, and so on):
A simple cluster with multiple server nodes using a common Lucene index on a shared drive
Each application instance in the cluster uses the default index manager, and the usual filesystem directory provider.
In this arrangement, all of the server nodes are true peers. They each read from the same Lucene index, and no matter which node performs an update, that node is responsible for the write. To prevent corruption, Hibernate Search depends on simultaneous writes being blocked, by the locking strategy (that is, either "simple" or "native").
Recall that the "near-real-time" index manager is explicitly incompatible with a clustered environment.
The advantage of this approach is two-fold. First and foremost is simplicity. The only steps involved are setting up a filesystem share, and pointing each application instance's directory provider to the same location. Secondly, this approach ensures that Lucene updates are instantly visible to all the nodes in the cluster.
However, a serious downside is that this approach can only scale so far. Very small clusters may work fine, but larger numbers of nodes trying to simultaneously access the same shared files will eventually lead to lock contention.
Also, the file server on which the Lucene indexes are hosted is a single point of failure. If the file share goes down, then your search functionality breaks catastrophically and instantly across the entire cluster.
When your scalability needs outgrow the limitations of a simple cluster, Hibernate Search offers more advanced models to consider. The common element among them is the idea of a master node being responsible for all Lucene write operations.
Clusters may also include any number of slave nodes. Slave nodes may still initiate Lucene updates, and the application code can't really tell the difference. However, under the covers, slave nodes delegate that work to be actually performed by the master node.
In a master-slave cluster, there is still an "overall master" Lucene index, which logically stands apart from all of the nodes. This may be filesystem-based, just as it is with a simple cluster. However, it may instead be based on JBoss Infinispan (http://www.jboss.org/infinispan), an open source in-memory NoSQL datastore sponsored by the same company that principally sponsors Hibernate development:
In a filesystem-based approach, all nodes keep their own local copies of the Lucene indexes. The master node actually performs updates on the overall master indexes, and all of the nodes periodically read from that overall master to refresh their local copies.
In an Infinispan-based approach, the nodes all read from the Infinispan index (although it is still recommended to delegate writes to a master node). Therefore, the nodes do not need to maintain their own local index copies. In reality, because Infinispan is a distributed datastore, portions of the index will reside on each node anyway. However, it is still best to visualize the overall index as a separate entity.
There are two available mechanisms by which slave nodes delegate write operations to the master node:
A JMS message queue provider creates a queue, and slave nodes send messages to this queue with details about Lucene update requests. The master node monitors this queue, retrieves the messages, and actually performs the update operations.
You may instead replace JMS with JGroups (http://www.jgroups.org), an open source multicast communication system for Java applications. This has the advantage of being faster and more immediate. Messages are received in real-time, synchronously rather than asynchronously.
However, JMS messages are generally persisted to a disk while awaiting retrieval, and therefore can be recovered and processed later, in the event of an application crash. If you are using JGroups and the master node goes offline, then all the update requests sent by slave nodes during that outage period will be lost. To fully recover, you would likely need to reindex your Lucene indexes manually.
A master-slave cluster using a directory provider based on filesystem or Infinispan, and worker based on JMS or JGroups. Note that when using Infinispan, nodes do not need their own separate index copies.
In this article, we explored the options for running applications in multi-node server clusters, to spread out and handle user requests in a distributed fashion. We also learned how to use sharding to help make our Lucene indexes faster and more manageable.
Resources for Article :
- Integrating Spring Framework with Hibernate ORM Framework: Part 1 [Article]
- Developing Applications with JBoss and Hibernate: Part 1 [Article]
- Hibernate Types [Article]
eBook Price: $17.99
Book Price: $29.99
About the Author :
Steve Perkins is a Java developer based in Atlanta, GA, USA. Steve has been working with Java in a web and systems integration context for 15 years, for clients ranging from commerce and finance to media and entertainment. He has been using Hibernate intensively for over seven years, and is interested in best practices for data modeling and application design.
Apart from coding, Steve also has a keen interest in the subject of software patents, which eventually led to a law degree and becoming a licensed attorney. Steve co-authored In the Aftermath of In re Bilski, published in 2009, and In the Aftermath of Bilski v. Kappos, published in 2010, for the Practicing Law Institute Handbook Series.
Steve lives in Atlanta with his wife, Amanda, their son, Andrew, and more musical instruments than he has free time to play. You can visit his website at steveperkins.net and follow him on Twitter at @stevedperkins.