What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!

Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!

50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

Thousands of reference materials covering every tech concept you need to stay up to date.

Subscribe now

View plans & pricing

Getting Started with Hazelcast, Second Edition

Chapter 1. What is Hazelcast?

Most, if not all, applications need to store some data—some applications far more than others. By holding this book in your hands and eagerly flipping through its pages, it might be safe to assume that you have previously architected, developed, or supported applications more towards the latter end of that scale. We can also take it as given that you are all too painfully familiar with the common pitfalls and issues that tend to crop up when scaling or distributing a data layer. However, to make sure that we are all up to speed, in this chapter, we shall examine the following topics:

Traditional approaches towards data persistence
How caches help improve performance but come with their own problems
Hazelcast's fresh approach towards the problem
A brief overview of the generic capabilities of Hazelcast
The type of problems that we can solve using Hazelcast

Therein lies the problem

By insulating the database from the read load, a problem is introduced in the form of a cache consistency issue. How does the local data cache deal with data changing underneath it within the primary database? The answer is rather disappointing—it can't! The manifestation of issues will largely depend on the data needs of the application and how frequently the data changes, but typically, caching systems will operate in one of the following two modes to combat the problem:

Time-bound cache: This holds entries for a defined period (time-to-live, popularly abbreviated as TTL)
Write-through cache: This holds entries until they are invalidated by subsequent updates

Time-bound caches almost always have consistency issues, but at least the amount of time that the issue would be present is limited to the expiry time of each entry. However, we must consider the application's access to this data, because if the frequency of accessing a particular entry is less than its cache expiry time, the cache provides no real benefit.

Write-through caches are consistent in isolation and can be configured to offer strict consistency, but if multiple write-through caches exist within the overall architecture, then there will be consistency issues between them. We can avoid this by having an intelligent cache that features a communication mechanism between the nodes that can propagate entry invalidations to the other nodes.

In practice, an ideal cache will feature a combination of both the features so that the entries will be held for a known maximum time, but will also pass around invalidations as changes are made.

So, the evolved architecture will look like the following figure:

So far, we had a look at the general issues in scaling a data layer and introduced strategies to help combat the trade-offs that we will encounter along the way. However, the real world isn't this simple. There are various cache servers and in-memory database products in this area (such as memcached or Ehcache). However, most of these are stand-alone single instances, perhaps with some degree of distribution bolted on or provided by other supporting technologies. This tends to bring about the same issues that we experienced with the primary database in that we may encounter resource saturation or capacity issues if the product is a single instance or the distribution doesn't provide consistency control, perhaps inconsistent data, which might harm our application.

Breaking the mould

Hazelcast is a radical, new approach towards data that was designed from the ground up around distribution. It embraces a new, scalable way of thinking in that data should be shared for resilience and performance while allowing us to configure the trade-offs surrounding consistency, as the data requirements dictate.

The first major feature of Hazelcast is its masterless nature. Each node is configured to be functionally the same and operates in a peer-to-peer manner. The oldest node in the cluster is the de facto leader. This node manages the membership by automatically making decisions as to which node is responsible for which data. In this way, as the new nodes join in or drop out, the process is repeated and the cluster rebalances accordingly. This makes it incredibly simple to get Hazelcast up and running, as the system is self-discovering, self-clustering, and works straight out of the box.

However, the second feature of Hazelcast that you should remember is that we are persisting data entirely in-memory. This makes it incredibly fast, but this speed comes at a price. When a node is shut down, all the data that was held by it is lost. We combat this risk to resilience through replication; by holding a number of copies of a piece of data across multiple nodes. In the event of failure, the overall cluster will not suffer any data loss. By default, the standard backup count is 1 so that we can immediately enjoy basic resilience. However, don't pull the plug on more than one node at a time until the cluster has reacted to the change in membership and reestablished the appropriate number of backup copies of data.

So, when we introduce our new peer-to-peer distributed cluster, we get something that looks like the following figure:

Note

A distributed cache is by far the most powerful as it can scale up in response to changes to the application's needs.

We previously identified that multi-node caches tend to suffer from either saturation or consistency issues. In the case of Hazelcast, each node is the owner. Hence, responsible for a number of subset partitions of the overall data, so the load will be fairly spread across the cluster. Therefore, any saturation that exists will be at the cluster level rather than in any individual node. We can address this issue simply by adding more nodes. In terms of consistency, the backup copies of the data are internal to Hazelcast by default and are not directly used. Thus, we enjoy strict consistency. This does mean that we have to interact with a specific node to retrieve or update a particular piece of data. However, exactly which node that is an internal operational detail and can vary over time. We, as developers, actually never need to know.

It is obvious that Hazelcast is not trying to entirely replace the role of a primary database. Its focus and feature set do differ from that of the primary database (which has more transactionally stateful capabilities, long term persistent storage, and so on). However the more the data and processes we master within Hazelcast, the less dependant we become on this constrained resource. Thus, we remove the potential need to change the underlying database systems.

If you imagine the scenario where the data is split into a number of partitions, and each partition slice is owned by a node and backed up on another, the interactions will look like the following figure:

This means that for data belonging to Partition 1, our application will have to communicate to Node 1, Node 2 for data belonging to Partition 2, and so on. The slicing of the data into each partition is dynamic. So, in practice, there are typically more partitions than nodes, hence each node will own a number of different partitions and hold backups for the number of others. As mentioned before, this is an internal operational detail and our application does not need to know it. However, it is important that we understand what is going on behind the scenes.

Moving to new ground

So far, we have talked mostly about simple persisted data and caches, but in reality, we should not think of Hazelcast as purely a cache. It is much more powerful than just that. It is an in-memory data grid that supports a number of distributed collections, processors, and features. We can load the data from various sources into differing structures, send messages across the cluster, perform analytical processing on the stored data, take out locks to guard against concurrent activity, and listen to the goings-on inside the workings of the cluster. Most of these implementations correspond to a standard Java collection or function in a manner that is comparable to other similar technologies. However, in Hazelcast, the distribution and resilience capabilities are already built in.

Standard utility collections:
- Map: Key-value pairs
- List: A collection of objects
- Set: Non-duplicated collection
- Queue: Offer/poll FIFO collection
Specialized collection:
- Multi-Map: Key–collection pairs
Lock: Cluster wide mutex
Topic: Publish and subscribe messaging
Concurrency utilities:
- AtomicNumber: Cluster-wide atomic counter
- IdGenerator: Cluster-wide unique identifier generation
- Semaphore: Concurrency limitation
- CountdownLatch: Concurrent activity gatekeeping
Listeners: This notifies the application as things happen

Playing around with our data

In addition to data storage collections, Hazelcast also features a distributed executor service that allows runnable tasks to be created. These tasks can be run anywhere on the cluster to obtain, manipulate, and store results. We can have a number of collections that contain source data, spin up tasks to process the disparate data (for example, averaging or aggregating), and outputs the results into another collection for consumption.

However, more recently, along with this general-purpose capability, Hazelcast has introduced a few extra ways that allow us to directly interact with data. The MapReduce functionality allows us to build data-centric tasks to search, filter, and process held data to find potential insights within it. You may have heard of this functionality before, but this extracting of value from raw data is at the heart of what big data is all about (forgive the excessive buzzword cliché). While MapReduce focuses more on generating additional information, the EntryProcessor interface enables us to quickly and safely manipulate data in-place throughout the cluster—on single entries and whole collections or even selectively based on a search criteria.

Again, just as we can scale up the data capacities by adding more nodes, we can also increase the processing capacity in exactly the same way. This essentially means that by building a data layer around Hazelcast, if our application's needs rapidly increase, we can continuously increase the number of nodes to satisfy the seemingly extensive demands, all without having to redesign or rearchitect the actual application.

What you will learn

Learn and store numerous data types in different distributed collections Set up a cluster from the ground up Work with truly distributed queues and topics for clusterwide messaging Make your application more resilient by listening into cluster internals Run tasks within and alongside our stored data Filter and search our data using MapReduce jobs Discover the new JCache standard and one of its first implementations

What do you get with a Packt Subscription?