Integrating Websphere eXtreme Scale Data Grid with Relational Database: Part 1

Exclusive offer: get 50% off this eBook here
IBM WebSphere eXtreme Scale 6

IBM WebSphere eXtreme Scale 6 — Save 50%

Build scalable, high-performance software with IBM's WebSphere eXtreme Scale 6 data grid with this book and eBook

£22.99    £11.50
by Anthony Chaves | November 2009 | Java

WebSphere eXtreme Scale provides a solution to scalability issues through caching and grid technology. A data grid is a means of combining computing resources so as to make a viable middleware layer. By using the partitioning and replication functions we can build a production-worthy application with persistent, reliable, and durable data storage, without ever touching a disk. In this article by Anthony Chaves, we will explore some of the uses of an in-memory data grid. We'll also look at integrating Websphere eXtreme Scale with relational databases.

Integrating a data grid and a database is vital since:

  • reporting tools don't work with data grids right now
  • less frequently used data can be stored on disk
  • our application may need to work with legacy applications

As stated above there are three compelling reasons to integrate with a database backend. First, reporting tools do not have good data grid integration. Using CrystalReports and other reporting tools, don't work with data grids right now. Loading data from a data grid into a data warehouse with existing tools isn't possible either.

The second reason we want to use a database with a data grid is when we have an extremely large data set. A data grid stores data in memory. Though much cheaper than in the past, system memory is still much more expensive than a typical magnetic hard disk. When dealing with extremely large data sets, we want to structure our data so that the most frequently used data is in the cache and less frequently used data is on the disk.

The third compelling reason to use a database with a data grid is that our application may need to work with legacy applications that have been using relational databases for years. Our application may need to provide more data to them, or operate on data already in the legacy database in order to stay ahead of a processing load.In this article, we will explore some of the good and not-so-good uses of an in-memory data grid. We'll also look at integrating Websphere eXtreme Scale with relational databases.

You're going where?

Somewhere along the way, we all learned that software consists of algorithms and data. CPUs load instructions from our compiled algorithms, and those instructions operate on bits representing our data. The closer our data lives to the CPU, the faster our algorithm can use it. On the x86 CPU, the registers are the closest we can store data to the instructions executed by the CPU.

CPU registers are also the smallest and most expensive data storage location. The amount of data storable in registers is fixed because the number and size of CPU registers is fixed. Typically, we don't directly interact with registers because their correct usage is important to our application performance. We let the compiler writers handle translating our algorithms into machine code. The machine code knows better than we do, and will use register storage far more effectively than we will most of the time.

Less expensive, and about an order of magnitude slower, we have the Level 1 cache on a CPU (see below). The Level 1 cache holds significantly more data than the combined storage capacity of the CPU registers. Reading data from the Level 1 cache, and copying it to a register, is still very fast. The Level 1 cache on my laptop has two 32K instruction caches, and two 32K data caches.

IBM WebSphere eXtreme Scale 6

Still less expensive, and another order of magnitude slower, is the Level 2 cache. The Level 2 cache is typically much larger than Level 1 cache. I have 4MB of the Level 2 cache on my laptop. It still won't fit the contents of the Library of Congress into that 4MB, but that 4MB isn't a bad amount of data to keep near the CPU.

Up another level, we come to the main system memory. Consumer level PCs come with 4GB RAM. A low-end server won't have any less than 8GB. At this point, we can safely store a large chunk of data, if not all of the data, used by an application. Once the application exits, its data is unloaded from the main memory, and all of the data is lost. In fact, once our data is evicted from any storage at or below this level, it is lost. Our data is ephemeral unless it is put onto some secondary storage. The unit of measurement for accessing data in a register, either Level 1 or 2 cache and main memory, is a nanosecond.

Getting to secondary storage, we jump up an SI-prefix to a microsecond. Accessing data in the secondary storage cache is on the order of microseconds. If the data is not in cache, the access time is on the order of milliseconds. Accessing data on a hard drive platter is one million times slower than accessing that same data in main memory, and one billion times slower than accessing that data in a register. However, secondary storage is very cheap and holds millions of times more than primary storage. Data stored in secondary storage is durable. It doesn't disappear when the computer is reset after a crash.

Our operation teams comfortably build secondary storage silos to store petabytes of data. We typically build our applications so the application server interacts with some relational database management system that sits in front of that storage silo. The network hop to communicate with the RDBMS is in the order of microseconds on a fast network, and milliseconds otherwise.

Sharing data between applications has been done with the disk + network + database approach for a long time. It's become the traditional way to build applications. Load balancer in front, application servers or batch processes constantly communicating with a database to store data for the next process that needs it.

As we see with computer architecture, we insert data where it fits. We squeeze it as close to the CPU as possible for better performance. If a data segment doesn't fit in one level, keep squeezing what fits into each higher storage level. That leaves us with a lot of unused memory and disk space in an application deployment. Storing data in the memory is preferable to storing it on a hard drive. Memory segmentation in a deployment has made it difficult to store useful amounts of data at a few milliseconds distance. We just use a massive, but slow, database instead.

Where does an IMDG fit?

We've used ObjectGrid to store all of our data so far. This diagram should look pretty familiar by now:

IBM WebSphere eXtreme Scale 6

Because we're only using the ObjectGrid APIs, our data is stored in-memory. It is not persisted to disk. If our ObjectGrid servers crash, then our data is in jeopardy (we haven't covered replication yet). One way to get our data into a persistent store is to mark up our classes with some ORM framework like JPA. We can use the JPA API to persist, update, and remove our objects from a database after we perform the same operations on them using the ObjectMap or Entity APIs. The onus is on the application developer to keep both cache and database in sync:

IBM WebSphere eXtreme Scale 6

If you take this approach, then all of the effort would be for naught. Websphere eXtreme Scale provides functionality to integrate with an ORM framework, or any data store, through Loaders. A Loader is a BackingMap plugin that tells ObjectGrid how to transform an object into the desired output form. Typically, we'll use a Loader with an ORM specification like JPA. Websphere eXtreme Scale comes with a few different Loaders out of the box, but we can always write our own.

A Loader works in the background, transforming operations on objects into some output, whether it's file output or SQL queries. A Loader plugs into a BackingMap in an ObjectGrid server instance, or in a local ObjectGrid instance. A Loader does not plug into a client-side BackingMap, though we can override Loader settings on a client-side BackingMap.

While the Loader runs in the background, we interact with an ObjectGrid instance. We use the ObjectMap API for objects with zero or simple relationships, and the Entity API for objects with more complex relationships. The Loader handles all of the details in transforming an object into something that can integrate with external data stores:

IBM WebSphere eXtreme Scale 6

Why is storing our data in a database so important? Haven't we seen how much faster Websphere eXtreme Scale is than an RDBMS? Shouldn't all of our data be stored in in-memory? An in-memory data grid is good for certain things. There are plenty of things that a traditional RDBMS is good at that any IMDG just doesn't support.

An obvious issue is that memory is significantly more expensive than hard drives. 8GB of server grade memory costs thousands of dollars. 8GB of server grade disk space costs pennies. Even though the disk is slower than memory, we can store a lot more data on it.

An IMDG shines where a sizeable portion of frequently-changing data can be cached so that all clients see the same data. The IMDG provides orders of magnitude with better latency, read, and write speeds than any RDBMS. But we need to be aware that, for large data sets, an entire data set may not fit in a typical IMDG. If we focus on the frequently-changing data that must be available to all clients, then using the IMDG makes sense.

Imagine a deployment with 10 servers, each with 64GB of memory. Let's say that of the 64GB, we can use 50GB for ObjectGrid. For a 1TB data set, we can store 50% of it in cache. That's great! As the data set grows to 5TB, we can fit 10% in cache. That's not as good as 50%, but if it is the 10% of the data that is accessed most frequently, then we come out ahead. If that 10% of data has a lot of writes to it, then we come out ahead.

Websphere eXtreme Scale gives us predictable, dynamic, and linear scalability. When our data set grows to 100TB, and the IMDG holds only 0.5% of the total data set, we can add more nodes to the IMDG and increase the total percentage of cacheable data (see below). It's important to note that this predictable scalability is immensely valuable. Predictable scalability makes capacity planning easier. It makes hardware procurement easier because you know what you need. Linear scalability provides a graceful way to grow a deployment as usage and data grow. You can rest easy knowing the limits of your application when it's using an IMDG. The IMDG also acts as a shock absorber in front of a database. We're going to explore some of the reasons why an IMDG makes a good shock absorber with the Loader functionality.

IBM WebSphere eXtreme Scale 6

There are plenty of other situations, some that we have already covered, where an IMDG is the correct tool for the job. There are also plenty of situations where an IMDG just doesn't fit.

A traditional RDBMS has thousands of man-years of research, implementation tuning, and bug fixing already put into it. An RDBMS is well-understood and is easy to use in application development. There are standard APIs for interacting with them in almost any language:

IBM WebSphere eXtreme Scale 6

In-memory data grids don't have the supporting tools built around them that RDBMSs have. We can't plug CrystalReports into an ObjectGrid instance to get daily reports out of the data in the grid. Querying the grid is useful when we run simple queries, but fails when we need to run the query over the entire data set, or run a complex query. The query engine in Websphere eXtreme Scale is not as sophisticated as the query engine in an RDBMS. This also means the data we get from ad hoc queries is limited. Running ad hoc queries in the first place is more difficult. Even building an ad hoc query runner that interacts with an IMDG is of limited usefulness.

An RDBMS is a wonderful cross-platform data store. Websphere eXtreme Scale is written in Java and only deals with Java objects. A simple way for an organization to share data between applications is in a plaintext database. We have standard APIs for database access in nearly every programming language. As long as we use the supported database driver and API, we will get the results as we expect, including ORM frameworks from other platforms like .NET and Rails. We could go on and on about why an RDBMS needs to be in place, but I think the point is clear. It's something we still need to make our software as useful as possible.

IBM WebSphere eXtreme Scale 6 Build scalable, high-performance software with IBM's WebSphere eXtreme Scale 6 data grid with this book and eBook
Published: November 2009
eBook Price: £22.99
Book Price: £36.99
See more
Select your format and quantity:

JPALoader and JPAEntityLoader

One of the most common ways for a Java application to interact with a database is with some JPA implementation. We'll use the Hibernate implementation for these examples, though any JPA implementation will work. The JPA spec, and the ObjectGrid APIs, contain many parallels, which make learning the integration concepts much easier. The JPA spec most closely matches the EntityManager API in terms of interacting with persistent data. In fact, both classes used for interacting with persistent data are EntityManagers, though they are in different packages. Due to similarities in class and method names, we'll configure JPA and ObjectGrid differently. We'll use the ObjectGrid annotations and the JPA XML configuration when the class and annotation names overlap.

We need to configure our application models to use JPA. This means we create our persistence.xml and orm.xml files as we normally would for JPA:

File: persistence.xml
<?xml version="1.0" encoding="UTF-8"?>
<persistence xmlns="http://java.sun.com/xml/ns/persistence"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/persistence
http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"
version="1.0">
<persistence-unit name="PaymentProcessor"
transaction-type="RESOURCE_LOCAL">
<provider>org.hibernate.ejb.HibernatePersistence</provider>
<class>wxs.sample.models.Payment</class>
<class>wxs.sample.models.Address</class>
<class>wxs.sample.models.Batch</class>
<class>wxs.sample.models.Card</class>
<properties>
<property name="hibernate.connection.url"
value="jdbc:mysql://galvatron:3306/payment_processor" />
<property name="hibernate.connection.driver_class"
value="com.mysql.jdbc.Driver" />
<property name="hibernate.connection.password"
value="pp_password" />
<property name="hibernate.connection.username"
value="pp_user" />
<!-- <property name="hibernate.hbm2ddl.auto"
value="create" />-->
<property name="hibernate.show_sql" value="false" />
</properties>
</persistence-unit>
</persistence>

We specify our four classes to use as JPA entities and set the Hibernate connection properties. In this example, I'm using a MySQL database named payment_processor. I'm also asking Hibernate to generate my schema based on the XML configuration I provide for the JPA entity mappings:

File: orm.xml
<?xml version="1.0" encoding="UTF-8"?>
<entity-mappings xmlns="http://java.sun.com/xml/ns/persistence/orm"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/persistence/orm
orm_1_0.xsd" version="1.0">
<entity class="wxs.sample.models.Payment" access="FIELD">
<attributes>
<id name="id"/>
<basic name="paymentType">
<enumerated>STRING</enumerated>
</basic>
<many-to-one name="batch"
target-entity="wxs.sample.models.Batch"
fetch="LAZY"/>
<many-to-one name="card"
target-entity="wxs.sample.models.Card"
fetch="LAZY"/>
</attributes>
</entity>
<entity class="wxs.sample.models.Batch" access="FIELD">
<attributes>
<id name="id"/>
<basic name="status">
<enumerated>STRING</enumerated>
</basic>
</attributes>
</entity>
<entity class="wxs.sample.models.Card" access="FIELD">
<attributes>
<id name="id"/>
<basic name="cardType">
[ 96 ]
<enumerated>STRING</enumerated>
</basic>
<many-to-one name="address"
target-entity="wxs.sample.models.Address"
fetch="LAZY">
<!-- <cascade>
<cascade-persist/>
<cascade-merge/>
</cascade> -->
</many-to-one>
</attributes>
</entity>
<entity class="wxs.sample.models.Address" access="FIELD">
<attributes>
<id name="id"/>
</attributes>
</entity>
</entity-mappings>

Again, there is nothing out of the ordinary here, just a normal orm.xml file. Our models now pull double-duty, working for both ObjectGrid and JPA. Because of this, we need to annotate them with a bit of JPA information. First we'll look at Payment:

File: Payment.java
@Entity @javax.persistence.Entity
public class Payment implements Serializable {
@Id @Index @javax.persistence.Id
int id;
@ManyToOne
Batch batch;
@ManyToOne
Card card;
PaymentType paymentType;
BigDecimal amount;
}

We need to include the @javax.persistence.Entity and @javax.persistence.Id annotations in the class file. If we don't, an ObjectGridException is thrown when the class is examined by ObjectGrid:

com.ibm.websphere.objectgrid.ObjectGridRuntimeException: The class 
class wxs.sample.models.Payment in the persistence unit
PaymentProcessor does not contain JPA key metadata.

We need to include these two annotations in each of our model classes, and leave the rest of the JPA configuration to the orm.xml file.

Now, we need to tell ObjectGrid about our JPA configuration. Of course, this happens in the objectgrid.xml file:

File: objectgrid.xml
<?xml version="1.0" encoding="UTF-8"?>
<objectGridConfig xmlns:xsi="http://www.w3.org/2001/
XMLSchema-instance"
xsi:schemaLocation=http://ibm.com/ws/objectgrid/config
../objectGrid.xsd
xmlns="http://ibm.com/ws/objectgrid/config">
<objectGrids>
<objectGrid name="PaymentProcessorGrid"
entityMetadataXMLFile="ppEntities.xml">
<bean id="TransactionCallback"
className="com.ibm.websphere.objectgrid.jpa.JPATxCallback">
<property name="persistenceUnitName"
type="java.lang.String"
value="PaymentProcessor" />
</bean>
<backingMap name="Payment" pluginCollectionRef="Payment" />
<backingMap name="Batch" pluginCollectionRef="Batch" />
<backingMap name="Card" pluginCollectionRef="Card" />
<backingMap name="Address" pluginCollectionRef="Address" />
<backingMap name="idGeneratorMap" />
</objectGrid>
</objectGrids>
<backingMapPluginCollections>
<backingMapPluginCollection id="Payment">
<bean id="Loader"
className="com.ibm.websphere.objectgrid.jpa.JPAEntityLoader" />
</backingMapPluginCollection>
<backingMapPluginCollection id="Batch">
<bean id="Loader"
className="com.ibm.websphere.objectgrid.jpa.JPAEntityLoader" />
</backingMapPluginCollection>
<backingMapPluginCollection id="Card">
<bean id="Loader"
className="com.ibm.websphere.objectgrid.jpa.JPAEntityLoader" />
</backingMapPluginCollection>
<backingMapPluginCollection id="Address">
<bean id="Loader"
className="com.ibm.websphere.objectgrid.jpa.JPAEntityLoader" />
</backingMapPluginCollection>
</backingMapPluginCollections>
</objectGridConfig>

The first thing that should stick out is the slightly different format we're using for this file. By specifying backingMapPluginCollection, we can define BackingMapPlugins, and associate them with our BackingMaps. This keeps the BackingMap definition simpler by limiting it to the BackingMap name and BackingMap attributes. The plugins can safely go in their own section. They are associated with the BackingMap by name and ID. The pluginCollectionRef attribute names the ID of the backingMapPluginCollection to be used with a BackingMap.

The second thing that should stick out about this file is the TransactionCallback bean configuration:

<bean id="TransactionCallback"
className="com.ibm.websphere.objectgrid.jpa.JPATxCallback">
<property name="persistenceUnitName"
type="java.lang.String"
value="PaymentProcessor" />
</bean>

This transaction callback coordinates JPA transactions, which are separate from ObjectGrid transactions. The TransactionCallback bean is required to use the JPALoader or JPAEntityLoader. We set one property on it, the name of the persistence unit we specified in persistence.xml. We do not specify the location of that file because it is assumed to be on the classpath.

The backingMapPluginCollection specifies a JPAEntityLoader for each BackingMap we've defined:

<backingMapPluginCollection id="Payment">
<bean id="Loader"
className="com.ibm.websphere.objectgrid.jpa.JPAEntityLoader"/>
</backingMapPluginCollection>

The Loader's job

What does the JPAEntityLoader do? It plugs into the BackingMap and interacts with the JPA implementation on our behalf. An action on the contents of a BackingMap generates an equivalent action on the database used by the Loader. Whenever the BackingMap is changed through persist, merge, or remove, that action applies to the corresponding data in the database as well. Inserting an object with entityManager.persist(address notifies the Loader to execute the SQL insert into address (id, streetLine1, streetLine2, city, state, zipCode). Removing an object from a BackingMap by calling entityManager.remove(address) triggers the Loader, which executes the SQL equivalent to delete from address where id = ? with the corresponding address ID.

However, we defined relationships on our ObjectGrid Entities, and we define those same relationships in the JPA configuration. This allows us to maintain referential integrity in the database through the JPAEntityLoader. Because this is a new application, we can ask the JPA implementation to generate a schema based on our JPA entity definitions. By using the definitions in orm.xml, we should get a schema that has these tables:

IBM WebSphere eXtreme Scale 6

The table definitions should look like this:

IBM WebSphere eXtreme Scale 6

IBM WebSphere eXtreme Scale 6

Let's run our example in a local ObjectGrid instance first. It's easy to set up the BackingMaps to use a JPAEntityLoader programmatically. We'll set up the local instance in the useLocalObjectGrid method:

JPATxCallback jpaTxCallback = new JPATxCallback();
jpaTxCallback.setPersistenceUnitName("PaymentProcessor");
grid.setTransactionCallback(jpaTxCallback);
BackingMap batchMap = grid.getMap("Batch");
batchMap.setLoader(new JPAEntityLoader());

ObjectGrid needs a jpaTxCallback object configured in order to use a JPA Loader. We set the persistence unit name to be what we specified in the persistence.xml file. The local ObjectGrid instance works with the same JPA files on the classpath as a distributed ObjectGrid instance. Next, we get a reference to the BackingMap used to store Batch objects, and set its Loader to a JPALoader instance. No further configuration is required right now. Batch objects become durable in the database after the ObjectGrid transaction and JPA transaction are complete. Let's do the same for the rest of our models:

BackingMap paymentMap = grid.getMap("Payment");
paymentMap.setLoader(new JPAEntityLoader());
BackingMap addressMap = grid.getMap("Address");
addressMap.setLoader(new JPAEntityLoader());
BackingMap cardMap = grid.getMap("Card");
cardMap.setLoader(new JPAEntityLoader());

We're ready to run it and see what happens! The programming model does not change even though we're using a JPAEntityLoader now. We interact with the ObjectGrid instance. Running the PaymentProcessor program now yields the same results as before, though it may run a bit slower.

Using these examples with a JPA Loader requires that all of the classes the JPA implementation normally uses be on the classpath. My classpath looks like this:

c:wxsworkspacePaymentProcessorbin;
c:jboss-4.2.2.GAserverdefaultlib*;
c:jboss-4.2.2.GAlib*;
c:wxsObjectGridlib*

This picks up all of the classes used by the Hibernate JPA implementation, the ObjectGrid libraries, and the classes in our example. Of course, creating a JAR file for our classes is a better idea for production use, but this is fine for development and local ObjectGrid instances.

Performance and referential integrity

If you run the PaymentProcessor as it stands now, you should notice that the performance is way down and your hard drives are spinning a lot. Because we stopped the configuration of the BackingMap after adding the JPAEntityLoader, we get the default write-behind behavior. Let's talk about read-through, write-through, and write-behind behavior so that we understand why our application suddenly slowed to a crawl.

These three behaviors dictate the performance of our application when it uses an inline cache. They specify when a Loader goes to the database to fetch or write an object to its SQL equivalent. Our goal is reducing the number of times the Loader goes to the database for any reason. Accessing an object in the database is orders of magnitude slower than accessing it in the cache. We want to maximize the cache hit rate while minimizing the database write rate. The read-through, write-through, and write-behind behaviors influence the entity relationships we define on our model classes.

A Loader used with an inline cache offers read-through behavior out of the box. Read-through means a Loader goes to the database and performs a select operation when the object is not found in the inline cache.

IBM WebSphere eXtreme Scale 6

Our PaymentProcessor application makes a request for a payment with ID 5 (seen above). This payment does not exist in the cache, so the BackingMap requests the Loader to find that payment in the database. If found, that payment is returned to the application, and it lives in the cache until removed or evicted. All of this happens transparently during our application call to find the payment. To our application, we read through the cache to the database. Our application does not call any JPA API to achieve this.

>> Continue Reading: Integrating Websphere eXtreme Scale Data Grid with Relational Database: Part 2

 

If you have read this article you may be interested to view :

IBM WebSphere eXtreme Scale 6 Build scalable, high-performance software with IBM's WebSphere eXtreme Scale 6 data grid with this book and eBook
Published: November 2009
eBook Price: £22.99
Book Price: £36.99
See more
Select your format and quantity:

About the Author :


Anthony Chaves

Anthony writes software for customers of all sizes. He likes building scalable, robust software. Customers have thrown all kinds of different development environments at him: Java, C, Rails, mobile device platforms – but no .NET (yet).

Anthony particularly likes user/device authentication problems and applied scalability practices. Cloud-computing buzzword bingo  doesn't fly with him. He started the Boston Scalability User Group in 2007.

Books From Packt


WebSphere Application Server 7.0 Administration Guide
WebSphere Application Server 7.0 Administration Guide

Pentaho Reporting 3.5 for Java Developers
Pentaho Reporting 3.5 for Java Developers

JasperReports 3.5 for Java Developers
JasperReports 3.5 for Java Developers

Spring Persistence with Hibernate
Spring Persistence with Hibernate

RESTful Java Web Services
RESTful Java Web Services

JBoss AS 5 Development
JBoss AS 5 Development

Tomcat 6 Developer's Guide
Tomcat 6 Developer's Guide

JBoss RichFaces 3.3
JBoss RichFaces 3.3


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software