Data Modeling and Scalability in Google App

Amy Unruh

November 2010

Google App Engine Java and GWT Application Development

Google App Engine Java and GWT Application Development

Build powerful, scalable, and interactive web applications in the cloud

  • Comprehensive coverage of building scalable, modular, and maintainable applications with GWT and GAE using Java
  • Leverage the Google App Engine services and enhance your app functionality and performance
  • Integrate your application with Google Accounts, Facebook, and Twitter
  • Safely deploy, monitor, and maintain your GAE applications
  • A practical guide with a step-by-step approach that helps you build an application in stages
        Read more about this book      

In deciding how to design your application's data models, there are a number of ways in which your approach can increase the app's scalability and responsiveness. Here, we discuss several such approaches and how they are applied in the Connectr app. In particular, we describe how the Datastore access latency can sometimes be reduced; ways to split data models across entities to increase the efficiency of data object access and use; and how property lists can be used to support "join-like" behavior with Datastore entities.

Reducing latency—read consistency and Datastore access deadlines

By default, when an entity is updated in the Datastore, all subsequent reads of that entity will see the update at the same time; this is called strong consistency . To achieve it, each entity has a primary storage location, and with a strongly consistent read, the read waits for a machine at that location to become available. Strong consistency is the default in App Engine.

However, App Engine allows you to change this default and use eventual consistency for a given Datastore read. With eventual consistency, the query may access a copy of the data from a secondary location if the primary location is temporarily unavailable. Changes to data will propagate to the secondary locations fairly quickly, but it is possible that an "eventually consistent" read may access a secondary location before the changes have been incorporated. However, eventually consistent reads are faster on average, so they trade consistency for availability. In many contexts, for example, with web apps such as Connectr that display "activity stream" information, this is an acceptable tradeoff—completely up-to-date freshness of information is not required.

See read-consistency-deadlines-more-control.html, better-datastore.html, and events/io/2009/sessions/TransactionsAcrossDatacenters. html for more background on this and related topics.

In Connectr, we will add the use of eventual consistency to some of our feed object reads; specifically, those for feed content updates. We are willing to take the small chance that a feed object is slightly out-of-date in order to have the advantage of quicker reads on these objects.

The following code shows how to set eventual read consistency for a query, using server.servlets.FeedUpdateFriendServlet as an example.

Query q = pm.newQuery("select from " + FeedInfo.class.getName() +
"where urlstring == :keys");
//Use eventual read consistency for this query

App Engine also allows you to change the default Datastore access deadline. By default, the Datastore will retry access automatically for up to about 30 seconds. You can set this deadline to a smaller amount of time. It can often be appropriate to set a shorter deadline if you are concerned with response latency, and are willing to use a cached version of the data for which you got the timeout, or are willing to do without it.

The following code shows how to set an access timeout interval (in milliseconds) for a given JDO query.

Query q = pm.newQuery("...");
// Set a Datastore access timeout

Splitting big data models into multiple entities to make access more efficient

Often, the fields in a data model can be divided into two groups: main and/or summary information that you need often/first, and details—the data that you might not need or tend not to need immediately. If this is the case, then it can be productive to split the data model into multiple entities and set the details entity to be a child of the summary entity, for instance, by using JDO owned relationships. The child field will be fetched lazily, and so the child entity won't be pulled in from the Datastore unless needed.

In our app, the Friend model can be viewed like this: initially, only a certain amount of summary information about each Friend is sent over RPC to the app's frontend (the Friend's name). Only if there is a request to view details of or edit a particular Friend, is more information needed.

So, we can make retrieval more efficient by defining a parent summary entity, and a child details entity. We do this by keeping the "summary" information in Friend, and placing "details" in a FriendDetails object , which is set as a child of Friend via a JDO bidirectional, one-to-one owned relationship, as shown in Figure 1. We store the Friend's e-mail address and its list of associated URLs in FriendDetails. We'll keep the name information in Friend. That way, when we construct the initial 'FriendSummaries' list displayed on application load, and send it over RPC, we only need to access the summary object.

Data Modeling and Scalability in Google App

Splitting Friend data between a "main" Friend persistent class and a FriendDetails child class.

A details field of Friend points to the FriendDetails child, which we create when we create a Friend. In this way, the details will always be transparently available when we need them, but they will be lazily fetched—the details child object won't be initially retrieved from the database when we query Friend, and won't be fetched unless we need that information.

As you may have noticed, the Friend model is already set up in this manner—this is the rationale for that design.


When splitting a data model like this, consider the queries your app will perform and how the design of the data objects will support those queries. For example, if your app often needs to query for property1 == x and property2 == y, and especially if both individual filters can produce large result sets, you are probably better off keeping both those properties on the same entity (for example, retaining both fields on the "main" entity, rather than moving one to a "details" entity).

For persistent classes (that is, "data classes") that you often access and update, it is also worth considering whether any of its fields do not require indexes. This would be the case if you never perform a query which includes that field. The fewer the indexed fields of a persistent class, the quicker are the writes of objects of that cl ass.

Splitting a model by creating an "index" and a "data" entity

You can also consider splitting a model if you identify fields that you access only when performing queries, but don't require once you've actually retrieved the object. Often, this is the case with multi-valued properties. For example, in the Connectr app, this is the case with the friendKeys list of the server.domain.FeedIndex class. This multi-valued property is used to find relevant feed objects but is not used when displaying feed content information.

With App Engine, there is no way for a query to retrieve only the fields that you need, so the full object must always be pulled in. If the multi-valued property lists are long, this is inefficient.

To avoid this inefficiency, we can split up such a model into two parts, and put each one in a different entity—an index entity and a data entity. The index entity holds only the multi-valued properties (or other data) used only for querying, and the data entity holds the information that we actually want to use once we've identified the relevant objects. The trick to this new design is that the data entity key is defined to be the parent of the index entity key.

More specifically, when an entity is created, its key can be defined as a "child" of another entity's key, which becomes its parent. The child is then in the same entity group as the parent. Because such a child key is based on the path of its parent key, it is possible to derive the parent key given only the child key, using the getParent() method of Key, without requiring the child to be instantiated.

So with this design, we can first do a keys-only query on the index kind (which is faster than full object retrieval) to get a list of the keys of the relevant index entities. With that list, even though we've not actually retrieved the index objects themselves, we can derive the parent data entity keys from the index entity keys. We can then do a batch fetch with the list of relevant parent keys to grab all the data entities at once. This lets us retrieve the information we're interested in, without having to retrieve the properties that we do not need.

See Brett Slatkin's presentation, Building scalable, complex apps on App Engine ( io/2009/sessions/BuildingScalableComplexApps. html) for more on this index/data design.

Data Modeling and Scalability in Google App

Splitting the feed model into an "index" part (server.domain.FeedIndex) and a "data" part (server.domain.FeedInfo)

Our feed model maps well to this design—we filter on the FeedIndex.friendKeys multi-valued property (which contains the list of keys of Friends that point to this feed) when we query for the feeds associated with a given Friend.

But, once we have retrieved those feeds, we don't need the friendKeys list further. So, we would like to avoid retrieving them along with the feed content. With our app's sample data, these property lists will not comprise a lot of data, but they would be likely to do so if the app was scaled up. For example, many users might have the same friends, or many different contacts might include the same company blog in their associated feeds.

So, we split up the feed model into an index part and a parent data part, as shown in Figure 2. The index class is server.domain.FeedIndex; it contains the friendKeys list for a feed. The data part, containing the actual feed content, is server.domain. FeedInfo. When a new FeedIndex object is created, its key will be constructed so that its corresponding FeedInfo object 's key is its parent key. This construction must of course take place at object creation, as Datastore entity keys cannot be changed.

For a small-scale app, the payoff from this split model would perhaps not be worth it. But for the sake of example, let's assume that we expect our app to grow significantly.

The FeedInfo persistent class —the parent class—simply uses an app-assigned String primary key, urlstring (the feed URL string). The server.domain. FeedIndex constructor, shown in the code below, uses the key of its FeedInfo parent—the URL string—to construct its key. This places the two entities into the same entity group and allows the parent FeedInfo key to be derived from the FeedIndex entity's key.

@PersistenceCapable(identityType = IdentityType.APPLICATION,
public class FeedIndex implements Serializable {

@Persistent(valueStrategy = IdGeneratorStrategy.IDENTITY)
private Key key;

public FeedIndex(String fkey, String url) {
this.friendKeys = new HashSet<String>();
KeyFactory.Builder keyBuilder =
new KeyFactory.Builder(FeedInfo.class.getSimpleName(), url);
keyBuilder.addChild(FeedIndex.class.getSimpleName(), url);
Key ckey = keyBuilder.getKey();
this.key= ckey;

The following code, from server.servlets.FeedUpdateFriendServlet, shows how this model is used to efficiently retrieve the FeedInfo objects associated with a given Friend. Given a Friend key, a query is performed for the keys of the FeedIndex entities that contain this Friend key in their friendKeys list. Because this is a keys-only query, it is much more efficient than returning the actual objects. Then, each FeedIndex key is used to derive the parent (FeedInfo) key. Using that list of parent keys, a batch fetch is performed to fetch the FeedInfo objects associated with the given Friend. We did this without needing to actually fetch the FeedIndex objects.

... imports...
public class FeedUpdateFriendServlet extends HttpServlet{

private static Logger logger =

public void doPost(HttpServletRequest req, HttpServletResponse resp)
throws IOException {

PersistenceManager pm = PMF.get().getPersistenceManager();

Query q = null;
try {
String fkey = req.getParameter("fkey");
if (fkey != null) {"in FeedUpdateFriendServlet, updating feeds for:"
// query for matching FeedIndex keys
q = pm.newQuery("select key from "+FeedIndex.class.getName()+"
where friendKeys == :id");
List ids=(List)q.execute(fkey);
if (ids.size()==0) {
// else, get the parent keys of the ids
Key k = null;
List<Key>parent list = new ArrayList<Key>();
for (Object id : ids) {
// cast to key
k = (Key)id;
// fetch the parents using the keys
Query q2 = pm.newQuery("select from +FeedInfo.class.getName()+
"where urlstring == :keys");
// allow eventual consistency on read
List<FeedInfo>results =
for(FeedInfo fi : results){
catch (Exception e) {
finally {
if q!=null) {
}//end class

        Read more about this book      

Use of property lists to support "join" behavior

Google App Engine does not support joins with the same generality as a relational database. However, property lists along with accompanying denormalization can often be used in GAE to support join-like functionality in a very efficient manner.

At the time of writing, there is GAE work in progress to support simple joins. However, this functionality is not yet officially part of the SDK.

Consider the many-to-many relationship between Friend and feed information in our application. With a relational database, we might support this relationship by using three tables: one for Friend data, one for Feed data, and a "join table" (sometimes called a "cross-reference table"), named, say, FeedFriend, with two columns—one for the friend ID and one for the feed ID. The rows in the join table would indicate which feeds were associated with which friends.

In our hypothetical relational database, a query to find the feeds associated with a given Friend fid would look something like this:

select feed.feedname from Feed feed, FeedFriend ff
where ff.friendid = 'fid' and ff.feedid =

If we wanted to find those feeds that both Friend 1 (fid1) and Friend 2 (fid2) had listed, the query would look something like this:

select feed.feedname from Feed feed, FeedFriend f1, FeedFriend f2
where f1.friendid = 'fid1' and f1.feedid =
and f2.friendid = 'fid2' and f2.feedid =

With Google App Engine, to support this type of query, we can denormalize the "join table" information and use Datastore multi-valued properties to hold the denormalized information. (Denormalization should not be considered a second-class citizen in GAE).

In Connectr, feed objects hold a list of the keys of the Friends that list that feed (friendKeys), and each Friend holds a list of the feed URLs associated with it.

So, with the first query above, the analogous JDQL query is:

select from FeedIndex where friendKeys == 'fid'

If we want to find those feeds that are listed by both Friend 1 and Friend 2, the JDQL query is:

select from FeedIndex where friendKeys == 'fid1' and
friendKeys == 'fid2'

Our data model, and its use of multi-valued properties, has allowed these queries to be very straightforward and efficient in GAE.

Supporting the semantics of more complex joins

The semantics of more complex join queries can sometimes be supported in GAE with multiple synchronously-ordered multi-valued properties.

For example, suppose we decided to categorize the associated feeds of Friends by whether they were "Technical", "PR", "Personal", "Photography-related", and so on (and that we had some way of determining this categorization on a feed-by-feed basis). Then, suppose we wanted to find all the Friends whose feeds include "PR" feed(s), and to list those feed URLs for each Friend.

In a relational database, we might support this by adding a "Category" table to hold category names and IDs, and adding a category ID column to the Feed table. Then, the query might look like this:

select f.lastName, feed.feedname from Friend f, Category c,
Feed feed, FeedFriend ff
where = 'PR' and feed.cat_id = and ff.feedid =
and =

We might attempt to support this type of query in GAE by adding a feedCategories multi-valued property list to Friend, which contained all the categories in which their feeds fell. Every time a feed was added to the Friend, this list would be updated with the new category as necessary. We could then perform a JDQL query to find all such Friends:

select from Friend where feedCategories == 'PR'

However, for each returned Friend we would then need to check each of their feeds in turn to determine which feed(s) were the PR ones—requiring further Datastore access.

To address this, we could build a Friend feedCategories multi-valued property list whose ordering was synchronized with the urls list ordering, with the nth position in the categories list indicating the category of the nth feed. For example, suppose that url1 and url3 are of category 'PR', and url2 is of category 'Technical'. The two lists would then be sorted as follows:

urls = [ url1, url2, url3, ... ]

feedCategories = [PR, TECHNICAL, PR, ...]

(For efficiency, we would probably map the categories to integers). Then, for each Friend returned from the previous query, we could determine which feed URLs were the 'PR' ones by their position in the feed list, without requiring further Datastore queries. In the previous example, it would be the URLs at positions 0 and 2— url1 and url3.

This technique requires more expense at write time, in exchange for more efficient queries at read time. The approach is not always applicable—for example, it requires a one-to-one mapping between the items in the synchronized property lists, but can be very effective when it does apply .


In this article we took a look at how Datastore access configuration and data modeling can impact application efficiency and discussed some approaches to entity design towards scalability.

In the next article we will take a look at Datastore Transactions.

You've been reading an excerpt of:

Google App Engine Java and GWT Application Development

Explore Title