Summary: How to model application data for in-memory data grid

Moving from Centralized to Distributed Data Model

When moving from a centralized into a distributed data store, your data needs to be partitioned across multiple nodes (AKA partitions). Implementing the partitioning mechanism technically is not a hard task; however, planning the distribution of your data for scalability and performance, requires some thinking.

There are several questions which need to be answered when planning for data partitioning:
1. What is the information I should store in memory? The answer to this question is not a technical one, and should not be mistakenly confused with the structure of the data. This is in essence a business question: How much the data will it grow over time? For how long should you keep it?

We recommend using the following table for this process:

Data item Estimated Quantity Expected Growth Estimated Object Size
Data Type A 100K 10% 2K
Data Type B 200K 20% 4K

Once you have identified the size and expected growth of your data, you can start thinking about partitioning it; however, there's more to consider before doing that.

2. What are my application's use cases? While you might be used to model your data by the logical relationship of your data items, in the case of distributed data, you need to think differently. The rule of thumb here is to avoid cross cluster relationships as much as possible, since they will lead to cross cluster queries and updates which are usually much less scalable and fast than their local counterparts.

Thinking in terms of traditional relationships ("one to one", "one to many" and "many to many"), is deceiving with distributed data.
The first question to ask is: How many different associations does each entity have?

If an entity is associated with several containers (parent entities), it can't be embedded within the containing entity. It might be also impossible to store it with all of its containers on the same partition.

Here's an example:
In the Pet Clinic application that is based on the Spring pet clinic sample, a Pet is only associated with an Owner. We can therefore store each Pet with its owner on the same partition. We can even embed the Pet object within the physical Owner entry.

However, if a Pet would have been associated with a Vet as well, we could have certainly not embedded the Pet in the Vet physical entry (without duplicating each Pet entry) and could not even store the Pet and its Vet in the same partition.

We have mentioned the concept of embedded relationships above, let us now explain this concept's implications on your application.

Embedded vs. Non Embedded Relationships

Embedded Relationships mean that one object physically contains the associated objects and there is a strong lifecycle dependency between them - once you delete the containing object, you also delete all of its contained objects. With this type of object association, you are always ensuring a local transaction since the entire object graph is stored in the same entry within the Space.

Here are example for embedded relationships data access:
Embedded Object Query - The info property is an object within the Person class:

SQLQuery<Person> query = new SQLQuery<Person>
	(Person.class, "info.socialSecurity < ? and info.socialSecurity >= ?");

Embedded Map Query - The info property is a Map within the Person class:

SQLQuery<Person> query = 
new SQLQuery<Person>(Person.class, "info.salary < 15000 and info.salary >= 8000");

Embedded Collection Query - The employees property is a collection within the Company class:

SQLQuery<Company> query = 
	new SQLQuery<Company>
	(Company.class, "employees[*].children[*].name = 'Junior Doe');

See the SQLQuery section for details about embedded entities query and indexing.

Non Embedded Relationships mean that one object is associated with a number of other objects, so you can navigate from one object to another. However, there is no life cycle dependency between them, so if you delete the referencing object, you don't automatically delete the referenced object(s). The association is therefore manifested in storing IDs rather than storing the actual associated object itself. This type of relationship means that you don't duplicate data but you are more likely to use access more than one node in the cluster when querying or updating your data.

See the Parent Child Relationship for an example for non-embedded relationships.

When Should Objects be Embedded?

You already know it's not a good practice to embed related objects. But even when there's a good case for embedding related objects (sometimes at the cost of data duplications), you still should be aware of the following:

  • Embedding means no direct access: When an entity is embedded within another entity you cannot apply CRUD operations to it directly. Instead, you need to get its root parent entity from the space via regualr query and then navigate down the object graph, until you get the entity you need. This is not just a matter of convenience, it has also performance implications: whenever you want to perform CRUD operations on an embedded entity, you read the entire graph first and (if you need to also update it) you write the entire object graph back to the Space.
  • On the other hand, with GigaSpaces non-embedded relationships mean you need to manage the relationship yourself, within your code.

Thumb Rules for Choosing Embedded Relationships

  • Embed when an entity is meaningful only with the context of its containing object. For example, in the petclinic application - a Pet has a meaning only when it has an Owner. A Pet in itself is meaningless without an Owner in this specific application. There is no business scenario for transferring a Pet from owner to owner or admitting a Pet to a Vet without the owner.
  • Embedding may sometimes mean duplicating your data. For example, if you want to reference a certain Visit from both the Pet and Vet class, you'll need to have duplicate Visit entries. So let's look into duplications:
    • Duplication means preferring scalability over footprint - the reason to duplicate is to avoid cluster wide transactions and in many cases it's the only way to partition your object in a scalable manner.
    • Duplication means higher memory consumption: While memory is considered a commodity and low cost today, duplication has a bigger price to pay - you might have two space objects that contain the same data.
    • Duplication means more lenient consistency. When you add a Visit to a Pet and Vet for example, you need to update them both. You can do it in one (potentially distributed) transaction, or in two separate transactions, which will scale better but be less consistent. This may be sufficient for many types of applications (e.g. social networks), where losing a post, although undesired, does not incur significant damage. In contrast, this is not feasible for financial applications where every operation should be accounted for.
GigaSpaces.com - Legal Notice - 3rd Party Licenses - Site Map - API Docs - Forum - Downloads - Blog - White Papers - Contact Tech Writing - Gen. by Atlassian Confluence