Summary: This section gives a detailed description of the Space's replication mechanism, and how to configure and optimize it.

Overview

Replication is the process of duplicating or copying application data and operations, from a source space to a target space, or to multiple target spaces. Replication is used mainly for high-availability, where a replica space is running in backup mode, for load-balancing, and for sharing local site data with remote sites. It is crucial to replicate application data and operations, since the grid environment deployment involves multiple clusters of workstations which need to share the same data.
In order to perform the application data and operations replication, a special replicator component runs inside the engine of each replicated space. This component replicates the space activity synchronously or asynchronously with other spaces that belong to the same replication group in the cluster.

An Efficient Mechanism

The replication mechanism's default design is an efficient one. Any changes that do not need propagation to the target space are not propagated. For example, if a space object is written and removed inside the same transaction, none of these operations are replicated. You can modify such behavior by enabling the repl-original-state option.
The replicator components find all the replica spaces spontaneously, and reconnect automatically in the case that a target space(s) is discovered.
When a client performs a destructive operation on a space - such as write, update, take, notify registration, or transaction commit, a replicator thread at the source space is triggered, and handles all the work required to copy the operations and associated space objects into the target space(s).

Take operation only replicate the object ID and does not hold any other information on the object (such as the object full state).

How Replication Works when Using Transactions

Whenever you use space operations together with transactions, the operations are replicated to the target space only when the commit() method is called. The client gets an acknowledgement regarding the commit operation after the target space gets the transaction data and is committed.

Synchronous vs. Asynchronous Replication

The GigaSpaces cluster provides synchronous and asynchronous replication schemes.
In a synchronous replication scheme, the client receives acknowledgement for any destructive operations only after both sets of the spaces - source and target - have performed the operation.

When the target space is defined in a backup mode, but is not available, the client receives acknowledgement from the active primary space. However, the operation on the backup space is performed only when the source space re-establishes the connection with the backup. The primary space logs all destructive operations until the backup is started. The same behavior happens when running in asynchronous replication mode.
In the asynchronous scheme, destructive operations are performed in the source space, and acknowledgement is immediately returned to the client.

Operations are accumulated in the source space and sent asynchronously to the target space, after a defined period of time has elapsed, or after a defined number of operations have been performed (the first one of these that occurs). The downside of this scheme is the possibility of data loss if the source space fails while transferring the accumulated operations to the target space. Another problem is data coherency - the source and the target do not have identical data all the time.
The table below shows a full comparison between synchronous and asynchronous replication.

Aspect Synchronous Replication Asynchronous Replication
Data loss Each space operation waits until completion is confirmed at the primary space, as well as the backup space.
An incomplete operation is rolled back at both locations; therefore the remote copy is always an exact duplicate of the primary.
Might sometimes lose some data if there is an unplanned failover to the backup space.
In a failover situation, the backup space is available for clients only after all data from its redo log has been processed.
This might slow down the failover.
Distance Sensitive to latency, which is tied directly to distance. Highly tolerant of latency, and can be used over the primary and backup spaces, which are located in different geographical sites (different cities).
Performance impact Client must wait for confirmation of each space operation from the source space and target space(s).
Performance is mainly based on source space resources (CPU/memory), target space resources (CPU/memory), and the network bandwidth and latency between the source space and the target space.
Client is acknowledged immediately after the source space has processed incoming operations.
Performance is mainly based on source space resources (CPU/memory).
Data integrity Very accurate. Less accurate.
Failover time Very rapid.
Backup space does not need to process redo log data.
Backup space needs to process redo log data.
The recovery time is based on the amount of operations.
Bandwidth requirements LAN WAN/LAN

Can Synchronous Replication and Asynchronous Replication be Mixed?

Yes. You can create a cluster that mixes synchronous and asynchronous replication between a source space and multiple target spaces - each source/target pair can have a different replication mode. You might define such a configuration when there is a need to provide different levels of quality of service. For example, there might be cases where some of the target space data is less sensitive and does not need to have 100% coherency with the source space - such as data used for reporting and monitoring. In such a case, asynchronous replication is sufficient. In contrast, whenever the target space is used for the runtime environment, or with an active-active configuration, synchronous replication is required. An example might be disaster recovery sites which need to have full coherency with the operational site in order to ensure 100% recoverability in a failover situation.

Asynchronous Replication

How it Works with Multiple Concurrent Users Accessing the Space

GigaSpaces synchronous replication supports concurrent clients. Whenever multiple clients access the same space and synchronous replication uses unicast, data is replicated to all target spaces simultaneously through separate threads, each replication channel using a different thread. The threads are taken from a dedicated thread pool whose minimum and maximum size can be configured.

How Asynchronous Replication Works

  1. A destructive space operation is called.
  2. The source space:
    • Performs the operation.
    • Logs the operation and the relevant space object UID into an asynchronous redo log.
    • Sends acknowledgement to the client.
  3. A replication event is triggered, so the replicator:
    • Constructs a batch of operations in the source space.
    • Packs these into one object – a replication packet.
    • Transmits the packet into the target space.
  4. Once they are received at the target space, the operations are processed according to their order.
  5. The next batch is sent when the target space completes processing the replication packet.

The cluster configuration includes parameters that define the replication behavior and policies. These also includes parameters that define the synchronous and asynchronous parameters, as well as common parameters.
When you use the replication transmission matrix, you can define for each source/target pair the replication mode - synchronous or asynchronous. Asynchronous is the default replication mode. If you use synchronous, choose unicast communication protocol. All source/target pairs utilize common synchronous or asynchronous parameters.

Replication Conflict Resolution

When a space object is updated by different cluster nodes at the same time (might happen with asynchronous replication mode), the target space is updated only if the replicated object version is more recent than the existing object within the target space.

  • In case of a take, or an update replication event for a non-existing space object, a Wrong clustered space usage – UID X doesn't exist in target Y error is added into the space log file.
  • In case of a write replication event for an space object that already exists, a Wrong clustered space usage – UID X already exists in space error is added into the space log file.
  • In case of a cancelled or renewed lease replication event for a space object that no longer exists, a This space object might be already expired or cancelled error is added into the space log file.

Synchronous Replication

When to Use Synchronous Replication

Synchronous replication is most beneficial in the following scenarios:
Whenever an application must replicate highly sensitive, mission critical data as soon as it arrives at the source space.
Whenever any space operation must be duplicated with 100% data integrity to the target space.

How Synchronous Replication Works

  1. A destructive space operation is called.
  2. The source space:
    • Performs the operation.
    • Logs the operation and the relevant space object UID into an asynchronous redo log.
    • Sends acknowledgement to the client.
  3. The target space receives a replication packet processed via a todo queue. The todo queue:
    • Makes sure parallel operations are processed in the correct order.
    • Verifies that there are no missing packets.
  4. With synchronous replication, packets can arrive in any order.
  5. When all missing packets arrive, the target space sends a confirmation to the source space.

If the timeout (todo queue), that is responsible for getting missing packets from the target space has been elapsed, the target space requests missing packets from the source space's redo log.

When running in sync-rec-ack mode, the source space receives the acknowledgement from the network layer for the replication packet arrival to the target space.

The replication packet processing confirmation is sent from the target space to the source space via a heartbeat. If the source space fails to receive the heartbeat within the defined timeout period, it tries to re-establish a replication channel with the target space.

Splitting Replication of Large Batches into Smaller Chunks

In GigaSpaces replicated topology, the take and clear operations are identical. Therefore, references to the take operation in this section are also correct for the clear operation.

When performing batch operations (writeMultiple, updateMultiple, takeMultiple), using a synchronous replication mode , the actual data (space objects/UID) is replicated to the target spaces in one batch. Some of the target (replica) spaces might not be able to deal with large amounts of data, resulting in OutOfMemory exceptions.

To solve this problem, you can split these space objects into several chunks, thus providing better memory usage, stability, and scalability.

For example, when performing the take (clear) operation, you don't necessarily know how many space objects exist in the space, and all of these need to be removed. In this case, if you know there should be a large amount of space objects in the space, it can be useful to take the space objects in several chunks (instead of trying to take all of them at once). Even when performing writeMultiple, updateMultiple, and takeMultiple, where you can control the amount of space objects you are performing the operation on, it can still be useful to split the operation into chunks (if you are working with a large amount of space objects).

Splitting large batches into chunks is defined using the cluster-config.groups.group.repl-policy.sync-replication.multiple-opers-chunk-size parameter. This parameter default value is 10000. This means that by default the operation is performed using chunks of 10000 objects each. To split the replication activity into smaller chunks, you must define a value for this parameter. This value defines the batch size. For example, when performing a takeMultiple, or a clear operation and setting:

<os-core:space id="space" url="/./mySpace">
    <os-core:properties>
        <props>
            <prop key="cluster-config.groups.group.repl-policy.sync-replication.multiple-opers-chunk-size">5000</prop>
        </props>
    </os-core:properties>
</os-core:space>

The space objects ID list to be removed from the replica space, is replicated using multiple chunks of 5000 elements each.

Splitting large batches into smaller chunks is not supported for transactional operations.

Synchronous Replication and the Replication Filter

If you employ synchronous replication in unicast mode, a dedicated replication channel is constructed for each target space. The replication filter is called for each replication channel.

Should Synchronous Replication be used for Primary/Backup Replication?

Synchronous replication, since it occurs simultaneously, offers a much higher rate of concurrency as well as full reliability if failover occurs. In order to ensure 100% data coherency in failover situations, synchronous replication must be defined between the primary and its backup spaces. However, if you use asynchronous replication between the primary and backup spaces running in in-memory mode, there might be a loss of data if the primary space fails, while its redo log still has data that has to be replicated.

Space Object Lease and Replication

Space Object Lease information is replicated across all relevant space cluster nodes. When renewing or cancelling the space object lease, the operation is also replicated into all relevant cluster nodes. If the cancel/renew operation is performed on a primary space that has already failed, or fails at the exact time the operation is being performed, the system (the same as with any other operation) passes immediately to the backup spaces, and cancels/renews the lease there.

When registering for lease expiration events and cluster-config.groups.group.repl-policy.trigger-notify-templates is enabled, the client also gets notifications from the backup space.

Other Information about Space Replication

For more information about space replication, refer to the following pages:

General Replication Options

To change the default replication settings you should modify the space properties when deployed. You may set these properties via the pu.xml or programmatically. Here is an example how you can set the replication parameters when using the pu.xml:

<os-core:space id="space" url="/./mySpace">
    <os-core:properties>
        <props>
            <prop key="cluster-config.groups.group.repl-policy.replication-mode">async</prop>
            <prop key="cluster-config.groups.group.repl-policy.policy-type">partial-replication</prop>
        </props>
    </os-core:properties>
</os-core:space>
Property Description Default Value
cluster-config.groups.group.repl-policy.replication-mode Optional values: sync, async.
  • The async mode replicates space operations to target space after the client receives the acknowledgment from the source space for the operation.
  • The sync mode replicates space operations to target space before the client receives the acknowledgment from the source space for the operation. The client getting the acknowledgment for the operation only after all target spaces confirms the arrival of the replicated data.
async used with the async-replicated schema.
sync used with the sync-replicated, primary-backup and partitioned-sync2backup cluster schema
cluster-config.groups.group.repl-policy.policy-type Optional values:
full-replication - all objects are replicated.
partial-replication - Object that their @SpaceClass configured(replicate = false) will not be replicated. See the POJO Support - Advanced page for details. This allows you to control replication at a class base level.
full-replication
cluster-config.groups.group.repl-policy.recovery Boolean value. Set to true if you want the space to recover its data from a replication partner when the system starts. See the Controlling the Recovery Process for details. true
cluster-config.groups.group.repl-policy.repl-find-timeout Timeout (in milliseconds) to wait for the lookup of a peer space. This parameter applies only when the space is searched in a Jini Lookup Service. 5000 [ms]
cluster-config.groups.group.repl-policy.repl-find-report-interval The interval (in milliseconds) in which the replicator thread prints a report of the failed attempts to find/locate other members. 30000 [ms]
cluster-config.groups.group.repl-policy.repl-original-state In order to replicate for every operation the Entry original state, not just the latest one, turn on the repl-original-state option. When using synchronous replication, this option is true by default. false
cluster-config.groups.group.repl-policy.recovery-chunk-size This parameter controls the chunk size that will be used when performing recovery. Tuning this parameter might speed up the recovery time after the space is restarted. 200
cluster-config.groups.group.repl-policy.recovery-thread-pool-size Number of threads used with the recovery process 4
cluster-config.groups.group.repl-policy.repl-full-take If set to true the entire object is replicated when take operations is called. If set to false only the ID, class information and primitive fields are replicated. This option is valid only when replicating data into a Mirror Service or a backup in non-central DB topology. false

Asynchronous Replication Options

Property Description Default Value
cluster-config.groups.group.repl-policy.async-replication.repl-chunk-size Number of packets transmitted together on the network when the replication event is triggered. The maximum value you can assign for this property is repl-interval-opers. 500
cluster-config.groups.group.repl-policy.async-replication.repl-interval-millis Time (in milliseconds) to wait between replication operations. 3000 [ms]
cluster-config.groups.group.repl-policy.async-replication.repl-interval-opers Number of destructive operations to wait before replicating. 500
cluster-config.groups.group.repl-policy.async-replication.async-channel-shutdown-timeout Determines how long (in milliseconds) the primary space will wait before replicating all existing redo log data into its targets before shutting down. 300000 [ms]

Synchronous Replication options

Property Description Default Value
cluster-config.groups.group.repl-policy.sync-replication.todo-queue-timeout The timeout time in milliseconds that the target space waits for missing packets. When the timeout expires, replication is performed via the asynchronous replicator. 1500 [ms]
cluster-config.groups.group.repl-policy.sync-replication.multiple-opers-chunk-size When performing batch operations (writeMultiple, updateMultiple, clear, takeMultiple), this element allows you to split the batch into smaller chunks. Default is 10000, meaning the replication operation is performed in chunks of 10000 objects. 10000
cluster-config.groups.group.repl-policy.sync-replication.unicast.min-work-threads The synchronous replicator module allocates a thread to communicate with each target space. This thread is taken from a pool. This parameter defines the pool's minimum size. 4
cluster-config.groups.group.repl-policy.sync-replication.unicast.max-work-threads The maximum pool size. Make this number larger than the number of target spaces per partition. 16
GigaSpaces.com - Legal Notice - 3rd Party Licenses - Site Map - API Docs - Forum - Downloads - Blog - White Papers - Contact Tech Writing - Gen. by Atlassian Confluence