Summary: This section describe into details how the In-Memory-Data-Grid clustering works. It goes into the details of the fail-over, load-balancing, and replication facilities used.
OverviewGigaSpaces Data-Grid clustering Scalability and High-Availability are based on the following concepts: Scalability
High-Availability
ReferencesNoCAP - Part II Availability and Partition tolerance NoCAP - Part III - GigaSpaces clustering explained GigaSpaces ClusteringSee below high-level description of the GigaSpaces Clustering. It explains how load-balancing, failover , replication and high-availability works. More details about Data-Grid clustering can be found here: GigaSpaces Clustering ExplainedConsistencyConsistency under concurrent updatesTo ensure consistency in the case of concurrent updates on the same data record each individual record is mapped to a single logical-partition at each given point in time. To ensure scalability, different records of the same logical table are written to multiple partitions in parallel as described in the diagram above. Each partition support the various locking semantics (pessimistic, optimistic (versioning), dirty-read) etc.. to control the concurrent access of the same record within the context of a single partition. Consistency Between two or more replicasTo ensure the continuous high availability we keep one or more copies of our data. In asynchronous replication, we may end up with scenarios where read and write operations would hit two different nodes at the same time and end up reading two separate versions of that same data. There are various algorithms that were developed to handle that situation. In our case we chose to avoid getting into that situation in the first place through the use of synchronous backup. The performance overhead of the synchronous replication is fixed and is not proportional to the size of the cluster (each partition replicates data only to its backup replica). The replication to the database is kept asynchronous to reduce the overhead of writing to disk. Transaction ConsistencySingle operations or groups of operations can be executed under transactions. This ensures the ACID properties. Transactions can be made local to each partition. In this case, they will bound to the scope of a single partition and would be highly optimized in terms of performance. Transaction can also span between nodes (in this case the overhead is obviously going to be higher). OrderingAll operations are ordered based on the time they were written. This is specifically relevant to ensure the consistency between the in-memory cluster and the long-term storage which is being updated asynchronously. AvailabilityPrimary failureGigaSpaces keep one or more replica nodes as a hot backup. The hot backup will take over immediately in case a primary node fails. The hot backup nodes uses synchronous replication to ensure no data loss before fail-over took place. Backup failureWhen a backup node fails, the primary continues to serve requests and log the operation in a redo-log. In parallel, a new backup is being provisioned on demand to take over from the failed one. That process involves a provisioning process (in which a new backup is created) and a recovery process (during which the backup gets its state). Failure of Multiple NodesTo increase availability, some of the NoSQL variants suggest at least three or more replicas per partition. In this way, we can handle simultaneous failure of multiple nodes. That obviously comes with a huge overhead - for each terabyte of data, we would have two terabytes of redundant information for backup purposes, and - equally important - we would also have the consistent overhead of keeping all of them up to date. An alternative approach is to use on-demand backups. On-demand backups are provisioned automatically as soon as one of the nodes fail. If spare capacity exists within the current pool of machines the backup will be provisioned into an alternate machine within the existing pool - this process can take a few seconds (depending on the amount of data per partition). If no machine is available, it will start a completely new machine. A new backup will be provisioned into that new machine. The process of starting a new machine with its backup can take few minutes. As soon the node starts it will first use a primary election protocol to find the master node within its group and only then it boots up. The startup process includes a recovery stage in which the node recovers its state from either the master node or the available replicas. The source node will also store all the updates since the recovery started in a redo log and would replay all updates to fill in the gap since the node started its recovery process. Client FailureClients use a cluster-aware proxy to communicate with the cluster. The smart proxy ensures that a write or read operation is always routed to one of the available partitions. The routing happens implicitly thus the client is not exposed to a fail-over scenario. Lookup Discovery protocolThe GigaSpaces cluster discovery mechanism is based on the Jini specification. Services use the discovery protocol to find nodes within the cluster and share cluster state amongst all nodes. Partition ToleranceNetwork partition between primary and backupWhen the connection between two nodes fails, the primary node logs all the transactions into a FIFO queue known as the redo-log. As soon as the communication gets re-established all the data gets replayed to the backup. If the backup fails completely, the system will start a new instance as described above. Network partition with the long term persistencyIf the communication with the long-term persistency datastore fails, the replica will log all the operations till the connection gets re-established. The log is also replicated to a backup node to ensure that the data won't be lost in case the primary partition fails before the data was successfully committed to the long-term storage. Network partition between two or more data center sitesMost people referred to scenarios where two sites can live in two separate locations and continue to work independently in a case of network partition as THE reference scenario for partition tolerance. It is important to note that there are two class of multi-site deployments that are fundamentally different as it relates to the network partition:
Disaster Recovery SiteDisaster recovery sites are very much like any node in a local network but often live in different network segments and with higher latency than local networks. Nodes within a cluster can be tagged with a zones tag to mark their data-center affinity. The system can use this information to automatically provision primary and backup nodes between the two sites. It will use the zone tag to ensure that primary and backup are always spread between two data centers.
Geographically Distributed sites over internet WANIn this scenario, nodes are spread over internet connections where the SLAs are lower and latency is significantly higher. In that case, it would be impractical to treat all of the nodes as a single cluster as in the previous case. It would be more practical to use a federated cluster deployment (a cluster of clusters) where we will use asynchronous replication to synchronize the multiple sites and therefore avoid the extreme latency overhead. In this mode, we can't achieve all three CAP properties. In most cases it would be more common to choose AP over CA. Having said that even when we choose AP we don't necessarily need to give away consistency completely. Here is a suggested architecture that can provides reasonable degree of consistency with slight compromise on absolute availability and partition tolerance:
Handling Extreme FailureMultiple node Failures at the Same TimeIn cases where both the primary and backup of a given partition fails at the same time, we will consider this cluster as non-functional and block operations to that cluster until the cluster becomes functional again. In that case we will compromise on certain aspects of Availability for the sake of Consistency. If your application can tolerate the potential consistency issues associated with such failure, you could configure the cluster to route all operations to currently available partitions. In that case we will trade Availability over Consistency - this behavior can be useful in streaming scenarios where the system is used to pass through events.
Recovery from a Total FailureWhen the entire system crashes, it will boot itself from the long-term persistent storage or from a snapshot (new). In that case we may lose some of the data that was kept in memory and wasn't yet delivered to the long term persistent storage. It is important to note that in a disaster recovery setup that event can happen only when both sites goes down at the same time. In that case the system would recover itself from the data that was last updated in the long-term storage. |
![]() |
GigaSpaces.com - Legal Notice - 3rd Party Licenses - Site Map - API Docs - Forum - Downloads - Blog - White Papers - Contact Tech Writing - Gen. by Atlassian Confluence |