Summary: About failure detection, reducing failure detection time, and relevant parameters.
Overview If you are interested in Reducing Failure Detection Time, see below. Failure detection is the time it takes for the space and the client to detect that failure has occurred. Failure detection consists of two main phases:
- The backup space detects that the primary space is down, and takes over as primary.
- The client detects that the machine running the primary space is down. In case it is running against a clustered space, it routs its requests to the new primary space (the backup space that has just taken over as primary).
If the client is running against a single primary space, a disconnection exception is thrown, and the client cannot proceed.
For more details about how failure scenarios are handled, refer to the Failover Group section. One of two main failure scenarios might occur:
- Process failure or machine crash
- Network cable disconnection
It takes GigaSpaces a few seconds to recover from process failure or a machine crash. In case of network cable disconnection, the client first has to detect that it has been disconnected from the machine running the space. Therefore, recovery time in this case is longer. For details on how network failure is detected and handled, see the About Network Failure Detection section. Reducing Failure Detection Time Configuring failure detection time can help you handle extreme failure scenarios more effectively. For example, in extreme cases of network disconnection, you might want the failover process to take 2-3 seconds.
 |
This type of configuration is advanced, and is very specific to your scenario, data loading, and other factors; and therefore should be performed carefully. If you are interested in making such changes, please refer to the GigaSpaces Support Team. |
Failure Detection Parameters
Space Side Parameters
Active Election Parameters
The following parameters in the cluster schema active election block regard failure detection and recovery (it is possible to use XPath overrides instead of cluster schema values):
Parameter |
Parameter Description |
Default Value |
yield-time |
This parameter allows you to configure the time it takes to yield to other participants between every election phase. For more details on active election, see the Active Election and Avoiding Split-Brain Scenarios section. |
1000 |
invocation-delay |
This parameter limits the amount of time the backup space waits between each ping to the primary space. |
1000 |
retry-count |
Related to the invocation-delay parameter, defines the number of times the backup checks if the primary space has failed |
3 |
Watchdog Parameters
These parameters are set both on the space side and the client side. Therefore, they are also mentioned in the Client Side Parameters section below.
Parameter |
Parameter Description |
Default Value |
-Dcom.gs.transport_protocol.lrmi.connect_timeout |
This system property defines a timeout (in [ms]) for the service's attempts to establish a socket connection. |
30000 |
-Dcom.gs.transport_protocol.lrmi.request_timeout |
This system property defines the timeout (in [ms]) for the backup space's requests to the primary space, to verify that the connection hasn't been broken. |
30000 |
Client Side Parameters
Liveness Detection Properties
The liveness detection mechanism defines the frequency in which the system checks the liveness of members – whether available members become unavailable:
Property |
Property Description |
Default Value |
-Dcom.gs.cluster.livenessMonitorFrequency |
This system property defines the heartbeat interval time for the client to ping the spaces. |
10000 |
-Dcom.gs.cluster.livenessDetectorFrequency |
This system property defines the heartbeat interval time it takes the client to perform spaceFinder.find on the spaces it failed to ping in the above system property. |
5000 |
Watchdog Parameters
These parameters are set both on the space side and the client side. Therefore, they are also mentioned in the Space Side Parameters section above.
Parameter |
Property Description |
Default Value |
-Dcom.gs.transport_protocol.lrmi.connect_timout |
This system property defines a timeout (in [ms]) for the client's attempts to establish a socket connection to the space. |
30000 |
-Dcom.gs.transport_protocol.lrmi.request_timeout |
This system property defines the timeout (in [ms]) for the client's requests to the primary space, to verify that the connection hasn't been broken. |
30000 |
Service Grid Parameters
The Service Grid uses two complementary mechanisms for service detections – the Lookup Service and fault-detection handlers.
- GSMFaultDetectionHandler – used by GSMs to monitor each other.
- GSCFaultDetectionHandler – used by the GigaSpaces Management Center to monitor GSCs.
- PUFaultDetectionHandler – Used by GSMs to monitor Processing Units deployed on GSCs.
The fault-detection handlers check periodically if a service is alive, and in case of failure, how many times to retry and how often.
The GSM and GSC fault-detection handler settings are located in the services.config file. The PUFaultDetectionHandler is configurable using the SLA - member alive indicator.
For logging information, it is advised to monitor service failure by setting the logging level to Level.FINE.
# ServiceGrid FaultDetectionHandler logging
com.gigaspaces.grid.gsc.GSCFaultDetectionHandler.level = INFO
com.gigaspaces.grid.gsm.GSMFaultDetectionHandler.level = INFO
org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler.level = INFO
Jini Parameters
The LeaseRenewalManager in the advanced-space.config file is also related to failure detection and recovery:
Parameter |
Parameter Description |
Default Value |
maxLeaseDuration |
The time the system waits between every lease renewal, for example: if the parameter value is 8000, the system renews the space lease every 8000 [ms]. As this value is reduced, renewal requests are performed more frequently while the service is up, and lease expiration occurs sooner when the service goes down. |
8000 |
roundTripTime |
This parameter instructs the renewal process to begin a certain amount of time (by default, 100 [ms]) before the actual renewal time, thus making sure that the renewal process is successful. Significantly low values might result in failure to renew a lease. Durations of managed leases should exceed the roundTripTime. |
4000 |
For more details on the Lease Renewal Manager, refer to the Managing Resources Lease*** section.
***Link required Unicast discovery parameters When a Jini Lookup Service fails and is brought back online, a client (such as a GSC, space or a client with a space proxy) needs to re-discover it. It uses Jini unicast discovery retrying to connect to the failed remote lookup service. The default unicast retry protocol provides a graduating approach, increasing the amount of time to wait before the next discovery attempts are made - upon each invocation, eventually reaching a maximum time interval over which discovery is re-tried. In this way, the network is not flooded with unicast discovery requests referencing a lookup service that may not be available for quite some time (if ever). The downside is that it may delay the discovery of services if these are not brought up quickly. A discovery can be delayed us much as 15 minutes. If you have two GSMs and one fails, but it will be brought back up only in the next hour, then it's discovery will take ~15 minutes after it has loaded. These settings can be configured - see How to Configure Unicast Discovery. |