Summary: The network failure detection facility provides a robust and optimized mechanism to identify network failures.

Overview

When reading from a NIO SocketChannel and the connection between the client and server fails abnormally, no notification is generated. The most common example for such failure is disconnecting the network cable while performing the read operation. In this case the read operation is blocked for a long time, without ever knowing that the connection is closed. This can cause several problems such as exceeding the number of open sockets on the server or causing client read requests to hang.

Known solutions and problems they arouse:

  1. TCP keep alive mechanism – TCP has an option of sending keep alive packets in order to detect such failures, but the keep alive interval can not be set for less than 2 hours, which is not very useful.
  2. Read timeout – the old Java IO package allowed to execute read operations with user defined timeout. The problem is that this feature doesn't work in NIO. You can still set the timeout on channel.socket(), but this only applies to reading from socket InputStream and not channel reads.

The network failure detection mechanism applies both to space clients and the space server side. It supports timeout operations (read/take with timeout larger than 0), and also writes and performs batch operations. It is efficient in terms of memory and CPU consumption with minimal overhead. Network failure detection is applicable only for the NIO protocol.

The network failure detection mechanism is timeout-based – each read operation has a predefined timeout. There are two timeout modes: listening timeout and request timeout:

  • Listening timeout – occurs when a thread listens on a server socket for longer than the defined timeout. This happens when a space listens for client requests, or a client waits for server notifications.
    • Listening timeout handling – timeout is perceived as a broken link and the server closes existing connection to the client. When the link between the server and the client is restored – connection is reestablished.
    • Heartbeat mechanism – used to avoid closing valid connections. A dummy request is sent from client to server several times before closing the connection.
  • Request timeout – occurs when the client sends a request to the space and doesn't get a reply for the defined timeout.
    • Request timeout handling – when timeout expires, the connection is first tested by establishing a new connection to the server using the same port. If the connection is established successfully, timeout is ignored and connection timeout is reset. Otherwise, connection is closed.

Failure detection uses a watchdog design pattern to monitor timeouts of NIO operations. Watchdog is a singleton thread that runs every predefined period of time, according to timeout resolution, checks all the threads that are currently registered and fires timeout event on the threads that exceeded their allowed timeout.

When network failure occurs and notify is used, the space tries to resend the notify the number of times specified in the <notifier-retries> property. When using a clustered space, <notifier-retries> is used together with the network failure detection properties.

GigaSpaces.com - Legal Notice - 3rd Party Licenses - Site Map - API Docs - Forum - Downloads - Blog - White Papers - Contact Tech Writing - Gen. by Atlassian Confluence