Wenbing Zhao

From Traditional Fault Tolerance to Blockchain


Скачать книгу

prior to the failure by replaying the logged nondeterministic events.

      2 It limits the size of the log. By taking a checkpoint periodically, the logged events prior to the checkpoint can be garbage collected.

      Logging protocols can be classified into three types [7]:

       ◾ Pessimistic logging. A message received is synchronously logged prior to its execution.

       ◾ Optimistic logging. To reduce the latency overhead, the nondeterministic events are first stored in volatile memory and logged asynchronously to stable storage. Consequently, the failure of a process might result in permanent loss of some messages, which would force a rollback to a state earlier than the state when the process fails.

       ◾ Causal logging. The nondeterministic events (and their determinant, such as delivery order of messages received at a process) that have not yet logged to stable storage are piggybacked with each message sent. With the piggy-backed information, a process can have access all the nondeterministic events that may have causal effects on its state, thereby enabling a consistent recovery of the system upon a failure.

      On the other hand, pessimistic logging protocols are much simpler in their design and implementation and failure recovery can be made much faster [11] (specific advantages will be elaborated in section 2.3.1 below). Therefore, our discussion will focus on the pessimistic logging techniques and there will be no further elaboration on optimistic and causal logging.

      The most straightforward implementation of pessimistic logging is to synchronously log every incoming message to stable storage before it is executed at a process. Each process can checkpoint its state periodically at its own pace without the need to coordinate with other processes in the distributed system. Upon recovery from a failure, a process restores its state using the last checkpoint and replays all logged incoming messages to recover itself to the state right before it fails.

      EXAMPLE 2.5

Schematic illustration of an example for pessimistic logging.

      Pessimistic logging can cope with concurrent failing and recovery of two or more processes, as illustrated in the example shown in Figure 2.11. Messages received while a process is recovering (i.e., while it is restoring its state using the latest checkpoint and by replaying all the logged messages), can be buffered and examined when the process completes its recovery. It is possible that while a process is engaging in a recovery, another process fails and recovers itself concurrently, as the above example shows. In this case, P1 would receive a duplicate message (m6) regenerated by another recovering process P2 and temporarily buffers it. P1 then would discard it as soon as it is done recovery. Similarly, P2 would receive the duplicate message m4 regenerated by P1, which will be discarded after the recovery is completed.

       2.3.1.1 Benefits of Pessimistic Logging.

      It is apparent that pessimistic logging has a number of very desirable characteristics:

       ◾ Processes do not need to track their dependencies. The relative ordering of the incoming messages to each process is naturally reflected in the log (i.e., during recovery, the messages in the log will be replayed in the order in which they are logged). Hence, the pessimistic logging mechanism is straightforward to implement and less error prone.

       ◾ Output commit is free with pessimistic logging. This is a great fit for distributed applications that interact with their users frequently.

       ◾ There is no need to carry out coordinated global checkpointing because by replaying the logged messages, a process can always bring itself to be consistent with other processes in the system. This further reduces the complexity of adding rollback recovery support to applications. Furthermore, a process can decide when it is the best time to take a local checkpoint, for example, when its message log is too big.

       ◾ Recovery can be done completely locally to the failed processes. The only impact to other processes is the possibility of receiving duplicate messages and discard them. Hence, the recovery is simpler and in general faster than optimistic and causal logging. The localization of failure recovery also means that pessimistic logging supports concurrent failure recovery of multiple processes.

      There are three issues that warrant additional elaboration: reconnection, message duplicate detection, and atomic message receiving and logging.

      Reconnection. A process must be able to cope with temporary connection failures and be ready to accept reconnections from other processes. This is an essential requirement for recoverable distributed system. This calls for a design in which the application logic is independent from the transport level events. This can be achieved by using a event-based [8] or document-based distributed computing architecture such as Web services [15], in conjunction with appropriate exception handling.