2 describes the checkpointing and logging mechanisms, which are widely used in practice to achieve some form of fault tolerance. Checkpointing and logging enable the recoverability of the system but do not prevent service disruption. These mechanisms are relatively simple to implement and understand, and they incur minimum runtime overhead while demanding very moderate extra resources (only stable storage). Furthermore, checkpointing and logging also serve as the foundation for more sophisticated dependability techniques.
Chapter 3 covers research works on recovery-oriented computing, including fault detection and diagnosis, microreboot, and system-level undo and redo. Recovery-oriented computing aims to facilitate faster recovery after a system failure and thereby improving the availability of the system. Similar to checkpointing and logging, the mechanisms for recovery-oriented computing do not prevent service disruption, hence, it is a promising approach for many e-commerce application, but not suitable for applications that require high reliability.
Chapter 4 outlines the replication technique for data and service fault tolerance. This is the fundamental technique to ensure high reliability. Through active replication (i.e., the use of multiple redundant copies of the application processes), the system would be able to mask the failure of a replica and continue to process clients’ requests (this is actually not entirely true, as we will show in later chapters, some failures may cause extended period of unavailability of the system). With replication comes the complexity of consistency issue. Ideally, the replicas should always maintain consistency with each other. However, doing so might not incur too much runtime overhead to be acceptable for some applications, or may cause extended period of system unavailability. Hence, strict consistency may have to be compromised either for better performance [15] or for better availability [19].
Chapter 5 explains the group communication systems, which can be used to implement active replication. A group communication system typically offers a totally ordered reliable multicast service for messages, a membership server, and a view synchrony service. These set of services help the replicas to maintain consistency even in the presence of failures, which would reduce the development cost of building dependable systems with active replication.
Chapter 6 discusses the consensus problem and describes several Paxos algorithms, including the Classic Paxos, Dynamic Paxos, Cheap Paxos, and Fast Paxos. While it is easy for a group of processes to agree on the same value if all processes can communicate with each other promptly and if none of them fails, distributed consensus is an incredibly hard problem when processes might fail and there might be extended delay to send or receive a message. The classical Paxos algorithm solves the consensus problem (under the non-malicious fault model) in a very elegant and efficient manner by separating the safety concern and the liveness concern [9]. Additional Paxos algorithm are developed to minimize the resources required, and to reduce the latency for achieving consensus by using a higher redundancy level [10, 18].
Chapter 7 introduces the problem of Byzantine fault tolerance. A Byzantine fault is synonymous with a malicious fault. Because a malicious faulty component may choose to behave like any of the non-malicious faults, the Byzantine fault model encompasses any arbitrary fault. The distributed consensus problem under the Byzantine fault model was first studied several decades ago by Lamport, Shostak, and Pease [11]. A much more efficient algorithm for achieving fault tolerance under the Byzantine fault model (referred to as Practical Byzantine fault tolerance) was proposed by Castro and Liskov in 1999 [5]. Since then, the research on Byzantine fault tolerance exploded. With the pervasiveness of cyberattacks and espionages, dealing with malicious faults becomes an urgent concern now compared with several decades ago.
Chapter 8 provides an overview of cryptocurrency and the blockchain technology, including the early conception of cryptocur rency, the first implementation of cryptocurrency in Bitcoin [12], the concept of smart contract and its implementation in Ethereum [4], as well as the vision of decentralized organizations [16] powered by smart contract and the blockchain technology.
Chapter 9 explains the consensus algorithms used in the blockchain technology in depth. Since the original PoW algorithm was introduced in Bitcoin, there has been great effort on improving PoW in various aspects, and on finding alternative algorithms that do not consume as much energy. A common set of requirements for such algorithms is laid out [22] and different proposals are examined with respect to the requirements [17]. In this chapter, we also discuss the Proof-of-Stake (PoS) consensus algorithm, which is the second most well-known algorithm behind PoW for blockchain. We will explain the PoS implementation in PeerCoin [8]. It is the first implementation of PoS in a practical cryptocurrency (i.e., PeerCoin) in 2013 and it has gone through several revisions to address its initial vulnerabilities.
Chapter 10 presents the applications of the blockchain technology and issues that will directly impact on how widely the blockchain technology can be adopted, including the value of the blockchain technology and the efforts to increase the throughput of blockchain systems [1, 3, 14, 21]. We primarily focus on blockchain applications in the area of cyber-physical systems (CPS) [20]. CPS is evolving rapidly and the integration of blockchain and CPS could potentially transform CPS design for much stronger security and robustness.
Wenbing Zhao
Cleveland, USA
March 2021
References
1 1. E. Akbari, W. Zhao, S. Yang, and X. Lou. The impact of block parameters on the throughput and security of blockchains. In Proceedings of the 2020 International Conference on Blockchain Technology, pages 13–18. ACM, 2020.
2 2. A. Arnold. Assessing the financial impact of downtime, April 2010. http://www.businesscomputingworld.co.uk/assessing-the-financial-impact-of-downtime/.
3 3. A. Back, M. Corallo, L. Dashjr, M. Friedenbach, G. Maxwell, A. Miller, A. Poelstra, J. Timón, and P. Wuille. Enabling blockchain innovations with pegged sidechains. URL: http://www.opensciencereview.com/papers/123/enablingblockchain-innovations-with-pegged-sidechains, 72, 2014.
4 4. V. Buterin et al. Ethereum white paper. https://ethereum.org/en/whitepaper/, 2013.
5 5. M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proceedings of the third symposium on Operating systems design and implementation, OSDI ’99, pages 173–186, Berkeley, CA, USA, 1999. USENIX Association.
6 6. Channel Insider. Unplanned it outages cost more than $5,000 per minute: Report. http://www.channelinsider.com/c/a/Spotlight/Unplanned-IT-Outages-Cost-More-than-5000-per-Minute-Report-105393/, May 2011.
7 7. J. Clark. The price of data center availability, October 2011. http://www.data-centerjournal.com/design/the-price-of-data-center-availability/.
8 8. S. King and S. Nadal. Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. https://www.peercoin.net/assets/paper/peercoin-paper.pdf, 2008.
9 9.