Cover Page

Scrivener Publishing

100 Cummings Center, Suite 541J

Beverly, MA 01915-6106

Publishers at Scrivener

Martin Scrivener (martin@scrivenerpublishing.com)

Phillip Carmical (pcarmical@scrivenerpublishing.com)

From Traditional Fault Tolerance to Blockchain

Wenbing Zhao

Cleveland State University

images

To My Parents

List of Figures

1.1 An example of a chain of threats with two levels of recursion.

1.2 The rollback recovery is enabled by periodically taking checkpoints and usually logging of the requests received.

1.3 With redundant instances in the system, the failure of a replica in some cases can be masked and the system continue providing services to its clients without any disruption.

1.4 Main types of assets in a distributed system.

2.1 An example distributed system.

2.2 Consistent and inconsistent global state examples.

2.3 An example of the domino effect in recovery with uncoordinated checkpointing.

2.4 Finite state machine specification for the coordinator in the Tamir and Sequin checkpointing protocol.

2.5 Finite state machine specification for the participant in the Tamir and Sequin checkpointing protocol.

2.6 Normal operation of the Tamir and Sequin checkpointing protocol in an example three-process distributed system.

2.7 Finite state machine specification for the Chandy and Lamport distributed snapshot protocol.

2.8 Normal operation of the Chandy and Lamport global snapshot protocol in an example three-process distributed system.

2.9 A comparison of the channel state definition between (a) the Chandy and Lamport distributed snapshot protocol and (b) the Tamir and Sequin global checkpointing protocol.

2.10 Example state intervals.

2.11 An example for pessimistic logging.

2.12 Transport level (a) and application level (b) reliable messaging.

2.13 Optimization of pessimistic logging: (a) concurrent message logging and execution (b) logging batched messages.

2.14 Probability density function of the logging latency.

2.15 A summary of the mean logging latency and mean end-to-end latency under various conditions.

2.16 Probability density function of the end-to-end latency.

2.17 Normal operation of the sender-based logging protocol.

2.18 An example normal operation of the sender-based logging protocol.

2.19 Two concurrent failures could result in the loss of determinant information for regular messages.

3.1 The three-tier architecture.

3.2 The Java EE architecture.

3.3 An example runtime path of an end-user request.

3.4 Component class and component instances.

3.5 The chi-square cumulative distribution function for degree of freedom of 1, 2, 3, 4, 5.

3.6 The path shape of the example runtime path shown in Figure 3.3.

3.7 Component class and component instances.

3.8 Dependency templates for nodes, processes, network paths, and the neighbor sets.

3.9 A partial dependency graph for an example system.

3.10 The error function.

3.11 A hypothetical dependency graph with abnormality for each component and the weight for each edge labeled.

3.12 The components that form a cycle in the f-map are reduced to a single unit in the r-map for recursive recovery.

3.13 The architecture of an Operator Undo framework.

4.1 The replication algorithm is typically implemented in a fault tolerance middleware framework.

4.2 Active replication, without (top) and with (bottom) voting at the client.

4.3 Passive replication.

4.4 Semi-active replication.

4.5 A write-all algorithm for data replication.

4.6 The problem of the write-all-available algorithm for data replication.

4.7 Preventing a transaction from accessing a not-fully-recovered replica is not sufficient to ensure one-copy serializable execution of transactions.

4.8 An example run of the quorum consensus algorithm on a single data item.

4.9 Basic steps for optimistic data replication for an operation-transfer system.

4.10 An example run of a system with three sites that uses Lamport clocks.

4.11 An example run of a system with three sites that uses vector clocks.

4.12 An example for the determination of the new version vector value after reconciling a conflict.

4.13 An example operation propagation using vector clocks in a system with three replicas.

4.14 An example for operation propagation using timestamp matrices in a system with three replicas.

4.15 Update commit using ack vectors in a system with three replicas.

4.16 Update commit using timestamp matrices in a system with three replicas.

4.17 An illustration of the CAP theorem.

4.18 Partition mode and partition recovery.

5.1 Examples of systems that ensure uniform total ordering and nonuniform total ordering.

5.2 In the sequencer based approach, a general system is structured into a combination of two subsystems, one with a single receiver and the other with a single sender of broadcast messages.

5.3 An example rotation sequencer based system in normal operation.

5.4 Normal operation of the membership view change protocol.

5.5 Membership change scenario: competing originators.

5.6 Membership change scenario: premature timeout.

5.7 Membership change scenario: temporary network partitioning.

5.8 A simplified finite state machine specification for Totem.

5.9 A successful run of the Totem Membership Protocol.

5.10 Membership changes due to a premature timeout by N2.

5.11 Messages sent before N1 fails in an example scenario.

5.12 Messages delivered during recovery for the example scenario.

5.13 Message sent before the network partitions into two groups, one with {N1, N2}, and the other with {N3, N4, N5}.

5.14 Messages delivered during recovery in the two different partitions for the example scenario.

5.15 Causal ordering using vector clocks.

6.1 Normal operation of the Paxos algorithm.

6.2 A deadlock scenario with two competing proposers in the Paxos algorithm.

6.3 If the system has already chosen a value, the safety property for consensus would hold even without the promise-not-to-accept-older-proposal requirement.

6.4 If two competing proposers propose concurrently, the system might end up choosing two different values without the promise-not-to-accept-older-proposal requirement.

6.5 With the promise-not-to-accept-older-proposal requirement in place, even if two competing proposers propose concurrently, only a single value may be chosen by the system.

6.6 Normal operation of Multi-Paxos in a client-server system with 3 server replicas and a single client.

6.7 View change algorithm for Multi-Paxos.

6.8 With reconfigurations, a group of 7 replicas (initially 5 active and 2 spare replicas) can tolerate up to 5 single faults (without reconfigurations, only up to 3 faults can be tolerated).

6.9 The Primary and secondary quorums formation for a system with 3 main replicas and 2 auxiliary replicas.

6.10 The Primary and secondary quorums formation as the system reconfigures due to the failures of main replicas.

6.11 Normal operation of Cheap Paxos in a system with 3 main replicas and 1 auxiliary replica.

6.12 The Primary and secondary quorums formation for a system with 3 main replicas and 2 auxiliary replicas.

6.13 Normal operation of (Multi-) Fast Paxos in a client-server system.

6.14 Collision recovery in an example system.

6.15 Expansion of the membership by adding two replicas in method 1.

6.16 Expansion of the membership by adding two replicas in method 2.

6.17 Reduction of the membership by removing two replicas one after another.

7.1 Two scenarios that highlight why it is impossible to use 3 generals to solve the Byzantine generals problem.

7.2 The message flow and the basic steps of the OM(1) algorithms. 252

7.3 The message flow and the basic steps of the OM(2) algorithms. 254

7.4 Normal operation of the PBFT algorithm.

7.5 PBFT view change protocol.

7.6 A worst case scenario for tentative execution.

7.7 Normal operation of Fast Byzantine fault tolerance.

7.8 Zyzzyva agreement protocol (case 1).

7.9 Zyzzyva agreement protocol (case 2).

7.10 A corner case in view change in Zyzzyva.

8.1 Bitcoin nodes.

8.2 The relationship between private key, public key, and address in Bitcoin.

8.3 Bitcoin transaction structure.

8.4 An example transaction chain in Bitcoin.

8.5 Bitcoin transactions per block data since its inception in 2009 through September 15, 2020. The data are downloaded from https://www.blockchain.com/charts/n-transactions-per-block.

8.6 Bitcoin block structure.

8.7 An issue with Bitcoin Merkle tree computation where different trees could produce the same Merkle root.

8.8 Bitcoin blockchain consensus and conflict resolution.

8.9 Structure of Ethereum transaction.

8.10 State transition via transaction in Bitcoin and Ethereum.

8.11 Ethereum smart contract structure.

8.12 An example transaction receipt in the JSON format. The content is color-coded. The yellow blocks are identifier information for the transaction, the contract invoked, and the block in which the transaction reside. The blue block contains the cumulative gas used. The green block contains the logs. The red block contains the logs Bloom filter string. The purple block contains the status of the transaction (success or not). The pink block contains the gas used for this transaction alone.

8.13 Ethereum block structure.

8.14 The annotated source code on verification of an ommer block.

8.15 An example on what kind of stale blocks may be chosen as an ommer block.

8.16 The annotated source code on the block reward scheme in Ethereum.

8.17 The cache size vs. the epoch number.

8.18 The dataset size vs. the epoch number.

8.19 The Ethash algorithm.

8.20 The double-spending attack steps.

9.1 A model for public blockchain consensus.

9.2 Main loop used by a mining node to compete in the creation of a new block using PoS in PeerCoin.

9.3 Major steps in the CreateNewBlock function in PeerCoin PoS.

9.4 Major steps in the CreateCoinStake function in PeerCoin PoS.

9.5 Major steps in the CheckStakeKernelHash function in PeerCoin PoS.

9.6 Information included in the data stream for computing PoS hash.

9.7 Major steps in PoET consensus.

10.1 Main benefits of blockchain for applications.

10.2 A model for cyber-physical systems.

10.3 Blockchain-enabled CPS applications.

10.4 Key operations and their relationship with the CPS applications and the blockchain benefits.

10.5 Basic CPS operations with respect to the latency and throughput requirements.

10.6 Stale block rate for different block sizes and block intervals.

10.7 Throughput for different combinations of block sizes and block intervals.

10.8 Payment channel operation.

10.9 Two level logging for sensing data with blockchain.

10.10 The format for the raw data (together with the aggregated data tuple) for local logging.

10.11 Summary of the token paradigm.

10.12 A classification of blockchain applications based on token usage.

10.13 The impossibility trinity hypothesis.

List of Tables

7.1 Messages received and final decisions in two cases for OM(1,4).

7.2 Messages received and step (3) calculation in two cases for instances of OM(1) at G1.

7.3 Messages received and step (3) calculation in two cases for instances of OM(1) at G2.

7.4 Messages received and step (3) calculation in two cases for instances of OM(1) at G3.

7.5 Messages received and step (3) calculation in two cases for instances of OM(1) at G4.

7.6 Messages received and step (3) calculation in two cases for instances of OM(1) at G5.

7.7 Final decision made at each lieutenant in step (3) of OM(2).

10.1 Blockchain-enabled IoT-based applications.

10.2 Blockchain-enabled supply chain applications.

10.3 Blockchain-enabled manufacturing applications.

10.4 Blockchain-enabled automobile production.

10.5 Blockchain-enabled energy systems.

10.6 Blockchain-enabled healthcare systems.

10.7 Blockchain-enabled smart city.

10.8 Blockchain-enabled workplace.

10.9 General discussions on blockchain-enabled CPS applications.

Acknowledgments

This book is dedicated to my parents. They tried their best to help me pursue my dreams through so many years’ financial hardship. They took their life savings to pay the government so that I could be free to emigrate to the greatest country on earth. When I stumbled and had nowhere else to go, they took me under their wings and took care of me. I am forever in their debt.

I also would like to thank my beautiful wife, Hao, and my lovely children Dorothy, Emily, and Arthur. It is them that make my life so enjoyable and meaningful.

W. Z.

Preface

Cloud services are playing an ever increasingly important role in all aspects of our society, governments, businesses, and individuals alike. We depend on these services on a daily basis, such as financial (e.g., online banking and stock trading), e-commerce (e.g., online shopping), civil infrastructure (e.g., electric power grid and traffic control), entertainment (e.g., online gaming and multimedia streaming), and personal data storage (e.g., various cloud services such as Dropbox, Google Drive, and OneDrive). Behind these cloud services is distributed computing, which addresses many critical issues in making the services dependable and trustworthy. The most important of all is to build consensus in its internal operations that span many different computing nodes.

Distributed consensus has been studied for several decades, at least starting in 1970s. The reason why distributed consensus is important is that a distributed system would span over many computing nodes, and these nodes must maintain a common view on the system state so that each can operate as planned towards the objectives of the system. Prolonged inconsistency among different components of the system would damage the integrity of the system and ultimately would result in system-level failures that are visible to end users.

The cost of system failures is enormous. If a data center is brought down by a system failure, the average cost for downtime may range from $42,000 to about $300,000 per hour [2, 6]. The cost can be estimated by summing up the wasted expenses and the loss of revenue. While the labor cost of downtime may be estimated relatively easily (i.e., roughly, wasted expenses per hour = number of employees × average salary per hour) [13], it is much harder to estimate the loss of revenue, especially due to the damages on the reputation of the business and the loyalty of its potential customers [2].

Ensuring high availability of distributed systems is not cheap. In [7], the cost of data center is estimated to range from $450 per square foot for 99.671% availability (i.e., 28.8 hours of downtime per year), to $1,100 per square foot for 99.995% availability (i.e., 0.4 hours of downtime per year). That is perhaps one reason why about 59% of Fortune 500 companies suffer from 1.6 hours or more of downtime per week [2].

All classical consensus algorithms rely on a concept referred to as membership, that is, every node would know how many nodes are in the current membership, the logical role of each node, and how to reach other nodes. Another important construct is voting via the sending of messages to each other. Typically, one of the members would assume a special role, which is referred to as the primary or the coordinator. The coordinator might fail or become compromised, in which case, a new coordinator would be elected through voting. As such, classical distributed consensus algorithms are expensive, highly complex, and not scalable due to the heavy use of multiple rounds of message exchanges among the members.

In January 2009, the launch of the first practical cryptocurrency, Bitcoin [12], has completely changed the picture. The most essential prerequisite for a cryptocurrency is the assurance that it is virtually impossible for anyone to double-spend the money (i.e., cryptocurrency) one has. Bitcoin addressed this requirement by introducing an immutable distributed ledger in the form of a chain of blocks where each block aggregates hundreds or even thousands of transactions. This distributed ledger is often referred to as the blockchain. The immutability of the blockchain is achieved by several means: (1) cryptographic protection of the blockchain, such as digital signature, one-way hash function, and chaining of the blocks; (2) massive degree of replication of the blockchain across many nodes in the Bitcoin network; and (3) a novel probabilistic consensus algorithm that is completely different from classical consensus algorithms.

The consensus algorithm used in Bitcoin does not involve any explicit form of voting, therefore, there is no extra message exchange among the nodes in the Bitcoin network for the purpose of reaching consensus. In Bitcoin, the consensus building process is converted into a lottery-like stochastic process where the winner of the lottery gets the right to create a new block of transactions and collects an award [22]. To ensure fairness and to ensure the process to be a stochastic process, every participating node would work on a Proof-of-Work (PoW) based puzzle, and the first one that finds a solution becomes the winner. The PoW puzzle has a predefined target difficulty, and a participating node would experiment with different ways of making the hash of the block header meet the target difficulty. This is a CPU-intensive process. Hence, the only way a node could gain advantage over other nodes is to invest in better hardware that can perform the hash operation faster. The Bitcoin consensus algorithm is referred to as PoW and sometimes as the Nakamoto algorithm, named after Bitcoin’s creator, which is apparently a pseudonym. This novel form of consensus algorithm has aroused huge interest in the research and application of the blockchain technology [20]. Some even expressed the hope that the blockchain technology would lead to a new-form of economy, just like what the Internet has transformed our society [16].

This book contains two parts. The first part consists of the first 7 chapters and it covers the most essential techniques for building dependable distributed systems. The last 3 chapters form the second part, which covers the blockchain technology.

Chapter 1 introduces the basic concepts and terminologies of dependable distributed computing, as well as the primary means to achieve dependability.

Chapter 2 describes the checkpointing and logging mechanisms, which are widely used in practice to achieve some form of fault tolerance. Checkpointing and logging enable the recoverability of the system but do not prevent service disruption. These mechanisms are relatively simple to implement and understand, and they incur minimum runtime overhead while demanding very moderate extra resources (only stable storage). Furthermore, checkpointing and logging also serve as the foundation for more sophisticated dependability techniques.

Chapter 3 covers research works on recovery-oriented computing, including fault detection and diagnosis, microreboot, and system-level undo and redo. Recovery-oriented computing aims to facilitate faster recovery after a system failure and thereby improving the availability of the system. Similar to checkpointing and logging, the mechanisms for recovery-oriented computing do not prevent service disruption, hence, it is a promising approach for many e-commerce application, but not suitable for applications that require high reliability.

Chapter 4 outlines the replication technique for data and service fault tolerance. This is the fundamental technique to ensure high reliability. Through active replication (i.e., the use of multiple redundant copies of the application processes), the system would be able to mask the failure of a replica and continue to process clients’ requests (this is actually not entirely true, as we will show in later chapters, some failures may cause extended period of unavailability of the system). With replication comes the complexity of consistency issue. Ideally, the replicas should always maintain consistency with each other. However, doing so might not incur too much runtime overhead to be acceptable for some applications, or may cause extended period of system unavailability. Hence, strict consistency may have to be compromised either for better performance [15] or for better availability [19].

Chapter 5 explains the group communication systems, which can be used to implement active replication. A group communication system typically offers a totally ordered reliable multicast service for messages, a membership server, and a view synchrony service. These set of services help the replicas to maintain consistency even in the presence of failures, which would reduce the development cost of building dependable systems with active replication.

Chapter 6 discusses the consensus problem and describes several Paxos algorithms, including the Classic Paxos, Dynamic Paxos, Cheap Paxos, and Fast Paxos. While it is easy for a group of processes to agree on the same value if all processes can communicate with each other promptly and if none of them fails, distributed consensus is an incredibly hard problem when processes might fail and there might be extended delay to send or receive a message. The classical Paxos algorithm solves the consensus problem (under the non-malicious fault model) in a very elegant and efficient manner by separating the safety concern and the liveness concern [9]. Additional Paxos algorithm are developed to minimize the resources required, and to reduce the latency for achieving consensus by using a higher redundancy level [10, 18].

Chapter 7 introduces the problem of Byzantine fault tolerance. A Byzantine fault is synonymous with a malicious fault. Because a malicious faulty component may choose to behave like any of the non-malicious faults, the Byzantine fault model encompasses any arbitrary fault. The distributed consensus problem under the Byzantine fault model was first studied several decades ago by Lamport, Shostak, and Pease [11]. A much more efficient algorithm for achieving fault tolerance under the Byzantine fault model (referred to as Practical Byzantine fault tolerance) was proposed by Castro and Liskov in 1999 [5]. Since then, the research on Byzantine fault tolerance exploded. With the pervasiveness of cyberattacks and espionages, dealing with malicious faults becomes an urgent concern now compared with several decades ago.

Chapter 8 provides an overview of cryptocurrency and the blockchain technology, including the early conception of cryptocur rency, the first implementation of cryptocurrency in Bitcoin [12], the concept of smart contract and its implementation in Ethereum [4], as well as the vision of decentralized organizations [16] powered by smart contract and the blockchain technology.

Chapter 9 explains the consensus algorithms used in the blockchain technology in depth. Since the original PoW algorithm was introduced in Bitcoin, there has been great effort on improving PoW in various aspects, and on finding alternative algorithms that do not consume as much energy. A common set of requirements for such algorithms is laid out [22] and different proposals are examined with respect to the requirements [17]. In this chapter, we also discuss the Proof-of-Stake (PoS) consensus algorithm, which is the second most well-known algorithm behind PoW for blockchain. We will explain the PoS implementation in PeerCoin [8]. It is the first implementation of PoS in a practical cryptocurrency (i.e., PeerCoin) in 2013 and it has gone through several revisions to address its initial vulnerabilities.

Chapter 10 presents the applications of the blockchain technology and issues that will directly impact on how widely the blockchain technology can be adopted, including the value of the blockchain technology and the efforts to increase the throughput of blockchain systems [1, 3, 14, 21]. We primarily focus on blockchain applications in the area of cyber-physical systems (CPS) [20]. CPS is evolving rapidly and the integration of blockchain and CPS could potentially transform CPS design for much stronger security and robustness.

Wenbing Zhao
Cleveland, USA
March 2021

References

  1. 1. E. Akbari, W. Zhao, S. Yang, and X. Lou. The impact of block parameters on the throughput and security of blockchains. In Proceedings of the 2020 International Conference on Blockchain Technology, pages 13–18. ACM, 2020.
  2. 2. A. Arnold. Assessing the financial impact of downtime, April 2010. http://www.businesscomputingworld.co.uk/assessing-the-financial-impact-of-downtime/.
  3. 3. A. Back, M. Corallo, L. Dashjr, M. Friedenbach, G. Maxwell, A. Miller, A. Poelstra, J. Timón, and P. Wuille. Enabling blockchain innovations with pegged sidechains. URL: http://www.opensciencereview.com/papers/123/enablingblockchain-innovations-with-pegged-sidechains, 72, 2014.
  4. 4. V. Buterin et al. Ethereum white paper. https://ethereum.org/en/whitepaper/, 2013.
  5. 5. M. Castro and B. Liskov. Practical byzantine fault tolerance. In Proceedings of the third symposium on Operating systems design and implementation, OSDI ’99, pages 173–186, Berkeley, CA, USA, 1999. USENIX Association.
  6. 6. Channel Insider. Unplanned it outages cost more than $5,000 per minute: Report. http://www.channelinsider.com/c/a/Spotlight/Unplanned-IT-Outages-Cost-More-than-5000-per-Minute-Report-105393/, May 2011.
  7. 7. J. Clark. The price of data center availability, October 2011. http://www.data-centerjournal.com/design/the-price-of-data-center-availability/.
  8. 8. S. King and S. Nadal. Ppcoin: Peer-to-peer crypto-currency with proof-of-stake. https://www.peercoin.net/assets/paper/peercoin-paper.pdf, 2008.
  9. 9. L. Lamport. Paxos made simple. ACM SIGACT News (Distributed Computing Column), 32(4):18–25, December 2001.
  10. 10. L. Lamport. Fast paxos. Distributed Computing, 19(2):79–193, 2006.
  11. 11. L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4:382–401, 1982.
  12. 12. S. Nakamoto. Bitcoin: A peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.pdf, 2008.
  13. 13. T. Pisello and B. Quirk. How to quantify downtime, January 2004. http://www.networkworld.com/careers/2004/0105man.html.
  14. 14. J. Poon and T. Dryja. The bitcoin lightning network: Scalable off-chain instant payments, 2016.
  15. 15. Y. Saito and M. Shapiro. Optimistic replication. ACM Comput. Surv., 37(1):42– 81, Mar. 2005.
  16. 16. M. Swan. Blockchain: Blueprint for a new economy. “O’Reilly Media, Inc.”, 2015.
  17. 17. W. Wang, D. T. Hoang, P. Hu, Z. Xiong, D. Niyato, P. Wang, Y. Wen, and D. I. Kim. A survey on consensus mechanisms and mining strategy management in blockchain networks. IEEE Access, 7:22328–22370, 2019.
  18. 18. W. Zhao. Fast paxos made easy: Theory and implementation. International Journal of Distributed Systems and Technologies (IJDST), 6(1):15–33, 2015.
  19. 19. W. Zhao. Optimistic byzantine fault tolerance. International Journal of Parallel, Emergent and Distributed Systems, 31(3):254–267, 2016.
  20. 20. W. Zhao, C. Jiang, H. Gao, S. Yang, and X. Luo. Blockchain-enabled cyber-physical systems: A review. IEEE Internet of Things Journal, 2020.
  21. 21. W. Zhao, S. Yang, and X. Lou. Secure hierarchical processing and logging of sensing data and iot events with blockchain. In Proceedings of the 2020 International Conference on Blockchain Technology, pages 52–56. ACM, 2020.
  22. 22. W. Zhao, S. Yang, and X. Luo. On consensus in public blockchains. In Proceedings of the 2019 International Conference on Blockchain Technology, pages 1–5, 2019.