A Distributed Systems Roadmap

Here lives a map for navigating the basics of distributed systems theory. This roadmap is mainly for programmers and researchers interested in learning enough theory to construct and reason about “real” distributed systems. Important proofs and results help ensure that such people do not attempt to solve problems that cannot be solved. Once you knows what is impossible under certain sets of assumptions, you can manipulate assumptions and designs in creative ways, to fruitful ends.

This is a working document. I have read some of these papers, and merely skimmed others. I plan on creating syntheses for as many of the papers as possible.

I am creating this document for another reason. The different possible failure and threat models in distributed systems theory make it difficult to classify systems. Raft handles fail-stops, but not Byzantine failures. A system with 2t+1 nodes can handle t Byzantine failures if the system entails one honest and reliable node reading a value from each node in the system, but you’ll need 2t+1 nodes if you want all nodes to agree upon a common value. Et cetera…

 What Good are Models and What Models are Good?

 by Fred Schneider

This chapter presents a nice overview of the landscape. Arguably the most important type of question to ask when considering distributed systems problems and solutions goes something like: What conditions will the system reside in? Can components fail, and if so, how? What do we assume about the network? What types of failures will we tolerate?

The system’s viability will depend on the answers to those questions.

This paper outlines the difference between fail-stop and Byzantine failures, which is crucial.

 The Consensus Problem in Unreliable Distributed Systems

 by Michael J. Fischer

Seems to provide a nice overview of topics to come. Understanding all topics in this paper is not crucial, since it cites and discusses many papers you’ll read.

 The Byzantine Generals Problem

 by Leslie Lamport, Robert Shostak, and Marshall Pease

Introduces the abstract notion of Byzantine faults. A relatively approachable paper. I wrote a summary.

 Reaching Agreement in the Presence of Faults

 by Leslie Lamport, Robert Shostak, and Marshall Pease

This presents an in-depth proof showing a 3-processor system cannot tolerate 1 fault. More generally, it provides a proof showing that in order to tolerate m Byzantine faults, the system needs at least 3m+1 nodes.

These papers go hand-in-hand, though the second is not strictly necessary.

I also find Leslie Lamport’s short commentaries on these papers here and here amusing.

 The Impossibility of Distributed Consensus with One Faulty Process

 by Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson

Arguably the most influential paper of the bunch. “In this paper, we show the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death. We do not consider Byzantine failures, and we assume that the message system is reliable…”

 Consensus in the Presence of Partial Synchrony

 by Cynthia Dwork and Nancy A. Lynch

Okay, so it’s impossible to achieve agreement with an asynchronous network (where both the delay on the communication links and the difference between processor speeds are unbounded and unknown). Under what circumstance can we achieve agreement? Introduces partial synchrony, in which both the delay on communication links and difference between process speeds are unknown, but bounded. (Fully synchronous systems have known and bounded delays/speed differences.)

It’s funny; basically every distributed systems paper needs to cite this one and remind its readers that we’re in a partially synchronous setting.

 Practical Byzantine Fault Tolerance

 by Miguel Castro and Barbara Liskov

One of the first (perhaps the first?) “practical” algorithms for tolerating Byzantine faults under partial synchrony. Fun fact: the paper was published in 1999, but Hyperledger Fabric uses PBFT.

 Time, Clocks, and the Ordering of Events in a Distributed System

 by Leslie Lamport

The man, the myth, the legend. How can we order events in a distributed setting?

 Consistent Global States of Distributed Systems

 by Ozalp Babaoglu and Keith Marzullo

A long, comprehensive overview. Probably a nice checkpoint. Many important problems in distributed computing admit solutions that contain a phase where some global property needs to be detected. This subproblem can be seen as an instance of the Global Predicate Evaluation (GPE) problem where the objective is to establish the truth of a Boolean expression whose variables may refer to the global system state.

 The Sybil Attack

 by John R. Douceur

Now, what happens when you open your system, and the number of nodes (peers) is unbounded?

 Perspectives on the CAP Theorem

 by Seth Gilbert and Nancy A. Lynch

A nice overview of the CAP Theorem and how it relates to the more fundamental concepts you’ve already read about, if you’ve gone this far.


There are a few directions to go from here: CRDTs, specific systems like Chubby or Dynamo, cryptocurrencies, to name a few. I expect to add them eventually.


Now read this

The Byzantine Generals Problem

The Byzantine Generals Problem is an abstract generalization for thinking about the failure in computer systems. Processors fail in multi-threaded systems, and nodes fail in distributed network systems. More specifically, processors and... Continue →