Distributed Hash Tables: Introduction
This post assumes you have ~ 1 or more years of programming or Computer Science experience. I’d say if you’re comfortable with hash functions & some basic network ideas, there’s something here for you. This series of posts is designed to explore the purpose, design, and function of Distributed Hash Tables; specifically & eventually Kademlia, which I am implementing here.
Sharing files is not easy. Locating and hosting files across the internet is particularly challenging. As a webmaster, you are free to host files on your HTTP server and serve these files up to any user that visits your website. One might have a few complaints with this situation, however:
- One can only see the files you choose to upload; the selection is limited.
- You might take these files down, at which point one can no longer download them.
- If your website becomes popular, you might have trouble meeting demand and your server might get overloaded, slow, etc.
- If someone with a law degree decides the files you are uploading are copyrighted or unsavory, it’s easy to get your server shut down (you can be sued) thus eliminating access to these files.
Perhaps the world desires a system where anyone can place his or her files up for grabs on the internet. They want anyone else in turn to be able to search for desired files, find where they are located (anywhere in the world) and download them.
Napster, founded by Shawn Fanning, John Fanning, and Sean Parke and released in 1999, took a stab at this problem. Their goal was to provide a peer-to-peer file-sharing service based around uploading and downloading MP3 files. As a consumer, you downloaded the Napster client, searched for a song, and in seconds you were downloading the file from some other random person connected to the network that we’ll call a producer.
The experience sounds much like the list of priorities above. However, there were some technical differences. Both consumers and producers connected to a central index server. This server stored what files were available and who was hosting them.
[FileA, FileB, FileC] => Producer X [FileD, FileE, FileF] => Producer Y
As a producer, you connected and said “Hi! I have Files [A, B, C]”. As a consumer, you might connect and ask for the location of File A. The index server would then give you the IP and port of the appropriate producer. You connect to the producer, and initiate a download.
Pros of Napster:
- Simple approach - consumers and producers are largely ignorant of technical details, they can just ask the central index server for information.
- Easy to update - the maintainers of Napster simply need to push updates to the central index server to change the behavior of the network.
- There is a single point of failure. If the central index server(s) go down, consumers can no longer talk to producers.
- Since Napster was “advertising” copyrighted songs on their servers, they were deemed liable for the exchange of copyrighted material being transferred via their network. They were sued and faded into the background.
- If a producer is hosting a valuable file that the world wants and they shut down their computer, that file is no longer available to anyone (unless someone else also chooses to manually host it).
Gnutella was developed by Justin Frankel and Tom Pepper around 2000. Gnutella was designed to avoid problems faced by networks like Napster, namely central points of failure. The aim was to provide the world with a decentralized way for many computer to connect share data.
Unfortunately, up to version 0.4, that’s all Gnutella was. It’s useful to think of Gnutella as simply a random, unorganized mess of computers, all connected to each other in a complicated and non-deterministic web. You might be connected to 5 computers, who are each in turn connected to 7 computers, and so on.
Producers and consumers are treated as equal as nodes in the network. If I am looking for a file, I simply ask every node I’m connected to for the file. They in turn ask each node they are connected to if they have seen the file. Imagine one node in a giant web starting a vibration which multiplies across the web and ends up vibrating every strand.
There is a timeout on requests - if one request is passed on to, say, 7 computers (hops), it will expire. If the file is found, each hop of the request is reversed until the file ends up at the requester.
Pros of Gnutella:
- No central point of failure. Anyone can join at end time and does not need to register information in a place than can, if shut down, bring down the whole network.
- Very simple design - everyone connects to everyone and floods the network every time a request is made.
- Wasteful - every time you make a request you must flood the network - so if requests are being made frequently (they are) then every node in the network is constantly sending and receiving messages.
- It’s possible you might not ever find a file that is in fact on the network. If you are searching for a file that is topologically on the other side of the network, requests might expire before ever getting to that point in the network.
- It’s ugly; the solution seems hacky and inelegant.
What do we want?
A solution that provides the following abilities would be ideal (I refer to
n as the number of nodes in the network):
- To find a file anyone in the network has, no matter how big the network is.
- To avoid querying every node in the network to find any file.
- To not rely on a central organization server of point of failure.
- To have the system perform well as
- To have the system perform well if nodes storing important files leave the network.
The Burden of Protocol
Distributed Hash Tables help mitigate most of these issues. To help us on that journey, we need to understand that Napster and Gnutella differ in the fundamental way they treat power and protocol.
In Napster’s case, the index server dictates how information is stored and how consumers and producers are connected to each other. This is where updates to the protocol go and where changes need to be made. If you remove the central index server, you’ve cut off the head of the snake. Consumers and producer nodes are stupid - they only know how to talk to the index server and do what they’re told.
By contrast, in Gnutella every node knows how to talk to every other node. Parameters like “how long to let a request live” are encoded onto every single node. They each know how to behave without being told by another node in the network. They are independent beings. Here, the “burden of protocol” is placed on every node, rather than at once central location. If you remove a node, it doesn’t matter; every node knows how to behave appropriately. The intelligent behavior emerges despite the lack of a node (like an index server) with a higher, more wholesome view of the network topology.
This characteristic is important in building failure-and-lawsuit-resilient peer-to-peer protocols. The downside is of course it becomes difficult to push updates to everyone at once: since no node is forced to run new software, there will often be a subset of nodes running different versions of software. See Bitcoin, for example.
Distributed Hash Tables also push the burden of protocol to every node - every node has behavior encoded in it. When every node correctly executes this behavior, an emergent, intelligent behavior is produced in the network.
This post outlines the historical background for understanding why we would want to talk about DHTs at all. Next up - some more technical background as well as a high level technical description.