Distributed Hash Tables: Introduction

This post assumes you have ~ 1 or more years of programming or Computer Science experience. I’d say if you’re comfortable with hash functions & some basic network ideas, there’s something here for you. This series of posts is designed to explore the purpose, design, and function of Distributed Hash Tables; specifically & eventually Kademlia, which I am implementing here.

 The Problem

Sharing files is not easy. Locating and hosting files across the internet is particularly challenging. As a webmaster, you are free to host files on your HTTP server and serve these files up to any user that visits your website. One might have a few complaints with this situation, however:

Perhaps the world desires a system where anyone can place his or her files up for grabs on the internet. They want anyone else in turn to be able to search for desired files, find where they are located (anywhere in the world) and download them.

 History

 Napster

Napster, founded by Shawn Fanning, John Fanning, and Sean Parke and released in 1999, took a stab at this problem. Their goal was to provide a peer-to-peer file-sharing service based around uploading and downloading MP3 files. As a consumer, you downloaded the Napster client, searched for a song, and in seconds you were downloading the file from some other random person connected to the network that we’ll call a producer.

The experience sounds much like the list of priorities above. However, there were some technical differences. Both consumers and producers connected to a central index server. This server stored what files were available and who was hosting them.

[FileA, FileB, FileC] => Producer X
[FileD, FileE, FileF] => Producer Y

As a producer, you connected and said “Hi! I have Files [A, B, C]”. As a consumer, you might connect and ask for the location of File A. The index server would then give you the IP and port of the appropriate producer. You connect to the producer, and initiate a download.

Screen Shot 2016-06-26 at 3.32.53 PM.png
Screen Shot 2016-06-26 at 3.33.00 PM.png

Pros of Napster:

Cons:

 Gnutella (LimeWire)

Gnutella was developed by Justin Frankel and Tom Pepper around 2000. Gnutella was designed to avoid problems faced by networks like Napster, namely central points of failure. The aim was to provide the world with a decentralized way for many computer to connect share data.

Unfortunately, up to version 0.4, that’s all Gnutella was. It’s useful to think of Gnutella as simply a random, unorganized mess of computers, all connected to each other in a complicated and non-deterministic web. You might be connected to 5 computers, who are each in turn connected to 7 computers, and so on.

Producers and consumers are treated as equal as nodes in the network. If I am looking for a file, I simply ask every node I’m connected to for the file. They in turn ask each node they are connected to if they have seen the file. Imagine one node in a giant web starting a vibration which multiplies across the web and ends up vibrating every strand.

There is a timeout on requests - if one request is passed on to, say, 7 computers (hops), it will expire. If the file is found, each hop of the request is reversed until the file ends up at the requester.

Screen Shot 2016-06-26 at 3.33.05 PM.png

Pros of Gnutella:

Cons:

 What do we want?

A solution that provides the following abilities would be ideal (I refer to n as the number of nodes in the network):

 The Burden of Protocol

Distributed Hash Tables help mitigate most of these issues. To help us on that journey, we need to understand that Napster and Gnutella differ in the fundamental way they treat power and protocol.

In Napster’s case, the index server dictates how information is stored and how consumers and producers are connected to each other. This is where updates to the protocol go and where changes need to be made. If you remove the central index server, you’ve cut off the head of the snake. Consumers and producer nodes are stupid - they only know how to talk to the index server and do what they’re told.

By contrast, in Gnutella every node knows how to talk to every other node. Parameters like “how long to let a request live” are encoded onto every single node. They each know how to behave without being told by another node in the network. They are independent beings. Here, the “burden of protocol” is placed on every node, rather than at once central location. If you remove a node, it doesn’t matter; every node knows how to behave appropriately. The intelligent behavior emerges despite the lack of a node (like an index server) with a higher, more wholesome view of the network topology.

This characteristic is important in building failure-and-lawsuit-resilient peer-to-peer protocols. The downside is of course it becomes difficult to push updates to everyone at once: since no node is forced to run new software, there will often be a subset of nodes running different versions of software. See Bitcoin, for example.

Distributed Hash Tables also push the burden of protocol to every node - every node has behavior encoded in it. When every node correctly executes this behavior, an emergent, intelligent behavior is produced in the network.

 Conclusion

This post outlines the historical background for understanding why we would want to talk about DHTs at all. Next up - some more technical background as well as a high level technical description.

 
2
Kudos
 
2
Kudos

Now read this

Day 5-15: Stack, BST, and Linked List (CodeReview is awesome)

I have not abandoned you C. But I get impatient quite easily, and wanted to jump ahead to some greater and more dreadful foes. For my first, I chose the Stack. A stack is a first on, last off data structure. I discovered there are no... Continue →