Sometimes, you need to tread ground that you have already walked.
Last time, we started our examination of Kademlia (link). We looked at how it works at a high level and its 4 RPCs. Let’s continue our discussion by digging a little deeper.
Kademlia arranges its nodes as leaves in a binary tree, and each node has one or more contacts (i.e., nodes) in each sub-tree. Nodes store neighboring key-value pairs. XOR is the distance metric. The 4 RPCs defined by Kademlia are:
- Ping - checks to see if a node is still alive.
- Store - tells a node to store a key-value pair.
- Find_Node - finds a node in the network by its node ID.
- Find_Value - finds a value using its key similarly.
Kademlia uses node lookups to find a node. During this process, a node selects α candidate nodes. They are picked based on their proximity to the target. Each candidate receives an asynchronous find_node. The recipients respond with new leads, the α closest nodes they know. The searching node sends asynchronous find_nodes to α of the new leads.
Find_value works like find_node, except that upon finding the value, it stops recursing. Then, the candidate closest to the key that did not return the value caches the key-value pair. The cached entries expire after some time (e.g., 1 hour).
Kademlia nodes update their buckets with every request that they send or receive. So, buckets generally do not grow stale. However, Kademlia dictates that nodes refresh all buckets hourly. So, if requests do not freshen a node within a 1 hour period, it is still freshened. Nodes run a node lookup on a random node contained within a bucket to freshen it.
A node joining the network bootstraps itself with an existing node. Say W is bootstrapping with U. W inserts U into its bucket. W then searches for its node ID by sending a find_node to U. Doing that allows it to populate its buckets and make other nodes aware of its existence. Finally, W refreshes all of its buckets.
Routing tables in Kademlia are binary trees, and k-buckets are the leaves. Each bucket has an ID prefix, and it holds nodes with the same prefix. The prefix is also the bucket’s position in the tree. All the k-buckets cover the 160-bit ID space with no overlap.
Nodes create buckets as needed. When W was bootstrapping, it had one bucket. It added U to that bucket. In other words, that first bucket covers the entire ID space. When W learns of a new node, it adds it into its bucket. It repeats that process and fills the bucket (i.e., there are k nodes in the bucket). The next time W tries to add a node, it fails. So it splits the bucket up and divides the existing nodes between the new buckets according to their prefixes. In other words, where there was one bucket, there are now two. Each of the new buckets covers half the ID space. Then, W attempts to add the new node to its appropriate bucket again. It discards the new node if it fails again.
W adds future new nodes in mostly the same way. The one change is if the bucket that the new node should go into is full, W checks if the bucket is splittable (i.e., its node ID falls into the ID space covered by that bucket). W pings the stalest node in the bucket if the bucket isn't splittable. W ejects the stalest node and inserts the new one if there is no response. Otherwise, it discards the new node. So, Kademlia nodes know more about things in their neighborhood than further away.
There is another exception if the binary tree is highly unbalanced. When C joins the network, it is the only node with the prefix 000. All existing nodes have the prefix 001. C bootstraps like normal. Its initial bucket quickly fills up and splits. All nodes in that first bucket are moved to the away bucket (i.e., the bucket covering the ID space that C's ID is not in), while the near bucket sits empty. So, the tree is highly unbalanced. C gets another new node. Again, the new node's prefix is 001. It belongs in the away bucket, but that bucket is full. Usually, the bucket isn't split, but Kademlia makes an exception in this case.
Why? Let's consider an existing node, D. D has the prefix 001. So it tracks its neighbors (i.e., nodes with the prefix 001) across multiple buckets. C enters the network. D adds C to its farthest away bucket. C is the only node in that bucket. So, D is tracking C and all the other nodes. Compare this to C. Without an exception for highly unbalanced trees, C would only know about k other nodes because it has no neighbors.
Nodes store the key-value pairs for whom they are one of the k closest nodes. Nodes republish key-value pairs every hour to ensure that that is always the case. For example, nodes entering or leaving the network can change who the k closest nodes are. Kademlia optimizes the republishing process in two ways. First, nodes do not republish a key-value pair for which they have already received a store in the last hour. In other words, only one node will republish a key-value pair every hour as long as the nodes have offset republishing intervals. Second, nodes will refresh the k-buckets in the subtree near the key before republishing if the subtree has at least k nodes. Doing so means that the k closest nodes will be known, and node lookups are unnecessary.
The second optimization needs some explanation. After a refresh of the subtree, the key to be republished either is within the subtree’s ID range or isn’t. If it is, the refresh ensures that a node knows the k closest nodes because the subtree is already known to have at least k nodes. If the key is outside the ID range, the publishing node must be one of the k closest nodes to the key. After all, nodes only store a key-value pair if they are one of the k-closest neighbors to that pair. Therefore, the node must have a k-bucket closer to the key, and that bucket must have less than k nodes. So, the publishing node knows the k closest nodes between its k-bucket and the refreshed subtree.
Keys also need to be republished when a new node joins the network. When learning about the new node, existing nodes send it store RPCs for the key-value pairs it should have. Existing nodes determine which pairs to send using their knowledge of the surrounding subtree.
That’s Kademlia in a nutshell. A notable optimization we did not discuss was accelerated lookups. In our examples so far, each branch of the tree corresponded to the value of a single bit. It doesn’t have to be that way. Branches can represent multiple bits (b).
Kademlia can route more efficiently than Chord due to more frequent keepalives. So, it finds dead nodes faster. Additionally, Kademlia can deal with latency and high churn better because it sends asynchronous messages to α nodes at a time and awaits the first response. So, Kademlia is better for mobile networks.