Consistent hashing forms the core of many distributed systems. It is a hashing technique that solves the problem of data distribution and locality with minimal impact to the system. In this article, i’ll walk you through exactly how consistent hashing works, and how can we build a distributed system with the help of consistent hashing.
For the purposes of this tutorial, we’re building a system where each user’s data must store together in a distributed environment. We’ll use username as the key for hashing.
Typically in a distributed system, data is split across multiple nodes / servers. In our example above, if we have two servers, we would want 50% of the users to be stored in 1st Node, and the remaining 50% in the other node.
A very simple logic would be to use username for splitting. All usernames between A — M can go into node 1, and all others can go to Node 2. Similarly, if we had 3 nodes, our data would then be split across 3 nodes using the first character of the username.
As you may have guessed, there is a problem with this approach. Practically speaking, a lot of usernames may start with A, E, R … and none may start with X, U , etc. In such cases our data may not be evenly distributed across both the servers. So, we must choose a strategy that helps us distribute the data uniformly.
As a solution to the above challenge, we can use a hashing algorithm that generally hash a uniform random distribution. We first map the username to a number, and then use that number as the hashing bucket. And that number stays the same always. We can decide to bucketize millions of users uniformly in buckets of any size. It can be 100, 1000, 1M. That doesn’t matter. A user who’s hashed in a particular bucket will always be assigned the same bucket number. So, our requirement for storing the data in a single place is satisfied. The diagram below shows how all users By using a good hash function, we can ensure that the distribution is uniform.
If we had only 2 servers with us, we can assign hashes from
0-49 to server 1 , and
50-99 to server 2. By doing so, we ensure that our keys are uniformly distributed across nodes in our cluster.
Even though this solution is better at solving the problem of distribution, there is still another issue with this approach.
What happens when we add a new node to our 2-server cluster. Based on our hashing strategy, we would now say that server 1 holds keys with Hash 0–33, server 2 holds nodes from 34–67 and the third server can hold data hashes between 68–99. Refer to the diagram below.
The problem with this approach is that every time a new node is added or removed from our cluster, hashes must be redistributed across machines. Imagine if we had 20 nodes, then there would be a redistribution of data everywhere.
When the data volume is huge, network and redistribution of data is an expensive operation. This problem can be prevented if we could somehow minimize hash redistribution across nodes. This is where consistent hashing comes into the picture.
Consistent hashing is an improvement over our previous hashing technique. our goal is to reduce the distribution of data whenever our server load increases. And at the same time provide hashing for keys such that we know exactly which node to query the data from.
The idea with consistent hashing is that we assume there is a virtual
Space where all keys are mapped to. Imagine we have a ring like structure which can have
hashing slots numbered from
0 to n-1. n being the number of hashing buckets. When we hash keys with our hashing algorithm, all Keys (user Ids in our case ) will be placed at some point on the ring. (Diagram below )
In the diagram above, we have decided that we’ll have 1024 buckets. In reality this may be any number. So, even if we had millions of keys, they would be uniformly distributed across these 1024 buckets.
Since we want to ensure that we distribute data across various machines in our cluster, we could just take all machine ids and hash them on the same ring. Let’s look at what happens if we have four machines in our cluster. We treat machine id / node id as another id in the same hashing space. This would make sure that machine ids are distributed on the same ring as the target key / user id.
If you notice the placement of machines in the ring, you can see that machines also get a bucket, and are placed on the same ring position depending on their hash for machine id.
The last part of consistent hashing is assigning keys to machines. Refer to the diagram below. We number our machine hash ranges for simplicity.
As seen in the diagram above we assigned all buckets between two nodes to the next node. So, a key that hashes to a bucket will always be placed to the next machine on the ring. In our example, we see M2 handles requests for hashing between
0-200, M3 handles keys that hash between
201-500, and so on.
Because the machines and keys share the same hash space, we can easily add or remove a node in our ring. Let’s add a new Machine m5 to our nodes and place it in between m1 and m2. Notice what happens to our data distribution when we do this operation.
Previously, M2 Was dealing with Hashes between
0-200. This is because all hashes are served by the next available node in our algorithm. Now when we add another node to our cluster, we place it at hash 0. In this case, there are two changes that will happen.
Notice how none of the other machines are impacted because of this. There’s no data movement between any other node. This is the beauty of consistent hashing. The algorithm ensures that the data movement and distribution of data is minimal.
When machine 5 takes responsibility of 901–1023 , it can just query M2 to request a move of these hashes from M2. That’s the only data movement that may happen in this case.
Consistent hashing is a nice technique to distribute our data and minimize data movements as nodes are added or removed from our system. There is another problem which we haven’t covered in this post, but our current design may lead to more load on some machines than the other. We’ll look at this in another post.
In order to understand distributed systems, consistent hashing serves as a very nice way of understanding how to distribute data evenly. And conceptually understand how many systems out there might be doing it.
I hope this post helps you understand some basics of this technique and gives you insight on how real-world systems may implement it.