Welcome to the distributed systems series, In this article we are going to learn consistent hashing and It’s usage in distributed systems. Why consistent hashing is important and how it plays a role in designing distributed systems such as database, cache ..etc. Let’s first understand what is hashing and how it is used to distribute data across machines. Then, we will understand what is consistent hashing.
Hashing
Hashing is a algorithmic technique which converts object into hashes or digest. Simple example would be hashcode function in java which returns a unique id for an immutable object. This returned id is used to choose the bucket from array of buckets for storage and retrieval. In order for this hashing function to return the correct value, the object or key that we use to hash should be immutable. This is how hashing works in java to store and retrieve value in HashMap data structure. If you know how hashmap works, concept is pretty similar in distributed systems. In distributed systems, we have array of machines to store the data and we have to decide which machines should hold the specific data. The following diagram explains how hashing is used to store {key, value} data on different machines.
The above example is pretty good to store and distribute large number of objects. since the data is partitioned horizontally, simple hash function based on key hash(key)%N would decide which machine to store the given {key, value}. N represents the number of machines in the cluster. What is the problem with above approach?. If we have to add or remove machines from the cluster, Everything {key, value} object that is stored on the cluster should be redistributed. This is not efficient due to the movement of all the keys in the cluster.
Why this is inefficient?. Let’s just imagine we have cache cluster of 100 machines. we have to redistribute all the keys while a new machine is added or removed from the cluster, we have to recalculate hash(key)%N for all keys and move them to corresponding machines. Surely it’s not efficient. The better way to solve this problem is using different hashing mechanism called Consistent Hashing.
Consistent Hashing
Consistent Hashing is a technique used in distributed systems to efficiently store and manage data. Let’s first understand how it works. It works by creating a hashring cluster with multiple points that range from start and end. The ring cluster is sorted in order, When a new machine is added to the ring cluster, It will take care of managing certain points in the circle. The below diagram explains how consistent hashing works. we created a simple hashring which have points ranges from 0 to 100. The points are ordered from 0 to 100 and each machine would take care of handling subset of the entire range. As specified as below, First machine will store points upto 20 which means if the hash(key) returns value that is ≤ 20, then that {key, value} is stored in the machine {20}.
How do we find the machine from hash(key)?. We can simply iterate or do binary search the range{0…100} to find the machine since points are already sorted. {key, value} is stored on the machine which has next higher range for the hash id. As specified in the below example, hash(k1) returns 18 and hash(k2) returns 38. k1 is stored on machine {20} and k2 is stored on machine {40}.
So far, we have seen how to add {key, value} to the respective machine using consistent hashing. The process is same for deleting object from the cluster. We have to get the hash(key) to find the node which has higher points for the computed hash id. Once it is found, we could simply delete the object from the cluster using key.
In our example, we have built a distributed ring cluster with points that ranges from 0 to 100. But what if the hash id for the key is higher than 100, In that case we will add {key, value} to the first node {20} of the cluster. Previously mentioned scenario is applicable for deleting {key, value} from the cluster, If the hash(key) is higher than 100 then it would search the {key, value} in the first node to delete.
Adding or Removing Nodes
In the previous section of this post, we saw how to add or remove data from the cluster using consistent hashing. In this section, we will see how adding or removing nodes from the cluster affects the movement of the data in the cluster.
While adding a new node to the cluster, we will calculate hash(serverId) to find where this node can be placed on our ring cluster. In the below example, we have removed 2 nodes {40} {50} from the cluster. When we remove these nodes, we could simply copy all the data that is stored on those nodes to node {60}. Now node{60} will take care of all the requests for hash(k2) and hash(k3). If we really look at this example, Instead of moving all the keys across the cluster, we have only moved the subset of the entire key ranges. This is the advantage of consistent hashing. Now, we have reduced the movement of key ranges to k/n where k is the total number of keys and n is the number of machines in the cluster.
Let’s add a new node to our cluster, hash(serverId) lies somewhere between {30} and {60}. Let’s consider this new node can take points upto {40}. When a new node is added, It has to copy all the data that belongs to {40} from {60}. Once it is done, {40} will take care of storing and retrieving hash(k2) which is 38 and node{60} will serve the request for hash(k3) which is 48.
So far, we have seen how consistent hashing is used to add or remove {key, value} from the cluster. How it effectively minimise the data movements across the cluster while a new node is added or removed from the cluster.
One more thing to understand here is that even though consistent hashing helps to minimise data movement, It will not guarantee that data is uniformly distributed across all the nodes of the cluster. This is completely depends on the hashing algorithm that we use.
Conclusion:
In this article, We have understood what is consistent hashing and why it is used in distributed systems such database, cache and how it minimise the data movement across the cluster. We have also understood how adding or removing a node in the cluster affects the movement of subset of key ranges in the cluster and finally we understood why consistent hashing is used in distributed systems.