System design for a LRU cache

Systems design is the process of defining the architecture, modules, interfaces, and data for a system to satisfy specified requirements. Systems design could be seen as the application of systems theory to product development

In this article, we will learn how we can design a cache

Let's begin to design our system, Any system design would have some requirements. Enlisting the same would give us a better idea of what would be the overall limitation of our MVP (Most Valuable Product)

Requirements

Data size: Let say we will store data like Facebook. Going upwards to few TBs
Cache Eviction Strategy: During the course of time, we might not have space to store all the cache entities at that time we need to have a strategy to evict old cache entity like LRU (Least Recently Used) you can read more here
Access Patterns for Cache:

Write through cache: In this cache pattern, the write goes through the cache and is marked as successful only after both the cache and database are successful, this pattern is useful for application where the data re-read is very fast. However, the write latency would increase here as there are 2 writes.
Write Around Cache: In this cache pattern, the write goes directly to the database. Cache sync happens when there is read request and a cache miss occurs. This pattern has a very high read latency as there would be a lot of caches miss.
Write Back Cache: In this cache pattern, the write goes directly to the cache, The write is marked successful after writing to the cache. The cache asynchronously syncs to the database. This would be really fast to write and have more throughput. However one of the drawbacks is that if the cache dies before committing to the database. We can attain using replication.

Estimations

In this part, we will discuss the rough estimations of our system

Cache Size: If we are caching to the size of Google and Facebook, Let's keep our limitation to 50TBs
No. Of Machines: Since we are designing a cache the data should be in memory for low latency. Assume that we might have 128GB of RAM per machine then the number of machines required = 50TB/ 128GB which is close to 390 machines.
This is the minimum requirement for our system, however, based on the next topic QPS it might change.
Query Per Second: To analyze the load in the system, we need to define the QPS for our system, this will give enough insights. Since we have around 390 machines and if we take the QPS as 10M per second then it will be around ~25000 queries per second per machine.

CAP Theorem Check

The understanding of the CAP Theorem is very important in design a system, it states that out of Consistency, Availability, and Partition Tolerance only 2 can be achieved successfully at a time.

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability

When choosing consistency over availability, the system will return an error or a time out if particular information cannot be guaranteed

Whereas, when choosing availability over consistency, the system will always process the query and try to return the most recent available version of the information, even if it cannot guarantee it is up to date data.

Partition Tolerance: This part directly matches the latency, we need to ask yourself how important is the latency for our system?
Since we are designing a cache, latency is of utmost importance for us.
Availability: If our cache is not available then it would result in a very high cache miss and increase the latency. Hence, we will choose availability over consistency.
Consistency: Since the above 2 points are important for us and in a huge cache like Facebook and Twitter, if we miss a few updates then it is ok for us, going with that logic we will try to achieve eventual consistency

Hence, In designing a cache, we will go with Partition Tolerance and Availability.

In-Depth analysis

Let's go in-depth about our design, starting with a small MVP and then iterate it later.

Cache Data Structure with Eviction:

If we go by LRU (Least Recently Used), then how should we store the data?

If we support operations GET and SET then a simple HashMap would be good for our cache, however, we need to evict the keys once our cache is full hence we need a different or a combination of data structure.

We need a data structure that would give the least recently used objects in order and very FAST.

In order to maintain the order, we can store the elements in either in Array or LinkedList. Now, when a GET operation is performed then we need to move the element from a certain position to start off the list, which means delete and the insert to the start of the list.

Deletion of an element would consume more time, in order to minimize the same, we can have another map to store these elements data.

Hence, we will go by the following data structure :

HashMap: This map would be our base map that would contain our data as key-value pair.
Doubly LinkedList: This list would contain the values of keys in order of the time they are accessed, Note that we have mentioned a LinkedList not an Array.
We know that we need to move the node which is accessed to the front of the list, however, if we use an Array here it would be very costly. (We would need to create Array copy and the move the resultant Array cells.)
HashMap: This map would contain the node information of the previous data structure LinkedList, now this is done to delete the node from the LinkedList in O(1) time and make our cache faster

Cache Operations:

Read Operation

Step 1: Get operation on Hashmap
Step 2: An update in the LinkedList, move the item node to first
Step 3: Update the metadata HashMap with new pointers

Write Operation

Insert a new key-value pair to the cache
If the cache is full, then

Step 4: Find out the least recently used item from LinkedList
Step 5: Remove it from LinkedList metadata HashMap
Step 6: Remove it from LinkedList
Step 7: Insert the new item in HashMap
Step 8: Insert the new item in LinkedList

Prioritization of operations:

In most of the concurrent systems, writes competes with reads and with other writes. This requires locking of items when writes are happening.

We can have a trade-off between reads and writes if we re-visit the design goals we had latency as an important factor in our system hence the read should never get blocked or waits because of writes.

Going from the above logic, Step 1 should be really fast and Step 2 can happen asynchronously. Similarly, all the write operation can be done asynchronously

If we go with our steps the Step 3, Step 5, and Step 7 deal with HashMap, To minimize the latency we can have a read/write lock per row basis not entirely on the map.
HashMap Implementation:

The HashMap can be implemented in multiple ways, one way would be to use LinkedList for colliding keys.

Let's say our hashmap size is N and we wish to add (key, value) to it
Let M = size N array of pointers with every element initialized to NULL
For a given key , generate g = hash(key) % N
newEntry = LinkedList Node with value newEntry.next = M[g] and M[g] = newEntry

Given this implementation, we can see that instead of having a lock on a hashmap level, we can have it for every single row. This way, a read for row i and a write for row j would not affect each other if i != j. Note that we would try to keep N as high as possible here to increase granularity.
Sharding the Cache:

Now that we have understood how we will store the cache data, let's discuss how we can shard the cache for high throughput.

QPS for a single machine:

In the estimation section, we had estimated that we will store around 50 TB of data with a 10M query per second. For every piece of data in our cache, we will store it in hashmap and we store entry on LinkedList. We can store 128 GB of on every single machine.

With that config, every machine can handle 10M/390 = 25000 QPS per machine
25000 QPS Handling:

Assuming that a machine has to serve 25000 QPS then we look at each machine has 4 core and then we calculate the per request time as

CPU time available per query = 4 * 1000 * 1000 / 25000 microseconds = 160us

Alternately, you can calculate the same as

25000 QPS will be served by 4 core processors (4 different threads). It means each core processor has to process 2500/4 = 6250 QPS. So each query needs to serve in 1 sec/ 6250 = 160 microsecond

So the machines have to return the query in 160 us. This is the way the QPS is derived. Then based on the read/write traffic and the latency numbers as per the https://gist.github.com/jboner/2841832, the QPS is further refined by increasing the number of machines.
Sharding Among machines with 32 GB of RAM:

The number of shards : 50 * 1000 / 32 = 1562

This leads to a QPS of approximately (10 M / 1562) 6400 per shard which should be reasonable (Note that with lower main memory size, CPU cycles required for access lowers as well ). Now, we also need to decide the shard number for every key. A simple way to do it would be to a shard based on hash(key) % TOTAL_SHARDS

The hash function might differ based on the expected properties of the key. If the key is an auto-incremental user_id, then hash(key) = key hashing might work well.

Now if we use the TOTAL_SHARDS for our hash function then if the number of shards changes then we need to re-map all the data. This re-mapping would add many overheads and the best way to do this would be to use the consistent hashing with multiple copies of every shard on the ring (Read more about consistent hashing at https://en.wikipedia.org/wiki/Consistent_hashing)

Fault handling of Shard machine:

If we only have one machine per shard, then if the machine goes down, all requests to that shard will start hitting the DB and hence there will be elevated latency.

As mentioned in the design goals, we would like to reduce high latency cases as much as possible. One way to avoid these cases would be to have multiple machines per shard where they maintain exactly the same amount of data.

A read query for the shard could go to all the servers in the shard and we can use the data from the one that responds first.

Complications of multiple servers:

Since we have multiple servers maintaining the same data, it is possible that the data is not in sync between the servers. This means that a few keys might be missing on some of the servers, and a few servers might have older values for the same keys.

Imagine a case when one of the servers goes down, misses a bunch of additions and updates, and then comes back up.
Master-Slave technique:

We can solve the above problem using this approach. There is only one active server at a time in a shard and it has a follower who keeps getting the update. When the master server goes down, the slave server takes over as the master server. Master and slave can maintain a changelog with a version number to make sure they are caught up.

Remember in our access pattern we had the different cache techniques which can be implemented using this feature.

If we are fine with all servers becoming eventually consistent, then we can have one master (taking all the write traffic) and many slaves where slaves can service the read traffic as well.

Search This Blog

TechBakra

System design for a LRU cache

Comments

Post a Comment

Popular posts from this blog

Kubernetes Nginx-ingress controller with HAProxy on Bare Metal

System design for Sharding a Database