Database sharding is a technique to achieve horizontal scalability in large-scale systems.
Almost all real-world systems consist of a database server that receives a lot of read requests and a non-negligible amount of write requests. This might overload the server and may hamper system performance.
To mitigate such impacts and improve the performance of a system, there are approaches such as database replication and database sharding. In this guide, we’ll first explore techniques to improve system performance, including:
Scaling up the database server
After discussing these techniques, we’ll proceed to learn how database sharding works and also look at the advantages and limitations of this approach.
Techniques to Improve System Performance
Let’s start by discussing techniques to improve system performance when there are bottlenecks due to the database server:
#1. Scaling Up the Database Server
Scaling up the database server instance can seem like a straightforward approach to improve system performance. This includes enhancing processing power, adding more RAM, and the like.
However, this technique comes with the following limitation. We cannot have a server with infinite storage and processing power. And beyond a certain limit, we get diminishing returns.
#2. Database Replication
When the database server instance overload occurs because of incoming requests, we can consider database replication.
Under database replication, we have one master node that typically receives write requests. There are multiple read replicas.
This improves availability and mitigates system overload. We can now process multiple queries in parallel as the read requests can be routed to one of the read replicas.
But this introduces another problem. Write requests to the master node can change the data, and these updates are periodically propagated to the read replicas.
Suppose there is a read request to one of the read replicas at the same time a write operation is in progress at the master node.
The changes in the master node will not have propagated to the read replicas as yet. In this case, we may be reading outdated data, which is not desirable.
#3. Horizontal Partitioning
Horizontal partitioning is another technique to optimize system performance. We may have a single large table with billions of rows (such as a table of customers and transaction data).
The read operations from such a database table are slower. But using horizontal partitioning, the single large table is now divided into multiple partitions (or smaller tables) that we can read from. Relational databases such as PostgreSQL natively support partitioning.
However, all the partitions are still inside a single database server instance. The only difference is that we can now read from the partitions instead of the single large table.
Therefore, when there is an increase in the number of incoming requests, the server may not be able to support the increased demand.
How Does Database Sharding Work?
Now that we’ve discussed the approaches to improve system performance and their limitations let’s understand how database sharding works.
In sharding, we split the single large database into multiple smaller databases, each running on a database server instance. Each such smaller database is called a shard. And each shard contains a unique subset of the data.
But how do we partition the database into shards? And how do we determine which of the rows go into which of the shards?
🔑 Enter the sharding key.
Understanding Sharding Key
Let’s understand the role of the sharding key.
The sharding key, which is usually a column (or a combination of columns) in the database table, should be chosen such that the distribution of data is even across multiple shards. Because we don’t want a particular shard to be much larger than the other shards.
In a database that stores data on customers and transactions, the customer_ID is a good candidate for the sharding key.
Once we’ve decided on the sharding key, we can come up with a hashing function that determines which of the rows go into which of the shards.
In this example, say we need to split the database into five shards (shard #0 to shard #4) using the customer_ID as the sharding key. In this case, a simple hashing function is customer_ID % 5.
All customer_ID values that leave a remainder of zero when divided by 5 will map to shard #0. And customer_ID values that leave remainders 1 through 4 will map to shard #1 through shard #4, respectively.
After the database sharding is implemented this way, it’s important to have a routing layer that routes the incoming requests to the correct database shard.
Advantages of Database Sharding
Here are some of the advantages of database sharding:
#1. High scalability
It is always possible to chunk a larger database into multiple smaller shards. So database sharding allows us to scale out horizontally.
#2. High availability
When there is a single database server instance that handles all the incoming requests, we have a single point of failure. If the database server is down, the entire application is down.
With database sharding, the probability of all the database shards being down at a given instant is relatively low. Therefore, if a particular shard is down, we will not be able to process read requests to that shard. But the other shards can still process the incoming requests. This results in high availability and increased fault tolerance.
Limitations of Database Sharding
Now let’s go over some of the limitations of database sharding:
Though sharding has advantages in terms of scalability and fault tolerance, it introduces complexity to the system.
From mapping records to partitions to implementing the routing layer to route queries to the respective shards, there’s considerable complexity involved with sharding databases.
Another limitation of sharding is the need for resharding.
Though we use hashing function to get an even distribution of data records, it is possible that one of the shards is much larger than the other shards, and it may get exhausted sooner. In this case, we have to account for resharding (or reshuffling), and that comes with substantial overhead.
#3. Running Complex Queries
When you need to run queries for analysis that involve joins, you need to use records from multiple shards as opposed to a single database. So this can be a challenge when you need to run too many analytical queries. You can get around this by denormalizing databases, but it still requires some effort!
Let’s wrap up the discussion with a summary of what we have learned.
Scaling up the hardware is not always optimal. So beefing up the server instance is not recommended. We also reviewed techniques such as database replication and horizontal partitioning and their limitations.
Then, we learned how database sharding works by splitting a large database into smaller and easy-to-manage shards. We discussed how the sharding key should be carefully chosen so as to get even partitions and the need for a routing layer to route the incoming requests to the correct database shard.
Database sharding has advantages such as high availability and scalability. Some of the downsides include the complexity of setting up sharding and resharding when one or more shards get exhausted.
So you can consider sharding when you think the advantages outweigh the complexity introduced by sharding. Next, check out the comparison of the various AWS relational databases.
Bala Priya C
Bala Priya is a developer and technical writer from India with over three years of experience in the technical content writing space. She shares her learning with the developer community by authoring tech tutorials, how-to guides, and more…. read more
Cloud data warehouses provide a scalable solution for storing and analyzing data in real-time. They enable businesses to reduce costs and improve efficiency by eliminating the need for on-premises hardware and maintenance. Here is our list of the best cloud data warehouse platforms.