A lot of forward-thinking companies are building software (web applications / mobile apps) to digitize and automate their business. When the amount of data that is being hosted on the software’s database is very high due to the high number of users or operations being done, database sharding is a common practice to ease up the database to reduce loading time.
Well, The word “Shard” means “a small part of a whole“. Hence Sharding means dividing a larger part into smaller parts. That means dividing the database into small databases.
Database sharding is the process of splitting up a database across multiple machines to improve the scalability of an application. In Sharding, one’s data is broken into two or smaller chunks, called logical shards. The logical shards are then distributed across separate database nodes, referred to as physical shards.
Need for Sharding:
Consider a very large database whose sharding has not been done. For example, let’s take a Database of a college in which all the student’s records (present and past) in the whole college are maintained in a single database. So, it would contain a very very large number of data, say 100,000 records.
Now when we need to find a student from this Database, each time around 100, 000 transactions have to be done to find the student, which is very very costly.
Now consider the same college students’ records, divided into smaller data shards based on years. Now each data shard will have around 1000–5000 student records only. So not only the database became much more manageable, but also the transaction cost of each time also reduces by a huge factor, which is achieved by Sharding.
Hence this is why Sharding is needed.
Types of Sharding :
1. Key based Sharding
Key-based sharding, also known as hash-based sharding, involves using a value taken from newly written data — such as a customer’s ID number, a client application’s IP address, a ZIP code, etc. — and plugging it into a hash function to determine which shard the data should go to. A hash function is a function that takes as input a piece of data (for example, a customer email) and outputs a discrete value, known as a hash value. In the case of sharding, the hash value is a shard ID used to determine which shard the incoming data will be stored on. Altogether, the process looks like this:
2. Range Based Sharding
Range-based sharding involves sharding data based on ranges of a given value. To illustrate, let’s say you have a database that stores information about all the products within a retailer’s catalog. You could create a few different shards and divvy up each products’ information based on which price range they fall into, like this:
3. Directory Based Sharding
To implement directory-based sharding, one must create and maintain a lookup table that uses a shard key to keep track of which shard holds which data. In a nutshell, a lookup table is a table that holds a static set of information about where specific data can be found. The following diagram shows a simplistic example of directory-based sharding:
Directory-based sharding is flexible as compared to range-based sharding and key-based sharding. Range-based sharding limits you to a specified range of values, while key-based sharding limits you to using a fixed hash-based function.
Benefits of Sharding :
The main appeal of sharding a database is that it can help to facilitate horizontal scaling, also known as scaling out.
- Smaller Databases are Easier to Manage.
Production databases must be fully managed for regular backups, database optimization, and other common tasks. With a single large database, these routine tasks can be very difficult to accomplish, if only in terms of the time window required for completion. By using the sharding approach, each individual “shard” can be maintained independently, providing a far more manageable scenario, performing such maintenance tasks in parallel. - Smaller Databases are Faster.
The scalability of sharding is apparent and achieved through the distribution of processing across multiple shards and servers in the network. What is less apparent is the fact that each individual shard database will outperform a single large database due to its smaller size. By hosting each shard database on its own server, the ratio between memory and data on disk is properly balanced, thereby reducing disk I/O and maximizing system resources. This results in less contention, greater join performance, faster index searches, and fewer database locks. Therefore, not only can a sharded system scale to new levels of capacity, individual transaction performance is benefited as well. - Database Sharding can Reduce Costs.
Most database sharding implementations take advantage of low-cost open-source databases and commodity databases. The technique can also take full advantage of reasonably priced “workgroup” versions of many commercial databases.
Drawbacks of Sharding :
- Adds complexity in the system: Properlyimplementing a sharded database architecture is a complex task. If not done correctly, there is a significant risk that the sharding process can lead to lost data or corrupted tables. Sharding also has a major impact on your team’s workflows.
- Rebalancing data: In a sharded database architecture, sometimes a shard outgrows other shards and becomes unbalanced, which is also known as a database hotspot. In this case, any benefits of sharding the database are canceled out. The database would likely need to be re-sharded to allow for a more even data distribution.
- Joining data from multiple shards: To implement some complex functionalities we may need to pull a lot of data from different sources spread across multiple shards. We can’t issue a query and get data from multiple shards. We need to issue multiple queries to different shards, get all the responses and merge them.
- No Native Support: Sharding is not natively supported by every database engine. Because of this, sharding often requires a “roll your own”. This means that documentation for sharding or tips for troubleshooting problems is often difficult to find.
The main question Should you implement it on your System ?
Well, maybe or maybe not it depends on a lot factors, Some see sharding as an inevitable outcome for databases that reach a certain size, while others see it as a headache that should be avoided unless it’s absolutely necessary, due to the operational complexity that sharding adds.
Because of this added complexity, sharding is usually only performed when dealing with very large amounts of data. Here are some common scenarios where it may be beneficial to shard a database:
1. The amount of application data grows to exceed the storage capacity of a single database node.
2. The volume of writes or reads to the database surpasses what a single node or its read replicas can handle, resulting in slowed response times or timeouts.
Before sharding, you should exhaust all other options for optimizing your database. Some optimizations you might want to consider include:
- Setting Up Of Low Latency remote database
- Implementing caching
- Setting up of database cluster with two or more nodes
- Upgrading to a large server
Many big tech company uses this method for their Distributed system and many of them innovate this method to next extent part for eg. Google, Facebook, Amazon &, etc.
Google Spanner and HBase — Range Sharding
You will saw many big tech companies using sharding architecture in their system designs.
However, Sharding can be a great solution for those looking to scale their database horizontally. However, it also adds a great deal of complexity and creates more potential failure points for your application. Sharding may be necessary for some, but the time and resources needed to create and maintain a sharded architecture could outweigh the benefits for others.
If you want to learn more on what other things are being done to maintain mobile apps, we have a great article covering that.
Suggested Reads:
[catlist categorypage=”yes”]