Distributed relational databases are a perfect match for Cloud computing models and distributed Cloud infrastructure. As such, they are the way forward for delivering new web scale applications.
But how is the data distributed in a distributed relational database? What is the best way to distribute data for my applications? How to I retune my distributed database for optimal performance as applications evolve and usage patterns change? You do all of this with your data distribution policy.
In this blog I’d like to explore different aspects of a data distribution policy. I want you to come away with a practical understanding you can use as you explore your distributed relational database options.
So, let’s dive in.
Data Distribution Policy: What It Is and Why You Should You Care
A data distribution policy describes the rules under which data is distributed. A policy that matches your application’s workflow and usage patterns will give you critical web scale benefits:
- endless scalability
- geo-location of data nearest user populations
- data “tiering”
A poorly conceived data distribution policy will degrade performance, use more system resources and cause you problems.
In The Beginning, there was Sharding, and it wasn’t so Good
In the past, to distribute data across an “array” of linked databases, developers needed to program data distribution logic into their actual applications. The effect was to “shard” a database into slices of data. Quite literally every read or write would need to run through new custom-built application code to know where bits of data should be placed, or could be found. This is what Facebook, Twitter and many others did as, at the time, there was no better alternative.
This extra sharding code required application developers to take on tasks typically handled by a database. A do-it-yourself approach may seem like a fun challenge (“…hey, after all, how hard can this really be??”). But with your database divided this way, you face the following issues to contend with:
- Operational issues become much more difficult, for example: backing up, adding indexes, changing schema.
- You also need to start checking your queries results to test that each query path is actually yielding accurate results.
A lot has been written about the challenges of sharding a relational database (here’s a good whitepaper you can read: Top 10 DIY MySQL Sharding Challenges), so I won’t go into them here. But, let’s also recognize that some great work has been accomplished by dedicated developers using sharding techniques. They have proven the inherent value of a distributed database to achieve massive scale. At the time, they had to shard as they had no alternative.
Today, there is a better way.
What is a Good Data Distribution Policy?
As I briefly mentioned, a data distribution policy describes the rules under which data is distributed across a set of smaller databases that, taken together and acting as one, comprise the entire distributed database.
The goal we are aiming for is an even and predictable distribution of workloads across the array of clusters in our distributed database. This brings us immense scalability and availability benefits to handle more concurrent users, higher transaction throughput and bigger volumes of data. But these benefits are all lost with a poorly conceived data distribution policy that does not align to your application’s unique usage and workloads. Let’s take a look.
Imagine we have a single database that is starting to exhibit signs of reaching its capacity limits. Throughput is becoming unpredictable. Users are getting frustrated waiting.
We decide the best way to improve the situation is to evolve to a distributed database. Our distributed database would aim to evenly divide the total workload across an array of databases. In this way, data distribution decreases the number of queries that any individual database cluster (or shard) receives.
Figure 1. A good data distribution policy: ensures that a specific transaction or query is complete within a specific database.
The critical point here is that we want to distribute the data in such a way that we minimize the cross-database chatter (from cluster to cluster, or shard to shard), so that each transaction can be completed within a single cluster and in a single fetch/trip.
If we distribute data without respecting how the data is actually used, we can make matters worse.
Figure 2. A bad data distribution policy: requires transactions or queries to access or collect data from multiple databases.
In the two images above, you can see that one case depicts 1,000,000 transactions equally spread across available resources. And the other case shows a bad distribution policy where each query needs to collect information from every cluster (or shard) – thus in every practical sense we are actually increasing the overall workload.
|Data Distribution Policy|
|Bad Data Distribution Policy||Good Data Distribution Policy|
|The load isn’t distributed – it’s multiplied!||Distributes the workload evenly across available resources|
|Doesn’t scale||Distributes the sessions|
|Adding an additional DB does NOT reduce the overall workload||Delivers linear scalability|
|The limitation of a single DB becomes the limitation of the entire array||Adding another database, increases the overall scale potential of the distributed database|
|When queries need data from multiple DBs, transactions must commit multiple separate DBs (2PC) before completing. This adds a lot of overhead to each Commit.||Queries complete using data from a single, smaller database. This reduces a lot of overhead to any Commits.|
Table 1. A comparison of a good and bad data distribution policy
So, we can see that unless we distribute the data intelligently, we will not achieve any benefit. Actually, we can see things can become worse than before.
The natural question we are lead to ask is: “OK, So what is the best way to distribute data for my applications and my workloads?”
How Create the Best Data Distribution Policy for Your Application
Distributing data across a cluster of smaller database instances and maintaining full relational database integrity, two-phase commit and rollback, (as well as leveraging SQL!) is today’s state of the art for distributed relational databases.
We can define two broad types of data distribution policy:
- Arbitrary Distribution: This is when data is distributed across database instances, but without any consideration or understanding for specific application requirements and how the data will be used by users or the application;
- Declarative, Policy-Based Distribution: This is when data is distributed across database instances, but in a way that specifically understands all application requirements, data relationships, transactions, and how the data is used in reads and writes by the application.
|Data Distribution Policy|
|Arbitrary Data Distribution Policy||Declarative Data Distribution Policy|
|Pros -||Pros -|
|Unsophisticated||Ensures that a specific transaction finds all the data it needs in one specific database|
|Predetermined (no forethought required)||Aligns with schema and DB structure|
|Cons -||Highly efficient and scalable|
|No intelligence about business, schema, use cases||Anticipates future requirements and growth assumptions|
|Leads to excessive use of database nodes||Cons -|
|Leads to excessive use of network||Requires forethought and analysis|
Arbitrary data distribution is often used by NoSQL database technologies. In fact, breaking the monolithic single-instance database into a distributed database has been the core of the NoSQL revolution so that NoSQL databases can tap into the scalability benefits of distributed database architecture. However, to get scalability, NoSQL databases have been willing to abandon the relational model. NoSQL and document store type databases can rely on arbitrary data distribution because their data model does not provide for joins. Meanwhile, customers have needed something to handle their massive web scale database loads, so they’ve been willing to try new technologies, like MongoDB, with new non-relational approaches. And in some application scenarios, losing the relational data model has been an OK trade-off. Having a choice is good.
However, nowadays you can get massive web scale and keep the time-tested relational database model, if you use a declarative, policy-based data distribution approach.
Academia has written about various types of distributed relational databases for decades. But today they are a reality. Declarative, policy-based data distribution is the way forward.
The good news is that today tools can identify the best declarative, policy-based data distribution approach for you!
If you use MySQL, you can take what you know now and check out ScaleBase’s free online Analysis Genie service for MySQL. It guides you through very simple steps to create the best data distribution policy matched to your unique application requirements and data.
If you’re just naturally curious about how to evolve your relational database into a modern distributed relational database, let’s dive into the details by looking at two very typical database and development scenarios:
- Scaling an existing application
- Designing scalability in a brand new application
In tomorrow’s blog post we will discuss about Scaling Existing Applications: Key Observations and Measurements.
Reference: Pinal Dave (http://blog.SQLAuthority.com)