In the earlier blog post we discussed about How to Create a Distributed Relational SQL Database, today we will continue to learn about Scaling Existing Applications: Key Observations and Measurements.
In the use case of an existing application that is on the verge of reaching the limits of its monolithic MySQL database, scaling up – getting bigger hardware – can only go so far. In today’s public, private and hybrid cloud world that leverages distributed infrastructure, scaling up is a counterintuitive approach. And it’s expensive.
A monolithic MySQL database that is suffering from scalability issues (for example; inconsistent performance, inconsistent availability, or transaction throughput bottlenecks) can become a distributed MySQL database that retains its relational principles by applying a declarative, policy-based data distribution process.
A well thought out data distribution policy aligns with the application’s current database structure and commands. Related data within various tables has to be identified and amassed to stay localized in a single database instance.
Simply put (and to paraphrase a familiar saying), “the data that plays together, should stay together.” Our goal is to enable “reads” and “writes” to be completed successfully using only data from within one database instance (or shard) from the total distributed database. The more that data is processed within a single database instance, without needing other data found by traveling across machines within the distributed database array, the better that application performance and database scalability are enhanced. This way, compute processing is brought to the data, as opposed to the other way around.
Arriving to Data Distribution (Analysis)
Reads: Commands play a large role in queries, or reads. Examining what bits of data are accessed in joins, sub-queries or unions will indicate which pieces of data ought to be kept together on one machine. For example, when identifying the ‘users’ in a database, the next step would involve identifying the ‘orders’ related to those ‘users’, then the ‘items’ related to the ‘orders’, and so on and so forth. So, first, the tables, and data that are related, are identified. This usually comes from related tables, which have the same foreign keys. However, even if there are no foreign keys, it is possible to identify which columns within and between tables are generally joined for query execution.
Writes: On the transactions, or writes, side, additions to the database need to be placed in the appropriate partitioned database instance (or shard) with their related data. For instance, an ‘order’ is made up of many ‘items’, which are consequently added to the same shard as the ‘order’. A transaction is more efficient when it is contained to a single shard, so that the order and all of its items are placed on the same shard. This practice eliminates the need for a distributed transaction with a 2-phase-commit. In our example, due to the fact that there is not a direct link between ‘users’ and ‘items’, the distribution process is that much more significant because it requires cascading key lookup.
Efficiency dictates that we want data is either read together, such as in queries (i.e. sub-query joins), or written together, as in transactions.
Denormalization – Not the Best Solution
While denormalization may seem like a decent solution to data placement issues, ScaleBase’s cascading key lookup solution easily removes the need for denormalization whilst efficiently resolving any data placement issues. However, returning to our example, due to the fact that the ‘items’ table does not contain a ‘user_id’ category, the routing process can become very difficult. In this situation, if cascading key lookup is not utilized, denormalization is in order. After verifying which tables contain the distribution key, the key would have to be added to the tables in which it is missing. However, while this aids in solving the complex process of data distribution, it creates many additional problems along the way. Learn more about data distribution policies and the ScaleBase solution, here.
A shard key is the field according to which data is directed (it can be the ‘user_id’ or some other cascaded link, such as ‘order_id’, that eventually leads to ‘user_id’). The fields that are used to determine where to route the data and commands cannot be empty (i.e. null) or updated during the life of the row.
Looking at the data life-cycle, every piece of data must be born with a distribution key that it keeps for the course of its entire life. Looking back at our example, an ‘order’ is always born with a ‘user_id’ that never changes. All ‘items’ are linked to a specific ‘parent order’, so they never migrate between ‘orders’, the ‘parent order’ never changes for existing ‘items’, leaving no possibility for them to be ‘orphans’.
It is not enough to simply have the shard key in all tables; it needs to be populated, as part of the data in the table, as well. A row can be inserted into a table, updated many times and deleted. It is vital to insert every table into the database with an updated shard key. Returning to our example, it is imperative that the shard key is included in every insert into the database (i.e. not null). Nonetheless, if a row is inserted into the database with a ‘null’ shard key, it cannot be placed into the distributed database.
New Applications: Design for Scale from the Start
If you’re building a new app that is web-facing, mobile or social, today you have to design anticipating millions of users, high-transaction rates and ever larger data volumes.
The same data distribution principles applied to existing applications should also be applied to new applications. Data is stored and accessed together on the same database, whether it is for “reads” or “writes”.
When designing a data distribution policy, the distribution key should be selected according to how the data will be distributed. You can then denormalize, adding the distribution key to every table. Or, better yet, you can distribute by understanding the link between the tables within each shard from the beginning of the design process.
For example, when looking at an ‘order’ table, it should be clear how it is linked to the ‘user’ table, as well as the ‘order item’ table. It is in this initial phase that either cascading key lookup or denormalization should be carried out, identifying which data may still be missing from the equation. Therefore, when designing the database, ask yourself about the life-cycle of the rows of your data. Were they born with a populated distribution key? Designing your application in a way that makes sure this is taken care of avoids the unpleasant situations of null distribution keys.
When developing an application from scratch, analysis tools are not appropriate, as new applications do not have anything to track. ScaleBase created a special guide “Building a New Application with Massive Database Scalability – Getting Started with ScaleBase” The goal of this document is to demonstrate how to build a new application that plans for massive database scalability right from the start. It outlines the high-level steps involved and provides a walkthrough of how to create a simple, straightforward database data distribution policy. As an aid, it uses a sample application, called ‘Blog System’, mimicking today’s typical modern web scale applications. It frames the steps involved in building the application, including activities such as uploading files, quick and frequent database writes, as well as performing speedy database queries.
If you want to add linear scalability to an existing MySQL application you can leverage ScaleBase’s free online service Analysis Genie. It will help define the best data distribution policy tailored to your unique application is based on a guided analysis of the nature of your data, data relationships and the functional use of your data.
If you are building a new app that is web-facing, or mobile, or social, you have to design anticipating millions of users, high-transaction rates and ever larger data volumes. This guide will help you get started and on your way to a more efficiently distributed database.
Have you been exploring distributed databases?
What are some challenges with distributed relational databases that you would like me to focus on in future blog posts?
Reference: Pinal Dave (http://blog.SQLAuthority.com)