MySQL – Scaling Existing Applications: Key Observations and Measurements

In the earlier blog post we discussed about How to Create a Distributed Relational SQL Database, today we will continue to learn about Scaling Existing Applications: Key Observations and Measurements.

In the use case of an existing application that is on the verge of reaching the limits of its monolithic MySQL database, scaling up – getting bigger hardware – can only go so far.  In today’s public, private and hybrid cloud world that leverages distributed infrastructure, scaling up is a counterintuitive approach. And it’s expensive.

A monolithic MySQL database that is suffering from scalability issues (for example; inconsistent performance, inconsistent availability, or transaction throughput bottlenecks) can become a distributed MySQL database that retains its relational principles by applying a declarative, policy-based data distribution process.

A well thought out data distribution policy aligns with the application’s current database structure and commands. Related data within various tables has to be identified and amassed to stay localized in a single database instance.

Simply put (and to paraphrase a familiar saying), “the data that plays together, should stay together.”  Our goal is to enable “reads” and “writes” to be completed successfully using only data from within one database instance (or shard) from the total distributed database. The more that data is processed within a single database instance, without needing other data found by traveling across machines within the distributed database array, the better that application performance and database scalability are enhanced. This way, compute processing is brought to the data, as opposed to the other way around.

Arriving to Data Distribution (Analysis)

Reads: Commands play a large role in queries, or reads. Examining what bits of data are accessed in joins, sub-queries or unions will indicate which pieces of data ought to be kept together on one machine. For example, when identifying the ‘users’ in a database, the next step would involve identifying the ‘orders’ related to those ‘users’, then the ‘items’ related to the ‘orders’, and so on and so forth. So, first, the tables, and data that are related, are identified. This usually comes from related tables, which have the same foreign keys. However, even if there are no foreign keys, it is possible to identify which columns within and between tables are generally joined for query execution.

Writes: On the transactions, or writes, side, additions to the database need to be placed in the appropriate partitioned database instance (or shard) with their related data. For instance, an ‘order’ is made up of many ‘items’, which are consequently added to the same shard as the ‘order’. A transaction is more efficient when it is contained to a single shard, so that the order and all of its items are placed on the same shard. This practice eliminates the need for a distributed transaction with a 2-phase-commit. In our example, due to the fact that there is not a direct link between ‘users’ and ‘items’, the distribution process is that much more significant because it requires cascading key lookup.

Efficiency dictates that we want data is either read together, such as in queries (i.e. sub-query joins), or written together, as in transactions.

Denormalization – Not the Best Solution

While denormalization may seem like a decent solution to data placement issues, ScaleBase’s cascading key lookup solution easily removes the need for denormalization whilst efficiently resolving any data placement issues. However, returning to our example, due to the fact that the ‘items’ table does not contain a ‘user_id’ category, the routing process can become very difficult. In this situation, if cascading key lookup is not utilized, denormalization is in order. After verifying which tables contain the distribution key, the key would have to be added to the tables in which it is missing. However, while this aids in solving the complex process of data distribution, it creates many additional problems along the way. Learn more about data distribution policies and the ScaleBase solution, here.

Null Columns

A shard key is the field according to which data is directed (it can be the ‘user_id’ or some other cascaded link, such as ‘order_id’, that eventually leads to ‘user_id’). The fields that are used to determine where to route the data and commands cannot be empty (i.e. null) or updated during the life of the row.

Looking at the data life-cycle, every piece of data must be born with a distribution key that it keeps for the course of its entire life. Looking back at our example, an ‘order’ is always born with a ‘user_id’ that never changes. All ‘items’ are linked to a specific ‘parent order’, so they never migrate between ‘orders’, the ‘parent order’ never changes for existing ‘items’, leaving no possibility for them to be ‘orphans’.

It is not enough to simply have the shard key in all tables; it needs to be populated, as part of the data in the table, as well. A row can be inserted into a table, updated many times and deleted. It is vital to insert every table into the database with an updated shard key. Returning to our example, it is imperative that the shard key is included in every insert into the database (i.e. not null). Nonetheless, if a row is inserted into the database with a ‘null’ shard key, it cannot be placed into the distributed database.

New Applications: Design for Scale from the Start

If you’re building a new app that is web-facing, mobile or social, today you have to design anticipating millions of users, high-transaction rates and ever larger data volumes.

The same data distribution principles applied to existing applications should also be applied to new applications. Data is stored and accessed together on the same database, whether it is for “reads” or “writes”.

When designing a data distribution policy, the distribution key should be selected according to how the data will be distributed. You can then denormalize, adding the distribution key to every table. Or, better yet, you can distribute by understanding the link between the tables within each shard from the beginning of the design process.

For example, when looking at an ‘order’ table, it should be clear how it is linked to the ‘user’ table, as well as the ‘order item’ table. It is in this initial phase that either cascading key lookup or denormalization should be carried out, identifying which data may still be missing from the equation. Therefore, when designing the database, ask yourself about the life-cycle of the rows of your data. Were they born with a populated distribution key? Designing your application in a way that makes sure this is taken care of avoids the unpleasant situations of null distribution keys.

When developing an application from scratch, analysis tools are not appropriate, as new applications do not have anything to track.   ScaleBase created a special guide “Building a New Application with Massive Database Scalability – Getting Started with ScaleBase” The goal of this document is to demonstrate how to build a new application that plans for massive database scalability right from the start.  It outlines the high-level steps involved and provides a walkthrough of how to create a simple, straightforward database data distribution policy. As an aid, it uses a sample application, called ‘Blog System’, mimicking today’s typical modern web scale applications. It frames the steps involved in building the application, including activities such as uploading files, quick and frequent database writes, as well as performing speedy database queries.

If you want to add linear scalability to an existing MySQL application you can leverage ScaleBase’s free online service Analysis Genie. It will help define the best data distribution policy tailored to your unique application is based on a guided analysis of the nature of your data, data relationships and the functional use of your data.

If you are building a new app that is web-facing, or mobile, or social, you have to design anticipating millions of users, high-transaction rates and ever larger data volumes. This guide will help you get started and on your way to a more efficiently distributed database.

Have you been exploring distributed databases?

What are some challenges with distributed relational databases that you would like me to focus on in future blog posts?

Reference: Pinal Dave (http://blog.SQLAuthority.com)

MySQL – How to Create a Distributed Relational SQL Database

Distributed relational databases are a perfect match for Cloud computing models and distributed Cloud infrastructure.  As such, they are the way forward for delivering new web scale applications.

But how is the data distributed in a distributed relational database?  What is the best way to distribute data for my applications?  How to I retune my distributed database for optimal performance as applications evolve and usage patterns change?  You do all of this with your data distribution policy.

In this blog I’d like to explore different aspects of a data distribution policy. I want you to come away with a practical understanding you can use as you explore your distributed relational database options.

So, let’s dive in.

Data Distribution Policy: What It Is and Why You Should You Care

A data distribution policy describes the rules under which data is distributed.  A policy that matches your application’s workflow and usage patterns will give you critical web scale benefits:

  • endless scalability
  • high-availability
  • geo-location of data nearest user populations
  • multi-tenancy
  • archiving
  • datatiering

A poorly conceived data distribution policy will degrade performance, use more system resources and cause you problems.

In The Beginning, there was Sharding, and it wasn’t so Good

In the past, to distribute data across an “array” of linked databases, developers needed to program data distribution logic into their actual applications. The effect was to “shard” a database into slices of data. Quite literally every read or write would need to run through new custom-built application code to know where bits of data should be placed, or could be found.  This is what Facebook, Twitter and many others did as, at the time, there was no better alternative.

This extra sharding code required application developers to take on tasks typically handled by a database.  A do-it-yourself approach may seem like a fun challenge (“hey, after all, how hard can this really be??”).  But with your database divided this way, you face the following issues to contend with:

  1. Operational issues become much more difficult, for example: backing up, adding indexes, changing schema.
  2. You also need to start checking your queries results to test that each query path is actually yielding accurate results.

A lot has been written about the challenges of sharding a relational database (here’s a good whitepaper you can read: Top 10 DIY MySQL Sharding Challenges), so I won’t go into them here.  But, let’s also recognize that some great work has been accomplished by dedicated developers using sharding techniques. They have proven the inherent value of a distributed database to achieve massive scale.  At the time, they had to shard as they had no alternative.

Today, there is a better way.

What is a Good Data Distribution Policy?

As I briefly mentioned, a data distribution policy describes the rules under which data is distributed across a set of smaller databases that, taken together and acting as one, comprise the entire distributed database.

The goal we are aiming for is an even and predictable distribution of workloads across the array of clusters in our distributed database.  This brings us immense scalability and availability benefits to handle more concurrent users, higher transaction throughput and bigger volumes of data. But these benefits are all lost with a poorly conceived data distribution policy that does not align to your application’s unique usage and workloads. Let’s take a look.

Imagine we have a single database that is starting to exhibit signs of reaching its capacity limits.  Throughput is becoming unpredictable.  Users are getting frustrated waiting.

We decide the best way to improve the situation is to evolve to a distributed database. Our distributed database would aim to evenly divide the total workload across an array of databases.  In this way, data distribution decreases the number of queries that any individual database cluster (or shard) receives.

Figure 1. A good data distribution policy: ensures that a specific transaction or query is complete within a specific database.

The critical point here is that we want to distribute the data in such a way that we minimize the cross-database chatter (from cluster to cluster, or shard to shard), so that each transaction can be completed within a single cluster and in a single fetch/trip.

If we distribute data without respecting how the data is actually used, we can make matters worse.

Figure 2. A bad data distribution policy: requires transactions or queries to access or collect data from multiple databases.

In the two images above, you can see that one case depicts 1,000,000 transactions equally spread across available resources.  And the other case shows a bad distribution policy where each query needs to collect information from every cluster (or shard) – thus in every practical sense we are actually increasing the overall workload.

Data Distribution Policy
Bad Data Distribution Policy Good Data Distribution Policy
The load isn’t distributed – it’s multiplied! Distributes the workload evenly across available resources
Doesn’t scale Distributes the sessions
Adding an additional DB does NOT reduce the overall workload Delivers linear scalability
The limitation of a single DB becomes the limitation of the entire array Adding another database, increases the overall scale potential of the distributed database
When queries need data from multiple DBs, transactions must commit multiple separate DBs (2PC) before completing. This adds a lot of overhead to each Commit. Queries complete using data from a single, smaller database. This reduces a lot of overhead to any Commits.

Table 1. A comparison of a good and bad data distribution policy

So, we can see that unless we distribute the data intelligently, we will not achieve any benefit. Actually, we can see things can become worse than before.

The natural question we are lead to ask is: “OK, So what is the best way to distribute data for my applications and my workloads?

Good question!

How Create the Best Data Distribution Policy for Your Application

Distributing data across a cluster of smaller database instances and maintaining full relational database integrity, two-phase commit and rollback, (as well as leveraging SQL!) is today’s state of the art  for distributed relational databases.

We can define two broad types of data distribution policy:

  1. Arbitrary Distribution: This is when data is distributed across database instances, but without any consideration or understanding for specific application requirements and how the data will be used by users or the application;
  2. Declarative, Policy-Based Distribution: This is when data is distributed across database instances, but in a way that specifically understands all application requirements, data relationships, transactions, and how the data is used in reads and writes by the application.
Data Distribution Policy
Arbitrary Data Distribution Policy Declarative Data Distribution Policy
Pros - Pros -
Unsophisticated  Ensures that a specific transaction finds all the data it needs in one specific database
 Predetermined (no forethought required)  Aligns with schema and DB structure
Cons - Highly efficient and scalable
 No intelligence about business, schema, use cases  Anticipates future requirements and growth assumptions
 Leads to excessive use of database nodes Cons -
Leads to excessive use of network  Requires forethought and analysis

Arbitrary data distribution is often used by NoSQL database technologies.  In fact, breaking the monolithic single-instance database into a distributed database has been the core of the NoSQL revolution so that NoSQL databases can tap into the scalability benefits of distributed database architecture. However, to get scalability, NoSQL databases have been willing to abandon the relational model. NoSQL and document store type databases can rely on arbitrary data distribution because their data model does not provide for joins. Meanwhile, customers have needed something to handle their massive web scale database loads, so they’ve been willing to try new technologies, like MongoDB, with new non-relational approaches. And in some application scenarios, losing the relational data model has been an OK trade-off. Having a choice is good.

However, nowadays you can get massive web scale and keep the time-tested relational database model, if you use a declarative, policy-based data distribution approach.

Academia has written about various types of distributed relational databases for decades. But today they are a reality. Declarative, policy-based data distribution is the way forward.

The good news is that today tools can identify the best declarative, policy-based data distribution approach for you!

If you use MySQL, you can take what you know now and check out ScaleBase’s free online Analysis Genie service for MySQL. It guides you through very simple steps to create the best data distribution policy matched to your unique application requirements and data.

If you’re just naturally curious about how to evolve your relational database into a modern distributed relational database, let’s dive into the details by looking at two very typical database and development scenarios:

  1. Scaling an existing application
  2. Designing scalability in a brand new application

In tomorrow’s blog post we will discuss about Scaling Existing Applications: Key Observations and Measurements.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

MongoDB and MySQL – Comparing Scalability, Data Distribution & Query Model – Part 3

Note: This blog post part 3 of the series and you can download 30-day trial of ScaleBase to practice the concepts.

In this article comparing MongoDB and MySQL scalability, I want to focus on query models.

We’ve previously discussed how MongoDB and ScaleBase are both highly scalable distributed databases, where data is distributed across partitions and clusters so as to distribute workloads, transactions, and concurrent users.

Now let’s compare query models, and see how MongoDB and MySQL with ScaleBase answer application calls and queries.

The Challenge – Aggregating query results from several database nodes

Famously, efficiently satisfying queries on a distributed system can be challenging. While accessing a single collection or document is easy enough, building result-sets from data spread across multiple nodes and collections can often become a manual coding process.

At a high-level, we can say a query can be filtered either by ID (and return a single result) or a range (and return multiple results).

In a distributed database, the challenge for a query is to efficiently and accurately access several data partitions where each partition gives a partial results, and then to aggregate results efficiently so as to provide one answer to the application and user.

At a high level we can say there are several operations that aggregate query results:

–     Aggregate functions: count, min, max, sum

–     Distinct

–     Group

–     Sort

The challenge is to execute these operations in a distributed data environment and so that the applications still “sees” one database.

Figure 5. A query in a distributed database environment

For example, for a distributed database with four partitions, a simple count of records across the entire database needs four queries, run in parallel on four databases, which results in four numbers that need to be summed up and returned to the application as a single summed number.

As I mentioned in my previous entry, since MongoDB does not do joins, discrete docs are “joined” together inside the application after several roundtrips to the database.

Aggregations can make this even more complex, but both MongoDB and ScaleBase do most of the work for you. An aggregation is an operation that scans over a set of documents and returns computed values such as Sum, Average, or Count.

So, what options do MongoDB and ScaleBase offer to aggregate query results from across several database partitions?

MongoDB Query Aggregation Options

MongoDB provides two main methods of aggregation: the aggregation pipeline (and simple single purpose pipeline operations), and map-reduce.

MongoDB Aggregation Pipeline

The aggregation pipeline is a framework for aggregates built into MongoDB (since version 2.2). You can think of the pipeline framework as working something like the UNIX pipe command. You stream documents through the pipeline, filtering and transforming as needed on each operator.

Figure 6 Annotated Aggregation Pipeline from Mongodb.com documentation http://docs.mongodb.org/master/_images/aggregation-pipeline.png

Each pipeline operator can do things like skip, match, sort and even geospatial style matching. You can improve performance by doing filtering (match) at the beginning of the pipeline, thus reducing the amount of data being scanned and manipulated. The aggregation pipeline can use indexes that are available.

Single Purpose Aggregation Operations

The single purpose aggregation operations are very simple procedures that return a very specific set of data. Examples would be count on a single value, grouping, a distinct list of values, etc.

However, it’s important to note that in Mongo DB, the ‘Group’ function does NOT support distributed database nodes or partitions.  This is a major difference.  Secondly, all single purpose aggregation operation results must be less than 16MB.

Figure 7 Annotated Single Purpose Aggregation Operation from Mongodb.com documentation http://docs.mongodb.org/master/_images/distinct.png

MongoDB Map-Reduce

MongoDB’s Map-Reduce capability provides programmatic query processing flexibility not available in Aggregation Pipeline, but at a cost to performance and coherence.  Map-Reduce is a massively parallel process for manipulating and condensing large volumes of data down to something more useful. MongoDB provides the ‘mapReduce’ command to process map-reduce scripts.

In a map-reduce process, you match data that you want to work with (the map process) and then you filter and/or condense (with the reduce process).

The map process creates a set of key->value pairs that you want to work with and the reduce process takes those values in. In MongoDB, a map process takes in a collection and the output of the reduce operation can be a collection or it can be returned inline. If you generate a collection, it can then be used as input to another map-reduce process.

Figure 8 Annotated Map-Reduce from Mongodb.com documentation http://docs.mongodb.org/master/_images/map-reduce.png

Like the pipelined aggregation, map-reduce does support partitioned databases. MongoDB notes that while they have made improvements in map-reduce performance in later releases, the aggregation pipeline is usually more performant, though it may not be as dynamic and functional.

ScaleBase Query Aggregation Options

ScaleBase also provides two main methods of query data aggregation: ALL_DB Aggregation and an automatic and built-in “Map-Reduce-Like” capability that executes across distributed database partitions.

The primary difference is with ScaleBase you can use SQL and tools that you are most likely already familiar with.

Additionally, with ScaleBase you don’t need to decide between an aggregation pipeline or a map reduce approach. Instead, you submit regular SQL query, and ScaleBase performs the all the aggregations operations behind the scenes for you.

ALL_DB Aggregation

ScaleBase supports single-phase and multi-phase ALL_DB aggregation.

For every ALL_DB command, ScaleBase performs:

  • Parallel execution
  • Local processing at each individual DB, such as Filter, local joins, local groups, local sort (That’s most of the processing!)
  • Meta-Aggregation at ScaleBase Controller

For example:

SELECT COUNT(*) FROM my_distributed _table;

ScaleBase delegates the same command to all databases in parallel. Databases perform it very efficiently in parallel, on smaller datasets, returns n counts to ScaleBase which in turn sums all the counts to one big count, to return to the client.

The same goes with “sum of sums”, “min of mins” and so on. ScaleBase even supports a global ALL_DB average aggregated function.

In addition, consider the following supported cases:

  • Multi-phase aggregation:
    GROUP BY + HAVING + ORDER BY + LIMIT
  • DDL (ALTER TABLE, CREATE INDEX)

ScaleBase Map-Reduce

As mentioned before, ScaleBase provides a Ma-Reduce-Like capability. ScaleBase will deconstruct the query; run parallel scans, and aggregate results.

To compare Map-Reduce in MySQL / ScaleBase and MongoDB, let’s look at the figure below.

Figure 9 MySQL Query and MongoDB Map-Reduce Script

Comparing a typical Map-Reduce query in MySQL and MongoDB, you can see that to get the same functionality, a simple SQL query is a lot easier to read and to write.  Ideally, less code can also mean less bugs and less maintenance.

And the above example is just for a very simple SQL command on only one table, with no joins, with three WHERE predicates and six SELECT predicates.  As we know, queries can be much longer than that, but not more complicated. This is really the power of SQL.

You can see how the ScaleBase query flow resembles a map-reduce program in the figure below.

Figure 10. Map-Reduce Like Functionality in ScaleBase

With no extra coding, you get localized access at the partition/node level, running in parallel, with local sorts, joins and grouping.  ScaleBase does the final aggregation for your application in the middle ScaleBase layer. Not in the application.

Aggregation is a fact of life in the data world. You don’t want a tool that makes your life harder than it has to be. MongoDB and ScaleBase make it as easy as possible for you.

Summary

MongoDB was developed after SQL databases were around for quite a while already and providing great query capabilities.  MongoDB is much more than a key/value database. It has great querying capabilities as well, some say equal to SQL databases, with very similar concepts of filters, indexes, results aggregation

The main difference is in the “how”.

In MongoDB, for most aggregation operations, the Aggregation Pipeline provides better performance and a more coherent interface. However, map-reduce operations provide some programmatic flexibility that is not presently available in the aggregation pipeline.

The take-way is that (as we have seen in scalability, auto-sharding and data redistribution already), in query model and query execution there’s more similarity than difference between MongoDB and MySQL with ScaleBase.

Bottom line: In a distributed database environment you must have the aggregation capabilities.  MongoDB give that to you. MySQL with ScaleBase gives that to you too, but retaining SQL and ACID properties.

If you’ve liked the ideas discussed in this series of articles comparing MongoDB and MySQL with Scalebase, offers a free 30-day trial you can download  (http://www.scalebase.com/download-trial/).

Reference: Pinal Dave (http://blog.sqlauthority.com)

MongoDB and MySQL – Comparing Scalability, Data Distribution & Query Model – Part 2

Note: This blog post part 2 of the series and you can download 30-day trial of ScaleBase to practice the concepts.

In a previous article comparing MongoDB and MySQL, we saw how MongoDB and MySQL with ScaleBase execute auto-sharding, a key technique to deliver database scalability.  In MongoDB, any joins across horizontal partitions need to be accomplished by the application. Whereas, MySQL with ScaleBase retains ACID compliance across the distributed database.

In this article, I want to focus on data distribution, reads / writes, joins and data rebalancing

Data Distribution

As a reminder, the goal of dividing data into horizontal partitions is to cause workloads to become distributed across multiple servers. This allows an application to experience higher TPS throughput, allow more concurrent users, and to allow for much larger overall database size, all while avoiding database bottlenecks.

For the best performance, any data that is logically related should be stored together. In this way, queries can be satisfied in a single fetch.  But, that ideal situation isn’t always available.  What happens when a query needs data from multiple partitions? Let’s take a look at a few example application scenarios and see how the way you distribute data can impact performance.

Reads and Writes in MongoDB and ScaleBase

Imagine a blogging application, with authors, articles, users, comments and tags.  The table below identifies four typical blogging activities (scenarios) and indicates how frequently they typically happen.

Figure 1 Example Blogging Application Scenarios

As you can see, we have two “write” activities and two “read” activities. And, two scenarios occur frequently, and two occur less often.

Below are typical data models for this simple application: MongoDB uses typical document type store, and ScaleBase’s typical relational store.

Figure 2 Example Blogging Application Data Models for MongoDB and ScaleBase

The model on the left is the MongoDB data model (represented via BSON) and the model on the right is a relational model. While the exact way that the data is stored is different, both models accomplish the same thing.

So, with the groundwork of our example understood, let’s examine how MongoDB and ScaleBase execute the four blogging application scenarios we outlined earlier.

Scenario Execution with MongoDB – Joins

You can see that for the two “write” scenarios, in MongoDB, with that model, adding an article, a high frequency activity, and adding a tag, are both easily accomplished in a single call to a single database. For example, we already know the user so it is simply a matter of inserting the relevant documents.

However, both of the “read” scenarios require that a “join” be made in the application. For example, if we want to query all of an author’s articles, along with that author’s details, we would need to first request the Users collection (table) and then query the Article collection (table). The application would then have to join those two sets of data and return it to the client.

Depending on how the Article collection is sharded, MongoDB may need to query multiple shards and then aggregate that data before returning it to the blogging application, and for that data to then also be joined with the User collection.

Figure 3. Read / Write Scenario Execution in MongoDB

Scenario Execution with ScaleBase with MySQL – Joins

In ScaleBase with MySQL, the data may be distributed across articles, but the distribution process will co-locate logically related user data with an author’s articles. In cases where data is not co-located, ScaleBase will join the data at the database level. This means the application does not need to code those joins. In this way, three of the four scenarios are satisfied in a single call.

Figure 4. Read / Write Scenario Execution in ScaleBase and MySQL

As you can see, both MongoDB and ScaleBase data models obviously can support our example Blogging application.  In MongoDB, there can be read scenarios where a join needs to be accomplished in the application.  In ScaleBase, for the one scenario that requires it, the join is accomplished by ScaleBase, and not within the application.

Data Rebalancing in MongoDB, and ScaleBase with MySQL

There can come a time when you need to rethink how data is distributed across servers. For example, application usage patterns may evolve. And, as the number of concurrent users goes up and/or transaction and data volumes increase, you may find that certain data nodes, or partitions become “hotspots” receiving more read/write workloads than other parts of the distributed database cluster. When this happens, you will want to adjust how data is distributed in the future, and possibly also redistribute existing data to achieve a more balanced distributed workload..

Data Chunks

In both MongoDB and ScaleBase, data is distributed in “chunks”. In MongoDB, a data chuck is a set of documents and the default size for a data chunk 64 megabytes. In ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.

Splitting Shards and Rebalancing Partitions

When you want to alleviate the workloads from a database hotspot, there are two typical data redistribution use-cases: splitting partitions and rebalancing partitions.

They are easily explained in a few pictures.

Splitting Partitions                                                Re-Balancing Partitions

Figure 5. Data Redistribution: Splitting and Rebalancing Shards

Simply stated, in splitting partitions you are adding resources to your distributed database, and in re-balancing partitions you are redistributing data across already existing server resources.

Both MongoDB and ScaleBase can automatically redistribute data across partitions, online and without downtime, by adding and removing nodes as needed, to handle data volume changes and hotspots.

A new server might need to be added for more capacity, but it may also be possible to move “hot” data from an overused server node to an underused server node. Remember that in MongoDB, a chunk is the smallest logical set of data. When migrating between over- and under-used nodes, MongoDB moves entire chunks from one server to the other. Whereas, in ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.  ScaleBase will similarly move entire chunks of data in a way that maintains logical data relations.

The best part, for both MongoDB users and ScaleBase users, is that the logic for data redistribution and rebalancing happens behind the scenes, automatically. Hotspots are detected and data is moved using logic at the data layer. Once again, your application does not need to contain code for that functionality.

Summary

We’ve covered a lot of material.  The figure below summarizes the discussion so far for MongoDB and ScaleBase auto-sharding, data distribution and join capabilities.

As you can see MongoDB and ScaleBase both equally support auto-sharding, data distribution at the database level, scalable workload distribution, and application transparency.  Where they differ is at the references and relational level.  In MongoDB, each document collection is sharded independently and joins are not supported.  In ScaleBase, MySQL relations are retained and form the basis of a data distribution policy, and joins are supported locally on each node, and across the entire distributed database cluster.

Next time I want to cover MongoDB and MySQL with ScaleBase query models.

If you’re curious to explore ScaleBase, there’s a free 30-day trial you can download (http://www.scalebase.com/download-trial/).

Reference: Pinal Dave (http://blog.sqlauthority.com)

MongoDB and MySQL – Comparing Scalability, Data Distribution & Query Model – Part 1

Note: This blog post part 1 of the series and you can download 30-day trial of ScaleBase to practice the concepts.

Introduction

Whether it is for extremely large data volumes, high transaction throughput or a massive number of concurrent users, today’s internet and mobile applications require the ability to respond to a new level of performance and usability. Databases need to be flexible enough to quickly accommodate growth and changes as they arise.

A lot has been written to compare MongoDB and MySQL. Both are solid solutions for DBAs to use. Given my focus on SQL, I’ve not yet written about MongoDB. However, I expect many of my readers are investigating NoSQL solutions, and the most popular NoSQL database is MongoDB.

At a high level, MongoDB is a NoSQL document database. It is considered highly scalable and cloud-ready.  MySQL is a leading relational database that I know many of you are already using.  I’d like to offer some thoughts to compare MongoDB, and MySQL across three main areas:

  1. Scalability, throughput and cloud-readiness.
  2. Data distribution, reads / writes, joins and data rebalancing
  3. Query model

Because, MySQL is not natively a distributed database, I want to level the playing field. ScaleBase is a well-regarded distributed relational database platform that uses MySQL (or MariaDB, or Percona Sever) for data storage. Adding ScaleBase into the equation makes MySQL a distributed database too, but it’s still a relational database with SQL properties and ACID compliance.

MongoDB and MySQL, with ScaleBase, are both positioned to take software engineering teams and their applications to the next level of data scalability.

I hope my comparison can help you understand which technology might be best for your new projects.  Ultimately, data architects, dev-ops and developers need to choose the database technology that best meets their application’s requirements.

1. Scalability, throughput and cloud-readiness

MongoDB and MySQL with ScaleBase each leverage several proven techniques when it comes to scaling out applications.

Auto-Sharding

A very powerful technique that can greatly extend the scalability of a database is called “sharding.”  A lot has been written about sharding, and the difficulties inherent in homegrown, do-it-yourself partitioning approaches, so I won’t cover those topics in detail again here.

Let’s just say that sharding is a process that splits large data sets on a single server into smaller data sets, or horizontal partitions, placed onto multiple servers. The goal of this division of the data is to divide workloads between multiple servers, so as to 1) increase TPS throughput, 2) allow more concurrent users, and 3) to allow for much larger overall database size.

Figure 1 Scalability via auto-sharding

MongoDB Sharding

Sharding is an important part of how MongoDB achieves its scalability. In MongoDB, data is split between shards using a shard key. The shard key is a single field (or a set of composite fields) that identify the data to be sharded. Data is defined in “chunks”, by default 64MB of data. The chunk is the smallest logical block of data that will be sharded.

A particular shard is responsible for one, or more, chunks. The shard stores a “replica set” of the application data. This is a subset of all the data and is identified by the particular shard key. MongoDB controls which replica set is stored on a particular shard.

If you look directly into an individual shard to see what it contains, you will see a random subset of the data. Which data replica is stored in which shard, or partition, is somewhat random to the user, and unknown to the application. That’s why you (the client) always need to connect to the mongos and let MongoDB find the data you need.

Figure 2 MongoDB Architecture (taken from the MongoDB web site)

In the MongoDB architecture diagram above, the configuration of the cluster is stored in the Config Servers (which coincidently are mongod databases). Queries, and data, going back and forth between the client and the cluster are sent through the MongoDB routers, mongos. The mongos will route a query to the appropriate backend database or databases.

Data distribution occurs at the database level – it’s not defined by the application. The application actually “sees” a single database and MongoDB handles retrieving data from each shard partition (mongod). As a non-relational database, there are no database joins (I’ll have more on that later), so joins are achieved in the application.  Most of the query processing is done in the data layer with aggregations done in the MongoDB middleware.

ScaleBase Sharding

Now that we’ve reviewed sharding in MongoDB, you might be surprised to learn this is very similar to how MySQL with ScaleBase works. ScaleBase also automates sharding. Like MongoDB, ScaleBase’s data distribution is handled at the database level (and not in the application). Importantly, because ScaleBase maintains a relational data environment, ScaleBase will ensure that data that is accessed together is also stored together.

Like MongoDB, with ScaleBase the application “sees” a single database and not a bunch of shards/partitions. Like MongoDB, Scalebase also provides agility, flexibility, and growth. Database optimization is dynamic. Rebalancing (adding and removing nodes to accommodate growth and to handle database “hotspots”) is dynamic, and can be executed while the entire system is online.  All this without changes to the application design or codebase.

However, there are two ways that ScaleBase sharding differs.

  1. Retaining ACID / SQL: With ScaleBase, joins are supported, even in a distributed environment. As MySQL is a relational database, ScaleBase embraces and retains data relations expressed in joins and transactions. With ScaleBase, queries can have local joins (within one shard) as well as cross-database joins. Unlike MongoDB where joins are accomplished in your application, ScaleBase will automatically aggregate data across shards for you. In this way, ScaleBase gives you ACID transactions and 2-phase commits across a sharded cluster of MySQL database nodes. This is unique.
  1. MySQL Ecosystem: The second way ScaleBase is a bit different is it preserves your MySQL skills and ecosystem technologies. It provides this scalability to your existing MySQL applications, if they are growing beyond the capabilities of your server (or Amazon RDS instance).  Maybe more importantly, ScaleBase gives new MySQL apps a modern, distributed and relational data framework with near-linear scalability.

Figure 3 ScaleBase Architecture

Looking at a ScaleBase architecture diagram and relating back to MongoDB, the ScaleBase Config Servers here are the equivalent of the MongoDB mongod Config Servers. And ScaleBase is the equivalent of the MongoDB router, the mongos.  The similarities are striking.

Summary

So far I’ve only focused my comparison on scalability, and we’ve seen that MongoDB and MySQL with ScaleBase are highly comparable.  MongoDB is a NoSQL database well known for its scalability. The database world calls it a web scale technology and it’s catching on with developers worldwide.  MySQL is the world’s the most widely used open source database, with a rich ecosystem of tooling and skilled users. ScaleBase is a modern relational database platform, build using standard MySQL, and optimized for the cloud, and takes advantage of your existing MySQL development skills and ecosystem technologies.

If you’re curious to explore ScaleBase, there’s a free 30-day trial you can download (http://www.scalebase.com/download-trial/).

Next time I will cover MongoDB and ScaleBase data models and examine how they handle data distribution and redistribution.

Reference: Pinal Dave (http://blog.sqlauthority.com)