Note: This blog post part 2 of the series and you can download 30-day trial of ScaleBase to practice the concepts.
In a previous article comparing MongoDB and MySQL, we saw how MongoDB and MySQL with ScaleBase execute auto-sharding, a key technique to deliver database scalability. In MongoDB, any joins across horizontal partitions need to be accomplished by the application. Whereas, MySQL with ScaleBase retains ACID compliance across the distributed database.
In this article, I want to focus on data distribution, reads / writes, joins and data rebalancing
As a reminder, the goal of dividing data into horizontal partitions is to cause workloads to become distributed across multiple servers. This allows an application to experience higher TPS throughput, allow more concurrent users, and to allow for much larger overall database size, all while avoiding database bottlenecks.
For the best performance, any data that is logically related should be stored together. In this way, queries can be satisfied in a single fetch. But, that ideal situation isn’t always available. What happens when a query needs data from multiple partitions? Let’s take a look at a few example application scenarios and see how the way you distribute data can impact performance.
Imagine a blogging application, with authors, articles, users, comments and tags. The table below identifies four typical blogging activities (scenarios) and indicates how frequently they typically happen.
Figure 1 Example Blogging Application Scenarios
As you can see, we have two “write” activities and two “read” activities. And, two scenarios occur frequently, and two occur less often.
Below are typical data models for this simple application: MongoDB uses typical document type store, and ScaleBase’s typical relational store.
Figure 2 Example Blogging Application Data Models for MongoDB and ScaleBase
The model on the left is the MongoDB data model (represented via BSON) and the model on the right is a relational model. While the exact way that the data is stored is different, both models accomplish the same thing.
So, with the groundwork of our example understood, let’s examine how MongoDB and ScaleBase execute the four blogging application scenarios we outlined earlier.
You can see that for the two “write” scenarios, in MongoDB, with that model, adding an article, a high frequency activity, and adding a tag, are both easily accomplished in a single call to a single database. For example, we already know the user so it is simply a matter of inserting the relevant documents.
However, both of the “read” scenarios require that a “join” be made in the application. For example, if we want to query all of an author’s articles, along with that author’s details, we would need to first request the Users collection (table) and then query the Article collection (table). The application would then have to join those two sets of data and return it to the client.
Depending on how the Article collection is sharded, MongoDB may need to query multiple shards and then aggregate that data before returning it to the blogging application, and for that data to then also be joined with the User collection.
Figure 3. Read / Write Scenario Execution in MongoDB
In ScaleBase with MySQL, the data may be distributed across articles, but the distribution process will co-locate logically related user data with an author’s articles. In cases where data is not co-located, ScaleBase will join the data at the database level. This means the application does not need to code those joins. In this way, three of the four scenarios are satisfied in a single call.
Figure 4. Read / Write Scenario Execution in ScaleBase and MySQL
As you can see, both MongoDB and ScaleBase data models obviously can support our example Blogging application. In MongoDB, there can be read scenarios where a join needs to be accomplished in the application. In ScaleBase, for the one scenario that requires it, the join is accomplished by ScaleBase, and not within the application.
There can come a time when you need to rethink how data is distributed across servers. For example, application usage patterns may evolve. And, as the number of concurrent users goes up and/or transaction and data volumes increase, you may find that certain data nodes, or partitions become “hotspots” receiving more read/write workloads than other parts of the distributed database cluster. When this happens, you will want to adjust how data is distributed in the future, and possibly also redistribute existing data to achieve a more balanced distributed workload..
In both MongoDB and ScaleBase, data is distributed in “chunks”. In MongoDB, a data chuck is a set of documents and the default size for a data chunk 64 megabytes. In ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.
Splitting Shards and Rebalancing Partitions
When you want to alleviate the workloads from a database hotspot, there are two typical data redistribution use-cases: splitting partitions and rebalancing partitions.
They are easily explained in a few pictures.
Splitting Partitions Re-Balancing Partitions
Figure 5. Data Redistribution: Splitting and Rebalancing Shards
Simply stated, in splitting partitions you are adding resources to your distributed database, and in re-balancing partitions you are redistributing data across already existing server resources.
Both MongoDB and ScaleBase can automatically redistribute data across partitions, online and without downtime, by adding and removing nodes as needed, to handle data volume changes and hotspots.
A new server might need to be added for more capacity, but it may also be possible to move “hot” data from an overused server node to an underused server node. Remember that in MongoDB, a chunk is the smallest logical set of data. When migrating between over- and under-used nodes, MongoDB moves entire chunks from one server to the other. Whereas, in ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy. ScaleBase will similarly move entire chunks of data in a way that maintains logical data relations.
The best part, for both MongoDB users and ScaleBase users, is that the logic for data redistribution and rebalancing happens behind the scenes, automatically. Hotspots are detected and data is moved using logic at the data layer. Once again, your application does not need to contain code for that functionality.
We’ve covered a lot of material. The figure below summarizes the discussion so far for MongoDB and ScaleBase auto-sharding, data distribution and join capabilities.
As you can see MongoDB and ScaleBase both equally support auto-sharding, data distribution at the database level, scalable workload distribution, and application transparency. Where they differ is at the references and relational level. In MongoDB, each document collection is sharded independently and joins are not supported. In ScaleBase, MySQL relations are retained and form the basis of a data distribution policy, and joins are supported locally on each node, and across the entire distributed database cluster.
Next time I want to cover MongoDB and MySQL with ScaleBase query models.
If you’re curious to explore ScaleBase, there’s a free 30-day trial you can download (http://www.scalebase.com/download-trial/).
Reference: Pinal Dave (http://blog.sqlauthority.com)