Here is the story of my latest weekend experiments. Those who know me are familiar that I try to do something new every weekend and post my experience over here. Earlier this week, I published my course on MySQL Backup and Recovery. After doing this course, I realize that everything I had covered in the course was related to native MySQL backup, but there should be some solution outside which caters to the larger need of MySQL Backup. Additionally, I found that taking backup with MySQL is hit and miss as well as there is no systemic reporting there. I decided to do some research on my own on the internet. I tried out various tools, softwares and quite a few third party plug-ins.
Out of all my experiments, there is one product indeed that got my attention. That is BitCan. It is indeed an excellent product. Here are few things I found out about the product:
User friendly interface and no need to write scripts or cron jobs
There is no need to install any plugin
Cloud based solution
Storage is very elastic and grows/shrinks with the needs
SSH secure communication layer integrated
Backup Monitoring Dashboard
I decided to try out the solution on my own and I find it very effective and efficient. I decided to quickly sign up for a FREE account and do the following very quickly.
Connect my public database with the BitCan
Schedule backup at interval
Create an immediate backup
Download the backup to install on the local machine
Monitor backup health in a single dashboard
Well, I already have a MySQL database which is available on the internet and decided to take backup of the same database. Here are a few images from my experiment.
Let us start how we can take backup with the help of BitCan.
Once you login you will see the following screen. It will allow us for 30 day trial. Now click on CREATE BACKUP link.
On the same screen there is a video as well. Please refer the video for additional information. I watched this five minutes video and learned how to get started quickly.
Step 2: New Backup
Over here you will select your backup name and opt for what kind of backup you will go for. You can take backup of file, directory, MongoDB as well as MySQL. I wanted to backup MySQL so I will selected MySQL and click on Save Backup.
Step 3: Configuration
Over here provide all the necessary details for your backup. Remember, I am selecting the public database as MySQL is accessible via internet.
You can also see that I have selected backup time for Sunday, Tuesday, Thursday and Saturday for 12:00 AM. Click on Save and continue to the next step.
Step 4: All Done!
That’s it. We are done. The backup will automatically happen at our scheduled time. If we want we can execute the backup manually and can see the various information related to the same.
The dashboard is a very interesting element of BitCan. In my life I have never seen a backup dashboard which is so comprehensive and detailed. I loved it. May be in future blog post we will discuss this in details.
Here is the another view of the dashboard.
Well, that’s it. In future blog posts we will discuss various other aspects of the BitCan.
Now let’s compare query models, and see how MongoDB and MySQL with ScaleBase answer application calls and queries.
The Challenge – Aggregating query results from several database nodes
Famously, efficiently satisfying queries on a distributed system can be challenging. While accessing a single collection or document is easy enough, building result-sets from data spread across multiple nodes and collections can often become a manual coding process.
At a high-level, we can say a query can be filtered either by ID (and return a single result) or a range (and return multiple results).
In a distributed database, the challenge for a query is to efficiently and accurately access several data partitions where each partition gives a partial results, and then to aggregate results efficiently so as to provide one answer to the application and user.
At a high level we can say there are several operations that aggregate query results:
– Aggregate functions: count, min, max, sum
The challenge is to execute these operations in a distributed data environment and so that the applications still “sees” one database.
Figure 5. A query in a distributed database environment
For example, for a distributed database with four partitions, a simple count of records across the entire database needs four queries, run in parallel on four databases, which results in four numbers that need to be summed up and returned to the application as a single summed number.
As I mentioned in my previous entry, since MongoDB does not do joins, discrete docs are “joined” together inside the application after several roundtrips to the database.
Aggregations can make this even more complex, but both MongoDB and ScaleBase do most of the work for you. An aggregation is an operation that scans over a set of documents and returns computed values such as Sum, Average, or Count.
So, what options do MongoDB and ScaleBase offer to aggregate query results from across several database partitions?
MongoDB Query Aggregation Options
MongoDB provides two main methods of aggregation: the aggregation pipeline (and simple single purpose pipeline operations), and map-reduce.
MongoDB Aggregation Pipeline
The aggregation pipeline is a framework for aggregates built into MongoDB (since version 2.2). You can think of the pipeline framework as working something like the UNIX pipe command. You stream documents through the pipeline, filtering and transforming as needed on each operator.
Each pipeline operator can do things like skip, match, sort and even geospatial style matching. You can improve performance by doing filtering (match) at the beginning of the pipeline, thus reducing the amount of data being scanned and manipulated. The aggregation pipeline can use indexes that are available.
Single Purpose Aggregation Operations
The single purpose aggregation operations are very simple procedures that return a very specific set of data. Examples would be count on a single value, grouping, a distinct list of values, etc.
However, it’s important to note that in Mongo DB, the ‘Group’ function does NOT support distributed database nodes or partitions. This is a major difference. Secondly, all single purpose aggregation operation results must be less than 16MB.
MongoDB’s Map-Reduce capability provides programmatic query processing flexibility not available in Aggregation Pipeline, but at a cost to performance and coherence. Map-Reduce is a massively parallel process for manipulating and condensing large volumes of data down to something more useful. MongoDB provides the ‘mapReduce’ command to process map-reduce scripts.
In a map-reduce process, you match data that you want to work with (the map process) and then you filter and/or condense (with the reduce process).
The map process creates a set of key->value pairs that you want to work with and the reduce process takes those values in. In MongoDB, a map process takes in a collection and the output of the reduce operation can be a collection or it can be returned inline. If you generate a collection, it can then be used as input to another map-reduce process.
Like the pipelined aggregation, map-reduce does support partitioned databases. MongoDB notes that while they have made improvements in map-reduce performance in later releases, the aggregation pipeline is usually more performant, though it may not be as dynamic and functional.
ScaleBase Query Aggregation Options
ScaleBase also provides two main methods of query data aggregation: ALL_DB Aggregation and an automatic and built-in “Map-Reduce-Like” capability that executes across distributed database partitions.
The primary difference is with ScaleBase you can use SQL and tools that you are most likely already familiar with.
Additionally, with ScaleBase you don’t need to decide between an aggregation pipeline or a map reduce approach. Instead, you submit regular SQL query, and ScaleBase performs the all the aggregations operations behind the scenes for you.
ScaleBase supports single-phase and multi-phase ALL_DB aggregation.
For every ALL_DB command, ScaleBase performs:
Local processing at each individual DB, such as Filter, local joins, local groups, local sort (That’s most of the processing!)
Meta-Aggregation at ScaleBase Controller
SELECT COUNT(*) FROM my_distributed _table;
ScaleBase delegates the same command to all databases in parallel. Databases perform it very efficiently in parallel, on smaller datasets, returns n counts to ScaleBase which in turn sums all the counts to one big count, to return to the client.
The same goes with “sum of sums”, “min of mins” and so on. ScaleBase even supports a global ALL_DB average aggregated function.
In addition, consider the following supported cases:
GROUP BY + HAVING + ORDER BY + LIMIT
DDL (ALTER TABLE, CREATE INDEX)
As mentioned before, ScaleBase provides a Ma-Reduce-Like capability. ScaleBase will deconstruct the query; run parallel scans, and aggregate results.
To compare Map-Reduce in MySQL / ScaleBase and MongoDB, let’s look at the figure below.
Figure 9 MySQL Query and MongoDB Map-Reduce Script
Comparing a typical Map-Reduce query in MySQL and MongoDB, you can see that to get the same functionality, a simple SQL query is a lot easier to read and to write. Ideally, less code can also mean less bugs and less maintenance.
And the above example is just for a very simple SQL command on only one table, with no joins, with three WHERE predicates and six SELECT predicates. As we know, queries can be much longer than that, but not more complicated. This is really the power of SQL.
You can see how the ScaleBase query flow resembles a map-reduce program in the figure below.
Figure 10. Map-Reduce Like Functionality in ScaleBase
With no extra coding, you get localized access at the partition/node level, running in parallel, with local sorts, joins and grouping. ScaleBase does the final aggregation for your application in the middle ScaleBase layer. Not in the application.
Aggregation is a fact of life in the data world. You don’t want a tool that makes your life harder than it has to be. MongoDB and ScaleBase make it as easy as possible for you.
MongoDB was developed after SQL databases were around for quite a while already and providing great query capabilities. MongoDB is much more than a key/value database. It has great querying capabilities as well, some say equal to SQL databases, with very similar concepts of filters, indexes, results aggregation
The main difference is in the “how”.
In MongoDB, for most aggregation operations, the Aggregation Pipeline provides better performance and a more coherent interface. However, map-reduce operations provide some programmatic flexibility that is not presently available in the aggregation pipeline.
Bottom line: In a distributed database environment you must have the aggregation capabilities. MongoDB give that to you. MySQL with ScaleBase gives that to you too, but retaining SQL and ACID properties.
In a previous article comparing MongoDB and MySQL, we saw how MongoDB and MySQL with ScaleBase execute auto-sharding, a key technique to deliver database scalability. In MongoDB, any joins across horizontal partitions need to be accomplished by the application. Whereas, MySQL with ScaleBase retains ACID compliance across the distributed database.
In this article, I want to focus on data distribution, reads / writes, joins and data rebalancing
As a reminder, the goal of dividing data into horizontal partitions is to cause workloads to become distributed across multiple servers. This allows an application to experience higher TPS throughput, allow more concurrent users, and to allow for much larger overall database size, all while avoiding database bottlenecks.
For the best performance, any data that is logically related should be stored together. In this way, queries can be satisfied in a single fetch. But, that ideal situation isn’t always available. What happens when a query needs data from multiple partitions? Let’s take a look at a few example application scenarios and see how the way you distribute data can impact performance.
Reads and Writes in MongoDB and ScaleBase
Imagine a blogging application, with authors, articles, users, comments and tags. The table below identifies four typical blogging activities (scenarios) and indicates how frequently they typically happen.
Figure 1 Example Blogging Application Scenarios
As you can see, we have two “write” activities and two “read” activities. And, two scenarios occur frequently, and two occur less often.
Below are typical data models for this simple application: MongoDB uses typical document type store, and ScaleBase’s typical relational store.
Figure 2 Example Blogging Application Data Models for MongoDB and ScaleBase
The model on the left is the MongoDB data model (represented via BSON) and the model on the right is a relational model. While the exact way that the data is stored is different, both models accomplish the same thing.
So, with the groundwork of our example understood, let’s examine how MongoDB and ScaleBase execute the four blogging application scenarios we outlined earlier.
Scenario Execution with MongoDB – Joins
You can see that for the two “write” scenarios, in MongoDB, with that model, adding an article, a high frequency activity, and adding a tag, are both easily accomplished in a single call to a single database. For example, we already know the user so it is simply a matter of inserting the relevant documents.
However, both of the “read” scenarios require that a “join” be made in the application. For example, if we want to query all of an author’s articles, along with that author’s details, we would need to first request the Users collection (table) and then query the Article collection (table). The application would then have to join those two sets of data and return it to the client.
Depending on how the Article collection is sharded, MongoDB may need to query multiple shards and then aggregate that data before returning it to the blogging application, and for that data to then also be joined with the User collection.
Figure 3. Read / Write Scenario Execution in MongoDB
Scenario Execution with ScaleBase with MySQL – Joins
In ScaleBase with MySQL, the data may be distributed across articles, but the distribution process will co-locate logically related user data with an author’s articles. In cases where data is not co-located, ScaleBase will join the data at the database level. This means the application does not need to code those joins. In this way, three of the four scenarios are satisfied in a single call.
Figure 4. Read / Write Scenario Execution in ScaleBase and MySQL
As you can see, both MongoDB and ScaleBase data models obviously can support our example Blogging application. In MongoDB, there can be read scenarios where a join needs to be accomplished in the application. In ScaleBase, for the one scenario that requires it, the join is accomplished by ScaleBase, and not within the application.
Data Rebalancing in MongoDB, and ScaleBase with MySQL
There can come a time when you need to rethink how data is distributed across servers. For example, application usage patterns may evolve. And, as the number of concurrent users goes up and/or transaction and data volumes increase, you may find that certain data nodes, or partitions become “hotspots” receiving more read/write workloads than other parts of the distributed database cluster. When this happens, you will want to adjust how data is distributed in the future, and possibly also redistribute existing data to achieve a more balanced distributed workload..
In both MongoDB and ScaleBase, data is distributed in “chunks”. In MongoDB, a data chuck is a set of documents and the default size for a data chunk 64 megabytes. In ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.
Splitting Shards and Rebalancing Partitions
When you want to alleviate the workloads from a database hotspot, there are two typical data redistribution use-cases: splitting partitions and rebalancing partitions.
They are easily explained in a few pictures.
Splitting Partitions Re-Balancing Partitions
Figure 5. Data Redistribution: Splitting and Rebalancing Shards
Simply stated, in splitting partitions you are adding resources to your distributed database, and in re-balancing partitions you are redistributing data across already existing server resources.
Both MongoDB and ScaleBase can automatically redistribute data across partitions, online and without downtime, by adding and removing nodes as needed, to handle data volume changes and hotspots.
A new server might need to be added for more capacity, but it may also be possible to move “hot” data from an overused server node to an underused server node. Remember that in MongoDB, a chunk is the smallest logical set of data. When migrating between over- and under-used nodes, MongoDB moves entire chunks from one server to the other. Whereas, in ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy. ScaleBase will similarly move entire chunks of data in a way that maintains logical data relations.
The best part, for both MongoDB users and ScaleBase users, is that the logic for data redistribution and rebalancing happens behind the scenes, automatically. Hotspots are detected and data is moved using logic at the data layer. Once again, your application does not need to contain code for that functionality.
We’ve covered a lot of material. The figure below summarizes the discussion so far for MongoDB and ScaleBase auto-sharding, data distribution and join capabilities.
As you can see MongoDB and ScaleBase both equally support auto-sharding, data distribution at the database level, scalable workload distribution, and application transparency. Where they differ is at the references and relational level. In MongoDB, each document collection is sharded independently and joins are not supported. In ScaleBase, MySQL relations are retained and form the basis of a data distribution policy, and joins are supported locally on each node, and across the entire distributed database cluster.
Next time I want to cover MongoDB and MySQL with ScaleBase query models.
Whether it is for extremely large data volumes, high transaction throughput or a massive number of concurrent users, today’s internet and mobile applications require the ability to respond to a new level of performance and usability. Databases need to be flexible enough to quickly accommodate growth and changes as they arise.
A lot has been written to compare MongoDB and MySQL. Both are solid solutions for DBAs to use. Given my focus on SQL, I’ve not yet written about MongoDB. However, I expect many of my readers are investigating NoSQL solutions, and the most popular NoSQL database is MongoDB.
At a high level, MongoDB is a NoSQL document database. It is considered highly scalable and cloud-ready. MySQL is a leading relational database that I know many of you are already using. I’d like to offer some thoughts to compare MongoDB, and MySQL across three main areas:
Scalability, throughput and cloud-readiness.
Data distribution, reads / writes, joins and data rebalancing
Because, MySQL is not natively a distributed database, I want to level the playing field. ScaleBase is a well-regarded distributed relational database platform that uses MySQL (or MariaDB, or Percona Sever) for data storage. Adding ScaleBase into the equation makes MySQL a distributed database too, but it’s still a relational database with SQL properties and ACID compliance.
MongoDB and MySQL, with ScaleBase, are both positioned to take software engineering teams and their applications to the next level of data scalability.
I hope my comparison can help you understand which technology might be best for your new projects. Ultimately, data architects, dev-ops and developers need to choose the database technology that best meets their application’s requirements.
1. Scalability, throughput and cloud-readiness
MongoDB and MySQL with ScaleBase each leverage several proven techniques when it comes to scaling out applications.
A very powerful technique that can greatly extend the scalability of a database is called “sharding.” A lot has been written about sharding, and the difficulties inherent in homegrown, do-it-yourself partitioning approaches, so I won’t cover those topics in detail again here.
Let’s just say that sharding is a process that splits large data sets on a single server into smaller data sets, or horizontal partitions, placed onto multiple servers. The goal of this division of the data is to divide workloads between multiple servers, so as to 1) increase TPS throughput, 2) allow more concurrent users, and 3) to allow for much larger overall database size.
Figure 1 Scalability via auto-sharding
Sharding is an important part of how MongoDB achieves its scalability. In MongoDB, data is split between shards using a shard key. The shard key is a single field (or a set of composite fields) that identify the data to be sharded. Data is defined in “chunks”, by default 64MB of data. The chunk is the smallest logical block of data that will be sharded.
A particular shard is responsible for one, or more, chunks. The shard stores a “replica set” of the application data. This is a subset of all the data and is identified by the particular shard key. MongoDB controls which replica set is stored on a particular shard.
If you look directly into an individual shard to see what it contains, you will see a random subset of the data. Which data replica is stored in which shard, or partition, is somewhat random to the user, and unknown to the application. That’s why you (the client) always need to connect to the mongos and let MongoDB find the data you need.
Figure 2 MongoDB Architecture (taken from the MongoDB web site)
In the MongoDB architecture diagram above, the configuration of the cluster is stored in the Config Servers (which coincidently are mongod databases). Queries, and data, going back and forth between the client and the cluster are sent through the MongoDB routers, mongos. The mongos will route a query to the appropriate backend database or databases.
Data distribution occurs at the database level – it’s not defined by the application. The application actually “sees” a single database and MongoDB handles retrieving data from each shard partition (mongod). As a non-relational database, there are no database joins (I’ll have more on that later), so joins are achieved in the application. Most of the query processing is done in the data layer with aggregations done in the MongoDB middleware.
Now that we’ve reviewed sharding in MongoDB, you might be surprised to learn this is very similar to how MySQL with ScaleBase works. ScaleBase also automates sharding. Like MongoDB, ScaleBase’s data distribution is handled at the database level (and not in the application). Importantly, because ScaleBase maintains a relational data environment, ScaleBase will ensure that data that is accessed together is also stored together.
Like MongoDB, with ScaleBase the application “sees” a single database and not a bunch of shards/partitions. Like MongoDB, Scalebase also provides agility, flexibility, and growth. Database optimization is dynamic. Rebalancing (adding and removing nodes to accommodate growth and to handle database “hotspots”) is dynamic, and can be executed while the entire system is online. All this without changes to the application design or codebase.
However, there are two ways that ScaleBase sharding differs.
Retaining ACID / SQL: With ScaleBase, joins are supported, even in a distributed environment. As MySQL is a relational database, ScaleBase embraces and retains data relations expressed in joins and transactions. With ScaleBase, queries can have local joins (within one shard) as well as cross-database joins. Unlike MongoDB where joins are accomplished in your application, ScaleBase will automatically aggregate data across shards for you. In this way, ScaleBase gives you ACID transactions and 2-phase commits across a sharded cluster of MySQL database nodes. This is unique.
MySQL Ecosystem: The second way ScaleBase is a bit different is it preserves your MySQL skills and ecosystem technologies. It provides this scalability to your existing MySQL applications, if they are growing beyond the capabilities of your server (or Amazon RDS instance). Maybe more importantly, ScaleBase gives new MySQL apps a modern, distributed and relational data framework with near-linear scalability.
Figure 3 ScaleBase Architecture
Looking at a ScaleBase architecture diagram and relating back to MongoDB, the ScaleBase Config Servers here are the equivalent of the MongoDBmongod Config Servers. And ScaleBase is the equivalent of the MongoDB router, the mongos. The similarities are striking.
So far I’ve only focused my comparison on scalability, and we’ve seen that MongoDB and MySQL with ScaleBase are highly comparable. MongoDB is a NoSQL database well known for its scalability. The database world calls it a web scale technology and it’s catching on with developers worldwide. MySQL is the world’s the most widely used open source database, with a rich ecosystem of tooling and skilled users. ScaleBase is a modern relational database platform, build using standard MySQL, and optimized for the cloud, and takes advantage of your existing MySQL development skills and ecosystem technologies.