Note: This blog post part 3 of the series and you can download 30-day trial of ScaleBase to practice the concepts.
In this article comparing MongoDB and MySQL scalability, I want to focus on query models.
We’ve previously discussed how MongoDB and ScaleBase are both highly scalable distributed databases, where data is distributed across partitions and clusters so as to distribute workloads, transactions, and concurrent users.
Now let’s compare query models, and see how MongoDB and MySQL with ScaleBase answer application calls and queries.
The Challenge – Aggregating query results from several database nodes
Famously, efficiently satisfying queries on a distributed system can be challenging. While accessing a single collection or document is easy enough, building result-sets from data spread across multiple nodes and collections can often become a manual coding process.
At a high-level, we can say a query can be filtered either by ID (and return a single result) or a range (and return multiple results).
In a distributed database, the challenge for a query is to efficiently and accurately access several data partitions where each partition gives a partial results, and then to aggregate results efficiently so as to provide one answer to the application and user.
At a high level we can say there are several operations that aggregate query results:
– Aggregate functions: count, min, max, sum
The challenge is to execute these operations in a distributed data environment and so that the applications still “sees” one database.
Figure 5. A query in a distributed database environment
For example, for a distributed database with four partitions, a simple count of records across the entire database needs four queries, run in parallel on four databases, which results in four numbers that need to be summed up and returned to the application as a single summed number.
As I mentioned in my previous entry, since MongoDB does not do joins, discrete docs are “joined” together inside the application after several roundtrips to the database.
Aggregations can make this even more complex, but both MongoDB and ScaleBase do most of the work for you. An aggregation is an operation that scans over a set of documents and returns computed values such as Sum, Average, or Count.
So, what options do MongoDB and ScaleBase offer to aggregate query results from across several database partitions?
MongoDB Query Aggregation Options
MongoDB provides two main methods of aggregation: the aggregation pipeline (and simple single purpose pipeline operations), and map-reduce.
MongoDB Aggregation Pipeline
The aggregation pipeline is a framework for aggregates built into MongoDB (since version 2.2). You can think of the pipeline framework as working something like the UNIX pipe command. You stream documents through the pipeline, filtering and transforming as needed on each operator.
Figure 6 Annotated Aggregation Pipeline from Mongodb.com documentation http://docs.mongodb.org/master/_images/aggregation-pipeline.png
Each pipeline operator can do things like skip, match, sort and even geospatial style matching. You can improve performance by doing filtering (match) at the beginning of the pipeline, thus reducing the amount of data being scanned and manipulated. The aggregation pipeline can use indexes that are available.
Single Purpose Aggregation Operations
The single purpose aggregation operations are very simple procedures that return a very specific set of data. Examples would be count on a single value, grouping, a distinct list of values, etc.
However, it’s important to note that in Mongo DB, the ‘Group’ function does NOT support distributed database nodes or partitions. This is a major difference. Secondly, all single purpose aggregation operation results must be less than 16MB.
Figure 7 Annotated Single Purpose Aggregation Operation from Mongodb.com documentation http://docs.mongodb.org/master/_images/distinct.png
MongoDB’s Map-Reduce capability provides programmatic query processing flexibility not available in Aggregation Pipeline, but at a cost to performance and coherence. Map-Reduce is a massively parallel process for manipulating and condensing large volumes of data down to something more useful. MongoDB provides the ‘mapReduce’ command to process map-reduce scripts.
In a map-reduce process, you match data that you want to work with (the map process) and then you filter and/or condense (with the reduce process).
The map process creates a set of key->value pairs that you want to work with and the reduce process takes those values in. In MongoDB, a map process takes in a collection and the output of the reduce operation can be a collection or it can be returned inline. If you generate a collection, it can then be used as input to another map-reduce process.
Figure 8 Annotated Map-Reduce from Mongodb.com documentation http://docs.mongodb.org/master/_images/map-reduce.png
Like the pipelined aggregation, map-reduce does support partitioned databases. MongoDB notes that while they have made improvements in map-reduce performance in later releases, the aggregation pipeline is usually more performant, though it may not be as dynamic and functional.
ScaleBase Query Aggregation Options
ScaleBase also provides two main methods of query data aggregation: ALL_DB Aggregation and an automatic and built-in “Map-Reduce-Like” capability that executes across distributed database partitions.
The primary difference is with ScaleBase you can use SQL and tools that you are most likely already familiar with.
Additionally, with ScaleBase you don’t need to decide between an aggregation pipeline or a map reduce approach. Instead, you submit regular SQL query, and ScaleBase performs the all the aggregations operations behind the scenes for you.
ScaleBase supports single-phase and multi-phase ALL_DB aggregation.
For every ALL_DB command, ScaleBase performs:
- Parallel execution
- Local processing at each individual DB, such as Filter, local joins, local groups, local sort (That’s most of the processing!)
- Meta-Aggregation at ScaleBase Controller
SELECT COUNT(*) FROM my_distributed _table;
ScaleBase delegates the same command to all databases in parallel. Databases perform it very efficiently in parallel, on smaller datasets, returns n counts to ScaleBase which in turn sums all the counts to one big count, to return to the client.
The same goes with “sum of sums”, “min of mins” and so on. ScaleBase even supports a global ALL_DB average aggregated function.
In addition, consider the following supported cases:
- Multi-phase aggregation:
GROUP BY + HAVING + ORDER BY + LIMIT
- DDL (ALTER TABLE, CREATE INDEX)
As mentioned before, ScaleBase provides a Ma-Reduce-Like capability. ScaleBase will deconstruct the query; run parallel scans, and aggregate results.
To compare Map-Reduce in MySQL / ScaleBase and MongoDB, let’s look at the figure below.
Figure 9 MySQL Query and MongoDB Map-Reduce Script
Comparing a typical Map-Reduce query in MySQL and MongoDB, you can see that to get the same functionality, a simple SQL query is a lot easier to read and to write. Ideally, less code can also mean less bugs and less maintenance.
And the above example is just for a very simple SQL command on only one table, with no joins, with three WHERE predicates and six SELECT predicates. As we know, queries can be much longer than that, but not more complicated. This is really the power of SQL.
You can see how the ScaleBase query flow resembles a map-reduce program in the figure below.
Figure 10. Map-Reduce Like Functionality in ScaleBase
With no extra coding, you get localized access at the partition/node level, running in parallel, with local sorts, joins and grouping. ScaleBase does the final aggregation for your application in the middle ScaleBase layer. Not in the application.
Aggregation is a fact of life in the data world. You don’t want a tool that makes your life harder than it has to be. MongoDB and ScaleBase make it as easy as possible for you.
MongoDB was developed after SQL databases were around for quite a while already and providing great query capabilities. MongoDB is much more than a key/value database. It has great querying capabilities as well, some say equal to SQL databases, with very similar concepts of filters, indexes, results aggregation
The main difference is in the “how”.
In MongoDB, for most aggregation operations, the Aggregation Pipeline provides better performance and a more coherent interface. However, map-reduce operations provide some programmatic flexibility that is not presently available in the aggregation pipeline.
The take-way is that (as we have seen in scalability, auto-sharding and data redistribution already), in query model and query execution there’s more similarity than difference between MongoDB and MySQL with ScaleBase.
Bottom line: In a distributed database environment you must have the aggregation capabilities. MongoDB give that to you. MySQL with ScaleBase gives that to you too, but retaining SQL and ACID properties.
If you’ve liked the ideas discussed in this series of articles comparing MongoDB and MySQL with Scalebase, offers a free 30-day trial you can download (http://www.scalebase.com/download-trial/).
Reference: Pinal Dave (http://blog.sqlauthority.com)