# SQL SERVER – Statistical Analysis in SQL Server – A Quick Introduction

SQL Server has very few statistical functions. Oracle has lots of them. XLeratorDB closes the gap. If you want to try out the example of this blog post, I suggest you download the 15-day free trial of XLeratorDB NOW.

Most SQL Server users would never think of T-SQL as a platform for doing sophisticated statistical analysis. One reason for that is the relative lack of statistical functions that come with SQL Server, even though Oracle users have lots of functions available to them.  The developers at Westclintech have done an admirable job of filling in that gap by providing over 300 statistical and mathematical functions including functions for numerically stable calculations for moments about the mean (variance, standard deviation, kurtosis and skewness) distribution functions (beta, normal, bivariate normal, gamma, binomial, t, chi square, F, etc.), linear regression functions, correlation, statistical inference, interpolation, and many more.

Let’s look at a very simple statistical analysis. We are going to predict demand in units for some of our products based upon previous year’s unit sales.

Let’s put some data in a table.

```SELECT * INTO #n FROM (VALUES (856018,'widget 1',2008,92303) ,(856018,'widget 1',2009,172259) ,(856018,'widget 1',2010,262942) ,(856018,'widget 1',2011,356509) ,(856018,'widget 1',2012,441066) ,(856018,'widget 1',2013,536820) ,(748396,'widget 2',2008,13614) ,(748396,'widget 2',2009,20498) ,(748396,'widget 2',2010,32809) ,(748396,'widget 2',2011,49666) ,(748396,'widget 2',2012,55390) ,(748396,'widget 2',2013,76579) ,(373748,'widget 3',2008,99394) ,(373748,'widget 3',2009,202891) ,(373748,'widget 3',2010,297216) ,(373748,'widget 3',2011,401210) ,(373748,'widget 3',2012,500209) ,(373748,'widget 3',2013,591125) ,(509819,'widget 4',2008,36723) ,(509819,'widget 4',2009,63566) ,(509819,'widget 4',2010,95523) ,(509819,'widget 4',2011,129923) ,(509819,'widget 4',2012,163620) ,(509819,'widget 4',2013,193636) ,(549801,'widget 5',2008,11127) ,(549801,'widget 5',2009,27420) ,(549801,'widget 5',2010,42190) ,(549801,'widget 5',2011,49519) ,(549801,'widget 5',2012,67777) ,(549801,'widget 5',2013,74313) ,(628964,'widget 6',2008,90024) ,(628964,'widget 6',2009,173597) ,(628964,'widget 6',2010,262586) ,(628964,'widget 6',2011,345029) ,(628964,'widget 6',2012,427760) ,(628964,'widget 6',2013,519919) ,(772493,'widget 7',2008,80854) ,(772493,'widget 7',2009,168631) ,(772493,'widget 7',2010,261083) ,(772493,'widget 7',2011,346453) ,(772493,'widget 7',2012,428067) ,(772493,'widget 7',2013,514310) ,(477653,'widget 8',2008,94886) ,(477653,'widget 8',2009,198272) ,(477653,'widget 8',2010,296506) ,(477653,'widget 8',2011,390449) ,(477653,'widget 8',2012,486135) ,(477653,'widget 8',2013,593191) ,(268817,'widget 9',2008,95857) ,(268817,'widget 9',2009,198910) ,(268817,'widget 9',2010,292704) ,(268817,'widget 9',2011,385263) ,(268817,'widget 9',2012,489351) ,(268817,'widget 9',2013,585276) )n([Product ID],[Description],[Year],[Units]) ```

We can use the XLeratorDB FORECAST function to predict unit demand for 2014 based upon the actual units from 2008 through 2013. FORECAST is a user-defined aggregate function, so the syntax is like any other aggregate function. Since we want to know the results by Product ID and Description, we will use a GROUP BY to summarize the results.

```SELECT [Product ID] ,[Description] ,ROUND(wct.FORECAST(2014,[Units],[Year]), 0) AS Forecast FROM #n GROUP BY [Product ID] ,[Description] ```

This provides us with the 2014 predicted value for each product.

```Product ID Description               Forecast
----------- ----------- ----------------------
856018 widget 1                    622574
748396 widget 2                     85062
373748 widget 3                    694134
509819 widget 4                    225745
549801 widget 5                     89824
628964 widget 6                    602593
772493 widget 7                    602995
477653 widget 8                    688146
268817 widget 9                    682325```

If we wanted to include the r-squared for the FORECAST (which is an indication of the goodness-of-fit for the prediction) we can use the RSQ function.

```SELECT [Product ID] ,[Description] ,ROUND(wct.FORECAST(2014,[Units],[Year]), 0) AS Forecast ,wct.RSQ([Units],[Year]) AS [R-squared] FROM #n GROUP BY [Product ID] ,[Description] ```

In addition to the 2014 predicted values we can also see the r-squared associated with that prediction. The higher the r-squared, the better the fit of the least-squares line.

```Product ID Description               Forecast              R-squared
----------- ----------- ---------------------- ----------------------
856018 widget 1                    622574       0.99946328252557
748396 widget 2                     85062      0.976925841604458
373748 widget 3                    694134      0.999667182131839
509819 widget 4                    225745      0.998768744610505
549801 widget 5                     89824      0.985511627560019
628964 widget 6                    602593      0.999766345380639
772493 widget 7                    602995       0.99964651777832
477653 widget 8                    688146      0.999647675706309
268817 widget 9                    682325      0.999735740913981```

Finally, in this SQL we simply PIVOT the results in order to present the forecast alongside the historical results for each year.

```SELECT [Product Id] ,[Description] ,[2008] ,[2009] ,[2010] ,[2011] ,[2012] ,[2013] ,[2014] FROM ( SELECT * FROM #n UNION ALL SELECT [Product ID] ,[Description] ,2014 ,ROUND(wct.FORECAST(2014,[Units],[Year]), 0) AS Units FROM #n GROUP BY [Product ID] ,[Description] ) D PIVOT( SUM([units]) FOR [Year] IN ([2008],[2009],[2010],[2011],[2012],[2013],[2014]) ) P ```

Now we can see the predicted values for 2014 alongside the historical value for 2008 through 2014.

Now that I can actually see how this type of analysis can work in T-SQL, I want to do statistical analysis in SQL Server. In my simple example above, we could have predicted the 2014 sales at almost any level of detail; say we wanted to include geographic information in our predictions (assuming that we have recorded that information in our database), we could predict demand by region, by state, by city, probably even by store address, with only minor changes to our SQL; something that could be done in a few seconds, rather than in days or week.

Doing this type of analysis is also extremely fast and scalable. The FORECAST function had no problem chewing through 10,000,000 rows of data in 2.5 seconds on my machine. In fact, it seems that for certain types of analyses it might actually be possible to make predictions in real time.

Numeric Instability

XLeratorDB has an enormous breadth of functions for statistical analysis, but one of the more interesting things that I discovered was that some of the built-in SQL Server functions are what the XLeratorDB documentation describes as ‘numerically unstable’.

Here’s a very simple example of numeric instability using a built-in SQL Server function.

```SELECT STDEVP(x) AS [STDEVP] FROM (VALUES (900000016.4),(900000010.3),(900000018.5),(900000006.2),(900000010.3))n(x) ```

This produces the following result.

```STDEVP
----------------------
0```

When I put the same values in Excel it returns 4.482231578.

This SQL performs the same calculation using the XLeratorDB function;

```SELECT wct.STDEV_P(x) AS STDEV_P FROM (VALUES (900000016.4),(900000010.3),(900000018.5),(900000006.2),(900000010.3))n(x) ```

producing the following result,

```STDEV_P
----------------------
4.4822315476529```

which is pretty close to the value returned by Excel. You can read more about what’s going on by reading this blog on the XLeratorDB web site.

If you are interested in doing statistical analysis and even predictive analytics in SQL Server you should download the 15-day free trial of XLeratorDB today and find out what you can do.

Reference: Pinal Dave (http://blog.sqlauthority.com)

# SQL SERVER – Finding the Last Backup for All Databases – Notes from the Field #016

[Notes from Pinal]: During my presentation, I always ask developer a simple question – When did you take your last database backup? I get two different kinds of answers – 1) few replies with No Idea and 2) few replies with some date time. When I further ask if they are confident that their backup was taken successfully, I see confuse faces. Well, in this episode of Notes from the Field Tim answers the very same question with answer.

Linchpin People are database coaches and wellness experts for a data driven world. In this 16th episode of the Notes from the Fields series database expert Tim Radney (partner at Linchpin People) explains a very simple script to find when was last successful backup of all the database was taken.

As data professionals (those responsible for supporting database environments) one of our most important task is making sure that we have proper backups. I regularly have to perform audits of SQL Server environments and one of my first checks is to make sure that backups are being performed.  A quick check is to see when the last full backup was made.  You can do this by running the following script.

```SELECT  a.Name AS [DB_Name], COALESCE(CONVERT(VARCHAR(12), MAX(b.backup_finish_date), 101), '-') AS LastBackup FROM    sys.sysdatabases a LEFT OUTER JOIN msdb.dbo.backupset b ON b.database_name = a.name GROUP BY a.Name ```

This will only tell you when the last full was completed. If you are running daily full backups then this will be sufficient. The script I personally run can be located on a recent blog post. This post includes a script that will get the database name, recovery model, most recent full, most recent differential, and the last two transaction log backups.

Anytime I talk with someone about backups I have to stress the importance of validating your backups. You need to have a process in place to regularly validate your backups by restoring them to another environment. Backups that can’t be restored when you need them are useless.

If you want me to take a look at your server and its settings, or if your server is facing any issue we can Fix Your SQL Server.

Note: Tim has also written an excellent book on SQL Backup and Recovery, a must have for everyone.

Reference: Pinal Dave (http://blog.sqlauthority.com)

# MongoDB and MySQL – Comparing Scalability, Data Distribution & Query Model – Part 3

Note: This blog post part 3 of the series and you can download 30-day trial of ScaleBase to practice the concepts.

In this article comparing MongoDB and MySQL scalability, I want to focus on query models.

Now let’s compare query models, and see how MongoDB and MySQL with ScaleBase answer application calls and queries.

The Challenge – Aggregating query results from several database nodes

Famously, efficiently satisfying queries on a distributed system can be challenging. While accessing a single collection or document is easy enough, building result-sets from data spread across multiple nodes and collections can often become a manual coding process.

At a high-level, we can say a query can be filtered either by ID (and return a single result) or a range (and return multiple results).

In a distributed database, the challenge for a query is to efficiently and accurately access several data partitions where each partition gives a partial results, and then to aggregate results efficiently so as to provide one answer to the application and user.

At a high level we can say there are several operations that aggregate query results:

–     Aggregate functions: count, min, max, sum

–     Distinct

–     Group

–     Sort

The challenge is to execute these operations in a distributed data environment and so that the applications still “sees” one database.

Figure 5. A query in a distributed database environment

For example, for a distributed database with four partitions, a simple count of records across the entire database needs four queries, run in parallel on four databases, which results in four numbers that need to be summed up and returned to the application as a single summed number.

As I mentioned in my previous entry, since MongoDB does not do joins, discrete docs are “joined” together inside the application after several roundtrips to the database.

Aggregations can make this even more complex, but both MongoDB and ScaleBase do most of the work for you. An aggregation is an operation that scans over a set of documents and returns computed values such as Sum, Average, or Count.

So, what options do MongoDB and ScaleBase offer to aggregate query results from across several database partitions?

MongoDB Query Aggregation Options

MongoDB provides two main methods of aggregation: the aggregation pipeline (and simple single purpose pipeline operations), and map-reduce.

MongoDB Aggregation Pipeline

The aggregation pipeline is a framework for aggregates built into MongoDB (since version 2.2). You can think of the pipeline framework as working something like the UNIX pipe command. You stream documents through the pipeline, filtering and transforming as needed on each operator.

Figure 6 Annotated Aggregation Pipeline from Mongodb.com documentation http://docs.mongodb.org/master/_images/aggregation-pipeline.png

Each pipeline operator can do things like skip, match, sort and even geospatial style matching. You can improve performance by doing filtering (match) at the beginning of the pipeline, thus reducing the amount of data being scanned and manipulated. The aggregation pipeline can use indexes that are available.

Single Purpose Aggregation Operations

The single purpose aggregation operations are very simple procedures that return a very specific set of data. Examples would be count on a single value, grouping, a distinct list of values, etc.

However, it’s important to note that in Mongo DB, the ‘Group’ function does NOT support distributed database nodes or partitions.  This is a major difference.  Secondly, all single purpose aggregation operation results must be less than 16MB.

Figure 7 Annotated Single Purpose Aggregation Operation from Mongodb.com documentation http://docs.mongodb.org/master/_images/distinct.png

MongoDB Map-Reduce

MongoDB’s Map-Reduce capability provides programmatic query processing flexibility not available in Aggregation Pipeline, but at a cost to performance and coherence.  Map-Reduce is a massively parallel process for manipulating and condensing large volumes of data down to something more useful. MongoDB provides the ‘mapReduce’ command to process map-reduce scripts.

In a map-reduce process, you match data that you want to work with (the map process) and then you filter and/or condense (with the reduce process).

The map process creates a set of key->value pairs that you want to work with and the reduce process takes those values in. In MongoDB, a map process takes in a collection and the output of the reduce operation can be a collection or it can be returned inline. If you generate a collection, it can then be used as input to another map-reduce process.

Figure 8 Annotated Map-Reduce from Mongodb.com documentation http://docs.mongodb.org/master/_images/map-reduce.png

Like the pipelined aggregation, map-reduce does support partitioned databases. MongoDB notes that while they have made improvements in map-reduce performance in later releases, the aggregation pipeline is usually more performant, though it may not be as dynamic and functional.

ScaleBase Query Aggregation Options

ScaleBase also provides two main methods of query data aggregation: ALL_DB Aggregation and an automatic and built-in “Map-Reduce-Like” capability that executes across distributed database partitions.

The primary difference is with ScaleBase you can use SQL and tools that you are most likely already familiar with.

Additionally, with ScaleBase you don’t need to decide between an aggregation pipeline or a map reduce approach. Instead, you submit regular SQL query, and ScaleBase performs the all the aggregations operations behind the scenes for you.

ALL_DB Aggregation

ScaleBase supports single-phase and multi-phase ALL_DB aggregation.

For every ALL_DB command, ScaleBase performs:

• Parallel execution
• Local processing at each individual DB, such as Filter, local joins, local groups, local sort (That’s most of the processing!)
• Meta-Aggregation at ScaleBase Controller

For example:

SELECT COUNT(*) FROM my_distributed _table;

ScaleBase delegates the same command to all databases in parallel. Databases perform it very efficiently in parallel, on smaller datasets, returns n counts to ScaleBase which in turn sums all the counts to one big count, to return to the client.

The same goes with “sum of sums”, “min of mins” and so on. ScaleBase even supports a global ALL_DB average aggregated function.

In addition, consider the following supported cases:

• Multi-phase aggregation:
GROUP BY + HAVING + ORDER BY + LIMIT
• DDL (ALTER TABLE, CREATE INDEX)

ScaleBase Map-Reduce

As mentioned before, ScaleBase provides a Ma-Reduce-Like capability. ScaleBase will deconstruct the query; run parallel scans, and aggregate results.

To compare Map-Reduce in MySQL / ScaleBase and MongoDB, let’s look at the figure below.

Figure 9 MySQL Query and MongoDB Map-Reduce Script

Comparing a typical Map-Reduce query in MySQL and MongoDB, you can see that to get the same functionality, a simple SQL query is a lot easier to read and to write.  Ideally, less code can also mean less bugs and less maintenance.

And the above example is just for a very simple SQL command on only one table, with no joins, with three WHERE predicates and six SELECT predicates.  As we know, queries can be much longer than that, but not more complicated. This is really the power of SQL.

You can see how the ScaleBase query flow resembles a map-reduce program in the figure below.

Figure 10. Map-Reduce Like Functionality in ScaleBase

With no extra coding, you get localized access at the partition/node level, running in parallel, with local sorts, joins and grouping.  ScaleBase does the final aggregation for your application in the middle ScaleBase layer. Not in the application.

Aggregation is a fact of life in the data world. You don’t want a tool that makes your life harder than it has to be. MongoDB and ScaleBase make it as easy as possible for you.

Summary

MongoDB was developed after SQL databases were around for quite a while already and providing great query capabilities.  MongoDB is much more than a key/value database. It has great querying capabilities as well, some say equal to SQL databases, with very similar concepts of filters, indexes, results aggregation

The main difference is in the “how”.

In MongoDB, for most aggregation operations, the Aggregation Pipeline provides better performance and a more coherent interface. However, map-reduce operations provide some programmatic flexibility that is not presently available in the aggregation pipeline.

The take-way is that (as we have seen in scalability, auto-sharding and data redistribution already), in query model and query execution there’s more similarity than difference between MongoDB and MySQL with ScaleBase.

Bottom line: In a distributed database environment you must have the aggregation capabilities.  MongoDB give that to you. MySQL with ScaleBase gives that to you too, but retaining SQL and ACID properties.

If you’ve liked the ideas discussed in this series of articles comparing MongoDB and MySQL with Scalebase, offers a free 30-day trial you can download  (http://www.scalebase.com/download-trial/).

Reference: Pinal Dave (http://blog.sqlauthority.com)

# MongoDB and MySQL – Comparing Scalability, Data Distribution & Query Model – Part 2

Note: This blog post part 2 of the series and you can download 30-day trial of ScaleBase to practice the concepts.

In a previous article comparing MongoDB and MySQL, we saw how MongoDB and MySQL with ScaleBase execute auto-sharding, a key technique to deliver database scalability.  In MongoDB, any joins across horizontal partitions need to be accomplished by the application. Whereas, MySQL with ScaleBase retains ACID compliance across the distributed database.

In this article, I want to focus on data distribution, reads / writes, joins and data rebalancing

### Data Distribution

As a reminder, the goal of dividing data into horizontal partitions is to cause workloads to become distributed across multiple servers. This allows an application to experience higher TPS throughput, allow more concurrent users, and to allow for much larger overall database size, all while avoiding database bottlenecks.

For the best performance, any data that is logically related should be stored together. In this way, queries can be satisfied in a single fetch.  But, that ideal situation isn’t always available.  What happens when a query needs data from multiple partitions? Let’s take a look at a few example application scenarios and see how the way you distribute data can impact performance.

### Reads and Writes in MongoDB and ScaleBase

Imagine a blogging application, with authors, articles, users, comments and tags.  The table below identifies four typical blogging activities (scenarios) and indicates how frequently they typically happen.

Figure 1 Example Blogging Application Scenarios

As you can see, we have two “write” activities and two “read” activities. And, two scenarios occur frequently, and two occur less often.

Below are typical data models for this simple application: MongoDB uses typical document type store, and ScaleBase’s typical relational store.

Figure 2 Example Blogging Application Data Models for MongoDB and ScaleBase

The model on the left is the MongoDB data model (represented via BSON) and the model on the right is a relational model. While the exact way that the data is stored is different, both models accomplish the same thing.

So, with the groundwork of our example understood, let’s examine how MongoDB and ScaleBase execute the four blogging application scenarios we outlined earlier.

### Scenario Execution with MongoDB – Joins

You can see that for the two “write” scenarios, in MongoDB, with that model, adding an article, a high frequency activity, and adding a tag, are both easily accomplished in a single call to a single database. For example, we already know the user so it is simply a matter of inserting the relevant documents.

However, both of the “read” scenarios require that a “join” be made in the application. For example, if we want to query all of an author’s articles, along with that author’s details, we would need to first request the Users collection (table) and then query the Article collection (table). The application would then have to join those two sets of data and return it to the client.

Depending on how the Article collection is sharded, MongoDB may need to query multiple shards and then aggregate that data before returning it to the blogging application, and for that data to then also be joined with the User collection.

Figure 3. Read / Write Scenario Execution in MongoDB

### Scenario Execution with ScaleBase with MySQL – Joins

In ScaleBase with MySQL, the data may be distributed across articles, but the distribution process will co-locate logically related user data with an author’s articles. In cases where data is not co-located, ScaleBase will join the data at the database level. This means the application does not need to code those joins. In this way, three of the four scenarios are satisfied in a single call.

Figure 4. Read / Write Scenario Execution in ScaleBase and MySQL

As you can see, both MongoDB and ScaleBase data models obviously can support our example Blogging application.  In MongoDB, there can be read scenarios where a join needs to be accomplished in the application.  In ScaleBase, for the one scenario that requires it, the join is accomplished by ScaleBase, and not within the application.

### Data Rebalancing in MongoDB, and ScaleBase with MySQL

There can come a time when you need to rethink how data is distributed across servers. For example, application usage patterns may evolve. And, as the number of concurrent users goes up and/or transaction and data volumes increase, you may find that certain data nodes, or partitions become “hotspots” receiving more read/write workloads than other parts of the distributed database cluster. When this happens, you will want to adjust how data is distributed in the future, and possibly also redistribute existing data to achieve a more balanced distributed workload..

Data Chunks

In both MongoDB and ScaleBase, data is distributed in “chunks”. In MongoDB, a data chuck is a set of documents and the default size for a data chunk 64 megabytes. In ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.

Splitting Shards and Rebalancing Partitions

When you want to alleviate the workloads from a database hotspot, there are two typical data redistribution use-cases: splitting partitions and rebalancing partitions.

They are easily explained in a few pictures.

Splitting Partitions                                                Re-Balancing Partitions

Figure 5. Data Redistribution: Splitting and Rebalancing Shards

Simply stated, in splitting partitions you are adding resources to your distributed database, and in re-balancing partitions you are redistributing data across already existing server resources.

Both MongoDB and ScaleBase can automatically redistribute data across partitions, online and without downtime, by adding and removing nodes as needed, to handle data volume changes and hotspots.

A new server might need to be added for more capacity, but it may also be possible to move “hot” data from an overused server node to an underused server node. Remember that in MongoDB, a chunk is the smallest logical set of data. When migrating between over- and under-used nodes, MongoDB moves entire chunks from one server to the other. Whereas, in ScaleBase, a data chuck is a logical data slice from an entire relational hierarchy.  ScaleBase will similarly move entire chunks of data in a way that maintains logical data relations.

The best part, for both MongoDB users and ScaleBase users, is that the logic for data redistribution and rebalancing happens behind the scenes, automatically. Hotspots are detected and data is moved using logic at the data layer. Once again, your application does not need to contain code for that functionality.

### Summary

We’ve covered a lot of material.  The figure below summarizes the discussion so far for MongoDB and ScaleBase auto-sharding, data distribution and join capabilities.

As you can see MongoDB and ScaleBase both equally support auto-sharding, data distribution at the database level, scalable workload distribution, and application transparency.  Where they differ is at the references and relational level.  In MongoDB, each document collection is sharded independently and joins are not supported.  In ScaleBase, MySQL relations are retained and form the basis of a data distribution policy, and joins are supported locally on each node, and across the entire distributed database cluster.

Next time I want to cover MongoDB and MySQL with ScaleBase query models.

Reference: Pinal Dave (http://blog.sqlauthority.com)

# MongoDB and MySQL – Comparing Scalability, Data Distribution & Query Model – Part 1

Note: This blog post part 1 of the series and you can download 30-day trial of ScaleBase to practice the concepts.

### Introduction

Whether it is for extremely large data volumes, high transaction throughput or a massive number of concurrent users, today’s internet and mobile applications require the ability to respond to a new level of performance and usability. Databases need to be flexible enough to quickly accommodate growth and changes as they arise.

A lot has been written to compare MongoDB and MySQL. Both are solid solutions for DBAs to use. Given my focus on SQL, I’ve not yet written about MongoDB. However, I expect many of my readers are investigating NoSQL solutions, and the most popular NoSQL database is MongoDB.

At a high level, MongoDB is a NoSQL document database. It is considered highly scalable and cloud-ready.  MySQL is a leading relational database that I know many of you are already using.  I’d like to offer some thoughts to compare MongoDB, and MySQL across three main areas:

2. Data distribution, reads / writes, joins and data rebalancing
3. Query model

Because, MySQL is not natively a distributed database, I want to level the playing field. ScaleBase is a well-regarded distributed relational database platform that uses MySQL (or MariaDB, or Percona Sever) for data storage. Adding ScaleBase into the equation makes MySQL a distributed database too, but it’s still a relational database with SQL properties and ACID compliance.

MongoDB and MySQL, with ScaleBase, are both positioned to take software engineering teams and their applications to the next level of data scalability.

I hope my comparison can help you understand which technology might be best for your new projects.  Ultimately, data architects, dev-ops and developers need to choose the database technology that best meets their application’s requirements.

### 1. Scalability, throughput and cloud-readiness

MongoDB and MySQL with ScaleBase each leverage several proven techniques when it comes to scaling out applications.

### Auto-Sharding

A very powerful technique that can greatly extend the scalability of a database is called “sharding.”  A lot has been written about sharding, and the difficulties inherent in homegrown, do-it-yourself partitioning approaches, so I won’t cover those topics in detail again here.

Let’s just say that sharding is a process that splits large data sets on a single server into smaller data sets, or horizontal partitions, placed onto multiple servers. The goal of this division of the data is to divide workloads between multiple servers, so as to 1) increase TPS throughput, 2) allow more concurrent users, and 3) to allow for much larger overall database size.

Figure 1 Scalability via auto-sharding

### MongoDB Sharding

Sharding is an important part of how MongoDB achieves its scalability. In MongoDB, data is split between shards using a shard key. The shard key is a single field (or a set of composite fields) that identify the data to be sharded. Data is defined in “chunks”, by default 64MB of data. The chunk is the smallest logical block of data that will be sharded.

A particular shard is responsible for one, or more, chunks. The shard stores a “replica set” of the application data. This is a subset of all the data and is identified by the particular shard key. MongoDB controls which replica set is stored on a particular shard.

If you look directly into an individual shard to see what it contains, you will see a random subset of the data. Which data replica is stored in which shard, or partition, is somewhat random to the user, and unknown to the application. That’s why you (the client) always need to connect to the mongos and let MongoDB find the data you need.

Figure 2 MongoDB Architecture (taken from the MongoDB web site)

In the MongoDB architecture diagram above, the configuration of the cluster is stored in the Config Servers (which coincidently are mongod databases). Queries, and data, going back and forth between the client and the cluster are sent through the MongoDB routers, mongos. The mongos will route a query to the appropriate backend database or databases.

Data distribution occurs at the database level – it’s not defined by the application. The application actually “sees” a single database and MongoDB handles retrieving data from each shard partition (mongod). As a non-relational database, there are no database joins (I’ll have more on that later), so joins are achieved in the application.  Most of the query processing is done in the data layer with aggregations done in the MongoDB middleware.

### ScaleBase Sharding

Now that we’ve reviewed sharding in MongoDB, you might be surprised to learn this is very similar to how MySQL with ScaleBase works. ScaleBase also automates sharding. Like MongoDB, ScaleBase’s data distribution is handled at the database level (and not in the application). Importantly, because ScaleBase maintains a relational data environment, ScaleBase will ensure that data that is accessed together is also stored together.

Like MongoDB, with ScaleBase the application “sees” a single database and not a bunch of shards/partitions. Like MongoDB, Scalebase also provides agility, flexibility, and growth. Database optimization is dynamic. Rebalancing (adding and removing nodes to accommodate growth and to handle database “hotspots”) is dynamic, and can be executed while the entire system is online.  All this without changes to the application design or codebase.

However, there are two ways that ScaleBase sharding differs.

1. Retaining ACID / SQL: With ScaleBase, joins are supported, even in a distributed environment. As MySQL is a relational database, ScaleBase embraces and retains data relations expressed in joins and transactions. With ScaleBase, queries can have local joins (within one shard) as well as cross-database joins. Unlike MongoDB where joins are accomplished in your application, ScaleBase will automatically aggregate data across shards for you. In this way, ScaleBase gives you ACID transactions and 2-phase commits across a sharded cluster of MySQL database nodes. This is unique.
1. MySQL Ecosystem: The second way ScaleBase is a bit different is it preserves your MySQL skills and ecosystem technologies. It provides this scalability to your existing MySQL applications, if they are growing beyond the capabilities of your server (or Amazon RDS instance).  Maybe more importantly, ScaleBase gives new MySQL apps a modern, distributed and relational data framework with near-linear scalability.

Figure 3 ScaleBase Architecture

Looking at a ScaleBase architecture diagram and relating back to MongoDB, the ScaleBase Config Servers here are the equivalent of the MongoDB mongod Config Servers. And ScaleBase is the equivalent of the MongoDB router, the mongos.  The similarities are striking.

### Summary

So far I’ve only focused my comparison on scalability, and we’ve seen that MongoDB and MySQL with ScaleBase are highly comparable.  MongoDB is a NoSQL database well known for its scalability. The database world calls it a web scale technology and it’s catching on with developers worldwide.  MySQL is the world’s the most widely used open source database, with a rich ecosystem of tooling and skilled users. ScaleBase is a modern relational database platform, build using standard MySQL, and optimized for the cloud, and takes advantage of your existing MySQL development skills and ecosystem technologies.

Next time I will cover MongoDB and ScaleBase data models and examine how they handle data distribution and redistribution.

Reference: Pinal Dave (http://blog.sqlauthority.com)

# MySQL – How to Detect Current Time Zone Name in MySQL

MySQL developers and DBA often face a very simple challenge when they have to retrieve the time zone of the session which they are connected to. If you look at MySQL documentation and search engine there are many different ways to do the same. In this blog post, I will demonstrate the simple method which I use to detect Timezone of the server I am connected with the current session.

Run following script in against your MySQL server:

`SELECT @@system_time_zone;`

It will return the result set with the name of your Timezone. For example, I am in India and the time zone of my machine is India Standard Time hence the result will contain the same.

Let me know if you use any other method to retrieve the Timezone of the server where MySQL is installed.

Reference: Pinal Dave (http://blog.sqlauthority.com)

# SQL SERVER – Get Current TimeZone Name in SQL Server

A very common question we developer often encounters is what is the current Timezone of the server where SQL Server is installed. It is very easy to get current Timezone of the server where SQL Server is installed. Here is the simple script for the same.

```DECLARE @TimeZone VARCHAR(50) EXEC MASTER.dbo.xp_regread 'HKEY_LOCAL_MACHINE', 'SYSTEM\CurrentControlSet\Control\TimeZoneInformation', 'TimeZoneKeyName',@TimeZone OUT SELECT @TimeZone ```

As my server is in India it will display following results.

I have been using this script for a quite a while and I have no idea of the origin of this script. Is there any other way to get the Timezone for SQL Server.

Reference: Pinal Dave (http://blog.sqlauthority.com)