MySQL – How to Create a Distributed Relational SQL Database

Distributed relational databases are a perfect match for Cloud computing models and distributed Cloud infrastructure.  As such, they are the way forward for delivering new web scale applications.

But how is the data distributed in a distributed relational database?  What is the best way to distribute data for my applications?  How to I retune my distributed database for optimal performance as applications evolve and usage patterns change?  You do all of this with your data distribution policy.

In this blog I’d like to explore different aspects of a data distribution policy. I want you to come away with a practical understanding you can use as you explore your distributed relational database options.

So, let’s dive in.

Data Distribution Policy: What It Is and Why You Should You Care

A data distribution policy describes the rules under which data is distributed.  A policy that matches your application’s workflow and usage patterns will give you critical web scale benefits:

  • endless scalability
  • high-availability
  • geo-location of data nearest user populations
  • multi-tenancy
  • archiving
  • datatiering

A poorly conceived data distribution policy will degrade performance, use more system resources and cause you problems.

In The Beginning, there was Sharding, and it wasn’t so Good

In the past, to distribute data across an “array” of linked databases, developers needed to program data distribution logic into their actual applications. The effect was to “shard” a database into slices of data. Quite literally every read or write would need to run through new custom-built application code to know where bits of data should be placed, or could be found.  This is what Facebook, Twitter and many others did as, at the time, there was no better alternative.

This extra sharding code required application developers to take on tasks typically handled by a database.  A do-it-yourself approach may seem like a fun challenge (“hey, after all, how hard can this really be??”).  But with your database divided this way, you face the following issues to contend with:

  1. Operational issues become much more difficult, for example: backing up, adding indexes, changing schema.
  2. You also need to start checking your queries results to test that each query path is actually yielding accurate results.

A lot has been written about the challenges of sharding a relational database (here’s a good whitepaper you can read: Top 10 DIY MySQL Sharding Challenges), so I won’t go into them here.  But, let’s also recognize that some great work has been accomplished by dedicated developers using sharding techniques. They have proven the inherent value of a distributed database to achieve massive scale.  At the time, they had to shard as they had no alternative.

Today, there is a better way.

What is a Good Data Distribution Policy?

As I briefly mentioned, a data distribution policy describes the rules under which data is distributed across a set of smaller databases that, taken together and acting as one, comprise the entire distributed database.

The goal we are aiming for is an even and predictable distribution of workloads across the array of clusters in our distributed database.  This brings us immense scalability and availability benefits to handle more concurrent users, higher transaction throughput and bigger volumes of data. But these benefits are all lost with a poorly conceived data distribution policy that does not align to your application’s unique usage and workloads. Let’s take a look.

Imagine we have a single database that is starting to exhibit signs of reaching its capacity limits.  Throughput is becoming unpredictable.  Users are getting frustrated waiting.

We decide the best way to improve the situation is to evolve to a distributed database. Our distributed database would aim to evenly divide the total workload across an array of databases.  In this way, data distribution decreases the number of queries that any individual database cluster (or shard) receives.

Figure 1. A good data distribution policy: ensures that a specific transaction or query is complete within a specific database.

The critical point here is that we want to distribute the data in such a way that we minimize the cross-database chatter (from cluster to cluster, or shard to shard), so that each transaction can be completed within a single cluster and in a single fetch/trip.

If we distribute data without respecting how the data is actually used, we can make matters worse.

Figure 2. A bad data distribution policy: requires transactions or queries to access or collect data from multiple databases.

In the two images above, you can see that one case depicts 1,000,000 transactions equally spread across available resources.  And the other case shows a bad distribution policy where each query needs to collect information from every cluster (or shard) – thus in every practical sense we are actually increasing the overall workload.

Data Distribution Policy
Bad Data Distribution Policy Good Data Distribution Policy
The load isn’t distributed – it’s multiplied! Distributes the workload evenly across available resources
Doesn’t scale Distributes the sessions
Adding an additional DB does NOT reduce the overall workload Delivers linear scalability
The limitation of a single DB becomes the limitation of the entire array Adding another database, increases the overall scale potential of the distributed database
When queries need data from multiple DBs, transactions must commit multiple separate DBs (2PC) before completing. This adds a lot of overhead to each Commit. Queries complete using data from a single, smaller database. This reduces a lot of overhead to any Commits.

Table 1. A comparison of a good and bad data distribution policy

So, we can see that unless we distribute the data intelligently, we will not achieve any benefit. Actually, we can see things can become worse than before.

The natural question we are lead to ask is: “OK, So what is the best way to distribute data for my applications and my workloads?

Good question!

How Create the Best Data Distribution Policy for Your Application

Distributing data across a cluster of smaller database instances and maintaining full relational database integrity, two-phase commit and rollback, (as well as leveraging SQL!) is today’s state of the art  for distributed relational databases.

We can define two broad types of data distribution policy:

  1. Arbitrary Distribution: This is when data is distributed across database instances, but without any consideration or understanding for specific application requirements and how the data will be used by users or the application;
  2. Declarative, Policy-Based Distribution: This is when data is distributed across database instances, but in a way that specifically understands all application requirements, data relationships, transactions, and how the data is used in reads and writes by the application.
Data Distribution Policy
Arbitrary Data Distribution Policy Declarative Data Distribution Policy
Pros - Pros -
Unsophisticated  Ensures that a specific transaction finds all the data it needs in one specific database
 Predetermined (no forethought required)  Aligns with schema and DB structure
Cons - Highly efficient and scalable
 No intelligence about business, schema, use cases  Anticipates future requirements and growth assumptions
 Leads to excessive use of database nodes Cons -
Leads to excessive use of network  Requires forethought and analysis

Arbitrary data distribution is often used by NoSQL database technologies.  In fact, breaking the monolithic single-instance database into a distributed database has been the core of the NoSQL revolution so that NoSQL databases can tap into the scalability benefits of distributed database architecture. However, to get scalability, NoSQL databases have been willing to abandon the relational model. NoSQL and document store type databases can rely on arbitrary data distribution because their data model does not provide for joins. Meanwhile, customers have needed something to handle their massive web scale database loads, so they’ve been willing to try new technologies, like MongoDB, with new non-relational approaches. And in some application scenarios, losing the relational data model has been an OK trade-off. Having a choice is good.

However, nowadays you can get massive web scale and keep the time-tested relational database model, if you use a declarative, policy-based data distribution approach.

Academia has written about various types of distributed relational databases for decades. But today they are a reality. Declarative, policy-based data distribution is the way forward.

The good news is that today tools can identify the best declarative, policy-based data distribution approach for you!

If you use MySQL, you can take what you know now and check out ScaleBase’s free online Analysis Genie service for MySQL. It guides you through very simple steps to create the best data distribution policy matched to your unique application requirements and data.

If you’re just naturally curious about how to evolve your relational database into a modern distributed relational database, let’s dive into the details by looking at two very typical database and development scenarios:

  1. Scaling an existing application
  2. Designing scalability in a brand new application

In tomorrow’s blog post we will discuss about Scaling Existing Applications: Key Observations and Measurements.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

About these ads

SQL SERVER – How to Catch Errors While Inserting Values in Table

Question: “I often get errors when I insert values into a table, I want to gracefully catch them, how do I do that.”

Answer: Very simple. Just use TRY… CATCH. Here is the simple example of TRY…CATCH I have blogged earlier when it was introduced.

Here is the example, I have build from the earlier blog post where user can catch the error details during inserting value in table.

First, we will create a sample table.

CREATE TABLE SampleTable (ID INT IDENTITY(1,1), Col VARCHAR(10))
GO

Now we will attempt to insert value in this table which will throw errors and the same error we will catch into the table.

BEGIN TRY
INSERT INTO SampleTable (Col)
SELECT 'FourthRow'
UNION ALL
SELECT 'FifthRow---------'
END TRY
BEGIN CATCH
SELECT
ERROR_NUMBER() AS ErrorNumber
,ERROR_MESSAGE() AS ErrorMessage;
END CATCH
GO

The second row of the above table will throw an error as the length of the row is larger than the column in which we are inserting values. It will throw an error and the same error will be caught via TRY…CATCH and it will be displayed in the SELECT statement. Here is the result set.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

SQL SERVER – How to Find Running Total in SQL Server

Finding running total is one of the most popular request user encounters in the industry. There are two different ways to find out running totals. One of the methods is as per SQL Server 2008 R2 and earlier version. This is indeed a very expensive version of finding running total and I always hated this solution when I had to implement it in the industry. However, I am extremely delighted since SQL Server 2012 as it has a new feature of OVER ORDER BY ROW methods. It is much more efficient and cleaner to implement.

Let us first create a sample table and populate the same.

USE tempdb
GO
CREATE TABLE TestTable (ID INT, Value INT)
INSERT INTO TestTable (ID, Value)
SELECT 1, 10
UNION ALL
SELECT 2, 20
UNION ALL
SELECT 3, 30
UNION ALL
SELECT 4, 40
UNION ALL
SELECT 5, 50
UNION ALL
SELECT 6, 60
UNION ALL
SELECT 7, 70
GO
-- selecting table
SELECT ID, Value
FROM TestTable
GO

Here is the screenshot of the resultset.

Here is the query which you can execute on SQL Server 2008 R2 or earlier version. The query is very expensive.

-- Running Total for SQL Server 2008 R2 and Earlier Version
SELECT ID, Value,
(
SELECT SUM(Value)
FROM TestTable T2
WHERE T2.ID <= T1.ID) AS RunningTotal
FROM TestTable T1
GO

Here is the query which you can execute on SQL Server 2012 or later version. The query is very efficient.

-- Running Total for SQL Server 2012 and Later Version
SELECT ID, Value,
SUM(Value) OVER(ORDER BY ID ROWS UNBOUNDED PRECEDING) AS RunningTotal
FROM TestTable
GO

Both of the above query returns following results.

If there is any other better option, please share it here.

Reference: Pinal Dave (http://blog.sqlauthority.com)

SQL SERVER – GROUP BY Columns with XMLPATH – Comma Delimit Multiple Rows

This is one of the most popular question and I keep on getting again and again in email, Facebook and on social media. I have decided to write about it here in the blog so in future I can directly give a reference.

Here is the question – there is the question. There is a table with name of the student and their classid, now we have to create another table where we have different representation of the classid and student names. In simple words, we have to group by classid and concat user names. Here is how image representations of the same.

Here is the script of the original table which generates a table displayed on the left side of the image.

USE tempdb
GO
CREATE TABLE StudentEnrolled (ClassID INT, FirstName VARCHAR(20), LastName VARCHAR(20))
GO
INSERT INTO StudentEnrolled (ClassID, FirstName, LastName)
SELECT 1, 'Thomas', 'Callan'
UNION ALL
SELECT 1, 'Henry', 'Quinto'
UNION ALL
SELECT 2, 'Greg', 'McCarthy'
UNION ALL
SELECT 2, 'Brad', 'Grey'
UNION ALL
SELECT 2, 'Loren', 'Oliver'
UNION ALL
SELECT 3, 'Elliot', 'Kirkland'
GO
--
SELECT *
FROM StudentEnrolled
GO

Now we can use XMLPATH to concat the firstname and lastname of the student and along with that we can also group by them using following script. Now this was just an example, but you can in future use this script for many other purposes.

SELECT
[ClassID],
STUFF((
SELECT ', ' + [FirstName] + ' ' + [LastName]
FROM StudentEnrolled
WHERE (ClassID = SE.ClassID)
FOR XML PATH(''),TYPE).value('(./text())[1]','VARCHAR(MAX)'),1,2,'') AS FullName
FROM StudentEnrolled SE
GROUP BY ClassID
GO

Let me know if there is any better way to do the same.

Reference: Pinal Dave (http://blog.sqlauthority.com)

SQL SERVER – ​Building Technical Reference Library – Notes from the Field #048

[Note from Pinal]: This is a 48th episode of Notes from the Field series. How do you build a technical reference library? In other word, when you need help how do you create your own reference so you do not have to go out to look for further help. There are so many little tips and tricks one should know and Brian Kelley has amazing skills to explain this simple concept with easy words.

In this episode of the Notes from the Field series database expert Brian Kelley explains a how to find out what has changed in deleted database. Read the experience of Brian in his own words.


Do you have a technical reference library? If you’re not sure what I mean, a technical reference library is your collection of notes, code, configuration options, bugs you’ve hit that you think you’ll hit again, and anything else that you might need to retrieve again in the future related to what you do in IT. If you have a technical reference library (hereafter referred to as TRL), is it:

  • outside of email?
  • distributed across multiple locations/computers?
  • searchable?
  • fast?

With my TRL, I’m more efficient because I‘m not searching the Internet again and again for the same information. I also can ensure I handle strange cases, such as unusual configurations, which we seem to get a lot of in IT. It’s in my TRL, so I don’t have to go back through a vendor’s install document or go run someone down in the organization to get the information I need. I already have it if I put it in my TRL. Because of the efficiency that TRLs provide, most top performing IT professionals that I know have some sort of system.

Outside of Email:

I used to have a folder in email where I kept technical reference documents. Because I try to follow Inbox Zero, I do have a Reference folder, but it’s not for technical documents. My Reference folder is typically related to what that mailbox is for. For instance, my LP Reference folder is for keeping procedures related to Linchpin such as how/where to enter time, who to contact about various things, etc.

Why don’t I have my technical documents in email any longer? Let me ask a question in response to that question: What happens when email is down? When email is down, you have no access to your TRL. Email does go down. I was faced with a case where I was responsible for getting email back up and, you guessed it, my technical notes were in email. That doesn’t work.

A second question to ask: How searchable is your TRL if it’s in email?  If you keep a lot of email, especially if you don’t have a specific folder for your TRL, searching may prove to be painful. That was the other problem I started to face.

Given these two issues, I advise building your TRL outside of email.

Distributed:

If your TRL  is only on a single computer, you’re going to regret it someday. That day usually occurs when the computer in question crashes and all your notes are lost. If you have a backup, anything you put into the library after the backup is gone. Give the prevelance of cloud-based solutions nowadays, having a technical reference library which is distributed is easy. Here are some ideas:

  • Evernote
  • Microsoft OneNote
  • Microsoft SkyDrive
  • DropBox
  • Google Docs
  • Apple iCloud

I’m particular to the first two, Evernote and OneNote, because they aren’t simply “file systems.” They are designed to capture and catalog information for quick retrieval later.

All my examples will come from Evernote, because that’s the application I typically use. In fact, here’s my setup. I have a specific notebook for my TRL:

TRL Notebook

If I know exactly what I’m looking for or if I’ve added it recently, I should be able to find any note quickly in the list of notes for the notebook:

Note: SQL 2012 Slipstream

Searchable (and Fast!):

Even if what I’m looking for isn’t right there at the top of the list, I can search in Evernote (and OneNote, if I was using it) to quickly locate the document. For instance, by typing “Slipstream,” I quickly get to the article that I want:

Search of TRL

Products live Evernote and OneNote have specifically worked on Search in order to retrieve results quickly. They also provide options to search within a notebook, for instance. In my case here, since slipstream is such a specialized term compared with what else is in my Evernote notebooks, I didn’t feel the need to filter by notebook. However, I could have if I recevied a lot of hits back or if the search was taking too long.

Also note that I’ve not added any tags to this article. I’m hitting it using a text search as to the contents alone. The use of tags offers another option in order to locate the material I need quickly. If I had a lot of articles that came up for a particular search word or phrase, I could look at using tags to differentiate them better. It’s another reason to consider tools designed to hold information and make it quickly retrievable.

Build a System That Works for You:

Learning expert Cynthia Tobias was once helping a teacher who asked her students to keep a reference notebook for assignments and handouts in class, an academic version of the TRL I’ve described thus far. The teacher balked at one student’s notebook because it was messy. The teacher couldn’t imagine how the student could locate anything in the notebook and was going to give the student a poor score. Tobias asked the teacher, “What’s the point?” The point, the teacher indicated, was to be able to retrieve an assignment or handout quickly. Tobias challenged the teacher to check to see if the student could retrieve quickly (within a minute, for instance). If the student could, the teacher should leave the student alone. If the student couldn’t, then work with the student to improve the reference system.

That’s what you want to do. You want to develop a reference system that’s efficient for you. I’ve given you a snapshot of what works for me. It may not work for you. That’s okay. Start with something. If you’re starting from scratch, I would recommend starting with Evernote or OneNote. Put some notes in that you’ll need again. See how well you can retrieve those notes, especially as the number of notes increases. Make tweaks as you have to for performance sake. Most of all, build your TRL and become a better professional.

If you want to get started with performance tuning and database security with the help of experts, read more over at Fix Your SQL Server.

Reference: Pinal Dave (http://blog.sqlauthority.com)

SQL SERVER – A Practical Use of Backup Encryption

 Backup is extremely important for any DBA. Think of any disaster and backup will come to rescue users in adverse situation. Similarly, it is very critical that we keep our backup safe as well. If your backup fall in the hands of bad people, it is quite possible that it will be misused and become serious data integrity issue. Well, in this blog post we will see a practical scenario where we will see how we can use Backup Encryption to improve security of the bakcup.

Feature description

Database Backup Encryption is a brand new and long expected feature that is available now in SQL Server 2014. You can create an encrypted backup file by specifying the encryption algorithm and the encryptor (either a Certificate or Asymmetric Key).

The ability to protect a backup file with the password has been existing for many years. If you use SQL Server for a long time, you might remember the WITH PASSWORD option for the BACKUP command. The option prevented unauthorized access to the backup file.

However this approach did not provide reliable protection. In that regard, Greg Robidoux noted on MSSQLTIPS: “Although this does add a level of security if someone really wants to crack the passwords they will find a way, so look for additional ways to secure your data.

To protect a backup file, SQL Server 2008 introduced the transparent data encryption (TDE) feature. Thus, a database had to be transparently encrypted before backup. Therefore, start with SQL Server 2012 the PASSWORD and MEDIAPASSWORD parameters are not used while creating backups. Even so, data encryption and backup files encryption are two different scenarios.

In case a database is stored locally, there is no need to encrypt it before backup. Fortunately in SQL Server 2014 there are two independent processes. Along with data encryption it is possible to encrypt a backup file based on a certificate or an asynchronous key. Supported encryption algorithms are:

  • AES 128
  • AES 192
  • AES 256
  • Triple DES

Practical use

To illustrate above mentioned, I will create an encrypted backup of the Adventureworks database. Also, you can back up directly to Azure. If needed, you may restore the encrypted back up file on Azure.

I will use dbForge Studio for SQL Server to create the encrypted backup file.

To protect the backup file we need to create an encryptor: either a Certificate or Asymmetric Key. Then, we need to pass this encryptor to the target SQL Server to restore the backup. For this, the encryptor must be exported from the source SQL Server and imported to the target SQL Server. There are no problems with the certificate in this regard. It is more complicated with asymmetric keys.

Taking into account that the BACKUP ASYMMETRIC KEY command is not available, and we can not just create a duplicate for an asymmetric key (compared to symmetric key), the only approach is to create the asymmetric key outside the SQL Server. Then we can use the sn.exe utility to transfer it inside SQL Server (CREATE ASYMMETRIC KEYkeynameFROM FILE = ‘filename.snk‘). After that we can use this asymmetric key to encrypt the backup file on the source instance. Further we need to use the same *.snk file to create the asymmetric key on the target instance (and restore the backup file).

In our example we will not use asymmetric keys. We will use a certificate. Moreover the certificate (behind the scene) is the pair of open/closed keys.

Let’s create the server certificate and use it to encrypt the backup file.

The certificate will be protected with the database master key, because we didn’t specify the ENCRYPTION BY statement.

This is exactly what we need. Only certificates signed with the database master-key can be used for the encryption purposes. Otherwise, If we for instance, protect the certificate with the password ENCRYPTION BY PASSWORD = ‘strongpassword‘, the following error appears while attempting to encrypt the backup file:

“Cannot use certificate ‘CertName’, because its private key is not present or it is not protected by the database master key.”

Encrypted backups (along with usual backups) can be traditionally created locally on the hard drive and in Azure Storage.

Instead of writing tons of SQL code I will use the convenient dbForge Studio for SQL Server Back Up wizard. The wizard allows to create the database backup in several clicks.

Step 1: Setup the DB Connection and the backup file location.

Step2: Setup mediaset

Step 3: Select the encryption algorithm and certificate.

In case you don’t want to pay extra attention to transferring the backup file to the Windows Azure, you can backup directly to Azure.

After the script execution in the required container the blob (with the backup) appears.

In case you had already created a backup with the same name in the same container, you can get the following error: There is currently a lease on the blob and no lease ID was specified in the request.

Further, you can restore the back up file on the Windows Azure.

Summary: 

Obviously, it is a good practice to encrypt a backup file while transferring. This, for instance, allows to avoid data leak while transferring backups from one DPC to another.

Reference: Pinal Dave (http://blog.sqlauthority.com)

SQL SERVER – UPDATE From SELECT Statement with Condition

An email from an old college friend landed my mailbox:

Hey Pinal,”

I have two tables. I want to conditionally update data in one table based on another table. How can I do that. I have included sample scripts and an image for further explanation.

Thanks!”

It always delights to receive email from an old college friend and particularly it is even more interesting when they have a question w where I can help. Here is the question and a sample script.

User had two tables – ItemList and ItemPrice. The requirement was to update ItemPrice table column Price with US price and for that it required to divide the column by 60. Here is the sample script of the table displayed in the image.

USE tempdb;
GO
CREATE TABLE ItemList
(ID INT, ItemDesc VARCHAR(100), Country VARCHAR(100));
INSERT INTO ItemList (ID, ItemDesc, Country)
SELECT 1, 'Car', 'USA'
UNION ALL
SELECT 2, 'Phone', 'India'
UNION ALL
SELECT 3, 'Computer', 'USA';
GO
CREATE TABLE ItemPrice
(ID INT, Price VARCHAR(100));
INSERT INTO ItemPrice (ID, Price)
SELECT 1, 5000
UNION ALL
SELECT 2, 10000
UNION ALL
SELECT 3, 20000;
GO
-- SELECT Data
SELECT *
FROM ItemList;
SELECT *
FROM ItemPrice;

Now let us write a script which will update the table as per our expectation.

-- Update Statement
UPDATE ItemPrice
SET Price = Price/60
FROM ItemList il
INNER JOIN ItemPrice ip ON il.ID = ip.ID
WHERE Country = 'USA'
GO

Now let us result by selecting the data in our Price table.

Now you can see how we can update from table to another table with conditions. You can clean up above code by dropping tables.

-- Clean up
DROP TABLE ItemPrice;
DROP TABLE ItemList;
GO

I hope this quick script helps, let me know if there is any better alternative.

Reference: Pinal Dave (http://blog.sqlauthority.com)