MySQL – How to Create a Distributed Relational SQL Database

Distributed relational databases are a perfect match for Cloud computing models and distributed Cloud infrastructure.  As such, they are the way forward for delivering new web scale applications.

But how is the data distributed in a distributed relational database?  What is the best way to distribute data for my applications?  How to I retune my distributed database for optimal performance as applications evolve and usage patterns change?  You do all of this with your data distribution policy.

In this blog I’d like to explore different aspects of a data distribution policy. I want you to come away with a practical understanding you can use as you explore your distributed relational database options.

So, let’s dive in.

Data Distribution Policy: What It Is and Why You Should You Care

A data distribution policy describes the rules under which data is distributed.  A policy that matches your application’s workflow and usage patterns will give you critical web scale benefits:

  • endless scalability
  • high-availability
  • geo-location of data nearest user populations
  • multi-tenancy
  • archiving
  • datatiering

A poorly conceived data distribution policy will degrade performance, use more system resources and cause you problems.

In The Beginning, there was Sharding, and it wasn’t so Good

In the past, to distribute data across an “array” of linked databases, developers needed to program data distribution logic into their actual applications. The effect was to “shard” a database into slices of data. Quite literally every read or write would need to run through new custom-built application code to know where bits of data should be placed, or could be found.  This is what Facebook, Twitter and many others did as, at the time, there was no better alternative.

This extra sharding code required application developers to take on tasks typically handled by a database.  A do-it-yourself approach may seem like a fun challenge (“hey, after all, how hard can this really be??”).  But with your database divided this way, you face the following issues to contend with:

  1. Operational issues become much more difficult, for example: backing up, adding indexes, changing schema.
  2. You also need to start checking your queries results to test that each query path is actually yielding accurate results.

A lot has been written about the challenges of sharding a relational database (here’s a good whitepaper you can read: Top 10 DIY MySQL Sharding Challenges), so I won’t go into them here.  But, let’s also recognize that some great work has been accomplished by dedicated developers using sharding techniques. They have proven the inherent value of a distributed database to achieve massive scale.  At the time, they had to shard as they had no alternative.

Today, there is a better way.

What is a Good Data Distribution Policy?

As I briefly mentioned, a data distribution policy describes the rules under which data is distributed across a set of smaller databases that, taken together and acting as one, comprise the entire distributed database.

The goal we are aiming for is an even and predictable distribution of workloads across the array of clusters in our distributed database.  This brings us immense scalability and availability benefits to handle more concurrent users, higher transaction throughput and bigger volumes of data. But these benefits are all lost with a poorly conceived data distribution policy that does not align to your application’s unique usage and workloads. Let’s take a look.

Imagine we have a single database that is starting to exhibit signs of reaching its capacity limits.  Throughput is becoming unpredictable.  Users are getting frustrated waiting.

We decide the best way to improve the situation is to evolve to a distributed database. Our distributed database would aim to evenly divide the total workload across an array of databases.  In this way, data distribution decreases the number of queries that any individual database cluster (or shard) receives.

Figure 1. A good data distribution policy: ensures that a specific transaction or query is complete within a specific database.

The critical point here is that we want to distribute the data in such a way that we minimize the cross-database chatter (from cluster to cluster, or shard to shard), so that each transaction can be completed within a single cluster and in a single fetch/trip.

If we distribute data without respecting how the data is actually used, we can make matters worse.

Figure 2. A bad data distribution policy: requires transactions or queries to access or collect data from multiple databases.

In the two images above, you can see that one case depicts 1,000,000 transactions equally spread across available resources.  And the other case shows a bad distribution policy where each query needs to collect information from every cluster (or shard) – thus in every practical sense we are actually increasing the overall workload.

Data Distribution Policy
Bad Data Distribution Policy Good Data Distribution Policy
The load isn’t distributed – it’s multiplied! Distributes the workload evenly across available resources
Doesn’t scale Distributes the sessions
Adding an additional DB does NOT reduce the overall workload Delivers linear scalability
The limitation of a single DB becomes the limitation of the entire array Adding another database, increases the overall scale potential of the distributed database
When queries need data from multiple DBs, transactions must commit multiple separate DBs (2PC) before completing. This adds a lot of overhead to each Commit. Queries complete using data from a single, smaller database. This reduces a lot of overhead to any Commits.

Table 1. A comparison of a good and bad data distribution policy

So, we can see that unless we distribute the data intelligently, we will not achieve any benefit. Actually, we can see things can become worse than before.

The natural question we are lead to ask is: “OK, So what is the best way to distribute data for my applications and my workloads?

Good question!

How Create the Best Data Distribution Policy for Your Application

Distributing data across a cluster of smaller database instances and maintaining full relational database integrity, two-phase commit and rollback, (as well as leveraging SQL!) is today’s state of the art  for distributed relational databases.

We can define two broad types of data distribution policy:

  1. Arbitrary Distribution: This is when data is distributed across database instances, but without any consideration or understanding for specific application requirements and how the data will be used by users or the application;
  2. Declarative, Policy-Based Distribution: This is when data is distributed across database instances, but in a way that specifically understands all application requirements, data relationships, transactions, and how the data is used in reads and writes by the application.
Data Distribution Policy
Arbitrary Data Distribution Policy Declarative Data Distribution Policy
Pros - Pros -
Unsophisticated  Ensures that a specific transaction finds all the data it needs in one specific database
 Predetermined (no forethought required)  Aligns with schema and DB structure
Cons - Highly efficient and scalable
 No intelligence about business, schema, use cases  Anticipates future requirements and growth assumptions
 Leads to excessive use of database nodes Cons -
Leads to excessive use of network  Requires forethought and analysis

Arbitrary data distribution is often used by NoSQL database technologies.  In fact, breaking the monolithic single-instance database into a distributed database has been the core of the NoSQL revolution so that NoSQL databases can tap into the scalability benefits of distributed database architecture. However, to get scalability, NoSQL databases have been willing to abandon the relational model. NoSQL and document store type databases can rely on arbitrary data distribution because their data model does not provide for joins. Meanwhile, customers have needed something to handle their massive web scale database loads, so they’ve been willing to try new technologies, like MongoDB, with new non-relational approaches. And in some application scenarios, losing the relational data model has been an OK trade-off. Having a choice is good.

However, nowadays you can get massive web scale and keep the time-tested relational database model, if you use a declarative, policy-based data distribution approach.

Academia has written about various types of distributed relational databases for decades. But today they are a reality. Declarative, policy-based data distribution is the way forward.

The good news is that today tools can identify the best declarative, policy-based data distribution approach for you!

If you use MySQL, you can take what you know now and check out ScaleBase’s free online Analysis Genie service for MySQL. It guides you through very simple steps to create the best data distribution policy matched to your unique application requirements and data.

If you’re just naturally curious about how to evolve your relational database into a modern distributed relational database, let’s dive into the details by looking at two very typical database and development scenarios:

  1. Scaling an existing application
  2. Designing scalability in a brand new application

In tomorrow’s blog post we will discuss about Scaling Existing Applications: Key Observations and Measurements.

Reference: Pinal Dave (http://blog.SQLAuthority.com)

About these ads

Google Drive Trick – Google Spreadsheet Formatting Dates to String

I have been using google drive and google spreadsheet for a while. I work with experts around the world and we all use different date format. The biggest confusion which often happens when we are sharing data is a timestamp. For some people it is natural to have dates time like dd/mm/yyyy and for some it is natural to have mm/dd/yyyy. Sometimes we are just confused looking at the date as there is no instruction about the format. Finally, we all decided that we will spell out the dates. For example, if we have date like 11/9/2014 to avoid confusion if it is November 9, 2014 or September 11, 2014 we decided to just write it in the string.

Now here is the problem with Google Spreadsheet – when I type any datetime, like as following, it will immediately convert them to mm/dd/yyyy format and defeat our purpose of spelling that out.

Finally, we figured out the solution how we can preserve the datetime as a string in Google Spreadsheet. Just start your datetime with a single quotes. For example now we enter datetime as follows:

‘September 9, 2014 instead of September 9, 2014. The best part is that, the single quotes are not visible on the screen.

Well, if there was any other way to do this and I was not aware of it, please let me know.

Reference: Pinal Dave (http://blog.sqlauthority.com)

Database – Beginning with Cloud Database As A Service

I love my weekend projects. Everybody does different activities in their weekend – like traveling, reading or just nothing. Every weekend I try to do something creative and different in the database world. The goal is I learn something new and if I enjoy my learning experience I share with the world. This weekend, I decided to explore Cloud Database As A Service – Morpheus. In my career I have managed many databases in the cloud and I have good experience in managing them.

I should highlight that today’s applications use multiple databases from SQL for transactions and analytics, NoSQL for documents, In-Memory for caching to Indexing for search.  Provisioning and deploying these databases often require extensive expertise and time.  Often these databases are also not deployed on the same infrastructure and can create unnecessary latency between the application layer and the databases.  Not to mention the different quality of service based on the infrastructure and the service provider where they are deployed.

Moreover, there are additional problems that I have experienced with traditional database setup when hosted in the cloud:

  • Database provisioning & orchestration
  • Slow speed due to hardware issues
  • Poor Monitoring Tools
  • High network latency

Now if you have a great software and expert network engineer, you can continuously work on above problems and overcome them. However, not every organization have the luxury to have top notch experts in the field. Now above issues are related to infrastructure, but there are a few more problems which are related to software/application as well.

Here are the top three things which can be problems if you do not have application expert:

  • Replication and Clustering
  • Simple provisioning of the hard drive space
  • Automatic Sharding

Well, Morpheus looks like a product build by experts who have faced similar situation in the past. The product pretty much addresses all the pain points of developers and database administrators.

What is different about Morpheus is that it offers a variety of databases from MySQL, MongoDB, ElasticSearch to Reddis as a service.  Thus users can pick and chose any combination of these databases.  All of them can be provisioned in a matter of minutes with a simple and intuitive point and click user interface.  The Morpheus cloud is built on Solid State Drives (SSD) and is designed for high-speed database transactions.  In addition it offers a direct link to Amazon Web Services to minimize latency between the application layer and the databases.

Here are the few steps on how one can get started with Morpheus. Follow along with me.  First go to http://www.gomorpheus.com and register for a new and free account.

Step 1: Signup

It is very simple to signup for Morpheus.

Step 2: Select your database

 

I use MySQL for my daily routine, so I have selected MySQL. Upon clicking on the big red button to add Instance, it prompted a dialogue of creating a new instance.

 

Step 3: Create User

Now we just have to create a user in our portal which we will use to connect to a database hosted at Morpheus. Click on your database instance and it will bring you to User Screen. Over here you will notice once again a big red button to create a new user. I created a user with my first name.

 

Step 4: Configure your MySQL client

I used MySQL workbench and connected to MySQL instance, which I had created with an IP address and user.

 

That’s it! You are connecting to MySQL instance. Now you can create your objects just like you would create on your local box. You will have all the features of the Morpheus when you are working with your database.

Dashboard

While working with Morpheus, I was most impressed with its dashboard. In future blog posts, I will write more about this feature.  Also with Morpheus you use the same process for provisioning and connecting with other databases: MongoDB, ElasticSearch and Reddis.

Reference: Pinal Dave (http://blog.sqlauthority.com)

MySQL – Export the Resultset to CSV file

In SQL Server, you can use BCP command to export the result set to a csv file. In MySQL too, You can export data from a table or result set as a csv file in many methods. Here are two methods.

Method 1 : Make use of Work Bench

If you are using Work Bench as a querying tool, you can make use of it’s Export option in the result window. Run the following code in Work Bench

SELECT db_names FROM mysql_testing;

The result will be shown in the result windows. There is an option called “File”. Click on it and it will prompt you a window to save the result set (Screen shot attached to show how file option can be used). Choose the directory and type out the name of the file.

Method 2 : Make use of OUTFILE command

You can do the export using a query with OUTFILE command as shown below

SELECT db_names FROM mysql_testing
INTO OUTFILE 'C:/testing.csv'
FIELDS ENCLOSED BY '"' TERMINATED BY ';' ESCAPED BY '"'
LINES TERMINATED BY '\r\n';

After the execution of the above code, you can find a file named testing.csv in C drive of the server.

Reference: Pinal Dave (http://blog.sqlauthority.com)

MySQL – UDF – Validate Integer Function

There is a default function in SQL Server called ISNUMERIC to determine if the string is numeric. In SQL Server ISNUMERIC returns 1 for non numerials also. Ex ISNUMERIC (‘,’) returns 1. To avoid this you can write a user defined function as shown over here. In MySQL, there is no default function like ISNUMERIC as in SQL Server. You need to create a similar user defined function as shown below.

DELIMITER $$
CREATE FUNCTION 'udf_IsNumeric'(number_in VARCHAR(100)) RETURNS BIT
BEGIN
DECLARE
Ret BIT;
IF number_in NOT regexp '[^0123456789]' THEN
SET
Ret:= 1;
ELSE
SET
Ret:= 0;
END IF;
RETURN Ret;
END

The above function uses Regular expression. The expression regexp ‘[^0123456789]‘ will find out if the string has at least one character which is not a digit. The expression not regexp ‘[^0123456789]‘ will just negate that condition so that you get the string which has all characters as a digit.

You can test this using the following example.

SELECT '999' TestValue,udf_IsNumeric('999') NumericTest;
SELECT 'abc' TestValue,udf_IsNumeric('abc') NumericTest;
SELECT '9+9' TestValue,udf_IsNumeric('9+9') NumericTest;
SELECT '$9.9' TestValue,udf_IsNumeric('$9.9') NumericTest;
SELECT 'SQLAuthority' TestValue,udf_IsNumeric('SQLAuthority') NumericTest;

The result is

TestValue NumericTest
 ——— ———–
 999 1
TestValue NumericTest
 ——— ———–
 abc 0
TestValue NumericTest
 ——— ———–
 9+9 0
TestValue NumericTest
 ——— ———–
 $9.9 0
TestValue NumericTest
 ———— ———–
 SQLAuthority 0

You can find more examples about Regular expressions in MySQL over here.

Reference: Pinal Dave (http://blog.sqlauthority.com)

MySQL – How to Format Date in MySQL with DATE_FORMAT()

MySQL supports formatting the datetime values into a different formats using DATE_FORMAT() function. This function accepts date/datetime values as a first parameter and returns into a specific format defined as a second parameter.

Let us explore this with the following examples which are self-explanatory

Define a DATETIME variable
SET @date:='2014-06-16 14:12:49';

-- Display datetime values in YYYY-mm-dd format
SELECT date_format(@date,'%Y-%m-%d') AS formatted_date;

The result is 2014-06-16

-- Display datetime values as Long Date format
SELECT date_format(@date,'%W, %M %d,%Y') AS formatted_date;

The result is Monday, June 16,2014

-- Display datetime values as Full date format
SELECT date_format(@date,'%W, %M %d,%Y %T') AS formatted_date;

The result is Monday, June 16,2014 14:12:49

-- Display datetime values in HH:MM:SS format
SELECT date_format(@date,'%T') AS formatted_date;

The result is 14:12:49

-- Display datetime values in Month Year format
SELECT date_format(@date,'%M %Y')  AS formatted_date;

The result is June 2014

-- Display datetime values in mm-dd-yyyy format
SELECT date_format(@date,'%m-%d-%Y')  AS formatted_date;

The result is 06-16-2014

-- Display datetime values in dd-mm-yyyy format
SELECT date_format(@date,'%M %Y')  AS formatted_date;

The result is 16-06-2014

Reference: Pinal Dave (http://blog.sqlauthority.com)

MySQL – How INSERT() Function Works for MySQL

You know there is an INSERT statement which is used to add data to a table. But did you know that there is an INSERT function which will replace certain characters within a string?

INSERT() function in MySQL is used to insert a substring at the specified position up to the specified number of characters.

Let us consider this example.

set @str:='This is nothing';

The variable @str has a string value “This is nothing”. Suppose you want to change “nothing” into “everything” in this string. You can use INSERT() function as shown below

select INSERT(@str,9,2,'every');

The INSERT() function replaces the characters starting from the position 9 to next 2 characters by “everything”. So in the string part “nothing”, no is replaced by “every” and the full string becomes “This is everything” .

You can also add a string between two strings as shown below

set @str:='This is nothing';
 select INSERT(@str,9,0,'not ');

The result is “This is not nothing”. When the third parameter is 0, it will not replace any characters, but append it at the starting position defined in the second parameter

Note: This INSERT() function is functionally equivalent to SQL Server’s STUFF() function. You can refer the usage of STUFF() function in this post.

Reference: Pinal Dave (http://blog.sqlauthority.com)