SQL SERVER – Statistical Analysis in SQL Server – A Quick Introduction

February 21, 2014

SQL Server has very few statistical functions. Oracle has lots of them. XLeratorDB closes the gap. If you want to try out the example of this blog post, I suggest you download the 15-day free trial of XLeratorDB NOW.

Most SQL Server users would never think of T-SQL as a platform for doing sophisticated statistical analysis. One reason for that is the relative lack of statistical functions that come with SQL Server, even though Oracle users have lots of functions available to them. The developers at Westclintech have done an admirable job of filling in that gap by providing over 300 statistical and mathematical functions including functions for numerically stable calculations for moments about the mean (variance, standard deviation, kurtosis and skewness) distribution functions (beta, normal, bivariate normal, gamma, binomial, t, chi square, F, etc.), linear regression functions, correlation, statistical inference, interpolation, and many more.

Let’s look at a very simple statistical analysis. We are going to predict demand in units for some of our products based upon previous year’s unit sales.

Let’s put some data in a table.

SELECT *
INTO #n
FROM (VALUES
(856018,'widget 1',2008,92303)
,(856018,'widget 1',2009,172259)
,(856018,'widget 1',2010,262942)
,(856018,'widget 1',2011,356509)
,(856018,'widget 1',2012,441066)
,(856018,'widget 1',2013,536820)
,(748396,'widget 2',2008,13614)
,(748396,'widget 2',2009,20498)
,(748396,'widget 2',2010,32809)
,(748396,'widget 2',2011,49666)
,(748396,'widget 2',2012,55390)
,(748396,'widget 2',2013,76579)
,(373748,'widget 3',2008,99394)
,(373748,'widget 3',2009,202891)
,(373748,'widget 3',2010,297216)
,(373748,'widget 3',2011,401210)
,(373748,'widget 3',2012,500209)
,(373748,'widget 3',2013,591125)
,(509819,'widget 4',2008,36723)
,(509819,'widget 4',2009,63566)
,(509819,'widget 4',2010,95523)
,(509819,'widget 4',2011,129923)
,(509819,'widget 4',2012,163620)
,(509819,'widget 4',2013,193636)
,(549801,'widget 5',2008,11127)
,(549801,'widget 5',2009,27420)
,(549801,'widget 5',2010,42190)
,(549801,'widget 5',2011,49519)
,(549801,'widget 5',2012,67777)
,(549801,'widget 5',2013,74313)
,(628964,'widget 6',2008,90024)
,(628964,'widget 6',2009,173597)
,(628964,'widget 6',2010,262586)
,(628964,'widget 6',2011,345029)
,(628964,'widget 6',2012,427760)
,(628964,'widget 6',2013,519919)
,(772493,'widget 7',2008,80854)
,(772493,'widget 7',2009,168631)
,(772493,'widget 7',2010,261083)
,(772493,'widget 7',2011,346453)
,(772493,'widget 7',2012,428067)
,(772493,'widget 7',2013,514310)
,(477653,'widget 8',2008,94886)
,(477653,'widget 8',2009,198272)
,(477653,'widget 8',2010,296506)
,(477653,'widget 8',2011,390449)
,(477653,'widget 8',2012,486135)
,(477653,'widget 8',2013,593191)
,(268817,'widget 9',2008,95857)
,(268817,'widget 9',2009,198910)
,(268817,'widget 9',2010,292704)
,(268817,'widget 9',2011,385263)
,(268817,'widget 9',2012,489351)
,(268817,'widget 9',2013,585276)
)n([Product ID],[Description],[Year],[Units])

We can use the XLeratorDB FORECAST function to predict unit demand for 2014 based upon the actual units from 2008 through 2013. FORECAST is a user-defined aggregate function, so the syntax is like any other aggregate function. Since we want to know the results by Product ID and Description, we will use a GROUP BY to summarize the results.

SELECT
[Product ID]
,[Description]
,ROUND(wct.FORECAST(2014,[Units],[Year]), 0) AS Forecast
FROM
#n
GROUP BY
[Product ID]
,[Description]

This provides us with the 2014 predicted value for each product.

Product ID Description               Forecast
----------- ----------- ----------------------
856018 widget 1                    622574
748396 widget 2                     85062
373748 widget 3                    694134
509819 widget 4                    225745
549801 widget 5                     89824
628964 widget 6                    602593
772493 widget 7                    602995
477653 widget 8                    688146
268817 widget 9                    682325

If we wanted to include the r-squared for the FORECAST (which is an indication of the goodness-of-fit for the prediction) we can use the RSQ function.

SELECT
[Product ID]
,[Description]
,ROUND(wct.FORECAST(2014,[Units],[Year]), 0) AS Forecast
,wct.RSQ([Units],[Year]) AS [R-squared]
FROM
#n
GROUP BY
[Product ID]
,[Description]

In addition to the 2014 predicted values we can also see the r-squared associated with that prediction. The higher the r-squared, the better the fit of the least-squares line.

Product ID Description               Forecast              R-squared
----------- ----------- ---------------------- ----------------------
856018 widget 1                    622574       0.99946328252557
748396 widget 2                     85062      0.976925841604458
373748 widget 3                    694134      0.999667182131839
509819 widget 4                    225745      0.998768744610505
549801 widget 5                     89824      0.985511627560019
628964 widget 6                    602593      0.999766345380639
772493 widget 7                    602995       0.99964651777832
477653 widget 8                    688146      0.999647675706309
268817 widget 9                    682325      0.999735740913981

Finally, in this SQL we simply PIVOT the results in order to present the forecast alongside the historical results for each year.

SELECT
[Product Id]
,[Description]
,[2008]
,[2009]
,[2010]
,[2011]
,[2012]
,[2013]
,[2014]
FROM (
SELECT
*
FROM
#n
UNION ALL
SELECT
[Product ID]
,[Description]
,2014
,ROUND(wct.FORECAST(2014,[Units],[Year]), 0) AS Units
FROM
#n
GROUP BY
[Product ID]
,[Description]
) D
PIVOT(
SUM([units])
FOR
[Year]
IN
([2008],[2009],[2010],[2011],[2012],[2013],[2014])
) P

Now we can see the predicted values for 2014 alongside the historical value for 2008 through 2014.

Now that I can actually see how this type of analysis can work in T-SQL, I want to do statistical analysis in SQL Server. In my simple example above, we could have predicted the 2014 sales at almost any level of detail; say we wanted to include geographic information in our predictions (assuming that we have recorded that information in our database), we could predict demand by region, by state, by city, probably even by store address, with only minor changes to our SQL; something that could be done in a few seconds, rather than in days or week.

Doing this type of analysis is also extremely fast and scalable. The FORECAST function had no problem chewing through 10,000,000 rows of data in 2.5 seconds on my machine. In fact, it seems that for certain types of analyses it might actually be possible to make predictions in real time.

Numeric Instability

XLeratorDB has an enormous breadth of functions for statistical analysis, but one of the more interesting things that I discovered was that some of the built-in SQL Server functions are what the XLeratorDB documentation describes as ‘numerically unstable’.

Here’s a very simple example of numeric instability using a built-in SQL Server function.

SELECT STDEVP(x) AS [STDEVP]
FROM (VALUES (900000016.4),(900000010.3),(900000018.5),(900000006.2),(900000010.3))n(x)

This produces the following result.

STDEVP
----------------------
0

When I put the same values in Excel it returns 4.482231578.

This SQL performs the same calculation using the XLeratorDB function;

SELECT wct.STDEV_P(x) AS STDEV_P
FROM (VALUES (900000016.4),(900000010.3),(900000018.5),(900000006.2),(900000010.3))n(x)

producing the following result,

STDEV_P
----------------------
4.4822315476529

which is pretty close to the value returned by Excel. You can read more about what’s going on by reading this blog on the XLeratorDB web site.

If you are interested in doing statistical analysis and even predictive analytics in SQL Server you should download the 15-day free trial of XLeratorDB today and find out what you can do.

Reference: Pinal Dave (https://blog.sqlauthority.com)

SQL Scripts

SQL SERVER – Finding the Last Backup for All Databases – Notes from the Field #016

SQLAuthority News – 4 Whitepapers on Windows Azure

1 Comment. Leave new

Vikas
November 5, 2015 12:23 pm
Hi Pinal
I am a SQL DBA for about a decade now. I would like to pursue Big Data, Data Scientist, R and predictive analysis. With so many things around the web it is difficult to decipher what is the right way to go.
Can you chime in with your comments on how and from where I can pursue this without losing the niche of my past experience.
-Thanks
-Vikas
Reply

Jul 2026 Discount: Comprehensive Database Performance Health Check | Testimonials

SQL SERVER – Statistical Analysis in SQL Server – A Quick Introduction

Related Posts

SQL SERVER – Comparison Index Fragmentation, Index De-Fragmentation, Index Rebuild – SQL SERVER 2000 and SQL SERVER 2005

SQL SERVER – Why Server Authentication is Disabled? What Mode is SQL Server Using Currently?

SQL SERVER – Msg 1842 – The File Size, Max Size Cannot be Greater Than 2147483647 in Units of a Page Size

1 Comment. Leave new

Leave a ReplyCancel reply