[Note from Pinal]: This is a new episode of Notes from the Field series. I personally have no problem accepting that I do not know many topics. One of the such topic is Tabular Models. I always wanted to learn more and I finally got the opportunity to learn more.
In this episode of the Notes from the Field series I asked SQL Expert Bill Anton a very crucial question – How can we optimize Tabular Models for self service reporting with the help of Summarize By ? Bill was very kind to answer the questions and provides plenty of information about how a novice developer can work with Tabular Models.
Designing Tabular solutions for self-service environments can pose some tricky performance problems as there’s little control over how a user interacts with the tools and no control over the DAX that gets generated and executed against the underlying Tabular model. However, there are several ways the developer can “influence” the DAX queries that get generated in the reporting layer. One such way is to set the “Summarize By” property (to “Do Not Summarize” or “None”) for numeric columns in Tabular models.
Summarize By Property
By default, self-service reporting tools such as Power BI attempt to apply a SUM aggregation to numeric columns. You may have seen this before when using the table visualization in a Power BI report. Below is a perfect illustration of this issue…
The repeating 7338695 value in the Year column is clearly not a valid year nor is it the expected behavior from the user perspective. The value is derived from adding up the Year column (i.e. 2012, 2013, etc) across all the rows of the DimDate table.
Fortunately, users can quickly correct this issue in the Power BI report by changing the “Summarize By” property in the Visualizations pane for the [Year] field to “Don’t Summarize”…
The table now displays the correct/expected values…
However, there are several issues with this approach…
- Poor User Experience (UX)
Because the correction is made in the in the reporting layer via Power BI as opposed to the BI Semantic layer (i.e. in the Tabular model), every report developer will potentially run into this same issue and need to make the same adjustment - Poor Performance
In some cases (e.g. table visualizations with detailed data and many columns), using numeric columns with the “Summarize By” set to default in a Power BI report can result in a DAX query with horrible performance. And while changing the “Summarize By” property from the default value to “Don’t Summarize” will result in a better/faster DAX query, the reality is this change can’t be made until after the column has been added to the visualization – which means the penalty is paid at least once. Factor in a large number of concurrent users and this type of performance issue can quickly soak up all the CPU and memory of Analysis Services instance to the point where it becomes unresponsive.
A better solution is to make this adjustment in the Tabular model – doing so addresses both of these issues.
Tabular Model Adjustment
To make this adjustment in the Tabular model, just click on the column in either the grid view or diagram view and you’ll see the “Summarize By” property in the Properties window.
Note: this property can be changed for multiple columns at the same time by holding down the control key and selecting the desired columns.
Performance Example
Below is a fairly typical (though terribly unsexy) Power BI report based on the Adventure Works sample data set. This report shows the number of customers and orders for the selected year (in this case 2012). It also includes a table visualization with customer details for all customers who made purchases during the selected timeframe.
Now let’s say we want to add the following customer attributes to the table…
Note: the sigma icon to the left of the column name indicates the column is numeric and still configured with the default value for the “Summarize By” property (i.e. SUM)
When these columns are added to the table, we immediately notice it takes a really long time to refresh…
Here’s the query generated by Power BI…
EVALUATE CALCULATETABLE( FILTER( KEEPFILTERS( ADDCOLUMNS( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS(VALUES('Customer'[Customer Name])), VALUES('Product Big'[Product Subcategory]) ) ), VALUES('Product Big'[Product Category]) ) ), VALUES('Customer'[Gender]) ) ), VALUES('Customer'[EnglishOccupation]) ) ), VALUES('Customer'[EnglishEducation]) ) ), FILTER( KEEPFILTERS(VALUES('Customer'[Customer ID])), OR( NOT(ISBLANK('Internet Sales Big'[Internet Sales Amount])), AND( CALCULATE( NOT(ISBLANK(COUNTROWS('Internet Sales Big'))) ), OR( OR( OR( NOT( ISBLANK(CALCULATE(SUM('Customer'[NumberCarsOwned]))) ), NOT( ISBLANK(CALCULATE(SUM('Customer'[NumberChildrenAtHome]))) ) ), NOT( ISBLANK(CALCULATE(SUM('Customer'[TotalChildren]))) ) ), NOT( ISBLANK(CALCULATE(SUM('Customer'[YearlyIncome]))) ) ) ) ) ) ) ), "Internet_Sales_Amount", 'Internet Sales Big'[Internet Sales Amount], "SumNumberCarsOwned", CALCULATE(SUM('Customer'[NumberCarsOwned])), "SumNumberChildrenAtHome", CALCULATE(SUM('Customer'[NumberChildrenAtHome])), "SumTotalChildren", CALCULATE(SUM('Customer'[TotalChildren])), "SumYearlyIncome", CALCULATE(SUM('Customer'[YearlyIncome])) ) ), OR( OR( OR( OR( OR( OR( OR( OR( OR( OR( OR( NOT(ISBLANK('Customer'[Customer Name])), NOT(ISBLANK('Product Big'[Product Subcategory])) ), NOT(ISBLANK('Product Big'[Product Category])) ), NOT(ISBLANK('Customer'[Gender])) ), NOT(ISBLANK('Customer'[EnglishOccupation])) ), NOT(ISBLANK('Customer'[EnglishEducation])) ), NOT(ISBLANK('Customer'[Customer ID])) ), NOT(ISBLANK([Internet_Sales_Amount])) ), NOT(ISBLANK([SumNumberCarsOwned])) ), NOT(ISBLANK([SumNumberChildrenAtHome])) ), NOT(ISBLANK([SumTotalChildren])) ), NOT(ISBLANK([SumYearlyIncome])) ) ), KEEPFILTERS( FILTER(KEEPFILTERS(VALUES('Date'[Year])), 'Date'[Year] = 2012) ) ) ORDER BY 'Customer'[Customer Name], 'Product Big'[Product Subcategory], 'Product Big'[Product Category], 'Customer'[Gender], 'Customer'[EnglishOccupation], 'Customer'[EnglishEducation], 'Customer'[Customer ID]
On my system, this query takes ~5.2 seconds with a cold-cache…
…and ~3.5 seconds with a warm-cache…
The problem here is that these 4 additional columns (i.e. NumberCarsOwned, NumberChildrenAtHome, TotalChildren, YearlyIncome) are being treated as measures instead of descriptive columns used for grouping…
We also see, from the DAX Studio screenshots above, that this query is very much bound to the formula engine which means we will need to rewrite the DAX to get achieve better performance. Which begs the question, how do we rewrite the DAX when it is being auto-magically generated by the reporting client – Power BI?
Technically we can’t control exactly how the query is constructed, but we can “influence” it via how we write our measures and configure out model. And in this particular case, all we need to do is change the “Summarize By” property for these 4 columns to “influence” the auto-magically generated DAX.
After doing so, the report refreshes very quickly and we see the identical results…
We also see that a slightly different query has been generated (below) and now those 4 additional columns are being treated as grouping columns instead of calculated measures…
EVALUATE CALCULATETABLE( ADDCOLUMNS( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS( GENERATE( KEEPFILTERS(VALUES('Customer'[Customer Name])), VALUES('Product Big'[Product Subcategory]) ) ), VALUES('Product Big'[Product Category]) ) ), VALUES('Customer'[Gender]) ) ), VALUES('Customer'[EnglishOccupation]) ) ), VALUES('Customer'[EnglishEducation]) ) ), VALUES('Customer'[Customer ID]) ) ), VALUES('Customer'[NumberCarsOwned]) ) ), VALUES('Customer'[NumberChildrenAtHome]) ) ), VALUES('Customer'[TotalChildren]) ) ), FILTER( KEEPFILTERS(VALUES('Customer'[YearlyIncome])), NOT(ISBLANK('Internet Sales Big'[Internet Sales Amount])) ) ) ), "Internet_Sales_Amount", 'Internet Sales Big'[Internet Sales Amount] ), KEEPFILTERS( FILTER(KEEPFILTERS(VALUES('Date'[Year])), 'Date'[Year] = 2012) ) ) ORDER BY 'Customer'[Customer Name], 'Product Big'[Product Subcategory], 'Product Big'[Product Category], 'Customer'[Gender], 'Customer'[EnglishOccupation], 'Customer'[EnglishEducation], 'Customer'[Customer ID], 'Customer'[NumberCarsOwned], 'Customer'[NumberChildrenAtHome], 'Customer'[TotalChildren], 'Customer'[YearlyIncome]
Looking at the breakdowns in DAX Studio, we see that this new query runs in ~1.5 sec with a cold-cache…
…and ~162 ms with a warm-cache…
That’s a pretty good performance improvement for such a minor adjustment on a sample data set. However, with a larger data set, the performance improvement can be much more significant. For example, in a recent client engagement, this same adjustment meant the difference between a Power BI report that was failing after > 2 minutes with an out-of-memory exception …
…to one that rendered in under 250ms with a cold-cache…
…and ~ 115ms with a warm-cache…
Wrapping Up – Tabular Models
The “Summarize By” property is just one of the many ways in which a developer can influence the queries generated from self-service reporting tools (e.g. Power BI) against an Analysis Services Tabular solution.
My advice is to set this property to “Do Not Summarize” for every single numeric column in the model. Any field that needs to be summarize/aggregated should have an explicit measure defined in the calculation area. Not only does this improve the user experience, but it could also have a (potentially) huge impact on the scalability of the Tabular instance.
If you want to get started on Tabular Models with the help of experts, read more over at Fix Your SQL Server.
Reference: Pinal Dave (https://blog.sqlauthority.com)
1 Comment. Leave new
the codes look like lisp