Measures and aggregation
Measures are the numeric values that our users want to aggregate, slice, dice and otherwise analyze, and as a result, it's important to make sure they behave the way we want them to. One of the fundamental reasons for using Analysis Services is that, unlike a relational database it allows us to build into our cube design business rules about measures: how they should be formatted, how they should aggregate up, how they interact with specific dimensions and so on. It's therefore no surprise that we'll spend a lot of our cube development time thinking about measures.
Useful properties of Measures
Apart from the AggregateFunction property of a measure, which we'll come to next, there are two other important properties we'll want to set on a measure, once we've created it.
The Format String property of a measure specifies how the raw value of the measure gets formatted when it's displayed in query results. Almost all client tools will display the formatted value of a measure, and this allows us to ensure consistent formatting of a measure across all applications that display data from our cube.
A notable exception is Excel 2003 and earlier versions, which can only display raw measure values and not formatted values. Excel 2007 will display properly formatted measure values in most cases, but not all. For instance, it ignores the fourth section of the Format String which controls formatting for nulls. Reporting Services can display formatted values in reports, but doesn't by default; this blog entry describes how you can make it do so: http://tinyurl.com/gregformatstring.
There are a number of built-in formats that you can choose from, and you can also build your own by using syntax very similar to the one used by Visual BASIC for Applications (VBA) for number formatting. The Books Online topic FORMAT_STRING Contents gives a complete description of the syntax used.
Here are some points to bear in mind when setting the Format String property:
- If you're working with percentage values, using the % symbol will display your values multiplied by one hundred and add a percentage sign to the end. Note that only the display value gets multiplied by hundred—the real value of the measure will not be, so although your user might see a value of 98% the actual value of the cell would be 0.98.
- If you have a measure that returns null values in some circumstances and you want your users to see something other than null, don't try to use a MDX calculation to replace the nulls—this will cause severe query performance problems. You can use the fourth section of the Format String property to do this instead—for example, the following:
will display the string NA for null values, while keeping the actual cell value as null without affecting performance.
- Be careful while using the Currency built-in format: it will format values with the currency symbol for the locale specified in the Language property of the cube. This combination of the Currency format and the Language property is frequently recommended for formatting measures that contain monetary values, but setting this property will also affect the way number formats are displayed: for example, in the UK and the USA, the comma is used as a thousands separator, but in continental Europe it is used as a decimal separator. As a result, if you wanted to display a currency value to a user in a locale that didn't use that currency, then you could end up with confusing results. The value €100,101 would be interpreted as a value just over one hundred Euros to a user in France, but in the UK, it would be interpreted as a value of just over one hundred thousand Euros. You can use the desired currency symbol in a Format String instead, for example '$#,#.00', but this will not have an effect on the thousands and decimal separators used, which will always correspond to the Language setting. You can find an example of how to change the language property using a scoped assignment in the MDX Script here: http://tinyurl.com/gregformatstring.
- Similarly, while Analysis Services 2008 supports the translation of captions and member names for users in different locales, unlike in previous versions, it will not translate the number formats used. As a result, if your cube might be used by users in different locales you need to ensure they understand whichever number format the cube is using.
Many cubes have a lot of measures on them, and as with dimension hierarchies, it's possible to group measures together into folders to make it easier for your users to find the one they want. Most, but not all, client tools support display folders, so it may be worth checking whether the one you intend to use does.
By default each measure group in a cube will have its own folder containing all of the measures on the measure group; these top level measure group folders cannot be removed and it's not possible to make a measure from one measure group appear in a folder under another measure group. By entering a folder name in a measure's Display Folder property, you'll make the measure appear in a folder underneath its measure group with that name; if there isn't already a folder with that name, then one will be created, and folder names are case-sensitive. You can make a measure appear under multiple folders by entering a semi-colon delimited list of names as follows: Folder One; Folder Two.
You can also create a folder hierarchy by entering either a forward-slash / or back-slash delimited list (the documentation contradicts itself on which is meant to be used—most client tools that support display folders support both) of folder names as follows: Folder One; Folder TwoFolder Three.
Calculated measures defined in the MDX Script can also be associated with a measure group, through the Associated_Measure_Group property, and with a display folder through the Display_Folder property. These properties can be set either in code or in Form View in the Calculations tab in the Cube Editor:
If you don't associate a calculated measure with a measure group, but do put it in a folder, the folder will appear at the same level as the folders created for each measure group.
Built-in measure aggregation types
The most important property of a measure is AggregateFunction; it controls how the measure aggregates up through each hierarchy in the cube. When you run an MDX query, you can think of it as being similar to a SQL SELECT statement with a GROUP BY clause—but whereas in SQL you have to specify an aggregate function to control how each column's values get aggregated, in MDX you specify this for each measure when the cube is designed.
Basic aggregation types
Anyone with a passing knowledge of SQL will understand the four basic aggregation types available when setting the AggregateFunction property:
- Sum is the commonest aggregation type to use, probably the one you'll use for 90% of all the measures. It means that the values for this measure will be summed up.
- Count is another commonly used property value and aggregates either by counting the overall number of rows from the fact table that the measure group is built from (when the Binding Type property, found on the Measure Source dialog that appears when you click on the ellipses button next to the Source property of a measure, is set to Row Binding), or by counting non-null values from a specific measure column (when Binding Type property is set to Column Binding).
- Min and Max return the minimum and maximum measure values.
There isn't a built-in Average aggregation type—as we'll soon see, AverageOfChildren does not do a simple average—but it's very easy to create a calculated measure that returns an average by dividing a measure with AggregateFunction Sum by one with AggregateFunction Count, for example:
CREATE MEMBER CURRENTCUBE.[Measures].[Average Measure Example] AS
IIF([Measures].[Count Measure]=0, NULL,
[Measures].[Sum Measure]/[Measures].[Count Measure]);
The DistinctCount aggregation type counts the number of distinct values in a column in your fact table, similar to a Count(Distinct) in SQL. It's generally used in scenarios where you're counting some kind of key, for example, finding the number of unique Customers who bought a particular product in a given time period. This is, by its very nature, an expensive operation for Analysis Services and queries that use DistinctCount measures can perform worse than those which use additive measures. It is possible to get distinct count values using MDX calculations but this almost always performs worse; it is also possible to use many-to-many dimensions to get the same results and this may perform better in some circumstances; see the section on "Distinct Count" in the "Many to Many Revolution" white paper, available at http://tinyurl.com/m2mrev.
When you create a new distinct count measure, BIDS will create a new measure group to hold it automatically. Each distinct count measure needs to be put into its own measure group for query performance reasons, and although it is possible to override BIDS and create a distinct count measure in an existing measure group with measures that have other aggregation types, we strongly recommend that you do not do this.
The None aggregation type simply means that no aggregation takes place on the measure at all. Although it might seem that a measure with this aggregation type displays no values at all, that's not true: it only contains values at the lowest possible granularity in the cube, at the intersection of the key attributes of all the dimensions. It's very rarely used, and only makes sense for values such as prices that should never be aggregated.
If you ever find that your cube seems to contain no data even though it has processed successfully, check to see if you have accidentally deleted the Calculate statement from the beginning of your MDX Script. Without this statement, no aggregation will take place within the cube and you'll only see data at the intersection of the leaves of every dimension, as if every measure had AggregateFunction None.
Semi-additive aggregation types
The semi-additive aggregation types are:
They behave the same as measures with aggregation type Sum on all dimensions except Time dimensions. In order to get Analysis Services to recognize a Time dimension, you'll need to have set the dimension's Type property to Time in the Dimension Editor.
Sometimes you'll have multiple, role-playing Time dimensions in a cube, and if you have semi-additive measures, they'll be semi-additive for just one of these Time dimensions. In this situation, Analysis Services 2008 RTM uses the first Time dimension in the cube that has a relationship with the measure group containing the semi-additive measure. You can control the order of dimensions in the cube by dragging and dropping them in the Dimensions pane in the bottom left-hand corner of the Cube Structure tab of the Cube Editor; the following blog entry describes how to do this in more detail: http://tinyurl.com/gregsemiadd. However, this behavior has changed between versions in the past and may change again in the future.
Semi-additive aggregation is extremely useful when you have a fact table that contains snapshot data. For example, if you had a fact table containing information on the number of items in stock in a warehouse, then it would never make sense to aggregate these measures over time: if you had ten widgets in stock on January 1, eleven in stock on January 2, eight on January 3 and so on, the value you would want to display for the whole of January would never be the sum of the number of items in stock on each day in January. The value you do display depends on your organization's business rules.
Let's take a look at what each of the semi-additive measure values actually do:
- AverageOfChildren displays the average of all the values at the lowest level of granularity on the Time dimension. So, for example, if Date was the lowest level of granularity, when looking at a Year value, then Analysis Services would display the average value for all days in the year.
- FirstChild displays the value of the first time period at the lowest level of granularity, for example, the first day of the year.
- LastChild displays the value of the last time period at the lowest level of granularity, for example, the last day of the year.
- FirstNonEmpty displays the value of the first time period at the lowest level of granularity that is not empty, for example the first day of the year that has a value.
- LastNonEmpty displays the value of the last time period at the lowest level of granularity that is not empty, for example the last day of the year that has a value. This is the most commonly used semi-additive type; a good example of its use would be where the measure group contains data about stock levels in a warehouse, so when you aggregated along the Time dimension what you'd want to see is the amount of stock you had at the end of the current time period.
The following screenshot of an Excel pivot table illustrates how each of these semi-additive aggregation types works:
Note that the semi-additive measures only have an effect above the lowest level of granularity on a Time dimension. For dates like July 17th in the screenshot above, where there is no data for the Sum measure, the LastNonEmpty measure still returns null and not the value of the last non-empty date.
The semi-additive measure aggregation types (and None) are only available in Enterprise Edition; it's possible to recreate the same functionality using MDX but query performance will not be quite as good. Here's an example of how you could overwrite the value of a measure with the AggregateFunction Sum by using an MDX Script assignment to make it behave in the same way as a LastNonEmpty measure:
THIS = TAIL(
* [Measures].[Sales Amount])
We may also find that the performance of LastNonEmpty is still not good enough, especially on very large cubes. It is not as fast as LastChild, so one trick we could try is to calculate the last non empty value for a measure in a new column in your fact table of your ETL. We can then create a new measure from this column, use the LastChild measure as its aggregation type, then set its Visible property to false and use an MDX Script something like the following to display the new measure's value above the leaf level for the original measure:
SCOPE([Measures].[Sales Amount], [Date].[Date].[All]);
THIS = [Measures].[LastChild];
This assignment forces each time period to show the value of the last date within that time period. Typically, though, at the tail of your Date dimension you will have dates that have not occurred yet and which cannot contain any data—for example, if today's date is April 16th 2009, you might only have data in the cube up to April 15th, but the Date dimension will probably contain all of the dates up to December 31st 2009. For these future dates you'll either have to ensure your ETL populates the value of the LastChild measure up until the end of the current year or use another MDX Script assignment that is similar to the one described at the end of this blog entry: http://tinyurl.com/gregsemiadd.
A Chart of Accounts dimension is a feature common to many cubes containing financial data; in many ways it has a lot in common with the measures dimension, because each member on a Chart of Accounts dimension represents a different type of value such as profit and loss or balance sheet values for example. A Chart of Accounts dimension is often implemented with a parent/child hierarchy, and there are often complex business rules that govern how the members on the hierarchy should aggregate up, if they even do at all. Analysis Services provides several features to help you build and manage such a dimension, although it's widely accepted that building financial applications in Analysis Services is not as easy as it should be, or indeed as easy as it is in other OLAP servers.
The ByAccount aggregation type (available in Enterprise Edition only) is the first of these features we'll discuss, and it allows us to define different semi-additive behavior for a single measure for different members on the main hierarchy of a Chart of Accounts dimension. The steps to get it working are as follows:
- Create your Chart of Accounts dimension.
- Create another attribute that will be used to flag which members on the main hierarchy will use which semi-additive aggregation type. This should have one member for each type of attribute, where each member represents a different form of semi-additive behavior. Call it something like 'Account Type'.
- Set the following properties:
- On the dimension itself, set the Type property to Accounts.
- On the main hierarchy, set the Type property to Account.
- On the Account Type attribute, set the Type property to Account Type.
- Define how each account type should aggregate. This can be done by right-clicking on the name of the project in the Solution Explorer in BIDS and selecting Edit Database. In this screen, in the Account Type Mapping table, the Name column contains the name of the built-in or user-defined account types and the Alias column contains a comma-delimited list of member names on the Account Type hierarchy that map to each account type, and Aggregation Function allows you to select either Sum or one of the semi-additive aggregation types for each account type.
Most of this work can also be done by running the Define Account Intelligence wizard. To do this, go to the Dimension menu and select Add Business Intelligence and then select 'Define Account Intelligence' from the list of available enhancements.
In our experience, the ByAccount aggregation type is very rarely used. This is partly because cube developers do not know about it, and partly because most financial applications need to implement more complex logic for aggregation than the built-in semi-additive aggregation types allow. This means that aggregation logic is usually implemented using MDX Script assignments, which give complete flexibility although at the expense of a lot of complexity and do not perform as well as the built-in aggregation types. Probably, the best compromise is to start off by using ByAccount and then for the accounts which need more complex aggregation, configure them to use Sum in the Edit Database screen and then use custom MDX for them.
In addition to the AggregateFunction property of a measure, there are many other ways of controlling how measures aggregate up. Apart from writing code in the MDX Script, these methods involve different types of dimension calculations.
Unary operators and weights
In a similar way to how you can control semi-additive aggregation with the ByAccount aggregation type, you can control how members on a hierarchy (usually parent/child, but not necessarily) aggregate up to their parents by defining unary operators. To do this, you must create a new column in your dimension table to hold the unary operator values and then set the attribute's UnaryOperatorColumn property to point to it.
The unary operators supported are:
- +, which means that the member's value contributes as a positive value to its parent's total
- -, which means that the member's value contributes as a negative value to its parent's total
- *, which means that the member's value is multiplied with the value of the previous member in the level
- /, which means that the member's value is divided by the value of the previous member in the level
- ~ , which means the member's value is ignored when calculating the value of the parent
- Any numeric value, which works the same as + but where the numeric value is multiplied by the member's value first
Let's take a look at a simple example of this working. Once you've defined some unary operators, you can see which members have which operators in the Browser tab of the Dimension Editor:
From this, we can see that the All member has three children. They are Balance Sheet, Net Income and Statistical Accounts. Since both Balance Sheet and Statistical Accounts have unary operator ~, All member's value is going to be the same as that of Net Income. Similarly, we can see that the value of Net Income itself will be calculated as follows: + (Operating Profit) + (Other Income and Expense) – (Taxes).
The distributive unary operators, that is + and -, perform much better than calculated members that do the equivalent addition and subtraction in MDX formulas.
Custom Member Formulas
Where unary operators do not give you enough control, you can assign an MDX expression to use to calculate the value for each member on an attribute hierarchy using Custom Member Formulas. To do this, we need to create a new column on our dimension table to hold the MDX expressions and then set the CustomRollupColumn property of the attribute to this column.
We can also create another column to hold property values that can be applied to each member on the hierarchy too, which then needs to be assigned in the CustomRollupPropertiesColumn property. The contents of this column takes the form of a comma-delimited list of property values, just as you would use at the end of a calculated member definition; for example, if a member had the following expression associated with it in the dimension table:
then all values for that measure would have the Format String '$#,#.00' applied to them, and they would be displayed with a red cell background.
Custom Member Formulas offer something very similar to what you can do with MDX Script assignments, so what are the advantages and disadvantages of using them?
- In terms of query performance, Custom Member Formulas and MDX Script assignments are the same.
- As dimension calculations, Custom Member Formulas differ from MDX Script assignments in one important respect: they use the 'Closest Pass Wins' rule rather than the 'Last Pass Wins' rule when a cell's value can be calculated in two different ways from two different calculations. You don't have the flexibility you have with MDX Script assignments, but you can always be sure that a member's value will be calculated using the formula you specify and not be accidentally overwritten by another assignment.
- Using Custom Member Formulas means the dimension becomes, in some respects, self-documenting. If the formulas are made visible to the end user somehow, perhaps by creating another attribute with AttributeHierarchyEnabled=False in the formula column, and the formulas are relatively simple, then users should be able to view and understand them.
- If a dimension with Custom Member Formulas is shared across multiple cubes, then these calculations are automatically applied to all of the cubes. With MDX Script assignments, the code would have to be duplicated across cubes.
- The major drawback to using Custom Member Formulas is that if you need to edit a calculation, you need to edit a value in a relational table. Not only is this inconvenient, it makes debugging calculations much more difficult and it means that your MDX code can be split between two places: the dimension table and the MDX Script.
In some rare cases, you may encounter non-aggregatable measures, that's to say measures whose aggregated values have to be supplied from the data warehouse and cannot be derived from lower granularity values. It's always worth asking why measures are non-aggregatable: in many cases, it's the result of some kind of pre-aggregation or calculation taking place during the ETL phase. While Analysis Services can handle non-aggregatable measures, it's much better at handling additive data, so if you can build your cube from the additive data, we recommend you do so, even if the data volumes end up being much larger.
However, if you do have to work with non-aggregatable measures, one way of handling them is by using parent/child hierarchies. Every non-leaf member on a parent/child hierarchy has an extra, system-generated child called a datamember, and you can control the visibility of these datamembers by setting the MembersWithData property on the parent/child hierarchy. If your fact table contains values for a non-leaf member on a parent/child hierarchy, then these values will in fact be assigned to the member's datamember; by default, the real non-leaf member's values will then be aggregated from its children and its datamember. This can be seen in the following screenshot from Excel:
In order to display non-aggregatable values, all you need to do is to use an MDX Script assignment to make each non-leaf member on your parent/child hierarchy display just the value of its datamember, for example:
SCOPE([Measures].[Sales Amount Quota]);
This code would have the following effect on our Excel example:
As you can see, the value of the non-leaf member Alberts is now no longer the sum of its children and its datamember, but that of its datamember alone.
For more details on working with non-aggregatable values and parent/child hierarchies, see the following paper written by Richard Tkachuk: http://tinyurl.com/nonaggdata
In this first part, we covered useful properties of measures, basic aggregation types, Custom Member Formulas along with their advantages and disadvantages. In the next part, we will focus on measure groups and dimension/measure group relationships.