In this chapter we will discuss the differences between Multidimensional databases (cubes) and Tabular Models in order to help you decide which is best for meeting your particular needs. More details on how to implement each of these will be covered in the rest of this cookbook. In this chapter we will explore the following concepts:
The business value of Business Intelligence
The challenges and barriers faced when implementing Business Intelligence
Strategies for overcoming these challenges and barriers
Choosing multidimensional or Tabular Models
Star- and Snowflake-relational schema
A sample scenario for choosing the Snowflake schema
Business Intelligence (BI) used to be a competitive advantage for businesses that could afford it. Today, BI is increasingly becoming a fundamental and critical function of every business, which means it can no longer operate on an exclusively strategic basis. Also, it must be responsive to changing business needs in a time frame that allows the business to address those needs. The demand for more responsiveness (less time to implement) of BI continues to increase in parallel with the demand for more functionality.
In response to this increasing pressure on BI to perform at "Business speed", a new capability has emerged in the form of Self-service BI. These Self-service tools allow business users to acquire and analyze data from a variety of sources according to their specific needs at that moment. There are challenges and concerns that come with giving this capability to business users. We'll discuss these in detail in this chapter, but Self-Service BI is a good and growing solution to an important business need.
Microsoft's offerings for Self-Service BI include tools such as Power Pivot, Power View, and Tabular Models in Analysis Services. These tools continue to gain acceptance and are an increasingly presumed capability in Microsoft BI environments. It's not always clear, however, what precise mix of these tools, along with multidimensional cubes, relational data marts, and other presentation tools such as Excel and Reporting Services, would be optimal for any given situation.
There are a large number of variables in determining this optimal mix and we won't be discussing all of them in this cookbook. Our focus will be on the choice between and recipes for implementing Multidimensional cubes and Tabular Models.
"Making better decisions faster" is a common phrase used to describe the purpose of BI, but understanding how this purpose translates into value for the business is the key in understanding why and how BI should be implemented.
Making better decisions is valuable for the strategic management of an organization; making those decisions faster is possibly better, but strategic decisions tend to have longer time frames. So, faster is often not always better or even necessary. Making better operational decisions faster, given the much higher frequency and shorter decision time frame, is of great value to the business. For this discussion we'll focus on this less often considered operational value of BI.
Operational decisions are made every day by people at all levels in the organization. The nature of these decisions vary greatly, including things such as troubleshooting and resolving a specific question, finding a more efficient process for performing a task, or determining an appropriate staffing level for the coming week.
Regardless of the specifics, operational decisions are generally concerned with improving efficiency, increasing productivity, improving quality of the product, or lowering cost. BI can provide the information necessary to identify opportunities for improvement in these areas as well as to make informed decisions on how to implement these improvements. But, the greatest value is realized only when that information is of high quality and is available when needed. Poor quality or late information makes for poor quality or late decisions.
The need to deliver accurate information quickly is the fundamental challenge for Business Intelligence. The production of high quality information in a useful format takes time—data must be acquired, cleansed, modeled, and stored over continuous update and enhancement cycles. If any of these aspects of properly managing data are given less than appropriate attention, the quality of the information suffers—you take a short cut for speed of delivery and risk a reduction in the quality of the final product.
Even the highest quality information is of little value if it comes too late to be helpful. So the pressure is on meeting the business requests for information now, not in the several days or weeks it might take to define requirements, update the data model, develop ETL processes, test, validate, and finally make the information available in the data warehouse. Businesses often cannot wait and so they develop alternatives for acquiring and "managing" their own data. These alternatives, though they may answer the need for speed, inevitably result in both redundant data and inconsistent information.
Over time, technology and business groups have developed strategies and techniques aimed at coming closer to aligning managed data and the much faster business cycles. Improvements in traditional data storage engines, including the development of multidimensional models and ETL tools have helped. Iterative and agile development methodologies have given BI more of a continuous improvement than waterfall behavior and have made the environment more nimble. Still, there remained a gap where IT could not respond quickly enough to business demands, and businesses did not have the skill and discipline to sufficiently manage high quality data.
Self-Service BI is a good, and still improving answer for bridging the Business Intelligence technology and business gap. More than just tools and technology, Self-Service BI involves a commitment to cooperation and continuous—organic—improvement. With the right tools and cooperation between IT and business, it's now possible to provide long-term and high-quality managed data while also giving businesses the capability to meet their information needs in their needed time frame.
The Self-Service tools, such as Power Pivot, Power View, and the Analysis Services Tabular Model introduced with the SQL Server 2012, allow business resources to acquire, analyze, and share information relatively independent of IT and with a relatively low requirement for technical skill—the emphasis is on "relatively". It is possible for a business person to acquire data from a variety of resources through the use of tools provided by wizards and graphical interfaces. However, there remains the need for a higher than average technical capability—not a developer level but an analyst level resource is the typical profile. Also, though there is no requirement to involve IT or the managed data environment, these resources remain a source of considerable capability and information, and Self-Service users should look to them first to check if their needs may be met.
Traditional managed data and emerging Self-Service BI are, therefore, not competitive nor alternative technologies but rather complimentary technologies that together are a comprehensive, robust, and nimble information environment. Self-Service BI is the pointed end of the spear in which analysts self-serving information are in direct contact with the business and are tasked with responding quickly to information requests. As such, these analysts are the first to be aware of emerging and recurring questions and the information needs that answer those questions. By regularly harvesting this knowledge, those in charge of maintaining the managed data environment have a clear direction as to how their environment should evolve. Incorporating the newly identified, and vetted by Self-Service, sources and business rules for analysis into the data warehouse continuously improves the quality and depth of the still very valuable managed data environment.
Given the complimentary nature of managed and Self-Service data environments, it's reasonable to assume that in most organizations, at least one data warehouse will exist and will be available as the primary source of information.
Prior to the introduction of Tabular Models, cubes were often implemented as the outermost information interface for reporting and analysis. This configuration provided preaggregated values, ad-hoc analysis functionality, and a central store for business calculations. However, the development and maintenance of the Cube is in the exclusive domain of IT, and the business calculations are written in the MultiDimensional eXpressions (MDX) language (not the easiest of languages to learn). So, cubes are the logical (multidimensional) extension of the managed data environment. They provide high quality information and are consistent as well as fast to query, but dependent on their defined relational sources and, as a result, often slow to respond to changing needs.
The Tabular Model, like cubes part of the Analysis Services platform and multidimensional in nature offers much greater flexibility for the introduction of new data sources and subsequent definition of new dimensions, attributes, and measures.
Note
No formal ETL or data modeling is required; so, turnaround times for updates are greatly reduced, and no MDX is needed in order to define calculated values, as this is done via the new language, Data Analysis Expressions (DAX). Though the DAX language is considered by many to be easier to learn and use than MDX, it is not, in its current version, as capable as MDX. So, while certain basic operations are easier to express in DAX compared to MDX, implementing complex calculations is much more difficult using DAX. Although this book isn't dedicated to MDX or DAX, you can learn about the useful constructs of these languages in Chapter 6, MDX and Chapter 10, DAX Calculations and Queries.
In most environments, both cubes and Tabular Models will be used as each provides a useful and specific set of functionality. Determining which should be used for a given set of requirements will depend on the particulars of those requirements, but the following checklist provides a high-level guideline for selecting the most appropriate tool.
A Cube is best if the following requirements are satisfied (not a comprehensive list, more of a top five):
You need Writeback functionality: Writeback is most commonly used for budgeting and what-if analysis. It's not the most widely used functionality but can be a very important part of some BI environments. This functionality is not currently supported in the Tabular Model.
You have complex calculations: It's difficult in DAX to create complex calculations, which are relatively straightforward in MDX.
You plan to have Named sets: Named sets are very useful and user friendly. However, they are not currently supported in Tabular Models.
You have Many-to-Many relationships: While many-to-many relationships are possible in Tabular Models, they are complicated to set up, whereas in cubes these relationships are native and relatively easy.
You will use Role Playing Dimensions: Like many-to-many relationships, Role Playing Dimensions are possible in Tabular Models, but they are complex to set up and also not very intuitive for users.
A Tabular Model is best if the following requirements are satisfied (again, not a comprehensive list):
You need the ability to quickly add additional or external data: The Tabular Model allows you to connect to a wide variety of sources, while the cube is far more constrained on its source requirements. The Tabular Model also offers greater data compression compared to the multidimensional model. Generally, the cycle for Tabular Model development will be shorter compared to that for multidimensional model development.
Your model is relatively simplistic: As mentioned earlier, complex design scenarios, such as many-to-many relationships as well as parent-child or role playing dimensions can be implemented with the Tabular Model, but it would require much greater effort compared to the effort with the multidimensional model.
Fastest possible query performance is required: The Tabular Model is entirely stored in memory, so it is very fast at query time. Note that cubes are cached as they are queried, so there is a point at which the Cube may match the Tabular Model for a similar query but, in most cases, the Tabular Model will outperform the Cube for sheer query speed.
You want to use Power View or intend to run reports at a low granularity of data: An add-in for SQL Server Reporting Services, Power View is an ad hoc visualization and analysis tool with which users may explore Power Pivot or the Tabular Model data in a familiar Office-like interface. Prior to SQL 2012 SP1, Power View could not use a Cube as a source. This is no longer a limitation with the release of SP1. The Tabular Model is also likely to be more efficient to retrieve data at low granularity, as opposed to retrieving just the summary values.
For additional information on the comparison between tabular and multidimensional models, refer to http://technet.microsoft.com/en-us/library/hh212940.aspx.
There continues to be much discussion, and often debate, over the question of whether a Star or Snowflake schema is preferred and whether cubes or Tabular Models may be required.
Note
In the Star schema, each fact table is directly related to every dimension table; in the Snowflake schema, some dimension tables may be further normalized and connected to the fact table through other dimensions. You can use Star or Snowflake data models for building multidimensional as well as Tabular Models.
The answer is that either architecture is acceptable, and in most environments, the best choice is not one or the other but rather a mix of both.
Before making a decision on using a Star or Snowflake architecture for your relational scheme, it's important to understand the key characteristics of each. Stars are denormalized models, most typically seen in data marts. Though not optimal for data maintenance activities (as they are heavily data redundant), Stars are very fast to query and due to their far less complex schema, they are easier for business users to navigate. Snowflakes, on the other hand, are normalized models, most typically seen in data warehouses. Since they are normalized, Snowflakes are optimized for data maintenance, but the requirement of joining many tables to retrieve data mean a more complex overall schema and slower queries.
Given that our primary goal in BI is to provide access to data as quickly and intuitively as possible, Stars are generally considered to be the preferred "outer" data layer. Outer in this case implies that we may have (and often we do have) a normalized (Snowflake) data warehouse, which is the primary persistent managed data store. The denormalized (Star) Data Mart is populated from the Data Warehouse as a way of positioning data for optimal user, reporting, and application use. Cubes and Tabular Models, like all analysis tools, benefit greatly from this optimization but can consume the normalized data warehouse as well—usually not as efficiently.
However, this does not mean that my Data Mart must be entirely comprised of denormalized Star structures. You will find that as your environment matures, you will be faced with the fact data of differing grains along shared (conformed) dimensions. In such instances, you should consider normalizing (Snowflaking) those specific dimensions in order to accommodate those different facts. This is a good example of a Data Mart that is still considered a Star architecture but contains a small number of Snowflake dimensions.
Here's an example of a design decision process that would lead you to a Snowflake dimension. Start by assuming that all the dimensions in the Data Mart (versus the Data Warehouse, where we may have different ideas) will be modeled as Stars.
We start in our first design with a single dimension, Geography, containing the following columns:
skGeography
(surrogate key)PostalCode
(business key)CityID
CityName
StateID
StateName
CountryID
CountryName
We have one fact source table containing, say, population data with the following columns:
CensusDate
PostalCode
PopulationCount
In ETL, we would join this source table to the dimension table on the business key PostalCode
to retrieve the surrogate key and use this to load the data mart fact table:
CensusDate
skGeography
PopulationCount
Now, let's introduce a second fact source table containing projected population data, but with a different grain. Let's assume this data comes in, not at the Postal Code grain but rather at the State grain. We'd have a source table with columns such as follows:
ProjectionDate
StateID
ProjectedGrowth
We can't join this new source table to our existing Geography dimension because if we do so, we will get back many surrogate keys—each representing one postal code within the specified state. So, we need to Snowflake (partially normalize) the Geography dimension so that it will support the grain of each of our fact source tables, giving us two dimension tables similar to the the following two bullet lists:
dimGeography
:
skGeography
PostalCode
CityID
CityName
skGeographyState
and dimGeographyState
:
skGeographyState
StateID
StateName
CountryID
CountryName
Notice that we did not fully normalize the dimension (postal code and city both exist in the first table, state and country in the second). We just normalized the dimension enough to give us a single relationship between each of our two facts and this dimension.