The focus of this chapter is how to design a data warehouse specifically for Analysis Services. There are numerous books available that explain the theory of dimensional modeling and data warehouses; our goal here is not to discuss generic data warehousing concepts, but to help you adapt the theory to the needs of Analysis Services.
In this chapter, we will touch on just about every aspect of data warehouse design, and mention several subjects that cannot be analyzed in depth in a single chapter. We will cover some of these subjects in detail, such as Analysis Services cube, and dimension design, in later chapters. Other aspects, which are outside the scope of this book, will require further research on the part of the reader.
We try to answer this question by analyzing the different kinds of databases we encounter in our search for the best source for our cube. In the process of doing so, we describe the basics of dimensional modeling, as well as some of the competing theories on how data warehouses should be designed.
Typically, you create a BI solution when business users want to analyze, explore, and report on their data in an easy and convenient way. The data itself may be composed of thousands, millions, or even billions of rows, normally kept in a relational database built to perform a specific business purpose. We refer to this database as the On Line Transactional Processing (OLTP) database.
The OLTP database can be a legacy mainframe system, a CRM system, an ERP system, a general ledger system, or any kind of database that a company uses in order to manage their business.
Sometimes, the source OLTP may consist of simple flat files generated by processes running on a host. In such a case, the OLTP is not a real database, but we can still turn it into one by importing the flat files into a SQL Server database for example. Therefore, regardless of the specific media used to store the OLTP, we will refer to it as a database.
The OLTP system is normally a complex piece of software that handles information and transactions. From our point of view, though, we can think of it simply as a database.
We do not normally communicate in any way with the application that manages and populates the data in the OLTP. Our job is that of exporting data from the OLTP, cleaning it, integrating it with data from other sources, and loading it into the data warehouse.
We cannot make any assumptions, such as a guarantee in the schema structure over time, about the OLTP database's structure.
Somebody else has built the OLTP system and is probably currently maintaining it, so its structure may change over time. We do not usually have the option of changing anything in its structure anyway, so we have to take the OLTP system "as is" even if we believe that it could be made better.
The OLTP may well contain data that does not conform to the general rules of relational data modeling, such as foreign keys and constraints.
Normally in the OLTP system, you find historical data that is not correct. This is almost always the case. A system that runs for years very often has data that is incorrect and never will be corrected.
When building a BI solution we have to clean and fix this data, but normally it would be too expensive and disruptive to do this for old data in the OLTP system itself.
In our experience, the OLTP system is very often poorly documented. Our first task is, therefore, that of creating good documentation for the system, validating data, and checking it for any inconsistencies.
The OLTP database is not built to be easily queried for analytical workloads, and is certainly not going to be designed with Analysis Services cubes in mind. Nevertheless, a very common question is: "do we really need to build a dimensionally modeled data mart as the source for an Analysis Services cube?", and the answer is a definite "yes"!
As we'll see, the structure of a data mart is very different from the structure of an OLTP database and Analysis Services is built to work on data marts, not on generic OLTP databases. The changes that need to be made when moving data from the OLTP database to the final data mart structure should be carried out by specialized ETL software, such as SQL Server Integration Services, and cannot simply be handled by Analysis Services in the Data Source View.
Moreover, the OLTP database needs to be efficient for OLTP queries. OLTP queries tend to be very fast on small chunks of data, in order to manage everyday work. If we run complex queries ranging over the whole OLTP database, as BI-style queries often do, we will create severe performance problems for the OLTP database. There are very rare situations in which data can flow directly from the OLTP through to Analysis Services, but these are so specific that their description is outside the scope of this book.
Beware of the temptation to avoid building a data warehouse and data marts. Building an Analysis Services cube is a complex job that starts with getting the design of your data mart right. If we have a dimensional data mart, we have a database that holds dimension and fact tables that can contain transformed, cleansed, and calculated information. If, on the other hand, we rely on the OLTP database, we might finish our first cube in less time, but our data will be dirty, inconsistent, and unreliable, and cube processing will be slow. In addition, we will not be able to create complex relational models to accommodate our users' analytical needs.
We always have an OLTP system as the original source of our data, but when it comes to the data warehouse, it is can be difficult to answer the apparently simple question of whether we have a data warehouse. The problem is not the answer, as every analyst will happily reply "Yes, we do have a data warehouse"; the problem is in the meaning of the term "data warehouse".
There are at least two major approaches to data warehouse design and development and consequently, to the definition of what a data warehouse is. They are described in the books of the following two leading authors:
Ralph Kimball: If we are building a Kimball data warehouse, we build fact tables and dimension tables structured as data marts. We will end up with a data warehouse composed of the sum of all the data marts.
Bill Inmon: If our choice is that of an Inmon data warehouse, then we design a (somewhat normalized) physical relational database that will hold the data warehouse. Afterwards, we produce departmental data marts with their star schemas populated from that relational database.
If this were a book about data warehouse methodology, then we could write hundreds of pages about this topic, but luckily for the reader, the detailed differences between the Inmon and Kimball methodologies are out of the scope of this book. Readers can find out more about these methodologies in Building the Data Warehouse by Bill Inmon and The Data Warehouse Toolkit by Ralph Kimball. Both books should be present on any BI developer's bookshelf.
A picture is worth a thousand words when trying to describe the differences between the two approaches. In Kimball's bus architecture, data flows from the OLTP through to the data marts, as shown in the following diagram:
In contrast, in Inmon's view, data coming from the OLTP systems needs to be stored in the enterprise data warehouse and from there, goes to the data marts, as shown in the following diagram:
What is important is to understand is that the simple phrase "data warehouse" has different meanings in each of these methodologies. Finally, it is worth mentioning that in the recent years several authors proposed combined methodologies (see, for an example, http://www.sqlbi.com/articles/sqlbi-methodology).
We will adopt Inmon's meaning for the term data warehouse. This is because in Inmon's methodology, the data warehouse is a real database, while in Kimball's view, the data warehouse is composed of integrated data marts. For the purposes of this chapter, though, what is really important is the difference between the data warehouse and the data mart. At the end, the source of an SSAS cube needs to be a data mart, maybe composed of many fact tables and dimensions, but definitely a data mart.
Whether you are using the Kimball or Inmon methodology, the frontend database just before the Analysis Services cube should be a data mart. A data mart is a database that is modeled according to the rules of Kimball's dimensional modeling methodology, and is composed of fact tables and dimension tables.
As a result, we'll spend a lot of time discussing data mart structure in the rest of this chapter. However, you will not learn how to build and populate a data mart from reading this chapter; the books by Kimball and Inmon we've already cited do a much better job than we ever could.
Nevertheless, if you are reading this book, it means you are using Analysis Services and so you will need to design your data marts with specific features of Analysis Services in mind. This does not mean you should completely ignore the basic theory of data warehouse design and dimensional modeling, but instead, adapt the theory to the practical needs of the product you are going to use as the main interface for querying the data.
At the core of the data mart structure is the separation of the entire database into two distinct types of data structures:
Dimension: A dimension is the major analytical object in the BI space. A dimension can be a list of products or customers, time, geography, or any other entity used to analyze numeric data. Dimensions are stored in dimension tables.
Dimensions have attributes. An attribute of a product may be its color, its manufacturer, or its weight. An attribute of a date may be simply its weekday or its month.
Dimensions have both natural and surrogate keys. The natural key is the original product code, customer ID, or real date. The surrogate key is a new unique integer number used in the data marts as a key that joins fact tables to dimension tables. Also, the surrogate key need to be meaningless and relate only to the data warehouse.
Dimensions have relationships with facts. Their reason for being is to add qualitative information to the numeric information contained in the facts. Sometimes, a dimension might have a relationship with other dimensions, but directly or indirectly it will always be related to facts in some way.
Fact: A fact is something that has happened or has been measured. A fact may be the sale of a single product to a single customer or the total amount of sales of a specific item during a month. From our point of view, a fact is a numeric value that users would like to aggregate in different ways for reporting and analysis purposes. Facts are stored in fact tables.
We normally relate a fact table to several dimension tables, but we do not relate fact tables directly with other fact tables. Facts and dimensions are related via surrogate keys. This is one of the foundations of Kimball's methodology.
When you build an Analysis Services solution, you build Analysis Services dimension objects from the dimension tables in your data mart and cubes on top of the fact tables. The concepts of facts and dimensions are so deeply ingrained in the architecture of Analysis Services that you are effectively obliged to follow dimensional modeling methodology if you want to use Analysis Services at all.
When we define dimension tables and fact tables and create joins between them, we end up with a star schema. At the center of a star schema there is always a fact table. As the fact table is directly related to dimension tables, if we place these dimensions around the fact table, we get something resembling a star shape, as shown in the following diagram:
In the preceding diagram, we can see that there is one fact table,
FactInternetSales, and four dimension tables directly related to the fact table. Looking at this diagram, we can easily understand that a customer buys a product with a specific currency and that the sale takes place in a specific sales territory. Star schemas have a useful characteristic; that they are easily understandable by anybody at first glance.
Moreover, while the simplicity for human understanding is very welcome, the same simplicity helps Analysis Services understand and use star schemas. If we use star schemas, Analysis Services will find it easier to recognize the overall structure of our dimensional model and help us in the cube design process. On the other hand, snowflakes, described later, are harder both for humans and for Analysis Services to understand, and we're much more likely to find that we make mistakes during cube design—or that Analysis Services makes incorrect assumptions when setting properties automatically—the more complex the schema becomes.
It is not always easy to generate star schemas; sometimes we need (or inexperience causes us) to create a more complex schema that resembles that of a traditional, normalized relational model. Look at the same data mart when we add the
Geography dimension, as shown in the following diagram:
The snowflake schema is nothing but a star schema complicated by the presence of intermediate tables and joins between dimensions. The problem with snowflakes is that reading them at first glance is not so easy. Try to answer the following two questions:
Geographydimension be reached from
What does the
Is it a denormalization of the more complex relationship through
Is it a completely separate key added during ETL?
DimGeographyis not used to create a new dimension, but is being used to add geographic attributes to the
DimSalesTerritoryis not the territory of the customer, but the territory of the order, added during the ETL phase
The problem is that, in order to answer these questions, we would have to search through the documentation or the ETL code to discover the exact meaning of the fields.
So, the simplicity of the star schema is lost when we switch from a star schema to a snowflake schema. Nevertheless, sometimes snowflakes are necessary, but it is very important that—when a snowflake starts to appear in our project—we explain how to read the relationships and what the fields mean.
It might be the case that a snowflake design is mandatory, due to the overall structure of the data warehouse or due to the complexity of the database structure. In this case, we have basically the following options to present a star schema to SSAS:
We can use views to transform the underlying snowflake into a star schema. Using views to join tables, it's possible to hide the snowflake structure, persist our knowledge of how the tables in the snowflake schema should be joined together, and present to Analysis Services a pure star schema. This is—in our opinion—the best approach.
We can use Analysis Services to model
We can build Analysis Services dimensions from a set of snowflaked tables. This can have some benefits since it makes it easier for the Dimension Wizard to set up optimal attribute relationships within the dimension, but on the other hand as we've already noted, it means we have to remember which columns join to each other every time we build a dimension from these tables. It's very easy to make mistakes when working with complex snowflakes and to get the error message, the '[tablename]' table that is required for a join cannot be reached based on the relationships in the Data Source View when you try to process the dimension.
We can leave the snowflake schema in place and create one Analysis Services dimension for each table, and then use referenced relationships to link these dimensions back to the fact table. Even if this solution seems an interesting one, in our opinion, it is the worst.
First of all, the presence of reference dimensions may lead, as we will discuss later, to performance problems either during cube processing or during querying. Additionally, having two separate dimensions in the cube does not give us any benefits in the overall design and may make it less user-friendly. The only case where this approach could be advisable is when the dimension is a very complex one; in this case it might be useful to model it once and use reference dimensions where needed. There are some other situations where reference dimensions are useful, but they are rarely encountered.
In all cases, the rule of thumb we recommend and use is the same: keep it simple! This way we'll make fewer mistakes and find it easier to understand our design.
At the end of the dimensional modeling process, we often end up with some attributes that do not belong to any specific dimension. Normally these attributes have a very limited range of values (perhaps three or four values each, sometimes more) and they seem to be not important enough to be considered dimensions in their own right, although obviously we couldn't just drop them from the model altogether.
We have two choices, as shown:
Create a very simple dimension for each of these attributes. This will lead to rapid growth in the number of dimensions in the solution; something the users will not like because it makes the cube harder to use.
Merge all these attributes in a so-called junk dimension. A junk dimension is simply a dimension that merges together attributes that do not belong anywhere else and share the characteristic of having only a few distinct values each.
If you join several small dimensions into a single junk dimension, you will reduce the number of fields in the fact table and the number of required joins at query time. For a fact table of several million rows, this can represent a significant reduction in the amount of space used and the time needed for cube processing.
Reducing the number of dimensions will mean Analysis Services performs better during the aggregation design process and during querying, thereby improving the end user experience.
The end user will never like a cube with 30 or more dimensions; it will be difficult to use and to navigate. Reducing the number of dimensions will make the cube less intimidating.
However, there is one big disadvantage in using a junk dimension; whenever you join attributes together into a junk dimension, you are clearly stating that these attributes will never have the rank of a fully-fledged dimension. If you ever change your mind and need to break one of these attributes out into a dimension on its own you will not only have to change the cube design, but also reprocess the entire cube and run the risk that any queries and reports the users have already created will become invalid.
Degenerate dimensions are created when we have columns on the fact table that we want to use for analysis, but which do not relate to any existing dimension. Degenerate dimensions often have almost the same cardinality of the fact table; a typical example is the transaction number for a point of sale data mart. The transaction number may be useful for several reasons, for example, to calculate a "total sold in one transaction" measure. Moreover, it might be useful if we need to go back to the OLTP database to gather other information. However, even if it is feature that is requested often, users should not be allowed to navigate sales data using a transaction number because the resulting queries are likely to bring back enormous amounts of data and run very slowly. Instead, if the transaction number is ever needed as a dimension attribute, it should be displayed in a specifically-designed report that shows the contents of a small number of transactions.
Keep in mind that, even though the literature often discusses degenerate dimensions as separate entities, it is often the case that a big dimension might have some standard attributes and some degenerate ones. In the case of the transaction number, we might have a dimension holding both the transaction number and the point of sale number. The two attributes live in the same dimension, but one is degenerate (the transaction number) and one is a standard one (the POS number). Users might be interested in slicing sales by POS number and they would expect good performance when they did so; however, they should not be encouraged to slice by transaction number due to the cardinality of the attribute.
From an Analysis Services point of view, degenerate dimensions are no different than any other dimension. The only area to pay attention to is the design of the attributes. Degenerate attributes should not be made query-able to the end user (you can do this by setting the attribute's
AttributeHierarchyEnabled property to
False) for the reasons already mentioned. Also, for degenerate dimensions that are built exclusively from a fact table, Analysis Services has a specific type of dimension relationship type called
Fact. Using the
Fact relationship type will lead to some optimizations being made to the SQL generated if Relational Online Analytical Processing (ROLAP) storage is used for the dimension.
Dimensions change over time. A customer changes his or her address, a product may change its price or other characteristics, and—in general—any attribute of a dimension might change its value. Some of these changes are useful to track while some of them are not; working out which changes should be tracked and those which shouldn't can be quite difficult though.
Changes do not happen very often. If they do, then we might be better off splitting the attribute off into a separate dimension. If the changes happen rarely, then a technique known as Slowly Changing Dimensions (SCD) is the solution and we need to model this into our dimensions.
Type 1: We maintain only the last value of each attribute in the dimension table. If a customer changes address, then the previous one is lost and all the previous facts will be shown as if the customer always lived at the same address (new address).
Type 2: We create a new record in the dimension table whenever a change happens. All previous facts will still be linked to the old record. Thus, in our customer address example, the old facts will be linked to the old address and the new facts will be linked to the new address.
Type 3: If what we want is simply to know the "last old value" of a specific attribute of a dimension, we can add a field to the dimension table in order to save just the "last value of the attribute" before updating it. In the real world, this type of dimension is used very rarely.
The SCD type used is almost never the same across all the dimensions in a project. We will normally end up with several dimensions of Type 1 and occasionally with a couple of dimensions of Type 2. Also, not all the attributes of a dimension have to have the same SCD behavior. History is not usually stored for the date of birth of a customer, if it changes, since the chances are that the previous value was a mistake. On the other hand, the address of the same customer is likely to be an SCD attribute as it's likely we'll want to track changes for it. Finally, there may be the need to use the same dimension with different slowly changing types in different cubes, as shown. Handling these changes will inevitably make our ETL more complex:
Type 1 dimensions are relatively easy to handle and to manage. Each time we detect a change, we apply it to the dimension table in the data mart and that is all the work we need to do.
Type 2 dimensions are more complex. When we detect a change, we invalidate the old record by setting its "end of validity date" and insert a new record with the new values. As all the new data will refer to the new record, it is simple to use in queries. We should have only one valid record for each entity in the dimension.
The modeling of SCDs in Analysis Services will be covered later, but in this theoretical discussion, it might be interesting to spend some time on the different ways to model Type 2 SCDs in the relational data mart.
A single dimension will hold attributes with different SCD types since not all the attributes of a single dimension will need to have historical tracking. So, we will end up with dimensions with some Type 1 attributes and some Type 2 attributes. How do we model that in the data mart?
We basically have the following choices:
We can build two dimensions; one containing the historical values and one containing the current values. The historical dimension would contain only the historical values of attributes, while the actual dimension would contain all the current attribute values. Obviously, we will need two different dimensional tables in the data mart to hold the two dimensions.
This solution is very popular and is easy to design but has some serious drawbacks:
The number of dimensions in the cube is much larger. If we have several Type 2 dimensions, the number of dimensions might reach the point where we have usability problems.
In order to query for the actual value and the historical value of an attribute, Analysis Services needs to resolve the relationship between the two dimensions via the fact table, and for very big fact tables, this might be very time-consuming. This issue is not marginal because, if we give users the actual and historical values of an attribute, they will always want to mix the two attributes in a single query to check how the actual value has been changing over time.
We can build a complex dimension holding both the actual and historical values in a single dimensional table. This solution will lead to a much more complex ETL to build the dimensional table, but solves the drawbacks of the previous solution. For example, having both the current and historical attribute values in a single dimension can lead to better query performance when comparing these two values, because the query can be resolved at the dimension level and does not need to cross the fact table. Also, as we've stressed several times already, having fewer dimensions in the cube makes it much more user-friendly.
We can use the terms bridge table and factless fact table interchangeably—they both refer to the same thing; a table that is used to model a many-to-many relationship between two dimension tables. Since the name factless fact table can be misleading, and even if the literature often refers to these tables as such, we prefer the term bridge table instead.
All fact tables represent many-to-many relationships between dimensions, but for bridge tables, the recording of this relationship is their only reason to exist. They do not contain any numeric columns—facts—that can be aggregated (hence the use of the name factless fact table). Regular fact tables generate many-to-many relationships as a side effect, as their reason for being is the nature of the fact, not of the relationship.
Now, let us see an example of a bridge table. Consider the following situation in an OLTP database:
In any given period of time, a product can be sold on special offer. The bridge table (
SpecialOfferProduct) tells us which products were on special offer at what times, while the
SpecialOffer table tells us information about the special offer itself: when it started, when it finished, the amount of discount, and so on.
A common way of handling this situation is to denormalize the special offer information into a dimension directly linked to the fact table, so we can easily see whether a specific sale was made under special offer or not. In this way, we can use the fact table to hold both the facts and the bridge. Nevertheless, bridge tables offer a lot of benefits, and in situations such as this, they are definitely the best option. Let's take a look at the reasons why.
It is interesting to consider whether we can represent the relationship in the preceding example only using fact tables (that is, storing three types of data for each sale: product, sale, and special offer) or whether a bridge table is necessary. While the first option is certainly correct, we need to think carefully before using it because if we do use it, all data on special offers that did not generate any sales will be lost. If a specific special offer results in no product sales, then the fact table is not storing the relationship between the special offer and the product anywhere—it will be exactly as though the product had never been on special offer. The fact table does not contain any data that defines the relationship between the special offers and the products, it only knows about this relationship when a sale is made. This situation may lead to confusion or incorrect reports. We always need to remember that the absence of a fact may be as important as its presence. Indeed, sometimes the absence of a fact is more important than its presence.
We recommend using bridge tables to model many-to-many relationships that do not strictly depend on facts to define the relationship. The relationships modeled by many-to-many relationships are often not bound to any fact table and exist regardless of any fact table. This shows the real power of bridge tables, but as always, the more power we have the bigger our responsibilities will be, and bridge tables will inevitably sometimes cause us headaches.
Bridge tables are modeled in Analysis Services as measure groups that act as bridges between different dimensions, through the many-to-many dimension relationship type, one of the most powerful features of Analysis Services. This feature will be analyzed in greater detail in Chapter 6, Adding Calculations to the Cube.
Now that we have defined what a fact table is, let us go deeper and look at the two main types: transaction fact tables and snapshots.
A transaction fact table records an event, and for that event, certain measurements can be made. When we record a sale, for example, we create a new row in the transaction fact table that contains information relating to the sale, such as what product was sold, when the sale took place, what the value of the sale was, and so on.
A snapshot is the recording of a specific situation at a point in time. If we record in a fact table the total amount of sales for each specific product, we are not recording an event, but a specific situation. Snapshots can also be useful when we want to measure something not directly related to any other fact. If we want to rank out customers based on sales or payments, for example, we may want to store snapshots of this data in order to analyze how these rankings change over time in response to marketing campaigns.
Using a snapshot table containing aggregated data instead of a transaction table can drastically reduce the number of rows in our fact table, which in turn leads to smaller cubes, faster cube processing, and faster querying. The price we pay for this is the loss of any information that can only be stored at the transaction level and cannot be aggregated up into the snapshot, such as the transaction number data we encountered when discussing degenerate dimensions. Whether this is an acceptable price to pay is a question only the end users can answer.
In an ideal world, data that is stored in the data warehouse would never change. Some books suggest that we should only support insert operations in a data warehouse, not updates; data comes from the OLTP, is cleaned, and is then stored in the data warehouse until the end of time, and should never change because it represents the situation at the time of insertion.
Nevertheless, the real world is somewhat different to the ideal one. While some updates are handled by the slowly changing dimension techniques already discussed, there are other kinds of updates needed in the life of a data warehouse. In our experience, these other types of update in the data warehouse are needed fairly regularly and are of the following two main kinds:
Structural updates: When the data warehouse is up and running, we will need to perform updates to add information such as new measures or new dimension attributes. This is normal in the lifecycle of a BI solution.
Data updates: We need to update data that has already been loaded into the data warehouse, because it is wrong. We need to delete the old data and enter the new data, as the old data will inevitably lead to confusion. There are many reasons why bad data comes to the data warehouse; the sad reality is that bad data happens and we need to manage it gracefully.
Now, how do these kinds of updates interact with fact and dimension tables? Let's summarize briefly what the physical distinctions between fact and dimension tables are:
Dimension tables are normally small, usually with less than 1 million rows, and very frequently much less than that.
Fact tables are often very large; they can have up to hundreds of millions or even billions of rows. Fact tables may be partitioned, and loading data into them is usually the most time-consuming operation in the whole of the data warehouse.
Structural updates on dimension tables are very easy to make, provided that they can be incorporated in the existing SCD type without affecting granularity. You simply update the table with the new metadata, make the necessary changes to your ETL procedures, and the next time they are run the dimension will reflect the new values. If your users decide that they want to analyze data based on a new attribute on, say, the
Customer dimension, then the new attribute can be added for all of the customers in the dimension. Moreover, if the attribute is not present for some customers, then they can be assigned a default value; after all, updating one million rows is not a difficult task for SQL Server or any other modern relational database. However, even if updating the relational model is simple, the updates need to go through to Analysis Services and this might result in the need for a full process of the dimension and therefore the cube, which might be very time consuming.
On the other hand, structural updates may be a huge problem on fact tables. The problem is not that of altering the metadata, but determining and assigning a default value for the large number of rows that are already stored in the fact table. It's easy to insert data into fact tables. However, creating a new field with a default value would result in an
UPDATE command that will probably run for hours and might even bring down your database server. Worse, if we do not have a simple default value to assign, then we will need to calculate the new value for each row in the fact table, and so the update operation will take even longer. We have found that it is often better to reload the entire fact table rather than perform an update on it. Of course, in order to reload the fact table, you need to have all of your source data at hand and this is not always possible.
Data updates are an even bigger problem still, both on facts and dimensions. Data updates on fact tables suffer from the same problems as adding a new field. Often, the number of rows that we need to update is so high that running even simple SQL commands can take a very long time.
Data updates on dimensions can be a problem because they may require very complex logic. Suppose we have a Type 2 SCD and that a record was entered into the dimension table with incorrect attribute values. In this situation, we would have created a new record and linked all the facts received after its creation to the new (and incorrect) record. Recovering from this situation requires us to issue very precise
UPDATE statements to the relational database and to recalculate all the fact table rows that depend—for any reason—on the incorrect record. Bad data in dimensions is not very easy to spot, and sometimes several days—if not months—pass before someone (in the worst case the user) discovers that something went wrong.
There is no good recipe for stopping bad data getting into your data warehouse. When it happens, we need to be ready to spend a long time trying to recover from the error. It's worth pointing out that data warehouses or data marts that are rebuilt each night ("one shot databases") are not prone to this situation because, if bad data is corrected, the entire data warehouse can be reloaded from scratch and the problem fixed very quickly. This is one of the main advantages of "one shot" data warehouses, although of course they do suffer from several disadvantages too, such as their limited ability to hold historic data.
In Kimball's view of a data mart, all the natural keys remain part of the dimension although they are no longer the primary key, nor are they used to join to the fact. Instead, you should use as table identifiers surrogate keys, that is, simple integer values with the only scope to connect facts and dimension. This gives us complete freedom in the data mart to add to or redefine a natural key's meaning and, importantly, the usage of the smallest possible integer type for surrogate keys will lead to a smaller fact table.
All this is very good advice. Nevertheless, there are situations in which the rules surrounding the usage of surrogate keys should be relaxed or—to put it another way—there can be times when it's useful to make the surrogate keys meaningful instead of meaningless. Let's consider some of the times when this might be the case:
Date: We can use a meaningless key as a surrogate key for the
Datedimension. However, is there any point in doing so? In our opinion, the best representation of a date surrogate key is an integer in the form YYYYMMDD, so 20080109 represents January 9, 2008. Note that even the Kimball Group, writing in the book The Microsoft Data Warehouse Toolkit, accept that this can be a good idea. The main reason for this is that it makes SQL queries that filter by date much easier to write and much more readable; we very often want to partition a measure group by date, for instance. The reason that it's safe to do this is that the
Datedimension will never change. You might add some attributes to a
Datedimension table and you might load new data into it, but the data that is already there should never need to be altered.
All invalid dates may be easily represented with negative numbers, so -1 may be the unknown date, -2 may be the empty date, and so on. We will have plenty of space for all the dummy dates we will ever need. A word of warning about the type of the key: we sometimes face situations where the
DateTimetype has been used for the key of the
Datedimension. This is absolutely the wrong thing to do, as not only is a
DateTimerepresentation going to be bigger than the
DateTimetype does not let us add dummy values to the dimension easily.
If we define a
Range Amountdimension and an
ID_RangeAmountkey in the fact table, this will solve our modeling problem. However, what will happen when the customer wants to change the ranges? We will have to re-compute the whole fact table because the
ID_RangeAmountkey will become useless.
On the other hand, if you decide that $100 will be the granularity of the range dimension, you can use
ID_RangeAmountand, in doing so, you will be able to update the attributes of the
RangeAmountdimension that will lead to hierarchies without updating the fact table. The advantages of doing this are discussed in more detail in the following blog entry: http://tinyurl.com/rangedims, and we discuss how to model this in Analysis Services in Chapter 4, Measures and Measure Groups.
The surrogate key of a dimension will be surfaced in MDX in the unique names generated for members on the key attribute of the dimension. Since all Analysis Services client tools save their queries using these unique names, this means that if—for any reason—we change the surrogate key of a member in the future (for example, because we have a "one shot" data warehouse that is reloaded each night), the saved queries will no longer work.
Junk dimensions: Junk dimensions, when defined at the conformed dimension level, can cause problems when there is the need to update data. If, for example, we need a new attribute on a junk dimension, this can involve a change in the meaning of the existing members of the junk dimension. This will invalidate the older facts, requiring a careful rebuilding of the junk dimension. In our opinion, it is better to maintain the junk dimensions as separate dimensions in the relational data mart and then merge them into a single dimension for Analysis Services, creating a key attribute that uses a composite key made up of the keys of all the dimension tables that make up the junk dimension.
Therefore, the conclusion is that although surrogate keys are very useful and we are not saying that there is something wrong with them, in some well-defined situations it makes sense to deviate from the standard recommendations for surrogate keys and use different forms instead.
The last consideration is that—even in cases where we deviate from the standard—the usage of surrogate keys of the smallest integer type possible is always strongly advised. The Analysis Services engine is optimized for the handling of integer values and does not handle string and date values anywhere near as well.
When designing a data mart, questions often arise about the relationship between the fact table and the dimensions. Should the foreign keys be NULLable or not? Should we use the built-in foreign keys of SQL Server to handle the relationship? What about key errors?
Since these are very interesting topics, let us discuss them in more detail:
Can we use NULLable foreign keys columns? The answer is definitely no. First of all, NULLable columns take space in the fact table, one bit for each NULLable column. So, if we have two
TINYINTforeign keys in the fact table, we will end up consuming three bytes: two for the columns and one for the two null bits. Since fact tables need to be as small as possible, NULLable columns definitely waste space.
Moreover, when the data is moved into the cube, there will be no relationship between the dimension and the facts containing the
NULLkey, leading to either processing errors or situations where the user will see partial data. It is much better to add a new member to the dimension, and relate the facts with missing key values to that member in our ETL. Although the Unknown Members feature in Analysis Services does this for us, we will have more flexibility if we handle the issue ourselves.
Should we use SQL Server
FOREIGN KEYS? The correct technique to adopt here is as follows: we should define the foreign keys in the data model and activate them during debugging, in order to detect errors. When the system is in production, the foreign key can be disabled in order to speed up the ETL code.
Key errors: There should be no key errors at all. If we enable foreign key checking, then we will end up with a table without key errors. This is important; even though we can leverage Analysis Services' key error handling functionality during processing, we advise not using it because the presence of a key error should be detected and resolved during the ETL process, as with NULL keys.
Apart from the issue of modeling data in an appropriate way for Analysis Services, it's also important to understand how details of the physical implementation of the relational data mart can be significant too.
All of the dimension and fact tables you intend to use should exist within the same relational data source, so for example, if you're using SQL Server, this means all the tables involved should exist within the same SQL Server database. If you create multiple data sources within Analysis Services, then you'll find that one is treated as the primary data source; this has to point to an instance of SQL Server (either SQL Server 2000 or above) and all data from other data sources is fetched via the primary data source using the SQL Server
OPENROWSET function, which can lead to severe processing performance problems.
When we design data marts, we need to be aware that Analysis Services does not treat all data types the same way. The cube will be much faster, for both processing and querying, if we use the right data type for each column. Here, we provide a brief table that helps us during the design of the data mart, to choose the best data type for each column type:
Fact column type
Fastest SQL Server data types
(Note that decimal and vardecimal require more CPU power to process than money and float types)
Distinct count columns
(If your count column is
Clearly, we should always try to use the smallest data type that will be able to hold any single value within the whole range of values needed by the application.
This is the rule for relational tables. However, you also need to remember that the equivalent measure data type in Analysis Services must be large enough to hold the largest aggregated value of a given measure, not just the largest value present in a single fact table row.
Always remember that there are situations in which the rules must be overridden. If we have a fact table containing 20 billion rows, each composed of 20 bytes and a column that references a date, then it might be better to use a
SMALLINT column for the date if we find a suitable representation that holds all necessary values. We will gain 2 bytes for each row, and that means a 10 percent reduction in the size of the whole table.
When Analysis Services needs to process a cube or a dimension, it sends queries to the relational database in order to retrieve the information it needs. Not all the queries are simple
SELECTs, there are many situations in which Analysis Services generates complex queries. Even if we do not have space enough to cover all scenarios, we're going to provide some examples relating to SQL Server, and we advise the reader to have a look at the SQL queries generated for their own cube to check whether they can be optimized in some way.
During dimension processing, Analysis Services sends several queries, one for each attribute of the dimension, in the form of
SELECT DISTINCT ColName, where
is the name of the column holding the attribute.
Many of these queries are run in parallel (exactly which ones can be run in parallel depends on the attribute relationships defined on the Analysis Services dimension), so SQL Server will take advantage of its cache system and perform only one physical read of the table, so that all successive scans are performed from memory. Nevertheless, keep in mind that the task of detecting the
DISTINCT values of the attributes is done by SQL Server, not Analysis Services.
We also need to be aware that, if our dimensions are built from complex views, they might confuse the SQL Server engine which will not be able to perform at its best. If, for example, we add a very complex
WHERE condition to our view, then the condition will be evaluated more than once. We have personally seen a situation where the processing of a simple
Time dimension with only a few hundred rows, which had a very complex
WHERE condition, took tens of minutes to complete. If the views uses to feed a dimension are too complex, this normally means that the ETL phase is not populating the dimensions in the right way and it would be better to review the ETL and produce physical tables which are easier to query during cube processing.
If a dimension contains attributes that come from a joined table, the
JOIN is performed by SQL Server, not Analysis Services. This situation arises very frequently when we define snowflakes instead of simpler star schemas. Since some attributes of a dimension are computed by taking their values from another dimension table, Analysis Services will send a query to SQL Server containing the
INNER JOIN between the two tables.
As long as all the joins are made on the primary keys, this will not lead to any problems, but in cases where the
JOIN is not made on the primary key, bad performance might result. As we said before, if we succeed in the goal of exposing a simple star schema to Analysis Services, we will never have to handle these
JOINs. As we argue next, if a snowflake is really needed, we can still hide it from Analysis Services using views, and in these views we will have full control over and knowledge of the complexity of the query used.
Reference dimensions, when present in the cube definition, will lead to one of the most hidden and most dangerous types of
JOIN. When we define the relationship between a dimension and a fact table, we can use the
Referenced relationship type and use an intermediate dimension to relate the dimension to the fact table. Reference dimensions often appear in the design due to snowflakes or due to the need to reduce fact table size.
A referenced dimension may be materialized or not. If we decide to materialize a reference dimension (as SQL Server Data Tools will suggest) the result is that the fact table query will contain a
JOIN to the intermediate dimension, to allow Analysis Services to get the value of the key for the reference dimension.
JOINs are a problem with dimension queries, they are a serious problem with fact queries. It might be the case that SQL Server needs to write a large amount of data to its temporary database before returning information to Analysis Services. It all depends on the size of the intermediate table and the number of reference dimensions that appear in the cube design.
We are not going to say that referenced dimensions should not be used at all, as there are a few cases where reference dimensions are useful, and in the following chapters we will discuss them in detail. Nevertheless, we need to be aware that reference dimensions might create complex queries sent to SQL server and this can cause severe performance problems during cube processing.
The processing of dimensions related to measure group with a fact relationship type, usually created to hold degenerate dimensions, is performed in the same way as any other dimension. This means that
SELECT DISTINCT will be issued on all the degenerate dimension's attributes.
Clearly, as the dimension and the fact tables are the same, the query will ask for
DISTINCT over a fact table; given that fact tables can be very large, the query might take a long time to run. Nevertheless, if a degenerate dimension is needed and it is stored in a fact table, then there is no other choice but to pay the price with this query.
The last kind of query that we need to be aware of is when we have a measure group containing a
DISTINCT COUNT measure. In this case, due to the way Analysis Services calculates distinct counts, the query to the fact table will be issued with
ORDER BY for the column we are performing the distinct count on.
Needless to say, this will lead to very poor performance because we are asking SQL Server to sort a fact table on a column that is not part of the clustered index (usually, the clustered index is built on the primary key). The pressure on the temporary database will be tremendous and the query will take a lot of time.
There are some optimizations, mostly pertinent to partitioning, that need to be done when we have
DISTINCT COUNT measures in very big fact tables. What we want to point out is that in this case a good knowledge of the internal behavior of Analysis Services is necessary in order to avoid bad performance when processing.
The usage of indexes in data mart is a very complex topic and we cannot cover it all in a simple section. Nevertheless, there are a few general rules that can be followed for both fact and dimension tables.
Non-clustered indexes may be added for the natural key, in order to speed up the ETL phase for Slowly Changing Dimensions. The key might be composed of the natural key and the slowly changing dimension date of insertion. These indexes might be defined as
UNIQUE, but like any other constraint in the data mart, the uniqueness should be enforced in development and disabled in production.
It is questionable whether fact tables should have a primary key or not. We prefer to have a primary clustered key based on an integer field, because it makes it very simple to identify a row in the case where we need to check for its value or update it.
In the case where the table is partitioned by date, the primary key will be composed of the date and the integer surrogate key, to be able to meet the needs of partitioning.
If a column is used to create a
DISTINCT COUNT measure in a cube, then it might be useful to have that column in the clustered index, because Analysis Services will request an
ORDER BY clause on that column during the process of the measure group. It is clear that the creation of a clustered index is useful in large cubes where data is added incrementally, so processing will benefit from the ordered data. If on the other hand, we have a one-shot solution where all tables are reloaded from scratch and the cube is fully processed, then it is better to avoid the creation of a clustered index since the sorting of the data is performed only once, during cube processing.
Once the cube has been built, if MOLAP storage is being used, no other indexes are useful. However if the data mart is queried by other tools such as Reporting Services, or if ROLAP partitions are created in Analysis Services, then it might be necessary to add more indexes to the tables. Remember, though, that indexes slow down update and insert operations so they should be added with care. A deep analysis of the queries sent to the relational database will help to determine the best indexes to create.
Personnel and staff management
Clearly, this list is far from complete and is different for every business. SQL Server provides schemas to arrange tables and—in our experience—the usage of schemas to assign database objects to subject areas leads to a very clear database structure.
Some tables will inevitably have no place at all in any subject area, but we can always define a common subject area to hold all these tables.
A clear and consistent naming convention is good practice for any kind of relational database and a data mart is no different. As well as making the structure more readable, it will help you when you come to build your cube because SQL Server Data Tools will be able to work out automatically which columns in your dimension and fact tables should join to each other if they have the same names.
The Data Source View (DSV, from now on) is one of the places where we can create an interface between Analysis Services and the underlying relational model. In the DSV, we can specify joins between tables and we can create named queries and calculations to provide the equivalent of views and derived columns. It's very convenient for the cube developer to open up the DSV in SQL Server Data Tools and make these kind of changes.
This is all well and good, but nevertheless our opinion about the DSV is clear: it is almost too powerful and, using its features, we risk turning a clean, elegant structure into a mess. It is certainly true that there is the need for an interface between the relational model of the database and the final star schema, but we don't think it's a good idea to use the DSV for this purpose.
SQL Server gives us a much more powerful and easy-to-use tool to use instead: SQL Views. The following is a list of some of the reasons why we prefer to use views instead of the DSV:
Views are stored where we need them: When we need to read the specification of an interface, we want to be able to do it quickly. Views are stored in the database, exactly where we want them to be. If we need to modify the database, we want to be able to find all of the dependencies easily, and using views, we have a very easy way of tracking dependencies.
If we use the DSV, we are hiding these dependencies from the database administrator, the person who needs to be able to update and optimize the data mart. In addition, there are tools on the market that can analyze dependencies between table and views. It is not easy to do this if information on the joins between tables is stored outside the database.
We can easily change column names in views: In the database, we might have
SoldQtyas a field in a table. This is good because it is concise and does not contain useless spaces. In the cube, we want to show it as
Quantity Soldsimply because our user wants a more descriptive name.
Views are a very useful means of changing names when needed. In turn, with views we are publicly declaring the name change so that everybody will easily understand that a specific field with a name in one level is—in reality—a field that has another name in the previous level.
Clearly, we should avoid the practice of changing names at each level. As always, having the opportunity to do something does not mean that we need to do it.
We can perform simple calculations in views easily: If we need to multiply the value of two columns, for example,
Qty * Price, to use in a measure in our cube we have two options. We can perform the calculation in the DSV, but as earlier, we are hiding the calculation in a Visual Studio project and other people will not be able to see what we're doing easily. If we perform the calculation in a view, then other developers can reuse it, and tracking dependencies is more straightforward.
This is certainly true for simple calculations. On the other hand, if we're performing complex calculations in views then we are probably missing some transformation code in our ETL. Moreover, performing this calculation will waste time when we execute the view. Performing the calculation during ETL will mean we compute the value only once; from then it will always be available.
Views are made up of plain text: We can easily search for all the occurrences of a specific column, table, or any kind of value using a simple text editor. We do not need any specialized development tools or to dive into unreadable XML code to have a clear view of how a specific field is used.
If we need to update a view, we can do it without opening SQL Server Data Tools. This means that nearly everybody can do it, although, as it is very easy to update a view, some sort of security does need to be applied.
Views can be updated very quickly: A view can be updated very quickly as it does not require any kind of processing; we just
ALTERit and the work is done. We do not need to use an
UPDATEstatement if we want to make simple (and possibly temporary) changes to the data.
Views can reduce the number of columns we expose: There is really no need to expose more columns to a cube than it needs. Showing more columns will only lead to confusion and a chance that the wrong column will be used for something.
Views can provide default values when needed: When we have a NULLable column that contains
NULLvalues, we can easily assign a default value to it using views. We shouldn't really have a NULLable column in a data mart, but sometimes it happens.
Views can expose a star schema even if the relational model is more complex: As we've already mentioned, sometimes we end up with a relational design that is not a perfectly designed star schema. By removing unused columns, by creating joins when necessary, and in general by designing the appropriate queries, we can expose Analysis Services to a star schema, even when the relational model has a more complex structure.
Views are database objects: As views are database objects, they inherit two important properties:
We can configure security for views, and so stop unauthorized access to data very easily.
Views can belong to a schema. If we are using schemas for the definition of subject areas, we can assign views to subject areas. This will lead to a very clean project, where each object belongs to the subject area that is relevant to it.
Views can be optimized: With views we can use hints to improve performance. For example, we can use the
NOLOCKhint to avoid locking while reading from tables; although, of course, removing locking leads to the possibility of dirty reads, and it is up to the developer to decide whether doing this is a good idea or not. Moreover, we can analyze the execution path of a view in order to fine tune it. All this can be done without affecting in any way the Analysis Services project.
One very important point needs to be stressed: views should not be used as a substitute for proper ETL. Whenever views are used to feed Analysis Services, they should not contain complex calculations or
WHEREclauses as this can lead to serious processing performance and maintenance problems. We can use a view instead of ETL code for prototyping purposes, but this is very bad practice in a production system.
In this chapter, we've learned a bit about the theory of data warehouse and data mart design, and how it should be applied when we're using Analysis Services. We've found out that we definitely do need to have a data mart designed according to the principles of dimensional modeling, and that a star schema is preferable to a snowflake schema; we've also seen how certain common design problems such as Slowly Changing Dimensions, junk dimensions, and degenerate dimensions can be solved in a way that is appropriate for Analysis Services. Last of all, we've recommended the use of a layer of simple views between the tables in the data mart and Analysis Services to allow us to perform calculations, change column names and join tables, and we've found out why it's better to do this than do the same thing in the Data Source View.