Nowadays, Business Intelligence (BI) is one of the hot topics in most of the job markets around the world. Most companies are establishing or planning to establish a Business Intelligence system and a data warehouse (DW). Knowledge related to the BI and data warehouse are in great demand in the job market. This chapter gives you an understanding of what Business Intelligence and data warehouse is, what the main components of the BI system are, and what the steps to create the data warehouse are.
This chapter focuses on the designing of the data warehouse, which is the core of a BI system. The following chapters are about other BI components such as visualization, data integration, data governance, and so on. A data warehouse is a database designed for analysis, and this definition indicates that designing a data warehouse is different from modeling a transactional database. Designing the data warehouse is also called dimensional modeling. In this chapter, you will learn about the concepts of dimensional modeling.
Based on Gartner's definition (http://www.gartner.com/it-glossary/business-intelligence-bi/), Business Intelligence is defined as follows:
Business Intelligence is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.
As the definition states, the main purpose of a BI system is to help decision makers to make proper decisions based on the results of data analysis provided by the BI system.
Nowadays, there are many operational systems in each industry. Businesses use multiple operational systems to simplify, standardize, and automate their everyday jobs and requirements. Each of these systems may have their own database, some of which may work with SQL Server, some with Oracle. Some of the legacy systems may work with legacy databases or even file operations. There are also systems that work through the Web via web services and XML. Operational systems are very useful in helping with day-to-day business operations such as the process of hiring a person in the human resources department, and sale operations through a retail store and handling financial transactions.
The rising number of operational systems also adds another requirement, which is the integration of systems together. Business owners and decision makers not only need integrated data but also require an analysis of the integrated data. As an example, it is a common requirement for the decision makers of an organization to compare their hiring rate with the level of service provided by a business and the customer satisfaction based on that level of service. As you can see, this requirement deals with multiple operational systems such as CRM and human resources. The requirement might also need some data from sales and inventory if the decision makers want to bring sales and inventory factors into their decisions. As a supermarket owner or decision maker, it would be very important to understand what products in which branches were in higher demand. This kind of information helps you to provide enough products to cover demand, and you may even think about creating another branch in some regions.
The requirement of integrating multiple operational systems together in order to create consolidated reports and dashboards that help decision makers to make a proper decision is the main directive for Business Intelligence.
Some organizations and businesses use ERP systems that are integrated, so a question may appear in your mind that there won't be a requirement for integrating data because consolidated reports can be produced easily from these systems. So does that mean that these systems still require a BI solution? The answer in most cases is yes. The companies or businesses might not require a separate BI system for internal and parts of the operations that implemented it through ERP. However, they might require getting some data from outside, for example, getting some data from another vendor's web service or many other protocols and channels to send and receive information. This indicates that there would be a requirement for consolidated analysis for such information, which brings the BI requirement back to the table.
After understanding what the BI system is, it's time to discover more about its components and understand how these components work with each other. There are also some BI tools that help to implement one or more components. The following diagram shows an illustration of the architecture and main components of the Business Intelligence system:
The BI architecture and components differ based on the tools, environment, and so on. The architecture shown in the preceding diagram contains components that are common in most of the BI systems. In the following sections, you will learn more about each component.
The data warehouse is the core of the BI system. A data warehouse is a database built for the purpose of data analysis and reporting. This purpose changes the design of this database as well. As you know, operational databases are built on normalization standards, which are efficient for transactional systems, for example, to reduce redundancy. As you probably know, a 3NF-designed database for a sales system contains many tables related to each other. So, for example, a report on sales information may consume more than 10 joined conditions, which slows down the response time of the query and report. A data warehouse comes with a new design that reduces the response time and increases the performance of queries for reports and analytics. You will learn more about the design of a data warehouse (which is called dimensional modeling) later in this chapter.
It is very likely that more than one system acts as the source of data required for the BI system. So there is a requirement for data consolidation that extracts data from different sources and transforms it into the shape that fits into the data warehouse, and finally, loads it into the data warehouse; this process is called Extract Transform Load (ETL). There are many challenges in the ETL process, out of which some will be revealed (conceptually) later in this chapter.
According to the definition of states, ETL is not just a data integration phase. Let's discover more about it with an example; in an operational sales database, you may have dozen of tables that provide sale transactional data. When you design that sales data into your data warehouse, you can denormalize it and build one or two tables for it. So, the ETL process should extract data from the sales database and transform it (combine, match, and so on) to fit it into the model of data warehouse tables.
There are some ETL tools in the market that perform the extract, transform, and load operations. The Microsoft solution for ETL is SQL Server Integration Service (SSIS), which is one of the best ETL tools in the market. SSIS can connect to multiple data sources such as Oracle, DB2, Text Files, XML, Web services, SQL Server, and so on. SSIS also has many built-in transformations to transform the data as required. Chapter 4, ETL with Integration Services, is about SSIS and how to do data transformations with this tool.
A data warehouse is designed to be the source of analysis and reports, so it works much faster than operational systems for producing reports. However, a DW is not that fast to cover all requirements because it is still a relational database, and databases have many constraints that reduce the response time of a query. The requirement for faster processing and a lower response time on one hand, and aggregated information on another hand causes the creation of another layer in BI systems. This layer, which we call the data model, contains a file-based or memory-based model of the data for producing very quick responses to reports.
Microsoft's solution for the data model is split into two technologies: the OLAP cube and the In-memory tabular model. The OLAP cube is a file-based data storage that loads data from a data warehouse into a cube model. The cube contains descriptive information as dimensions (for example, customer and product) and cells (for example, facts and measures, such as sales and discount). The following diagram shows a sample OLAP cube:
In the preceding diagram, the illustrated cube has three dimensions: Product, Customer, and Time. Each cell in the cube shows a junction of these three dimensions. For example, if we store the sales amount in each cell, then the green cell shows that Devin paid 23$ for a Hat on June 5. Aggregated data can be fetched easily as well within the cube structure. For example, the orange set of cells shows how much Mark paid on June 1 for all products. As you can see, the cube structure makes it easier and faster to access the required information.
Microsoft SQL Server Analysis Services 2012 comes with two different types of modeling: multidimensional and tabular. Multidimensional modeling is based on the OLAP cube and is fitted with measures and dimensions, as you can see in the preceding diagram. The tabular model is based on a new In-memory engine for tables. The In-memory engine loads all data rows from tables into the memory and responds to queries directly from the memory. This is very fast in terms of the response time. You will learn more about SSAS Multidimensional in Chapter 2, SQL Server Analysis Services Multidimensional Cube Development, and about SSAS Tabular in Chapter 3, Tabular Development of SQL Server Analysis Services, of this book. The BI semantic model (BISM) provided by Microsoft is a combination of SSAS Tabular and Multidimensional solutions.
The frontend of a BI system is data visualization. In other words, data visualization is a part of the BI system that users can see. There are different methods for visualizing information, such as strategic and tactical dashboards, Key Performance Indicators (KPIs), and detailed or consolidated reports. As you probably know, there are many reporting and visualizing tools on the market.
Microsoft has provided a set of visualization tools to cover dashboards, KPIs, scorecards, and reports required in a BI application. PerformancePoint, as part of Microsoft SharePoint, is a dashboard tool that performs best when connected to SSAS Multidimensional OLAP cube. You will learn about PerformancePoint in Chapter 10, Dashboard Design. Microsoft's SQL Server Reporting Services (SSRS) is a great reporting tool for creating detailed and consolidated reports. SSRS is a mature technology in this area, which will be revealed in Chapter 9, Reporting Services. Excel is also a great slicing and dicing tool especially for power users. There are also components in Excel such as Power View, which are designed to build performance dashboards. You will learn more about Power View in Chapter 9, Reporting Services, and about Power BI features of Excel 2013 in Chapter 11, Power BI. Sometimes, you will need to embed reports and dashboards in your custom written application. Chapter 12, Integrating Reports in Application, of this book explains that in detail.
Every organization has a part of its business that is common between different systems. That part of the data in the business can be managed and maintained as master data. For example, an organization may receive customer information from an online web application form or from a retail store's spreadsheets, or based on a web service provided by other vendors.
Master Data Management (MDM) is the process of maintaining the single version of truth for master data entities through multiple systems. Microsoft's solution for MDM is Master Data Services (MDS). Master data can be stored in the MDS entities and it can be maintained and changed through the MDS Web UI or Excel UI. Other systems such as CRM, AX, and even DW can be subscribers of the master data entities. Even if one or more systems are able to change the master data, they can write back their changes into MDS through the staging architecture. You will learn more about MDS in Chapter 5, Master Data Management.
The quality of data is different in each operational system, especially when we deal with legacy systems or systems that have a high dependence on user inputs. As the BI system is based on data, the better the quality of data, the better the output of the BI solution. Because of this fact, working on data quality is one of the components of the BI systems. As an example, Auckland might be written as "Auck land" in some Excel files or be typed as "Aukland" by the user in the input form.
As a solution to improve the quality of data, Microsoft provided users with DQS. DQS works based on Knowledge Base domains, which means a Knowledge Base can be created for different domains, and the Knowledge Base will be maintained and improved by a data steward as time passes. There are also matching policies that can be used to apply standardization on the data. You will learn more about DQS in Chapter 6, Data Quality and Data Cleansing.
A data warehouse is a database built for analysis and reporting. In other words, a data warehouse is a database in which the only data entry point is through ETL, and its primary purpose is to cover reporting and data analysis requirements. This definition clarifies that a data warehouse is not like other transactional databases that operational systems write data into. When there is no operational system that works directly with a data warehouse, and when the main purpose of this database is for reporting, then the design of the data warehouse will be different from that of transactional databases.
If you recall from the database normalization concepts, the main purpose of normalization is to reduce the redundancy and dependency. The following table shows customers' data with their geographical information:
Customer first name
Let's elaborate on this example. As you can see from the preceding list, the geographical information in the records is redundant. This redundancy makes it difficult to apply changes. For example, in the structure, if Remuera, for any reason, is no longer part of the Auckland city, then the change should be applied on every record that has Remuera as part of its suburb. The following screenshot shows the tables of geographical information:
So, a normalized approach is to retrieve the geographical information from the customer table and put it into another table. Then, only a key to that table would be pointed from the customer table. In this way, every time the value Remuera changes, only one record in the geographical region changes and the key number remains unchanged. So, you can see that normalization is highly efficient in transactional systems.
This normalization approach is not that effective on analytical databases. If you consider a sales database with many tables related to each other and normalized at least up to the third normalized form (3NF), then analytical queries on such databases may require more than 10 join conditions, which slows down the query response. In other words, from the point of view of reporting, it would be better to denormalize data and flatten it in order to make it easier to query data as much as possible. This means the first design in the preceding table might be better for reporting.
However, the query and reporting requirements are not that simple, and the business domains in the database are not as small as two or three tables. So real-world problems can be solved with a special design method for the data warehouse called dimensional modeling. There are two well-known methods for designing the data warehouse: the Kimball and Inmon methodologies.
The Inmon and Kimball methods are named after the owners of these methodologies. Both of these methods are in use nowadays. The main difference between these methods is that Inmon is top-down and Kimball is bottom-up. In this chapter, we will explain the Kimball method. You can read more about the Inmon methodology in Building the Data Warehouse, William H. Inmon, Wiley (http://www.amazon.com/Building-Data-Warehouse-W-Inmon/dp/0764599445), and about the Kimball methodology in The Data Warehouse Toolkit, Ralph Kimball, Wiley (http://www.amazon.com/The-Data-Warehouse-Toolkit-Dimensional/dp/0471200247). Both of these books are must-read books for BI and DW professionals and are reference books that are recommended to be on the bookshelf of all BI teams. This chapter is referenced from The Data Warehouse Toolkit, so for a detailed discussion, read the referenced book.
To gain an understanding of data warehouse design and dimensional modeling, it's better to learn about the components and terminologies of a DW. A DW consists of Fact tables and dimensions. The relationship between a Fact table and dimensions are based on the foreign key and primary key (the primary key of the dimension table is addressed in the fact table as the foreign key).
Facts are numeric and additive values in the business process. For example, in the sales business, a fact can be a sales amount, discount amount, or quantity of items sold. All of these measures or facts are numeric values and they are additive. Additive means that you can add values of some records together and it provides a meaning. For example, adding the sales amount for all records is the grand total of sales.
Dimension tables are tables that contain descriptive information. Descriptive information, for example, can be a customer's name, job title, company, and even geographical information of where the customer lives. Each dimension table contains a list of columns, and the columns of the dimension table are called attributes. Each attribute contains some descriptive information, and attributes that are related to each other will be placed in a dimension. For example, the customer dimension would contain the attributes listed earlier.
Each dimension has a primary key, which is called the surrogate key. The surrogate key is usually an auto increment integer value. The primary key of the source system will be stored in the dimension table as the business key.
The Fact table is a table that contains a list of related facts and measures with foreign keys pointing to surrogate keys of the dimension tables. Fact tables usually store a large number of records, and most of the data warehouse space is filled by them (around 80 percent).
Grain is one of the most important terminologies used to design a data warehouse. Grain defines a level of detail that stores the Fact table. For example, you could build a data warehouse for sales in which Grain is the most detailed level of transactions in the retail shop, that is, one record per each transaction in the specific date and time for the customer and sales person. Understanding Grain is important because it defines which dimensions are required.
There are two different schemas for creating a relationship between fact and dimensions: the snow flake and star schema. In the start schema, a Fact table will be at the center as a hub, and dimensions will be connected to the fact through a single-level relationship. There won't be (ideally) a dimension that relates to the fact through another dimension. The following diagram shows the different schemas:
The snow flake schema, as you can see in the preceding diagram, contains relationships of some dimensions through intermediate dimensions to the Fact table. If you look more carefully at the snow flake schema, you may find it more similar to the normalized form, and the truth is that a fully snow flaked design of the fact and dimensions will be in the 3NF.
The snow flake schema requires more joins to respond to an analytical query, so it would respond slower. Hence, the star schema is the preferred design for the data warehouse. It is obvious that you cannot build a complete star schema and sometimes you will be required to do a level of snow flaking. However, the best practice is to always avoid snow flaking as much as possible.
After a quick definition of the most common terminologies in dimensional modeling, it's now time to start designing a small data warehouse. One of the best ways of learning a concept and method is to see how it will be applied to a sample question.
Assume that you want to build a data warehouse for the sales part of a business that contains a chain of supermarkets; each supermarket sells a list of products to customers, and the transactional data is stored in an operational system. Our mission is to build a data warehouse that is able to analyze the sales information.
Before thinking about the design of the data warehouse, the very first question is what is the goal of designing a data warehouse? What kind of analytical reports would be required as the result of the BI system? The answer to these questions is the first and also the most important step. This step not only clarifies the scope of the work but also provides you with the clue about the Grain.
Defining the goal can also be called requirement analysis. Your job as a data warehouse designer is to analyze required reports, KPIs, and dashboards. Let's assume that the decision maker of a particular supermarket chain wants to have analytical reports such as the comparison of sales between stores, or the top 10 customers and/or top 10 bestselling products, or he wants to compare the sale on weekdays with weekends.
Choosing the business process.
Identifying the Grain.
In our example, there is only one business process, that is, sales. Grain, as we've described earlier, is the level of detail that will be stored in the Fact table. Based on the requirement, Grain is to have one record per sales transaction and date, per customer, per product, and per store.
Once Grain is defined, it is easy to identify dimensions. Based on the Grain, the dimensions would be date, store, customer, and product. It is useful to name dimensions with a
Dim prefix to identify them easily in the list of tables. So our dimensions will be
DimStore. The next step is to identify the Fact table, which would be a single Fact table named
FactSales. This table will store the defined Grain.
After identifying the Fact and dimension tables, it's time to go more in detail about each table and think about the attributes of the dimensions, and measures of the Fact table. Next, we will get into the details of the Fact table and then into each dimension.
There is only one Grain for this business process, and this means that one Fact table would be required. Based on the provided Grain, a Fact table would be connected to DimCustomer, DimDate, DimProduct, and DimStore. To connect to each dimension, there would be a foreign key in the Fact table that points to the primary key of the dimension table.
The table would also contain measures or facts. For the sales business process, facts that can be measured (numeric and additive) are SalesAmount, DiscountAmount, and QuantitySold. The Fact table would only contain relationships to other dimensions and measures. The following diagram shows some columns of the FactSales:
As you can see, the preceding diagram shows a star schema. We will go through the dimensions in the next step to explore them more in detail. Fact tables usually don't have too many columns because the number of measures and related tables won't be that much. However, Fact tables will contain many records. The Fact table in our example will store one record per transaction.
As the Fact table will contain millions of records, you should think about the design of this table carefully. The String data types are not recommended in the Fact table because they won't add any numeric or additive value to the table. The relationship between a Fact table and dimensions could also be based on the surrogate key of the dimension. The best practice is to set a data type of surrogate keys as the integer; this will be cost-effective in terms of the required disk space in the Fact table because the integer data type takes only 4 bytes while the string data type is much more. Using an integer as a surrogate key also speeds up the join between a fact and a dimension because join and criteria will be based on the integer that operators works with, which is much faster than a string.
If you are thinking about adding comments in this made by a sales person to the sales transaction as another column of the Fact table, first think about the analysis that you want to do based on comments. No one does analysis based on a free text field; if you wish to do an analysis on a free text, you can categorize the text values through the ETL process and build another dimension for that. Then, add the foreign key-primary key relationship between that dimension to the Fact table.
The customer's information, such as the customer name, customer job, customer city, and so on, will be stored in this dimension. You may think that the customer city is, as another dimension, a Geo dimension. But the important note is that our goal in dimensional modeling is not normalization. So resist against your tendency to normalize tables. For a data warehouse, it would be much better if we store more customer-related attributes in the customer dimension itself rather than designing a snow flake schema. The following diagram shows sample columns of the
DimCustomer dimension may contain many more attributes. The number of attributes in your dimensions is usually high. Actually, a dimension table with a high number of attributes is the power of your data warehouse because attributes will be your filter criteria in the analysis, and the user can slice and dice data by attributes. So, it is good to think about all possible attributes for that dimension and add them in this step.
As we've discussed earlier, you see attributes such as Suburb, City, State, and Country inside the customer dimension. This is not a normalized design, and this design definitely is not a good design for a transactional database because it adds redundancy, and making changes won't be consistent. However, for the data warehouse design, not only is redundancy unimportant but it also speeds up analytical queries and prevents snow flaking.
You can also see two keys for this dimension:
CustomerKey is the surrogate key and primary key for the dimension in the data warehouse. The
CustomerKey is an integer field, which is autoincremented. It is important that the surrogate key won't be encoded or taken as a string key; if there is something coded somewhere, then it should be decoded and stored into the relevant attributes. The surrogate key should be different from the primary key of the table in the source system. There are multiple reasons for that; for example, sometimes, operational systems recycle their primary keys, which means they reuse a key value for a customer that is no longer in use to a new customer.
CustomerAlternateKey is the primary key of the source system. It is important to keep the primary key of the source system stored in the dimension because it would be necessary to identify changes from the source table and apply them into the dimension. The primary key of the source system will be called the business key or alternate key.
The date dimension is one of the dimensions that you will find in most of the business processes. There may be rare situations where you work with a Fact table that doesn't store date-related information.
DimDate contains many generic columns such as FullDate, Month, Year, Quarter, and MonthName. This is obvious as you can fetch all other columns out of the full date column with some date functions, but that will add extra time for processing. So, at the time of designing dimensions, don't think about spaces and add as many attributes as required. The following diagram shows sample columns of the date dimension:
It would be useful to store holidays, weekdays, and weekends in the date dimension because in sales figures, a holiday or weekend will definitely affect the sales transactions and amounts. So, the user will require an understanding of why the sale is higher on a specific date rather than on other days. You may also add another attribute for promotions in this example, which states whether that specific date is a promotion date or not.
The date dimension will have a record for each date. The table, shown in the following screenshot, shows sample records of the date dimension:
As you can see in the records illustrated in the preceding screenshot, the surrogate of the date dimension (
DateKey) shows a meaningful value. This is one of the rare exceptions where we can keep the surrogate key of this dimension as an integer type but with the format of YYYYMMDD to represent a meaning as well.
In this example, if we store time information, where do you think would be the place for time attributes? Inside the date dimension? Definitely not. The date dimension will store one record per day, so a date dimension will have 365 records per year and 3650 records for 10 years. Now, we add time splits to this, down to the last minute, and then we would require 24*60 records per day. So, the combination of the date and time for 10 years would have 3650*24*60= 5265000 records. However, 5 million records for a single dimension are too much; dimensions are usually narrow and they occasionally might have more than one million records. So in this case, the best practice would be to add another dimension as
DimTime and add all time-related attributes in that dimension. The following screenshot shows some example records and attributes of
Usually, the date and time dimensions are generic and static, so you won't be required to populate these dimensions through ETL every night; you just load them once and then you could use them. I've written two general-purpose scripts to create and populate date and time dimensions on my blog that you can use. For the date dimension, visit the http://www.rad.pasfu.com/index.php?/archives/95-Script-for-Creating-and-Generating-members-for-Date-Dimensions-General-Purpose.html URL, and for the time dimension, visit the http://www.rad.pasfu.com/index.php?/archives/122-Script-for-Creating-and-Generating-members-for-Time-Dimension.html URL.
The product dimension will have a
ProductKey, which is the surrogate key, and the business key, which will be the primary key of the product in the source system (something similar to a product's unique number). The product dimension will also have information about the product categories. Again, denormalization in dimensions occurred in this case for the product subcategory, and the category will be placed into the product dimension with redundant values. However, this decision was made in order to avoid snow flaking and raise the performance of the join between the fact and dimensions.
We are not going to go in detail through the attributes of the store dimension. The most important part of this dimension is that it can have a relationship to the date dimension. For example, a store's opening date will be a key related to the date dimension. This type of snow flaking is unavoidable because you cannot copy all the date dimension's attributes in every other dimension that relates to it. On the other hand, the date dimension is in use with many other dimensions and facts. So, it would be better to have a conformed date dimension.
Outrigger is a Kimball terminology for dimensions, such as date, which is conformed and might be used for a many-to-one relationship between dimensions for just one layer.
There is also another type of fact, which is the snapshot Fact table. In snapshot fact, each record will be an aggregation of some transactional records for a snapshot period of time. For example, consider financial periods; you can create a snapshot Fact table with one record for each financial period, and the details of the transactions will be aggregated into that record.
Transactional facts are a good source for detailed and atomic reports. They are also good for aggregations and dashboards. The Snapshot Fact tables provide a very fast response for dashboards and aggregated queries, but they don't cover detailed transactional records. Based on your requirement analysis, you can create both kinds of facts or only one of them.
There is also another type of Fact table called the accumulating Fact table. This Fact table is useful for storing processes and activities, such as order management. You can read more about different types of Fact tables in The Data Warehouse Toolkit, Ralph Kimball, Wiley (which was referenced earlier in this chapter).
We've explained that Fact tables usually contain FKs of dimensions and some measures. However, there are times when you would require a Fact table without any measure. These types of Fact tables are usually used to show the non-existence of a fact.
For example, assume that the sales business process does promotions as well, and you have a promotion dimension. So, each entry in the Fact table shows that a customer X purchased a product Y at a date Z from a store S when the promotion P was on (such as the new year's sales). This Fact table covers every requirement that queries the information about the sales that happened, or in other words, for transactions that happened. However, there are times when the promotion is on but no transaction happens! This is a valuable analytical report for the decision maker because they would understand the situation and investigate to find out what was wrong with that promotion that doesn't cause sales.
So, this is an example of a requirement that the existing Fact table with the sales amount and other measures doesn't fulfill. We would need a Fact table that shows that store S did the promotion P on the date D for product X. This Fact table doesn't have any fact or measure related to it; it just has FKs for dimensions. However, it is very informative because it tells us on which dates there was a promotion at specific stores on specific products. We call this Fact table as a Factless Fact table or Bridge table.
Using examples, we've explored the usual dimensions such as customer and date. When a dimension participates in more than one business process and deals with different data marts (such as date), then it will be called a conformed dimension.
Sometimes, a dimension is required to be used in the Fact table more than once. For example, in the FactSales table, you may want to store the order date, shipping date, and transaction date. All these three columns will point to the date dimension. In this situation, we won't create three separate dimensions; instead, we will reuse the existing
DimDate three times as three different names. So, the date dimension literally plays the role of more than one dimension. This is the reason we call such dimensions role-playing dimensions.
There are other types of dimensions with some differences, such as junk dimension and degenerate dimension. The junk dimension will be used for dimensions with very narrow member values (records) that will be in use for almost one data mart (not conformed). For example, the status dimensions can be good candidates for junk dimension. If you create a status dimension for each situation in each data mart, then you will probably have more than ten status dimensions with only less than five records in each. The junk dimension is a solution to combine such narrow dimensions together and create a bigger dimension.
You may or may not use a junk dimension in your data mart because using junk dimensions reduces readability, and not using it will increase the number of narrow dimensions. So, the usage of this is based on the requirement analysis phase and the dimensional modeling of the star schema.
A degenerate dimension is another type of dimension, which is not a separate dimension table. In other words, a degenerate dimension doesn't have a table and it sits directly inside the Fact table. Assume that you want to store the transaction number (string value). Where do you think would be the best place to add that information? You may think that you would create another dimension and enter the transaction number there and assign a surrogate key and use that surrogate key in the Fact table. This is not an ideal solution because that dimension will have exactly the same Grain as your Fact table, and this indicates that the number of records for your sales transaction dimension will be equal to the Fact table, so you will have a very deep dimension table, which is not recommended. On the other hand, you cannot think about another attribute for that dimension because all attributes related to the sales transaction already exist in other dimensions connected to the fact. So, instead of creating a dimension with the same Grain as the fact and with only one column, we would leave that column (even if it is a string) inside the Fact table. This type of dimension will be called a degenerate dimension.
Now that you understand dimensions, it is a good time to go into more detail about the most challengeable concepts of data warehousing, which is slowly changing dimension (SCD). The dimension's attribute values may change depending on the requirement. You will do different actions to respond to that change. As the changes in the dimension's attribute values happen occasionally, this called the slowly changing dimension. SCD depends on the action to be taken after the change is split in different types. In this section, we only discuss type 0, 1, and 2.
Type 0 doesn't accept any changes. Let's assume that the Employee Number is inside the Employee dimension. Employee Number is the business key and it is an important attribute for ETL because ETL distinguishes new employees or existing employees based on this field. So we don't accept any changes in this attribute. This means that type 0 of SCD is applied on this attribute.
Sometimes, a value may be typed wrongly in the source system, such as the first name, and it is likely that someone will come and fix that with a change. In such cases, we will accept the change, and we won't need to keep historical information (the previous name). So we simply replace the existing value with a new value. This type of SCD is called type 1. The following screenshot shows how type 1 works:
In this type, it is a common requirement to maintain historical changes. For example, consider this situation; a customer recently changes their city from Seattle to Charlotte. You cannot use type 0 because it is likely that someone will change their city of living. If you behave like type 1 and update the existing record, then you will miss the information of the customer's purchase at the time that they were in Seattle, and all entries will show that they are customers from Charlotte. So the requirement for keeping the historical version resulted in the third type of SCD, which is type 2.
Type 2 is about maintaining historical changes. The way to keep historical changes is through a couple of metadata columns:
ToDate. Each new customer will be imported into
FromDate as a start date, and the
ToDate will be left as null (or a big default value such as 29,990,101). If a change happens in the city, the existing records in
ToDate will be marked as the date of change, and a new record will be created as an exact copy of the previous record with the new city and with a new
FromDate, which will be the date of change, and the
ToDate field will be left as null. Using this solution to find the latest and most up-to-date member information, you just need to look for the member record with
ToDate as null. To fetch the historical information, you would need to search for it in the specified time span whether the historical record exists. The following screenshot shows an example of SCD type 2:
There are other types of SCD that are based on combinations of the first three types and cover other kinds of requirements. You can read more about the different types of SCD and methods of implementing them in The Data Warehouse Toolkit referenced earlier in this chapter.
In this chapter, you learned what Business Intelligence is and what its components are. You studied the requirement for BI systems, and you saw the solution architecture to solve the requirements. Then, you read about data warehousing and the terminologies in dimensional modeling.
If you come from a DBA or database developer background and are familiar with database normalization, then you will know that in dimensional modeling, you should avoid normalization in some parts and you would need to design a star schema. You've learned that the Fact table shows numeric and additive values, and descriptive information will be stored in dimensions. You've learned different types of facts such as transactional, snapshot, and accumulating, and also learned about different types of dimensions such as outriggers, role playing, and degenerate.
Data warehousing and dimensional modeling together constitute the most important part of the BI system, which is sometimes called the core of the system. In the following chapters, we will go through some of the BI system components such as ETL, OLAP, Dashboards, and reports.