Choosing your database type
There are three broad categories for setting up a database when you install Oracle database using the installation wizard:
- Transaction Processing: Online transaction processing is typically for applications with multiple users inserting and updating low volumes per transaction
- General Purpose: A batch application that loads large volumes of data during specific windows
- Data Warehouse: An application that can have multiple users (transactions) requesting large volumes of data for read-only
Identify the key stakeholders, business process owners, and system owners within the identified process flows.
How to do it...
To determine the initial configuration of the database, it is important to get an understanding of the number of users and the expected volumes you will be processing:
- Schedule meetings with the stakeholders.
- Ask key questions to understand the nature of the solution:
- How many anticipated users do you expect for the solution?
- What time of the month/day do you expect the highest number of users utilizing the information?
- Does the information need to be updated real time, and why?
- Schedule meetings with the business process owners.
- Ask key questions to understand the nature of the business process:
- Is the business process updated real time by users, or is information entered and then processed within a batch?
- How many business users are involved with the business process?
- Schedule meetings with the system owners.
- Ask key questions to understand the source system:
- What is the database upon which the system resides?
- How large is the current database?
- Where is the system located?
- How is the system connected to the network?
- Does the system do a lot of batch processing?
- Can we connect to the system directly?
- What is the preferred method to connect to the system to extract information?
- What is the process to request access to extract information?
How it works...
There are many other questions that can and should be asked. The aforementioned questions are focused on gathering information to determine how many databases should be created, and the general configuration for each database.
Generally, a business intelligence solution is logically structured in the following way:
Information comes from multiple sources, and is loaded into a staging area. From the staging area, the information is transformed and loaded into the data warehouse. In the data warehouse, this information can be summarized or enhanced (additional calculations), ready to be consumed from the presentation area.
For the information sessions, if you have determined that you will be continuously extracting large volumes of data from multiple sources, which are geographically dispersed, then you will probably require a separate staging database to collect and consolidate the information. It is advantageous to select a separate database, so that as your information grows over a period of time, you can tune the stage environment to suit the requirements. When you perform the installation for Oracle, you can select a General Purpose template as the initial configuration for the stage database.
For the data warehouse, if you have a large number of users who access information 24 hours per day, 7 days per week, then it is recommended that the data warehouse again be a separate database. This will allow you to prepare the information within the staging database and load the results into the data warehouse. You can minimize the impact of the data transformation and loading on the users, and only require a smaller window to load information. The Data Warehouse template should be used when creating this database.
For the presentation area, if you are using a relational tool to publish information, then this should be located within the same database as the data warehouse. The main reason for this is the ease of administration and the ability to utilize the Oracle technology (summary management, query optimization, and so on), and minimize the amount of data movement.
For most projects, if you have a handful of sources with data loads happening during a preset window and reasonable user counts, a single General Purpose database will be the correct place to start. Data Marts would also be classified under this definition.
Oracle has published a white paper called Best Practices for a Data Warehouse on Oracle Database 11g, An Oracle White Paper, November 2010. This has a lot of very valuable information with regards to Balanced Configuration and Disk Layout for your database.
Defining your database layout
Oracle information is stored within a schema. The schema is the objects (tables, views, sequences, and so on) owned within the database. In a business intelligence and data warehouse solution, there are normally multiple schemas that encompass the solution.
Understand the proposed technology for the project and pre-read the installation requirements to get an understanding of the schema requirements.
How to do it...
Creating schemas within the database allows you to segregate and separate objects. This makes it easier to manage the objects with greater flexibility.
Separate your schemas into two categories:
- Application Owners: These are the required schemas to house the information for the solution:
- EDW_APP: The Enterprise Data Warehouse schema is for custom applications, to capture or enhance information or common lookups. This becomes a source for the data warehouse.
- EDW_STAGE: The Enterprise Data Warehouse stage schema will store all the objects required within the staging process.
- EDW: The Enterprise Data Warehouse publish schema will store all the objects required for the data warehouse and the presentation area.
- EDW_DM_<SA>: This is the Enterprise Data Warehouse data mart schema, where <SA> is replaced with the name of the subject area. The schema does not normally contain any specific objects, but refers back to the objects within the EDW schema.
- Technology/Tool/Repository/Pre-Packaged applications required schemas: These are the required schemas that enable the technology to build/run the solution. Typical schemas include:
- OWBSYS: This is the Oracle Warehouse Builder system owner schema; its name cannot be changed.
- OWB_<NAME>: This is the Oracle Warehouse Builder repository owner schema. It is recommended that you prefix with OWB, and then add an additional descriptive name.
- ODI_<NAME>: This is the Oracle Data Integrator master/work repository owner schema. It is recommended that you prefix with ODI, and then add an additional descriptive name.
- APEX_PUBLIC_USER: This is the Oracle Application Express Public User schema; its name cannot be changed.
- APEX_04000: This is the Oracle Application Express schema; its name cannot be changed, but may differ from version to version.
- FLOW_FILES: This is the Oracle Application Express schema; its name cannot be changed.
- MDS and BIPLATFORM: These are the Oracle Business Intelligence Enterprise Edition schemas. It is recommended that you prefix the schema names with the environment they are being created in.
For example, DEV_MDS, DEV_BIPLATFORM. These are created using the repository creation utility or RCU, during installation.
How it works...
Separating and segregating schemas within your application gives you greater flexibility. It allows you to manage and configure each schema appropriately for its function. The following figure outlines the different schemas. In the EDW_STAGE schema, it should store an image of the source and target objects:
The staging area should have a before picture and an after picture of the information. The before picture should be a representation of the source information within minimal data transformations. The after picture should look like the information that will be loaded into the data warehouse. This allows you to audit information before you load it into the data warehouse, and minimizes the amount of interruption to the users, while manipulating information.
The schemas for each of the tools within the solution should be autonomous to the database. This means that all components for the tool should be contained within a single database. For example, ODI has the concept of a Master and Work Repository. The configuration allows for a single master and multiple work repositories collocated on a single server or across multiple servers, with one or more than one schema. It is recommended that you use a single master with a single work repository within a single schema. Having autonomous environments and objects being exported from one environment to another has several advantages and disadvantages.
The advantages of autonomous environments and exporting objects from one environment to another are:
- Redundancy of environments allows you to tolerate hardware faults.
- Ability to upgrade software and test progressively. Having a single master and multiple work repository places software upgrades and downtime requirements on the architecture. This is a great concern if you have to upgrade all your environments first before you can test the latest patch, due to a single master and multiple-work repository configuration.
- Reduced dependency; errors which may cascade between Master and Work repositories will be isolated to a single environment.
- Security; ability to secure environments.
- Segregation and standard development practices; ability to isolate environments and use release management processes between environments.
The disadvantages are:
- Migration of code is cumbersome
- Potential software version inconsistencies may arise should environments not be patched to the same levels
Based on the previous recipe, you may have several databases, one for your stage area and another for a publish area. The relevant schemas should be placed within the relevant databases.
Within the data warehouse, you will normally have a layer of tables that will hold your atomic-level information. This is the level of information that is extracted from the source environment. Also, in this area, you may have additional objects that will aggregate the information into a model, which is optimized for your reporting tool. This is the presentation layer of information. This layer may change over a period of time, and you should ensure that you can regenerate the information from the atomic level, if required.
Oracle's white paper—Best Practices for a Data Warehouse on Oracle Database 11g, An Oracle White Paper, November 2010—has a lot of very valuable information with regards to staging, foundation, and presentation layers.
Selecting the Third Normal Form or a Dimensional model
Using the Third Normal Form or a Dimensional model to support your data warehouse solution is an important and highly debated decision within the industry. Primary reasons for choosing one over the other include:
- Skills and capabilities within the organizations
- The approach to building your warehouse solution
Understand the differences and limitations of a Third Normal Form and or a Dimensional-modeling approach to building your warehouse. Research the technology you plan to implement, to understand which model it works more optimally with.
How to do it...
Third Normal Form modeling is normally used within Online Transaction Processing Applications(OLTP). Dimensional modeling has been refined over the years and is generally accepted for data warehouses:
- Determine the modeling approach for the initial stage area, which houses a copy of the source. Typically, here you will mirror the source system. Note, you will only bring the relevant information that you require. This schema will therefore use the modeling applied in the source system.
- Determine the modeling approach for the presentation information used by the application. If you are using OBIEE, then it works more optimally using a dimensional model.
- Decide on the modeling approach for the atomic-level data model. If you are going to use OBIEE and drill down to detail from the presentation layer, then it is easier to use a dimensional model.
- The stage area therefore should be built using a model that mirrors your atomic-level data warehouse model.
How it works...
Matching the data model with different areas in your solution increases the amount of transparency within the data model, and makes it easier to troubleshoot. Optimizing your data model for different tool sets allows you to take full advantage of using the capabilities of the tools.
It is recommended that the information in the data warehouse always be at the atomic level (lowest level necessary to recreate any summary information). Therefore, should you change tools over a period of time, you will have the ability to recreate the presentation layer optimized for the new tool.
Other features such as denormalized or snowflake dimensions—type I, II, or III, slowly changing dimensions, and different types of fact tables are other capabilities, which can be embedded within your solution, based on your preferences and standards.
There is a lot of information available about these topics. A good place to start is The 10 Essential Rules of Data Modeling, from Kimball University. Additional topics are also available from the Kimball group.
Getting the basics right was the topic of this article. In this article, we focused on creating a successful foundation to interactively build your solution, which can save large amounts of time and money.
We began with selecting a database type, depending on the number of users and the expected volume you will be processing. After selecting the database type, we focused on schemas and the criteria’s to consider while creating schemas within database, for your project. We also discussed using the Third Normal Form or a Dimensional model to support your data warehouse solution.