Packt Publishing Community, Experience, Distilled

Data Profiling with IBM Information Analyzer

HomeBooksSupportFreeAuthorsAward
WELCOME ARTICLES IMPACKT NEWSLETTERS YOUR ACCOUNT ABOUT US

 
Article Network FAQ

Want to know more about Packt's Article Network? Interested in contributing your article ideas?

Please visit our FAQ for more information.


See More

SEARCH

Search our Site


SUGGEST A TITLE
What would you like Packt to publish?

Visual ETL Development With IBM DataStage

ETL (Extract Transform Load) is the most resource consuming part of data warehouse development and maintenance. An ETL tool, particularly one that is GUI-based, can leverage the productivity and quality of ETL development and maintenance.

WebSphere DataStage from IBM is an ETL tool. In DataStage you build and execute ETL jobs visually on its GUI clients. This article by Djoni Darmawikarta shows how to build and execute an ETL job with DataStage. Specifically, we’ll build a job that loads a Customer Dimension table from an input sequential file (implementing an SCD1, Slowly Changing Dimension Type1; a well-known technique from Dimensional Data Warehouse methodology).


See More
 
Data Profiling with IBM Information Analyzer

Data profiling is essentially data mining, but for a different purpose. You mine data to understand, to gain better knowledge about the data. While the more common use of data mining is for gaining the data insights for business purpose (e.g. customer buying characteristics), data profiling is for technical purpose. To be more precise, you do data profiling to gather and analyze the technical metadata characteristics of the data. Information Analyzer, data profiling software from IBM, helps you gain insight into such technical metadata characteristics as, for example, column data type and size (length).

This article is based on a case where a database table grew its size unexpectedly and its initial disk space allocation got strained. Looking at the growth pattern of the table, such as the number of new records (which was not that huge), didn’t give us any clue about the cause of the problem. In this article, Djoni Darmawikarta will step through the Information Analyzer process, run one of its functions called Column Analysis on a simple table (a scaled-down version of the real table), and show how the profiling output help solve the problem.

Information Analyzer is a client-server software. A data profiling user (metadata analyst) works on its GUI client, so to make it easier to show you how I solve the problem I’ll use a lot of screenshots.

Our example data is an Oracle table that has two columns and three rows (In real life, they can typically be more than 50 and a few millions, respectively).

When you start the Information Analyzer client, called Information Server Console, you’ll be shown its start-up screen; and then, its log-in window.

When your log-in is successful, the console main window will show up.

Assuming the Oracle table that we’d like to profile is new; we must identify it to the Analyzer, which technically means importing its metadata.

Make sure you have connected the Oracle database to the Information Analyzer server before you import the metadata of its tables.

Expand Metadata Management from the HOME drop-down menu.

Then, click Import Metadata.

Our example Oracle data (table) is in the CLROPER database (hosted in DDOM02), so select CLROPER and then click Identify Next Level.

It might take a while, particularly for a database that has many tables and many columns; so just wait.

On the completion message screen, click OK to close the screen.

All tables in CLROPER database will be identified (listed) including our example table named SPACE1. We’ll next identify the columns of our SPACE1 table; so select SPACE1 and then click Identify Next Level.

The result shows that Analyzer has correctly identified the two columns of the table.

Now, import metadata of all columns of the table by selecting the table and then clicking Import.

Click OK to continue.

Wait for completion.


Books from Packt

Beginners Guide to SQL Server Integration Services Using Visual Studio 2005
Beginners Guide to SQL Server Integration Services Using Visual Studio 2005

ODP.NET Developer’s Guide: Oracle Database 10g Development with Visual Studio 2005 and the Oracle Data Provider for .NET
ODP.NET Developer’s Guide: Oracle Database 10g Development with Visual Studio 2005 and the Oracle Data Provider for .NET

Microsoft AJAX Library Essentials: Client-side ASP.NET AJAX 1.0 Explained
Microsoft AJAX Library Essentials: Client-side ASP.NET AJAX 1.0 Explained

ASP.NET Data Presentation Controls Essentials
ASP.NET Data Presentation Controls Essentials

Programming Windows Workflow Foundation: Practical WF Techniques and Examples using XAML and C#
Programming Windows Workflow Foundation: Practical WF Techniques and Examples using XAML and C#

Visual SourceSafe 2005 Software Configuration Management in Practice
Visual SourceSafe 2005 Software Configuration Management in Practice

LINQ Quickly
LINQ Quickly

BlackBerry Enterprise Server for Microsoft® Exchange
BlackBerry Enterprise Server for Microsoft® Exchange



Click OK on the successful completion screen.

We’re now done with the metadata of the data; we’re now ready to start our profiling task.

In Information Analyzer (as in most other software of these days) we group our profiling works into projects. Here, I just use an existing project (DJONI_TEST), so select Open Project from the drop-down arrow on the right of NO PROJECT SELECTED.

You’ll be shown the list of existing projects. Select your project, and click Open.

Our previous (existing) profiling works are shown.

Next, open click Project Properties from the OVERVIEW drop-down menu.

Go to the Data Sources tab. Our SPACE1 table is not in the list yet, as we haven’t identified it specifically in our project (we did in the previous steps at the server-wide level); so we need to add it into our project, click Add.

Expand the SPACE1 table to see its columns. Select all of the columns as we want to profile all of them, and then click OK.

When completed, click Save All, and then close the Project Properties window.

Now, we’re ready to profile our SPACE1 data, to analyze its columns. On the main toolbar select Investigate | Column analysis.


Books from Packt

Beginners Guide to SQL Server Integration Services Using Visual Studio 2005
Beginners Guide to SQL Server Integration Services Using Visual Studio 2005

ODP.NET Developer’s Guide: Oracle Database 10g Development with Visual Studio 2005 and the Oracle Data Provider for .NET
ODP.NET Developer’s Guide: Oracle Database 10g Development with Visual Studio 2005 and the Oracle Data Provider for .NET

Microsoft AJAX Library Essentials: Client-side ASP.NET AJAX 1.0 Explained
Microsoft AJAX Library Essentials: Client-side ASP.NET AJAX 1.0 Explained

ASP.NET Data Presentation Controls Essentials
ASP.NET Data Presentation Controls Essentials

Programming Windows Workflow Foundation: Practical WF Techniques and Examples using XAML and C#
Programming Windows Workflow Foundation: Practical WF Techniques and Examples using XAML and C#

Visual SourceSafe 2005 Software Configuration Management in Practice
Visual SourceSafe 2005 Software Configuration Management in Practice

LINQ Quickly
LINQ Quickly

BlackBerry Enterprise Server for Microsoft® Exchange
BlackBerry Enterprise Server for Microsoft® Exchange



Select all columns of the SPACE1 table to analyze, and click Run Column Analysis.

Click Submit.

Check status by clicking Details.

When the job status shows Schedule Complete, click Close to close the Activity Status (job status) window.

Close the Column Analysis window as well.

To check result, click Open Column Analysis.

Our profiling output shows the metadata characteristics of the two columns. Our focus is on their sizes; so if necessary scroll to the right to see the Length columns.

The Length has three columns: Defined, Inferred, and Selected. The Defined length of the first column (INTEGER1) is as defined in the metadata of the table we imported, which is 38. The Inferred length, which is 3, is produced, by Information Analyzer, by computing statistically the data lengths of all rows, based on the actual data values of the column; and then, it suggests (Selected) that 3 should be the length of this column. Similarly, Information Analyzer did the Inferred and Suggested on the other column, the LARGECHAR1.

Based on these output produced by Information Analyzer, we can decide how much we’d to reduce the length of the columns, which will certainly reduce the disk space needed for the data.

Summary

Using a data profiling tool, such as the IBM Information Analyzer, we can analyze and gain knowledge particularly large amount of data that otherwise would not be apparent. The Information Analyzer has much more functionalities; this article discussed only the basics of one of them (Column Analysis).



About the Author

Djoni Darmawikarta built his career in IBM Asia Pacific and Canada as a software engineer, international consultant, instructor and project manager, for a total of 17 years. He's currently a technical specialist in the Data Warehousing and Business Intelligence team of a Toronto-based insurance company. Outside of his office works, Djoni writes IT articles and books.
Books from Packt

Beginners Guide to SQL Server Integration Services Using Visual Studio 2005
Beginners Guide to SQL Server Integration Services Using Visual Studio 2005

ODP.NET Developer’s Guide: Oracle Database 10g Development with Visual Studio 2005 and the Oracle Data Provider for .NET
ODP.NET Developer’s Guide: Oracle Database 10g Development with Visual Studio 2005 and the Oracle Data Provider for .NET

Microsoft AJAX Library Essentials: Client-side ASP.NET AJAX 1.0 Explained
Microsoft AJAX Library Essentials: Client-side ASP.NET AJAX 1.0 Explained

ASP.NET Data Presentation Controls Essentials
ASP.NET Data Presentation Controls Essentials

Programming Windows Workflow Foundation: Practical WF Techniques and Examples using XAML and C#
Programming Windows Workflow Foundation: Practical WF Techniques and Examples using XAML and C#

Visual SourceSafe 2005 Software Configuration Management in Practice
Visual SourceSafe 2005 Software Configuration Management in Practice

LINQ Quickly
LINQ Quickly

BlackBerry Enterprise Server for Microsoft® Exchange
BlackBerry Enterprise Server for Microsoft® Exchange







 
FEEDBACK
Name *:
Email *:
* optional
Do you have any comments?

Article Network


Packt Article Network

Visit Packt's Article Network, for all the latest quality, relevant and free content.
See More


NEWSLETTER

Sign up for updates, offers, free downloads and you could win an iPod Shuffle.
Subscription center


Visual MySQL Database Design in MySQL Workbench

MySQL Workbench is a visual database design tool recently released by MySQL AB. The tool is specifically for designing MySQL database.

MySQL Workbench has many functions and features; this article by Djoni Darmawikarta shows some of them by way of an example. We’ll build a physical data model for an order system where an order can be a sale order or a purchase order, and then, forward-engineer our model into an MySQL database.


See More
 




© Packt Publishing Ltd 2008

RSS