Your message has been sent.
This article has been saved to your account.
Go to my account
This article has been emailed to your Kindle.
Send this article
Microsoft SQL Server's SQL Server Integration Services is a comprehensive, full featured tool for processing data in the back-end storage locations. Extraction of attributes in a body of text is an important task while mining data and the SSIS provides just the required transformation - Term Extraction, to accomplish this task. SSIS looks through the presented text document and sorts out terms. By terms are meant specifically nouns, noun phrases, and both nouns and noun phrases. You also have the flexibility of limiting extracted terms based on their frequency of occurrence. For example you could configure the task in such a way that you can capture terms that are repeated in the body of text a certain number of times.
This tutorial by Jayaram Krishnaswamy shows a simple example of extracting terms from a body of text.
The following text (SomeText.txt) file saved at a suitable location on the hard drive is used. This particular text is:
Rose is Red
Chrysanthemum is yellow
Violets are violet
Rose can be Pink
Hyacinth is white
Desk jobs are the best
Lily is also white
The girl is wearing a rose garland
The boy is handsome
Pink rose is not red
rose garland is made of roses
Roserose is not rose
He rose to power
The desk is made of rose wood
The reason for using the above text is to see how well the Term Extract transformation is able to distinguish words and phrases and find how often they are found in a body of text. The transformation works for text in English and is capable of distinguishing between nouns and other parts of speech.
In the following steps we will create a Visual Studio 2005 Business Intelligence project and access some text stored on the hard drive, and apply this transformation and review the results.
Creating a Business Intelligence Project
In the Visual Studio 2005 IDE, File | New | Project opens the New Project window as shown, where you can highlight the Integration Services Project template in the Business Intelligence page and change its default name to something different. For this tutorial TermExtract has been used as the project name.
Text to be accessed shown above is saved to a file, SomeText.txt in the C: drive. In order to access this from the Integration Services we need to create a Package with a data flow task. The source for this data is the SomeText.txt file on the C: drive.
Change the name of the default package file name to something different, In this case MineText.dtsx. Click Yes on the Microsoft Visual Studio message box asking whether you want to rename the package.
Add a Data Flow Task
Drag and drop a Data Flow Task to the Control Flow page as shown in the next figure. The Data Flow Task will access the SomeText.txt using a connection manager, an intermediary between SQL Server Integration Services and the external system.
Add a Flat File Source
Click on the Data Flow Task page. Drag and drop a Flat File Source from the Data Flow Sources group in the Toolbox and drop it on the Data Flow Page which is open as shown.
When the Flat File Source is dropped on the Data Flow Task page you may see this error in the error window as shown. This is nothing to worry about because a connection is not yet established.
Add a Connection Manager to Manage Flat File Source
Now right click in the Connection Manager's pane as shown to display the pick list of connection managers and choose New Flat File Connection... as shown.
This immediately displays the Flat File Connection Manager's editor window as shown. You must provide a name of your choice to the Connection Manager, and a description of your choice. Then you need to use the Browse button to locate the SomeText.txt file on your hard drive. The next figure shows the editor after these choices are made. The rest of the fields such as Locale, Code page, etc were automatically chosen by the program.
Now click on the Columns list item in the left of the Editor. The one column that gets populated with the data from the SomeText.txt gets displayed. The program has correctly configured the fields for this text.
Click on the OK button on the Editor. This adds a Connection Manager, MyText to the Connection manager's pane in the SSIS designer. With this, the SomeText.txt is available for the other controls that you may add.
Add a Term Extraction Transformation
The column that was populated in the above will now pass through the Term Extraction Transformation added by dragging and dropping this from the Toolbox on to the Data Flow Page.
Click the dangling green line and extend it to touch the Term Extraction Transformation. This is an easy way to establish a connection from the source to a transformation, a destination.
Double click the Term Extraction Transformation to open its Editor as shown in the next figure. In the Term Extraction tabbed page you see a single column which is displayed unchecked. Place a check mark for this column as shown in the next figure.
When the 'terms' are extracted, the output column will have a 'term' and a 'score' column. The term refers to a noun, a noun phrase, or a noun and a noun phrase. The score represents how many times each term is repeated in the body of the text. Pay attention to the message that says the column can have only values of a certain types and the disabled OK button.
The data type of the data going into the Term Extract Transformation can be found by right clicking on the connecting green line and looking at the page that reveals the Meta data list item as shown in the next figure. This is of the data type DT_STR.
To rectify this, there are two options, either use one more transformation, the data conversion transformation or use the Advanced Editor of the Flat File Source which can be displayed by right clicking the Flat File source component and choosing the Show Advanced Editor. This option was made to change Str[DT_STR] to Unicode str [DT_WSTR].
The DT_* shows the data type that are supported. The following information about these data types are shown extracted from the Books on line.
- DT_STR: A null-terminated ANSI/MBCS character string
- DT_NTEXT: A Unicode character string with a maximum length of 2^30-1 characters
- DT_WSTR: a null terminated Unicode character string
Now when you place a check mark for the Column 0 in the Term Extract Transformation Editor, the OK gets enabled.
Click on the Exclusion tab to reveal its page. This page when configured, allows you to exclude (skip) certain terms stored in an OLEDB database. The figure shows the details of editing this page. A Microsoft Access 2003 database called 'SkipTerms' was created and a new table 'SkipTable' was created in this database. It has two columns SkipID (autonumber, Primary Key) and SkipThis (text). A new OLEDB Connection was established along the same lines as the connection manager to the Flat File Source. Of course you need to choose an OLEDB Provider in making this connection. The 'SkipThis' column has just one entry, 'desk'. This noun is found twice in SomeText.txt. The word 'desk' will be skipped in the output column when the Column 0 is processed by this transformation.
Click on the Advanced tab to open its page as shown. This is where you choose type of terms, nouns, noun phrases, or both noun and noun phrases. You also select the score which shows how many times (Frequency and Frequency Threshold) the terms appear in the text. As chosen here, the transformation will be looking for noun(s) that gets repeated twice. The case sensitive option can also be chosen but left blank in this exercise. The score type TFIDF is another type of scoring more appropriate for a document collection and not a single document like in this article. You may learn more details on this from this link.
eBook Price: £14.99
Book Price: £24.99
Adding a Recordset Destination to Pipe the Data Out
As in memory, ADO Recordset Destination (This is like a Message box for table data) is a convenient way to show the output from the Term Extraction Transformation. The Recordset Destination can be dragged and dropped from the Tool box on to the Data Flow Page. In configuring this you need to crate a variable which is in the scope of Data Flow Task and of type Object as shown in the next figure.
With the variable in place you can edit the recordset destination component by double clicking it. This opens up the editor wherein you need to point to the variable you just created as shown. This is all that is necessary to configure this component. It is assumed that you have added a path from the Term Extraction Transformation to the Recordset Destination in a manner similar to what you did earlier.
Adding Data Viewers to Monitor Data Flow
Data Viewers are convenient in visualizing the data flow in SSIS packages. You normally place a data viewer in the path of data flow. In the present tutorial you see two data viewers one (DVFF) between the Flat File Source and the Term Extraction Transformation and the other (DVTE) between Term Extraction Transformation and the Recordset Destination. During run time the data viewer acts like a valve you can stop the flow review it and turn it on to let the data flow to the next step. You may provide a name to the Data Viewer. There are four types of data viewers (Grid, Histogram, Scatter Plot, and Chart) and the 'Grid' type is used in this article.
Build the Project and Execute the Package
This is easy. Click on the menu item Build to build the project. Some times you may need to refresh especially after you make changes midway so that the most current configuration is built. Right click the file MineText.dtsx and from the drop-down choose Execute task. It takes a bit of a time to do the crunching and after the processing is complete you will see the Data Viewers popping-up as shown and the DVFF data viewer is showing the data that is going to the Term Extraction Transformation. The Yellow color signifies that the process is still incomplete.
You can see the progress up to this point by looking at the Progress tabbed page as shown.
When you click on the green arrow in the DVFF table, the flow gets routed downstream as shown. There are two columns as mentioned earlier, Term and score. The data is stopped on its way to the recordset and you turn it on by clicking the green arrow on the DVTE data viewer. When this is done you will see that the execution has completed. From the 15 rows from the source the transformation chose two nouns rose and violet each repeated twice. The 'desk' although repeated twice was skipped as programmed.
Details regarding the usage of a Term Extraction Transformation in extracting nouns and their frequency of occurrence in a body of text are described.
eBook Price: £14.99
Book Price: £24.99
About the Author :
Dr. Jay Krishnaswamy is a graduate of the Indian Institute of Science, Bangalore writes on database and web development related topics to several computer programming related web sites. He is an active participant in several forums and discussion groups. Before working in the IT industry as a Microsoft Certified Trainer and a Siebel Certified consultant he taught at several institutes of technology and universities in India, Australia, Brazil and the USA. Links to his articles may be found at his web site or the blog. He lives in Plainsboro, NJ, USA and may be reached at email@example.com