Packt+ | Advance your knowledge in tech

You're reading from Getting Started with Talend Open Studio for Data Integration

Product type Book

Published in Nov 2012

Publisher Packt

ISBN-13 9781849514729

Pages 320 pages

Edition 1st Edition

Languages

Java

Concepts

Data Processing

Author (1):

Jonathan Bowen

Table of Contents (22) Chapters

Getting Started with Talend Open Studio for Data Integration

Credits

Foreword

About the Author

Acknowledgement

About the Reviewers

www.PacktPub.com

Preface

Knowing Talend Open Studio

Working with Talend Open Studio

Transforming Files

Working with Databases

Filtering, Sorting, and Other Processing Techniques

Managing Files

Job Orchestration

Managing Jobs

Global Variables and Contexts

Worked Examples

Installing Sample Jobs and Data

Downloading job and data files

Resources

Index

Chapter 5. Filtering, Sorting, and Other Processing Techniques

As we pass data around from component to component, system to system, there is often the need to modify it in some way. This chapter introduces the Studio's processing components, which will become your "Swiss Army Knife" as you develop integration jobs. The processing group of components is used as intermediate data processing or transformation components, intercepting data flows between input and output components. For example, we might have a filtering component between a database read component and a database write component, or between an XML file input and a CSV output. Alternatively, we might use a data sorting component that takes sales order data from a file and sorts it by customer ID in ascending order.

In this chapter, we will look at:

Filtering data: Removing or passing through specific records based on some attribute of the data
Sorting data: Alpha and numeric sorts (singularly or in combination)
Summing and aggregating...

Filtering data

As we pass data through an integration process, we may often wish to filter it in some way. Data from source systems may be fine in terms of its format, but its content scope may be too broad for the receiving systems. For example, suppose we have an export of data from our financial system of all invoices due to our customers and we wish to send a list of the invoices to each customer; we wouldn't send the full list to all customers, but rather send a filtered list to each customer of only their own invoices.

We have seen in previous examples that the tMap component has filtering capabilities but the Studio provides a dedicated filtering component with some extra features for fine control (when there is no requirement for data mapping). We will look at three examples of how to use the filter component in your integration jobs:

A straightforward filter
The same filter, but also capturing the rejected records
Finally, how to split a file based on filters

Simple filter

Let's start...

Sorting data

Our next data processing example will look at sorting data. Let's dive straight in:

Create a new job and name it Sorting.
We will use the currencies data file that we used previously in this example, so from the Repository Metadata drag the currencies delimited file onto the Job Designer. Select tFileInputDelimited from the pop-up window.
From the Palette, search for delimited and drop a tFileOutputDelimited component onto the Job Designer. Now do a search for sort and drop a tSortRow component between the two delimited components.
Right-click on the currencies delimited input and, using Row | Main, connect it to the sort component. Similarly, right-click on the sort component and use Row | Main to connect it to the delimited output component.
In the Criteria box of the tSortRow component, click on + to add a row—some default values will be shown. The criteria we enter here will determine how the output is sorted. Let's leave the Schema column value as country, but change the sort...

Aggregating data

An aggregation operation can often be thought of as summing some data items, but the Studio also uses aggregating functions to work out counts, minimum and maximum values, and average values, amongst other things. We will look at a simple data summation example.

The file we will work with is in the resources directory of this chapter and is named invoices.csv. It shows a number of invoices with the invoice number, customer name, and invoice amount. We want to extract the sum of the invoices for each customer.

Create a new job and name it Aggregating.
Create a File delimited metadata item for the invoices.csv file, following the steps we have used previously. Name column 0 as invoice_number, column 1 as customer_name, and column 2 as invoice_value. The Studio will choose a data type of float for the invoice value. It is better to use the data type BigDecimal, which preserves the two decimal places we expect with monetary values, so change this in the Type column of the final...

Normalizing and denormalizing data

Database normalization is the process whereby a database schema is designed to reduce data duplication and redundancy. If a database is not designed with normalization principles in mind, it can:

Get overly large, due to duplicated data
Make data maintenance difficult or give rise to data integrity issues if the same data values reside in multiple tables

While we are not directly concerned with database schema design in this chapter, our next two examples look at processing operations borne from the same principles as database normalization, so readers who aren't familiar with the concepts may wish to read some introductory material first. For a good primer on database normalization, go to http://en.wikipedia.org/wiki/Database_normalization .

Data normalization

Our first example shows how we can normalize data. Suppose we have a data file that has two fields: product_id and categories. A product can belong to more than one category and the category values are...

Extracting delimited fields

As we have seen, some systems may store data in a denormalized form and, in the previous section, we saw how we could normalize the data. In essence, we were turning the data from column into a row. However, with some data, we may wish to change its normalized form not to rows, but to individual columns. For example, suppose a system stores its employee data with the following schema:

[employee_id] | [name]

And the name field holds the first name and last name of the employee in the following format:

[last_name], [first_name]

An example file is shown as follows:

Note

Note that the schema does not have three fields, but that the second field contains the first and last name, separated by a comma.

Our objective in this example is to manipulate the data so that it maps to a three-field schema:

[employee_id] | [last_name] | [first_name]

Follow the walk-through given:

Create a new job and name it ExtractDelimitedFields.
Create a File delimited metadata item for our input file...

Find and replace

We saw in Chapter 3, Transforming Files, in the StateLookup job example, that we can use the Studio to replace one value with another. In this example, we used a second reference data source and a tJoin component to take a value from the input file and replace it with the value held in the reference file (replacing the long-form of the state name with its two-character state code). This works really well if the replacement data is held in a reference file or database. However, sometimes we won't have this data stored elsewhere or the nature of the lookup is not as well structured as our state code example. In these instances, we can use a simple find and replace component to apply ad-hoc lookups. Let's see this technique in action.

In this job, we will take a list of two-character country codes and replace one of them with the full country name. Our input file is country-codes.csv in the resource directory of this chapter.

Create a new job and name it FindAndReplace.
Create...

Sampling rows

For our final example of processing techniques, we will look at how to extract specific rows from a data flow. For this technique, the Studio offers a tSampleRow component which filters rows according to their line numbers. We're going to use the same data as used in the previous example —country-codes.csv—and the job we create will be similar to the last example, so we'll reuse this and add some modifications.

In the repository, right-click on the FindAndReplace job and select Duplicate.
In the pop-up window, change the name to SampleRow and click on OK.
Double-click on the new job to open it and delete the tReplace component.
In the Palette, search for sample and drag a tSampleRow component onto the Job Designer, in between the delimited input and the tLogRow component.
Connect the component together using Row | Main in each case.
The Basic Settings tab shows some examples of how to configure the component. For our job, let's configure it to filter rows 5 to 10. In the Range box...

Summary

In this chapter, we explored the Studio processing components and how we can use them to modify the data flows between files, databases, and other start and end points through the use of filters, data sorts, and aggregations. We saw how we could group and un-group data with the Studio normalize and denormalize components. The Studio's tReplace component is a simple, but hugely useful addition to your toolkit and will undoubtedly find a place in many of your integration jobs. Finally, the ability to extract specific rows from a data flow through the tSampleRow component was reviewed and we will see examples later in the book where, when combined with a sorting component, this has some powerful applications.

When building integration jobs, you will commonly use the processing components in combination. For example, you might take a data flow from a database, filter out the records you want, then pass the data to a find and replace component, where you will substitute some values with...