Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Getting Started with Talend Open Studio for Data Integration

You're reading from  Getting Started with Talend Open Studio for Data Integration

Product type Book
Published in Nov 2012
Publisher Packt
ISBN-13 9781849514729
Pages 320 pages
Edition 1st Edition
Languages
Author (1):
Jonathan Bowen Jonathan Bowen
Profile icon Jonathan Bowen

Table of Contents (22) Chapters

Getting Started with Talend Open Studio for Data Integration
Credits
Foreword
Foreword
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Knowing Talend Open Studio Working with Talend Open Studio Transforming Files Working with Databases Filtering, Sorting, and Other Processing Techniques Managing Files Job Orchestration Managing Jobs Global Variables and Contexts Worked Examples Installing Sample Jobs and Data Resources Index

Chapter 5. Filtering, Sorting, and Other Processing Techniques

As we pass data around from component to component, system to system, there is often the need to modify it in some way. This chapter introduces the Studio's processing components, which will become your "Swiss Army Knife" as you develop integration jobs. The processing group of components is used as intermediate data processing or transformation components, intercepting data flows between input and output components. For example, we might have a filtering component between a database read component and a database write component, or between an XML file input and a CSV output. Alternatively, we might use a data sorting component that takes sales order data from a file and sorts it by customer ID in ascending order.

In this chapter, we will look at:

  • Filtering data: Removing or passing through specific records based on some attribute of the data

  • Sorting data: Alpha and numeric sorts (singularly or in combination)

  • Summing and aggregating...

Filtering data


As we pass data through an integration process, we may often wish to filter it in some way. Data from source systems may be fine in terms of its format, but its content scope may be too broad for the receiving systems. For example, suppose we have an export of data from our financial system of all invoices due to our customers and we wish to send a list of the invoices to each customer; we wouldn't send the full list to all customers, but rather send a filtered list to each customer of only their own invoices.

We have seen in previous examples that the tMap component has filtering capabilities but the Studio provides a dedicated filtering component with some extra features for fine control (when there is no requirement for data mapping). We will look at three examples of how to use the filter component in your integration jobs:

  • A straightforward filter

  • The same filter, but also capturing the rejected records

  • Finally, how to split a file based on filters

Simple filter

Let's start...

Sorting data


Our next data processing example will look at sorting data. Let's dive straight in:

  1. Create a new job and name it Sorting.

  2. We will use the currencies data file that we used previously in this example, so from the Repository Metadata drag the currencies delimited file onto the Job Designer. Select tFileInputDelimited from the pop-up window.

  3. From the Palette, search for delimited and drop a tFileOutputDelimited component onto the Job Designer. Now do a search for sort and drop a tSortRow component between the two delimited components.

  4. Right-click on the currencies delimited input and, using Row | Main, connect it to the sort component. Similarly, right-click on the sort component and use Row | Main to connect it to the delimited output component.

  5. In the Criteria box of the tSortRow component, click on + to add a row—some default values will be shown. The criteria we enter here will determine how the output is sorted. Let's leave the Schema column value as country, but change the sort...

Aggregating data


An aggregation operation can often be thought of as summing some data items, but the Studio also uses aggregating functions to work out counts, minimum and maximum values, and average values, amongst other things. We will look at a simple data summation example.

The file we will work with is in the resources directory of this chapter and is named invoices.csv. It shows a number of invoices with the invoice number, customer name, and invoice amount. We want to extract the sum of the invoices for each customer.

  1. Create a new job and name it Aggregating.

  2. Create a File delimited metadata item for the invoices.csv file, following the steps we have used previously. Name column 0 as invoice_number, column 1 as customer_name, and column 2 as invoice_value. The Studio will choose a data type of float for the invoice value. It is better to use the data type BigDecimal, which preserves the two decimal places we expect with monetary values, so change this in the Type column of the final...

Normalizing and denormalizing data


Database normalization is the process whereby a database schema is designed to reduce data duplication and redundancy. If a database is not designed with normalization principles in mind, it can:

  • Get overly large, due to duplicated data

  • Make data maintenance difficult or give rise to data integrity issues if the same data values reside in multiple tables

While we are not directly concerned with database schema design in this chapter, our next two examples look at processing operations borne from the same principles as database normalization, so readers who aren't familiar with the concepts may wish to read some introductory material first. For a good primer on database normalization, go to http://en.wikipedia.org/wiki/Database_normalization .

Data normalization

Our first example shows how we can normalize data. Suppose we have a data file that has two fields: product_id and categories. A product can belong to more than one category and the category values are...

Extracting delimited fields


As we have seen, some systems may store data in a denormalized form and, in the previous section, we saw how we could normalize the data. In essence, we were turning the data from column into a row. However, with some data, we may wish to change its normalized form not to rows, but to individual columns. For example, suppose a system stores its employee data with the following schema:

[employee_id] | [name]

And the name field holds the first name and last name of the employee in the following format:

[last_name], [first_name]

An example file is shown as follows:

Note

Note that the schema does not have three fields, but that the second field contains the first and last name, separated by a comma.

Our objective in this example is to manipulate the data so that it maps to a three-field schema:

[employee_id] | [last_name] | [first_name]

Follow the walk-through given:

  1. Create a new job and name it ExtractDelimitedFields.

  2. Create a File delimited metadata item for our input file...

Find and replace


We saw in Chapter 3, Transforming Files, in the StateLookup job example, that we can use the Studio to replace one value with another. In this example, we used a second reference data source and a tJoin component to take a value from the input file and replace it with the value held in the reference file (replacing the long-form of the state name with its two-character state code). This works really well if the replacement data is held in a reference file or database. However, sometimes we won't have this data stored elsewhere or the nature of the lookup is not as well structured as our state code example. In these instances, we can use a simple find and replace component to apply ad-hoc lookups. Let's see this technique in action.

In this job, we will take a list of two-character country codes and replace one of them with the full country name. Our input file is country-codes.csv in the resource directory of this chapter.

  1. Create a new job and name it FindAndReplace.

  2. Create...

Sampling rows


For our final example of processing techniques, we will look at how to extract specific rows from a data flow. For this technique, the Studio offers a tSampleRow component which filters rows according to their line numbers. We're going to use the same data as used in the previous example —country-codes.csv—and the job we create will be similar to the last example, so we'll reuse this and add some modifications.

  1. In the repository, right-click on the FindAndReplace job and select Duplicate.

  2. In the pop-up window, change the name to SampleRow and click on OK.

  3. Double-click on the new job to open it and delete the tReplace component.

  4. In the Palette, search for sample and drag a tSampleRow component onto the Job Designer, in between the delimited input and the tLogRow component.

  5. Connect the component together using Row | Main in each case.

  6. The Basic Settings tab shows some examples of how to configure the component. For our job, let's configure it to filter rows 5 to 10. In the Range box...

Summary


In this chapter, we explored the Studio processing components and how we can use them to modify the data flows between files, databases, and other start and end points through the use of filters, data sorts, and aggregations. We saw how we could group and un-group data with the Studio normalize and denormalize components. The Studio's tReplace component is a simple, but hugely useful addition to your toolkit and will undoubtedly find a place in many of your integration jobs. Finally, the ability to extract specific rows from a data flow through the tSampleRow component was reviewed and we will see examples later in the book where, when combined with a sorting component, this has some powerful applications.

When building integration jobs, you will commonly use the processing components in combination. For example, you might take a data flow from a database, filter out the records you want, then pass the data to a find and replace component, where you will substitute some values with...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Getting Started with Talend Open Studio for Data Integration
Published in: Nov 2012 Publisher: Packt ISBN-13: 9781849514729
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}