Reader small image

You're reading from  Talend Open Studio Cookbook

Product typeBook
Published inOct 2013
Reading LevelIntermediate
PublisherPackt
ISBN-139781782167266
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Rick Barton
Rick Barton
author image
Rick Barton

Rick Barton is a freelance consultant who has specialized in data integration and ETL for the last 13 years as part of an IT career spanning over 25 years. After gaining a degree in Computer Systems from Cardiff University, he began his career as a firmware programmer before moving into Mainframe data processing and then into ETL tools in 1999. He has provided technical consultancy to some of the UKs largest companies, including banks and telecommunications companies, and was a founding partner of a Big Data integration consultancy. Four years ago he moved back into freelance development and has been working almost exclusively with Talend Open Studio and Talend Integration Suite, on multiple projects, of various sizes, in UK. It is on these projects that he has learned many of the lessons that can be found in this, his first book.
Read more about Rick Barton

Right arrow

Chapter 3. Validating Data

This chapter contains recipes that show some of the techniques for validating data and handling invalid rows.

  • Enabling and disabling reject flows

  • Gathering all rejects prior to killing a job

  • Validating against the schema

  • Rejecting rows using tMap

  • Checking a column against a list of allowed values

  • Checking a column against a lookup

  • Creating validation rules for more complex requirements

  • Creating binary error codes to store multiple test results

Introduction


Clean, timely, and correct data is a business-critical resource for most organizations, because it enables (but is not restricted to) more accurate decision making, compliance, and improved efficiency.

Data integration is often the first point of contact for data arriving into a business (from third parties), and the hub for data held within a business, and as such, plays a key part in ensuring that data is fit for use.

This section concentrates on some of the features and methods within Talend that enable the developer to identify and capture invalid data, so that it can be reported.

Enabling and disabling reject flows


Rejected data is closely coupled to schemas (Chapter 2, Metadata and Schemas), as many of the input and output components will validate data according to a schema definition and then pass any incorrect data to a reject flow.

Reject flows thus allow non-conforming data to be collected and handled as per the needs of a project.

In some cases, depending upon the business requirement, rejects are not acceptable. In these cases, reject flows should be disabled and the job allowed to fail.

Tip

Whether a job dies on the first incorrect record, collects rejects in a file, or completely ignores rejects is a design decision that should be based upon the requirements for the process. Where possible, designers and developers should attempt to define how errors and rejects are handled before coding begins.

Getting ready

Open the job jo_cook_ch03_0000_inputReject.

How to do it…

  1. Run the job and it will fail with an unparseable date error.

  2. Open the tFileInputDelimited component...

Gathering all rejects prior to killing a job


As an alternative to collecting incorrect rows up to the point where a job fails (Die on error), you may wish to capture all rejects from an input before killing a job.

This has the advantage of enabling support personnel to identify all problems with source data in a single pass, rather than having to re-execute a job continually to find and fix a single error / set of errors at a time.

Getting ready

Open the job jo_cook_ch03_0010_validationSubjob. As you can see, the reject flow has been attached and the output is being sent to a temporary store (tHashMap).

How to do it…

  1. Add the tJava, tDie, tHashInput, and tFileOutputDelimited components.

  2. Add onSubjobOk to tJava from the tFileInputDelimited component.

  3. Add a flow from the tHashInput component to the tFileOutputDelimited component.

  4. Right-click the tJava component, select Trigger and then Runif. Link the trigger to the tDie component. Click the if link, and add the following code

    ((Integer)globalMap.get...

Validating against the schema


The tSchemaComplianceCheck is a very useful component for ensuring that the data passing downstream is correct with respect to the defined schema.

This simple exercise demonstrates how rows can be rejected using this component.

Getting ready

Open the job jo_cook_ch03_0020_schemaCompliance.

How to do it…

  1. Run the job. You should see two rows being rejected.

  2. Add a tSchemaComplianceCheck and two tLogRow, right click on tSchemaComplianceCheck_1 and select Row then Rejects. Join the flow one of the new tLogRow. Connect the main to the other as shown:

  3. Now, when you run the job, you will see an additional reject row being output from the tSchemaComplianceCheck component.

How it works…

The tFileInputDelimited component will detect only some of the anomalies within the data, whereas the tSchemaComplianceCheck component will perform a much more thorough validation of the data.

If you look at the output, you will see the log entry, which shows that the name field has exceeded the...

Rejecting rows using tMap


This recipe shows how tMap can be used to ensure that unwanted rows are not propagated downstream. This may be as a result of the filter criteria or a validation rule.

Getting ready

Open the job jo_cook_ch03_0030_tMapRejects.

How to do it…

  1. Open the tMap and click the Activate/unactivate expression filter button for the validRows output.

  2. In the Expression box add the code customer.age >= 18.

  3. Click on the tMapRejects output and then on the tMapSettings button.

  4. Click on Catch output reject value column to set it to true.

  5. Run the job. You should see that one of the rows has been rejected.

How it works…

In this example, tMap is working like an if statement. Therefore, if customer's age is greater than eighteen, then write the record to validRows or else pass the data to the tMapRejects.

There's more…

You can use this method to test for multiple different rejects, by adding additional outputs and adding different filter criteria to each output.

The tMap component will process any...

Checking a column against a list of allowed values


Often it is necessary to ensure that a column contains only values as defined in a list. This recipe shows how this can be achieved using a tMap expression.

Getting ready

Open the job jo_cook_ch03_0040_tMapValuesInList. You will notice that the job is very similar to the previous recipe Rejecting rows using tMap.

How to do it…

  1. Open tMap and click the expression builder button (…), and add the test criteria, as shown in the following screenshot:

  2. Run the job and you should see that one of the rows is rejected.

How it works…

The tMap conditions are the same as Java conditions, so the symbol || (pipe pipe) is a logical OR.

Thus, the condition checks for the value being UK or USA or France.

There's more…

This method is fine if the list is quite small and isn't liable to change. If the list is too large or subject to frequent changes, then the code will be hard to maintain and/or will need to be changed often, which will require re-testing of the code....

Checking a column against a lookup


Another method for validating a column is to refer to a lookup containing a list of allowed values that can be stored in any format (file, table, XML for example).

Getting ready

Open the job jo_cook_ch03_0050_tMapValuesInLookup.You will see that there are two inputs to the tMap: customer and country.

How to do it…

  1. Open tMap, and drag the field countryOfBirth from the customer input to the countryName field in the country input. This will create a key link, as shown in the following screenshot:

  2. Click the button tMap settings and set the value for Join Model to Inner Join.

  3. In the reject output, click on the button for tMap settings, and set the value for Catch lookup inner join reject to true.

  4. Run the job and you will see that three of the records have been rejected.

  5. Re-open the tMap and change the Expr.key on the country to StringHandling.UPCASE(customer.countryOfBirth)

  6. Re-run the job and you will see that now only one record has been rejected.

How it works…

The tMap...

Creating validation rules for more complex requirements


Sometimes validation rules require multiple inputs to provide a pass/fail result, so it is often easier to build and understand the code if it is written using Java.

If you aren't familiar with code routines in Talend, it is recommended that you first complete the recipe Creating custom functions using code routines, Chapter 5 , Using Java in Talend that will take you through the setup of code routines.

Getting ready

Open the job jo_cook_ch03_0060_validationCodeRoutine.

How to do it…

  1. Create a new code routine called validation, and copy the following code into it:

        /**
         * validateCustomerAge: Check customer is 18 or over for UK, 21 or over for rest of world.
         * returns true if valid, false if invalid
         * e.g. validateCustomerAge(23,"UK")
         * 
         * {talendTypes} Boolean
         * 
         * {Category} Validation
         * 
         * {param} string(age) input: Customer age
         * {param} string(country) input: Customer country
         ...

Creating binary error codes to store multiple test results


Note

Prior to doing this exercise, it is recommended that you first jump forward to Chapter 4, Mapping Data, and do the exercises related to ternary operators and using variables in tMap.

Sometimes, it is desirable to perform multiple checks on a row at the same time, so that when a row is rejected, all of the problems with the data can be identified from a single error message. An excellent method of recording this is to create a binary error code.

A binary error code is a binary number, where each of the digit position represents the result of a validation test: 0 being pass and 1 being fail.

For example, 1101 = failed test 1 (rightmost digit), test 3 and test 4 and passed test 2. This binary value can be held as a decimal integer, in this case 13.

Getting ready

Open the job jo_cook_ch03_0070_binaryErrorCode.

How to do it…

  1. Open tMap and create six new Integer type variables: nameTest, dateOfBirthTest, timestampTest, ageTest, countryOfBirthTest...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Talend Open Studio Cookbook
Published in: Oct 2013Publisher: PacktISBN-13: 9781782167266
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rick Barton

Rick Barton is a freelance consultant who has specialized in data integration and ETL for the last 13 years as part of an IT career spanning over 25 years. After gaining a degree in Computer Systems from Cardiff University, he began his career as a firmware programmer before moving into Mainframe data processing and then into ETL tools in 1999. He has provided technical consultancy to some of the UKs largest companies, including banks and telecommunications companies, and was a founding partner of a Big Data integration consultancy. Four years ago he moved back into freelance development and has been working almost exclusively with Talend Open Studio and Talend Integration Suite, on multiple projects, of various sizes, in UK. It is on these projects that he has learned many of the lessons that can be found in this, his first book.
Read more about Rick Barton