Talend Open Studio Cookbook — Save 50%
Over 100 recipes to help you master Talend Open Studio and become a more effective data integration developer with this book and ebook
This article by Rick Barton the author of the book Talend Open Studio Cookbook focuses on gathering the rejects. As an alternative to collecting incorrect rows up to the point where a job fails (Die on error), you may wish to capture all rejects from an input before killing a job.
This has the advantage of enabling support personnel to identify all problems with source data in a single pass, rather than having to re-execute a job continually to find and fix a single error / set of errors at a time.
(For more resources related to this topic, see here.)
Open the job jo_cook_ch03_0010_validationSubjob. As you can see, the reject flow has been attached and the output is being sent to a temporary store (tHashMap).
How to do it…
- Add the tJava, tDie, tHashInput, and tFileOutputDelimited components.
- Add onSubjobOk to tJava from the tFileInputDelimited component.
- Add a flow from the tHashInput component to the tFileOutputDelimited component.
- Right-click the tJava component, select Trigger and then Runif. Link the trigger to the tDie component. Click the if link, and add the following code
((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) > 0
- Right-click the tJava component, select Trigger, and then Runif. Link this trigger to the tHashInput component.
((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) == 0
The job should now look like the following:
- Drag the generic schema sc_cook_ch3_0010_genericCustomer to both the tHashInput and tFileOutputDelimited.
- Run the job. You should see that the tDie component is activated, because the file contained two errors.
How it works…
What we have done in this exercise is created a validation stage prior to processing the data.
Valid rows are held in temporary storage (tHashOutput) and invalid rows are written to a reject file until all input rows are processed.
The job then checks to see how many records are rejected (using the RunIf link). In this instance, there are invalid rows, so the RunIf link is triggered, and the job is killed using tDie.
By ensuring that the data is correct before we start to process it into a target, we know that the data will be fit for writing to the target, and thus avoiding the need for rollback procedures.
The records captured can then be sent to the support team, who will then have a record of all incorrect rows. These rows can be fixed in situ within the source file and the job simply re-run from the beginning.
This article is particularly important when rollback/correction of a job may be particularly complex, or where there may be a higher than expected number of errors in an input.
An example would be when there are multiple executions of a job that appends to a target file. If the job fails midway through, then rolling back involves identifying which records were appended to the file by the job before failure, removing them from the file, fixing the offending record, and then re-running. This runs the risk of a second error causing the same thing to happen again.
On the other hand, if the job does not die, but a subsection of the data is rejected, then the rejects must be manipulated into the target file via a second manual execution of the job.
So, this method enables us to be certain that our records will not fail to write due to incorrect data, and therefore saves our target from becoming corrupted.
This article has shown how the rejects are collected before killing a job. This article also shows how incorrect rejects be manipulated into the target file.
Resources for Article:
- Pentaho Data Integration 4: Working with Complex Data Flows [Article]
- Nmap Fundamentals [Article]
- Getting Started with Pentaho Data Integration [Article]
|Over 100 recipes to help you master Talend Open Studio and become a more effective data integration developer with this book and ebook|
eBook Price: $26.99
Book Price: $44.99
About the Author :
Rick Barton is a freelance consultant who has specialized in data integration and ETL for the last 13 years as part of an IT career spanning over 25 years. After gaining a degree in Computer Systems from Cardiff University, he began his career as a
firmware programmer before moving into Mainframe data processing and then into ETL tools in 1999.
He has provided technical consultancy to some of the UK’s largest companies, including banks and telecommunications companies, and was a founding partner of a “Big Data” integration consultancy.
Four years ago he moved back into freelance development and has been working almost exclusively with Talend Open Studio and Talend Integration Suite, on multiple projects, of various sizes, in UK. It is on these projects that he has learned many of the lessons that can be found in this, his first book.
Books From Packt