(For more resources related to this topic, see here.)
Open the job jo_cook_ch03_0010_validationSubjob. As you can see, the reject flow has been attached and the output is being sent to a temporary store (tHashMap).
How to do it…
- Add the tJava, tDie, tHashInput, and tFileOutputDelimited components.
- Add onSubjobOk to tJava from the tFileInputDelimited component.
- Add a flow from the tHashInput component to the tFileOutputDelimited component.
- Right-click the tJava component, select Trigger and then Runif. Link the trigger to the tDie component. Click the if link, and add the following code
((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) > 0
- Right-click the tJava component, select Trigger, and then Runif. Link this trigger to the tHashInput component.
((Integer)globalMap.get("tFileOutputDelimited_1_NB_LINE")) == 0
The job should now look like the following:
- Drag the generic schema sc_cook_ch3_0010_genericCustomer to both the tHashInput and tFileOutputDelimited.
- Run the job. You should see that the tDie component is activated, because the file contained two errors.
How it works…
What we have done in this exercise is created a validation stage prior to processing the data.
Valid rows are held in temporary storage (tHashOutput) and invalid rows are written to a reject file until all input rows are processed.
The job then checks to see how many records are rejected (using the RunIf link). In this instance, there are invalid rows, so the RunIf link is triggered, and the job is killed using tDie.
By ensuring that the data is correct before we start to process it into a target, we know that the data will be fit for writing to the target, and thus avoiding the need for rollback procedures.
The records captured can then be sent to the support team, who will then have a record of all incorrect rows. These rows can be fixed in situ within the source file and the job simply re-run from the beginning.
This article is particularly important when rollback/correction of a job may be particularly complex, or where there may be a higher than expected number of errors in an input.
An example would be when there are multiple executions of a job that appends to a target file. If the job fails midway through, then rolling back involves identifying which records were appended to the file by the job before failure, removing them from the file, fixing the offending record, and then re-running. This runs the risk of a second error causing the same thing to happen again.
On the other hand, if the job does not die, but a subsection of the data is rejected, then the rejects must be manipulated into the target file via a second manual execution of the job.
So, this method enables us to be certain that our records will not fail to write due to incorrect data, and therefore saves our target from becoming corrupted.
This article has shown how the rejects are collected before killing a job. This article also shows how incorrect rejects be manipulated into the target file.
Resources for Article:
- Pentaho Data Integration 4: Working with Complex Data Flows [Article]
- Nmap Fundamentals [Article]
- Getting Started with Pentaho Data Integration [Article]