Reader small image

You're reading from  Talend Open Studio Cookbook

Product typeBook
Published inOct 2013
Reading LevelIntermediate
PublisherPackt
ISBN-139781782167266
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Rick Barton
Rick Barton
author image
Rick Barton

Rick Barton is a freelance consultant who has specialized in data integration and ETL for the last 13 years as part of an IT career spanning over 25 years. After gaining a degree in Computer Systems from Cardiff University, he began his career as a firmware programmer before moving into Mainframe data processing and then into ETL tools in 1999. He has provided technical consultancy to some of the UKs largest companies, including banks and telecommunications companies, and was a founding partner of a Big Data integration consultancy. Four years ago he moved back into freelance development and has been working almost exclusively with Talend Open Studio and Talend Integration Suite, on multiple projects, of various sizes, in UK. It is on these projects that he has learned many of the lessons that can be found in this, his first book.
Read more about Rick Barton

Right arrow

Chapter 8. Managing Files

This chapter contains recipes that show some of the techniques used to read and write data to files. It also contains recipes that show techniques to manage files within a file system. We will cover the following recipes in this chapter:

  • Appending records to a file

  • Reading rows using a regular expression

  • Using temporary files

  • Storing data in memory using tHashMap

  • Reading headers and trailers using tMap

  • Reading headers and trailers with no identifiers

  • Using the information in the header and trailer

  • Adding a header and trailer to a file

  • Moving, copying, renaming, and deleting files and folders

  • Capturing file information

  • Processing multiple files at once

  • Processing control/validation files

  • Create and write files depending upon input data

Introduction


It isn't very efficient to process large batches of information via a web service, nor is it particularly desirable to pull data from an application database during peak hours. Thus, many organizations still maintain a file-based overnight batch processes using large extracts in file format.

In addition, many older, legacy applications rely solely on file-based data for communicating with the outside world.

It is therefore very important for the data integration developer to understand many file types and be able to manage them efficiently and effectively.

Note

This chapter deals with "flat" files, which, for our purposes means files that do not carry their metadata with them, such as XML or JSON, that are described in Chapter 9, Working with XML, Queues, and Web Services.

This does not mean that we will only deal with simple files. Some of the recipes in this chapter will deal with complex hierarchical file structures.

Appending records to a file


This simple recipe shows how a file can be built in within different sub jobs by appending data to an existing file. The append method is one way of building complex files, as will be demonstrated in later recipes in this chapter.

Getting ready

Open the jo_cook_ch08_0010_fileAppend job.

How to do it...

The steps for appending records to a file are as follows:

  1. Copy the complete subjob1 – copy me sub job and paste it to create a second sub job.

  2. Link the two sub jobs using an onSubjobOK link.

  3. Open tFixedFlowInput, and change Records from first subjob to Records from second subjob.

  4. Open tFileOutputDelimited on the new sub job, and tick Append, as shown in the following screenshot:

How it works...

The first sub job creates the file, and the second sub job appends to the same file.

There's more…

While relatively trivial, this recipe demonstrates a very powerful method for creating files that do not adhere to the norm, such as files containing a mixture of fixed and delimited data...

Reading rows using a regular expression


Regular expression (regex) is a powerful method for pattern matching and replacement within many programming languages, and is outside the scope of this book (a good starting point is the javadocs for regex patterns at http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html). One interesting use for regular expressions is when dealing with unusual input formats that are difficult to describe using normal delimited or fixed-width file formatting. This recipe shows how regex can be used to identify a set of input columns from an unstructured input row.

Getting ready

The screenshot of the chapter8_jo_0020_jobLogData.txt file is as follows:

You should notice that there is neither an obvious delimiter, nor does each record fit a fixed width format.

Now, open the jo_cook_ch08_0020_readRegexData job.

How to do it...

The steps for reading rows using regular expressions are as follows:

  1. Open tFileInputRegex and enter the following code:

    "^job: "+
    ...

Using temporary files


Occasionally, it is necessary to create intermediate files within a job that are only used during the lifetime of the job. This recipe shows how to use Talend temporary files.

Getting ready

Open the jo_cook_ch04_0030_temporaryFile job.

How to do it...

The steps for using temporary files are as follows:

  1. Open the tCreateFileTemporary component, and change the name to customerTemp_XXXX.

  2. Select the options Remove file when execution is over, and Use temporary system directory.

  3. Open the tempCustomerOut component, and change File Name to ((String)globalMap.get("tCreateTemporaryFile_1_FILEPATH")).

  4. Repeat the steps for the tempCustomerIn component.

  5. Run the job, and you will see that data is written to and read from the temporary file.

How it works...

The tCreateTemporaryFile component creates an empty file that is then available for writing in the main sub job. The name of the file is stored in the globalMap variable tCreateTemporaryFile_1_FILEPATH, which is referenced by both the output...

Storing intermediate data in the memory using tHashMap


While not strictly file based, there are alternative methods for storing intermediate data which are more efficient than using temporary files, so long as there is enough memory to hold the temporary data. This recipe shows how to do this using the tHashMap component.

Getting ready

Open the jo_cook_ch08_0040_temporaryDatatHashMap job. You will notice that this is the same job as in the previous recipe.

How to do it...

The steps for storing intermediate data in memory using tHashMap are as follows:

  1. Delete the tCreateTemporaryfile component.

  2. Replace the tFileOutputDelimited with a tHashInput component, having a generic schema of sc_cook_ch8_0040_genericCustomerOut.

  3. Replace tFileInputDelimited with tHashInput component sc_cook_ch8_0040_genericCustomerOut.

  4. Add the onSubjobOk link.

  5. Run the job, and the results will be the same as for the previous recipe.

How it works...

tHashMap creates an in memory structure that holds all the data in the flow. It...

Reading headers and trailers using tMap


This recipe shows how to parse a file that has header and trailer records, and a record type at the start of a line.

Getting ready

Open the jo_cook_ch08_0060_headTrailtMap job.

How to do it...

The steps for reading headers and trailers using tMap are as follows:

  1. Drag a tMap component onto the canvas.

  2. Connect the tFileInputFullRow to tMap, and rename the flow to customerIn.

  3. Open tMap, and create three new outputs. Name them header, detail, and trailer.

  4. Copy the input field line into each of the new outputs.

  5. Add the expression filter customerIn.line.startsWith("00") to the header output table.

  6. Add the expression filter customerIn.line.startsWith("01") to the detail output table.

  7. Add the expression filter customerIn.line.startsWith("99") to the trailer output table.

  8. Your tMap should now look like the one shown as follows:

  9. Close tMap, and drag three tExtractDelimitedFields components to the canvas, along with three tLogRow components.

  10. Join each output from tMap to each...

Reading headers and trailers with no identifiers


This recipe shows how to parse a file that has header and trailer records, but does not have an associated record type. Instead, the header is the first record in the file, and the trailer is the last record in the file.

Getting ready

Open the jo_cook_ch08_0070_headTrailtMapNoType job. You will see that it is a slightly changed version of the completed job from the previous recipe; the output schemas have changed.

How to do it...

The steps for reading headers and trailers with no identifiers are as follows:

  1. Drag a tFileRowCount component onto the canvas.

  2. Open the tFileRowCount, and change File Name to context.cookbookData+"/chapter8/chapter08_jo_0070_customerData.txt", which is the same as our input file.

  3. Connect an onSubJobOk trigger from the tFileRowCount component to the tFileInputDelimited.

  4. Open the tMap, and add a new variable rowCount. Set its expression to Numeric.sequence("rowNumber",1,1).

  5. Change the Filter expressions for header, detail, and...

Using the information in the header and trailer


This recipe follows on from the previous recipe, but shows how the information in the header can be added to the detail data, and the data in the trailer used for validation, as is typically the case with files of this type.

Getting ready

Open the jo_cook_ch08_0080_useHeaderAndTrailerInfo job. This job is the completed job from the previous recipe; however, do note that the tLogRow components have now been replaced with tHashOutput components. Also, note that three tHashInput components have also been added and configured.

How to do it...

We will be performing two main tasks; the first is to use the trailer information to validate the file, and then take a column from the header to use in all the output records.

Validation subjob

  1. Drag a tMap component onto the canvas, and join the trailer input to it. Rename the flow to trailerIn.

  2. Open the tMap component, and create an output table named rowCountError.

  3. Drag the input detailCount field to the output...

Adding a header and trailer to a file


This recipe details a method for creating a file with a header and trailer record, which makes use of file append.

Getting ready

Open the jo_cook_ch08_0090_createHeaderAndTrailer job.

How to do it...

The steps for adding a header and trailer to a file are as follows:

  1. Open the tFixedFlowInput_1 component.

  2. Add the following for the field fileDate; TalendDate.getDate("CCYY-MM-DD").

  3. Open the tFixedFlowInput_2 component.

  4. Add the following for the field fileDate; ((Integer)globalMap.get("tFileInputDelimited_1_NB_LINE")).

  5. Open tFileOutputDelimited_1, and change the type to Append.

  6. Open tFileOutputDelimited_4, and change the type to Append.

  7. Run the job, and if you examine the output file, you should see that it has created a file with the current date in the header and the correct number of detail lines in the trailer.

How it works...

tFixedFlowInputs are used to generate a single row each for the header and the trailer.

The header sub job will create the file and add the...

Moving, copying, renaming, and deleting files and folders


As well as reading from and writing to files, Talend has a set of components that allow developers to perform file functions without the need to call native operating system commands. This recipe shows the basic file management components.

Getting ready

Open the job jo_cook_ch08_0100_basicFileCommands.

How to do it...

In the following recipes, it is worth noting that Talend uses the Linux style forward slash (/) in the file paths, as opposed to the Windows backslash (\).

Copying a file to another directory

  1. Drag tFileCopy to the job.

  2. Set the file name to be context.cookbookData+"/chapter8/chapter08_jo_0100_copyFile.txt".

  3. Set the output directory to be context.cookbookData+"/outputData/chapter8".

  4. Run the job, and you will see that the new file has been created: This is a simple copy.

Copying file to a different name

  1. Open tFileCopy, tick the Rename box, and then add a Destination filename of chapter08_jo_0100_copyFileRenamed.txt.

  2. Run the job, and...

Capturing file information


Another useful Talend feature is the ability to capture information about a file for use within downstream processing, most probably to perform validation prior to processing.

Getting ready

Open the jo_cook_ch08_0110_fileInformation job.

How to do it...

The steps for capturing file information are as follows:

  1. Drag a tFileProperties component from the right-hand panel. Open tFileProperties, and set the file name to context.cookbookData+"/chapter8/chapter08_jo_0110_customerData.txt".

  2. Drag tFlowToIterate to the canvas, and link the row from tFileProperties to it. Name the flow properties.

  3. Drag tFileRowCount to the canvas and set the filename to match the tFileProperties component.

  4. Add onSubjobOk from tFileProperties to tFileRowCount, and then to tFixedFlowInput, so that your job looks like the one shown as follows:

  5. Open tFixedFlowInput.

  6. Add ((Long)globalMap.get("properties.size")) to the field fileSize.

  7. Add ((Integer)globalMap.get("tFileRowCount_1_COUNT")) to the field numberOfRows...

Processing multiple files at once


Often, with batch processes, it is required that multiple files are processed by the same job in a single tranche. This example shows how this can be achieved by merging a group of input files into a single output.

Getting ready

Open the jo_cook_ch08_0120_multipleFiles job. You will notice that it is currently reading a single file to a temporary file, and then copying the temporary fileto a permanent output.

How to do it...

The steps for processing multiple files at once are as follows:

  1. Add a tFileList component, open it, and set the directory to context.cookbookData+"/chapter8".

  2. Click on the + button under the Filemask box, and add the filemask "chapter08_jo_0120_customerData_*.txt".

  3. Your tFileList should look like the one shown, as follows:

  4. Move the OnSubjobOk from the tFileInputDelimited to the tFileList.

  5. Add a tJava component.

  6. Right-click on tFileList and select Row, then Iterate, and link to the tJava.

  7. Right-click on the tJava and select Trigger, then OnComponentOk...

Processing control/validation files


Some organizations prefer to use a companion (control/validation) file containing file information instead of storing the information in the file header or trailer. This means that the detail file is much simpler to process, because it is a normal flat file.

In this recipe, the control file has the same name as the detail file; however, it is suffixed with .ctrl rather than .txt. This recipe shows how the control file is processed.

Getting ready

Open the jo_cook_ch08_0130_controlFile job. You will see that tFileList_1 is looking for files with the mask of chapter08_jo_0130_customerData*.txt. There are two of these in the directory.

How to do it...

The steps for processing control/validation files

  1. Copy the first sub job.

  2. Change the new tFileList mask to StringHandling.EREPLACE(((String)globalMap.get("tFileList_1_CURRENT_FILE")),"txt","ctrl").

  3. Open tJava_2 and change the command to System.out.println("Found control file: "+((String)globalMap.get("tFileList_2_CURRENT_FILE...

Creating and writing files depending on the input data


Sometimes it is required that multiple files are written from a single data source where the file name is dependent upon the data held within the row. This recipe shows how this can be achieved.

Getting ready

Open the jo_cook_ch08_0140_filesFromInputData job.

How to do it...

The steps for creating and writing files depending on the input data are as follows:

  1. Run the job, and you will see that the file dummy.txt has been created and populated with six rows.

  2. Open the tJavaRow component, and you will see that the move of data from input to output has already been performed.

  3. Add in the following code after the generated code:

    // test for change of input_row.key
    if (Numeric.sequence(input_row.key, 1, 1) == 1 ) {
      outtFileOutputDelimited_1.flush();
      
      // if this is the first record then do not flush and close - do not want to create dummy.txt
      // otherwish if sequence > 1 then we will close the previous file 
      if(Numeric.sequence("all", 1...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Talend Open Studio Cookbook
Published in: Oct 2013Publisher: PacktISBN-13: 9781782167266
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rick Barton

Rick Barton is a freelance consultant who has specialized in data integration and ETL for the last 13 years as part of an IT career spanning over 25 years. After gaining a degree in Computer Systems from Cardiff University, he began his career as a firmware programmer before moving into Mainframe data processing and then into ETL tools in 1999. He has provided technical consultancy to some of the UKs largest companies, including banks and telecommunications companies, and was a founding partner of a Big Data integration consultancy. Four years ago he moved back into freelance development and has been working almost exclusively with Talend Open Studio and Talend Integration Suite, on multiple projects, of various sizes, in UK. It is on these projects that he has learned many of the lessons that can be found in this, his first book.
Read more about Rick Barton