Getting Started with Talend Open Studio for Data Integration — Save 50%
Develop system integrations with speed and quality using Talend Open Studio for Data Integration with this book and ebook
Talend Open Studio for Data Integration (TOS) is an open source graphical development environment for creating custom integrations between systems.
This article by Jonathan Bowen, author of Getting Started with Talend Open Studio for Data Integration, shows how to manage Files during integration jobs. We'll look at renaming, moving, copying, and deleting Files; how to timestamp a File; connecting to remote servers to FTP files; and zipping and unzipping Files.
(For more resources related to this topic, see here.)
Managing local files
In this section we will look at local file operations. We'll cover common operations that all computer users will be familiar with—copying, deleting, moving, renaming, and archiving files. We'll also look at some not-so-common techniques, such as timestamping files, checking for the existence of a file, and listing the files in a directory.
For our first file job, let's look at a simple file copy process. We will create a job that looks in a specific directory for a file and copies it to another location.
Let's do some setup first (we can use this for all of the file examples). In your project directory, create a new folder and name it FileManagement. Within this folder, create two more folders and name them Source and Target. In the Source directory, drop a simple text file and name it original.txt. Now let's create our job:
- Create a new folder in Repository and name it Chapter6
- Create a new job within the Chapter6 directory and name it FileCopy.
- In the Palette, search for copy. You should be able to locate a tFileCopy component. Drop this onto the Job Designer.
- Click on its Component tab. Set the File Name field to point to the original.txt file in the Source directory.
Set the Destination directory field to direct to the Target directory.
For now, let's leave everything else unchanged. Click on the Run tab and then click on the Run button. The job should complete pretty quickly and, because we only have a single component, there are now data fl ows to observe. Check your Target folder and you will see the original.txt file in there, as expected. Note that the file still remains in the Source folder, as we were simply copying the file.
Copying and removing files
Our next example is a variant of our first file management job. Previously, we copied a file from one folder to another, but often you will want to affect a file move. To use an analogy from desktop operating systems and programs, we want to do a cut and paste rather than a copy and paste. Open the FileCopy job and follow the given steps:
- Remove the original.txt file from the Target directory, making sure it still exists in the Source directory.
In the Basic settings tab of the tFileCopy component, select the checkbox for Remove source file.
- Now run the job. This time the original.txt file will be copied to the Target directory and then removed from the Source directory.
We can also use the tFileCopy component to rename files as we copy or move. Again, let's work with the FileCopy job we have created previously. Reset your Source and Target directories so that the original.txt file only exists in Source.
- In the Basic settings tab, check the Rename checkbox. This will reveal a new parameter, Destination filename.
Change the default value of the Destination filename parameter to modified_name.txt.
- Run the job. The original file will be copied to the Target directory and renamed. The original file will also be removed from the Source directory.
It is really useful to be able to delete files. For example, once they have been transformed or processed into other systems. Our integration jobs should "clean up afterwards", rather than leaving lots of interim files cluttering up the directories. In this job example we'll delete a file from a directory.This is a single-component job.
- Create a new job and name it FileDelete.
In your workspace directory, FileManagement/Source, create a new text file and name it file-to-delete.txt.
- From the Palette, search for filedelete and drag a tFileDelete component onto the Job Designer.
Click on its Component tab to configure it. Change the File Name parameter to be the path to the file you created earlier in step 2.
- Run the job. After it is complete, go to your Source directory and the file will no longer be there.
Note that the file does not get moved to the recycle bin on your computer, but is deleted immediately.
Timestamping a file
Sometimes in real life use, integration jobs, like any software, can fail or give an error. Server issues, previously unencountered bugs, or a host of other things can cause a job to behave in an unexpected manner, and when this happens, manual intervention may be needed to investigate the issue or recover the job that failed. A useful trick to try to incorporate into your jobs is to save files once they have been consumed or processed, in case you need to re-process them again at some point or, indeed, just for investigation and debugging purposes should something go wrong. A common way to save files is to rename them using a date/timestamp. By doing this you can easily identify when files were processed by the job. Follow the given steps to achieve this:
- Create a new job and call it FileTimestamp.
Create a file in the Source directory named timestamp.txt. The job is going to move this to the Target directory, adding a time-stamp to the file as it processes.
- From the Palette, search for filecopy and drop a tFileCopy component onto the Job Designer.
- Click on its Component tab and change the File Name parameter to point to the timestamp.txt file we created in the Source directory.
- Change the Destination Directory to direct to your Target directory.
- Check the Rename checkbox and change the Destination filename parameter to "timestamp"+TalendDate.getDate("yyyyMMddhhmmss")+".txt".
The previous code snippet concatenates the fixed file name, "timestamp", with the current date/time as generated by the Studio's getDate function at runtime. The file extension ".txt" is added to the end too.
Run the job and you will see a new version of the original file drop into the Target directory, complete with timestamp. Run the job again and you will see another file in Target with a different timestamp applied.
Depending on your requirements you can configure different format timestamps. For example, if you are only going to be processing one file a day, you could dispense with the hours, minutes, and second elements of the timestamp and simply set the output format to "yyyyMMdd". Alternatively, to make the timestamp more readable, you could separate its elements with hyphens—"yyyy-MM-dd", for example.
You can find more information about Java date formats at http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html..
Listing files in a directory
Our next example job will show how to list all of the files (or all the files matching a specific naming pattern) in a directory. Where might we use such a process? Suppose our target system had a data "drop-off" directory, where all integration files from multiple sources were placed before being picked up to be processed. As an example, this drop-off directory might contain four product catalogue XML files, three CSV files containing inventory data, and 50 order XML files detailing what had been ordered by the customers. We might want to build a catalogue import process that picks up the four catalogue files, processes them by mapping to a different format, and then moves them to the catalogue import directory. The nature of the processing means we have to deal with each file individually, but we want a single execution of the process to pick up all available files at that point in time. This is where our file listing process comes in very handy and, as you might expect, the Studio has a component to help us with this task. Follow the given steps:
Let's start by preparing the directory and files we want to list. Copy the FileList directory from the resource files to the FileManagement directory we created earlier. The FileList directory contains six XML files.
Create a new job and name it FileList.
Search for Filelist in the Palette and drop a tFileList component onto the Job Designer.
Additionally, search for logrow and drop a tLogRow component onto the designer too.
We will use the tFileList component to read all of the filenames in the directory and pass this through to the tLogRow component. In order to do this, we need to connect the tFileList and tLogRow. The tFileList component works in an iterative manner—it reads each filename and passes it onwards before getting the next filename. Its connector type is Iterative, rather than the more common Main connector. However, we cannot connect an iterative component to the tLogRow component, so we need to introduce another component that will act as an intermediary between the two.
Search for iteratetoflow in the Palette and drop a tIterateToFlow component onto the Job Designer. This bridges the gap between an iterate component and a fl ow component.
Click on the tFileList component and then click on its Component tab. Change the directory value so that it points to the FileList directory we created in step 1.
Click on the + button to add a new row to the File section. Change the value to "*.xml". This configures the component to search for any files with an XML extension.
Right-click on the tFileList component, select Row | Iterate, and drop the resulting connector onto the tIterateToFlow component.
The tIterateToFlow component requires a schema and, as the tFileList component does not have a schema, it cannot propagate this to the iterateto-flow component when we join them. Instead we will have to create the schema directly. Click on the tIterateToFlow component and then on its Component tab. Click on the Edit schema button and, in the pop-up schema editor, click on the + button to add a row and then rename the column value to filename. Click on OK to close the window.
A new row will be added to the Mapping table. We need to edit its value, so click in the Value column, delete the setting that exists, and press Ctrl + space bar
to access the global variables list.
Scroll through the global variable drop-down list and select "tFileList_1_CURRENT_FILE". This will add the required parameter to the Value column.
Right-click on the tIterateToFlow component, select Row | Main, and connect this to the tLogRow component.
Let's run the job. It may run too quickly to be visible to the human eye, but the tFileList component will read the name of the first file it finds, pass this forward to the tIterateToFlow component, go back and read the second file, and so on. As the iterate-to-flow component receives its data, it will pass this onto tLogRow as row data. You will see the following output in the tLogRow component:
Now that we have cracked the basics of the file list component, let's extend the example to a real-life situation. Let's suppose we have a number of text files in our input directory, all conforming to the same schema. In the resources directory, you will find five files named fileconcat1.txt, fileconcat2.txt, and so on. Each of these has a "random" number of rows. Copy these files into the Source directory of your workspace. The aim of our job is to pick up each file in turn and write its output to a new file, thereby concatenating all of the original files. Let's see how we do this:
- Create a new job and name it FileConcat.
- For this job we will need a file list component, a delimited file output component, and a delimited file input component. As we will see in a minute, the delimited input component will be a "placeholder" for each of the input files in turn.
Find the components in the Palette and drop them onto the Job Designer.
- Click on the file list component and change its Directory value to point to the Source directory.
- In the Files box, add a row and change the Filemask value to "*.txt".
Right-click on the file list component and select Row | Iterate. Drop the connector onto the delimited input component.
- Select the delimited input component and edit its schema so that it has a single field rowdata of data type String
- We need to modify the File name/Stream value, but in this case it is not a fixed file we are looking for but a different file with each iteration of the file list component. TOS gives us an easy way to add such variables into the component definitions. First, though, click on the File name/Stream box and clear the default value.
In the bottom-left corner of the Studio you should see a window named Outline. If you cannot see the Outline window, select Window | Show View from the menu bar and type outline into the pop-up search box. You will see the Outline view in the search results—double click on this to open it.
Now that we can see the Outline window, expand the tFileList item to see the variables available in it. The variables are different depending upon the component selected. In the case of a file list component, the variables are mostly attributes of the current file being processed. We are interested in the filename for each iteration, so click on the variable Current File Name with path and drag it to the File name/Stream box in the Component tab of the delimited input component.
- You can see that the Studio completes the parameter value with a globalMap variable—in this case, tFileList_1_CURRENT_FILEPATH, which denotes the current filename and its directory path.
- Now right-click on the delimited input, select Row | Main, and drop the connector onto the delimited output.
- Change the File Name of the delimited output component to fileconcatout.txt in our target directory and check the Append checkbox, so that the Studio adds the data from each iteration to the bottom of each file. If Append is not checked, then the Studio will overwrite the data on each iteration and all that will be left will be the data from the final iteration.
Run the job and check the output file in the target directory. You will see a single file with the contents of the five original files in it. Note that the Studio shows the number of iterations of the file list component that have been executed, but does not show the number of lines written to the output file, as we are used to seeing in non-iterative jobs.
Checking for files
Let's look at how we can check for the existence of a file before we undertake an operation on it. Perhaps the first question is "Why do we need to check if a file exists?"
To illustrate why, open the FileDelete job that we created earlier. If you look at its component configuration, you will see that it will delete a file named file-todelete. txt in the Source directory. Go to this directory using your computer's file explorer and delete this file manually. Now try to run the FileDelete job. You will get an error when the job executes:
The assumption behind a delete component (or a copy, rename, or other file operation process) is that the file does, in fact, exist and so the component can do its work. When the Studio finds that the file does not exist, an error is produced. Obviously, such an error is not desirable. In this particular case nothing too untoward happens—the job simply errors and exits—but it is better if we can avoid unnecessary errors.
What we should really do here is check if the file exists and, if it does, then delete it. If it does not exist, then the delete command should not be invoked. Let's see how we can put this logic together
- Create a new job and name it FileExist.
- Search for fileexist in the Palette and drop a tFileExist component onto the Job Designer. Then search for filedelete and place a tFileDelete component onto the designer too.
In our Source directory, create a file named file-exist.txt and configure File Name of the tFileDelete component to point to this.
Now click on the tFileExist component and set its File name/Stream parameter to be the same file in the Source directory.
Right-click on the tFileExist component, select Trigger | Run if, and drop the connector onto the tFileDelete component. The connecting line between the two components is labeled If.
When our job runs the first component will execute, but the second component, tFileDelete, will only run if some conditions are satisfied. We need to configure the if conditions.
- Click on If and, in the Component tab, a Condition box will appear.
In the Outline window (in the bottom-left corner of the Studio), expand the tFileExist component. You will see three attributes there. The Exists attribute is highlighted in red in the following screenshot:
Click on the Exists attribute and drag it into the Conditions box of the Component tab.
- As before, a global-map variable is written to the configuration.
- The logic of our job is as follows:
i. Run the tFileExist component.
ii. If the file named in tFileExist actually exists, run the tFileDelete component.
Note that if the file does not exist, the job will exit.
We can check if the job works as expected by running it twice. The file we want to delete is in the Source directory, so we would expect both components to run on the first execution (and for the file to be deleted). When the if condition is evaluated, the result will show in the Job Designer view. In this case, the if condition was true—the file did exist.
Now try to run the job again. We know that the file we are checking for does not exist, as it was deleted on the last execution.
This time, the if condition evaluates to false, and the delete component does not get invoked. You can also see in the console window that the Studio did not log any errors. Much better!
Sometimes we may want to verify that a file does not exist before we invoke another component. We can achieve this in a similar way to checking for the existence of a file, as shown earlier. Drag the Exists variable into the Conditions box and prefix the statement with !—the Java operator for "not":
|Develop system integrations with speed and quality using Talend Open Studio for Data Integration with this book and ebook|
eBook Price: $26.99
Book Price: $44.99
Archiving and unarchiving files
Integration jobs often produce or consume lots of files, so the ability to archive (or zip) files can help save on disk space or help speed up an FTP process. Let's create a simple job that looks for files in a specific directory and zips up those that it finds. To set it up, copy the five filearchive text files from the resource directory of this article into your Source directory. Then:
- Create a new job and name it FileArchive.
- Search for Filearchive in the Palette and drag a tFileArchive component onto the Job Designer.
- In its Component tab, set the Directory parameter to the Source directory. Note that, by default, the archive component will look at the files in the named directory and any of its subdirectories. If you do not want this behavior, uncheck the Subdirectories checkbox.
- Set the Archive file parameter to archive.zip in your Target directory.
- We could simply archive all files in our Source directory by leaving the All Files checkbox checked, but in this case we will be more specific, so uncheck the All Files checkbox.
The Files box will be revealed. Add a new row to this by clicking on the + button and change the Filemask value to filearchive*.txt. This will select all of the files matching this name format.
- Let's leave the other parameters unchanged and run the job. Check the Target directory and you will see the newly created archive file.
FTP file operations
Integration jobs often connect different systems residing on different servers; so the Studio's FTP components will frequently play a part in your developments. The Studio supports many FTP actions—for example, Get, Put, Delete, Rename, File List, File Exist, and so on—and we'll look at how to use some of these in this section.
Readers may find it useful to have an FTP client installed on their computers to follow this section of the article and to check that files have been FTP'd correctly. There are many free FTP clients available for download on the Internet. FileZilla is recommended and can be downloaded from http://filezilla-project.org/
We will start by defining an FTP connection in our repository metadata. As we saw previously with our database connection, it is really useful to be able to define a connection that can be used repeatedly. Follow the given steps:
- In the Repository, expand the Metadata section, right-click on the FTP icon, and select Create FTP.
- Enter FTP_CONNECTION in the Name box and enter values in the Purpose and Description boxes if you wish. Click on Next.
Enter your FTP username, password, host, and port into the appropriate boxes.
- Click on Finish.
Now that the connection is set up, we can create some FTP jobs.
The FTP Put component takes a file (or files) from your local machine and FTPs them to a remote computer. Create a new job named FTPPut and follow the given steps:
- Click on the FTP connection you created in the Repository Metadata and drag it onto the Job Designer.
Select tFTPPut in the pop-up window and click on OK.
Click on the FTP Put component in the Job Designer. You will see that a lot of its configuration values are already set, courtesy of the metadata FTP connection. We will need to configure the Local Directory, Remote Directory, and Files settings.
- Change the Local directory field to be the directory on your computer where the files you want to transfer reside
- Change the Remote Directory to be the directory on the remote server where you are going to transfer to.
Click on the + button in the Files box to add a row. Change the Filemask and New name values to be that of the file you are transferring. Note that the Studio gives the option to change the filename as part of the transfer, but we will ignore this for now.
- This is all we need to configure for a basic FTP Put job, so let's run the job. You will find that the file is transferred to the remote server successfully, leaving the original file in the local directory.
Now let's create a job that does the opposite of FTP Put—transfers a file from a remote server to your local computer.
The job configuration process is very similar, except that when you select your FTP component from the pop-up window after dragging the FTP connection onto the Job Designer, select tFTPGet rather than tFTPPut.
Have a go at configuring this job now. When you have finished, the tFTPGet component should look something like the following screenshot:
Note that the Studio does not give the option to change the filename with the FTP Get component
Run your job and you will see the file transfer back to your local directory (again, leaving the original on the remote server).
There is a sample FTP Get job in the resources directory of this article for you to refer to.
FTP File Exist
We saw earlier in the article how the Studio can check for the existence of a file in a directory before it processes it. It offers a similar FTP component that allows you to check for the existence of a file on a remote server.
Following the file exists example earlier in the article, create a job that checks for a file on your FTP server and, if it exists, gets the file to your local machine.
You will need the following components: tFTPFileExist, tFTPFileGet, and a Run If connector.
There is a sample job in the resources directory for you to refer to.
FTP File List and Rename
Our next job will use two FTP components to list the files in a remote directory and rename them with a timestamp. Follow the given steps:
- Create a new job and name it FTPListAndRename.
- Drag the FTP_CONNECTION connection from the Repository Metadata onto the Job Designer. Select a tFTPFileList component from the pop-up window.
We also need an FTP file rename component, so drag another FTP_CONNECTION component onto the Job Designer and select tFTPRename from the pop-up window this time.
Our job now has two different components, but both have the same name.
To make your job clearer to read and understand, you can rename each component to show its purpose. By double-clicking on the text name of each component, you can put it into "edit mode" and rename it appropriately. Here, we have renamed components to FTPFileList and FTPFileRename.
- Configure the file list component by changing the Remote directory value to /Test and the Filemask value to *.* (this is a wildcard entry and will list every filename with any extension).
- Connect the file list component to the file rename component by right-clicking on the file list, selecting Row | Iterate, and dropping the connector onto the file rename component.
- Now let's configure the file rename component. Change the Remote directory value to /Test.
- Click on the + button of the File box to add a new row.
- In the Filemask column, press Ctrl + space bar to access the global variables. Select FTP File List.CURRENT_FILE from the drop-down list.
In the New name column, press Ctrl + space bar again and scroll to find TalendDate.getDate. Select this to add it to the New name column. Modify the default value to be as follows:
- Place your cursor at the end of the getDate variable and press Ctrl + space bar again. Select FTP File List.CURRENT_FILE from the drop-down list.
Place your cursor between the two variables that have been added and type the following line:
Your New name expression should now be as follows:
- This will change the filename so that it is prefixed with the current date. The two elements of the new filename, the current date and the old filename, will be separated with an underscore.
Run the job to test the process. You will find that the remote files are renamed with the timestamp, as configured.
Deleting files on an FTP server
Our final FTP example shows how to delete files on an FTP server. This is a routine that you might typically use at the end of a processing job as part of a clean-up process, possibly after the files have been archived for safekeeping. There are at least two ways to approach this problem: the first is brute-force, the second is more controlled.
In the first method you could simply use a tFTPDelete component and, applying a file mask to pick out the files to be deleted, simply delete what is there. The second method uses a tFileList component to specify what should be deleted. In this example, we will use the file list method and add in a renaming component too.
The desired fl ow of functionality is as follows:
- List the files on an FTP server.
- Time-stamp and move the files to a processed directory.
- Delete the original source files.
Follow the given steps:
- Create a new job and name it FTPFileDelete.
- Using the FTP Metadata in the Repository, create three components on the job designer — tFTPFileList, tFTPRename, and tFTPDelete.
Connect the three components using iterate connectors.
Configure the file list component to look for CSV files on the remote server directory, using the Filemask parameter.
- Now click on the rename component. We are going to use this to move the file to a new directory and add a timestamp to the filename.
- Configure the Remote directory and then place your cursor in the Filemask column of the Files box. Click Ctrl + space bar to show a list of global variables we can use in this job.
- Find and select FTP_CONNECTION.CURRENT_FILE. This accesses the current file from the tFileList iteration.
In the New name column, add the following code:
"/Test/Processed/"+TalendDate.getDate("yyyyMMdd")+ "_"+((String)globalMap.get ("tFTPFileList_1_CURRENT_FILE"))
- This does a number of things. It, again, uses the current filename but prefixes this with the current date (TalendDate.getDate("yyyyMMdd")) and an underscore ("_"). We also specify that the file path is "/Test/Processed/" and so the Studio will move it to the Processed subdirectory as part of this file rename process. Note that you need to create the Processed directory on the remote server first—the Studio does not create it on the fl y.
Finally, click on the tFTPDelete component and configure its Remote directory and Files parameters. The Files/Filemask parameter should again be set to the CURRENT_FILE variable.
- Run the job and you should see the file move from the "/Test" directory to "/Test/Processed" with the addition of a timestamp, as specified.
Managing files is a very important part of the integration development process. Because files are so widely used in integration jobs, we need to have strategies to effectively deal with them and, even though file management is the less "glamorous" part of the integration job—compared to XML mapping or database extraction, for example—its importance cannot be overstated. Moving files in a logical manner, renaming and archiving appropriately, and checking for the existence of a file before attempting to do anything with it is simply good practice. Without these techniques your integration jobs will only be partially complete; so it is highly recommended that readers spend time planning their developments to incorporate these elements.
Resources for Article :
- Oracle Integration and Consolidation Products [Article]
- Java 7: Managing Files and Directories [Article]
- Configuring MySQL [Article]
|Develop system integrations with speed and quality using Talend Open Studio for Data Integration with this book and ebook|
eBook Price: $26.99
Book Price: $44.99
About the Author :
Jonathan Bowen is an E-commerce and Retail Systems Consultant and has worked in and around the retail industry for the past 20 years. His early career was in retail operations, then in the late 1990s he switched to the back office and has been integrating and implementing retail systems ever since.
Since 2006, he has worked for one of the UK’s largest e-commerce platform vendors as Head of Projects and, later, Head of Product Strategy. In that time he has worked on over 30 major e-commerce implementations.
Outside of work, Jonathan, like many parents, has a busy schedule of sporting events, music lessons, and parties to take his kids to, and any downtime is often spent catching up with the latest tech news or trying to record electronic music in his home studio.
You can get in touch with Jonathan at his website: www.learnintegration.com.