Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Getting Started with Talend Open Studio for Data Integration

You're reading from  Getting Started with Talend Open Studio for Data Integration

Product type Book
Published in Nov 2012
Publisher Packt
ISBN-13 9781849514729
Pages 320 pages
Edition 1st Edition
Languages
Author (1):
Jonathan Bowen Jonathan Bowen
Profile icon Jonathan Bowen

Table of Contents (22) Chapters

Getting Started with Talend Open Studio for Data Integration
Credits
Foreword
Foreword
About the Author
Acknowledgement
About the Reviewers
www.PacktPub.com
Preface
Knowing Talend Open Studio Working with Talend Open Studio Transforming Files Working with Databases Filtering, Sorting, and Other Processing Techniques Managing Files Job Orchestration Managing Jobs Global Variables and Contexts Worked Examples Installing Sample Jobs and Data Resources Index

Chapter 3. Transforming Files

Let's start our "deep dive" into the world of Talend Open Studio for data integration by looking at some common file integration techniques. Of all the methods of integrating systems, using files and file exchanges is probably one of the oldest and most common. With the rise of other, more modern, integration methods over the last few years, web services for example, integrating files might be seen as a bit unfashionable. But don't be fooled; exchanging files between systems can be an excellent way to integrate. To support this, lots and lots of applications have file-based integration APIs.

In this chapter, we will learn how to:

  • Transform files from one format to another

  • Use the Studio's expression editor to modify data

  • Build advanced and multi-schema XML files

  • Use lookups to enrich data

  • Get familiar with the Studio development environment by following the detailed step-by-step examples

Transforming XML to CSV


Let's start with a simple file format transformation. Many modern applications use Extensible Markup Language (XML) formats to get data in and out. Other, often simpler, systems use a Comma Separated Format (CSV). Common, desktop-based systems, such as Excel and Access, have wizards for taking data in the CSV format. We'll work through the process of taking a simple XML file and extracting its data into a comma-separated format.

Before we dive in and actually start to configure a the Studio job, let's look at the data that we want to transform. Our input file is an XML product catalogue named catalogue.xml, which is present in the datafiles of this chapter. Open this in the XML viewer of your choice. You can see that the data is pretty self-explanatory. The file contains data about Stock Keeping Units (SKUs). There are a number of repeating SKU elements, each containing an skuid, skuname, size, colour, and price.

We want to extract this data into a spreadsheet-style...

Transforming CSV to XML


Working the other way around, we will now transform a simple CSV file into an XML format. Again, this is a common integration scenario, where, for example, data needs to move from a spreadsheet to a format suitable for loading into another application via its native API.

Start by creating a new job in the Chapter3 folder and call it CSV2XML. We're going to use the CSV output from our previous task as an input file this time, so take a copy of catalogue-out.csv from your DataOut folder and drop it into the DataIn folder. To avoid confusion, let's rename this to csv2xml-catalogue.csv.

As in the previous example, we'll start by declaring the metadata for the input file before completing the main configuration of the job. Perform the following steps:

  1. In the Metadata section of the Repository window, right-click on File delimited and select Create file delimited. The metadata wizard window will appear. Enter csv_catalogue into the Name field and click on Next:

  2. In step 2...

Maps and expressions


In most integration scenarios, we are unlikely to find that all data fields can be passed from one system to another without any modification. Because different systems model the same objects in different ways, there's often the need, to not only change the file format, but also to change the data model and content in some way.

For our next job design, we'll do another CSV to XML transformation; but this time, the data models of the input and the output (and hence the schemas) will be different. We'll use the Studio's mapping component and Expression editor to help us deal with these differences.

To start off, let's look at our two data models to examine the differences. Our CSV file is a customer datafile and has the following fields:

  • Customer ID

  • First Name

  • Last Name

  • Address1

  • Address2

  • Town City

  • County

  • Postcode

  • Telephone

We know that all of these fields have a string data type and that all fields are mandatory, except for Address2.

The XML file we want to produce has a similar,...

Advanced XML output for complex XML structures


The XML output format we used in the last example was pretty straightforward. Every piece of data is contained within its own XML element; however, it is very common for XML files to be more complex, with subelements being repeated within a parent element, and data items being held in XML attributes rather than in elements.

In this example, we will produce an XML file that contains data about customer orders that have been dispatched. This file will contain the following information:

  • The order ID

  • The order-line ID

  • The product SKU for each line

  • The quantity of each dispatched SKU

  • The dispatch date (in the format yyyy-MM-dd hh:mm)

  • The courier tracking ID (so that customers can track their order)

The XML format we need to adhere to is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<DISPATCH_DOCKET>
<ORDER ID="1000">
<ORDER_LINE ID="1" SKU="123456789" QUANTITY="1" DISPATCHED_DATE="2011-01-01 12:00" TRACKING_ID="ABC12345"/>
<...

Working with multi-schema XML files


The XML files we have worked with so far were straightforward and only dealt with a single collection of elements. However, many systems produce or consume XML files that contain multiple collections of elements; these are called multi-schema XML files. Here's an example:

<?xml version="1.0" encoding="UTF-8"?>

<catalogue>
  <skus>
    <sku>
      <skuid>432345</skuid>
      <skuname>Check Shirt</skuname>
      <size>S</size>
      <colour>Green</colour>
      <price>29.99</price>
    </sku>
  </skus>
  <inventory>
    <sku>
      <skuid>432345</skuid>
      <stock_on_hand>12</stock_on_hand>
    </sku>
  </inventory>
</catalogue>

This shows a product catalogue file with two schemas, one for the product details and one for the inventory. There's nothing in the XML structure to connect the two schemas. In essence...

Enriching data with lookups


So far, we have looked at integration scenarios where we have transformed files from one format to another, but in all cases the data we needed in the output file was contained, in some form, in the input file. However, it is commonplace in real-life scenarios that we need to transform data to the requirements of one system, but the originating system does not actually contain the data we need. It's time to improvise!

In this section, we'll create a job that passes data from one component to another, but on the way, uses a lookup data to replace some data. Imagine that we need to transform some customer data. Our original file is simple, containing the following fields:

  • Company name

  • Address

  • City

  • State

  • Zip code

The following is a sample file:

Let's name this file as corporate-addresses.csv and drop it into your DataIn folder.

The output file required by the receiving system is exactly the same as this, with one exception. Its state field is only two characters long (as...

Extracting data from Excel files


Spreadsheets are a ubiquitous business tool in the modern world, and there is a vast amount of critical data that resides in these common desktop files. As the tools are so commonplace and easy to use, spreadsheets are often the tool of choice for storing and manipulating all kinds of data. In this section, we'll look at a couple of ways to pull data from a spreadsheet. One of the most-used features of spreadsheets is the sheets functionality, which is the ability to add another page within the spreadsheet file. Sheets within a file may be closely related (for example, each sheet represents sales data for a given month) or may be less closely related (for example, a customers spreadsheet may contain customer data, such as the first name, last name, and e-mail in sheet 1, and address data in sheet 2). Instead of taking spreadsheet data and converting it into the CSV format before transforming it, the Studio has Excel components that allow us to address multiple...

Summary


This was the first time we got our hands dirty with the Studio, the Studio, and we looked at some common integration scenarios based on common file formats. We looked at transforming files from one format to another—CSV to XML, for example. We also explored the tMap component and its built-in Expression editor. We built jobs using the multi-schema XML component, and saw how we can map data to the most complex XML structures using the advance XML output component. We also saw how to join data from different sources using both the tJ join component and the join functionality within tMap.

Let's move on to look at how the Studio can work with another common systems component—the relational database.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Getting Started with Talend Open Studio for Data Integration
Published in: Nov 2012 Publisher: Packt ISBN-13: 9781849514729
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}