Your message has been sent.
This article has been saved to your account.
Go to my account
This article has been emailed to your Kindle.
Send this article
Debatching data is the process of turning one huge pile of data into many small piles of data. There are multiple tools for debatching large bulk data loads including BizTalk Server and SQL Server Integration Services (SSIS). One can use such tools to break up large batches of data, manipulate it as needed, and send it on to its next reincarnation (for example, into an API, a relational database, or a text file). In this article by Mike Sexton, co-author of Applied Architecture Patterns on the Microsoft Platform, we will take a look at options for processing large data sets and explain how to take giant sets of data and insert them into databases for analysis.
(For more resources on Microsoft Platform, see here.)
Why is it better to shovel one ton of data using two thousand, one pound shovels instead of one big load from a huge power shovel? After all, large commercial databases and the attendant bulk loader or SQL Loader programs are designed to do just that: insert huge loads of data in a single shot.
The bulk load approach works under certain tightly constrained circumstances. They are as follows:
- The "bulk" data comes to you already matching the table structure of the destination system. Of course, this may mean that it was debatched before it gets to your system.
- The destination system can accept some, potentially significant, error rate when individual rows fail to load.
- There are no updates or deletes, just inserts.
- Your destination system can handle bulk loads. Certain systems (for example, some legacy medical systems or other proprietary systems) cannot handle bulk operations.
As the vast majority of data transfer situations will not meet these criteria, we must consider various options. First, one must consider which side of the database event horizon one should perform these tasks. One could, for example, simply dump an entire large file into a staging table on SQL Server, and then debatch using SQL to move the data to the "permanent" tables.
Big Box Stores owns and operates retail chains that include huge Big Box warehouse stores, large retail operations in groceries and general department stores and small convenience stores that sell gasoline, beverages, and fast food. The company has added brands and stores over the past few years through a series of mergers. Each brand has its own unique point of sale system. The stores operate in the United States, Canada, Mexico, and Western Europe.
The loss prevention department has noticed that a number of store sales and clerical staff are helping themselves to "five-finger bonuses." The staff members use various ruses to take money from cash registers, obtain goods without paying for them, or otherwise embezzle money or steal goods from Big Box. These patterns typically unfold over periods of several days or weeks. For example, employees will make purchases using their employee discount at the store where they work, then return the product for full price at another store where they are not known or they have an accomplice return the goods to the employee for a full refund.
The various methods used to steal from Big Box fall into these recognized patterns and a good deal of this theft can be uncovered by analyzing patterns of sales transactions. Standard ETL techniques will be used to import data concerning the stores, products, and employees to a database where we can analyze these patterns and detect employee theft.
We have been tasked with designing a system that will import comma-delimited files exported by the point of sales (POS) systems into a SQL Server database that will then perform the analysis. Data concerning each sale will be sent from each of the point of sale systems. The files will hold all or part of the prior day's sales and will range from 30,000 to over 2.5 million rows of data per file. For stores that have "regular" business hours, files will become available approximately one hour after the stores close. This time will vary based on the day of the week and the time of year. During "normal" operations, stores typically close at 9:00 PM local time. During certain peak shopping periods (for example, Christmas or local holiday periods) stores remain open until midnight, local time. Convenience stores are opened 24 hours per day, 7 days per week. Data will be sent for these stores after the POS system has closed the books on the prior day, typically at 1:00 AM local time.
The POS systems can be extended to periodically expose "final" sales to the system throughout the business day via a web service. The impact of using this method during a peak sales period is unknown, and performance of the POS may degrade. A full day's data may also be extracted from the POS system in the comma-delimited format discussed as follows. The web service would expose the data using the natural hierarchy of "sales header" and "sales detail."
All data must be loaded and available to the loss prevention department by 9 AM CET for European stores and 9 AM EST for North American stores. It should be noted that the different POS use different data types to identify stores, employees, products, and sales transactions. The load job must account for this and properly relate the data from the store to the master data loaded in a separate application.
The data will be sent in two comma-delimited files, one containing the "Sales Header" data and one containing the sales details. The data will be in the following format:
SalesID, StoreID, EmployeeID, EmployeeFirstName, EmployeeLastName, RegisterID, RegisterLocation, storeAddress, StoreCity, StoreProvince, StorePostalCode, CustomerID, CustomerFirstName, CustomerLastName, CustomerPostalCode, Date, Time, Method of Payment, CreditCardNumber, TotalSales, Amount Tendered, Change, PriorSalesID, Return
SalesID, ProductID, Quantity, markedPrice, ActualPrice, ReturnItem, DiscountCode, DiscountPercent, DiscountDescription, OriginalPurchaseDate, OriginalPurchaseStore, OriginalPurchaseSalesID, originalCustomerID, OriginalFirstName, OriginalLastName, OriginalStoreID, OriginalRegisterID, OriginalEmployeeID
Our mission is to move this data into a data mart that will use a standard star schema for this analysis. Big Box intended to prosecute employees for larceny or theft offences based on evidence this system gathers. Given the legal requirements that evidence gathered through this process must stand up in court, it is vital that this data be correct and loaded with a minimal number of errors or issues.
As is fairly typical, the use case above does not contain information on all of the facts we would need to consider when designing a solution. Every company has operating assumptions that the enterprise takes as a "given" and others we learn through our own involvement with the enterprise. These "facts" are so ingrained into an organization's culture that people may not even recognize the need to explicitly state these requirements.
For example, if a consultant arrives at a company in Denver, CO that only does business in the United States, then he or she can expect that the business language will be English with US spelling. The exact same company in Calgary, doing business in Canada will need both English with British spelling and French. It is doubtful one would ever see such "requirements" stated explicitly, but anyone designing a solution would do well to keep them in mind.
Other facts may be extrapolated or derived from the given requirements. When you are designing a solution you must take these criteria into account as well. It would be at best unwise to design a solution that was beyond the skill set for the IT staff, for example. In this case, it is probably safe to say the following:
|Fact or Extrapolation||Reason|
|Big Box has a very sophisticated IT staff that can handle any advanced and sophisticated technologies.||They are currently handling multiple POS systems on 2 continents and already do sophisticated ETL work from these systems to existing BI systems.|
|Getting the deliverable "right" is more important than getting it done "fast".||Legal requirements for using data as evidence.|
|Data must be secure during movement to avoid allegations of evidence tampering.||Legal requirements for using data as evidence.|
|Some level of operational control and monitoring must be built into the application we will design.||Common courtesy to the Network Operations Center (NOC) staff who will deal with this, if nothing else.|
We can tackle this problem from multiple angles, so let us take a look at the available options.
Candidate architecture #1–SSIS
First, we will explore the pros and cons of using SSIS for our solution platform.
Solution design aspects
This scenario is the sweet spot for SSIS. SSIS is, first and foremost, an ETL and batch data processing tool. SSIS can easily read multiple files from a network drive and has the tools out of the box that can debatch, either before or after loading to a database.
Nonetheless, we are faced with certain hurdles that will need to be accounted for in our design. We do not control precisely when the POS data will be made available.
There are a number of variables that influence that timing, not the least of which is the potential need for human intervention in closing books for the day and variable times throughout the year and across the globe when a particular store's books will be closed. We need to expect that files will be delivered over a time range.
In some ways this is helpful, as it spreads some of the load over time.
One of the great things about SSIS in this situation is the flexibility it provides. We can load all of the data in a single batch to a staging table then move it (debatch) to its final destinations using SQL, or we can debatch on the application side and load directly to the final tables, or any combination that suits us and the strengths of the development team. SSIS can also be extended to monitor directories and load data when it becomes available. Finally, SSIS integrates easily into NOC monitoring systems and provides the ability to guarantee data security and integrity as required for this application. Moreover, SSIS does not incur any additional licensing costs, as it ships with SQL Server out of the box.
Solution delivery aspects
It is not clear from our use case what depth of experience Big Box staff has with SSIS. However, they certainly have experience with database technologies, SQL queries, and with other advance technologies associated with data transfer and integration, given the size of the enterprise operations. We can reasonably expect them to pick up any unfamiliar technologies quickly and easily.
This application will require some extensions to the typical ETL paradigm. Here data must go through some amount of human intervention through the daily "closing" before it is made available. This will involve tasks such as physically counting cash to make sure it matches the records in the POS system. Any number of factors can accelerate or delay the completion of this task. SSIS will therefore need to monitor the directories where data are delivered to ensure the data is available. Also, we will need to design the system so that it does not attempt to load partially completed files. This is a classic ETL problem with many potential solutions and certainly does not present insurmountable issues.
Solution operations aspects
In this case, we have one vitally important operational requirement; the solution must guarantee data integrity and security so that the data can be used to prosecute thieves or otherwise stand up to evidentiary rules. SSIS and SQL Server 2008 Enterprise Edition can handle these requirements. SQL Server 2008 security and data access auditing features will meet chain of custody requirements and ensure that no data tampering occurred. SSIS can enforce business rules programmatically to ensure the precise and accurate transfer of the data sent by the POS systems.
Many of these requirements will be filled with the design of the database itself. We would use, for example, the data access auditing now available with SQL Server 2008 to monitor who has been working with data. The database would use only Windows-based security, not SQL Server based security. Other steps to harden SQL Server against attack should be taken.
All the previously mentioned features secure the data while at rest. We will need to focus on how to ensure data integrity during the transfer of the data—while the data is in motion. SSIS has logging tools that will be used to monitor unsuccessful data transfers. Moreover, we can extend these tools to ensure either a complete data load or that we will have an explanation for any failure to load. It should be noted that the loss prevention staff is interested in outliers, so they will want to carefully examine data that fails to meet business requirements (and therefore fails to load to our target system) to look for patterns of theft.
We understand that Big Box staff has the technical wherewithal to handle this relatively simple extension to existing SQL Server technologies. This is a group of database professionals who deal with multiple stores performing over 2 million transactions per day. They support the POS, financial, inventory, and other systems required to handle this volume on two continents. This is a small step for them in terms of their ability to live with this solution.
Candidate architecture #2–BizTalk Server
While not primarily targeted at bulk data solutions, BizTalk Server can parse large inbound data sets, debatch the individual records, and insert them into a target system.
Solution design aspects
The POS systems that provide sales data to the Big Box data hub typically produce comma-delimited files. Using BizTalk Server, we can define the document structure of delimited files and natively accept and parse them. The requirements earlier also stated that the POS systems could be extended to publish a more real-time feed via web services as opposed to the daily file drop of data. This is more in tune with how BizTalk does standard processing (real-time data feeds) and would be a preferred means to distribute data through the BizTalk bus.
BizTalk Server's SQL Server adapter is built to insert a record at a time into a database. This means that the BizTalk solution needs to break apart these large inbound data sets and insert each record individually into the final repository. Messages are debatched automatically in BizTalk via pipeline components and specially defined schemas, but this is a CPU-intensive process. We would want to isolate the servers that receive and parse these data sets so that the high CPU utilization doesn't impede other BizTalk-based solutions from running.
Solution delivery aspects
Big Box leverages SQL Server all across the organization, but does not currently have a BizTalk footprint. This means that they'll need to set up a small infrastructure to host this software platform. They do have developers well-versed in .NET development and have typically shown a penchant for utilizing external consultants to design and implement large enterprise solutions. It would be critical for them to build up a small center of excellence in BizTalk to ensure that maintenance of this application and the creation of new ones can progress seamlessly.
Solution operations aspects
BizTalk Server provides strong operational support through tooling, scripting, and monitoring. If the downstream database becomes unavailable, BizTalk will queue up the messages that have yet to be delivered. This ensures that no sales information gets lost in transit and provides a level of guarantee that the data mart is always accurate.
Given the relatively large sets of data, the operations team will need to configure a fairly robust BizTalk environment, which can handle the CPU-intensive debatching and perform the database inserts in a timely fashion.
Big Box would be well served by moving to a more real-time processing solution in the near future. This way, they can do more live analysis and not have to wait until daily intervals to acquire the latest actionable data. A messaging-based solution that relies on BizTalk Server is more in tune with that vision.
However, this is a critical program and speed to market is a necessity. Big Box accepts a high level of risk in procuring a new enterprise software product and getting the environments and resources in place to design, develop, and support solutions built upon it.
|SQL Server and SSIS|
|Easily deployed and extensible ETL tool||Need to build sophisticated error handling systems|
|Designed to handle batch processing of large files, exactly the task at hand|
|No additional licensing costs - comes with SQL Server|
|Can be built and maintained by current staff|
|Provides for live, real-time analysis||CPU-intensive processes|
|Can leverage BizTalk capability to send events to downstream transactional systems||High database process overhead|
|Enterprise-class hosting infrastructure||Additional licensing and capital costs|
|Not clear if staff has the skills to support product|
When all is said and done, this is exactly the scenario that SSIS was designed to handle, a batch load to a data mart. Moreover, the selection of SSIS entails no additional licensing costs, as might be the case with BizTalk.
eBook Price: $35.99
Book Price: $59.99
(For more resources on Microsoft Platform, see here.)
Building the solution
This section outlines the construction of a proof-of-concept application that will implement the functionality required for this application. The details of data warehousing, OLAP, and data security each would require large, detailed books in and of themselves. At a minimum, we would suggest you consider studying the following topics:
- Partitioning strategies for data storage and loading
- Compression strategies
- Transparent data encryption
- Data access auditing for PID
- Multi-dimensional data structures
- Encryption strategies for data in motion
- SAN and physical infrastructure design
From a very high level (and shown below), we will be taking data already encrypted on the source system using agreed upon encryption algorithms, decrypting it, debatching it, and loading it into our database, or, as an alternative, loading it into a staging table then debatching it using SQL:
We are dealing with people's credit card numbers so we must encrypt this data, particularly if we plan on sending it to a potentially open folder in a .csv format. There are numerous methods for handling encryption, and the determination of which methods to use will depend largely on both data sources and data targets. In this case, we will use simple symmetric key encryption available with SQL Server, mostly because it is easy for purposes of illustration. Moreover, the data will be transferred using internal Big Box networks so we can also rely on network-level security to protect the data. Asymmetrical keys should also be considered here for real-world scenarios.
In order to generate our simple .csv files we created a mock point of sale (POS) database named SourceOfWisdom with two tables for our sales header and sales detail data. To encrypt the data, we first create a master key, then a certificate, and finally a symmetric key.
WHERE name = N'AppPatternCert'
DROP CERTIFICATE AppPatternCert
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'InY0urDre@ms';
/* Note Password complexity */
CREATE CERTIFICATE AppPatternCert WITH SUBJECT = 'Key Protection';
CREATE SYMMETRIC KEY DebatchPatternKey WITH
KEY_SOURCE = 'KeepMeSecret',
ALGORITHM = AES_256,
IDENTITY_VALUE = 'ThisIsSecret2'
ENCRYPTION BY CERTIFICATE AppPatternCert;
-- a simple test for the certificate
OPEN SYMMETRIC KEY DebatchPatternKey
DECRYPTION BY CERTIFICATE AppPatternCert;
encryptbykey(key_guid('DebatchPatternKey'), 'CreditCardNumber' )
As one would expect, we have a target system that does not match our sources. Moreover, we can expect the iron law of BI systems will be strictly enforced here. That law is "no one ever tells the BI staff anything." Psychic abilities are part of the job description. We can expect that there will be changes made to our source systems, or that data will be entered in one system but not in another. We must design our systems accordingly.
Our target schema is shown as follows. You can create this sample database with the accompanying project.
(Move the mouse over the image to enlarge.)
Debatching with SSIS and SQL Server
With our sources and targets defined, we now must decide how we will debatch the data so that it is properly loaded into the target tables. While using SSIS we can decide to leverage the objects that come with SSIS out of the box, or simply load the data into staging tables and use SQL set operations to move the data to target tables.
There is no "right" answer to this issue any more than there is a "right" answer to "which is better, a hammer or a screwdriver?" We have a box full of very useful tools and we use them for their most appropriate tasks. There are some general rules that should be followed.
Set operations should be done with SQL. A set operation is the operation any relational database uses to match data by common characteristics across data sets. Databases are based on set or relational algebra and therefore handle these operations very efficiently.
"Non-set" operations should be handled within the SSIS package, using the objects available or with custom scripts.
If there is a specific object on either system that makes your life easier, use that.
Always default to the system that allows you to write the most efficient code. Generally, this will be where you are most comfortable.
In this case, we have used the SQL Server symmetric key encryption to secure PID. This data is held in the SalesHeader.csv file, so we will need to work with this data on the SQL side of things. For purposes of illustration, we will demonstrate debatching the sales detail data on the application side.
We will also make certain assumptions for this illustration that would be risky in real-world applications. First, we will assume that all of data we will be loading is fresh and there are no updates to previously loaded records. Second, we will assume that the source files will not be locked and will be a complete data set when we begin our loading process. There are numerous ways to control for this in the real world, including the use of empty trigger files.
We will create a single SSIS package for extracting the data from the flat files, transforming it, and finally loading it into the appropriate tables in the destination database. The appropriate code is in two separate sequence containers. We will assume a certain amount of familiarity with SSIS here, including the primer information, and not outline every step in the detail.
Debatching with SQL
"Debatching" in this context can fit into the "transform" or the "load" phases of any ETL job. Here we will take one of our source batches of data, load it as a single batch into a staging table, then "debatch" using SQL statements.
You will need a staging table on your target database for this solution. We will be doing a significant amount of our transformation and loading using SQL, and this is a fast way to get the data where we can use SQL. Note that we also create a storage schema. Long term, this schema will allow storage flexibility by allowing us to manipulate which physical storage devices hold a particular schema. The table can be created with the following script:
CREATE SCHEMA staging AUTHORIZATION dbo
CREATE TABLE staging.SalesHeaderSource(
SalesID int NULL,
StoreID int NULL,
storeAddress nvarchar(75) NULL,
StoreCity nvarchar(50) NULL,
StoreProvince nvarchar(2) NULL,
StorePostalCode nvarchar(5) NULL,
EmployeeID int NULL,
EmployeeFirstName nvarchar(25) NULL,
EmployeeLastName nvarchar(25) NULL,
RegisterID int NULL,
RegisterLocation nvarchar(10) NULL,
RegisterType nvarchar(30) NULL,
DateOfSale nvarchar(50) NULL,
TimeofSale nvarchar(25) NULL,
Method_of_Payment nvarchar(25) NULL,
CreditCardNumber nvarchar(max) NULL,
CustomerID nvarchar(max) NULL,
CustomerFirstName nvarchar(max) NULL,
CustomerLastName nvarchar(max) NULL,
CustomerPostalCode nvarchar(max) NULL,
TotalSales numeric(9, 2) NULL,
AmountTendered numeric(9, 2) NULL,
Change numeric(9, 2) NULL,
PriorSalesID nvarchar(76) NULL,
ReturnTrans NVARCHAR(2) NULL
Start an SSIS project in Visual Studio, rename the default package to debatch.dtsx, then drag a sequence container from the tool box to the package. Rename the sequence container "Load Header" or any similar descriptive term that floats your boat. Place an execute SQL Task (named "prep staging table" for our example), a data flow task (named "Load Staging Table" for our example), and a second execute SQL task (named "Cleanse Return Data"). These three steps will handle our staging table load.
You should also note the creation of connection managers for both our source CSV file and our target database.
Our data flow task will take data from our .csv file—which will, of course, be in ASCII—convert the data to the appropriate data type, and load it into the staging table.
The .csv file does not have column names in the first row. For ease of coding and because we mere mortals cannot remember the names of fields by a numeric designation, we will need to go into the source connection manager and rename our columns.
Mapping these column names to the target field in the database will make your life easier in the long term.
Next, we will need to convert our ASCII data to the appropriate data types. Many of us got spoiled by earlier versions of SQL Server doing ASCII to Unicode conversions for us under the covers. You must make this explicit with SSIS.
Next, we map the data into the correct staging table fields as shown next and the data will load to staging.
We want to make sure we have the correct values, particularly for data that comes from the source as a null. Often SSIS will treat this as a string and load the word "null" instead of a null value. We therefore perform an execute SQL task that simply executes the following:
UPDATE staging.SalesHeaderSource SET PriorSalesID = null WHERE
ISNUMERIC(PriorSalesID) = 0
UPDATE staging.SalesHeaderSource SET ReturnTrans = 0
WHERE ISNUMERIC(ReturnTrans) = 0
Now that our data is properly staged, we will move it to the final tables using a series of SQL statements. Recall that we made an admittedly dangerous assumption that applies to this step—all of our data is new. We will illustrate a merge statement, new with SQL Server 2008 and a very convenient alternative to the hoops that we previously jumped through to "upsert" data.
The data will be moved in a manner consistent with the hierarchy of the data and the foreign key constraints we have defined. First, we move the stores data with a merge statement.
SET NOCOUNT ON
MERGE LossPrevention.Stores AS TARGET
USING(select distinct StoreID, storeAddress,StoreCity, StoreProvince,
StorePostalCode FROM staging.SalesHeaderSource)
AS SOURCE (StoreID, storeAddress,StoreCity, StoreProvince,
ON (TARGET.StoreID = source.StoreID)
WHEN MATCHED THEN
UPDATE SET storeAddress = source.storeAddress,
StoreCity = source.StoreCity,
StoreProvince = source.StoreProvince,
StorePostalCode = source.StorePostalCode
WHEN NOT MATCHED THEN
INSERT (StoreID, storeAddress,StoreCity, StoreProvince,
VALUES (source.StoreID, source.storeAddress, source.StoreCity, source.
Next, for the sake of simplicity, we will execute a series of inserts. For example, we would use this insert for registers:
INSERT INTO [FindCrooks].[LossPrevention].[Registers]
RegisterID, StoreID, RegisterLocation, RegisterType
The balance of the inserts may be found in the accompanying code.
Of course, all of the merge and insert statements can be placed within stored procedures and executed there.
Here, we have taken a large "batch" of data and broken it apart (debatched) using SQL statements.
Debatch with SSIS
Here, we will start once again with a data flow task; however, after bringing the data into the SSIS process we will debatch within that process, rather than on the database. Once again, drag a sequence container onto the package and label it "Load Details". Next, drag a data flow task into the sequence container and label it "get details".
As in the earlier data flow, drag a flat file source into the data flow. You will need a new connection object to the details CSV file and, as before, you should rename the columns to make them human readable.
We will illustrate the loading of two tables, the "SalesDetail" and "ReturnSales" tables, using an SSIS multicast object.
The first issue we encounter is that our source system has used the word "Null" for null values. SSIS interprets this value as a string holding the word "null" so we will need to deal with this issue, particularly on the numeric and integer value columns. Add a derived columns object and connect it to the flat-file source. For illustration, we will add a new column for the "Original Customer ID" value by substituting an empty string for the word "Null" and we will replace the "Null" in the "DiscountPercent" field with "0" as illustrated next. Note that this is a simple "REPLACE" function.
We must then perform our data type conversions, as we did earlier, converting ASCII to Unicode and strings to the appropriate numeric or integer types, as required. In order to preserve relational integrity, we will need to get the store identifier, along with other specific data that was provided in the sales header file. For this, drag a lookup transformation onto the data flow. In the data transformation stage, we will have already converted the sales ID to a four byte signed integer data type. This will allow us to relate the sales header data to the sales detail data in an accurate manner.
We will be sending our data to two separate tables. Drag a multicast task onto the flow. This will allow you to send duplicate datasets to multiple destinations. For you Matrix fans, think of multicasts as the agent Smith of ETL. As you can see from the illustration shown next, we simply send this to the OLEDB destinations using fast loads for the tables. Recall that in this application, we are looking for sales that do not meet certain rules. These require closer inspection as they may indicate employee theft or customer theft (for example, shoplifting then "returning" the item for a full refund). In this case, we send this data to a flat file. In the "real world" we would send this to an "alerts" table to flag it for further review.
Here we have three powerful tools for data movement, SSIS, Microsoft Sync Framework, and SQL Server Service Broker that can handle the data movement, master data management, and data governance needs of a variety of organizations and in a variety of business circumstances. The tools can be used in a variety of combinations to get data where it needs to be.
- Content Based Routing on Microsoft Platform [Article]
- Building the Content Based Routing Solution on Microsoft Platform [Article]
eBook Price: $35.99
Book Price: $59.99