Home

Mastering SQL Server 2014 Data Mining

By Amarpreet Singh Bassan , Debarchan Sarkar

Book

Subscription

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

Subscription

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Whether you are new to data mining or are a seasoned expert, this book will provide you with the skills you need to successfully create, customize, and work with Microsoft Data Mining Suite. Starting with the basics, this book will cover how to clean the data, design the problem, and choose a data mining model that will give you the most accurate prediction.

Next, you will be taken through the various classification models such as the decision tree data model, neural network model, as well as Naïve Bayes model. Following this, you'll learn about the clustering and association algorithms, along with the sequencing and regression algorithms, and understand the data mining expressions associated with each algorithm. With ample screenshots that offer a step-by-step account of how to build a data mining solution, this book will ensure your success with this cutting-edge data mining system.

Publication date:: December 2014
Publisher: Packt
Pages: 304
ISBN: 9781849688949

Chapter 1. Identifying, Staging, and Understanding Data

We will begin our discussion with an introduction to the data mining life cycle, and this chapter will focus on its first three stages. You are expected to have a basic understanding of the Microsoft Business Intelligence stack and familiarity of terms such as extract, transform, and load (ETL), data warehouse, and so on. This chapter builds on this basic understanding.

We will cover the following topics in this chapter:

Data mining life cycle
Identifying the goal
Staging data
Understanding and cleansing data

Data mining life cycle

Before going into further detail, it is important to understand the various stages of the data mining life cycle. The data mining life cycle can be broadly classified into the following steps:

Understanding the business requirement.
Understanding the data.
Preparing the data for analysis.
Preparing the data mining models.
Evaluating the results of the analysis prepared with the models.
Deploying the models to the SQL Server Analysis Services.
Repeating steps 1 to 6 in case the business requirement changes.

Let's look at each of these stages in detail.

First and foremost, the task that needs to be well defined even before beginning the mining process is to identify the goals. This is a crucial part of the data mining exercise and you need to understand the following questions:

What and whom are we targeting?
What is the outcome we are targeting?
What is the time frame for which we have the data and what is the target time period that our data is going to forecast?
What would the success measures look like?

Let's define a classic problem and understand more about the preceding questions. Note that for the most part of this book, we will be using the AdventureWorks and AdventureWorksDW databases for our data mining activities as they already have the schema and dimensions predefined. We can use them to discuss how to extract the information rather than spending our time on defining the schema.

The details on how to acquire the AdventureWorks database is already discussed in the Preface of this book.

Consider an instance where you are a salesman for the AdventureWorks Cycles, company, and you need to make predictions that could be used in marketing the products. The problem sounds simple and straightforward, but any serious data miner would immediately come up with many questions. Why? The answer lies in the exactness of the information being searched for. Let's discuss this in detail.

The problem statement comprises the words predictions and marketing. When we talk about predictions, there are several insights that we seek, namely:

What is it that we are predicting? (for example: customers, product sales, and so on)
What is the time period of the data that we are selecting for prediction?
What time period are we going to have the prediction for?
What is the expected outcome of the prediction exercise?

From the marketing point of view, several follow-up questions that must be answered are as follows:

What is our target for marketing; a new product or an older product?
Is our marketing strategy product centric or customer centric? Are we going to market our product irrespective of the customer classification, or are we marketing our product according to customer classification?
On what timeline in the past is our marketing going to be based on?

We might observe that there are many questions that overlap the two categories and, therefore, there is an opportunity to consolidate the questions and classify them as follows:

What is the population that we are targeting?
What are the factors that we will actually be looking at?
What is the time period of the past data that we will be looking at?
What is the time period in the future that we will be considering the data mining results for?

Let's throw some light on these aspects based on the AdventureWorks example. We will get answers to the preceding questions and arrive at a more refined problem statement.

What is the population that we are targeting? The target population might be classified according to the following aspects:

Age
Salary
Number of kids

What are the factors that we are actually looking at? They might be classified as follows:

Geographical location: The people living in hilly areas would prefer All Terrain Bikes (ATB) and the population on plains would prefer daily commute bikes.
Household: The people living in posh areas would look for bikes with the latest gears and also look for accessories that are state of the art, whereas people in the suburban areas would mostly look for budgetary bikes.
Affinity of components: The people who tend to buy bikes would also buy some accessories.

What is the time period of the past data that we would be looking at? Usually, the data that we get is quite huge and often consists of the information that we might very adequately label as noise. In order to sieve effective information, we will have to determine exactly how much into the past we should look at; for example, we can look at the data for the past year, past two years, or past five years.

We also need to decide the future data that we will consider the data mining results for. We might be looking at predicting our market strategy for an upcoming festive season or throughout the year. We need to be aware that market trends change and so do people's needs and requirements. So we need to keep a time frame to refresh our findings to an optimal; for example, the predictions from the past five years' data can be valid for the upcoming two or three years depending upon the results that we get.

Now that we have taken a closer look into the problem, let's redefine the problem more accurately. AdventureWorks Cycles has several stores in various locations and, based on the location, we would like to get an insight into the following:

Which products should be stocked where?
Which products should be stocked together?
How many products should be stocked?
What is the trend of sales for a new product in an area?

It is not necessary that we receive answers to all the detailed questions but even if we keep looking for the answers to these questions, there would be several insights that we will get, which will help us make better business decisions.

Staging data

In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems into a centralized repository.

Extract, transform, and load

This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts:

Extract: The data is extracted from a different source database and other databases that might contain the information that we seek
Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on
Load: The transformed data is loaded into the destination data store database

We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure:

Data warehouse

As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices, and then relate to these concepts with the help of the AdventureWorksDW database.

Measures and dimensions

There are a few common terminologies you will encounter as you enter the world of data warehousing. This section discusses them to help you get familiar:

Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on
Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension

Schema

Basically, a schema determines the relationship of the various entities with each other. There are essentially two types of schema, namely:

Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot:
Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema would usually involve hierarchies, as shown in the following screenshot:

Data mart

While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts.

Let's consider the sales for AdventureWorks Cycles. To make any predictions on the sales of AdventureWorks Cycles, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart.

The Internet sales facts table has the following data:

[ProductKey]
 [OrderDateKey]
 [DueDateKey]
 [ShipDateKey]
 [CustomerKey]
 [PromotionKey]
 [CurrencyKey]
 [SalesTerritoryKey]
 [SalesOrderNumber]
 [SalesOrderLineNumber]
 [RevisionNumber]
 [OrderQuantity]
 [UnitPrice]
 [ExtendedAmount]
 [UnitPriceDiscountPct]
 [DiscountAmount]
 [ProductStandardCost]
 [TotalProductCost]
 [SalesAmount]
 [TaxAmt]
 [Freight]
 [CarrierTrackingNumber]
 [CustomerPONumber]
 [OrderDate]
 [DueDate]
 [ShipDate]

From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following:

Product: This provides the following data:
```
[ProductKey]
[ListPrice]
```
Date: This provides the following data:
```
[DateKey]
```
Customer: This provides the following data:
```
[CustomerKey]
```
Currency: This provides the following data:
```
[CurrencyKey]
```
Sales territory: This provides the following data:
```
[SalesTerritoryKey]
```

The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and, hence, we can easily perform all the analysis pertaining to the sales of the organization.

Refreshing data

Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data.

Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table:

[BusinessEntityID]
,[NationalIDNumber]
,[LoginID]
,[OrganizationNode]
,[OrganizationLevel]
,[JobTitle]
,[BirthDate]
,[MaritalStatus]
,[Gender]
,[HireDate]
,[SalariedFlag]
,[VacationHours]
,[SickLeaveHours]
,[CurrentFlag]
,[rowguid]
,[ModifiedDate]

Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions. Prior to SQL 2008. We have to follow the method described in the next section to keep track of the changes. SQL 2008 provides us with Change Data Capture (CDC) and Change Tracking (CT), which will help us in incremental loading of our data warehouse; however, the following solution presented is a generalized solution that will work for any source database. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely:

Type 1: The older values are overwritten by new values
Type 2: A new row specifying the present value for the dimension is inserted
Type 3: The column specifying TimeStamp from which the new value is effective is updated

Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows:

Create table employee_load_status(
HireDate   DateTime,
LoadStatus       varchar
);

The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database:

With employee_loaded_date(HireDate)  as 
(select  ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) from employee_load_status where LoadStatus='success'
Union All
Select ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) from employee_load_status where LoadStatus='failed'
)
Insert into DimEmployee  select * from employee where HireDate >=(select Min(HireDate) from employee_loaded_date);

This will reload all the data from the date of the first failure till the present day.

A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks database, we see the following columns:

[BusinessEntityID]
,[TerritoryID]
,[SalesQuota]
,[Bonus]
,[CommissionPct]
,[SalesYTD]
,[SalesLastYear]
,[rowguid]
,[ModifiedDate]

The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column.

Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on.

Let's look at this with a simpler example. Consider we have the following sales table:

Name	SalesAmount	Date
Rama	1000	11-02-2014
Shyama	2000	11-02-2014

Consider we have the following fact table:

SalesAmount

Datekey

We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this:

id	SalesAmount	Datekey
ra	1000	11-02-2014
Sh	2000	11-02-2014
Ra	4000	12-02-2014
Sh	5000	13-02-2014

Also, a complete reload will appear as shown here:

id	TotalSalesAmount	Datekey
Ra	5000	12-02-2014
Sh	7000	13-02-2014

Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria.

Understanding and cleansing data

Before entering the warehouse, the data should go through several cleaning operations. These involve shaping up the raw data through transformation, null value removals, duplicate removals, and so on. In this section, we will discuss the techniques and methodologies pertaining to the understanding and cleansing of the data, we will see how we can identify the data that we need to cleanse, and how we will set a benchmark for the data sanity for further data refresh.

Data cleansing is the act of detecting the data that is either out of sync, not accurate, or incomplete, and we either correct, delete, or synchronize the data so that the chances of any ambiguity or inaccuracy in any prediction that is based on this data is minimized.

When we talk about data cleansing, we definitely need to have a basic data benchmark in place—a dataset against which the incoming data can be compared. Or, there needs to be a set of conditions or criteria against which the incoming data needs to be compared. Once the data is compared and we have a list of deviation of the data from the criteria for data sanity, we can very easily correct the deviation or delete the data with the deviation.

We will now get more specific on the meaning of the benchmark data or the set of conditions. The following are some of the situations that would help us to get a deeper understanding of what criteria a benchmark data will possess:

The data will have a very high conformance to the constraints or rules of the data store such as the target data type, the target data ranges, the target foreign key constraints, the target unique constraints, and so on.
The data should not be incomplete, for example, some of the data or group of data should together add up to a fixed total. We need to make sure that such a total exists in the data store.
The data should be consistent. We might get the data feed from various sources but the data needs to be in the common format, the most common example is customerid. The accounts department might have customerid as C<number> but the sales department might have the same details as ID<number>. In such cases, it is very important to understand how the data is stored in different databases and how can we apply different data transformations so that the data is finally loaded in the main database or in the warehouse in a consistent manner.
In continuation with the preceding point, there might be a situation wherein the data might be pooled in from different regions in which case, we might experience a difference in the locales that would give rise to complications, such as differences in the currency, differences in the units of measurement of weights, and so on. In such a situation, the task of combining the data is even more difficult and requires more stages of transformation.

Thus, as we see from the preceding discussion, a benchmark data would have the following aspects:

Higher conformance to the constraints
Higher degree of completeness
Higher consistency
Lower locale-specific variations

The preceding task of ensuring high quality data might seem to be an uphill task and might not be feasible for a human being; therefore, we have many tools that are at our disposal, which possess the ability to analyze the incoming data, detect any variation from the defined standard, suggest the correction, and even correct the data for us if we so desire. One such tool that we will be using in our scenario is the Data Quality Services (DQS).

DQS is a knowledge-based product that is available with the SQL Server and can be installed as a part of the SQL Server installation.

Rather than going into the discussion of the DQS, we will go through a practical example here and see how the DQS makes our task of maintaining the data quality easier and also ensures a high degree of data quality.

The first thing that we need to do to have a data quality project is to create a knowledge base. We will have to start an instance of Data Quality Client.

Once started, the screen will look this:

Start-up screen of the Data Quality Client

We will now create a new knowledge base by clicking on the New Knowledge Base button, as shown in the following screenshot:

Data Quality Client screen

This opens up the new Knowledge Base Management window, as shown here:

The Knowledge Base Management interface

Now, we will have to decide which columns we are going to have as our reference. In AdventureWorksDW, we take CustomerAlternateKey, MaritalStatus, and Gender from the table DimCustomer as KnowledgeBase.

We will now develop knowledge base as follows:

Selecting the source of the data

Click on the icon, as shown in the following screenshot, to create a domain. We will create a unique domain for every column:

Creating a Domain

Click on the Next button to start the data discovery and we will reach the following screen:

Starting the Data Discovery Process once the Domains for the Fields have been created

In the following screenshot, we can see that CustomerAlternateKey has all the unique values, the MaritalStatus column has 2 unique values and the Gender column has 2 unique values:

Unique values contained in each column

Then, we reach the following screen that gives us detailed data for each column:

Results of the discovery for each column

When we click on the Finish button, we will see the following screen:

We publish this knowledge base to be used for other datasets

We will now click on Publish and we have a base data ready to test the sanity of other datasets.

Let's induce some anomaly in the data by changing the first record of the DimCustomer table as follows:

`Columnname`	`PreviousValue`	`NextValue`	`ChangedValue`
CustomerAlternateKey	AW00011002		00011002
`MaritalStatus`	M	N
`Gender`	M	N

Now, let's apply the DQS implementation over this new table and see the results. We can click on New Data Quality Project, as shown here:

Starting a New Data Quality Project

We name the project as testSanity, and select the activity as Cleaning, as shown in the following screenshot:

Project is for cleaning the data

We now click on Next and reach the next screen where we need to select the table to be cleansed, as shown here:

The Map tab in Data Quality Project

The remaining fields will be populated automatically. We now click on Next and start the analysis to arrive at the following screenshot:

The CustomerKey details in Data Quality Project

We now highlight the Gender domain to see the suggestion and the confidence for this suggestion, as shown here:

The Gender details in Data Quality Project

The confidence and suggestion for MaritalStatus is as shown in the following screenshot:

The MartialStatus details in Data Quality Project

The Confidence column is the certainty with which the DQS is suggesting the changes in the value. So, we can see that it suggests the change in CustomerAlternateKey with a confidence of 73 percent, but it suggests the addition of the values of MaritalStatus and Gender as N with a confidence of 0 percent, which we can approve or ignore.

Summary

In this chapter, we've covered the first three steps of any data mining process. We've considered the reasons why we would want to undertake a data mining activity and identified the goal we have in mind. We then looked to stage the data and cleanse it. In the next chapter, we will look at how the SQL Server Business Intelligence Suite will help us work with this data.

About the Authors

Amarpreet Singh Bassan

Amarpreet Singh Bassan is a Microsoft Data Platform engineer who works on SQL Server and its surrounding technologies. He is a subject matter expert in SQL Server Analysis Services and reporting services. Amarpreet is also a part of Microsoft's HDInsight team.
Browse publications by this author
Debarchan Sarkar

Debarchan Sarkar is a Microsoft Data Platform engineer. He specializes in the Microsoft SQL Server Business Intelligence stack. Debarchan is a subject matter expert in SQL Server Integration Services and delves deep into the open source world, specifically the Apache Hadoop framework. He is currently working on a technology called HDInsight, which is Microsoft's distribution of Hadoop on Windows. He has authored various books on SQL Server and Big Data, including Microsoft SQL Server 2012 with Hadoop, Packt Publishing, and Pro Microsoft HDInsight: Hadoop on Windows, Apress. His Twitter handle is @debarchans.
Browse publications by this author

Sehr zufrieden mit dem breiten Produktspektrum.