Data mining is about finding hidden patterns and associations in data. A large number of analyses that can only be performed by a human in a reasonable time if the amount of data is small, can be done by a computer in a very short time. Before Excel 2016, it was possible to install an add-in (called Data Mining) that was packed with different methods and models that could be used, mostly as black boxes, to get insights and discover information in any dataset. Microsoft then changed its policy and started migrating this capability to the cloud, first to Office 365 and, most recently, to Azure. It is also possible to calculate many things in Excel, with built-in functions and even with the help of macros programmed in Visual Basic Applications (VBA). Here, we will show two cases where data mining can help us find important information...
You're reading from Hands-On Machine Learning with Microsoft Excel 2019
Technical requirements
To complete this section, the reader will need to download the transactions_by_dept.csv and cohort_input_data.csv files from the GitHub repository at https://github.com/PacktPublishing/Hands-On-Machine-Learning-with-Microsoft-Excel-2019/tree/master/Chapter06.
Learning by example – Market Basket Analysis
We have all read the sentence in almost every online store: People who bought this product also bought.... It all started with Amazon in the 1990s, and it is widespread today. This same principle is even being tested in physical stores, where customers can get personalized suggestions on which items to buy based on their shopping history and similarity with other products. These techniques are generally known as cross-selling, and they are useful since it is often easier to sell an additional product to an existing customer than to acquire a new one.
In this section, we will show a simplified example of Market Basket Analysis, which is the type of data mining technique behind these recommender systems. The results might not be as accurate as those obtained with more advanced methods, but are still useful to explain the method...
Learning by example – Customer Cohort Analysis
An excellent way of gaining insights about a company's customers and their behavior is to perform a segmented analysis. These segments are groups of customers that share the same characteristics and are usually called cohorts. Their definition depends very much on the type of business we are dealing with.
The dataset we will analyze contains a list of customer IDs, the date when they first purchased something from us, the day they left us, and the mean amount of money spent per month. What does "left us" mean in this context? It is mainly a matter of definition. For example, if we refer to a credit card, we could consider that the customer left when they cancelled the card, or maybe we want to go further and say that if the customer did not spend any money for two or three months, we consider that they have left...
Summary
In this chapter, we learned about two data mining techniques: Market Basket Analysis and Customer Cohort Analysis. The first one tells us about hidden relations between store departments or products based on the customers' behavior. The second shows the time evolution of the number of customers, revealing differences between different customer segments or cohorts.
There are a large number of data mining analyses that can help us dig into our datasets and find hidden information, which is extremely important in order for us to understand our business and make the right decisions.
In the next chapter, we will study a special kind of dataset in detail: time series. We will see that it needs a different kind of modeling and analysis.
Questions
- How can the unexplainable, random associations be avoided in a Market Basket Analysis?
- If we find a correlation between two products or two departments in a given store, can this be generalized to other places?
- Some columns in the cohort matrix show zeroes starting at a given date. What does this mean in the context of the current dataset?
- The last rows in the cohort matrix show zero consistently. What does this mean in the context of the current dataset?
- How would you maximize the amount of money spent by customers as time passes?
Further reading
- Database Marketing: Analyzing and Managing Customers by Robert C. Blattberg, Byung-Do Kim, Scott A Neslin
- Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking, 1st Edition, by Foster Provost, Tom Fawcett