Home Data R: Mining spatial, text, web, and social media data

R: Mining spatial, text, web, and social media data

By Nathan H. Danneman , Richard Heimann , Pradeepta Mishra and 1 more
books-svg-icon Book
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
About this book
Data mining is the first step to understanding data and making sense of heaps of data. Properly mined data forms the basis of all data analysis and computing performed on it. This learning path will take you from the very basics of data mining to advanced data mining techniques, and will end up with a specialized branch of data mining—social media mining. You will learn how to manipulate data with R using code snippets and how to mine frequent patterns, association, and correlation while working with R programs. You will discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on R Hadoop projects. Now that you are comfortable with data mining with R, you will move on to implementing your knowledge with the help of end-to-end data mining projects. You will learn how to apply different mining concepts to various statistical and data applications in a wide range of fields. At this stage, you will be able to complete complex data mining cases and handle any issues you might encounter during projects. After this, you will gain hands-on experience of generating insights from social media data. You will get detailed instructions on how to obtain, process, and analyze a variety of socially-generated data while providing a theoretical background to accurately interpret your findings. You will be shown R code and examples of data that can be used as a springboard as you get the chance to undertake your own analyses of business, social, or political data. This Learning Path combines some of the best that Packt has to offer in one complete, curated package. It includes content from the following Packt products: ? Learning Data Mining with R by Bater Makhabel ? R Data Mining Blueprints by Pradeepta Mishra ? Social Media Mining with R by Nathan Danneman and Richard Heimann
Publication date:
June 2017
Publisher
Packt
ISBN
9781788293747

 

Part 1. Module 1

Learning Data mining with R

Develop key skills and techniques with R to create and customize data mining algorithms

 

Chapter 1. Warming Up

In this chapter, you will learn basic data mining terms such as data definition, preprocessing, and so on.

The most important data mining algorithms will be illustrated with R to help you grasp the principles quickly, including but not limited to, classification, clustering, and outlier detection. Before diving right into data mining, let's have a look at the topics we'll cover:

  • Data mining
  • Social network mining
  • Text mining
  • Web data mining
  • Why R
  • Statistics
  • Machine learning
  • Data attributes and description
  • Data measuring
  • Data cleaning
  • Data integration
  • Data reduction
  • Data transformation and discretization
  • Visualization of results

In the history of humankind, the results of data from every aspect is extensive, for example websites, social networks by user's e-mail or name or account, search terms, locations on map, companies, IP addresses, books, films, music, and products.

Data mining techniques can be applied to any kind of old or emerging data; each data type can be best dealt with using certain, but not all, techniques. In other words, the data mining techniques are constrained by data type, size of the dataset, context of the tasks applied, and so on. Every dataset has its own appropriate data mining solutions.

New data mining techniques always need to be researched along with new data types once the old techniques cannot be applied to it or if the new data type cannot be transformed onto the traditional data types. The evolution of stream mining algorithms applied to Twitter's huge source set is one typical example. The graph mining algorithms developed for social networks is another example.

The most popular and basic forms of data are from databases, data warehouses, ordered/sequence data, graph data, text data, and so on. In other words, they are federated data, high dimensional data, longitudinal data, streaming data, web data, numeric, categorical, or text data.

Big data

Big data is large amount of data that does not fit in the memory of a single machine. In other words, the size of data itself becomes a part of the issue when studying it. Besides volume, two other major characteristics of big data are variety and velocity; these are the famous three Vs of big data. Velocity means data process rate or how fast the data is being processed. Variety denotes various data source types. Noises arise more frequently in big data source sets and affect the mining results, which require efficient data preprocessing algorithms.

As a result, distributed filesystems are used as tools for successful implementation of parallel algorithms on large amounts of data; it is a certainty that we will get even more data with each passing second. Data analytics and visualization techniques are the primary factors of the data mining tasks related to massive data. The characteristics of massive data appeal to many new data mining technique-related platforms, one of which is RHadoop. We'll be describing this in a later section.

Some data types that are important to big data are as follows:

  • The data from the camera video, which includes more metadata for analysis to expedite crime investigations, enhanced retail analysis, military intelligence, and so on.
  • The second data type is from embedded sensors, such as medical sensors, to monitor any potential outbreaks of virus.
  • The third data type is from entertainment, information freely published through social media by anyone.
  • The last data type is consumer images, aggregated from social medias, and tagging on these like images are important.

Here is a table illustrating the history of data size growth. It shows that information will be more than double every two years, changing the way researchers or companies manage and extract value through data mining techniques from data, revealing new data mining studies.

Year

Data Sizes

Comments

N/A

 

1 MB (Megabyte) = Big data. The human brain holds about 200 MB of information.

N/A

 

1 PB (Petabyte) = Big data. It is similar to the size of 3 years' observation data for Earth by NASA and is equivalent of 70.8 times the books in America's Library of Congress.

1999

1 EB

1 EB (Exabyte) = Big data. The world produced 1.5 EB of unique information.

2007

281 EB

The world produced about 281 Exabyte of unique information.

2011

1.8 ZB

1 ZB (Zetabyte)= Big data. This is all data gathered by human beings in 2011.

Very soon

 

1 YB(Yottabytes)= Big data.

Scalability and efficiency

Efficiency, scalability, performance, optimization, and the ability to perform in real time are important issues for almost any algorithms, and it is the same for data mining. There are always necessary metrics or benchmark factors of data mining algorithms.

As the amount of data continues to grow, keeping data mining algorithms effective and scalable is necessary to effectively extract information from massive datasets in many data repositories or data streams.

The storage of data from a single machine to wide distribution, the huge size of many datasets, and the computational complexity of the data mining methods are all factors that drive the development of parallel and distributed data-intensive mining algorithms.

Data source

Data serves as the input for the data mining system and data repositories are important. In an enterprise environment, database and logfiles are common sources. In web data mining, web pages are the source of data. The data that continuously fetched various sensors are also a typical data source.

Note

Here are some free online data sources particularly helpful to learn about data mining:

  • Frequent Itemset Mining Dataset Repository: A repository with datasets for methods to find frequent itemsets (http://fimi.ua.ac.be/data/).
  • UCI Machine Learning Repository: This is a collection of dataset, suitable for classification tasks (http://archive.ics.uci.edu/ml/).
  • The Data and Story Library at statlib: DASL (pronounced "dazzle") is an online library of data files and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that statistics teachers can find real-world examples that will be interesting to their students. Use DASL's powerful search engine to locate the story or data file of interest. (http://lib.stat.cmu.edu/DASL/)
  • WordNet: This is a lexical database for English (http://wordnet.princeton.edu)

Data mining

Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining.

The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization.

Feature extraction

This is to extract the most prominent features of the data and ignore the rest. Here are some examples:

  • Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items.
  • Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining.

Summarization

The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

The data mining process

There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:

  • Cross-Industry Standard Process for Data Mining (CRISP-DM)
  • Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA

CRISP-DM

There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:

CRISP-DM

Let's look at the phases in detail:

  • Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
  • Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
  • Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
  • Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
  • Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
  • Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.

SEMMA

Here is an overview of the process for SEMMA:

SEMMA

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

Feature extraction

This is to extract the most prominent features of the data and ignore the rest. Here are some examples:

  • Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items.
  • Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining.

Summarization

The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

The data mining process

There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:

  • Cross-Industry Standard Process for Data Mining (CRISP-DM)
  • Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA

CRISP-DM

There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:

CRISP-DM

Let's look at the phases in detail:

  • Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
  • Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
  • Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
  • Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
  • Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
  • Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.

SEMMA

Here is an overview of the process for SEMMA:

SEMMA

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

Summarization

The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

The data mining process

There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:

  • Cross-Industry Standard Process for Data Mining (CRISP-DM)
  • Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA

CRISP-DM

There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:

CRISP-DM

Let's look at the phases in detail:

  • Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
  • Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
  • Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
  • Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
  • Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
  • Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.

SEMMA

Here is an overview of the process for SEMMA:

SEMMA

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

The data mining process

There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:

  • Cross-Industry Standard Process for Data Mining (CRISP-DM)
  • Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA

CRISP-DM

There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:

CRISP-DM

Let's look at the phases in detail:

  • Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
  • Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
  • Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
  • Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
  • Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
  • Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.

SEMMA

Here is an overview of the process for SEMMA:

SEMMA

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

CRISP-DM

There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:

CRISP-DM

Let's look at the phases in detail:

  • Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
  • Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
  • Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
  • Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
  • Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
  • Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.

SEMMA

Here is an overview of the process for SEMMA:

SEMMA

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

SEMMA

Here is an overview of the process for SEMMA:

SEMMA

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

Social network mining

As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented.

Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web.

Social network

When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows:

  • There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely.
  • There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other.
  • There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related.

Here are some varieties of social networks:

  • Telephone networks: The nodes in this network are phone numbers and represent individuals
  • E-mail networks: The nodes represent e-mail addresses, which represent individuals
  • Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly

Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges.

Social network

Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students.

Social network

Social network

When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows:

  • There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely.
  • There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other.
  • There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related.

Here are some varieties of social networks:

  • Telephone networks: The nodes in this network are phone numbers and represent individuals
  • E-mail networks: The nodes represent e-mail addresses, which represent individuals
  • Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly

Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges.

Social network

Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students.

Social network

Text mining

Text mining is based on the data of text, concerned with exacting relevant information from large natural language text, and searching for interesting relationships, syntactical correlation, or semantic association between the extracted entities or terms. It is also defined as automatic or semiautomatic processing of text. The related algorithms include text clustering, text classification, natural language processing, and web mining.

One of the characteristics of text mining is text mixed with numbers, or in other point of view, the hybrid data type contained in the source dataset. The text is usually a collection of unstructured documents, which will be preprocessed and transformed into a numerical and structured representation. After the transformation, most of the data mining algorithms can be applied with good effects.

The process of text mining is described as follows:

  • Text mining starts from preparing the text corpus, which are reports, letters and so forth
  • The second step is to build a semistructured text database that is based on the text corpus
  • The third step is to build a term-document matrix in which the term frequency is included
  • The final result is further analysis, such as text analysis, semantic analysis, information retrieval, and information summarization

Information retrieval and text mining

Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows:

  • Specify a query. The following are some of the types of queries:
    • Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword
    • Boolean query: This is constructed with Boolean operators and keywords
    • Phrase query: This is a query that consists of a sequence of words that makes up a phrase
    • Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases
    • Full document query: This query is a full document to find other documents similar to the query document
    • Natural language questions: This query helps to express users' requirements as a natural language question
  • Search the document collection.
  • Return the subset of relevant documents.

Mining text for prediction

Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue.

Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied.

Information retrieval and text mining

Information retrieval is to help users find information, most commonly associated with online documents. It focuses on the acquisition, organization, storage, retrieval, and distribution for information. The task of Information Retrieval (IR) is to retrieve relevant documents in response to a query. The fundamental technique of IR is measuring similarity. Key steps in IR are as follows:

  • Specify a query. The following are some of the types of queries:
    • Keyword query: This is expressed by a list of keywords to find documents that contain at least one keyword
    • Boolean query: This is constructed with Boolean operators and keywords
    • Phrase query: This is a query that consists of a sequence of words that makes up a phrase
    • Proximity query: This is a downgrade version of the phrase queries and can be a combination of keywords and phrases
    • Full document query: This query is a full document to find other documents similar to the query document
    • Natural language questions: This query helps to express users' requirements as a natural language question
  • Search the document collection.
  • Return the subset of relevant documents.

Mining text for prediction

Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue.

Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied.

Mining text for prediction

Prediction of results from text is just as ambitious as predicting numerical data mining and has similar problems associated with numerical classification. It is generally a classification issue.

Prediction from text needs prior experience, from the sample, to learn how to draw a prediction on new documents. Once text is transformed into numeric data, prediction methods can be applied.

Web data mining

Web mining aims to discover useful information or knowledge from the web hyperlink structure, page, and usage data. The Web is one of the biggest data sources to serve as the input for data mining applications.

Web data mining is based on IR, machine learning (ML), statistics, pattern recognition, and data mining. Web mining is not purely a data mining problem because of the heterogeneous and semistructured or unstructured web data, although many data mining approaches can be applied to it.

Web mining tasks can be defined into at least three types:

  • Web structure mining: This helps to find useful information or valuable structural summary about sites and pages from hyperlinks
  • Web content mining: This helps to mine useful information from web page contents
  • Web usage mining: This helps to discover user access patterns from web logs to detect intrusion, fraud, and attempted break-in

The algorithms applied to web data mining are originated from classical data mining algorithms. They share many similarities, such as the mining process; however, differences exist too. The characteristics of web data mining makes it different from data mining for the following reasons:

  • The data is unstructured
  • The information of the Web keeps changing and the amount of data keeps growing
  • Any data type is available on the Web, such as structured and unstructured data
  • Heterogeneous information is on the web; redundant pages are present too
  • Vast amounts of information on the web is linked
  • The data is noisy

Web data mining differentiates from data mining by the huge dynamic volume of source dataset, a big variety of data format, and so on. The most popular data mining tasks related to the Web are as follows:

  • Information extraction (IE): The task of IE consists of a couple of steps, tokenization, sentence segmentation, part-of-speech assignment, named entity identification, phrasal parsing, sentential parsing, semantic interpretation, discourse interpretation, template filling, and merging.
  • Natural language processing (NLP): This researches the linguistic characteristics of human-human and human-machine interactive, models of linguistic competence and performance, frameworks to implement process with such models, processes'/models' iterative refinement, and evaluation techniques for the result systems. Classical NLP tasks related to web data mining are tagging, knowledge representation, ontologies, and so on.
  • Question answering: The goal is to find the answer from a collection of text to questions in natural language format. It can be categorized into slot filling, limited domain, and open domain with bigger difficulties for the latter. One simple example is based on a predefined FAQ to answer queries from customers.
  • Resource discovery: The popular applications are collecting important pages preferentially; similarity search using link topology, topical locality and focused crawling; and discovering communities.

Why R?

R is a high-quality, cross-platform, flexible, widely used open source, free language for statistics, graphics, mathematics, and data science—created by statisticians for statisticians.

R contains more than 5,000 algorithms and millions of users with domain knowledge worldwide, and it is supported by a vibrant and talented community of contributors. It allows access to both well-established and experimental statistical techniques.

R is a free, open source software environment maintained by R-projects for statistical computing and graphics, and the R source code is available under the terms of the Free Software Foundation's GNU General Public License. R compiles and runs on a wide variety for a variety of platforms, such as UNIX, LINUX, Windows, and Mac OS.

What are the disadvantages of R?

There are three shortages of R:

  • One is that it is memory bound, so it requires the entire dataset store in memory (RAM) to achieve high performance, which is also called in-memory analytics.
  • Similar to other open source systems, anyone can create and contribute package with strict or less testing. In other words, packages contributing to R communities are bug-prone and need more testing to ensure the quality of codes.
  • R seems slow than some other commercial languages.

Fortunately, there are packages available to overcome these problems. There are some solutions that can be categorized as parallelism solutions; the essence here is to spread work across multiple CPUs that overcome the R shortages that were just listed. Good examples include, but are not limited to, RHadoop. You will read more on this topic soon in the following sections. You can download the SNOW add-on package and the Parallel add-on package from Comprehensive R Archive Network (CRAN).

What are the disadvantages of R?

There are three shortages of R:

  • One is that it is memory bound, so it requires the entire dataset store in memory (RAM) to achieve high performance, which is also called in-memory analytics.
  • Similar to other open source systems, anyone can create and contribute package with strict or less testing. In other words, packages contributing to R communities are bug-prone and need more testing to ensure the quality of codes.
  • R seems slow than some other commercial languages.

Fortunately, there are packages available to overcome these problems. There are some solutions that can be categorized as parallelism solutions; the essence here is to spread work across multiple CPUs that overcome the R shortages that were just listed. Good examples include, but are not limited to, RHadoop. You will read more on this topic soon in the following sections. You can download the SNOW add-on package and the Parallel add-on package from Comprehensive R Archive Network (CRAN).

Statistics

Statistics studies the collection, analysis, interpretation or explanation, and presentation of data. It serves as the foundation of data mining and the relations will be illustrated in the following sections.

Statistics and data mining

Statisticians were the first to use the term data mining. Originally, data mining was a derogatory term referring to attempts to extract information that was not supported by the data. To some extent, data mining constructs statistical models, which is an underlying distribution, used to visualize data.

Data mining has an inherent relationship with statistics; one of the mathematical foundations of data mining is statistics, and many statistics models are used in data mining.

Statistical methods can be used to summarize a collection of data and can also be used to verify data mining results.

Statistics and machine learning

Along with the development of statistics and machine learning, there is a continuum between these two subjects. Statistical tests are used to validate the machine learning models and to evaluate machine learning algorithms. Machine learning techniques are incorporated with standard statistical techniques.

Statistics and R

R is a statistical programming language. It provides a huge amount of statistical functions, which are based on the knowledge of statistics. Many R add-on package contributors come from the field of statistics and use R in their research.

The limitations of statistics on data mining

During the evolution of data mining technologies, due to statistical limits on data mining, one can make errors by trying to extract what really isn't in the data.

Bonferroni's Principle is a statistical theorem otherwise known as Bonferroni correction. You can assume that big portions of the items you find are bogus, that is, the items returned by the algorithms dramatically exceed what is assumed.

Statistics and data mining

Statisticians were the first to use the term data mining. Originally, data mining was a derogatory term referring to attempts to extract information that was not supported by the data. To some extent, data mining constructs statistical models, which is an underlying distribution, used to visualize data.

Data mining has an inherent relationship with statistics; one of the mathematical foundations of data mining is statistics, and many statistics models are used in data mining.

Statistical methods can be used to summarize a collection of data and can also be used to verify data mining results.

Statistics and machine learning

Along with the development of statistics and machine learning, there is a continuum between these two subjects. Statistical tests are used to validate the machine learning models and to evaluate machine learning algorithms. Machine learning techniques are incorporated with standard statistical techniques.

Statistics and R

R is a statistical programming language. It provides a huge amount of statistical functions, which are based on the knowledge of statistics. Many R add-on package contributors come from the field of statistics and use R in their research.

The limitations of statistics on data mining

During the evolution of data mining technologies, due to statistical limits on data mining, one can make errors by trying to extract what really isn't in the data.

Bonferroni's Principle is a statistical theorem otherwise known as Bonferroni correction. You can assume that big portions of the items you find are bogus, that is, the items returned by the algorithms dramatically exceed what is assumed.

Statistics and machine learning

Along with the development of statistics and machine learning, there is a continuum between these two subjects. Statistical tests are used to validate the machine learning models and to evaluate machine learning algorithms. Machine learning techniques are incorporated with standard statistical techniques.

Statistics and R

R is a statistical programming language. It provides a huge amount of statistical functions, which are based on the knowledge of statistics. Many R add-on package contributors come from the field of statistics and use R in their research.

The limitations of statistics on data mining

During the evolution of data mining technologies, due to statistical limits on data mining, one can make errors by trying to extract what really isn't in the data.

Bonferroni's Principle is a statistical theorem otherwise known as Bonferroni correction. You can assume that big portions of the items you find are bogus, that is, the items returned by the algorithms dramatically exceed what is assumed.

Statistics and R

R is a statistical programming language. It provides a huge amount of statistical functions, which are based on the knowledge of statistics. Many R add-on package contributors come from the field of statistics and use R in their research.

The limitations of statistics on data mining

During the evolution of data mining technologies, due to statistical limits on data mining, one can make errors by trying to extract what really isn't in the data.

Bonferroni's Principle is a statistical theorem otherwise known as Bonferroni correction. You can assume that big portions of the items you find are bogus, that is, the items returned by the algorithms dramatically exceed what is assumed.

The limitations of statistics on data mining

During the evolution of data mining technologies, due to statistical limits on data mining, one can make errors by trying to extract what really isn't in the data.

Bonferroni's Principle is a statistical theorem otherwise known as Bonferroni correction. You can assume that big portions of the items you find are bogus, that is, the items returned by the algorithms dramatically exceed what is assumed.

Machine learning

The data to which a ML algorithm is applied is called a training set, which consists of a set of pairs (x, y), called training examples. The pairs are explained as follows:

  • x: This is a vector of values, often called the feature vector. Each value, or feature, can be categorical (values are taken from a set of discrete values, such as {S, M, L}) or numerical.
  • y: This is the label, the classification or regression values for x.

The objective of the ML process is to discover a function Machine learning that best predicts the value of y associated with each value of x. The type of y is in principle arbitrary, but there are several common and important cases.

  • y: This is a real number. The ML problem is called regression.
  • y: This is a Boolean value true or false, more commonly written as +1 and -1, respectively. In this class, the problem is binary classification.
  • y: Here this is a member of some finite set. The member of this set can be thought of as classes, and each member represents one class. The problem is multiclass classification.
  • y: This is a member of some potentially infinite set, for example, a parse tree for x, which is interpreted as a sentence.

Until now, machine learning has not proved successful in situations where we can describe the goals of the mining more directly. Machine learning and data mining are two different topics, although some algorithms are shared between them—algorithms are shared especially when the goal is to extract information. There are situations where machine learning makes sense. The typical one is when we have idea of what we looking for in the dataset.

Approaches to machine learning

The major classes of algorithms are listed here. Each is distinguished by the function Approaches to machine learning.

  • Decision tree: This form of Approaches to machine learning is a tree and each node of the tree has a function of x that determines which child or children the search must proceed for.
  • Perceptron: These are threshold functions applied to the components of the vector Approaches to machine learning. A weight Approaches to machine learning is associated with the ith components, for each i = 1, 2, … n, and there is a threshold Approaches to machine learning. The output is +1 if and the output is -1 otherwise.
  • Neural nets: These are acyclic networks of perceptions, with the outputs of some perceptions used as inputs to others.
  • Instance-based learning: This uses the entire training set to represent the function Approaches to machine learning.
  • Support-vector machines: The result of this class is a classifier that tends to be more accurate on unseen data. The target for class separation denotes as looking for the optimal hyper-plane separating two classes by maximizing the margin between the classes' closest points.

Machine learning architecture

The data aspects of machine learning here means the way data is handled and the way it is used to build the model.

  • Training and testing: Assuming all the data is suitable for training, separate out a small fraction of the available data as the test set; use the remaining data to build a suitable model or classifier.
  • Batch versus online learning: The entire training set is available at the beginning of the process for batch mode; the other one is online learning, where the training set arrives in a stream and cannot be revisited after it is processed.
  • Feature selection: This helps to figure out what features to use as input to the learning algorithm.
  • Creating a training set: This helps to create the label information that turns data into a training set by hand.

Approaches to machine learning

The major classes of algorithms are listed here. Each is distinguished by the function Approaches to machine learning.

  • Decision tree: This form of Approaches to machine learning is a tree and each node of the tree has a function of x that determines which child or children the search must proceed for.
  • Perceptron: These are threshold functions applied to the components of the vector Approaches to machine learning. A weight Approaches to machine learning is associated with the ith components, for each i = 1, 2, … n, and there is a threshold Approaches to machine learning. The output is +1 if and the output is -1 otherwise.
  • Neural nets: These are acyclic networks of perceptions, with the outputs of some perceptions used as inputs to others.
  • Instance-based learning: This uses the entire training set to represent the function Approaches to machine learning.
  • Support-vector machines: The result of this class is a classifier that tends to be more accurate on unseen data. The target for class separation denotes as looking for the optimal hyper-plane separating two classes by maximizing the margin between the classes' closest points.

Machine learning architecture

The data aspects of machine learning here means the way data is handled and the way it is used to build the model.

  • Training and testing: Assuming all the data is suitable for training, separate out a small fraction of the available data as the test set; use the remaining data to build a suitable model or classifier.
  • Batch versus online learning: The entire training set is available at the beginning of the process for batch mode; the other one is online learning, where the training set arrives in a stream and cannot be revisited after it is processed.
  • Feature selection: This helps to figure out what features to use as input to the learning algorithm.
  • Creating a training set: This helps to create the label information that turns data into a training set by hand.

Machine learning architecture

The data aspects of machine learning here means the way data is handled and the way it is used to build the model.

  • Training and testing: Assuming all the data is suitable for training, separate out a small fraction of the available data as the test set; use the remaining data to build a suitable model or classifier.
  • Batch versus online learning: The entire training set is available at the beginning of the process for batch mode; the other one is online learning, where the training set arrives in a stream and cannot be revisited after it is processed.
  • Feature selection: This helps to figure out what features to use as input to the learning algorithm.
  • Creating a training set: This helps to create the label information that turns data into a training set by hand.

Data attributes and description

An attribute is a field representing a certain feature, characteristic, or dimensions of a data object.

In most situations, data can be modeled or represented with a matrix, columns for data attributes, and rows for certain data records in the dataset. For other cases, that data cannot be represented with matrices, such as text, time series, images, audio, video, and so forth. The data can be transformed into a matrix by appropriate methods, such as feature extraction.

The type of data attributes arises from its contexts or domains or semantics, and there are numerical, non-numerical, categorical data types or text data. Two views applied to data attributes and descriptions are widely used in data mining and R. They are as follows:

  • Data in algebraic or geometric view: The entire dataset can be modeled into a matrix; linear algebraic and abstract algebra plays an important role here.
  • Data in probability view: The observed data is treated as multidimensional random variables; each numeric attribute is a random variable. The dimension is the data dimension. Irrespective of whether the value is discrete or continuous, the probability theory can be applied here.

To help you learn R more naturally, we shall adopt a geometric, algebraic, and probabilistic view of the data.

Here is a matrix example. The number of columns is determined by m, which is the dimensionality of data. The number of rows is determined by n, which is the size of dataset.

Data attributes and description

Where Data attributes and description denotes the i row, which is an m-tuple as follows:

Data attributes and description

And Data attributes and description denotes the j column, which is an n-tuple as follows:

Data attributes and description

Numeric attributes

Numerical data is convenient to deal with because it is quantitative and allows arbitrary calculations. The properties of numerical data are the same as integer or float data.

Numeric attributes taken from a finite or countable infinite set of values are called discrete, for example a human being's age, which is the integer value starting from 1,150. Other attributes taken from any real values are called continuous. There are two main kinds of numeric types:

  • Interval-scaled: This is the quantitative value, measured on a scale of equal unit, such as the weight of some certain fish in the scale of international metric, such as gram or kilogram.
  • Ratio-scaled: This value can be computed by ratios between values in addition to differences between values. It is a numeric attribute with an inherent zero-point; hence, we can say a value is a multiple of another value.

Categorical attributes

The values of categorical attributes come from a set-valued domain composed of a set of symbols, such as the size of human costumes that are categorized as {S, M, L}. The categorical attributes can be divided into two groups or types:

  • Nominal: The values in this set are unordered and are not quantitative; only the equality operation makes sense here.
  • Ordinal: In contrast to the nominal type, the data has an ordered meaning here. The inequality operation is available here in addition to the equality operation.

Data description

The basic description can be used to identify features of data, distinguish noise, or outliers. A couple of basic statistical descriptions are as follows:

  • Measures of central tendency: This measures the location of middle or center of a data distribution: the mean, median, mode, midrange, and so on.
  • Measure of the dispersion of the data: This is the range, quartiles, interquartile range, and so on.

Data measuring

Data measuring is used in clustering, outlier detection, and classification. It refers to measures of proximity, similarity, and dissimilarity. The similarity value, a real value, between two tuples or data records ranges from 0 to 1, the higher the value the greater the similarity between tuples. Dissimilarity works in the opposite way; the higher the dissimilarity value, the more dissimilar are the two tuples.

For a dataset, data matrix stores the n data tuples in n x m matrix (n tuples and m attributes):

Data measuring

The dissimilarity matrix stores a collection of proximities available for all n tuples in the dataset, often in a n x n matrix. In the following matrix, Data measuring means the dissimilarity between two tuples; value 0 for highly similar or near between each other, 1 for completely same, the higher the value, the more dissimilar it is.

Data measuring

Most of the time, the dissimilarity and similarity are related concepts. The similarity measure can often be defined using a function; the expression constructed with measures of dissimilarity, and vice versa.

Here is a table with a list of some of the most used measures for different attribute value types:

Attribute value types

Dissimilarity

Nominal attributes

The dissimilarity between two tuples can be computed by the following equation: d (i, j) = (p-m)/p

Where, p is the dimension of data and m is the number of matches that is in same state.

Ordinal attributes

The treatment of ordinal attributes is similar to that of numeric attributes, but it needs a transformation first before applying the methods.

Interval-scaled

Euclidean, Manhattan, and Minkowski distances are used to calculate the dissimilarity of data tuples.

Numeric attributes

Numerical data is convenient to deal with because it is quantitative and allows arbitrary calculations. The properties of numerical data are the same as integer or float data.

Numeric attributes taken from a finite or countable infinite set of values are called discrete, for example a human being's age, which is the integer value starting from 1,150. Other attributes taken from any real values are called continuous. There are two main kinds of numeric types:

  • Interval-scaled: This is the quantitative value, measured on a scale of equal unit, such as the weight of some certain fish in the scale of international metric, such as gram or kilogram.
  • Ratio-scaled: This value can be computed by ratios between values in addition to differences between values. It is a numeric attribute with an inherent zero-point; hence, we can say a value is a multiple of another value.

Categorical attributes

The values of categorical attributes come from a set-valued domain composed of a set of symbols, such as the size of human costumes that are categorized as {S, M, L}. The categorical attributes can be divided into two groups or types:

  • Nominal: The values in this set are unordered and are not quantitative; only the equality operation makes sense here.
  • Ordinal: In contrast to the nominal type, the data has an ordered meaning here. The inequality operation is available here in addition to the equality operation.

Data description

The basic description can be used to identify features of data, distinguish noise, or outliers. A couple of basic statistical descriptions are as follows:

  • Measures of central tendency: This measures the location of middle or center of a data distribution: the mean, median, mode, midrange, and so on.
  • Measure of the dispersion of the data: This is the range, quartiles, interquartile range, and so on.

Data measuring

Data measuring is used in clustering, outlier detection, and classification. It refers to measures of proximity, similarity, and dissimilarity. The similarity value, a real value, between two tuples or data records ranges from 0 to 1, the higher the value the greater the similarity between tuples. Dissimilarity works in the opposite way; the higher the dissimilarity value, the more dissimilar are the two tuples.

For a dataset, data matrix stores the n data tuples in n x m matrix (n tuples and m attributes):

Data measuring

The dissimilarity matrix stores a collection of proximities available for all n tuples in the dataset, often in a n x n matrix. In the following matrix, Data measuring means the dissimilarity between two tuples; value 0 for highly similar or near between each other, 1 for completely same, the higher the value, the more dissimilar it is.

Data measuring

Most of the time, the dissimilarity and similarity are related concepts. The similarity measure can often be defined using a function; the expression constructed with measures of dissimilarity, and vice versa.

Here is a table with a list of some of the most used measures for different attribute value types:

Attribute value types

Dissimilarity

Nominal attributes

The dissimilarity between two tuples can be computed by the following equation: d (i, j) = (p-m)/p

Where, p is the dimension of data and m is the number of matches that is in same state.

Ordinal attributes

The treatment of ordinal attributes is similar to that of numeric attributes, but it needs a transformation first before applying the methods.

Interval-scaled

Euclidean, Manhattan, and Minkowski distances are used to calculate the dissimilarity of data tuples.

Categorical attributes

The values of categorical attributes come from a set-valued domain composed of a set of symbols, such as the size of human costumes that are categorized as {S, M, L}. The categorical attributes can be divided into two groups or types:

  • Nominal: The values in this set are unordered and are not quantitative; only the equality operation makes sense here.
  • Ordinal: In contrast to the nominal type, the data has an ordered meaning here. The inequality operation is available here in addition to the equality operation.

Data description

The basic description can be used to identify features of data, distinguish noise, or outliers. A couple of basic statistical descriptions are as follows:

  • Measures of central tendency: This measures the location of middle or center of a data distribution: the mean, median, mode, midrange, and so on.
  • Measure of the dispersion of the data: This is the range, quartiles, interquartile range, and so on.

Data measuring

Data measuring is used in clustering, outlier detection, and classification. It refers to measures of proximity, similarity, and dissimilarity. The similarity value, a real value, between two tuples or data records ranges from 0 to 1, the higher the value the greater the similarity between tuples. Dissimilarity works in the opposite way; the higher the dissimilarity value, the more dissimilar are the two tuples.

For a dataset, data matrix stores the n data tuples in n x m matrix (n tuples and m attributes):

Data measuring

The dissimilarity matrix stores a collection of proximities available for all n tuples in the dataset, often in a n x n matrix. In the following matrix, Data measuring means the dissimilarity between two tuples; value 0 for highly similar or near between each other, 1 for completely same, the higher the value, the more dissimilar it is.

Data measuring

Most of the time, the dissimilarity and similarity are related concepts. The similarity measure can often be defined using a function; the expression constructed with measures of dissimilarity, and vice versa.

Here is a table with a list of some of the most used measures for different attribute value types:

Attribute value types

Dissimilarity

Nominal attributes

The dissimilarity between two tuples can be computed by the following equation: d (i, j) = (p-m)/p

Where, p is the dimension of data and m is the number of matches that is in same state.

Ordinal attributes

The treatment of ordinal attributes is similar to that of numeric attributes, but it needs a transformation first before applying the methods.

Interval-scaled

Euclidean, Manhattan, and Minkowski distances are used to calculate the dissimilarity of data tuples.

Data description

The basic description can be used to identify features of data, distinguish noise, or outliers. A couple of basic statistical descriptions are as follows:

  • Measures of central tendency: This measures the location of middle or center of a data distribution: the mean, median, mode, midrange, and so on.
  • Measure of the dispersion of the data: This is the range, quartiles, interquartile range, and so on.

Data measuring

Data measuring is used in clustering, outlier detection, and classification. It refers to measures of proximity, similarity, and dissimilarity. The similarity value, a real value, between two tuples or data records ranges from 0 to 1, the higher the value the greater the similarity between tuples. Dissimilarity works in the opposite way; the higher the dissimilarity value, the more dissimilar are the two tuples.

For a dataset, data matrix stores the n data tuples in n x m matrix (n tuples and m attributes):

Data measuring

The dissimilarity matrix stores a collection of proximities available for all n tuples in the dataset, often in a n x n matrix. In the following matrix, Data measuring means the dissimilarity between two tuples; value 0 for highly similar or near between each other, 1 for completely same, the higher the value, the more dissimilar it is.

Data measuring

Most of the time, the dissimilarity and similarity are related concepts. The similarity measure can often be defined using a function; the expression constructed with measures of dissimilarity, and vice versa.

Here is a table with a list of some of the most used measures for different attribute value types:

Attribute value types

Dissimilarity

Nominal attributes

The dissimilarity between two tuples can be computed by the following equation: d (i, j) = (p-m)/p

Where, p is the dimension of data and m is the number of matches that is in same state.

Ordinal attributes

The treatment of ordinal attributes is similar to that of numeric attributes, but it needs a transformation first before applying the methods.

Interval-scaled

Euclidean, Manhattan, and Minkowski distances are used to calculate the dissimilarity of data tuples.

Data measuring

Data measuring is used in clustering, outlier detection, and classification. It refers to measures of proximity, similarity, and dissimilarity. The similarity value, a real value, between two tuples or data records ranges from 0 to 1, the higher the value the greater the similarity between tuples. Dissimilarity works in the opposite way; the higher the dissimilarity value, the more dissimilar are the two tuples.

For a dataset, data matrix stores the n data tuples in n x m matrix (n tuples and m attributes):

Data measuring

The dissimilarity matrix stores a collection of proximities available for all n tuples in the dataset, often in a n x n matrix. In the following matrix, Data measuring means the dissimilarity between two tuples; value 0 for highly similar or near between each other, 1 for completely same, the higher the value, the more dissimilar it is.

Data measuring

Most of the time, the dissimilarity and similarity are related concepts. The similarity measure can often be defined using a function; the expression constructed with measures of dissimilarity, and vice versa.

Here is a table with a list of some of the most used measures for different attribute value types:

Attribute value types

Dissimilarity

Nominal attributes

The dissimilarity between two tuples can be computed by the following equation: d (i, j) = (p-m)/p

Where, p is the dimension of data and m is the number of matches that is in same state.

Ordinal attributes

The treatment of ordinal attributes is similar to that of numeric attributes, but it needs a transformation first before applying the methods.

Interval-scaled

Euclidean, Manhattan, and Minkowski distances are used to calculate the dissimilarity of data tuples.

Data cleaning

Data cleaning is one part of data quality. The aim of Data Quality (DQ) is to have the following:

  • Accuracy (data is recorded correctly)
  • Completeness (all relevant data is recorded)
  • Uniqueness (no duplicated data record)
  • Timeliness (the data is not old)
  • Consistency (the data is coherent)

Data cleaning attempts to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data. Data cleaning is usually an iterative two-step process consisting of discrepancy detection and data transformation.

The process of data mining contains two steps in most situations. They are as follows:

  • The first step is to perform audition on the source dataset to find the discrepancy.
  • The second step is to choose the transformation to fix (based on the accuracy of the attribute to be modified and the closeness of the new value to the original value). This is followed by applying the transformation to correct the discrepancy.

Missing values

During the process to seize data from all sorts of data sources, there are many cases when some fields are left blank or contain a null value. Good data entry procedures should avoid or minimize the number of missing values or errors. The missing values and defaults are indistinguishable.

If some fields are missing a value, there are a couple of solutions—each with different considerations and shortages and each is applicable within a certain context.

  • Ignore the tuple: By ignoring the tuple, you cannot make use of the remaining values except the missing one. This method is applicable when the tuple contains several attributes with missing values or the percentage of missing value per attribute doesn't vary considerably.
  • Filling the missing value manually: This is not applicable for large datasets.
  • Use a global constant to fill the value: Applying the value to fill the missing value will misguide the mining process, and is not foolproof.
  • Use a measure for a central tendency for the attribute to fill the missing value: The measures of central tendency can be used for symmetric data distribution.
  • Use the attribute mean or median: Use the attribute mean or median for all samples belonging to the same class as the given tuple.
  • Use the most probable value to fill the missing value: The missing data can be filled with data determined with regression, inference-based tool, such as Bayesian formalism or decision tree induction.

The most popular method is the last one; it is based on the present values and values from other attributes.

Junk, noisy data, or outlier

As in a physics or statistics test, noise is a random error that occurs during the test process to seize the measured data. No matter what means you apply to the data gathering process, noise inevitably exists.

Approaches for data smoothing are listed here. Along with the progress of data mining study, new methods keep occurring. Let's have a look at them:

  • Binning: This is a local scope smoothing method in which the neighborhood values are used to compute the final value for the certain bin. The sorted data is distributed into a number of bins and each value in that bin will be replaced by a value depending on some certain computation of the neighboring values. The computation can be bin median, bin boundary, which is the boundary data of that bin.
  • Regression: The target of regression is to find the best curve or something similar to one in a multidimensional space; as a result, the other values will be used to predict the value of the target attribute or variable. In other aspects, it is a popular means for smoothing.
  • Classification or outlier: The classifier is another inherent way to find the noise or outlier. During the process of classifying, most of the source data is grouped into couples of groups, except the outliers.

Missing values

During the process to seize data from all sorts of data sources, there are many cases when some fields are left blank or contain a null value. Good data entry procedures should avoid or minimize the number of missing values or errors. The missing values and defaults are indistinguishable.

If some fields are missing a value, there are a couple of solutions—each with different considerations and shortages and each is applicable within a certain context.

  • Ignore the tuple: By ignoring the tuple, you cannot make use of the remaining values except the missing one. This method is applicable when the tuple contains several attributes with missing values or the percentage of missing value per attribute doesn't vary considerably.
  • Filling the missing value manually: This is not applicable for large datasets.
  • Use a global constant to fill the value: Applying the value to fill the missing value will misguide the mining process, and is not foolproof.
  • Use a measure for a central tendency for the attribute to fill the missing value: The measures of central tendency can be used for symmetric data distribution.
  • Use the attribute mean or median: Use the attribute mean or median for all samples belonging to the same class as the given tuple.
  • Use the most probable value to fill the missing value: The missing data can be filled with data determined with regression, inference-based tool, such as Bayesian formalism or decision tree induction.

The most popular method is the last one; it is based on the present values and values from other attributes.

Junk, noisy data, or outlier

As in a physics or statistics test, noise is a random error that occurs during the test process to seize the measured data. No matter what means you apply to the data gathering process, noise inevitably exists.

Approaches for data smoothing are listed here. Along with the progress of data mining study, new methods keep occurring. Let's have a look at them:

  • Binning: This is a local scope smoothing method in which the neighborhood values are used to compute the final value for the certain bin. The sorted data is distributed into a number of bins and each value in that bin will be replaced by a value depending on some certain computation of the neighboring values. The computation can be bin median, bin boundary, which is the boundary data of that bin.
  • Regression: The target of regression is to find the best curve or something similar to one in a multidimensional space; as a result, the other values will be used to predict the value of the target attribute or variable. In other aspects, it is a popular means for smoothing.
  • Classification or outlier: The classifier is another inherent way to find the noise or outlier. During the process of classifying, most of the source data is grouped into couples of groups, except the outliers.

Junk, noisy data, or outlier

As in a physics or statistics test, noise is a random error that occurs during the test process to seize the measured data. No matter what means you apply to the data gathering process, noise inevitably exists.

Approaches for data smoothing are listed here. Along with the progress of data mining study, new methods keep occurring. Let's have a look at them:

  • Binning: This is a local scope smoothing method in which the neighborhood values are used to compute the final value for the certain bin. The sorted data is distributed into a number of bins and each value in that bin will be replaced by a value depending on some certain computation of the neighboring values. The computation can be bin median, bin boundary, which is the boundary data of that bin.
  • Regression: The target of regression is to find the best curve or something similar to one in a multidimensional space; as a result, the other values will be used to predict the value of the target attribute or variable. In other aspects, it is a popular means for smoothing.
  • Classification or outlier: The classifier is another inherent way to find the noise or outlier. During the process of classifying, most of the source data is grouped into couples of groups, except the outliers.

Data integration

Data integration combines data from multiple sources to form a coherent data store. The common issues here are as follows:

  • Heterogeneous data: This has no common key
  • Different definition: This is intrinsic, that is, same data with different definition, such as a different database schema
  • Time synchronization: This checks if the data is gathered under same time periods
  • Legacy data: This refers to data left from the old system
  • Sociological factors: This is the limit of data gathering

There are several approaches that deal with the above issues:

  • Entity identification problem: Schema integration and object matching are tricky. This referred to as the entity identification problem.
  • Redundancy and correlation analysis: Some redundancies can be detected by correlation analysis. Given two attributes, such an analysis can measure how strongly one attribute implies the other, based on the available data.
  • Tuple Duplication: Duplication should be detected at the tuple level to detect redundancies between attributes
  • Data value conflict detection and resolution: Attributes may differ on the abstraction level, where an attribute in one system is recorded at a different abstraction level

Data dimension reduction

Reduction of dimensionality is often necessary in the analysis of complex multivariate datasets, which is always in high-dimensional format. So, for example, problems modeled by the number of variables present, the data mining tasks on the multidimensional analysis of qualitative data. There are also many methods for data dimension reduction for qualitative data.

The goal of dimensionality reduction is to replace large matrix by two or more other matrices whose sizes are much smaller than the original, but from which the original can be approximately reconstructed, usually by taking their product with loss of minor information.

Eigenvalues and Eigenvectors

An eigenvector for a matrix is defined as when the matrix (A in the following equation) is multiplied by the eigenvector (v in the following equation). The result is a constant multiple of the eigenvector. That constant is the eigenvalue associated with this eigenvector. A matrix may have several eigenvectors.

Eigenvalues and Eigenvectors

An eigenpair is the eigenvector and its eigenvalue, that is, (Eigenvalues and Eigenvectors) in the preceding equation.

Principal-Component Analysis

The Principal-Component Analysis (PCA) technique for dimensionality reduction views data that consists of a collection of points in a multidimensional space as a matrix, in which rows correspond to the points and columns to the dimensions.

The product of this matrix and its transpose has eigenpairs, and the principal eigenvector can be viewed as the direction in the space along which the points best line up. The second eigenvector represents the direction in which deviations from the principal eigenvector are the greatest.

Dimensionality reduction by PCA is to approximate the data by minimizing the root-mean-square error for the given number of columns in the representing matrix, by representing the matrix of points by a small number of its eigenvectors.

Singular-value decomposition

The singular-value decomposition (SVD) of a matrix consists of following three matrices:

  • U
  • V

U and V are column-orthonormal; as vectors, the columns are orthogonal and their length is 1. ∑ is a diagonal matrix and the values along its diagonal are called singular values. The original matrix equals to the product of U, ∑, and the transpose of V.

SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix.

Dimensionality reduction by SVD for matrix U and V are typically as large as the original. To use fewer columns for U and V, delete the columns corresponding to the smallest singular values from U, V, and ∑. This minimizes the error in reconstruction of the original matrix from the modified U, ∑, and V.

CUR decomposition

The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.

The CUR chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and CUR decomposition in SVD. The choice of rows and columns is made randomly with a distribution that depends on the square root of the sum of the squares of the elements. Between C and R is a square matrix called U, which is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

Tip

By CUR solution, the three component matrices C, U, and R will be retrieved. The product of those three will approximate the original matrix M. For R community, rCUR is an R package for the CUR matrix decomposition.

Eigenvalues and Eigenvectors

An eigenvector for a matrix is defined as when the matrix (A in the following equation) is multiplied by the eigenvector (v in the following equation). The result is a constant multiple of the eigenvector. That constant is the eigenvalue associated with this eigenvector. A matrix may have several eigenvectors.

Eigenvalues and Eigenvectors

An eigenpair is the eigenvector and its eigenvalue, that is, (Eigenvalues and Eigenvectors) in the preceding equation.

Principal-Component Analysis

The Principal-Component Analysis (PCA) technique for dimensionality reduction views data that consists of a collection of points in a multidimensional space as a matrix, in which rows correspond to the points and columns to the dimensions.

The product of this matrix and its transpose has eigenpairs, and the principal eigenvector can be viewed as the direction in the space along which the points best line up. The second eigenvector represents the direction in which deviations from the principal eigenvector are the greatest.

Dimensionality reduction by PCA is to approximate the data by minimizing the root-mean-square error for the given number of columns in the representing matrix, by representing the matrix of points by a small number of its eigenvectors.

Singular-value decomposition

The singular-value decomposition (SVD) of a matrix consists of following three matrices:

  • U
  • V

U and V are column-orthonormal; as vectors, the columns are orthogonal and their length is 1. ∑ is a diagonal matrix and the values along its diagonal are called singular values. The original matrix equals to the product of U, ∑, and the transpose of V.

SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix.

Dimensionality reduction by SVD for matrix U and V are typically as large as the original. To use fewer columns for U and V, delete the columns corresponding to the smallest singular values from U, V, and ∑. This minimizes the error in reconstruction of the original matrix from the modified U, ∑, and V.

CUR decomposition

The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.

The CUR chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and CUR decomposition in SVD. The choice of rows and columns is made randomly with a distribution that depends on the square root of the sum of the squares of the elements. Between C and R is a square matrix called U, which is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

Tip

By CUR solution, the three component matrices C, U, and R will be retrieved. The product of those three will approximate the original matrix M. For R community, rCUR is an R package for the CUR matrix decomposition.

Principal-Component Analysis

The Principal-Component Analysis (PCA) technique for dimensionality reduction views data that consists of a collection of points in a multidimensional space as a matrix, in which rows correspond to the points and columns to the dimensions.

The product of this matrix and its transpose has eigenpairs, and the principal eigenvector can be viewed as the direction in the space along which the points best line up. The second eigenvector represents the direction in which deviations from the principal eigenvector are the greatest.

Dimensionality reduction by PCA is to approximate the data by minimizing the root-mean-square error for the given number of columns in the representing matrix, by representing the matrix of points by a small number of its eigenvectors.

Singular-value decomposition

The singular-value decomposition (SVD) of a matrix consists of following three matrices:

  • U
  • V

U and V are column-orthonormal; as vectors, the columns are orthogonal and their length is 1. ∑ is a diagonal matrix and the values along its diagonal are called singular values. The original matrix equals to the product of U, ∑, and the transpose of V.

SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix.

Dimensionality reduction by SVD for matrix U and V are typically as large as the original. To use fewer columns for U and V, delete the columns corresponding to the smallest singular values from U, V, and ∑. This minimizes the error in reconstruction of the original matrix from the modified U, ∑, and V.

CUR decomposition

The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.

The CUR chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and CUR decomposition in SVD. The choice of rows and columns is made randomly with a distribution that depends on the square root of the sum of the squares of the elements. Between C and R is a square matrix called U, which is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

Tip

By CUR solution, the three component matrices C, U, and R will be retrieved. The product of those three will approximate the original matrix M. For R community, rCUR is an R package for the CUR matrix decomposition.

Singular-value decomposition

The singular-value decomposition (SVD) of a matrix consists of following three matrices:

  • U
  • V

U and V are column-orthonormal; as vectors, the columns are orthogonal and their length is 1. ∑ is a diagonal matrix and the values along its diagonal are called singular values. The original matrix equals to the product of U, ∑, and the transpose of V.

SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix.

Dimensionality reduction by SVD for matrix U and V are typically as large as the original. To use fewer columns for U and V, delete the columns corresponding to the smallest singular values from U, V, and ∑. This minimizes the error in reconstruction of the original matrix from the modified U, ∑, and V.

CUR decomposition

The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.

The CUR chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and CUR decomposition in SVD. The choice of rows and columns is made randomly with a distribution that depends on the square root of the sum of the squares of the elements. Between C and R is a square matrix called U, which is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

Tip

By CUR solution, the three component matrices C, U, and R will be retrieved. The product of those three will approximate the original matrix M. For R community, rCUR is an R package for the CUR matrix decomposition.

CUR decomposition

The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.

The CUR chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and CUR decomposition in SVD. The choice of rows and columns is made randomly with a distribution that depends on the square root of the sum of the squares of the elements. Between C and R is a square matrix called U, which is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

Tip

By CUR solution, the three component matrices C, U, and R will be retrieved. The product of those three will approximate the original matrix M. For R community, rCUR is an R package for the CUR matrix decomposition.

Data transformation and discretization

As we know from the previous section, there are always some data formats that are best suited for specific data mining algorithms. Data transformation is an approach to transform the original data to preferable data format for the input of certain data mining algorithms before the processing.

Data transformation

Data transformation routines convert the data into appropriate forms for mining. They're shown as follows:

  • Smoothing: This uses binning, regression, and clustering to remove noise from the data
  • Attribute construction: In this routine, new attributes are constructed and added from the given set of attributes
  • Aggregation: In this summary or aggregation, operations are performed on the data
  • Normalization: Here, the attribute data is scaled so as to fall within a smaller range
  • Discretization: In this routine, the raw values of a numeric attribute are replaced by interval label or conceptual label
  • Concept hierarchy generation for nominal data: Here, attributes can be generalized to higher level concepts

Normalization data transformation methods

To avoid dependency on the choice of measurement units on data attributes, the data should be normalized. This means transforming or mapping the data to a smaller or common range. All attributes gain an equal weight after this process. There are many normalization methods. Let's have a look at some of them:

  • Min-max normalization: This preserves the relationships among the original data values and performs a linear transformation on the original data. The applicable ones of the actual maximum and minimum values of an attribute will be normalized.
  • z-score normalization: Here the values for an attribute are normalized based on the mean and standard deviation of that attribute. It is useful when the actual minimum and maximum of an attribute to be normalized are unknown.
  • Normalization by decimal scaling: This normalizes by moving the decimal point of values of attribute.

Data discretization

Data discretization transforms numeric data by mapping values to interval or concept labels. Discretization techniques include the following:

  • Data discretization by binning: This is a top-down unsupervised splitting technique based on a specified number of bins.
  • Data discretization by histogram analysis: In this technique, a histogram partitions the values of an attribute into disjoint ranges called buckets or bins. It is also an unsupervised method.
  • Data discretization by cluster analysis: In this technique, a clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attribute into clusters or groups.
  • Data discretization by decision tree analysis: Here, a decision tree employs a top-down splitting approach; it is a supervised method. To discretize a numeric attribute, the method selects the value of the attribute that has minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
  • Data discretization by correlation analysis: This employs a bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals, recursively. It is supervised method.

Data transformation

Data transformation routines convert the data into appropriate forms for mining. They're shown as follows:

  • Smoothing: This uses binning, regression, and clustering to remove noise from the data
  • Attribute construction: In this routine, new attributes are constructed and added from the given set of attributes
  • Aggregation: In this summary or aggregation, operations are performed on the data
  • Normalization: Here, the attribute data is scaled so as to fall within a smaller range
  • Discretization: In this routine, the raw values of a numeric attribute are replaced by interval label or conceptual label
  • Concept hierarchy generation for nominal data: Here, attributes can be generalized to higher level concepts

Normalization data transformation methods

To avoid dependency on the choice of measurement units on data attributes, the data should be normalized. This means transforming or mapping the data to a smaller or common range. All attributes gain an equal weight after this process. There are many normalization methods. Let's have a look at some of them:

  • Min-max normalization: This preserves the relationships among the original data values and performs a linear transformation on the original data. The applicable ones of the actual maximum and minimum values of an attribute will be normalized.
  • z-score normalization: Here the values for an attribute are normalized based on the mean and standard deviation of that attribute. It is useful when the actual minimum and maximum of an attribute to be normalized are unknown.
  • Normalization by decimal scaling: This normalizes by moving the decimal point of values of attribute.

Data discretization

Data discretization transforms numeric data by mapping values to interval or concept labels. Discretization techniques include the following:

  • Data discretization by binning: This is a top-down unsupervised splitting technique based on a specified number of bins.
  • Data discretization by histogram analysis: In this technique, a histogram partitions the values of an attribute into disjoint ranges called buckets or bins. It is also an unsupervised method.
  • Data discretization by cluster analysis: In this technique, a clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attribute into clusters or groups.
  • Data discretization by decision tree analysis: Here, a decision tree employs a top-down splitting approach; it is a supervised method. To discretize a numeric attribute, the method selects the value of the attribute that has minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
  • Data discretization by correlation analysis: This employs a bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals, recursively. It is supervised method.

Normalization data transformation methods

To avoid dependency on the choice of measurement units on data attributes, the data should be normalized. This means transforming or mapping the data to a smaller or common range. All attributes gain an equal weight after this process. There are many normalization methods. Let's have a look at some of them:

  • Min-max normalization: This preserves the relationships among the original data values and performs a linear transformation on the original data. The applicable ones of the actual maximum and minimum values of an attribute will be normalized.
  • z-score normalization: Here the values for an attribute are normalized based on the mean and standard deviation of that attribute. It is useful when the actual minimum and maximum of an attribute to be normalized are unknown.
  • Normalization by decimal scaling: This normalizes by moving the decimal point of values of attribute.

Data discretization

Data discretization transforms numeric data by mapping values to interval or concept labels. Discretization techniques include the following:

  • Data discretization by binning: This is a top-down unsupervised splitting technique based on a specified number of bins.
  • Data discretization by histogram analysis: In this technique, a histogram partitions the values of an attribute into disjoint ranges called buckets or bins. It is also an unsupervised method.
  • Data discretization by cluster analysis: In this technique, a clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attribute into clusters or groups.
  • Data discretization by decision tree analysis: Here, a decision tree employs a top-down splitting approach; it is a supervised method. To discretize a numeric attribute, the method selects the value of the attribute that has minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
  • Data discretization by correlation analysis: This employs a bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals, recursively. It is supervised method.

Data discretization

Data discretization transforms numeric data by mapping values to interval or concept labels. Discretization techniques include the following:

  • Data discretization by binning: This is a top-down unsupervised splitting technique based on a specified number of bins.
  • Data discretization by histogram analysis: In this technique, a histogram partitions the values of an attribute into disjoint ranges called buckets or bins. It is also an unsupervised method.
  • Data discretization by cluster analysis: In this technique, a clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attribute into clusters or groups.
  • Data discretization by decision tree analysis: Here, a decision tree employs a top-down splitting approach; it is a supervised method. To discretize a numeric attribute, the method selects the value of the attribute that has minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
  • Data discretization by correlation analysis: This employs a bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals, recursively. It is supervised method.

Visualization of results

Visualization is the graphic presentation of data-portrayals meant to reveal complex information at a glance, referring to all types of structured representation of information. This includes graphs, charts, diagrams, maps, storyboards, and other structured illustrations.

Good visualization of results gives you the chance to look at data through the eyes of experts. It is beautiful not only for their aesthetic design, but also for the elegant layers of detail that efficiently generate insight and new understanding.

The result of every data mining algorithm can be visualized and clarified by the use of the algorithms. Visualization plays an important role in the data mining process.

There are four major features that create the best visualizations:

  • Novel: It must not only merely being a conduit for information, but offer some novelty in the form of new style of information.
  • Informative: The attention to these factors and the data itself will make a data visualization effective, successful, and beautiful.
  • Efficient: A nice visualization has an explicit goal, a clearly defined message, or a special perspective on the information that it is made to convey. It must be as simple as possible and straightforward, but shouldn't lose out on necessary, relevant complexity. The irrelevant data serves as noises here. It should reflect the qualities of the data that they represent, reveal properties and relationships inherent and implicit in the data source to bring new knowledge, insight, and enjoyment to final user.
  • Aesthetic: The graphic must serve the primary goal of presenting information, not only axes and layout, shapes, lines, and typography, but also the appropriate usage of these ingredients.

Visualization with R

R provides the production of publication-quality diagrams and plots. There are graphic facilities distributed with R, and also some facilities that are not part of the standard R installation. You can use R graphics from command line.

The most important feature of the R graphics setup is the existence of two distinct graphics systems within R:

  • The traditional graphics system
  • Grid graphics system

The most appropriate facilities will be evaluated and applied to the visualization of every result of all algorithms listed in the book.

Functions in the graphics systems and add-on packages can be divided into several types:

  • High-level functions that produce complete plots
  • Low-level functions to add further output to an existing plot
  • The ones to work interactively with graphical output

    Note

    R graphics output can be produced in a wide range of graphical formats, such as PNG, JPEG, BMP, TIFF, SVG, PDF, and PS.

To enhance your knowledge about this chapter, here are some practice questions for you to have check about the concepts.

Visualization with R

R provides the production of publication-quality diagrams and plots. There are graphic facilities distributed with R, and also some facilities that are not part of the standard R installation. You can use R graphics from command line.

The most important feature of the R graphics setup is the existence of two distinct graphics systems within R:

  • The traditional graphics system
  • Grid graphics system

The most appropriate facilities will be evaluated and applied to the visualization of every result of all algorithms listed in the book.

Functions in the graphics systems and add-on packages can be divided into several types:

  • High-level functions that produce complete plots
  • Low-level functions to add further output to an existing plot
  • The ones to work interactively with graphical output

    Note

    R graphics output can be produced in a wide range of graphical formats, such as PNG, JPEG, BMP, TIFF, SVG, PDF, and PS.

To enhance your knowledge about this chapter, here are some practice questions for you to have check about the concepts.

Time for action

Let's now test what we've learned so far:

  • What is the difference between data mining and machine learning?
  • What is data preprocessing and data quality?
  • Download R and install R on your machine.
  • Compare and contrast data mining and machine learning.

Summary

In this chapter, we looked at the following topics:

  • An introduction to data mining and available data sources
  • A quick overview of R and the necessity to use R
  • A description of statistics and machine learning, and their relations to data mining
  • The two standard industrial data mining process
  • Data attributes types and the data measurement approaches
  • The three important steps in data preprocessing
  • An introduction to the scalability and efficiency of data mining algorithms, and data visualization methods and necessities
  • A discussion on social network mining, text mining, and web data mining
  • A short introduction about RHadoop and Map Reduce

In the following chapters, the reader will learn how to implement various data mining algorithms and manipulate data with R.

 

Chapter 2. Mining Frequent Patterns, Associations, and Correlations

In this chapter, we will learn how to mine frequent patterns, association rules, and correlation rules when working with R programs. Then, we will evaluate all these methods with benchmark data to determine the interestingness of the frequent patterns and rules. We will cover the following topics in this chapter:

  • Introduction to associations and patterns
  • Market basket analysis
  • Hybrid association rules mining
  • Mining sequence datasets
  • High-performance algorithms

The algorithms to find frequent items from various data types can be applied to numeric or categorical data. Most of these algorithms have one common basic algorithmic form, which is A-Priori, depending on certain circumstances. Another basic algorithm is FP-Growth, which is similar to A-Priori. Most pattern-related mining algorithms derive from these basic algorithms.

With frequent patterns found as one input, many algorithms are designed to find association and correlation rules. Each algorithm is only a variation from the basic algorithm.

Along with the growth, size, and types of datasets from various domains, new algorithms are designed, such as the multistage algorithm, the multihash algorithm, and the limited-pass algorithm.

An overview of associations and patterns

One popular task for data mining is to find relations among the source dataset; this is based on searching frequent patterns from various data sources, such as market baskets, graphs, and streams.

All the algorithms illustrated in this chapter are written from scratch in the R language for the purpose of explaining association analysis, and the code will be demonstrated using the standard R packages for the algorithms such as arules.

Patterns and pattern discovery

With many applications across a broad field, frequent pattern mining is often used in solving various problems, such as the market investigation for a shopping mall from the transaction data.

Frequent patterns are the ones that often occur in the source dataset. The dataset types for frequent pattern mining can be itemset, subsequence, or substructure. As a result, the frequent patterns found are known as:

  • Frequent itemset
  • Frequent subsequence
  • Frequent substructures

These three frequent patterns will be discussed in detail in the upcoming sections.

These newly founded frequent patterns will serve as an important platform when searching for recurring interesting rules or relationships among the given dataset.

Various patterns are proposed to improve the efficiency of mining on a dataset. Some of them are as follows; they will be defined in detail later:

  • Closed patterns
  • Maximal patterns
  • Approximate patterns
  • Condensed patterns
  • Discriminative frequent patterns

The frequent itemset

The frequent itemset originated from true market basket analysis. In a store such as Amazon, there are many orders or transactions; a certain customer performs a transaction where their Amazon shopping cart includes some items. The mass result of all customers' transactions can be used by the storeowner to find out what items are purchased together by customers. As a simple definition, itemset denotes a collection of zero or more items.

We call a transaction a basket, and a set of items can belong to any basket. We will set the variable s as the support threshold, which is compared with the count of a certain set of items that appear in all the baskets. If the count of a certain set of items that appear in all the baskets is not less than s, we would call the itemset a frequent itemset.

An itemset is called a k-itemset if it contains k pieces of items, where k is a non-zero integer. The support count of an itemset is The frequent itemset, the count of itemset contained X, given the dataset.

For a predefined minimum support threshold s, the itemset X is a frequent itemset if The frequent itemset. The minimum support threshold s is a customizable parameter, which can be adjusted by domain experts or experiences.

The frequent itemset is also used in many domains. Some of them are shown in the following table:

 

Items

Baskets

Comments

Related concepts

Words

Documents

 

Plagiarism

Documents

Sentences

 

Biomarkers

Biomarkers and diseases

The set of data about a patient

 

If an itemset is frequent, then any of its subset must be frequent. This is known as the A-Priori principle, the foundation of the A-Priori algorithm. The direct application of the A-Priori principle is to prune the huge number of frequent itemsets.

One important factor that affects the number of frequent itemsets is the minimum support count: the lower the minimum support count, the larger the number of frequent itemsets.

For the purpose of optimizing the frequent itemset-generation algorithm, some more concepts are proposed:

  • An itemset X is closed in dataset S, if The frequent itemset; X is also called a closed itemset. In other words, if X is frequent, then X is a closed frequent itemset.
  • An itemset X is a maximal frequent itemset if The frequent itemset; in other words, Y does not have frequent supersets.
  • An itemset X is considered a constrained frequent itemset once the frequent itemset satisfies the user-specified constraints.
  • An itemset X is an approximate frequent itemset if X derives only approximate support counts for the mined frequent itemsets.
  • An itemset X is a top-k frequent itemset in the dataset S if X is the k-most frequent itemset, given a user-defined value k.

The following example is of a transaction dataset. All itemsets only contain items from the set, The frequent itemset.Let's assume that the minimum support count is 3.

tid (transaction id)

List of items in the itemset or transaction

T001

The frequent itemset

T002

The frequent itemset

T003

The frequent itemset

T004

The frequent itemset

T005

The frequent itemset

T006

The frequent itemset

T007

The frequent itemset

T008

The frequent itemset

T009

The frequent itemset

T010

The frequent itemset

Then, we will get the frequent itemsets The frequent itemset and The frequent itemset.

The frequent subsequence

The frequent sequence is an ordered list of elements where each element contains at least one event. An example of this is the page-visit sequence on a site by the specific web page the user is on more concretely speaking, the order in which a certain user visits web pages. Here are two examples of the frequent subsequence:

  • Customer: Successive shopping records of certain customers in a shopping mart serves as the sequence, each item bought serves as the event item, and all the items bought by a customer in one shopping are treated as elements or transactions
  • Web usage data: Users who visit the history of the WWW are treated as a sequence, each UI/page serves as the event or item, and the element or transaction can be defined as the pages visited by users with one click of the mouse

The length of a sequence is defined by the number of items contained in the sequence. A sequence of length k is called a k-sequence. The size of a sequence is defined by the number of itemsets in the sequence. We call a sequence The frequent subsequence as a subsequence of the sequence The frequent subsequence or The frequent subsequence as the super sequence of The frequent subsequence when The frequent subsequence is satisfied.

The frequent substructures

In some domains, the tasks under research can be modeled with a graph theory. As a result, there are requirements for mining common subgraphs (subtrees or sublattices); some examples are as follows:

  • Web mining: Web pages are treated as the vertices of graph, links between pages serve as edges, and a user's page-visiting records construct the graph.
  • Network computing: Any device with computation ability on the network serves as the vertex, and the interconnection between these devices serves as the edge. The whole network that is made up of these devices and interconnections is treated as a graph.
  • Semantic web: XML elements serve as the vertices, and the parent/child relations between them are edges; all these XML files are treated as graphs.

A graph G is represented by G = (V, E), where V represents a group of vertices, and E represents a group of edges. A graph The frequent substructures is called as subgraph of graph G = (V, E) once The frequent substructures and The frequent substructures. Here is an example of a subgraph. There is the original graph with vertices and edges on the left-hand side of the following figure and the subgraph on the right-hand side with some edges omitted (or omission of vertices in other circumstances):

The frequent substructures

Relationship or rules discovery

Mining of association rules is based on the frequent patterns found. The different emphases on the interestingness of relations derives two types of relations for further research: association rules and correlation rules.

Association rules

In a later section, a method to show association analysis is illustrated; this is a useful method to discover interesting relationships within a huge dataset. The relations can be represented in the form of association rules or frequent itemsets.

Association rule mining is to find the result rule set on a given dataset (the transaction data set or other sequence-pattern-type dataset), a predefined minimum support count s, and a predefined confidence c, given any found rule Association rules Association rules, and Association rules.

Association rules is an association rule where Association rules; X and Y are disjoint. The interesting thing about this rule is that it is measured by its support and confidence. Support means the frequency in which this rule appears in the dataset, and confidence means the probability of the appearance of Y when X is present.

For association rules, the key measures of rule interestingness are rule support and confidence. Their relationship is given as follows:

Association rules

support_count(X) is the count of itemset in the dataset, contained X.

As a convention, in support_count(X), in the confidence value and support count value are represented as a percentage between 0 and 100.

The association rule Association rules is strong once Association rules and Association rules. The predefined minimum support threshold is s, and c is the predefined minimum confidence threshold.

The meaning of the found association rules should be explained with caution, especially when there is not enough to judge whether the rule implies causality. It only shows the co-occurrence of the prefix and postfix of the rule. The following are the different kinds of rules you can come across:

  • A rule is a Boolean association rule if it contains association of the presence of the item
  • A rule is a single-dimensional association if there is, at the most, only one dimension referred to in the rules
  • A rule is a multidimensional association rule if there are at least two dimensions referred to in the rules
  • A rule is a correlation-association rule if the relations or rules are measured by statistical correlation, which, once passed, leads to a correlation rule
  • A rule is a quantitative-association rule if at least one item or attribute contained in it is quantitative

Correlation rules

In some situations, the support and confidence pairs are not sufficient to filter uninteresting association rules. In such a case, we will use support count, confidence, and correlations to filter association rules.

There are a lot of methods to calculate the correlation of an association rule, such as Correlation rules analyses, all-confidence analysis, and cosine. For a k-itemset Correlation rules, define the all-confidence value of X as:

Correlation rules
Correlation rules

Market basket analysis

Market basket analysis is the methodology used to mine a shopping cart of items bought or just those kept in the cart by customers. The concept is applicable to a variety of applications, especially for store operations. The source dataset is a massive data record. The aim of market basket analysis is to find the association rules between the items within the source dataset.

The market basket model

The market basket model is a model that illustrates the relation between a basket and its associated items. Many tasks from different areas of research have this relation in common. To summarize them all, the market basket model is suggested as the most typical example to be researched.

The basket is also known as the transaction set; this contains the itemsets that are sets of items belonging to same itemset.

The A-Priori algorithm is a level wise, itemset mining algorithm. The Eclat algorithm is a tidset intersection itemset mining algorithm based on tidset intersection in contrast to A-Priori. FP-growth is a frequent pattern tree algorithm. The tidset denotes a collection of zeros or IDs of transaction records.

A-Priori algorithms

As a common strategy to design algorithms, the problem is divided into two subproblems:

  • The frequent itemset generation
  • Rule generation

The strategy dramatically decreases the search space for association mining algorithms.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The market basket model

The market basket model is a model that illustrates the relation between a basket and its associated items. Many tasks from different areas of research have this relation in common. To summarize them all, the market basket model is suggested as the most typical example to be researched.

The basket is also known as the transaction set; this contains the itemsets that are sets of items belonging to same itemset.

The A-Priori algorithm is a level wise, itemset mining algorithm. The Eclat algorithm is a tidset intersection itemset mining algorithm based on tidset intersection in contrast to A-Priori. FP-growth is a frequent pattern tree algorithm. The tidset denotes a collection of zeros or IDs of transaction records.

A-Priori algorithms

As a common strategy to design algorithms, the problem is divided into two subproblems:

  • The frequent itemset generation
  • Rule generation

The strategy dramatically decreases the search space for association mining algorithms.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

A-Priori algorithms

As a common strategy to design algorithms, the problem is divided into two subproblems:

  • The frequent itemset generation
  • Rule generation

The strategy dramatically decreases the search space for association mining algorithms.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

Input data characteristics and data structure

As the input of the A-Priori algorithm, the original input itemset is binarized, that is, 1 represents the presence of a certain item in the itemset; otherwise, it is 0. As a default assumption, the average size of the itemset is small. The popular preprocessing method is to map each unique available item in the input dataset to a unique integer ID.

The itemsets are usually stored within databases or files and will go through several passes. To control the efficiency of the algorithm, we need to control the count of passes. During the process when itemsets pass through other itemsets, the representation format for each itemset you are interested in is required to count and store for further usage of the algorithm.

There is a monotonicity feature in the itemsets under research; this implies that every subset of a frequent itemset is frequent. This characteristic is used to prune the search space for the frequent itemset in the process of the A-Priori algorithm. It also helps compact the information related to the frequent itemset. This feature gives us an intrinsic view that focuses on smaller-sized frequent itemsets. For example, there are three frequent 2-itemsets contained by one certain frequent 3-itemset.

Tip

When we talk about k-itemsets means an itemset containing k items.

The basket is in a format called the horizontal format and contains a basket or transaction ID and a number of items; it is used as the basic input format for the A-Priori algorithm. In contrast, there is another format known as the vertical format; this uses an item ID and a series of the transaction IDs. The algorithm that works on vertical data format is left as an exercise for you.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The A-Priori algorithm

Two actions are performed in the generation process of the A-Priori frequent itemset: one is join, and the other is prune.

Note

One important assumption is that the items within any itemset are in a lexicographic order.

  • Join action: Given that The A-Priori algorithm is the set of frequent k-itemsets, a set of candidates to find The A-Priori algorithm is generated. Let's call it The A-Priori algorithm.
    The A-Priori algorithm
  • Prune action: The A-Priori algorithm, the size of The A-Priori algorithm, the candidate itemset, is usually much bigger than The A-Priori algorithm, to save computation cost; monotonicity characteristic of frequent itemset is used here to prune the size of The A-Priori algorithm.
    The A-Priori algorithm

Here is the pseudocode to find all the frequent itemsets:

The A-Priori algorithm
The A-Priori algorithm
The A-Priori algorithm

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

R code of the A-Priori frequent itemset generation algorithm goes here. D is a transaction dataset. Suppose MIN_SUP is the minimum support count threshold. The output of the algorithm is L, which is a frequent itemsets in D.

The output of the A-Priori function can be verified with the R add-on package, arules, which is a pattern-mining and association-rules-mining package that includes A-Priori and éclat algorithms. Here is the R code:

Apriori <- function (data, I, MIN_SUP, parameter = NULL){
  f <- CreateItemsets()
  c <- FindFrequentItemset(data,I,1, MIN_SUP)
  k <- 2
  len4data <- GetDatasetSize(data)
  while( !IsEmpty(c[[k-1]]) ){
        f[[k]] <- AprioriGen(c[k-1])
         for( idx in 1: len4data ){
             ft <- GetSubSet(f[[k]],data[[idx]])
             len4ft <- GetDatasetSize(ft)
             for( jdx in 1:len4ft ){
                IncreaseSupportCount(f[[k]],ft[jdx])
             }
         }
         c[[k]] <- FindFrequentItemset(f[[k]],I,k,MIN_SUP)
         k <- k+1
  }
  c
}

To verify the R code, the arules package is applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) provides the support to mine frequent itemsets, maximal frequent itemsets, closed frequent itemsets, and association rules too. A-Priori and Eclat algorithms are both available. Also cSPADE can be found in arulesSequence, the add-on for arules.

Given:

The R implementation

At first, we will sort D into an ordered list in a predefined order algorithm or simply the natural order of characters, which is used here. Then:

The R implementation

Let's assume that the minimum support count is 5; the following table is an input dataset:

tid (transaction id)

List of items in the itemset or transaction

T001

The R implementation

T002

The R implementation

T003

The R implementation

T004

The R implementation

T005

The R implementation

T006

The R implementation

T007

The R implementation

T008

The R implementation

T009

The R implementation

T010

The R implementation

In the first scan or pass of the dataset D, get the count of each candidate itemset The R implementation. The candidate itemset and its related count:

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

2

The R implementation

5

The R implementation

2

The R implementation

3

The R implementation

3

We will get the The R implementation after comparing the support count with minimum support count.

Itemset

Support count

The R implementation

6

The R implementation

8

The R implementation

5

We will generate The R implementation by The R implementation, The R implementation.

Itemset

Support count

The R implementation

4

The R implementation

3

The R implementation

4

After comparing the support count with the minimum support count, we will get The R implementation. The algorithm then terminates.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

A-Priori algorithm variants

The various variants of A-Priori algorithms are designed mainly for the purpose of efficiency and scalability. Some of the improvements of the A-Priori algorithms are discussed in the upcoming sections.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The Eclat algorithm

The A-Priori algorithm loops as many times as the maximum length of the pattern somewhere. This is the motivation for the Equivalence CLASS Transformation (Eclat) algorithm. The Eclat algorithm explores the vertical data format, for example, using <item id, tid set> instead of <tid, item id set> that is, with the input data in the vertical format in the sample market basket file, or to discover frequent itemsets from a transaction dataset. The A-Priori property is also used in this algorithm to get frequent (k+1) itemsets from k itemsets.

The candidate itemset is generated by set intersection. The vertical format structure is called a tidset as defined earlier. If all the transaction IDs related to the item I are stored in a vertical format transaction itemset, then the itemset is the tidset of the specific item.

The support count is computed by the intersection between tidsets. Given two tidsets, X and Y, The Eclat algorithm is the cardinality of The Eclat algorithm. The pseudocode is The Eclat algorithm, The Eclat algorithm.

The Eclat algorithm

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R code for the Eclat algorithm to find the frequent patterns. Before calling the function, f is set to empty, and p is the set of frequent 1-itemsets:

Eclat  <- function (p,f,MIN_SUP){
  len4tidsets <- length(p)
  for(idx in 1:len4tidsets){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
       for(jdx in idx:len4tidsets){
         if(ItemCompare(p[[jdx]],p[[idx]]) > 0){
             xab <- MergeTidSets(p[[idx]],p[[jdx]])
             if(GetSupport(xab)>=MIN_SUP){
                AddFrequentItemset(pa,xab,
                GetSupport(xab))
               }
           }
     }
     if(!IsEmptyTidSets(pa)){
       Eclat(pa,f,MIN_SUP)
    }
  }
}

Here is the running result of one example, I = {beer, chips, pizza, wine}. The transaction dataset with horizontal and vertical formats, respectively, are shown in the following table:

tid

X

1

{beer, chips, wine}

2

{beer, chips}

3

{pizza, wine}

4

{chips, pizza}

x

tidset

beer

{1,2}

chips

{1,2,4}

pizza

{3,4}

wine

{1,3}

The binary format of this information is in the following table.

tid

beer

chips

pizza

wine

1

1

1

0

1

2

1

1

0

0

3

0

0

1

1

4

0

1

1

0

Before calling the Eclat algorithm, we will set MIN_SUP=2, The R implementation,

The R implementation

The running process is illustrated in the following figure. After two iterations, we will get frequent tidsets, {beer, 12 >, < chips, 124>, <pizza, 34>, <wine, 13>, < {beer, chips}, 12>}:

The R implementation

The output of the Eclat function can be verified with the R add-on package, arules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The FP-growth algorithm

The FP-growth algorithm is an efficient method targeted at mining frequent itemsets in a large dataset. The main difference between the FP-growth algorithm and the A-Priori algorithm is that the generation of a candidate itemset is not needed here. The pattern-growth strategy is used instead. The FP-tree is the data structure.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

Input data characteristics and data structure

The data structure used is a hybrid of vertical and horizontal datasets; all the transaction itemsets are stored within a tree structure. The tree structure used in this algorithm is called a frequent pattern tree. Here is example of the generation of the structure, I = {A, B, C, D, E, F}; the transaction dataset D is in the following table, and the FP-tree building process is shown in the next upcoming image. Each node in the FP-tree represents an item and the path from the root to that item, that is, the node list represents an itemset. The support information of this itemset is included in the node as well as the item too.

tid

X

1

{A, B, C, D, E}

2

{A, B, C, E}

3

{A, D, E}

4

{B, E, D}

5

{B, E, C}

6

{E, C, D}

7

{E, D}

The sorted item order is listed in the following table:

item

E

D

C

B

A

support_count

7

5

4

4

3

Reorder the transaction dataset with this new decreasing order; get the new sorted transaction dataset, as shown in this table:

tid

X

1

{E, D, C, B, A}

2

{E, C, B, A}

3

{E, D, A}

4

{E, D, B}

5

{E, C, B}

6

{E, D, C}

7

{E, D}

The FP-tree building process is illustrated in the following images, along with the addition of each itemset to the FP-tree. The support information is calculated at the same time, that is, the support counts of the items on the path to that specific node are incremented.

The most frequent items are put at the top of the tree; this keeps the tree as compact as possible. To start building the FP-tree, the items should be decreasingly ordered by the support count. Next, get the list of sorted items and remove the infrequent ones. Then, reorder each itemset in the original transaction dataset by this order.

Given MIN_SUP=3, the following itemsets can be processed according to this logic:

Input data characteristics and data structure

The result after performing steps 4 and 7 are listed here, and the process of the algorithm is very simple and straight forward:

Input data characteristics and data structure

A header table is usually bound together with the frequent pattern tree. A link to the specific node, or the item, is stored in each record of the header table.

Input data characteristics and data structure

The FP-tree serves as the input of the FP-growth algorithm and is used to find the frequent pattern or itemset. Here is an example of removing the items from the frequent pattern tree in a reverse order or from the leaf; therefore, the order is A, B, C, D, and E. Using this order, we will then build the projected FP-tree for each item.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The FP-growth algorithm

Here is the pseudocode with recursion definition; the input values are The FP-growth algorithm

The FP-growth algorithm

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R source code of the main FP-growth algorithm:

FPGrowth  <- function (r,p,f,MIN_SUP){
    RemoveInfrequentItems(r)
    if(IsPath(r)){
       y <- GetSubset(r)
       len4y <- GetLength(y)
       for(idx in 1:len4y){
          x <- MergeSet(p,y[idx])
          SetSupportCount(x, GetMinCnt(x))
          Add2Set(f,x,support_count(x))
       }
  }else{
      len4r <- GetLength(r)
      for(idx in 1:len4r){
         x <- MergeSet(p,r[idx])
         SetSupportCount(x, GetSupportCount(r[idx]))
         rx <- CreateProjectedFPTree()
         path4idx <- GetAllSubPath(PathFromRoot(r,idx))
         len4path <- GetLength(path4idx)
         for( jdx in 1:len4path ){
           CountCntOnPath(r, idx, path4idx, jdx)
           InsertPath2ProjectedFPTree(rx, idx, path4idx, jdx, GetCnt(idx))
         }
         if( !IsEmpty(rx) ){
           FPGrowth(rx,x,f,MIN_SUP)
         }
      }
  }
}
The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The GenMax algorithm with maximal frequent itemsets

The GenMax algorithm is used to mine maximal frequent itemset (MFI) to which the maximality properties are applied, that is, more steps are added to check the maximal frequent itemsets instead of only frequent itemsets. This is based partially on the tidset intersection from the Eclat algorithm. The diffset, or the differential set as it is also known, is used for fast frequency testing. It is the difference between two tidsets of the corresponding items.

The candidate MFI is determined by its definition: assuming M as the set of MFI, if there is one X that belongs to M and it is the superset of the newly found frequent itemset Y, then Y is discarded; however, if X is the subset of Y, then X should be removed from M.

Here is the pseudocode before calling the GenMax algorithm, The GenMax algorithm with maximal frequent itemsets, where D is the input transaction dataset.

The GenMax algorithm with maximal frequent itemsets

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R source code of the main GenMax algorithm:

GenMax  <- function (p,m,MIN_SUP){
  y <- GetItemsetUnion(p)
  if( SuperSetExists(m,y) ){
    return
  }
  len4p <- GetLenght(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
       xij <- MergeTidSets(p[[idx]],p[[jdx]])
         if(GetSupport(xij)>=MIN_SUP){
            AddFrequentItemset(q,xij,GetSupport(xij))
          }
     }
     if( !IsEmpty(q) ){
       GenMax(q,m,MIN_SUP)
     }else if( !SuperSetExists(m,p[[idx]]) ){
     Add2MFI(m,p[[idx]])
     }
   }
}
The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The Charm algorithm with closed frequent itemsets

Closure checks are performed during the mining of closed frequent itemsets. Closed frequent itemsets allow you to get the longest frequent patterns that have the same support. This allows us to prune frequent patterns that are redundant. The Charm algorithm also uses the vertical tidset intersection for efficient closure checks.

Here is the pseudocode before calling the Charm algorithm, The Charm algorithm with closed frequent itemsets, where D is the input transaction dataset.

The Charm algorithm with closed frequent itemsets

The Charm algorithm with closed frequent itemsets
The Charm algorithm with closed frequent itemsets

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

Here is the R source code of the main algorithm:

Charm  <- function (p,c,MIN_SUP){
  SortBySupportCount(p)
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     q <- GenerateFrequentTidSet()
     for(jdx in (idx+1):len4p){
        xij <- MergeTidSets(p[[idx]],p[[jdx]])
        if(GetSupport(xij)>=MIN_SUP){
          if( IsSameTidSets(p,idx,jdx) ){
            ReplaceTidSetBy(p,idx,xij)
            ReplaceTidSetBy(q,idx,xij)
            RemoveTidSet(p,jdx)
          }else{
             if( IsSuperSet(p[[idx]],p[[jdx]]) ){
               ReplaceTidSetBy(p,idx,xij)
               ReplaceTidSetBy(q,idx,xij)
             }else{
               Add2CFI(q,xij)
             }
          }
        }
     }
     if( !IsEmpty(q) ){
       Charm(q,c,MIN_SUP)
     }
     if( !IsSuperSetExists(c,p[[idx]]) ){
        Add2CFI(m,p[[idx]])
     }
  }
}
The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The algorithm to generate association rules

During the process of generating an algorithm for A-Priori frequent itemsets, the support count of each frequent itemset is calculated and recorded for further association rules mining processing, that is, for association rules mining.

To generate an association rule The algorithm to generate association rules, l is a frequent itemset. Two steps are needed:

  • First to get all the nonempty subsets of l.
  • Then, for subset X of l, The algorithm to generate association rules, the rule The algorithm to generate association rules is a strong association rule only if The algorithm to generate association rules The support count of any rule of a frequent itemset is not less than the minimum support count.

Here is the pseudocode:

The algorithm to generate association rules

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

The R implementation

R code of the algorithm to generate A-Priori association is as follows:

Here is the R source code of the main algorithm:AprioriGenerateRules  <- function (D,F,MIN_SUP,MIN_CONF){
  #create empty rule set
  r <- CreateRuleSets()
  len4f <- length(F)
  for(idx in 1:len4f){
     #add rule F[[idx]] => {}
     AddRule2RuleSets(r,F[[idx]],NULL)
     c <- list()
     c[[1]] <- CreateItemSets(F[[idx]])
     h <- list()
     k <-1
     while( !IsEmptyItemSets(c[[k]]) ){
       #get heads of confident association rule in c[[k]]
       h[[k]] <- getPrefixOfConfidentRules(c[[k]],
       F[[idx]],D,MIN_CONF)
       c[[k+1]] <- CreateItemSets()

       #get candidate heads
       len4hk <- length(h[[k]])
       for(jdx in 1:(len4hk-1)){
          if( Match4Itemsets(h[[k]][jdx],
             h[[k]][jdx+1]) ){
             tempItemset <- CreateItemset
             (h[[k]][jdx],h[[k]][jdx+1][k])
             if( IsSubset2Itemsets(h[[k]],    
               tempItemset) ){
               Append2ItemSets(c[[k+1]], 
               tempItemset)
         }
       }
     }
   }
   #append all new association rules to rule set
   AddRule2RuleSets(r,F[[idx]],h)
   }
  r
}

To verify the R code, Arules and Rattle packages are applied while verifying the output.

Tip

Arules (Hahsler et al., 2011) and Rattle packages provide support for association rule analysis. AruleViz is used to visualize the output's association rules.

Hybrid association rules mining

There are two interesting applications of association rules mining: one is multilevel and multidimensional association rules mining, while the other is constraint-based mining.

Mining multilevel and multidimensional association rules

For a given transactional dataset, if there is a conceptual hierarchy that exists from some dimensions of the dataset, then we can apply multilevel association rules mining to this dataset. Any association rules mining algorithm applicable to the transaction dataset can be used for this task. The following table shows an example from the Amazon store:

TID

Item purchased

1

Dell Venue 7 16 GB Tablet, HP Pavilion 17-e140us 17.3-Inch Laptop...

2

Samsung Galaxy Tab 3 Lite, Razer Edge Pro 256GB Tablet…

2

Acer C720P-2666 Chromebook, Logitech Wireless Combo MK270 with Keyboard and Mouse…

2

Toshiba CB35-A3120 13.3-Inch Chromebook, Samsung Galaxy Tab 3 (7-Inch, White)…

Have a look at the following flowchart that explains multilevel pattern mining:

Mining multilevel and multidimensional association rules

Based on the conceptual hierarchy, lower-level concepts can be projected to higher-level concepts, and the new dataset with higher-level concepts can replace the original lower-level concepts.

The support counts are calculated at each conceptual level. Many A-Priori-like algorithms are designed with slightly different treatment to support count; here is a possible list of treatments available:

  • A uniform minimum support threshold is used across all the levels
  • Reduced minimum support threshold is used for lower levels
  • Group-based minimum support threshold

Note

Sometimes the A-Priori property is not always held here. There are some exceptions.

Multilevel association rules are mined from multiple levels of the conceptual hierarchy.

Constraint-based frequent pattern mining

Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.

The ordinary constraints are, but not limited to, the following:

  • Knowledge-type constraint (specifies what we are going to mine)
  • Data constraint (limits to the original dataset)
  • Dimension-level constraints
  • Interestingness constraints
  • Rule constraints

Mining multilevel and multidimensional association rules

For a given transactional dataset, if there is a conceptual hierarchy that exists from some dimensions of the dataset, then we can apply multilevel association rules mining to this dataset. Any association rules mining algorithm applicable to the transaction dataset can be used for this task. The following table shows an example from the Amazon store:

TID

Item purchased

1

Dell Venue 7 16 GB Tablet, HP Pavilion 17-e140us 17.3-Inch Laptop...

2

Samsung Galaxy Tab 3 Lite, Razer Edge Pro 256GB Tablet…

2

Acer C720P-2666 Chromebook, Logitech Wireless Combo MK270 with Keyboard and Mouse…

2

Toshiba CB35-A3120 13.3-Inch Chromebook, Samsung Galaxy Tab 3 (7-Inch, White)…

Have a look at the following flowchart that explains multilevel pattern mining:

Mining multilevel and multidimensional association rules

Based on the conceptual hierarchy, lower-level concepts can be projected to higher-level concepts, and the new dataset with higher-level concepts can replace the original lower-level concepts.

The support counts are calculated at each conceptual level. Many A-Priori-like algorithms are designed with slightly different treatment to support count; here is a possible list of treatments available:

  • A uniform minimum support threshold is used across all the levels
  • Reduced minimum support threshold is used for lower levels
  • Group-based minimum support threshold

Note

Sometimes the A-Priori property is not always held here. There are some exceptions.

Multilevel association rules are mined from multiple levels of the conceptual hierarchy.

Constraint-based frequent pattern mining

Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.

The ordinary constraints are, but not limited to, the following:

  • Knowledge-type constraint (specifies what we are going to mine)
  • Data constraint (limits to the original dataset)
  • Dimension-level constraints
  • Interestingness constraints
  • Rule constraints

Constraint-based frequent pattern mining

Constraint-based frequent pattern mining is a heuristic method with some user-specified constraints to prune the search space.

The ordinary constraints are, but not limited to, the following:

  • Knowledge-type constraint (specifies what we are going to mine)
  • Data constraint (limits to the original dataset)
  • Dimension-level constraints
  • Interestingness constraints
  • Rule constraints

Mining sequence dataset

Sequential pattern mining is the major task for sequence dataset mining. The A-Priori-life algorithm is used to mine sequence patterns that use the A-Priori-life algorithm, which applies a breath-first strategy. However, for the pattern-growth method, a depth-first strategy is used instead. The algorithm sometimes integrates with constraints for various reasons.

The common purchase patterns of the customers of the store can be mined from sequential patterns. In other aspects, especially advertisement or market campaign, sequential patterns play an important role. The individual customer's behavior can be predicted from sequential patterns in the domain of web log mining, web page recommendation system, bioinformatics analysis, medical treatment sequence track and analysis, and disaster prevention and safety management.

The rules in this chapter, which are mined from sequence patterns, are of many types. Some of them are listed as follows:

  • A sequential rule is Mining sequence dataset, where Mining sequence dataset
  • A label sequential rule (LSR) is of the form Mining sequence dataset, where Y is a sequence, and X a sequence generated from Y by replacing some of its items with wildcards
  • A class sequential rule (CSR) is defined as X if:
    Mining sequence dataset

Sequence dataset

A sequence dataset S is defined as a set of tuples, (sid, s), in which sid is a sequence ID, and s is a sequence.

The support of a sequence X in a sequence dataset S is the number of tuples in S, which contains X: Sequence dataset.

Here is a property intrinsic to sequential patterns, and it is applied to related algorithms such as the A-Priori property for the A-Priory algorithm. For a sequence X and its subsequence Y, Sequence dataset.

The GSP algorithm

The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:

  • GSP is an extension of the A-Priori algorithm

    It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.

  • The features require multiple passes of the initial transaction dataset
  • It uses the horizontal data format
  • In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
  • In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass

The overview of GSP algorithm goes here.

The GSP algorithm

Here is the pseudocode:

The GSP algorithm
The GSP algorithm

Sequence dataset

A sequence dataset S is defined as a set of tuples, (sid, s), in which sid is a sequence ID, and s is a sequence.

The support of a sequence X in a sequence dataset S is the number of tuples in S, which contains X: Sequence dataset.

Here is a property intrinsic to sequential patterns, and it is applied to related algorithms such as the A-Priori property for the A-Priory algorithm. For a sequence X and its subsequence Y, Sequence dataset.

The GSP algorithm

The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:

  • GSP is an extension of the A-Priori algorithm

    It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.

  • The features require multiple passes of the initial transaction dataset
  • It uses the horizontal data format
  • In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
  • In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass

The overview of GSP algorithm goes here.

The GSP algorithm

Here is the pseudocode:

The GSP algorithm
The GSP algorithm

The GSP algorithm

The generalized sequential patterns (GSP) algorithm is an A-Priori-like algorithm, but it is applied to sequence patterns. It is a level-wise algorithm and has a breadth-first approach. Here is the feature list:

  • GSP is an extension of the A-Priori algorithm

    It uses the A-Priori property (downward-closed), that is, given the minimum support count, if a sequence is not accepted, all its super sequence will be discarded.

  • The features require multiple passes of the initial transaction dataset
  • It uses the horizontal data format
  • In each pass, the candidate's set is generated by a self-join of the patterns found in the previous pass
  • In the k-pass, a sequence pattern is accepted only if all its (k-1) subpatterns are accepted in the (k-1) pass

The overview of GSP algorithm goes here.

The GSP algorithm

Here is the pseudocode:

The GSP algorithm
The GSP algorithm

The R implementation

Here is the R source code of the main algorithm:

GSP  <- function (d,I,MIN_SUP){
  f <- NULL
  c[[1]] <- CreateInitalPrefixTree(NULL)
  len4I <- GetLength(I)
  for(idx in 1:len4I){
    SetSupportCount(I[idx],0)
    AddChild2Node(c[[1]], I[idx],NULL)
  }
  k <- 1
  while( !IsEmpty(c[[k]]) ){
     ComputeSupportCount(c[[k]],d)
     while(TRUE){
       r <- GetLeaf(c[[k]])
       if( r==NULL ){
         break
       }
       if(GetSupport(r)>=MIN_SUP){
         AddFrequentItemset(f,r,GetSupport(r))
       }else{
         RemoveLeaf(c[[k]],s)
       }
    }
    c[[k+1]] <- ExtendPrefixTree(c[[k]])
    k <- K+1
  }
  f
}

The SPADE algorithm

Sequential Pattern Discovery using Equivalent classes (SPADE) is a vertical sequence-mining algorithm applied to sequence patterns; it has a depth-first approach. Here are the features of the SPADE algorithm:

  • SPADE is an extension of the A-Priori algorithm
  • It uses the A-Priori property
  • Multiple passes of the initial transaction data set are required
  • The vertical data format is used
  • It uses a simple join operation
  • All sequences are found in three dataset passes

The short description of SPADE algorithm goes here.

Here is the pseudocode before calling the SPADE algorithm, The SPADE algorithm:

The SPADE algorithm
The SPADE algorithm

The R implementation

Here is the R source code of the main algorithm:

SPADE  <- function (p,f,k,MIN_SUP){
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
     for(jdx in 1:len4p){
       xab <- CreateTidSets(p[[idx]],p[[jdx]],k)
       if(GetSupport(xab)>=MIN_SUP){
         AddFrequentTidSets(pa,xab)
       }
     }
     if(!IsEmptyTidSets(pa)){
       SPADE(p,f,k+1,MIN_SUP)
     }
  }
}

Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

The SPADE algorithm

Sequential Pattern Discovery using Equivalent classes (SPADE) is a vertical sequence-mining algorithm applied to sequence patterns; it has a depth-first approach. Here are the features of the SPADE algorithm:

  • SPADE is an extension of the A-Priori algorithm
  • It uses the A-Priori property
  • Multiple passes of the initial transaction data set are required
  • The vertical data format is used
  • It uses a simple join operation
  • All sequences are found in three dataset passes

The short description of SPADE algorithm goes here.

Here is the pseudocode before calling the SPADE algorithm, The SPADE algorithm:

The SPADE algorithm
The SPADE algorithm

The R implementation

Here is the R source code of the main algorithm:

SPADE  <- function (p,f,k,MIN_SUP){
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
     for(jdx in 1:len4p){
       xab <- CreateTidSets(p[[idx]],p[[jdx]],k)
       if(GetSupport(xab)>=MIN_SUP){
         AddFrequentTidSets(pa,xab)
       }
     }
     if(!IsEmptyTidSets(pa)){
       SPADE(p,f,k+1,MIN_SUP)
     }
  }
}

Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

The R implementation

Here is the R source code of the main algorithm:

SPADE  <- function (p,f,k,MIN_SUP){
  len4p <- GetLength(p)
  for(idx in 1:len4p){
     AddFrequentItemset(f,p[[idx]],GetSupport(p[[idx]]))
     Pa <- GetFrequentTidSets(NULL,MIN_SUP)
     for(jdx in 1:len4p){
       xab <- CreateTidSets(p[[idx]],p[[jdx]],k)
       if(GetSupport(xab)>=MIN_SUP){
         AddFrequentTidSets(pa,xab)
       }
     }
     if(!IsEmptyTidSets(pa)){
       SPADE(p,f,k+1,MIN_SUP)
     }
  }
}
Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

Rule generation from sequential patterns

Sequential rules, label sequential rules, and class sequential rules can be generated from sequential patterns, which you will get from the previous sequential patterns discovery algorithms.

High-performance algorithms

Along with the growth of the dataset size, there is a steady requirement for high-performance associations/patterns mining algorithms.

With the introduction of Hadoop and other MapReduce-like platforms to the world, there is a chance to meet these requirements. I will discuss this further in the upcoming chapters. Depending on the size of the dataset, some algorithms should be revised and adjusted, such as the recursive algorithm that will eventually run out of space on the call stack and might present a challenge when converting to MapReduce.

Time for action

To enhance your knowledge about this chapter, here are some practice questions that'll let you understand the concepts better:

  • Write an R program to find how many unique items' names are contained in the given sample market basket transaction file. Map each item's name to a unique integer ID. Find out all the closed frequent itemsets. Find out all the maximal frequent itemsets and their support count. Set a support count threshold to various values yourself.
  • Write an R program to implement the A-PrioriTid algorithm.

Summary

In this chapter, we looked at the following topics:

  • Market basket analysis
  • As the first step of association rule mining, the frequent itemset is the key factor. Along with the algorithm design, closed itemsets and maximum frequent itemsets are defined too.
  • As the target of association rule mining, association rules are mined with the measure of support count and confidence. Correlation rules mining are mined with the correlation formulae, in addition to the support count.
  • Monotonicity of frequent itemset; if an itemset is frequent, then all its subsets are frequent.
  • The A-Priori algorithm, which is the first efficient mining algorithm to mine frequent patterns; many variants originated from it.
  • Frequent patterns in sequence.

The next chapter will cover the basic classification algorithms, which is a major application of data mining, including ID3, C4.5, and CART.

 

Chapter 3. Classification

In this chapter, you will learn the popular classification algorithms written in the R language. Empirical classifier performance and accuracy benchmarks are also included. Along with the introduction of various classification algorithms, b will also learn various ways to improve the classifier and so on.

Classification has massive applications in modern life. With the exponential growth of the information dataset, there is a need for high performance classification algorithms to judge an event/object belonging to a predefined categories set. Such algorithms have unlimited opportunity for implementation in a wide variety of industries such as bioinformatics, cybercrime, and banking. Successful classification algorithms use predefined categories from training information datasets to predict the unknown category for a single event given a common set of features.

Along with the continual growth of computer science, the classification algorithms need to be implemented on many diverse platforms including distributed infrastructure, cloud environment, real-time devices, and parallel computing systems.

In this chapter, we will cover the following topics:

  • Classification
  • Generic decision tree introduction
  • High-value credit card customers classification using ID3
  • Web spam detection using C4.5
  • Web key resource page judgment using CART
  • Trojan traffic identification method and Bayes classification
  • Spam e-mail identification and Naïve Bayes classification
  • Rule-based classification and the player types in computer games

Classification

Given a set of predefined class labels, the task of classification is to assign each data object of the input dataset with a label using the classifier's training model. Typically, the input can be a discrete or continuous value, but the output is discrete binary or nominal value and so forth. Classification algorithms are often described as learning models or functions, in which x is a tuple of attribute set with discrete or continuous value, and y is an attribute with discrete value such as categorical labels.

Classification

This function can also be treated as a classification model. It can be used to distinguish objects belonging to different classes or to predict the class of a new tuple or y in the above (x, y). In another point of view, classification algorithms are targeted to find a model from the input data, and apply this model to future classification usage predictions when given a common set of attributes.

Generally speaking, Classification is a set of attributes selected as the input for the classification system. There are special algorithms used to select only the useful attributes from this set to ensure the efficiency of the classification system.

Almost any classification tasks need this preprocessing procedure, but the exact means vary from case to case. Here are three mainstream methods applied:

  • Data cleaning
  • Relevance analysis
  • Data transformation and reduction

A standard classification process often includes two steps. The classification model with the higher accepted accuracy is accepted as classifier to classify a dataset in production. The following two steps are illustrated with an example in the diagram:

  • Training (supervised learning): The classification model is built upon the training dataset, that is, the (instance, class label) pairs
  • Classification validation: The accuracy of the model is checked with the test dataset to decide whether to accept the model
Classification

In the following sections, we will introduce some classification algorithms with different designs.

Generic decision tree induction

There are various definitions of the term decision tree. Most commonly, a decision tree provides a representation of the process of judging the class of a given data instance or record from the root node down to some leaf node. As a major classification model, the decision tree induction builds a decision tree as a classification model using the input dataset and class label pairs. A decision tree can be applied to various combinations of the following attribute data types, but is not limited to, including nominal valued, categorical, numeric and symbolic data, and their mixture. The following list is an illustration of Hunt's decision tree definition. The Step #7 applies a selected attribute test condition to partition the records to smaller datasets.

Generic decision tree induction

The decision tree is popular for its simplicity and low computational effort compared to other algorithms. Here are some characteristics of the decision tree induction:

  • The greedy strategy is usually applied to the decision tree.
  • It infers a decision tree once upon the entire training dataset.
  • The algorithm requires no parameters to obtain the classification model from the input dataset.
  • Like many other tasks, finding an optimal decision tree is an NP-complete problem.
  • The algorithm to build the decision tree enables construction of the decision tree quickly. Tree construction is efficient, even upon large datasets.
  • It provides an expressive way for discrete-valued functions.
  • It is robust while opposed to noise.
  • Using a top-down, recursive partition, the divide-and-conquer strategy is applied to most of the decision tree algorithms.
  • The size of the sample dataset usually decreases dramatically when traversed down the tree.
  • A subtree can be replicated many times in the decision tree.
  • The test condition usually contains only one attribute.
  • The performance of decision tree algorithms is affected by the impurity measure.

It is time to consider the decision tree when the instances in the source dataset are describable by the attribute-value pair, and the target function has discrete values, while the training dataset possibly has some noise.

An example of a decision tree built with the input dataset in a table (classical play golf dataset) is shown in the following diagram. The decision tree is composed of three entities or concepts: the root node, the internal node, and the leaf node. The leaf is given a class label. Nodes other than the leaf conduct tests on the attribute set to determine which input data belongs to which branch (child) of the node.

Generic decision tree induction

Given a built decision tree, a test record can be classified easily. From the root node, apply the test condition of that node to the test record and go to the next node with the corresponding test result until a leaf, by which we can decide which class the test record belongs to, is reached.

Now there will be two issues. One is how to divide the training set as per a certain node while the decision induction tree grows according to a chosen test condition upon various attribute sets. This will be a question related to attribute selection measure, which are illustrated in the following section. The second but important issue is related to model overfitting.

There are two strategies for the termination of the growth of the limiting decision induction tree node. Using the naïve strategy, for a certain node, when all the data objects within a node are assigned to it belong to the same class or all records with the same attribute values; as a result, the node related will be assigned with the class label as the majority of training records within that node. The second strategy terminates the algorithm earlier, which is meant to avoid model overfitting and will be introduced in the tree pruning section.

Attribute selection measures

The node can have more than two children or branches depending on the attribute test condition and the selected attribute. To split the node, attribute selection measures with various implementations are applied. Attribute selection measures within the same node may also vary for binary branches or multiway branches. Some common attribute selection measures are the following:

  • Entropy: This concept is used in information theory to describe the impurity of an arbitrary collection of data. Given the target attribute class set with size of c, and Attribute selection measures as the proportion/probability of S belonging to class i, the definition is here, and the definition Gain is shown in the next point. Entropy always means how disordered a dataset is. The higher the value of entropy, the more the uncertainty shown by the source dataset.
    Attribute selection measures
    The size and coverage of the training set assigned to a certain node affect the correctness of the following equations. The gain is better for those situations.
  • Gain:
    Attribute selection measures
  • Gain Ratio: This is applied in the C4.5 classification algorithm using the following formula:
    Attribute selection measures
  • Information Gain: The ID3 algorithm uses this statistical property to decide which attribute is selected to be tested at any node in the tree, and measures the association between inputs and outputs of the decision tree.

    With the concept of information gain, the definition of a decision tree can be thought of in this way:

    • A decision tree is a tree-structured plan that uses a set of attribute tests to predict output
    • To decide which attribute should be tested first, simply find the one with the highest information gain
    • It, then, recurs
  • Gini Index: It is used in the CART classification algorithm. The Gini index for a specific split point is calculated using the following equation. It is used to gauge the purity of the split point.
    Attribute selection measures
  • Split Info:
    Attribute selection measures

Tree pruning

The initial decision tree is often built with many branches reflecting outliers or noise, which are also common causes of model overfitting. Usually, the direct consequent in tree pruning is needed for the after-dealt of decision tree aiming, which is required for classifying higher accuracy or lower error rates. The two types of pruning in production are as follows:

  • Post-pruning: This approach is to perform tree pruning after the tree grows to the maximum form. The cost-complexity pruning algorithm used in CART and the pessimistic pruning in C4.5 are both examples of post-pruning.
  • Pre-pruning: This is also known as the early stopping strategy, which avoids an over-matured tree generation, to stop the growth of a tree earlier using additional restrictions, such as a threshold.

Repetition and replication are the two major factors that make decision trees unreasonably large and inefficient.

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

General algorithm for the decision tree generation

Another variety of the algorithm is described here, with input parameters as follows:

  • D denotes the training data
  • The leaf size is defined by General algorithm for the decision tree generation
  • The leaf purity threshold is defined by General algorithm for the decision tree generation

The output of the algorithm is a decision tree, as shown in the following screenshot:

General algorithm for the decision tree generation

Line 1 denotes the partition size. Line 4 denotes the stopping condition. Line 9 through line 17 try to get the two branches with the new split. Finally, line 19 applies the algorithm recursively on the two new subbranches to build the subtree. This algorithm is implemented with R in the following section.

The R implementation

The main function of the R code for the generic decision tree induction is listed as follows. Here data is the input dataset, c is the set of class labels, x is the set of attributes, and yita and pi have the same definitions as in the previous pseudocodes:

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

One sample dataset is chosen to verify the generic decision tree induction algorithm, the weather dataset. It is from the R package Rattle, which contains 366 examples of 23 attributes, and one target or the class label. In the R language, weather is a data frame, which contains 366 observations of 24 variables. The details for the dataset can be retrieved with the following R code:

> Library(rattle)
> str(weather)

Attribute selection measures

The node can have more than two children or branches depending on the attribute test condition and the selected attribute. To split the node, attribute selection measures with various implementations are applied. Attribute selection measures within the same node may also vary for binary branches or multiway branches. Some common attribute selection measures are the following:

  • Entropy: This concept is used in information theory to describe the impurity of an arbitrary collection of data. Given the target attribute class set with size of c, and Attribute selection measures as the proportion/probability of S belonging to class i, the definition is here, and the definition Gain is shown in the next point. Entropy always means how disordered a dataset is. The higher the value of entropy, the more the uncertainty shown by the source dataset.
    Attribute selection measures
    The size and coverage of the training set assigned to a certain node affect the correctness of the following equations. The gain is better for those situations.
  • Gain:
    Attribute selection measures
  • Gain Ratio: This is applied in the C4.5 classification algorithm using the following formula:
    Attribute selection measures
  • Information Gain: The ID3 algorithm uses this statistical property to decide which attribute is selected to be tested at any node in the tree, and measures the association between inputs and outputs of the decision tree.

    With the concept of information gain, the definition of a decision tree can be thought of in this way:

    • A decision tree is a tree-structured plan that uses a set of attribute tests to predict output
    • To decide which attribute should be tested first, simply find the one with the highest information gain
    • It, then, recurs
  • Gini Index: It is used in the CART classification algorithm. The Gini index for a specific split point is calculated using the following equation. It is used to gauge the purity of the split point.
    Attribute selection measures
  • Split Info:
    Attribute selection measures

Tree pruning

The initial decision tree is often built with many branches reflecting outliers or noise, which are also common causes of model overfitting. Usually, the direct consequent in tree pruning is needed for the after-dealt of decision tree aiming, which is required for classifying higher accuracy or lower error rates. The two types of pruning in production are as follows:

  • Post-pruning: This approach is to perform tree pruning after the tree grows to the maximum form. The cost-complexity pruning algorithm used in CART and the pessimistic pruning in C4.5 are both examples of post-pruning.
  • Pre-pruning: This is also known as the early stopping strategy, which avoids an over-matured tree generation, to stop the growth of a tree earlier using additional restrictions, such as a threshold.

Repetition and replication are the two major factors that make decision trees unreasonably large and inefficient.

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

General algorithm for the decision tree generation

Another variety of the algorithm is described here, with input parameters as follows:

  • D denotes the training data
  • The leaf size is defined by General algorithm for the decision tree generation
  • The leaf purity threshold is defined by General algorithm for the decision tree generation

The output of the algorithm is a decision tree, as shown in the following screenshot:

General algorithm for the decision tree generation

Line 1 denotes the partition size. Line 4 denotes the stopping condition. Line 9 through line 17 try to get the two branches with the new split. Finally, line 19 applies the algorithm recursively on the two new subbranches to build the subtree. This algorithm is implemented with R in the following section.

The R implementation

The main function of the R code for the generic decision tree induction is listed as follows. Here data is the input dataset, c is the set of class labels, x is the set of attributes, and yita and pi have the same definitions as in the previous pseudocodes:

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

One sample dataset is chosen to verify the generic decision tree induction algorithm, the weather dataset. It is from the R package Rattle, which contains 366 examples of 23 attributes, and one target or the class label. In the R language, weather is a data frame, which contains 366 observations of 24 variables. The details for the dataset can be retrieved with the following R code:

> Library(rattle)
> str(weather)

Tree pruning

The initial decision tree is often built with many branches reflecting outliers or noise, which are also common causes of model overfitting. Usually, the direct consequent in tree pruning is needed for the after-dealt of decision tree aiming, which is required for classifying higher accuracy or lower error rates. The two types of pruning in production are as follows:

  • Post-pruning: This approach is to perform tree pruning after the tree grows to the maximum form. The cost-complexity pruning algorithm used in CART and the pessimistic pruning in C4.5 are both examples of post-pruning.
  • Pre-pruning: This is also known as the early stopping strategy, which avoids an over-matured tree generation, to stop the growth of a tree earlier using additional restrictions, such as a threshold.

Repetition and replication are the two major factors that make decision trees unreasonably large and inefficient.

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

General algorithm for the decision tree generation

Another variety of the algorithm is described here, with input parameters as follows:

  • D denotes the training data
  • The leaf size is defined by General algorithm for the decision tree generation
  • The leaf purity threshold is defined by General algorithm for the decision tree generation

The output of the algorithm is a decision tree, as shown in the following screenshot:

General algorithm for the decision tree generation

Line 1 denotes the partition size. Line 4 denotes the stopping condition. Line 9 through line 17 try to get the two branches with the new split. Finally, line 19 applies the algorithm recursively on the two new subbranches to build the subtree. This algorithm is implemented with R in the following section.

The R implementation

The main function of the R code for the generic decision tree induction is listed as follows. Here data is the input dataset, c is the set of class labels, x is the set of attributes, and yita and pi have the same definitions as in the previous pseudocodes:

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

One sample dataset is chosen to verify the generic decision tree induction algorithm, the weather dataset. It is from the R package Rattle, which contains 366 examples of 23 attributes, and one target or the class label. In the R language, weather is a data frame, which contains 366 observations of 24 variables. The details for the dataset can be retrieved with the following R code:

> Library(rattle)
> str(weather)

General algorithm for the decision tree generation

Here is the pseudocode of the general decision induction tree algorithm:

General algorithm for the decision tree generation

Another variety of the algorithm is described here, with input parameters as follows:

  • D denotes the training data
  • The leaf size is defined by General algorithm for the decision tree generation
  • The leaf purity threshold is defined by General algorithm for the decision tree generation

The output of the algorithm is a decision tree, as shown in the following screenshot:

General algorithm for the decision tree generation

Line 1 denotes the partition size. Line 4 denotes the stopping condition. Line 9 through line 17 try to get the two branches with the new split. Finally, line 19 applies the algorithm recursively on the two new subbranches to build the subtree. This algorithm is implemented with R in the following section.

The R implementation

The main function of the R code for the generic decision tree induction is listed as follows. Here data is the input dataset, c is the set of class labels, x is the set of attributes, and yita and pi have the same definitions as in the previous pseudocodes:

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

One sample dataset is chosen to verify the generic decision tree induction algorithm, the weather dataset. It is from the R package Rattle, which contains 366 examples of 23 attributes, and one target or the class label. In the R language, weather is a data frame, which contains 366 observations of 24 variables. The details for the dataset can be retrieved with the following R code:

> Library(rattle)
> str(weather)

The R implementation

The main function of the R code for the generic decision tree induction is listed as follows. Here data is the input dataset, c is the set of class labels, x is the set of attributes, and yita and pi have the same definitions as in the previous pseudocodes:

  1 DecisionTree <- function(data,c,x,yita,pi){
  2         result.tree <- NULL
  3         if( StoppingCondition(data,c,yita,pi) ){
  4                 result.tree <- CreateLeafNode(data,c,yita,pi)
  5                 return(result.tree)
  6         }
  7 
  8         best.split <- GetBestSplit(data,c,x)
  9         newdata <- SplitData(data,best.split)
 10 
 11         tree.yes <- DecisionTree(newdata$yes,c,x,yita,pi)
 12         tree.no <- DecisionTree(newdata$no,c,x,yita,pi)
 13         result.tree <- CreateInternalNode(data,
 14                    best.split,tree.yes,tree.no)
 15         
 16         result.tree
 17 }

One sample dataset is chosen to verify the generic decision tree induction algorithm, the weather dataset. It is from the R package Rattle, which contains 366 examples of 23 attributes, and one target or the class label. In the R language, weather is a data frame, which contains 366 observations of 24 variables. The details for the dataset can be retrieved with the following R code:

> Library(rattle)
> str(weather)

High-value credit card customers classification using ID3

The Iterative Dichotomiser 3 (ID3) algorithm is one of the most popular designs of the decision induction tree. It is not tolerant of missing values or noisy, and the value of attributes must come from an infinite fixed set.

ID3 uses entropy to calculate the homogeneity of a sample and also for the split. The information gain G for each attribute A is computed using the following equation. The root of the final tree is assigned with an attribute with the highest information gain. Then the new subtree is built recursively upon each value of the attribute bound to the root.

High-value credit card customers classification using ID3
High-value credit card customers classification using ID3

Note

With the play golf dataset as the input dataset, you can calculate the information gain and list using the following formulas:

  • Entropy (root) = 0.940
  • Gain () = 0.048, Gain (S, Humidity) = 0.151
  • Gain (S, Temperature) = 0.029, Gain (S, Outlook) = 0.246

ID3 (C4.5 and CART) builds the decision induction tree recursively in a top-down divide-and-conquer manner through the space of possible decision trees with a greedy strategy. Using the greedy search strategy, at each step, a decision that greatly improves the optimizing target is made. For each node, find the test condition best segment the training data assigned to it.

The characteristics of the decision induction tree in the case of ID3 include the following:

  • Each node excluding the leaf of the tree corresponds to an input attribute, each arc to a possible value of that attribute
  • Entropy is used to determine how informative a particular input attribute is about the output class on a given dataset
  • The recursive algorithm

    Note

    A quick description about the recursive algorithm can be defined as follows:

    • Break the original problem into two or more smaller-sized problems with the same type
    • Call the recursive algorithm on each smaller type of problem
    • Group together the results of step 2 to solve the original problem

The ID3 algorithm

The input parameters for ID3 algorithm are as follows:

  • I, denotes the set of input attributes, which may be tested by the result decision tree
  • T, the set of training data objects, or training examples

The output parameter of the algorithm is as follows:

  • O, denotes the set of output attribute, that is, the value of those attributes will be predicted by the tree

Here is the pseudocode of the general algorithm:

The ID3 algorithm
The ID3 algorithm

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.

Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.

Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.

Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.

The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:

  • DOS: This refers to denial of service
  • R2L: This refers to unauthorized access to the local machine from the remote machine
  • U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
  • Probing: This refers to surveillance and probing

By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

  • SQL Injection
  • Cross Site Scripting
  • Code Injection
  • Directory Traversal

All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

The ID3 algorithm

The input parameters for ID3 algorithm are as follows:

  • I, denotes the set of input attributes, which may be tested by the result decision tree
  • T, the set of training data objects, or training examples

The output parameter of the algorithm is as follows:

  • O, denotes the set of output attribute, that is, the value of those attributes will be predicted by the tree

Here is the pseudocode of the general algorithm:

The ID3 algorithm
The ID3 algorithm

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.

Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.

Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.

Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.

The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:

  • DOS: This refers to denial of service
  • R2L: This refers to unauthorized access to the local machine from the remote machine
  • U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
  • Probing: This refers to surveillance and probing

By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

  • SQL Injection
  • Cross Site Scripting
  • Code Injection
  • Directory Traversal

All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

The R implementation

The main function of the R code for the ID3 algorithm is listed as follows. Here data is the input training dataset, ix is the set of input attributes, and ox is the output attribute:

  1 ID3 <- function(data,ix,ox){
  2    result.tree <- NULL
  3 
  4    if( IsEmpty(data) ){
  5        node.value <- "Failure"
  6        result.tree <- CreateNode(node.value)
  7        return(result.tree)
  8    }
  9    if( IsEqualAttributeValue(data,ox) ){
 10        node.value <- GetMajorityAttributeValue(data,ox)
 11        result.tree <- CreateNode(node.value)
 12        return(result.tree)
 13    }
 14    if( IsEmpty(ix) ){
 15        node.value <- GetMajorityAttributeValue(data,ox)
 16        result.tree <- CreateNode(node.value)
 17        return(result.tree)
 18    }
 19    gain <- GetInformationGain(data,ix)
 20    best.split <- GetBestSplit(data,gain,ix)
 21 
 22    values <- GetAttributeValues(best.split)
 23    values.count <- GetAttributeValuesCount(best.split)
 24    data.subsets <- SplitData(data,best.split)
 25 
 26    node.value <- best.split
 27    result.tree <- CreateNode(node.value)
 28    idx <- 0
 29    while( idx<=values.count ){
 30        idx <- idx+1
 31        newdata <- GetAt(data.subsets,idx)
 32        value <- GetAt(values,idx)
 33        new.ix <- RemoveAttribute(ix,best.split)
 34        new.child <- ID3(newdata,new.ix,ox)
 35        AddChildNode(result.tree,new.child,value)
 36    }
 37 
 38    result.tree
 39 }

Web attack detection

Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.

Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.

Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.

Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.

The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:

  • DOS: This refers to denial of service
  • R2L: This refers to unauthorized access to the local machine from the remote machine
  • U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
  • Probing: This refers to surveillance and probing

By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

  • SQL Injection
  • Cross Site Scripting
  • Code Injection
  • Directory Traversal

All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

Web attack detection

Along with the development of information technology, there have emerged many systems that identify malicious usage of the built software system, web system, and so on. One of them is the Intrusion Detection System (IDS), to detect the malicious behavior, conduct content inspection without the firewall. Also includes include signature detection, anomaly detection, and so on.

Classifier-like decision tree technologies, such as ID3, C4.5, and CART, play an important role as analyzers in addition to other important components of IDS, such as sensor, manager, operator, and administrator. The classifications needed here are activity monitor, file integrity checker, host firewall, log parser, and packet pattern matching.

Many issues occur for IDS. One of them is the new variety of a known attack pattern, often with low detection rate by the existing IDS. This drives the design of new types of IDS systems integrated with artificial intelligence, especially decision tree technologies.

Among real world examples, except the ones IDS has already built, there are also competitions for applying data mining techniques to web attack detection. One of them is KDD-Cup. The topic for KDD-Cup 1999 was Computer network intrusion detection, to build a classifier to predict the unauthorized behavior.

The dataset for it came from the DARPA Intrusion Detection Evaluation Program. More than five million data instances are contained in the training dataset and more than two million for test dataset. There are about 24 attack types in the training dataset, and 14 in the test dataset. Each data instance in the dataset contains 41 attributes, 9 for TCP connection, 13 for content features contained in the TCP connection, 9 for traffic features that use a two-second time window and the left for host-related traffic features. All the attacks can be categorized into the following four groups:

  • DOS: This refers to denial of service
  • R2L: This refers to unauthorized access to the local machine from the remote machine
  • U2R: This refers to unauthorized access to local super-user privileges by a local unprivileged user
  • Probing: This refers to surveillance and probing

By specific transformation, the ID3 algorithm can be applied to various web attack detection datasets with various sizes. When the size of the dataset increases, the performance of ID3 will be kept efficient by parallelization.

For simplicity, one example only takes the following four types of attacks to label a dataset for simple IDS:

  • SQL Injection
  • Cross Site Scripting
  • Code Injection
  • Directory Traversal

All the four types of attacks behave with a common pattern, the web queries with malicious pattern. Normalizing the web queries, the URL and collection of reserved tags, label-specific patterns with the appropriate label in the four types of attacks. After training ID3 on the dataset and applying it to the existing IDS, a better detection rate can be achieved.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

High-value credit card customers classification

Following the growth of credit card usage, there has been a requirement in banking industry—finding high-value credit card customers from all customers to create a more customer-oriented strategy to increase profit. There are similar requirements such as finding interesting rules from the dataset.

To achieve this target, we need to enroll more correct customer attributes (no matter what type they are) to the training data object. The possible choices are transaction records, usage behaviors, customer age, annual income, education background, financial assets, and so on.

There is no need to include all customer-related attributes; the most key attributes on this target should be adopted. The domain experts might be helpful on this.

With the appropriate attributes selected, the ID3 algorithm can be applied here to finally extract sensitive features or representatives to help to judge which customer is more likely to be profitable.

Web spam detection using C4.5

C4.5 is an extension of ID3. The major extensions include handling data with missing attribute values, and handling attributes that belong to an infinite continuous range.

It is one of the decision tree algorithms, and is also a supervised learning classification algorithm. A model is learned and the input attribute values are mapped to the mutually exclusive class labels. Moreover, the learned model will be used to further classify new unseen instances or attribute values. The attribute select measure adopted in C4.5 is the gain ratio, which avoids the possible bias:

Web spam detection using C4.5
Web spam detection using C4.5
Web spam detection using C4.5

Based on the generic C4.5 algorithm, a suite for varieties derived, C4.5, C4.5-no-pruning, C4.5-rules, and so forth; all of them are called C4.5 algorithms, which means C4.5 is a suite of algorithms.

Compared to other algorithms, there are many important characteristics of C4.5:

  • Tree pruning, a post-pruning strategy, is followed to remove some of the tree structure using selected accuracy criteria
  • An improved use of continuous attributes
  • Missing values handling
  • Inducing rule sets
  • It contains a multiway test that depends on the attribute value and is not just limited to the binary test
  • Information theoretic tests are applied via the gain and gain ratio
  • The greedy learning algorithm, that is, along with the tree growing, test best criteria test result is chosen
  • The data fits in main memory (there are many extended algorithms that can use the secondary storages, such as BOAT, Rainforest, SLIQ, SPRINT, and so forth.)

The C4.5 algorithm

Here is the pseudocode for the basic C4.5 algorithm:

The C4.5 algorithm

The R implementation

The R code for the ID3 is listed as follows:

  1 C45 <- function(data,x){
  2     result.tree <- NULL
  3 
  4     if( IsEmpty(data) ){
  5         node.value <- "Failure"
  6         result.tree <- CreateNode(node.value)
  7         return(result.tree)
  8     }
  9     if( IsEmpty(x) ){
 10         node.value <- GetMajorityClassValue(data,x)
 11         result.tree <- CreateNode(node.value)
 12         return(result.tree)
 13     }
 14     if( 1 == GetCount(x) ){
 15         node.value <- GetClassValue(x)
 16         result.tree <- CreateNode(node.value)
 17         return(result.tree)
 18     }
 19 
 20     gain.ratio <- GetGainRatio(data,x)
 21     best.split <- GetBestSplit(data,x,gain.ratio)
 22 
 23     data.subsets <- SplitData(data,best.split)
 24     values <- GetAttributeValues(data.subsets,best.split)
 25     values.count <- GetCount(values)
 26 
 27     node.value <- best.split
 28     result.tree <- CreateNode(node.value)
 29     idx <- 0
 30     while( idx<=values.count ){
 31         idx <- idx+1
 32         newdata <- GetAt(data.subsets,idx)
 33         value <- GetAt(values,idx)
 34         new.x <- RemoveAttribute(x,best.split)
 35         new.child <- C45(newdata,new.x)
 36         AddChildNode(result.tree,new.child,value)
 37     }
 38 
 39     result.tree
 40 }

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce

Web spam detection

Spamming occurs along with the emergence of search engine technologies to pursue higher rank with the deceiving search engines relevancy algorithm, but not improve their own website technologies. It performs deliberate actions to trigger unjustifiable favorable relevance or importance for a specific web page, in contrast to the page's true value or merit. The spam page finally receives a substantial amount of score from other spam pages to boost its rank in search engine results by deliberately manipulating the search engine indexes. Finally, traffic is driven to the spammed pages. As a direct result of the Web spam, the information quality of the Web world is degraded, user experience is manipulated, and the security risk for use increases due to exploitation of user privacy.

One classic example, denoted as link-farm, is illustrated in the following diagram, in which a densely connected set of pages is created with the target of cheating a link-based ranking algorithm, which is also called collusion:

Web spam detection

There are three major categories of spam from a business point of view:

  • Page Spoofing
  • Browser-Based Attacks
  • Search Engine Manipulations

There are three major categories of spam from a technology point of view:

  • Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
  • Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
  • Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.
    Web spam detection

The link-based spam detections usually rely on automatic classifiers, detecting the anomalous behavior of a link-based ranking algorithm, and so on. Classifier or language model disagreement can be adopted to detect content spam. While the cloaking detection solution is inherent, one of them is comparing the indexed page with the pages visitors saw.

How to apply a decision tree for web spam detection? The links and contents of the web spam, after statistical analysis, are unique compared to other normal pages. Some properties are valuable for detecting spam, trustworthiness of the page, neutrality (facts), and bias. Web spam detection can be a good example for illustrating the C4.5 algorithm.

Some domain-related knowledge can be applied to the classification solution. One observed phenomenon is that bad links always have links between them. The links between web pages and websites are often not randomly set but with certain rules and they can be detected by classifiers.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

  • Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
  • Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

All these features are included in one data point to set up a preprocessed dataset, which in turn can be used by classification algorithms, especially decision tree algorithms such as C4.5 to work as web spam classifiers to distinguish spam from normal pages. Among all the classification solutions, C4.5 achieves the best performance.

The C4.5 algorithm

Here is the pseudocode for the basic C4.5 algorithm:

The C4.5 algorithm

The R implementation

The R code for the ID3 is listed as follows:

  1 C45 <- function(data,x){
  2     result.tree <- NULL
  3 
  4     if( IsEmpty(data) ){
  5         node.value <- "Failure"
  6         result.tree <- CreateNode(node.value)
  7         return(result.tree)
  8     }
  9     if( IsEmpty(x) ){
 10         node.value <- GetMajorityClassValue(data,x)
 11         result.tree <- CreateNode(node.value)
 12         return(result.tree)
 13     }
 14     if( 1 == GetCount(x) ){
 15         node.value <- GetClassValue(x)
 16         result.tree <- CreateNode(node.value)
 17         return(result.tree)
 18     }
 19 
 20     gain.ratio <- GetGainRatio(data,x)
 21     best.split <- GetBestSplit(data,x,gain.ratio)
 22 
 23     data.subsets <- SplitData(data,best.split)
 24     values <- GetAttributeValues(data.subsets,best.split)
 25     values.count <- GetCount(values)
 26 
 27     node.value <- best.split
 28     result.tree <- CreateNode(node.value)
 29     idx <- 0
 30     while( idx<=values.count ){
 31         idx <- idx+1
 32         newdata <- GetAt(data.subsets,idx)
 33         value <- GetAt(values,idx)
 34         new.x <- RemoveAttribute(x,best.split)
 35         new.child <- C45(newdata,new.x)
 36         AddChildNode(result.tree,new.child,value)
 37     }
 38 
 39     result.tree
 40 }

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce

Web spam detection

Spamming occurs along with the emergence of search engine technologies to pursue higher rank with the deceiving search engines relevancy algorithm, but not improve their own website technologies. It performs deliberate actions to trigger unjustifiable favorable relevance or importance for a specific web page, in contrast to the page's true value or merit. The spam page finally receives a substantial amount of score from other spam pages to boost its rank in search engine results by deliberately manipulating the search engine indexes. Finally, traffic is driven to the spammed pages. As a direct result of the Web spam, the information quality of the Web world is degraded, user experience is manipulated, and the security risk for use increases due to exploitation of user privacy.

One classic example, denoted as link-farm, is illustrated in the following diagram, in which a densely connected set of pages is created with the target of cheating a link-based ranking algorithm, which is also called collusion:

Web spam detection

There are three major categories of spam from a business point of view:

  • Page Spoofing
  • Browser-Based Attacks
  • Search Engine Manipulations

There are three major categories of spam from a technology point of view:

  • Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
  • Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
  • Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.
    Web spam detection

The link-based spam detections usually rely on automatic classifiers, detecting the anomalous behavior of a link-based ranking algorithm, and so on. Classifier or language model disagreement can be adopted to detect content spam. While the cloaking detection solution is inherent, one of them is comparing the indexed page with the pages visitors saw.

How to apply a decision tree for web spam detection? The links and contents of the web spam, after statistical analysis, are unique compared to other normal pages. Some properties are valuable for detecting spam, trustworthiness of the page, neutrality (facts), and bias. Web spam detection can be a good example for illustrating the C4.5 algorithm.

Some domain-related knowledge can be applied to the classification solution. One observed phenomenon is that bad links always have links between them. The links between web pages and websites are often not randomly set but with certain rules and they can be detected by classifiers.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

  • Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
  • Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

All these features are included in one data point to set up a preprocessed dataset, which in turn can be used by classification algorithms, especially decision tree algorithms such as C4.5 to work as web spam classifiers to distinguish spam from normal pages. Among all the classification solutions, C4.5 achieves the best performance.

The R implementation

The R code for the ID3 is listed as follows:

  1 C45 <- function(data,x){
  2     result.tree <- NULL
  3 
  4     if( IsEmpty(data) ){
  5         node.value <- "Failure"
  6         result.tree <- CreateNode(node.value)
  7         return(result.tree)
  8     }
  9     if( IsEmpty(x) ){
 10         node.value <- GetMajorityClassValue(data,x)
 11         result.tree <- CreateNode(node.value)
 12         return(result.tree)
 13     }
 14     if( 1 == GetCount(x) ){
 15         node.value <- GetClassValue(x)
 16         result.tree <- CreateNode(node.value)
 17         return(result.tree)
 18     }
 19 
 20     gain.ratio <- GetGainRatio(data,x)
 21     best.split <- GetBestSplit(data,x,gain.ratio)
 22 
 23     data.subsets <- SplitData(data,best.split)
 24     values <- GetAttributeValues(data.subsets,best.split)
 25     values.count <- GetCount(values)
 26 
 27     node.value <- best.split
 28     result.tree <- CreateNode(node.value)
 29     idx <- 0
 30     while( idx<=values.count ){
 31         idx <- idx+1
 32         newdata <- GetAt(data.subsets,idx)
 33         value <- GetAt(values,idx)
 34         new.x <- RemoveAttribute(x,best.split)
 35         new.child <- C45(newdata,new.x)
 36         AddChildNode(result.tree,new.child,value)
 37     }
 38 
 39     result.tree
 40 }

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce

Web spam detection

Spamming occurs along with the emergence of search engine technologies to pursue higher rank with the deceiving search engines relevancy algorithm, but not improve their own website technologies. It performs deliberate actions to trigger unjustifiable favorable relevance or importance for a specific web page, in contrast to the page's true value or merit. The spam page finally receives a substantial amount of score from other spam pages to boost its rank in search engine results by deliberately manipulating the search engine indexes. Finally, traffic is driven to the spammed pages. As a direct result of the Web spam, the information quality of the Web world is degraded, user experience is manipulated, and the security risk for use increases due to exploitation of user privacy.

One classic example, denoted as link-farm, is illustrated in the following diagram, in which a densely connected set of pages is created with the target of cheating a link-based ranking algorithm, which is also called collusion:

Web spam detection

There are three major categories of spam from a business point of view:

  • Page Spoofing
  • Browser-Based Attacks
  • Search Engine Manipulations

There are three major categories of spam from a technology point of view:

  • Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
  • Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
  • Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.
    Web spam detection

The link-based spam detections usually rely on automatic classifiers, detecting the anomalous behavior of a link-based ranking algorithm, and so on. Classifier or language model disagreement can be adopted to detect content spam. While the cloaking detection solution is inherent, one of them is comparing the indexed page with the pages visitors saw.

How to apply a decision tree for web spam detection? The links and contents of the web spam, after statistical analysis, are unique compared to other normal pages. Some properties are valuable for detecting spam, trustworthiness of the page, neutrality (facts), and bias. Web spam detection can be a good example for illustrating the C4.5 algorithm.

Some domain-related knowledge can be applied to the classification solution. One observed phenomenon is that bad links always have links between them. The links between web pages and websites are often not randomly set but with certain rules and they can be detected by classifiers.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

  • Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
  • Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

All these features are included in one data point to set up a preprocessed dataset, which in turn can be used by classification algorithms, especially decision tree algorithms such as C4.5 to work as web spam classifiers to distinguish spam from normal pages. Among all the classification solutions, C4.5 achieves the best performance.

A parallel version with MapReduce

With the increase in the dataset volume or size, the C4.5 algorithm can be parallelized according to the MapReduce algorithm, the Hadoop technologies suite, and especially via RHadoop for R codes.

The MapReduce programming model is illustrated in the following diagram:

A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce
A parallel version with MapReduce

Web spam detection

Spamming occurs along with the emergence of search engine technologies to pursue higher rank with the deceiving search engines relevancy algorithm, but not improve their own website technologies. It performs deliberate actions to trigger unjustifiable favorable relevance or importance for a specific web page, in contrast to the page's true value or merit. The spam page finally receives a substantial amount of score from other spam pages to boost its rank in search engine results by deliberately manipulating the search engine indexes. Finally, traffic is driven to the spammed pages. As a direct result of the Web spam, the information quality of the Web world is degraded, user experience is manipulated, and the security risk for use increases due to exploitation of user privacy.

One classic example, denoted as link-farm, is illustrated in the following diagram, in which a densely connected set of pages is created with the target of cheating a link-based ranking algorithm, which is also called collusion:

Web spam detection

There are three major categories of spam from a business point of view:

  • Page Spoofing
  • Browser-Based Attacks
  • Search Engine Manipulations

There are three major categories of spam from a technology point of view:

  • Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
  • Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
  • Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.
    Web spam detection

The link-based spam detections usually rely on automatic classifiers, detecting the anomalous behavior of a link-based ranking algorithm, and so on. Classifier or language model disagreement can be adopted to detect content spam. While the cloaking detection solution is inherent, one of them is comparing the indexed page with the pages visitors saw.

How to apply a decision tree for web spam detection? The links and contents of the web spam, after statistical analysis, are unique compared to other normal pages. Some properties are valuable for detecting spam, trustworthiness of the page, neutrality (facts), and bias. Web spam detection can be a good example for illustrating the C4.5 algorithm.

Some domain-related knowledge can be applied to the classification solution. One observed phenomenon is that bad links always have links between them. The links between web pages and websites are often not randomly set but with certain rules and they can be detected by classifiers.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

  • Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
  • Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

All these features are included in one data point to set up a preprocessed dataset, which in turn can be used by classification algorithms, especially decision tree algorithms such as C4.5 to work as web spam classifiers to distinguish spam from normal pages. Among all the classification solutions, C4.5 achieves the best performance.

Web spam detection

Spamming occurs along with the emergence of search engine technologies to pursue higher rank with the deceiving search engines relevancy algorithm, but not improve their own website technologies. It performs deliberate actions to trigger unjustifiable favorable relevance or importance for a specific web page, in contrast to the page's true value or merit. The spam page finally receives a substantial amount of score from other spam pages to boost its rank in search engine results by deliberately manipulating the search engine indexes. Finally, traffic is driven to the spammed pages. As a direct result of the Web spam, the information quality of the Web world is degraded, user experience is manipulated, and the security risk for use increases due to exploitation of user privacy.

One classic example, denoted as link-farm, is illustrated in the following diagram, in which a densely connected set of pages is created with the target of cheating a link-based ranking algorithm, which is also called collusion:

Web spam detection

There are three major categories of spam from a business point of view:

  • Page Spoofing
  • Browser-Based Attacks
  • Search Engine Manipulations

There are three major categories of spam from a technology point of view:

  • Link spam: This consists of the creation of the link structure, often a tight-knit community of links, targeted at affecting the outcome of a link-based ranking algorithm. Possible technologies include honey pot, anchor-text spam, blog/wiki spam, link exchange, link farm, and expired domains.
  • Content spam: This crafts the contents of web page pages. One example is inserting unrelated keywords to the web page content for higher ranking on search engines. Possible technologies include hidden text (size and color), repetition, keyword stuffing/dilution, and language-model-based technologies (phrase stealing and dumping).
  • Cloaking: This sends content to search engines which looks different from the version viewed by human visitors.
    Web spam detection

The link-based spam detections usually rely on automatic classifiers, detecting the anomalous behavior of a link-based ranking algorithm, and so on. Classifier or language model disagreement can be adopted to detect content spam. While the cloaking detection solution is inherent, one of them is comparing the indexed page with the pages visitors saw.

How to apply a decision tree for web spam detection? The links and contents of the web spam, after statistical analysis, are unique compared to other normal pages. Some properties are valuable for detecting spam, trustworthiness of the page, neutrality (facts), and bias. Web spam detection can be a good example for illustrating the C4.5 algorithm.

Some domain-related knowledge can be applied to the classification solution. One observed phenomenon is that bad links always have links between them. The links between web pages and websites are often not randomly set but with certain rules and they can be detected by classifiers.

About the attributes for a certain data point, the dataset can be divided into two groups, link-based features and content-based features:

  • Link-based features: These include degree-based measures such as in-degree and out-degree of the web hosts. The second is PageRank-based features, which is an algorithm to compute a score for every page. The third is TrustRank-based features, which measure the trustworthiness of certain web pages to some benchmarked, trustworthy web pages.
  • Content-based features: These include the number of words in the page, number of words in the title, average word length, amount of anchor text, fraction of anchor text, fraction of visible text, fraction of page drawn from globally popular words, fraction of globally popular words, compressibility, corpus precision and corpus recall, query precision and query recall, independent trigram or n-gram likelihood, and entropy of trigrams or n-gram.

All these features are included in one data point to set up a preprocessed dataset, which in turn can be used by classification algorithms, especially decision tree algorithms such as C4.5 to work as web spam classifiers to distinguish spam from normal pages. Among all the classification solutions, C4.5 achieves the best performance.

Web key resource page judgment using CART

Classification and Regression Trees (CART) is one of the most popular decision tree algorithms. It is a binary recursive partitioning algorithm that can be used to process continuous and nominal attributes.

There are three main steps in the CART algorithm. The first is to construct the maximum tree (binary tree). The second step is to choose the right size of the tree. The last step is to classify new data using the result tree.

Compared to other algorithms, there are many important characteristics of CART:

  • Binary decision tree (a binary recursive partitioning process)
  • The source dataset can have continuous or nominal attributes
  • No stopping rule (unless no possible splits are available)
  • Tree pruning with cost-complexity pruning
  • Nonparametric
  • No variables to be selected in advance
  • The missing value is dealt with an adaptive and better strategy
  • The outlier can be easily handled
  • No assumptions
  • Computationally fast
  • At each split point, only one variable is used
  • Only one optimal tree is taken as the result tree, which is formed from a sequence of nested pruned candidate trees generated by CART
  • Automatically handling the missing value in the source dataset

The weighted Gini index equation is defined as follows:

Web key resource page judgment using CART

The CART measure is different; the goodness of the split point is proportional to the value of measure. The higher the better.

Web key resource page judgment using CART

The CART algorithm

Splitting rules, with the omission of the following parts, is too lengthy to include in this section:

  • Separate handling of continuous and categorical splitters
  • Special handling for categorical splitters with multiple levels
  • Missing value handling
  • Tree pruning
  • Tree selection

Here is the pseudocode for the CART algorithm, the simplified tree-growing algorithm:

The CART algorithm

The simplified pruning algorithm is as follows:

The CART algorithm

The R implementation

Please look up the R codes file ch_02_cart.R from the bundle of R codes for the previously mentioned algorithms. One example is chosen to apply the CART algorithm to, in the following section.

Web key resource page judgment

Web key resource judgment arises from the domain of web information retrieval and web search engines. The original concept is from the authority value, the hub value, and the HITS algorithm. During queries of information from IR systems or search engines, finding important and related information from an overwhelmingly increasing volume of information is a challenging task. A better judgment leads to less indexing storage and a more informative querying result.

A key resource page is a high quality web page with much more information per selected topic compared to an ordinary web page on the same topic. In order to measure the importance of a certain web page, feature selection is the first thing required in the design.

The link-based features used in current search technologies can't resolve such issues at an acceptable accuracy rate. To improve the accuracy rate, more global information across many data instances can be adopted in addition to the single-data-instance-related attributes or features, which means local attributes.

Experimental results show that the key web page should contain in-site out-links with anchor text to other pages. Non-content attributes, such as web page links related attributes and content structures of pages, can be applied to judge key resource pages. The possible attributes are listed as follows:

  • In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
  • URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
  • In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
  • In-site out-link number: This refers to the number of links embedded in a page.
  • Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

The CART algorithm

Splitting rules, with the omission of the following parts, is too lengthy to include in this section:

  • Separate handling of continuous and categorical splitters
  • Special handling for categorical splitters with multiple levels
  • Missing value handling
  • Tree pruning
  • Tree selection

Here is the pseudocode for the CART algorithm, the simplified tree-growing algorithm:

The CART algorithm

The simplified pruning algorithm is as follows:

The CART algorithm

The R implementation

Please look up the R codes file ch_02_cart.R from the bundle of R codes for the previously mentioned algorithms. One example is chosen to apply the CART algorithm to, in the following section.

Web key resource page judgment

Web key resource judgment arises from the domain of web information retrieval and web search engines. The original concept is from the authority value, the hub value, and the HITS algorithm. During queries of information from IR systems or search engines, finding important and related information from an overwhelmingly increasing volume of information is a challenging task. A better judgment leads to less indexing storage and a more informative querying result.

A key resource page is a high quality web page with much more information per selected topic compared to an ordinary web page on the same topic. In order to measure the importance of a certain web page, feature selection is the first thing required in the design.

The link-based features used in current search technologies can't resolve such issues at an acceptable accuracy rate. To improve the accuracy rate, more global information across many data instances can be adopted in addition to the single-data-instance-related attributes or features, which means local attributes.

Experimental results show that the key web page should contain in-site out-links with anchor text to other pages. Non-content attributes, such as web page links related attributes and content structures of pages, can be applied to judge key resource pages. The possible attributes are listed as follows:

  • In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
  • URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
  • In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
  • In-site out-link number: This refers to the number of links embedded in a page.
  • Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

The R implementation

Please look up the R codes file ch_02_cart.R from the bundle of R codes for the previously mentioned algorithms. One example is chosen to apply the CART algorithm to, in the following section.

Web key resource page judgment

Web key resource judgment arises from the domain of web information retrieval and web search engines. The original concept is from the authority value, the hub value, and the HITS algorithm. During queries of information from IR systems or search engines, finding important and related information from an overwhelmingly increasing volume of information is a challenging task. A better judgment leads to less indexing storage and a more informative querying result.

A key resource page is a high quality web page with much more information per selected topic compared to an ordinary web page on the same topic. In order to measure the importance of a certain web page, feature selection is the first thing required in the design.

The link-based features used in current search technologies can't resolve such issues at an acceptable accuracy rate. To improve the accuracy rate, more global information across many data instances can be adopted in addition to the single-data-instance-related attributes or features, which means local attributes.

Experimental results show that the key web page should contain in-site out-links with anchor text to other pages. Non-content attributes, such as web page links related attributes and content structures of pages, can be applied to judge key resource pages. The possible attributes are listed as follows:

  • In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
  • URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
  • In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
  • In-site out-link number: This refers to the number of links embedded in a page.
  • Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

Web key resource page judgment

Web key resource judgment arises from the domain of web information retrieval and web search engines. The original concept is from the authority value, the hub value, and the HITS algorithm. During queries of information from IR systems or search engines, finding important and related information from an overwhelmingly increasing volume of information is a challenging task. A better judgment leads to less indexing storage and a more informative querying result.

A key resource page is a high quality web page with much more information per selected topic compared to an ordinary web page on the same topic. In order to measure the importance of a certain web page, feature selection is the first thing required in the design.

The link-based features used in current search technologies can't resolve such issues at an acceptable accuracy rate. To improve the accuracy rate, more global information across many data instances can be adopted in addition to the single-data-instance-related attributes or features, which means local attributes.

Experimental results show that the key web page should contain in-site out-links with anchor text to other pages. Non-content attributes, such as web page links related attributes and content structures of pages, can be applied to judge key resource pages. The possible attributes are listed as follows:

  • In-degree or in-links: This denotes the number of links pointing to the page. Observation shows that the higher the number of in-links related to the key page, the more the links from other sites to that page, which means more recommendations to a certain extent.
  • URL length or the depth of a page's URL: There are four types of URLs defined in the following box: root, subroot, path, and filename. The four types of URLs map to four levels of length, that is, 1 to 4 respectively. A lower prior probability with a lower level and a higher prior probability mean a bigger possibility to be a key resource page.
  • In-site out-link anchor text rate: This refers to the rate of the length of the anchor text to the document or page content length.
  • In-site out-link number: This refers to the number of links embedded in a page.
  • Document length (in words): This filters out specified predefined non-usable characters from the document. This attribute can predict the relevance of a page because of the non-uniform distribution.

With the attributes just mentioned, the uniform sampling problem can be bypassed to a certain extent. The dataset can be easily built and used by decision tree induction algorithms such as CART.

Trojan traffic identification method and Bayes classification

Among probabilistic classification algorithms is the Bayes classification, which is based on Bayes' theorem. It predicts the instance or the class as the one that makes the posterior probability maximal. The risk for Bayes classification is that it needs enough data to estimate the joint probability density more reliably.

Given a dataset D with a size n, each instance or point x belonging to D with a dimension of m, for each Trojan traffic identification method and Bayes classification. To predict the class Trojan traffic identification method and Bayes classification of any x, we use the following formula:

Trojan traffic identification method and Bayes classification

Basing on Bayes' theorem, Trojan traffic identification method and Bayes classification is the likelihood:

Trojan traffic identification method and Bayes classification

Then we get the following new equations for predicting Trojan traffic identification method and Bayes classification for x:

Trojan traffic identification method and Bayes classification

Estimating

With new definitions to predict a class, the prior probability and its likelihood needs to be estimated.

Prior probability estimation

Given the dataset D, if the number of instances in D labeled with class Prior probability estimation is Prior probability estimation and the size of D is n, we get the estimation for the prior probability of the class Prior probability estimation as follows:

Prior probability estimation

Likelihood estimation

For numeric attributes, assuming all attributes are numeric, here is the estimation equation. One presumption is declared: each class Likelihood estimation is normally distributed around some mean Likelihood estimation with the corresponding covariance matrix Likelihood estimation. Likelihood estimation is used to estimate Likelihood estimation, Likelihood estimation for Likelihood estimation:

Likelihood estimation
Likelihood estimation
Likelihood estimation
Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The Bayes classification

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

Estimating

With new definitions to predict a class, the prior probability and its likelihood needs to be estimated.

Prior probability estimation

Given the dataset D, if the number of instances in D labeled with class Prior probability estimation is Prior probability estimation and the size of D is n, we get the estimation for the prior probability of the class Prior probability estimation as follows:

Prior probability estimation

Likelihood estimation

For numeric attributes, assuming all attributes are numeric, here is the estimation equation. One presumption is declared: each class Likelihood estimation is normally distributed around some mean Likelihood estimation with the corresponding covariance matrix Likelihood estimation. Likelihood estimation is used to estimate Likelihood estimation, Likelihood estimation for Likelihood estimation:

Likelihood estimation
Likelihood estimation
Likelihood estimation
Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The Bayes classification

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

Prior probability estimation

Given the dataset D, if the number of instances in D labeled with class Prior probability estimation is Prior probability estimation and the size of D is n, we get the estimation for the prior probability of the class Prior probability estimation as follows:

Prior probability estimation

Likelihood estimation

For numeric attributes, assuming all attributes are numeric, here is the estimation equation. One presumption is declared: each class Likelihood estimation is normally distributed around some mean Likelihood estimation with the corresponding covariance matrix Likelihood estimation. Likelihood estimation is used to estimate Likelihood estimation, Likelihood estimation for Likelihood estimation:

Likelihood estimation
Likelihood estimation
Likelihood estimation
Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The Bayes classification
The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

Likelihood estimation

For numeric attributes, assuming all attributes are numeric, here is the estimation equation. One presumption is declared: each class Likelihood estimation is normally distributed around some mean Likelihood estimation with the corresponding covariance matrix Likelihood estimation. Likelihood estimation is used to estimate Likelihood estimation, Likelihood estimation for Likelihood estimation:

Likelihood estimation
Likelihood estimation
Likelihood estimation
Likelihood estimation

For categorical attributes, it can also be dealt with similarly but with minor difference.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The Bayes classification
The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

The Bayes classification

The pseudocode of the Bayes classification algorithm is as follows:

The Bayes classification

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

The R implementation

The R code for the Bayes classification is listed as follows:

  1 BayesClassifier <- function(data,classes){
  2     bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8    cov.m <-GetCovarianceMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(bayes.model,cards)
 11     AddPriorProbability(bayes.model,prior.p)
 12     AddMeans(bayes.model,means)
 13     AddCovarianceMatrix(bayes.model,cov.m)
 14 
 15     return(bayes.model)
 16 }
 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     bayes.model <- BayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Bayes classification algorithm, in the following section.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

Trojan traffic identification method

A Trojan horse, which is a malicious program, surreptitiously performs its operation under the guise of a legitimate program. It has a specific pattern and unique malicious behavior (such as traffic and other operations). For example, it may obtain account information and sensitive system information for further attacks. It can also fork processes for dynamic ports, impersonate software and redirect traffic of affected services to other systems, make them available to attackers to hijack connections, intercept valuable data, and inject fake information or phishing.

Depending on the purpose of Trojans, there are many versatile types of designs for Trojans, each with a certain traffic behavior. With the ability to identify the Trojan traffic, further processing can be performed to protect information. As a result, detecting the traffic of Trojans is one of the main tasks to detect Trojans on system. The behavior of Trojans is an outlier compared to the normal software. So the classification algorithms such as the Bayesian classification algorithm can be applied to detect the outliers. Here is a diagram showing the Trojan traffic behavior:

Trojan traffic identification method

The malicious traffic behaviors include but are not limited to spoofing the source IP addresses and (short and long) term scanning the flow of the address/port that serves as the survey for successive attacks. Known Trojan traffic behaviors are used as the positive training data instances. The normal traffic behaviors are used as the negative data instances in the training dataset. These kinds of datasets are continuously collected by NGOs.

The attributes used for a dataset may include the latest DNS request, the NetBIOS name table on the host machine, ARP cache, intranet router table, socket connections, process image, system ports behavior, opened files updates, remote files updates, shell history, packet TCP/IP headers information, identification fields (IPID) of the IP header, Time To Live (TTL), and so forth. One possible attribute set for a dataset is source IP, port, target IP, target port, number of flows, number of packets, number of bytes, timestamp at certain checkpoint, and the class label for the type of detection. The DNS traffic plays an important role in the Trojans' detection too; the traffics of Trojans has certain a relation with DNS traffic.

Trojan traffic identification method

The traditional technologies for detecting a Trojan often rely on the Trojan's signature and can be deceived by dynamic ports, encrypted messages, and so on. This led to the introduction of mining technologies for the classification of Trojan traffic. The Bayesian classifier is one of the better solutions among others. The preceding diagram is one such possible structure.

Identify spam e-mail and Naïve Bayes classification

The Naïve Bayes classification presumes that all attributes are independent; it simplifies the Bayes classification and doesn't need the related probability computation. The likelihood can be defined with the following equation:

Identify spam e-mail and Naïve Bayes classification

Some of the characteristics of the Naïve Bayes classification are as follows:

  • Robust to isolated noise
  • Robust to irrelevant attributes
  • Its performance might degrade due to correlated attributes in the input dataset

The Naïve Bayes classification

The pseudocode of the Naïve Bayes classification algorithm, with minor differences from the Bayes classification algorithm, is as follows:

The Naïve Bayes classification

The R implementation

The R code for the Naïve Bayes classification is listed as follows:

  1 NaiveBayesClassifier <- function(data,classes){
  2     naive.bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8     variances.m <- GetVariancesMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(naive.bayes.model,cards)
 11     AddPriorProbability(naive.bayes.model,prior.p)
 12     AddMeans(naive.bayes.model,means)
 13     AddVariancesMatrix(naive.bayes.model,variances.m)
 14 
 15     return(naive.bayes.model)
 16 }

 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     naive.bayes.model <- NaiveBayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Naïve Bayes classification algorithm, in the following section.

Identify spam e-mail

E-mail spam is one of the major issues on the Internet. It refers to irrelevant, inappropriate, and unsolicited emails to irrelevant receivers, pursuing advertisement and promotion, spreading malware, and so on.

Note

Unsolicited, unwanted e-mail that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient is the formal definition of e-mail spam from spam track at the Text Retrieval Conference (TREC).

The increase in e-mail users, business e-mail campaigns, and suspicious usage of e-mail have resulted in a massive dataset of spam e-mails, which in turn necessitate high-efficiency solutions to detect e-mail spam:

Identify spam e-mail

E-mail spam filters are automated tools that recognize spam and prevent further delivery. The classifier serves as a spam detector here. One solution is to combine inputs of a couple of e-mail spam classifiers to present improved classification effectiveness and robustness.

Spam e-mail can be judged from its content, title, and so on. As a result, the attributes of the e-mails, such as subject, content, sender address, IP address, time-related attributes, in-count /out-count, and communication interaction average, can be selected into the attributes set of the data instance in the dataset. Example attributes include the occurrence of HTML form tags, IP-based URLs, age of link-to domains, nonmatching URLs, HTML e-mail, number of links in the e-mail body, and so on. The candidate attributes include discrete and continuous types.

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

The Naïve Bayes classification

The pseudocode of the Naïve Bayes classification algorithm, with minor differences from the Bayes classification algorithm, is as follows:

The Naïve Bayes classification

The R implementation

The R code for the Naïve Bayes classification is listed as follows:

  1 NaiveBayesClassifier <- function(data,classes){
  2     naive.bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8     variances.m <- GetVariancesMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(naive.bayes.model,cards)
 11     AddPriorProbability(naive.bayes.model,prior.p)
 12     AddMeans(naive.bayes.model,means)
 13     AddVariancesMatrix(naive.bayes.model,variances.m)
 14 
 15     return(naive.bayes.model)
 16 }

 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     naive.bayes.model <- NaiveBayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Naïve Bayes classification algorithm, in the following section.

Identify spam e-mail

E-mail spam is one of the major issues on the Internet. It refers to irrelevant, inappropriate, and unsolicited emails to irrelevant receivers, pursuing advertisement and promotion, spreading malware, and so on.

Note

Unsolicited, unwanted e-mail that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient is the formal definition of e-mail spam from spam track at the Text Retrieval Conference (TREC).

The increase in e-mail users, business e-mail campaigns, and suspicious usage of e-mail have resulted in a massive dataset of spam e-mails, which in turn necessitate high-efficiency solutions to detect e-mail spam:

Identify spam e-mail

E-mail spam filters are automated tools that recognize spam and prevent further delivery. The classifier serves as a spam detector here. One solution is to combine inputs of a couple of e-mail spam classifiers to present improved classification effectiveness and robustness.

Spam e-mail can be judged from its content, title, and so on. As a result, the attributes of the e-mails, such as subject, content, sender address, IP address, time-related attributes, in-count /out-count, and communication interaction average, can be selected into the attributes set of the data instance in the dataset. Example attributes include the occurrence of HTML form tags, IP-based URLs, age of link-to domains, nonmatching URLs, HTML e-mail, number of links in the e-mail body, and so on. The candidate attributes include discrete and continuous types.

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

The R implementation

The R code for the Naïve Bayes classification is listed as follows:

  1 NaiveBayesClassifier <- function(data,classes){
  2     naive.bayes.model <- NULL
  3 
  4     data.subsets <- SplitData(data,classes)
  5     cards <- GetCardinality(data.subsets)
  6     prior.p <- GetPriorProbability(cards)
  7     means <- GetMeans(data.subsets,cards)
  8     variances.m <- GetVariancesMatrix(data.subsets,cards,means)
  9 
 10     AddCardinality(naive.bayes.model,cards)
 11     AddPriorProbability(naive.bayes.model,prior.p)
 12     AddMeans(naive.bayes.model,means)
 13     AddVariancesMatrix(naive.bayes.model,variances.m)
 14 
 15     return(naive.bayes.model)
 16 }

 17 
 18 TestClassifier <- function(x){
 19     data <- GetTrainingData()
 20     classes <- GetClasses()
 21     naive.bayes.model <- NaiveBayesClassifier(data,classes)
 22 
 23     y <- GetLabelForMaxPostProbability(bayes.model,x)
 24 
 25     return(y)
 26 }

One example is chosen to apply the Naïve Bayes classification algorithm, in the following section.

Identify spam e-mail

E-mail spam is one of the major issues on the Internet. It refers to irrelevant, inappropriate, and unsolicited emails to irrelevant receivers, pursuing advertisement and promotion, spreading malware, and so on.

Note

Unsolicited, unwanted e-mail that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient is the formal definition of e-mail spam from spam track at the Text Retrieval Conference (TREC).

The increase in e-mail users, business e-mail campaigns, and suspicious usage of e-mail have resulted in a massive dataset of spam e-mails, which in turn necessitate high-efficiency solutions to detect e-mail spam:

Identify spam e-mail

E-mail spam filters are automated tools that recognize spam and prevent further delivery. The classifier serves as a spam detector here. One solution is to combine inputs of a couple of e-mail spam classifiers to present improved classification effectiveness and robustness.

Spam e-mail can be judged from its content, title, and so on. As a result, the attributes of the e-mails, such as subject, content, sender address, IP address, time-related attributes, in-count /out-count, and communication interaction average, can be selected into the attributes set of the data instance in the dataset. Example attributes include the occurrence of HTML form tags, IP-based URLs, age of link-to domains, nonmatching URLs, HTML e-mail, number of links in the e-mail body, and so on. The candidate attributes include discrete and continuous types.

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

Identify spam e-mail

E-mail spam is one of the major issues on the Internet. It refers to irrelevant, inappropriate, and unsolicited emails to irrelevant receivers, pursuing advertisement and promotion, spreading malware, and so on.

Note

Unsolicited, unwanted e-mail that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient is the formal definition of e-mail spam from spam track at the Text Retrieval Conference (TREC).

The increase in e-mail users, business e-mail campaigns, and suspicious usage of e-mail have resulted in a massive dataset of spam e-mails, which in turn necessitate high-efficiency solutions to detect e-mail spam:

Identify spam e-mail

E-mail spam filters are automated tools that recognize spam and prevent further delivery. The classifier serves as a spam detector here. One solution is to combine inputs of a couple of e-mail spam classifiers to present improved classification effectiveness and robustness.

Spam e-mail can be judged from its content, title, and so on. As a result, the attributes of the e-mails, such as subject, content, sender address, IP address, time-related attributes, in-count /out-count, and communication interaction average, can be selected into the attributes set of the data instance in the dataset. Example attributes include the occurrence of HTML form tags, IP-based URLs, age of link-to domains, nonmatching URLs, HTML e-mail, number of links in the e-mail body, and so on. The candidate attributes include discrete and continuous types.

The training dataset for the Naïve Bayes classifier will be composed of the labeled spam e-mails and legitimate e-mails.

Rule-based classification of player types in computer games and rule-based classification

Compared to other classification algorithms, the learned model for a rule-based classification is set up by an IF-THEN rules set. The rules set can be transformed from the decision tree or by the following algorithm. An IF-THEN rule has the following format:

          IF condition_holds_true THEN make_a_conclusion

An alternative format is as follows:

Rule-based classification of player types in computer games and rule-based classification

For a given instance or record in the source dataset, if the RULE antecedent holds true, the rule is defined to cover the instance, and it is satisfied.

Given a rule R, the coverage and accuracy are defined as follows:

Rule-based classification of player types in computer games and rule-based classification
Rule-based classification of player types in computer games and rule-based classification

Transformation from decision tree to decision rules

It is very convenient to transform the decision tree into a decision rules set for further processing. Along with every path from the root to a leaf in the decision tree, a rule can be written. The left-hand side, the rule antecedent, of any rule is constructed by the combination of the label of the nodes and the labels of the arcs, then the rule consequent by the leaf node. One example of extracting classification rules from the decision tree is illustrated in the following diagram:

Transformation from decision tree to decision rules

One important question is the pruning of the resulting rules set.

Rule-based classification

Rules are learned sequentially, one at a time. Here is the pseudocode of the algorithm to build a rule-based classifier. The LearnOneRule function is designed with the greedy strategy. Its target is to cover the positive instance in the source dataset as much as possible, and none or as few as possible of the negative instance at the same time. All the instances in the source dataset with a specific class are defined as positive, and those that belong to other classes are considered to be negative. An initial rule r is generated, which keeps refining until the stop condition is met.

Sequential covering algorithm

The pseudocode of the generic sequential covering algorithm is as follows. The input parameters include the dataset with class-labeled tuples and the attributes set with all of their possible values. The output is a set of IF-THEN rules as follows:

Sequential covering algorithm

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a direct rule-based classifier, in which the rule set is relatively convenient to interpret and the most practical for imbalance problems.

As per the growth of a rule, the algorithm starts from an empty rule and adds conjuncts, which maximize or improve the information gain measure, that is, the FOIL. It stops at the situation so that the rule does not cover negative rules. The resulting rule is pruned immediately with incremental reduced error pruning. Any final sequence of conditions is removed once it maximizes the measure of pruning v, which is calculated as follows:

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

Given are p as the number of positive examples covered by this rule and n as the number of negative rules covered by this rule. P denotes the number of positive examples of this class, and N the number of the negative examples of this class.

The RIPPER algorithm
The RIPPER algorithm
The RIPPER algorithm

The pseudocode of the RIPPER algorithm is as follows:

The RIPPER algorithm

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

Transformation from decision tree to decision rules

It is very convenient to transform the decision tree into a decision rules set for further processing. Along with every path from the root to a leaf in the decision tree, a rule can be written. The left-hand side, the rule antecedent, of any rule is constructed by the combination of the label of the nodes and the labels of the arcs, then the rule consequent by the leaf node. One example of extracting classification rules from the decision tree is illustrated in the following diagram:

Transformation from decision tree to decision rules

One important question is the pruning of the resulting rules set.

Rule-based classification

Rules are learned sequentially, one at a time. Here is the pseudocode of the algorithm to build a rule-based classifier. The LearnOneRule function is designed with the greedy strategy. Its target is to cover the positive instance in the source dataset as much as possible, and none or as few as possible of the negative instance at the same time. All the instances in the source dataset with a specific class are defined as positive, and those that belong to other classes are considered to be negative. An initial rule r is generated, which keeps refining until the stop condition is met.

Sequential covering algorithm

The pseudocode of the generic sequential covering algorithm is as follows. The input parameters include the dataset with class-labeled tuples and the attributes set with all of their possible values. The output is a set of IF-THEN rules as follows:

Sequential covering algorithm

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a direct rule-based classifier, in which the rule set is relatively convenient to interpret and the most practical for imbalance problems.

As per the growth of a rule, the algorithm starts from an empty rule and adds conjuncts, which maximize or improve the information gain measure, that is, the FOIL. It stops at the situation so that the rule does not cover negative rules. The resulting rule is pruned immediately with incremental reduced error pruning. Any final sequence of conditions is removed once it maximizes the measure of pruning v, which is calculated as follows:

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

Given are p as the number of positive examples covered by this rule and n as the number of negative rules covered by this rule. P denotes the number of positive examples of this class, and N the number of the negative examples of this class.

The RIPPER algorithm
The RIPPER algorithm
The RIPPER algorithm

The pseudocode of the RIPPER algorithm is as follows:

The RIPPER algorithm

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

Rule-based classification

Rules are learned sequentially, one at a time. Here is the pseudocode of the algorithm to build a rule-based classifier. The LearnOneRule function is designed with the greedy strategy. Its target is to cover the positive instance in the source dataset as much as possible, and none or as few as possible of the negative instance at the same time. All the instances in the source dataset with a specific class are defined as positive, and those that belong to other classes are considered to be negative. An initial rule r is generated, which keeps refining until the stop condition is met.

Sequential covering algorithm

The pseudocode of the generic sequential covering algorithm is as follows. The input parameters include the dataset with class-labeled tuples and the attributes set with all of their possible values. The output is a set of IF-THEN rules as follows:

Sequential covering algorithm

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a direct rule-based classifier, in which the rule set is relatively convenient to interpret and the most practical for imbalance problems.

As per the growth of a rule, the algorithm starts from an empty rule and adds conjuncts, which maximize or improve the information gain measure, that is, the FOIL. It stops at the situation so that the rule does not cover negative rules. The resulting rule is pruned immediately with incremental reduced error pruning. Any final sequence of conditions is removed once it maximizes the measure of pruning v, which is calculated as follows:

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

Given are p as the number of positive examples covered by this rule and n as the number of negative rules covered by this rule. P denotes the number of positive examples of this class, and N the number of the negative examples of this class.

The RIPPER algorithm
The RIPPER algorithm
The RIPPER algorithm

The pseudocode of the RIPPER algorithm is as follows:

The RIPPER algorithm

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

Sequential covering algorithm

The pseudocode of the generic sequential covering algorithm is as follows. The input parameters include the dataset with class-labeled tuples and the attributes set with all of their possible values. The output is a set of IF-THEN rules as follows:

Sequential covering algorithm

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a direct rule-based classifier, in which the rule set is relatively convenient to interpret and the most practical for imbalance problems.

As per the growth of a rule, the algorithm starts from an empty rule and adds conjuncts, which maximize or improve the information gain measure, that is, the FOIL. It stops at the situation so that the rule does not cover negative rules. The resulting rule is pruned immediately with incremental reduced error pruning. Any final sequence of conditions is removed once it maximizes the measure of pruning v, which is calculated as follows:

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

Given are p as the number of positive examples covered by this rule and n as the number of negative rules covered by this rule. P denotes the number of positive examples of this class, and N the number of the negative examples of this class.

The RIPPER algorithm
The RIPPER algorithm
The RIPPER algorithm

The pseudocode of the RIPPER algorithm is as follows:

The RIPPER algorithm

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

The RIPPER algorithm

Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a direct rule-based classifier, in which the rule set is relatively convenient to interpret and the most practical for imbalance problems.

As per the growth of a rule, the algorithm starts from an empty rule and adds conjuncts, which maximize or improve the information gain measure, that is, the FOIL. It stops at the situation so that the rule does not cover negative rules. The resulting rule is pruned immediately with incremental reduced error pruning. Any final sequence of conditions is removed once it maximizes the measure of pruning v, which is calculated as follows:

The RIPPER algorithm

The sequential covering algorithm is used to build a rule set; the new description length (DL) is computed once a new rule is added to the rule set. The rule set is then optimized.

Given are p as the number of positive examples covered by this rule and n as the number of negative rules covered by this rule. P denotes the number of positive examples of this class, and N the number of the negative examples of this class.

The RIPPER algorithm
The RIPPER algorithm
The RIPPER algorithm

The pseudocode of the RIPPER algorithm is as follows:

The RIPPER algorithm

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

The R implementation

The R code for the rule-based classification is listed as follows:

  1 SequentialCovering <- function(data,x,classes){
  2     rule.set <- NULL
  3 
  4     classes.size <- GetCount(classes)
  5      idx <- 0
  6     while( idx <= classes.size ){
  7         idx <- idx+1
  8         one.class <- GetAt(classes,idx)
  9         repeat{
 10             one.rule <- LearnOneRule(newdata,x,one.class)
 11             data <- FilterData(data,one.rule)
 12             AddRule(rule.set,one.rule)
 13             if(CheckTermination(data,x,classes,rule.set)){
 14                 break;
 15             }
 16         }
 17     }
 18     return(rule.set)
 19 }

One example is chosen to apply the rule-based classification algorithm, in the following section.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

Rule-based classification of player types in computer games

During computer game progressing and in the game context, improving the experience of a game is always a continual task. Classification of player types is one major task, which in turn brings more improvements including game design.

One of the popular player models of typological of temperature is the DGD player topology, which is illustrated in the following diagram. Given this model, the game players can be labeled with appropriate types, the game can be explained, it helps in designing new games, and so forth.

Rule-based classification of player types in computer games

Based on the player behaviors or models, we can train the decision tree model with the dataset, and the rules set from the trained decision tree model. The dataset will come from the game log and some predefined domain knowledge.

Time for action

Here are some practices for you to check what you've learned so far:

  • Running the R code of the ID3 algorithm step by step upon a minor dataset to trace the values of the important factors at each step
  • Preparing the dataset related to web logs and creating an application that detects web attacks using ID3
  • Implementing an R code to generate decision rules from a decision tree
  • What is Gain Ratio?

Summary

In this chapter, we learned the following facts:

  • Classification is a class of dispatch instances to one of predefined categories
  • Decision tree induction is to learn the decision tree from the source dataset with the (instance and class-label) pairs under the supervised learning mode
  • ID3 is a decision tree induction algorithm
  • C4.5 is an extension of ID3
  • CART is a decision tree induction
  • Bayes classification is a statistical classification algorithm
  • Naïve Bayes classification is a simplified version of Bayes classification in which there is a presumption of independence
  • Rule-based classification is a classification model applying the rule set, which can be collections by direct algorithm, the sequential covering algorithm, and the indirect method by decision tree transforming

In the next chapter, you'll cover the more-advanced classification algorithms, including Bayesian Belief Network, SVM, k-Nearest Neighbors algorithm, and so on.

 

Chapter 4. Advanced Classification

In this chapter, you will learn about the top classification algorithms written in the R language. You will also learn the ways to improve the classifier.

We will cover the following topics:

  • Ensemble methods
  • Biological traits and Bayesian belief network
  • Protein classification and the k-Nearest Neighbors algorithm
  • Document retrieval and Support Vector Machine
  • Text classification using sentential frequent itemsets and classification using frequent patterns
  • Classification using the backpropagation algorithm

Ensemble (EM) methods

To improve the accuracy of classification, EM methods are developed. The accuracy is dramatically improved by at least one grade compared to its base classifiers, because the EM methods make mistakes only when at least half of the result of the base classifiers are wrong.

The concept structure of EM methods is illustrated in the following diagram:

Ensemble (EM) methods

The label for the new data tuple is the result of the voting of a group of base classifiers. A combined classifier is created based on several base classifiers. Each classifier is trained with a different dataset or training set re-sampled with the replacement of the original training dataset.

Three popular EM methods are discussed in the successive sections:

  • Bagging
  • Boosting
  • Random forests

The bagging algorithm

Here is a concise description of the bagging algorithm (noted as the bootstrap aggregation), followed by the summarized pseudocode. For iteration i (The bagging algorithm), a training set, The bagging algorithm, of d tuples is sampled with replacement from the original set of tuples, D. Any training set is sampled by employing bootstrap sampling (with replacement) for it, which in turn is used to learn a classifier model, The bagging algorithm. To classify an unknown or test tuple, X, each classifier, The bagging algorithm, returns its class prediction, which counts as one vote. Assume the number of classifiers as follows, which predicts the same class, The bagging algorithm, given the test tuple X:

The bagging algorithm

The bagged classifier, The bagging algorithm, counts the votes and assigns the class with the most votes to X. Each vote has the same weight in the equation;

The bagging algorithm

About the prediction of continuous values, the average value of each prediction for a given test tuple is used as the result. The algorithm reduces the variance, given a more correct result than the base classifiers.

The input parameters for bagging algorithm are as follows:

  • D: This is the training tuples dataset
  • K: This is the number of classifiers combined
  • S: This is a classification learning algorithm or scheme to learning base classifier
  • The bagging algorithm: This is the ensemble classifier, which is the output of the algorithm

The summarized pseudocode for the bagging algorithm is as follows:

The bagging algorithm

The boosting and AdaBoost algorithms

As opposed to an ensemble algorithm, the bagging algorithm is the weighted voting and weighted sample training tuple dataset for each base classifier. The base classifiers are learned iteratively. Once a classifier is learned, the relative weights are updated with a certain algorithm for the next learning of the base classifiers. The successive model learning will emphasize the tuples misclassified by the former classifier. As a direct result, the accuracy of a certain classifier will play an important role in the voting of the final label once an unknown test tuple is provided for the combined classifier.

Adaptive Boosting or AdaBoost is one of the boosting algorithms. If it contains K base classifiers, then the AdaBoost will be performed as K passes. Given the tuple dataset The boosting and AdaBoost algorithms and its corresponding classifier The boosting and AdaBoost algorithms, the error rate The boosting and AdaBoost algorithms of classifier The boosting and AdaBoost algorithms is defined as follows:

The boosting and AdaBoost algorithms

The new classifier The boosting and AdaBoost algorithms will be discarded once the error (The boosting and AdaBoost algorithms) is bigger than 0.5. The training tuples set will be resampled for this classifier and perform the training of this classifier from scratch.

For tuple The boosting and AdaBoost algorithms, the error function is as follows:

The boosting and AdaBoost algorithms

All the weights of the tuples in the training tuples dataset The boosting and AdaBoost algorithms are initialized with The boosting and AdaBoost algorithms. When the classifier The boosting and AdaBoost algorithms is learned from the training tuples set The boosting and AdaBoost algorithms, the weight for the tuple, which is correctly classified, is multiplied by The boosting and AdaBoost algorithms. After the updates, the weights for all tuples are normalized, which means the weight of the classified tuple increases and the weight of the others decreases.

The weight of the vote of classifier The boosting and AdaBoost algorithms is as follows:

The boosting and AdaBoost algorithms

We defined The boosting and AdaBoost algorithms as the weight of the vote for class The boosting and AdaBoost algorithms upon the K classifiers, The boosting and AdaBoost algorithms representing the weight of the ith classifier:

The boosting and AdaBoost algorithms

The AdaBoost combined classifier, The boosting and AdaBoost algorithms, counts the votes with their respective weights multiplied and assigns the class with the most votes to X. Each vote has the same weight in the equation:

The boosting and AdaBoost algorithms

The input parameters for the AdaBoost algorithm are as follows:

  • D, which denotes a set of training tuples
  • k, which is the number of rounds
  • A classification learning algorithm

The output of the algorithm is a composite model. The pseudocode of AdaBoost is listed here:

The boosting and AdaBoost algorithms

The Random forests algorithm

Random forests algorithm is an ensemble method to combine a group of decision trees, which is generated by a strategy of applying a random selection of attributes at each node that will be split. Given an unknown tuple, each classifier votes, and the most popular one decides the final result. The pseudocode for the ForestRI algorithm to generate a forest is as follows:

The Random forests algorithm

T denotes a total order of the variables in line 2. In line 5, The Random forests algorithm denotes the set of variables preceding The Random forests algorithm. Prior knowledge is required for line 6.

Instead of random selection of attributes on splitting the node, for another algorithm, ForestRC, the random linear combination strategy of the existing attributes is used to split the task. New attributes are built by the random linear combination of the original attributes set. With a couple of new attributes added, the best split is searched over the updated attributes set including the new and original attributes.

The R implementation

Here we provide three R implementations, bagging, AdaBoost, and Random forests. Please look up the R codes file ch_04_bagging.R, ch_04_adaboost.R, ch_04_forestrc.R, and ch_04_forestri.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following commands:

> source("ch_04_bagging.R")
> source("ch_04_adaboost.R")
> source("ch_04_forestrc.R")
> source("ch_04_forestri.R")

Parallel version with MapReduce

The following algorithm is the parallelized AdaBoost algorithm, which depends on a couple of workers to construct boosting classifiers. The dataset for the pth worker is defined using the following formula, where, Parallel version with MapReduce denoting its size is Parallel version with MapReduce:

Parallel version with MapReduce

The classifier Parallel version with MapReduce is defined in the following format, with Parallel version with MapReduce as the weight:

Parallel version with MapReduce

The output is the final classifier. The input is the training dataset of M workers Parallel version with MapReduce.

Parallel version with MapReduce

Biological traits and the Bayesian belief network

The Bayesian belief network, once trained, can be used for classification. Based on the Bayes' theorem, which is defined in the The Bayes classification section of Chapter 3, Classification, it is defined with two parts, one directed acyclic graph and conditional probability tables (CPT) for each variable; this is in turn represented by one node in the graph and models the uncertainty by graphically representing the conditional dependencies between distinct components. The arcs in the image give a representation of causal knowledge. The interaction among the diverse sources of uncertainty is also graphically illustrated.

The uncertainty comes from various sources:

  • The way to associate the knowledge by the expert
  • The domain intrinsic uncertainty
  • The requirement of the knowledge to be translated
  • The accuracy and availability of knowledge

Here is an example of the Bayesian belief network with four Boolean variables and the corresponding arcs. Whether the grass is wet is influenced by the work results of sprinkler and whether it has just rained, and so on. Each arc has a certain probability.

Let us have a look at the CPT representation of Biological traits and the Bayesian belief network:

Biological traits and the Bayesian belief network

In the network, each variable is conditionally independent of its non-descendants. Here is the definition of the joint probability distribution:

Biological traits and the Bayesian belief network

The Bayesian belief network (BBN) algorithm

Before the application of the BBN algorithm to classification, we need to train it first. In the process of training the network, the expert knowledge, that is, the prior knowledge, can be used in the training process to help the design of the network. For the variables that participated in direct dependency, experts must specify their conditional probability. There are many algorithms to learn the network from the training dataset; we will introduce an adaptive probabilistic networks algorithm.

The input parameters for the BBN algorithm are as follows:

  • T, denotes a total order of the variables
  • CPT

The output of the algorithm is the topology structure of BBN, which is as follows:

The Bayesian belief network (BBN) algorithm

T denotes a total order of the variables in line 2. In line 5, The Bayesian belief network (BBN) algorithm denotes the set of variables preceding The Bayesian belief network (BBN) algorithm.

The R implementation

Please look up the R codes file ch_04_bnn.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_04_bnn.R")

Biological traits

A biological trait is one of the important applications of the BBN algorithm.

The Bayesian belief network (BBN) algorithm

Before the application of the BBN algorithm to classification, we need to train it first. In the process of training the network, the expert knowledge, that is, the prior knowledge, can be used in the training process to help the design of the network. For the variables that participated in direct dependency, experts must specify their conditional probability. There are many algorithms to learn the network from the training dataset; we will introduce an adaptive probabilistic networks algorithm.

The input parameters for the BBN algorithm are as follows:

  • T, denotes a total order of the variables
  • CPT

The output of the algorithm is the topology structure of BBN, which is as follows:

The Bayesian belief network (BBN) algorithm

T denotes a total order of the variables in line 2. In line 5, The Bayesian belief network (BBN) algorithm denotes the set of variables preceding The Bayesian belief network (BBN) algorithm.

The R implementation

Please look up the R codes file ch_04_bnn.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_04_bnn.R")

Biological traits

A biological trait is one of the important applications of the BBN algorithm.

The R implementation

Please look up the R codes file ch_04_bnn.R from the bundle of R codes for the previously mentioned algorithms. The codes can be tested with the following command:

> source("ch_04_bnn.R")

Biological traits

A biological trait is one of the important applications of the BBN algorithm.

Biological traits

A biological trait is one of the important applications of the BBN algorithm.

Protein classification and the k-Nearest Neighbors algorithm

The k-Nearest Neighbors (kNN) algorithm is one of the lazy learners that postpones the learning until the test tuple or test instance is provided.

A single training tuple is represented by a point in an n-dimensional space. In other words, n attributes' combinations are used to represent the specific training tuple. There is no specific training before the arrival of the test tuple that needs to be classified. Some preprocessing steps are needed, such as normalization for some attributes with large values compared to other attributes' values. Data normalization approaches in the data transformation can be applied here for preprocessing.

When a test tuple is given, the k-nearest training tuples are found from the training tuples space by a specific measure to calculate the distance between test tuple and the training tuple. The k-nearest training tuples are also known as the kNN. One popular solution is the Euclidean distance in real space, illustrated in the following equation. This method is only applicable to numeric attributes:

Protein classification and the k-Nearest Neighbors algorithm

For nominal attributes, one solution is that the difference between two attribute values is defined as 1, or as 0. We already know that many approaches deal with missing values in the attributes. With a predefined threshold, the value of k is selected with the number of tuples with the lowest error-rate among all the training tuples.

The class label of the test tuple is defined by the voting of the most common class in the kNN.

The kNN algorithm

The input parameters for kNN algorithm are as follows:

  • D, the set of training objects
  • z, the test object, which is a vector of attribute values
  • L, the set of classes used to label the objects

The output of the algorithm is the class of z, represented as The kNN algorithm.

The pseudocode snippet for kNN is illustrated here:

The kNN algorithm

The I function in line 6 denotes an indicator function that returns the value 1 if its argument is true and 0 otherwise.

The R implementation

Please look up the R codes file ch_04_knn.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_04_knn.R")

The kNN algorithm

The input parameters for kNN algorithm are as follows:

  • D, the set of training objects
  • z, the test object, which is a vector of attribute values
  • L, the set of classes used to label the objects

The output of the algorithm is the class of z, represented as The kNN algorithm.

The pseudocode snippet for kNN is illustrated here:

The kNN algorithm

The I function in line 6 denotes an indicator function that returns the value 1 if its argument is true and 0 otherwise.

The R implementation

Please look up the R codes file ch_04_knn.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_04_knn.R")

The R implementation

Please look up the R codes file ch_04_knn.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_04_knn.R")

Document retrieval and Support Vector Machine

Support Vector Machine (SVM) is a classification algorithm applicable to both linear and nonlinear data classification. It is based on an assumption: if two classes of data cannot be divided by a hyper-plane, then after mapping the source dataset to sufficient higher dimension spaces, the optimal separating hyper-plane must exist.

Here are two concepts that need to be clearly defined:

  • Linearly separable: This means that a dataset can be divided into the target classes with a linear equation with the input of a training tuple.
  • Nonlinearly separable: This means that none of the linear equations exist in the space with the same dimension as that of the training tuple.

The linear hyper-plane can be represented as the linear discriminant equation, given the weight vector w and the training tuple x,

Document retrieval and Support Vector Machine
Document retrieval and Support Vector Machine
Document retrieval and Support Vector Machine

With the preceding equation, we have the following image to illustrate a hyper-plane:

Document retrieval and Support Vector Machine

The target of SVM is to find the optimal hyper-plane, by which the margin between data points belonging to different classes are maximized.

There are two hyper-planes with equal distance and that are parallel to the Document retrieval and Support Vector Machine hyper-plane. They are boundary hyper-planes and all support vectors are on them. This is illustrated in the following diagram:

Document retrieval and Support Vector Machine

In the following diagram, the case of a nonlinearly separable case is illustrated:

Document retrieval and Support Vector Machine

In the following diagram, after mapping the vector from a low-dimensional space to high-dimensional space, the nonlinearly separable case will be transformed into to a linearly separable case:

Document retrieval and Support Vector Machine

The SVM algorithm

The input parameters for a dual SVM algorithm are as follows:

  • D, the set of training objects
  • K
  • C
  • The SVM algorithm

The output of the algorithm is the SVM algorithm. The pseudocode snippet for this algorithm is illustrated here:

The SVM algorithm
The SVM algorithm

Here is the pseudocode for another version of the SVM algorithm, which is the primal kernel SVM algorithm. The input parameters for the primal kernel SVM algorithm are as follows: