Exploring Data with RapidMiner

4 (1 reviews total)
By Andrew Chisholm
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies

About this book

Data is everywhere and the amount is increasing so much that the gap between what people can understand and what is available is widening relentlessly. There is a huge value in data, but much of this value lies untapped. 80% of data mining is about understanding data, exploring it, cleaning it, and structuring it so that it can be mined. RapidMiner is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics. It is used for research, education, training, rapid prototyping, application development, and industrial applications.

Exploring Data with RapidMiner is packed with practical examples to help practitioners get to grips with their own data. The chapters within this book are arranged within an overall framework and can additionally be consulted on an ad-hoc basis. It provides simple to intermediate examples showing modeling, visualization, and more using RapidMiner.

Exploring Data with RapidMiner is a helpful guide that presents the important steps in a logical order. This book starts with importing data and then lead you through cleaning, handling missing values, visualizing, and extracting additional information, as well as understanding the time constraints that real data places on getting a result. The book uses
real examples to help you understand how to set up processes, quickly.

This book will give you a solid understanding of the possibilities that RapidMiner gives for exploring data and you will be inspired to use it for your own work.

Publication date:
November 2013
Publisher
Packt
Pages
162
ISBN
9781782169338

 

Chapter 1. Setting the Scene

You have data, you know it has hidden value, and you want to mine it. The problem is you're a bit stuck.

The data you have could be anything and you have a lot of it. It is probably from where you work, and you are probably very knowledgeable about how it is gathered, how to interpret it, and what it means. You may also know a domain expert to whom you can turn for additional expertise.

You also have more than a passing knowledge of data mining and you have spent a short time becoming familiar with RapidMiner to perform data mining activities, such as clustering, classification, and regression. You know well that mining data is not just a case of using a spreadsheet to draw a few graphs and pie charts; there is much more.

Given all of this, what is the problem, why are you stuck, and what is this book for?

Simply put, real data is huge, stored in a multitude of formats, contains errors and missing values, and does not yield its secrets willingly. If, like me, your first steps in data mining involved using simple test datasets with a few hundred rows (all with clean data), you will quickly find that 10 million rows of data of dubious quality stored in a database combined with some spreadsheets and sundry files presents a whole new set of problems. In fact, estimates put the proportion of time spent cleaning, understanding, interpreting, and exploring data at something like 80 percent. The remaining 20 percent is the time spent on mining.

The problem restated is that if you don't spend time cleaning, reformatting, restructuring, and generally getting to know your data as part of an exploration, you will remain stuck and will get poor results. If we agree that this is the activity to be done, we come to a basic question: how will we do this?

The answer to this problem for this book is to use RapidMiner, a very powerful and ultimately easy-to-use product. These features, coupled with its open source availability, means it is very widely used. It does have a learning curve that can seem daunting. Be assured, once you have ascended this, the product truly becomes easy to use and lives up to its name.

This book is therefore an intermediate-level practical guide to using RapidMiner to explore data and includes techniques to import, visualize, clean, format, and restructure data. This overall objective gives a context in which the various techniques can be considered together. This is helpful because it shows what is possible and makes it easier to modify the techniques for whatever real data is encountered. Hints and tips are provided along the way; in fact, some readers may prefer to use these hints as a starting point.

Having set the scene, let us consider some of the aspects of data exploration raised in this introduction. The following sections explain some of the aspects of data exploration and give references to chapters where these aspects are considered in detail.

 

A process framework


It is important to think carefully about the framework within which any data mining investigation is done. A systematic yet simple approach will help results happen and will ensure everyone involved knows what to do and what to expect.

The following diagram shows a simple process framework, derived in part from CRISP-DM (ftp://ftp.software.ibm.com/software/analytics/spss/documentation/modeler/14.2/en/CRISP_DM.pdf):

There are six main phases. The process starts with Business understanding and the whole process proceeds in a clockwise direction, but it is quite normal to return, at any stage, to the previous phases in an iterative way. Not all the stages are mandatory. It is possible that the business has an objective that is not related to data mining and modeling at all. It might be enough to summarize large volumes of data in some sort of dashboard, so the Modeling step would be ignored in this case.

The Business understanding phase is the most important phase to get correct. Without clear organizational objectives set by what we might loosely call the business, as well as its continuing involvement, the whole activity is doomed. The output from this phase is considered the criteria for determining success. For the purpose of this book, it is assumed that this critical phase has been started and this clear view exists.

Data understanding and Data preparation follow Business understanding, and these phases involve activities such as importing, extracting, transforming, cleaning, and loading data into new databases and visualizing and generally getting a thorough understanding of what the data is. This book will be concerned with these two phases.

The Modeling, Evaluation, and Deployment phases concern building models to make predictions, testing these with real data, and deploying them in live use. This is the part that most people regard as data mining but it represents 20 percent of the effort. This book does not concern itself with these phases in any detail.

Having said that, it is important to have a view of the Modeling phase that will eventually be undertaken because this will impact the data exploration and understanding activity. For example, a predictive analytics project may try to predict the likelihood of a mobile phone customer switching to a competitor based on usage data. This has implications for how the data should be structured. Another example is using online shopping behavior to predict customer purchases, where a market basket analysis would be undertaken. This might require a different structure for the data. Yet another example would be an unsupervised clustering exercise to try and summarize different types of customers, where the aim is to find groups of similar customers. This can sometimes change the focus of the exploration to find relationships between all the attributes of the data.

Evaluation is also important because this is where success is estimated. An unbalanced dataset, where there are few examples of the target to be predicted, will have an effect on the validation to be performed. A regression modeling problem, which estimates a numerical result, will also require a different approach to a classification in which nominal values are being predicted.

Having set the scene for what is to be covered, the following sections will give some more detail about what the Data understanding and Data preparation phases contain, to give a taste of the chapters to come.

 

Data volume and velocity


There is no doubt that data is growing. Even a cursory glance at historical trends and future predictions reveals graphs trending sharply upwards for data volumes, data sources, and datatypes as well as for the speed at which data is being created. There are also graphs showing the cost of data storage going down, linked to the increased power and reduced cost of processing, the presence of new devices such as smartphones, and the ability of standard communication networks such as the Internet to make the movement of data easy.

So, there is more and more data being generated by more and more devices and it is becoming easier to move it around.

However, the ability of people to process and understand data remains constant. The net result is a gap in understanding that is getting wider.

For evidence of this, it is interesting to use Google Trends to look for search terms such as data visualization, data understanding, data value, and data cost. All of these have been trending up to a greater or lesser extent since 2007. This points to the concerns that people have which causes them to search for these terms because they are being overwhelmed with data.

Clearly, there is a need for something to help close the understanding gap to make the process of exploring data more efficient. As the first step, therefore, Chapter 8, Reducing Data Size, and Chapter 9, Resource Constraints, give some practical advice on determining how long a RapidMiner process will take to run. Some techniques to sample or reduce the size of data are also included to allow results to be obtained within a meaningful time span while understanding the effect on accuracy.

 

Data variety, formats, and meanings


For the purpose of this book, data is something that can be processed by a computer. This means that it is probably stored in a file on a disk or in a database or it could be in the computer's memory. Additionally, it might not physically exist until it is asked for. In other words, it could be the response to a web service query, which mashes up data sources to produce a result. Furthermore, some data is available in real time as a result of some external process being asked to gather or generate results.

Having found the data, understanding its format and the fields within it represents a challenge. With the increase of data volume comes an inevitable increase in the formats of data, owing simply to there being more diverse sources of data. User-generated content, mash-ups, and the possibility of defining one's own XML datatypes means that the meaning and interpretation of a field may not be obvious simply by looking at its name.

The obvious example is date formats. The date 1/5/2012 means January 5, 2012 to someone from the US whereas it means May 1, 2012 to someone from the UK. Another example in the context of a measurement of time is where results are recorded in microseconds, labeled as elapsed time, and then interpreted by a person as being in seconds. Yet another example could be a field labeled Item with the value Bat. Is this referring to a small flying mammal or is it something to play cricket with?

To address some aspects of data, Chapter 2, Loading Data, Chapter 4, Parsing and Converting Attributes, and Chapter 7, Transforming Data, take the initial steps to help close the understanding gap mentioned earlier.

 

Missing data


Most data has missing values. These arise for many reasons by virtue of errors during the gathering process, deliberate withholding for legitimate or malicious reasons, and simple bugs in the way data is processed. Having a strategy to handle this is very important because some algorithms perform very poorly even with a small percentage of missing data.

On the face of it, missing data is easy to detect, but there is a pitfall for the unwary since a missing value could in fact be a completely legitimate empty value. For example, a commuter train could start at one station and stop at all intermediate stations before reaching a final destination. An express train would not stop at the intermediate stations at all, and there would be no recorded arrival and departure times for these stops. This is not missing data but if it is handled like it is, the data would become unrepresentative and would lead to unpredictable results when used for mining.

That's not all; there are different types of missing data. Some are completely random, while some depend on the other data in complex ways. It is also possible for missing data to be correlated with the data to be predicted. Any strategy for handling missing values has therefore to consider these issues because the simple strategy of deleting records does not only remove precious data but could also bias the results of any data mining activity. The typical starting approach is to fill missing values manually. This is not advisable because it is time consuming, error prone, risks bias, is not repeatable, and does not scale.

What is needed is a systematic method of handling missing values and determining a way to process them automatically with little or no manual intervention. Chapter 6, Missing Values, takes the first step on this road.

 

Cleaning data


It is almost certain that any data encountered in the real world has data quality issues. In simple terms, this means that values are invalid or very different from other values. Of course, it can get more complex than this when it is not at all obvious that a particular value is anomalous. For example, the heights of people could be recorded and the range could be between 1 and 2 meters. If there is data for young children in the sample, lower heights are expected, but isn't a 2-meter five-year-old child an anomaly? It probably is, but anomalies such as these usually occur.

As with missing data, a systematic and automatic approach is required to identify it and deal with it and Chapter 5, Outliers, gives some details.

 

Visualizing data


A picture paints a thousand words, and this is particularly true when trying to understand data and close the understanding gap. Faced with a million rows of data, there is often no better way to view it to understand what quality issues there are, how the attributes within it relate to one another, and whether there are other systematic features that need to be understood and explained.

There are many types of visualizations that can be used and it is also important to combine these with the use of descriptive statistics, such as the mean and standard deviation.

Examples include 2D and 3D scatter plots, density plots, bubble charts, series, surfaces, box plots, and histograms, and it is often important to aggregate data into summaries for presentation because the larger the data gets, the more time it takes to process. Indeed, it becomes mandatory to summarize data as the resource limits of the available computers are reached.

Some initial techniques are given in Chapter 3, Visualizing Data.

 

Resource constraints


There is never enough time and there is never enough money. In other words, there is never enough time to get all the investigation and processing done, both in terms of the capacity of a person to look at the data and understand it as well as in terms of processing power and capacity. To be valuable in the real world, it must be possible to process all the data in a time that meets the requirements set at the outset. Referring back to the overall process, the business objectives must consider and set acceptance criteria for this.

This pervades all aspects of the data mining process from loading data, cleaning it, handling missing values, transforming it for subsequent processing, and performing the classification or clustering process itself.

When faced with huge data that is taking too long to process, there are many techniques that can be used to speed things up and Chapter 9, Resource Constraints, gives some details. This can start by breaking the process into steps and ensuring that intermediate results are saved. Very often, an initial load of data from a database can dwarf all other activities in terms of elapsed time. It may also be the case that it is simply not possible to load the data at all, making a batch approach necessary.

It is well known that different data mining algorithms perform differently depending on the number of rows of data and the number of attributes per row. One of the outputs from the data preparation phase is a dataset that is capable of being mined. This means that it must be possible for the data to be mined in a reasonable amount of time and so it is important that attention is paid to reducing the size of the data while bearing in mind that any reduction could affect the accuracy of the resulting data mining activity.

Reducing the number of rows by filtering or by aggregation is one method. An alternative method to this is to summarize data into groups. Another approach is to focus on the attributes and remove those that have no effect on the final outcome. It is also possible to transform attributes into their principal components for summarization purposes.

All of this does not help you think any quicker, but by speeding up the intermediate steps, it helps keep the train of thought going as the data is being understood.

 

Terminology


The following table contains some common terms that RapidMiner uses:

Term

Explanation

Process

A process is an executable unit containing the functionality to be executed. The user creates the process using operators and joins them together in whatever way is required.

Operator

An operator is a single block of functionality available from the RapidMiner Studio GUI that can be arranged in a process and connected to other processes. Each operator has parameters that can be configured as per the specific requirements of the process.

Macro

A macro is a global variable that can be set and used by most operators to modify operator behavior.

Repository

A repository is a location where processes, data, models, and files can be stored and read either from the RapidMiner Studio GUI or from a process.

Example

An example is a single row of data.

Example set

This is a set of one or more examples.

Attribute

An attribute is a column of data.

Type

This is the type of an attribute. It can be real, integer, date_time, nominal (both polynominal and binominal), or text.

Role

An attribute's role dictates how operators will use the attribute. The most obvious role is regular. The other standard types are known as special attributes and these include label, id, cluster, prediction, and outlier. It is also possible to set the role of an attribute that is generally ignored by most operators (there are exceptions).

Label

A label is the target attribute to be predicted in a data mining classification context. This is one of the special role types for an attribute.

ID

This is a special role that indicates an identifier for an example. Some operators use the ID as part of their operation.

This table is given here so that readers are aware of the terminology up front and to make it easier to find later.

 

Accompanying material


Many RapidMiner processes have been produced for this book, and most are available on the Internet.

Some of the processes contain additional bonus material. Note that, where files need to be accessed, you will have to edit the processes to match the locations of your files.

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

 

Summary


So far, we have seen in detail how extracting value from data should be considered an iterative process that consists of a number of phases. It is important to have a clear business objective as well as continued involvement of key people throughout the process. The important point is that the bulk of data mining is about data cleaning, exploration, and understanding, which means it is important to make this clear at the beginning to avoid disappointment.

Having seen some of the aspects of data cleaning, exploring, and understanding, you recognize some of the practical issues you have faced that have prevented you from getting value out of your data.

Without further ado, let's get straight onto the next logical step: importing data. RapidMiner provides many ways to do this and these are covered in Chapter 2, Loading Data.

About the Author

  • Andrew Chisholm

    Andrew Chisholm completed his degree in Physics from Oxford University nearly thirty years ago. This coincided with the growth in software engineering and it led him to a career in the IT industry. For the last decade he has been very involved in mobile telecommunications, where he is currently a product manager for a market-leading test and monitoring solution used by many mobile operators worldwide. Throughout his career, he has always maintained an active interest in all aspects of data. In particular, he has always enjoyed finding ways to extract value from data and presenting this in compelling ways to help others meet their objectives. Recently, he completed a Master's in Data Mining and Business Intelligence with first class honors. He is a certified RapidMiner expert and has been using this product to solve real problems for several years. He maintains a blog where he shares some miscellaneous helpful advice on how to get the best out of RapidMiner. He approaches problems from a practical perspective and has a great deal of relevant hands-on experience with real data. This book draws this experience together in the context of exploring data—the first and most important step in a data mining process. He has published conference papers relating to unsupervised clustering and cluster validity measures and contributed a chapter called Visualizing cluster validity measures to an upcoming book entitled RapidMiner: Use Cases and Business Analytics Applications, Chapman & Hall/CRC

    Browse publications by this author

Latest Reviews

(1 reviews total)
Still busy with the Reference , Would like him to add his modelling expertise Some times I cant repeat the exercise properly