Mastering Machine Learning with R

4.7 (15 reviews total)
By Cory Lesmeister
  • Instant online access to over 8,000+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. A Process for Success

About this book

Machine learning is a field of Artificial Intelligence to build systems that learn from data. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning to your data.

The book starts with introduction to Cross-Industry Standard Process for Data Mining. It takes you through Multivariate Regression in detail. Moving on, you will also address Classification and Regression trees. You will learn a couple of “Unsupervised techniques.” Finally, the book will walk you through text analysis and time series.

The book will deliver practical and real-world solutions to problems and variety of tasks such as complex recommendation systems.By the end of this book, you will gain expertise in performing R machine learning and will be able to build complex ML projects using R and its packages.

Publication date:
October 2015
Publisher
Packt
Pages
400
ISBN
9781783984527

 

Chapter 1. A Process for Success

 

"If you don't know where you are going, any road will get you there."

 
 --Robert Carrol
 

"If you can't describe what you are doing as a process, you don't know what you're doing."

 
 --W. Edwards Deming

At first glance, this chapter may seem to have nothing to do with machine learning, but it has everything to do with machine learning and specifically, its implementation and making the changes happen. The smartest people, best software, and best algorithm do not guarantee success, no matter how it is defined.

In most—if not all—projects, the key to successfully solving problems or improving decision-making is not the algorithm, but the soft, more qualitative skills of communication and influence. The problem many of us have with this is that it is hard to quantify how effective one is around these skillsets. It is probably safe to say that many of us ended up in this position because of a desire to avoid it. After all, the highly successful TV comedy The Big Bang Theory was built on this premise. Therefore, this chapter is to set you up for success. The intent is to provide a process, a flexible process no less, where you can become a Change Agent: a person who can influence and turn their insights into action without positional power. We will focus on Cross-Industry Standard Process for Data Mining (CRISP-DM). It is probably the most well-known and respected of any processes for analytical projects. Even if you use another industry process or something proprietary, there should still be a few gems in this chapter that you can take away.

I will not hesitate to say that this all is easier said than done, and without question, I'm guilty of every sin by both commission and omission that will be discussed in this chapter. With skill and some luck, you can avoid the many physical and emotional scars I've picked up over the last 10 and a half years.

Finally, we will also have a look at a flow chart (a cheat sheet) that you can use to help you identify what methodology to apply to the problem at hand.

 

The process


The CRISP-DM process was designed specifically for the data mining. However, it is flexible and thorough enough that it can be applied to any analytical project, whether it is predictive analytics, data science, or machine learning. Don't be intimidated by the numerous list of tasks as you can apply your judgment to the process and adapt it for any real-world situation. The following figure provides a visual representation of the process and shows the feedback loops, which facilitate its flexibility:

Figure from CRISP-DM 1.0, Step-by-step data mining guide

The process has the following six phases:

  • Business Understanding

  • Data Understanding

  • Data Preparation

  • Modeling

  • Evaluation

  • Deployment

For an in-depth review of the entire process with all of its tasks and subtasks, you can examine the paper by SPSS, CRISP-DM 1.0, step-by-step data mining guide, available at https://the-modeling-agency.com/crisp-dm.pdf.

I will discuss each of the steps in the process, covering the important tasks. However, it will not be in the detailed level of the guide, but more high level. We will not skip any of the critical details but focus more on the techniques that one can apply to the tasks. Keep in mind that the process steps will be used in the later chapters as a framework in the actual application of the machine learning methods in general and the R code specifically.

 

Business understanding


One cannot underestimate how important this first step of the process is in achieving success. It is the foundational step and failure or success here will likely determine failure or success for the rest of the project. The purpose of this step is to identify the requirements of the business so that you can translate them into analytical objectives. It has the following four tasks:

  1. Identify the business objective

  2. Assess the situation

  3. Determine the analytical goals

  4. Produce a project plan

Identify the business objective

The key to this task is to identify the goals of the organization and frame the problem. An effective question to ask is, what are we going to do different? This may seem like a benign question, but it can really challenge people to ponder what they need from an analytical perspective and it can get to the root of the decision that needs to be made. It can also prevent you from going out and doing a lot of unnecessary work on some fishing expedition. As such, the key for you is to identify the decision. A working definition of a decision can be put forward to the team as the irrevocable choice to commit or not commit the resources. Additionally, remember that the choice to do nothing different is indeed a decision.

This does not mean that a project should not be launched if the choices are not absolutely clear. There will be times when the problem is not or cannot be well-defined; to paraphrase former Defense Secretary Donald Rumsfeld, there are known – unknowns. Indeed, there will probably be many times when the problem is ill-defined and the project's main goal is to further the understanding of the problem and generate hypotheses; again calling on Secretary Rumsfeld, unknown – unknowns, which means that you don't know what you don't know. However, in ill-defined problems, one should go forward with an understanding of what will happen next in terms of resource commitment based on the various outcomes of hypothesis exploration.

Another thing to consider in this task is to manage expectations. There is no such thing as a perfect data, no matter what its depth and breadth is. This is not the time to make guarantees but to communicate what is possible, given your expertise.

I recommend a couple of outputs from this task. The first is a mission statement. This is not the touchy-feely mission statement of an organization, but it is your mission statement or, more importantly, the mission statement approved by the project sponsor. I stole this idea from my years of military experience and I could write volumes on why it is effective, but that is for another day. Let's just say that in the absence of clear direction or guidance, the mission statement or whatever you want to call it becomes the unifying statement and can help prevent scope creep. It consists of the following points:

  • Who: This is yourself or the team or project name; everyone likes a cool project name, for example, Project Viper, Project Fusion, and so on

  • What: This is the task that you will perform, for example, conduct machine learning

  • When: This is the deadline

  • Where: This could be geographical; by function, department, initiative, and so on

  • Why: This is the purpose of doing the project, that is, the business goal

The second task is to have as clear a definition of success as possible. Literally, ask what does success look like? Help the team/sponsor paint a picture of success that you can understand. Your job then is to translate this into modeling requirements.

Assess the situation

This task helps you in project planning by gathering information on the resources available, constraints, and assumptions, identifying the risks, and building contingency plans. I would further add that this is also the time to identify the key stakeholders that will be impacted by the decisions to be made.

A couple of points here. When examining the resources that are available, do not neglect to scour the records of the past and current projects. Odds are someone in the organization has or is working on the same problem and it may be essential to synchronize your work with theirs. Don't forget to enumerate the risks considering time, people, and money. Do everything in your power to create a list of the stakeholders, both those that impact your project and those that could be impacted by your project. Identify who these people are and how they can influence/be impacted by the decision. Once this is done, work with the project sponsor to formulate a communication plan with these stakeholders.

Determine the analytical goals

Here, you are looking to translate the business goal into technical requirements. This includes turning the success criterion from the task of creating a business objective to technical success. This might be things such as RMSE or a level of predictive accuracy.

Produce a project plan

The task here is to build an effective project plan with all the information gathered up to this point. Regardless of what technique you use, whether it be a Gantt chart or some other graphic, produce it and make it a part of your communication plan. Make this plan widely available to the stakeholders and update it on a regular basis and as circumstances dictate.

 

Data understanding


After enduring the all-important pain of the first step, you can now get your hands on the data. The tasks in this process consist of the following:

  1. Collect the data

  2. Describe the data

  3. Explore the data

  4. Verify the data quality

This step is the classic case of ETL is Extract, Transform, Load. There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine if the variables are sparse and identify the extent to which the data may be missing. This may drive the learning method that you use and/or whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon an incomplete data collection, cases where unintended IT issues led to errors in the data, or there were planned changes in the business rules. This is critical in the time series where often business rules change over time on how the data is classified. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself the heartache later on and make one.

 

Data preparation


Almost there! This step has the following five tasks:

  1. Select the data

  2. Clean the data

  3. Construct the data

  4. Integrate the data

  5. Format the data

These tasks are relatively self-explanatory. The goal is to get the data ready to input in the algorithms. This includes merging, feature engineering, and transformations. If imputation is needed, then it happens here as well. Additionally, with R, pay attention to how the outcome needs to be labeled. If your outcome/response variable is Yes/No, it may not work in some packages and will require a transformed or no variable with 1/0. At this point, you should also break your data into the various test sets if applicable: train, test, or validate. This step can be an unforgivable burden, but most experienced people will tell you that it is where you can separate yourself from your peers. With this, let's move on to the money step.

 

Modeling


This is where all the work that you've done up to this point can lead to fist-pumping exuberance or fist-pounding exasperation. But hey, if it was that easy, everyone would be doing it. The tasks are as follows:

  1. Select a modeling technique

  2. Generate a test design

  3. Build a model

  4. Assess a model

Oddly, this process step includes the considerations that you have already thought of and prepared for. In the first step, one will need at least a modicum of an idea about how they will be modeling. Remember, that this is a flexible, iterative process and not some strict linear flowchart such as an aircrew checklist.

The cheat sheet included in this chapter should help guide you in the right direction for the modeling techniques. A test design refers to the creation of your test and train datasets and/or the use of cross-validation and this should have been thought of and accounted for in the data preparation.

Model assessment involves comparing the models with the criteria/criterion that you developed in the business understanding, for example, RMSE, Lift, ROC, and so on.

 

Evaluation


With the evaluation process, the main goal is to confirm that the work that has been done and the model selected at this point meets the business objective. Ask yourself and others, have we achieved the definition of success? Let the Netflix prize serve as a cautionary tale here. I'm sure you are aware that Netflix awarded a $1 million prize to the team that could produce the best recommendation algorithm as defined by the lowest RMSE. However, Netflix did not implement it because the incremental accuracy gained was not worth the engineering effort! Always apply Occam's razor. At any rate, here are the tasks:

  1. Evaluate the results

  2. Review the process

  3. Determine the next steps

In reviewing the process, it may be necessary—as you no doubt determined earlier in the process—to take the results through governance and communicate with the other stakeholders in order to gain their buy-in. As for the next steps, if you want to be a change agent, make sure that you answer the what, so what, and now what in the stakeholders' minds. If you can tie their now what into the decision that you made earlier, you are money.

 

Deployment


If everything is done according to the plan up to this point, it might just come down to flipping a switch and your model goes live. Assuming that this is not the case, here are the tasks of this step:

  1. Deploying the plan

  2. Monitoring and maintenance of the plan

  3. Producing the final report

  4. Reviewing the project

After the deployment and monitoring/maintenance is underway, it is crucial for yourself and those that will walk in your steps to produce a well-written final report. This report should include a white paper and briefing slide. I have to say that I resisted the drive to put my findings in a white paper as I was an indentured servant to the military's passion for PowerPoint slides. However, slides can and will be used against you, cherry-picked or misrepresented by various parties for their benefit. Trust me, that just doesn't happen with a white paper as it becomes an extension of your findings and beliefs.

Now for the all-important process review. You may have your own proprietary way of conducting it, but here is what it should cover, whether you conduct it in a formal or informal way:

  • What was the plan?

  • What actually happened?

  • Why did it happen or did not happen?

  • What should be sustained in future projects?

  • What should be improved upon in future projects?

  • Create an action plan to ensure sustainment and improvement happens

That concludes the review of the CRISP-DM process, which provides a comprehensive and flexible framework to guarantee the success of your project and make you an agent of change.

 

Algorithm flowchart


The purpose of this section is to create a tool that will help you not just select the possible modeling techniques but also to think deeper about the problem. The residual benefit is that it may help you frame the problem with the project sponsor/team. The techniques in the flowchart are certainly not comprehensive but are exhaustive enough to get you started. It also includes techniques not discussed in this book.

The following figure starts the flow of selecting the potential modeling techniques. As you answer the question(s), it will take you to one of the four additional charts:

Figure 1

If the data is a text or in the time series format, then you will follow the flow in the following figure:

Figure 2

In this branch of the algorithm, you do not have a text or the time series data. Additionally, you are not trying to predict what category the observations belong to.

Figure 3

To get to this section, you would have data that is not text or time series. You want to categorize the data, but it does not have an outcome label, which brings us to clustering methods, as follows:

Figure 4

This brings us to a situation where we want to categorize the data and it is labeled, that is, classification:

Figure 5

 

Summary


This chapter was about how to set yourself and your team up for success in any project that you tackle. The CRISP-DM process is put forward as a flexible and comprehensive framework in order to facilitate the softer skills of communication and influence. Each process step and the tasks in each step were enumerated. More than that, the commentary provides some techniques and considerations to help in the process execution. By taking heed of the process, you can indeed become an agent of positive change to any organization.

The other item put forth in this chapter was an algorithm flowchart; a cheat sheet to help in identifying the proper techniques to apply in order to solve the business problem. With this foundation in place, we can now move on to applying these techniques to real-world problems.

About the Author

  • Cory Lesmeister

    Cory Lesmeister has over 14 years of quantitative experience and is currently a senior data scientist for the advanced analytics team at Cummins, Inc. in Columbus, Indiana. Cory spent 16 years at Eli Lilly and Company in sales, market research, Lean Six Sigma, marketing analytics, and new product forecasting. He also has several years of experience in the insurance and banking industries, both as a consultant and as a manager of marketing analytics. A former US Army active duty and reserve officer, Cory was stationed in Baghdad, Iraq, in 2009 serving as the strategic advisor to the 29,000-person Iraqi Oil Police, succeeding where others failed by acquiring and delivering promised equipment to help the country secure and protect its oil infrastructure. Cory has a BBA in Aviation Administration from the University of North Dakota and a commercial helicopter license.

    Browse publications by this author

Latest Reviews

(15 reviews total)
Prompt and easy delivery via Kindle account. Book is wat the description offeren. Priceworthy
I am currently using this as a text for an Introduction to Data Analysis class. It is great introduction to R coding. The layout including business understanding with the analysis makes it applicable to the real world. The language is also really accessible.
Excellent
Book Title
Access this book, plus 8,000 other titles for FREE
Access now