Hello everyone, and welcome to Machine Learning Using C# and .NET. Our goal in this book is to expose you, a seasoned C# .NET developer, to the many open source machine learning frameworks that are available, as well as examples of using those packages. Along the way, we'll talk about logging, facial and motion detection, decision trees, image recognition, intuitive deep learning, quantum computing, and more. In many cases, you'll be up and running within minutes. It's my true hope that there is something for everyone in this series. Most importantly, having dealt with developers for 30 years now, here's why I wrote this book.
As a lifelong Microsoft developer, I have often watched developers struggle to find the resources needed for everyday problems. Let's face it, none of us have the time to do things the way we like, and few of us are fortunate enough to work in a true research and development unit. We've made quite a journey over the years though, from those of us old enough to remember having the sentinel copy of the C programmers' reference and 50 other books on our desk, to now being able to type in a quick search into Google and get exactly (okay, sometimes exactly) what we are looking for. But now that the age of AI is among us, things take a bit of a different turn. As C# developers, Google search isn't always our best friend when it comes to machine learning because almost everything being used is Python, R, MATLAB, and Octave. We also have to remember that machine learning has been around for many years; it's just recently that corporate America has embraced it and we're seeing more and more people become involved. The computing power is now available, and the academia has made incredible strides and progress in bringing it out into the world. But the world, my friends, as you have no doubt heard, is a scary place! Where is a C# .NET developer to turn? Let's start answering this question with a short story in the next section, which, unfortunately, is as true as the sky is blue. At least here in sunny Florida!
In this chapter, we are going to learn the following topics:
- Data mining
- Artificial Intelligence (AI) and bio-AI
- Deep learning
- Probability and statistics
- Supervised learning
- Unsupervised learning
- Reinforcement learning
- Whether to buy, build or open source
I once had a boss whom I told I was using machine learning to discover more about our data. His response was, What do you think you can learn that I don't already know! If you haven't encountered one of those in your career, congratulations. Also let me know if you have any openings! But you more than likely have, or will. Here's how it was handled. And no, I didn't quit!
Him: "But I already know all that. And machine learning is just a buzzword, it's all data in the end, and we're all just data stewards. The rest is all buzzwords. Why should we be doing this and how is it going to help me in the end."
Me: "Well, let me ask you this. What do you think happens when you type a search for something in Google?"
Him: Deer-in-the headlights look with a slight hint of anger.
Him: "What do you mean? Google obviously compares my search against other searches that have historically looked for the same thing."
Me: "OK, and how does that get done?"
Him: A slightly bigger hint at anger and frustration.
Him: "Obviously its computers searching the web and matching up my search criteria against others." Me: "But did you ever think about how that search gets matched up amongst the billions of other searches going on, and how all the data behind the searches keeps getting updated? People obviously cannot be involved or it wouldn't scale."
Him: "Of course, algorithms are finely tuned and give the results we are looking for, or at least, recommendations."
Me: "Right, and it is machine learning that does just that." (not always but close enough!)
Him: "OK, well I don't see what more I can learn from the data so let's see how it goes."
So, let's be honest folks. Sometimes, no amount of logic will override blinders or resistance to change, but the story has a much different and more important meaning behind it than a boss who defies everything we learned in biology. In the world of machine learning, it's a lot harder to prove/show what's going on, whether or not things are working, how they are working, why they are (or are not) working, and so on to someone who isn't in the day-to-day trenches of development like you are. And even then, it could be very difficult for you to understand what the algorithm is doing as well.
Here are just some of the questions you should be asking yourself when it comes to deciding whether or not machine learning is right for you:
- Are you just trying to be buzzword compliant (which might be what's really being asked for) or is there a true need for this type of solution?
- Do you have the data you need?
- Is the data clean enough for usage (more on that later)?
- Do you know where, and whether, you can get data that you might be missing? More importantly, how do you know that data is in fact missing?
- Do you have a lot of data or just a small amount?
- Is there another known and proven solution that already exists that we could use instead?
- Do you know what you are trying to accomplish?
- Do you know how you are going to accomplish it?
- How will you explain it to others?
- How will you be able to prove what's going on under the hood when asked?
These are just some of the many questions we will tackle together as we embark on our machine learning journey. It's all about developing what I call the machine learning mindset.
Nowadays, it seems that if someone does a SQL query that returns more than one row, they call themselves a data scientist. Fair enough for the resume; everyone needs a pat on the back occasionally, even if it's self-provided. But are we really operating as data scientists, and what exactly does data scientist mean? Are we really doing machine learning, and what exactly does that mean? Well, by the end of this book, we'll hopefully have found the answers to all of that, or at the very least, created an environment where you can find the answers on your own!
Not all of us have the luxury of working in the research or academic world. Many of us have daily fires to fight, and the right solution just might be a tactical solution that has to be in place in 2 hours. That's what we, as C# developers, do. We sit behind our desks all day, headphones on if we're lucky, and type away. But do we ever really get the full time we want or need to develop a project the way we'd like? If we did, there wouldn't be as much technical debt in our projects as we have, right (you do track your technical debt, right)?
We need to be smart about how we can get ahead of the curve, and sometimes we do that by thinking more than we code, especially upfront. The academic side of things is invaluable; there's simply no replacement for knowledge. But most production code in corporate America isn't written in academic languages such as Python, R, Matlab and Octave. Even though all that academic wealth is available, it's not available in the form that suits us best to do our jobs.
In the meantime, let's stop and praise those that contribute to the open source community. It is because of them that we have some excellent third-party open source solutions out there that we can leverage to get the job done. It's such a privilege that the open source community allows us to utilize what they have developed, and the objective of this book is to expose you to just some of those tools and show how you can use them. Along the way, we'll try and give you at least some of the basic behind-the-scenes knowledge that you should know, just so that everything isn't a black hole versus a black box!
You've heard buzzwords everywhere. I used to have a 2-4 hour commute to and from work each day, and I can't remember the total number of billboards I would see that had the words machine learning or AI on them. They are everywhere, but what exactly does it all mean? AI, machine learning, data science, Natural Language Processing (NLP), data mining, neurons, phew! It seems like as soon as corporate America got involved, what was once a finely tuned art became a messy free-for-all, and micro-managed project with completely unreal expectations. I've even heard a prospective client say, I'm not sure what it means, but I just don't want to be left behind!
The first thing we must do is to learn the proper way to approach a machine learning project. Let's start with some definitions:
Tom Mitchell has defined machine learning as:
Our definition is going to be just a bit different. It will hopefully be something that you can use when asked to defend your chosen path:
Now, what about those things we call techniques? Make no mistake; techniques such as probability, statistics, they are all there, just hidden under the covers. And the tools we're going to use to perform our examples will hide the details just like Python, R, and the rest of them do! That being said, it would be a complete disservice to you if we didn't at least make you aware of some of the basics, which we'll cover in a moment. I don't mean to lower the importance of any of them as they are all equally important, but our goal here is to get all C# developers up and running as quick as possible. We're going to give you enough information to make you buzzword compliant, and then you'll know more than just the block box API calls! I encourage each one of you to pursue as much academic knowledge as possible in this field. Machine Learning and Artificial Intelligence are changing daily it seems, so always keep up with the latest. The more you know, the better you will be at gaining acceptance for your project.
Since we brought up the topic of buzzword compliant, let's clear up a few terms right from the start. Data mining, machine learning, artificial intelligence, the list goes on and on. I'll only cover a few terms for now, but here's an easy way to think about it.
You're on a road trip with your family. Let's assume you have children, and let's put aside the are we there yet conversations! You are driving down the highway and one of your kids (a very young toddler), yells TRUCK and points out the window at a truck. This child is very young, so how did he know that particular vehicle was a truck (let's assume it really was!). They know it's a truck because every previous time they did the same thing you said Yes or No. That's machine learning. Then, when you told them Yes or No, that's reinforcement learning. If you said Yes, that's a big truck, that's adding context to the reinforcement, and that moves us down the road into deep learning. See what you've been teaching your children that you didn't even know about?
Hope that helped.
Data mining deals with searching large amounts of data for very specific information. You are searching through your data looking for something specific. For example, a credit card company would use data mining to learn about buyers habits by analyzing purchases and their locations. This information then becomes very useful for things such as targeted advertisements.
Machine learning, on the other hand, focuses on performing the actual task of searching for that data using algorithms you have provided. Makes sense?
Enough said for now, but here is an excellent link where you can learn more about data mining: https://blog.udacity.com/2014/12/24-data-science-resources-keep-finger-pulse.html
Artificial Intelligence is a higher order of machine learning. Some people have defined it as when the machine appears as smart as or smarter than a human. As for me, the verdict is still out on that one. The more I watch the daily news, the more I wonder which intelligence it is that is artificial, and for that matter, what intelligence really is! There are so many definitions floating around, but in a nutshell, Artificial Intelligence is considered a machine doing things that a human could or should do in a manner such that any reasonable person would not be able to distinguish the machine from the human in its response. In any event, Artificial Intelligence is a very far-reaching subject, and unfortunately there are as many meanings as people using the term!
Bio-AI refers to putting a biological component alongside a computational component. Genotypes, phenotypes, neurons, mirror neurons, canonical neurons, synapses... you'll hear all that mentioned under this category, Artificial Neural Networks (ANNs)! Bio-AI is mostly used in the medical field. For now, we need not concern ourselves with this, but just know that the term exists and that biology is the basis for its incorporation.
For many years, it was believed that neural networks (using a concept known as hidden layers) only needed a single hidden layer to solve any problem. With the increase of computing power, decrease of computing hardware cost, and advances of neural network algorithms, it's common to have hundreds or even thousands of hidden layers in your network. The increase in the number of hidden layers, among other things, is what deep learning is in a very small nutshell! Here's a visual comparison that might help in making things clearer:
As you can see in the following representational image there are several hidden layers in the network.
Believe it or not, this is what you are doing; it's just very well abstracted from your view. But let me give you an incredibly, overly simplified, quick primer... just in case you are rusty!
You see a polar bear walking in the snow. You wonder what kind of footprints it makes. That's probability. Next, you see footprints in the snow and wonder if it's a polar bear. That's statistics. Kaboom! Now you're primed! You're also probably wondering what is wrong with this author, so maybe another example just in case!
- Probability deals with predicting the likelihood of future event(s).
- Statistics deals with analyzing the frequency of past event(s).
Next, let's talk about how we're going to approach our machine learning project, and while doing so, continue to define/refine our machine learning mindset. Let's start by defining the basic steps that we need to use each time we approach one of these projects. Basically, we can break them down into the following categories.
There are countless types of data at your disposal, from SQL and NoSQL databases, Excel files, Access databases, text files, and on and on. You need to decide where your data is located, how it is formatted, how you will import and refine it. You need to always keep in mind that there is no substitute for large amounts of testing and training data, as well as the quality of it. Garbage in, garbage out can get very messy in machine learning!
As we said previously, there is simply no substitute for data quality. Is there data that is missing, malformed, or incorrect? And let's not forget about another term you'll get familiar with, data outliers. Those are the nasty little pieces of data that simply don't fit nicely with the rest of your data! Do you have those? If so, should they be there, and if so, how will they be treated? If you are not sure, here's what a data outlier might look like if you are plotting your data:
In statistics, an outlier is an observation point that is distant from other observations, sometimes very much so, sometimes not. The outlier itself may be due to variability in measurement, indicate an experiment defect, or it might in fact be valid. If you see outliers in your data, you need to understand why. They can indicate some form of measurement error, and the algorithm that you are using may not be robust enough to handle these outliers.
When creating and training a model, here are a few things that you need to consider.
- You need to choose the appropriate machine learning algorithm for the task at hand, which will be representative of the data you are working with. You will then split this into 2-3 subsets of data: training, validation, and test. The rules for the correct proportions vary based upon the amount of data you are working with. For example, if you have 10,000 rows of data, then perhaps 20% to training and 80% to test is good. But if you have 108 rows of data, perhaps 5% training and 95% test is better.
- There is one rule that you must always follow to the letter. Whatever fractionality you decide to use for your test, train and validation sets, ALL THE DATA MUST COME FROM THE SAME DATASET. This is so very important. You never want to take some data from one dataset to train on, and then data from a completely different dataset to test on. That will just lead to frustration. Always accumulate huge datasets to train, test and validate on!
- Validation data can be used to validate your test data prior to using the test data set. Some people use it, some don't. However you split your data up, you will always have a data set to train with, and a set to test with. The goal of your algorithm must be to be flexible enough to handle data it has not previously seen, and you can't do that if you are testing with the same set of data you are developing against. Following are the two ways that the data can be split. The two approaches show how you can separate test and train sets (one with a cross validation set and the other without one):
Once you have used your training data, you will move on to testing/evaluating your model using the test dataset you prepared earlier. This is where we find out how well our model works against data that it has not previously seen. If our model fails here, we return to go, do not collect $200, and refine our process!
As you are evaluating your model, you may, at some point, determine that you need to choose a different model or introduce more features/variables/hyper-parameters to improve the efficiency and performance of your model. One good way of reducing your exposure here is to spend the extra time in the Data collection section and Data preparation section. As we said earlier, there is simply no substitute for a lot of correct data.
If you have to tune your models, and you will, there are many approaches to doing so. Here are just a few:
- Grid search
- Random search
- Bayesian optimization
- Gradient-based optimization
- Evolutionary optimization
Let's look at an example dataset—the infamous and always used Iris dataset.
The Iris dataset is a dataset of flowers introduced by the biologist Mr. Ronald Fisher in 1936. This dataset contains 50 samples from each of 3 species of the Iris flower (Iris setosa, Iris virginica, Iris versicolor). Each sample consists of four features (length of the sepal, length of the petal, width of the sepal, width of the pedal). Combined, this data produces a linearly discriminant model distinguishing one species from another.
So, how do we go from the flower to the data:
We need to now take what we know about the visual representation of what we are working with (the flower) and transform it into something the computer can understand. We do so by breaking down all the information we know about the flower into columns (features) and rows (data items) as you can see below:
Now that all the measurements are in a format which the computer can understand, our first step should be to make sure we have no missing or malformed data, as that spells trouble. If you look at the yellow highlights in the previous screenshot, you can see that we are missing data. We need to ensure that this gets populated before we feed it to our application. Once the data is properly prepared and validated, we are ready to go. If we run the Iris validator from Encog34 our output should reflect that we have 150 datasets, which it does:
Now, let's briefly familiarize ourselves with the different types of machine learning which we will discuss throughout the book, starting with the next chapter. It is important that you are at least familiar with these terms as they surely will come up one day, and the more you know and understand, the better you will be able to approach your problem and explain it to others.
Here is a simple diagram which shows the three main categories of machine learning:
These types of machine learning models are used to predict the outcome based upon the data presented to it. The instructions provided are explicit and detailed, or at least should be, which is what has garnered the label supervised. We are basically learning a function which maps an input to an output based upon input and output pairs. This function is inferred from training data which is called labeled, in that it specifically tells the function what it expects. In supervised learning, there is always an input and corresponding output (or more correctly, a desired output value). More formally, this type of algorithm uses a technique known as inductive bias to accomplish this, which basically means that there are a set of assumptions which the algorithm will use to predict the outputs given inputs it may or may not have previously seen.
In supervised learning we typically have access to a set of X features (X1, X2, X3, ... Xx), measured on observations, and a response Y, also measured on those same n observations. We then try and predict Y using X1, X2, X3, ... Xn.
Models such as Support Vector Machines (SVM), linear regression, Naive Bayes, and tree-based methods are just a few examples of supervised learning.
Next, let's briefly discuss a few things which we need to concern ourselves with when it comes to supervised learning. They are, in no particular order:
- Bias-variance trade-off
- Amount of training data
- Input space dimensionality
- Incorrect output values
- Data heterogeneity
Before we talk about the bias-variance trade-off, it only makes sense that we would first make sure you are familiar with the individual terms themselves.
When we talk about bias-variance trade-off, bias refers to an error from incorrect assumptions in the learning algorithm. High bias causes what is known as under-fitting, a phenomenon which causes the algorithm to miss relevant feature-output layer relationships in the data.
Variance, on the other hand, is a sensitivity error to small fluctuations in the training set. High variance can cause your algorithm to model random noise rather than the actual intended outputs, a phenomenon known as over-fitting.
There is a trade-off between bias and variance that every machine learning developer needs to understand. It has a direct correlation to under and over fitting of your data. We say that a learning algorithm has a high variance for an input if it predicts a different output result when used on a different training set, and that of course is not good.
A machine learning algorithm with low bias must be flexible enough so that it can fit the data well. If the algorithm is designed too flexible, each training and test dataset will fit differently, resulting in high variance.
Your algorithm must be flexible enough to adjust this trade-off either by inherent algorithmic knowledge or a parameter which can be user adjusted.
The following figure shows a simple model with high bias (to the left), and a more complex model with high variance (to the right).
As we have said repeatedly, there simply is no substitute for having enough data to get the job done correctly and completely. This directly correlates to the complexity of your learning algorithm. A less complex algorithm with high bias and low variance can learn better from a smaller amount of data. However, if your learning algorithm is complex (many input features, parameters, and so on), then you will need a much larger training set from which to learn from with low bias and high variance.
With every learning problem our input is going to be in the form of a vector. The feature vector, meaning the features of the data itself, can affect the learning algorithm greatly. If the input feature vectors are very large, which is called high-dimensionality, then learning can be more difficult even if you only need just a few of those features. Sometimes, the extra dimensions confuse your learning algorithm, which results in high variance. This, in turn, means that you will have to tune your algorithm to have lower variance and higher bias. It is sometimes easier, if applicable, to remove the extra features from your data, thus improving your learning function accuracy.
That being said, a popular technique known as dimensionality reduction is used by several machine learning algorithms. These algorithms will identify and remove irrelevant features.
The question we ask ourselves here is how many errors exist in the desired output from our machine learning algorithm. If we experience this, the learning algorithm may be attempting to fit the data too well, resulting in something we mentioned previously, over-fitting. Over-fitting can result from incorrect data, or a learning algorithm which is too complex for the task at hand. If this happens, we need to either tune our algorithm or look for one which will provide us with higher bias and lower variance.
Heterogeneity, according to Webster's dictionary, means the quality or state of consisting of dissimilar or diverse elements: the quality or state of being heterogeneous. To us this means that the feature vectors include features of many different kinds. If this applies to our application, then it may be better for us to apply a different learning algorithm for the task. Some learning algorithms also require that our data is scaled to fit within certain ranges, such as [0 - 1], [-1 - 1], and so on. As we get into learning algorithms that utilize distance functions as their basis, such as nearest neighbor and support vector methods, you will see that they are exceptionally sensitive to this. On the other hand, algorithms such as those that are tree-based (decision trees, and so on) handle this phenomenon quite well.
We will end this discussion by saying that we should always start with the least complex, and most appropriate algorithm, and ensure our data is collected and prepared correctly. From there, we can always experiment with different learning algorithms and tune them to see which one works best for our situation. Make no mistake, tuning algorithms may not be a simple task, and in the end, consumes a lot more time than we have available. Always ensure the appropriate amount of data is available first!
Contrary to supervised learning, unsupervised usually has more leeway in how the outcome is determined. The data is treated such that, to the algorithm, there is no single feature more important than any other in the dataset. These algorithms learn from datasets of input data without the expected output data being labeled. k-means clustering (cluster analysis) is an example of an unsupervised model. It is very good at finding patterns in the data that have meaning relative to the input data. The big difference between what we learned in the supervised section and here is that we now have x features X1, X2, X3, ... Xx measured on n observations. But we no longer interested in prediction of Y because we no longer have Y. Our only interest is to discover data patterns over the features that we have:
In the previous diagram, you can see that data such as this lends itself much more to a non-linear approach, where the data appears to be in clusters relative to importance. It is non-linear because there is no way we will get a straight line to accurately separate and categorize the data. Unsupervised learning allows us to approach a problem with little to no idea what the results will, or should, look like. Structure is derived from the data itself versus supervised rules applied to output labels. This structure is usually derived by clustering relationships of data.
For example, let's say we have 108 genes from our genomic data science experiment. We would like to group this data into similar segments, such as hair color, lifespan, weight, and so on.
The second example is what is famously known as the cocktail party effect3, which basically refers to the brains auditory ability to focus attention to one thing and filter out the noise around it.
Both examples can use clustering to accomplish their goals.
Reinforcement learning is a case where the machine is trained for a specific outcome with the sole purpose of maximizing efficiency and/or performance. The algorithm is rewarded for making the correct decisions, and penalized for making incorrect ones. Continual training is used to constantly improve performance. The continual learning process means less human intervention. Markov models are an example of reinforcement learning, and self-driving autonomous automobiles are a great example of just such an application. It constantly interacts with its environments, watches for obstacles, speed limits, distance, pedestrians, and so on to (hopefully) make the correct decisions.
Our biggest difference with reinforcement learning is that we do not deal with correct input and output data. The focus here is on performance, meaning somehow finding a balance between unseen data and what the algorithms have already learned.
The algorithm applies an action to its environment, receives a reward or a penalty based upon what it has done, repeats, and so on as shown in the following image. You can just imagine how many times per second this is happening in that nice little autonomous taxi that just picked you up at the hotel.
Next, let's ask ourselves the ever-important question? Buy, build or open source?
It would be my recommendation, and of course one reason why I'm writing this book, to expose yourself to the open source world. I realize that many developers suffer from the 'it's not built here' syndrome, but we should really be honest with ourselves before going down that path. Do we really think we have the expertise to do better, faster, and have it tested within our time constraints, compared to what is already out there? We should first try and see what is already out there that we can use. There are so many fabulous open source toolkits for us to use, and the developers of those have put tremendous amounts of hours and work into developing and testing them. Obviously open source is not a solution for everyone, every time, but even if you cannot use it in your application, there certainly is tremendous knowledge you can gain by using and experimenting with them.
Buying usually isn't an option. If you're lucky enough to find something to purchase, you probably won't get the approval as it will cost a pretty penny! And what happens if you need to modify the product to do something you need? Good luck getting access to the source or having the support team change their priorities just for you. Not going to happen, at least not as fast as we'll probably need it to!
And as for building it yourself, hey we're developers, it's what we all want to do, right? But before you fire up Visual Studio and take off, think long and hard about what you are getting into.
So open source should always be a first choice. You can bring it in house (assuming licensing allows you), adapt it to your standards if need be (code contacts, more unit tests, better documentation, and so on).
Although the code is in Python and R, I encourage those interested in expanding upon what we have talked about in this chapter to visit Jason Brownlee's site, https://machinelearningmastery.com/. The explanations and passion about machine learning are second to none and there is an incredible amount of information you can gain from his site. The explanations are clear, passionate and cover an incredible amount of depth. I highly recommend perusing his blog and site to learn as much as you can.
In this chapter, we discussed many aspects of machine learning with C#, different strategies for implementing your code—such as build, buy, or open source—as well as lightly touch upon some important definitions. I hope this got you ready for the chapters to come.
Before we dive right into our source code and applications, I want to take some time to discuss with you something that is very near and dear to my heart: logging. It's something that we all do (or should do), and there is a phenomenal tool out there that you need to know about if you do not already. We'll be using it quite a bit in this book, so it's definitely helpful to spend some time on it up front, starting in the next chapter.
- By Nicoguaro - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=46257808
- Creative Commons Attribution-ShareAlike 3.0 Unported
- Encog framework is copyright of Jeff Heaton/Heaton research