C# Machine Learning Projects

2.5 (2 reviews total)
By Yoon Hyup Hwang
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Basics of Machine Learning Modeling

About this book

Machine learning is applied in almost all kinds of real-world surroundings and industries, right from medicine to advertising; from finance to scientifc research. This book will help you learn how to choose a model for your problem, how to evaluate the performance of your models, and how you can use C# to build machine learning models for your future projects.

You will get an overview of the machine learning systems and how you, as a C# and .NET developer, can apply your existing knowledge to the wide gamut of intelligent applications, all through a project-based approach. You will start by setting up your C# environment for machine learning with the required packages, Accord.NET, LiveCharts, and Deedle. We will then take you right from building classifcation models for spam email fltering and applying NLP techniques to Twitter sentiment analysis, to time-series and regression analysis for forecasting foreign exchange rates and house prices, as well as drawing insights on customer segments in e-commerce. You will then build a recommendation model for music genre recommendation and an image recognition model for handwritten digits. Lastly, you will learn how to detect anomalies in network and credit card transaction data for cyber attack and credit card fraud detections.

By the end of this book, you will be putting your skills in practice and implementing your machine learning knowledge in real projects.

Publication date:
June 2018


Chapter 1. Basics of Machine Learning Modeling

It can be difficult to see how machine learning (ML) affects the daily lives of ordinary people. In fact, ML is everywhere! In the process of searching for a restaurant for dinner, you almost certainly used ML. In the search for a dress to wear for a dinner party, you would have used ML. On your way to your dinner appointment, you probably used ML as well if you used one of the ride-sharing apps. ML has become so widely used that it has become an essential part of our lives, although it is usually unnoticeable. With ever-growing data and its accessibility, the applications and needs for ML are rapidly rising across various industries. However, the pace of the growth in trained data scientists has yet to meet the pace of growth ML needs in businesses, despite abundant resources and software libraries that make building ML models easier, due to the fact that it takes time and experience for a data scientist and ML engineer to master such skill sets. This book will prepare such individuals with real-world projects based on real-world datasets.

In this chapter, we will learn about some of the real-life examples and applications of ML, the essential steps in building ML models, and how to set up our C# environment for ML. After this brief introductory chapter, we will dive immediately into building classification ML models using text datasets in Chapter 2Spam Email Filtering, and Chapter 3, Twitter Sentiment Analysis. Then, we will use financial and real estate property data to build regression models in Chapter 4Foreign Exchange Rate Forecast, and Chapter 5Fair Value of House and Property. In Chapter 6, Customer Segmentation, we will use a clustering algorithm to gain insight into customer behavior using e-commerce data. In Chapter 7Music Genre Recommendation, and Chapter 8Handwritten Digit Recognition, we will build recommendation and image recognition models using audio and image data. Lastly, we will use semi-supervised learning techniques to detect anomalies in Chapter 9, Cyber Attack Detection and Chapter 10Credit Card Fraud Detection.

In this chapter, we will cover the following topics:

  • Key ML tasks and applications
  • Steps in building ML models
  • Setting up a C# environment for ML

Key ML tasks and applications

There are many areas where ML is used in our daily lives without being noticed. Media companies use ML to recommend the most relevant content, such as news articles, movies, or music, for you to read, watch, or listen to. The e-commerce companies use ML to suggest the items that are of interest and that you are most likely to purchase. Game companies use ML to detect your motion and joint movements for their motion sensor games. Some other common uses of ML in the industry include face detection on cameras for better focusing, automated question answering where chat bots or virtual assistants interact with customers to answer questions and requests, and detecting and preventing fraudulent transactions. In this section, we will take a look at some of the applications we use in our daily lives that utilize ML heavily:

  • Google News feed: Google News feed uses ML to generate a personalized stream of articles based on the user's interests and other profile data. Collaborative filtering algorithms are frequently used for such recommendation systems and are built from the view history data of their user base. Media companies use such personalized recommendation systems to attract more traffic to their websites and increase the number of subscribers.
  • Amazon product recommendations: Amazon uses user browse and order history data to train a ML model to recommend products that a user is most likely to purchase. This is a good use case for supervised learning in the e-commerce industry. These recommendation algorithms help e-commerce companies maximize their profit by displaying items that are the most relevant to each user's interests.
  • Netflix movie recommendation: Netflix uses movie ratings, view history, and preference profiles to recommend other movies that a user might like. They train collaborative filtering algorithms with data to make personalized recommendations. Considering that More than80 per cent of the TV shows people watch on Netflix are discovered through the platform's recommendation system according to an article on Wired (http://www.wired.co.uk/article/how-do-netflixs-algorithms-work-machine-learning-helps-to-predict-what-viewers-will-like), this is a very useful and profitable example of ML at a media company.
  • Face detection on cameras: Cameras detect faces for better focusing and light metering. This is the most frequently used example of computer vision and classification. Also, some photo management software uses clustering algorithms to group similar faces in your images together so that you can search photos by certain people in them later.
  • Alexa – Virtual assistant: Virtual assistant systems, such as Alexa, can answer questions such as What's the weather in New York? or complete certain tasks, such as Turn on the living room lights. These kinds of virtual assistant system are typically built using speech recognition, natural language understanding (NLU), deep learning, and various other machine learning technologies. 
  • Microsoft Xbox Kinect: Kinect can sense how far each object is from the sensor and detect joint positions. Kinect is trained with a randomized decision forest algorithm that builds lots of individual decision trees from depth images.

The following screenshot shows different examples of recommendation systems using ML:

 Left: Google News Feed, top-right: Amazon product recommendation, bottom-right: Netflix movie recommendation

The following screenshot depicts a few other examples of ML applications:

Left: Face detection, middle: Amazon Alexa, right: Microsoft Xbox Kinect


Steps in building ML models

Now that we have seen some examples of the ML applications that are out there, the question is, How do we go about building such ML applications and systems? Books about ML and ML courses that are taught in universities typically start by covering the mathematics and theories behind ML algorithms and then apply those algorithms to a given dataset. This approach is great for people who are completely new to this subject and are looking to learn the foundations of ML. However, aspiring data scientists with some prior knowledge and experience and who are looking to apply their knowledge to real ML projects often stumble about where to start and how to approach a given ML project. In this section, we will discuss a typical workflow for building a ML application, which we will follow throughout the book. The following figure summarizes our approach to developing an application using ML and we will discuss this in more detail in the following subsections:


 Steps in building ML models

As seen in the preceding diagram, the steps that are to be followed for building learning models are as follows:

  • Problem definition: The first step in starting any project is not only understanding the problem, but also defining the problem that you are trying to solve using ML. Poor definition of a problem will result in a meaningless ML system, since the models will have been trained and optimized for a problem that you are not actually trying to solve. This first step is unarguably the most important step in building useful ML models and applications. You should at least answer the following four questions before you jump into building ML models:
    • What is the problem? This is where you describe and state the problem that you are trying to solve. For example, a problem description might be need a system to assess a small business owner's ability to pay back a loan for a small business lending project.
    • Why is it a problem? It is important to define why such a problem is actually a problem and why the new ML model is going to be useful. Maybe you have a working model already and you have noticed it is performing worse than before; you might have obtained new data sources that you can use for building a new prediction model; or maybe you want your existing model to produce prediction results more quickly. There can be multiple reasons why you think this is a problem and why you need a new model. Defining why it is a problem will help you stay on the right track while you are building a new ML model.
    • What are some of the approaches to solving this problem? This is where you brainstorm your approaches to solve the given problem. You should think about how this model is going to be used (do you need this to be a real-time system or is this going to be run as a batch process?), what type of problem it is (is it a classification problem, regression, clustering, or something else?), and what types of data you would need for your model. This will provide a good basis for future steps in building your machine learning model.
    • What are the success criteria? This is where you define your checkpoints. You should think about what metrics you will look at and what your target model performance should look like. If you are building a model that is going to be used in a real-time system, then you can also set the target execution speed and data availability at runtime as part of your success criteria. Setting these success criteria will help you keep moving forward without being stuck at a certain step.
  • Data collection: Having data is the most essential and critical part of building a ML model, preferably lots of data. No data, no model. Depending on your project, your approaches to collecting data can vary. You can purchase existing data sources from other vendors, you can scrape websites and extract data from there, you can use publicly available data, or you can also collect your own data. There are multiple ways you can gather the data you need for your ML model, but you need to keep in mind these two elements of your data when you are in the process of data collection—the target variable and feature variables. The target variable is the answer for your predictions and feature variables are the factors that your models will use to learn how to predict the target variable. Often, target variables are not present in a labeled form. For example, when you are dealing with Twitter data to predict the sentiment of each tweet, you might not have labeled sentiment data for each tweet. In this case, you will have to take an extra step to label your target variables. Once you have your data collected, you can move on to the data preparation step.
  • Data preparation: Once you have gathered all of your input data, you need to prepare it so that it is in a useable format. This step is more important than you might think. If you have messy data and you did not clean it up for your learning algorithms, your algorithms will not learn well from your dataset and will not perform as expected. Also, even if you have high-quality data, if your data is not in a format that your algorithms can be trained with, then it is meaningless to have high-quality data. Bad data, bad model. You should at least handle some of the common problems listed as follows to have your data ready for the next steps:
    • File format: If you are getting your data from multiple data sources, you will most likely run into different formats for each data source. Some data might be in CSV format, while other data is in JSON or XML format. Some data might even be stored in a relational database. In order to train your ML model, you will need to first merge all these data sources in different formats into one standard format.
    • Data format: It can also be the case that data formats vary among different data sources. For example, some data might have the address field broken down into street address, city, state, and ZIP, while some others might not. Some data might have the date field in the American date format (mm/dd/yyyy), while some others may be in British format (dd/mm/yyyy). These data format discrepancies among data sources can cause issues when you are parsing the values. In order to train your ML model, you will need to have a uniform data format for each field.
    • Duplicate records: Often you will see same exact records repeating in your dataset. This problem can occur in the data collection process where you recorded a data point more than once or when you were merging different datasets in your data preparation process. Having duplicate records can adversely affect your model and it is good to check for duplicates in your dataset before you move on to the next steps.
    • Missing values: It is also common to see some records with empty or missing values in the data. This can also have an adverse effect when you are training your ML models. There are multiple ways to handle missing values in your data, but you will have to be careful and understand your data very well, as this can change your model performance dramatically. Some of the ways you can handle the missing values include dropping records with missing values, replacing missing values with the mean or median, replacing missing values with a constant, or replacing missing values with a dummy variable and an indicator variable for missing. It will be beneficial to study your data before you deal with the missing values.
  • Data analysis: Now that your data is ready, it is time to actually look at the data and see if you can recognize any patterns and draw some insights from the data. Summary statistics and plots are two of the best ways to describe and understand your data. For continuous variables, looking at the minimum, maximum, mean, median, and quartiles is a good place to start. For categorical variables, you can look at the counts and percentages of categories. As you are looking at these summary statistics, you can also start plotting graphs to visualize the structures of your data. The following figure shows some commonly used charts for data analysis. Histograms are frequently used to show and inspect underlying distributions of variables, outliers, and skewness. Box plots are frequently used to visualize five-number summary, outliers, and skewness. Pairwise scatter plots are frequently used to detect obvious pairwise correlations among the variables:

Data analysis and visualizations. Top-left: histogram of nominal house sale price, top-right: histogram of house sale price using the logarithmic scale, bottom-left: box plots of distributions of basement, first floor, and second floor square footage's, bottom-right: scatter plot between first and second floor square feet

    • Feature engineering: Feature engineering is the most important part of the model building process in applied ML. However, this is one of the least discussed topics in many textbooks and ML courses. Feature engineering is the process of transforming raw input data into more informative data for your algorithms to learn from. For example, for your Twitter sentiment prediction model that we will build in Chapter 3Twitter Sentiment Analysis, your raw input data may only contain a list of text in one column and a list of sentiment targets in another column. Your ML model will probably not learn how to predict each tweet's sentiment well with this raw data. However, if you transform this data so that each column represents the number of occurrences of each word in each tweet, then your learning algorithm can learn the relationship between the existence of certain words and sentiments more easily. You can also group each word with its adjacent word (bigram) and have the number of occurrences of each bigram in each tweet as another group of features. As you can see from this example, feature engineering is a way of making your raw data more representative and informative of the underlying problems. Feature engineering is not only a science, but also an art. Feature engineering requires good domain knowledge of the dataset, the creativity to build new features from raw input data, and multiple iterations for better results. As we work through this book, we will cover how to build text features using some natural language processing (NLP) techniques, how to build time series features, how to sub-select features to avoid overfitting issues, and how to use dimensionality reduction techniques to transform high-dimensional data into fewer dimensions. 

Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.

-Andrew Ng

  • Train/test algorithms: Once you have created your features, it is time to train and test some ML algorithms. Before you start training your models, it is good to think about performance metrics. Depending on the problem you are solving, your choice of performance measure will differ. For example, if you are building a stock price forecast model, you might want to minimize the difference between your prediction and the actual price and choose root mean square error (RMSE) as your performance measure. If you are building a credit model to predict whether a person can be approved for a loan or not, you would want to use the precision rate as your performance measure, since incorrect loan approvals (false positives) will have a more negative impact than incorrect loan disapprovals (false negatives). As we work through the chapters, we will discuss more specific performance metrics for each project.

Once you have specific performance measures in mind for your model, you can now train and test various learning algorithms and their performance. Depending on your prediction target, your choice of learning algorithms will also vary. The following figure shows illustrations of some of the common machine learning problems. If you were solving classification problems, you would want to train classifiers, such as the logistic regression model, the Naive Bayes classifier, or the random forest classifier. On the other hand, if you had a continuous target variable, then you would want to train regressors, such as the linear regression model, k-nearest neighbor, or Support Vector Machine (SVM). If you would like to draw some insights from data by using unsupervised learning, you would want to use k-means clustering or mean shift algorithms:

 Illustrations of ML problems. Left: classification, middle: regression, right: clustering 

Lastly, we will have to think about how we test and evaluate the performance of the learning algorithms we tried. Splitting your dataset into train and test sets and running cross-validation are the two most commonly used methods of testing and comparing your ML models. The purpose of splitting a dataset into two subsets, one for training and another for testing, is to train a model on the train set without exposing it to the test set so that prediction results on the test set are indicative of the general model performance for the unforeseen data. K-fold cross-validation is another way to evaluate model performance. It first splits a dataset into equally sized K subsets and leaves one set out for testing and trains on the rest. For example, in 3-fold cross-validation, a dataset will first split into three equally sized subsets. In the first iteration, we will use folds #1 and #2 to train our model and test it on fold #3. In the second iteration, we will use folds #1 and #3 to train and test our model on fold #2, In the third iteration, we will use folds #2 and #3 to train and test our model on fold #1. Then, we will average the performance measures to estimate the model performance:

  • Improve results: By now you will have one or two candidate models that perform reasonably well, but there might be still some room to improve. Maybe you noticed your candidate models are overfitting to some extent, maybe they do not meet your target performance, or maybe you have some more time to iterate on your models—regardless of your intent, there are multiple ways that you can improve the performance of your model and they are as follows:
    • Hyperparameter tuning: You can tune the configurations of your models to potentially improve the performance results. For example, for random forest models, you can tune the maximum height of the tree or number of trees in the forest. For SVMs, you can tune the kernels or cost values.
    • Ensemble methods: Ensembling is combining the results of multiple models to get better results. Bagging is where you train the same algorithm on different subsets of your dataset, boosting is combining different models that are trained on the same train set, and stacking is where the output of models is used as the input to a meta model that learns how to combine the results of the sub-models.
    • More feature engineering: Iterating on feature engineering is another way to improve model performance.
  • Deploy: Time to put your models into action! Once you have your models ready, it is time to let them run in production. Make sure you test extensively before your models take full charge. It will also be beneficial to plan to develop monitoring tools for your models, since model performance can decrease over time as the input data evolves.

Setting up a C# environment for ML

Now that we have discussed the steps and approaches to building ML models that we will follow throughout this book, let's start setting up our C# environment for ML. We will first install and set up Visual Studio and then two packages (Accord.NET and Deedle) that we will frequently use for our projects in the following chapters.

Setting up Visual Studio for C#

Assuming you have some prior knowledge of C#, we will keep this part brief. In case you need to install Visual Studio for C#, go to https://www.visualstudio.com/downloads/ and download one of the versions of Visual Studio. In this book, we use the Community Edition of Visual Studio 2017. If it prompts you to download .NET Framework before you install Visual Studio, go to https://www.microsoft.com/en-us/download/details.aspx?id=53344 and install it first.

Installing Accord.NET

Accord.NET is a .NET ML framework. On top of ML packages, the Accord.NET framework also has mathematics, statistics, computer vision, computer audition, and other scientific computing modules. We are mainly going to use the ML package of the Accord.NET framework. 

Once you have installed and set up your Visual Studio, let's start installing the ML framework for C#, Accord.NET. It is easiest to install it through NuGet. To install it, open the package manager (Tools |NuGet Package Manager | Package Manager Console) and install Accord.MachineLearning and Accord.Controls by typing in the following commands:

PM> Install-Package Accord.MachineLearning
PM> Install-Package Accord.Controls

Now, let's build a sample ML application using these Accord.NET packages. Open your Visual Studio and create a new Console Application under the Visual C# category. Use the preceding commands to install those Accord.NET packages through NuGet and add references to our project. You should see some Accord.NET packages added to your references in your Solutions Explorer and the result should look something like the following screenshot:

The model we are going to build now is a very simple logistic regression model. Given two-dimensional arrays and an expected output, we are going to develop a program that trains a logistic regression classifier and then plot the results showing the expected output and the actual predictions by this model. The input and output for this model look like the following:

The code for this sample logistic regression classifier is as follows:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

using Accord.Controls;
using Accord.Statistics;
using Accord.Statistics.Models.Regression;
using Accord.Statistics.Models.Regression.Fitting;

namespace SampleAccordNETApp
    class Program
        static void Main(string[] args)
            double[][] inputs =
                new double[] { 0, 0 },
                new double[] { 0.25, 0.25 }, 
                new double[] { 0.5, 0.5 }, 
                new double[] { 1, 1 },

            int[] outputs =

            // Train a Logistic Regression model
            var learner = new IterativeReweightedLeastSquares<LogisticRegression>()
                MaxIterations = 100
            var logit = learner.Learn(inputs, outputs);

            // Predict output
            bool[] predictions = logit.Decide(inputs);

            // Plot the results
            ScatterplotBox.Show("Expected Results", inputs, outputs);
            ScatterplotBox.Show("Actual Logistic Regression Output", inputs, predictions.ToZeroOne());


Once you are done writing this code, you can run it by hitting F5 or clicking on the Start button on top. If everything runs smoothly, it should produce the two plots shown in the following figure. If it fails, check for references or typos. You can always right-click on the class name or the light bulb icon to make Visual Studio help you find which packages are missing from the namespace references:

 Plots produced by the sample program. Left: actual prediction results, right: expected output

This sample code can be found at the following link: https://github.com/yoonhwang/c-sharp-machine-learning/blob/master/ch.1/SampleAccordNETApp.cs.

Installing Deedle

Deedle is an open source .NET library for data frame programming. Deedle lets you do data manipulation in a way that is similar to R data frames and pandas data frames in Python. We will be using this package to load and manipulate the data for our ML projects in the following chapters.

Similar to how we installed Accord.NET, we can install the Deedle package from NuGet. Open the package manager (Tools | NuGet Package Manager | Package Manager Console) and install Deedle using the following command:

PM> Install-Package Deedle

Let's briefly look at how we can use this package to load data from a CSV file and do simple data manipulations. For more information, you can visit http://bluemountaincapital.github.io/Deedle/ for API documentation and sample code. We are going to use daily AAPL stock price data from 2010 to 2013 for this exercise. You can download this data from the following link: https://github.com/yoonhwang/c-sharp-machine-learning/blob/master/ch.1/table_aapl.csv.

Open your Visual Studio and create a new Console Application under the Visual C# category. Use the preceding command to install the Deedle library through NuGet and add references to your project. You should see the Deedle package added to your references in your Solutions Explorer.

Now, we are going to load the CSV data into a Deedle data frame and then do some data manipulations. First, we are going to update the index of the data frame with the Date field. Then, we are going to apply some arithmetic operations on the Open and Close columns to calculate the percentage changes from open to close prices. Lastly, we will calculate daily returns by taking the differences between the close and the previous close prices, dividing them by the previous close prices, and then multiplying it by 100. The code for this sample Deedle program is shown as follows:

using Deedle;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace DeedleApp
    class Program
        static void Main(string[] args)
            // Read AAPL stock prices from a CSV file
            var root = Directory.GetParent(Directory.GetCurrentDirectory()).Parent.FullName;
            var aaplData = Frame.ReadCsv(Path.Combine(root, "table_aapl.csv"));
            // Print the data
            Console.WriteLine("-- Raw Data --");

            // Set Date field as index
            var aapl = aaplData.IndexRows<String>("Date").SortRowsByKey();
            Console.WriteLine("-- After Indexing --");

            // Calculate percent change from open to close
            var openCloseChange = 
                    aapl.GetColumn<double>("Close") - aapl.GetColumn<double>("Open")
                ) / aapl.GetColumn<double>("Open")) * 100.0;
            aapl.AddColumn("openCloseChange", openCloseChange);
            Console.WriteLine("-- Simple Arithmetic Operations --");

            // Shift close prices by one row and calculate daily returns
            var dailyReturn = aapl.Diff(1).GetColumn<double>("Close") / aapl.GetColumn<double>("Close") * 100.0;
            aapl.AddColumn("dailyReturn", dailyReturn);
            Console.WriteLine("-- Shift --");


When you run this code, you will see the following outputs.

The raw dataset looks like the following:

After indexing this dataset with the date field, you will see the following:

After applying simple arithmetic operations to compute the change rate from open to close, you will see the following:

Finally, after shifting close prices by one row and computing daily returns, you will see the following:

As you can see from this sample Deedle project, we can run various data manipulation operations with one or two lines of code, where it would have required more lines of code to apply the same operations using native C#. We will use the Deedle library frequently throughout this book for data manipulation and feature engineering.

This sample Deedle code can be found at the following link: https://github.com/yoonhwang/c-sharp-machine-learning/blob/master/ch.1/DeedleApp.cs.



In this chapter, we briefly discussed some key ML tasks and real-life examples of ML applications. We also learned the steps for developing ML models and the common challenges and tasks in each step. We are going to follow these steps as we work through our projects in the following chapters and we will explore certain steps in more detail, especially for feature engineering, model selection, and model performance evaluations. We will discuss the various techniques we can apply in each step depending on the types of problems we are solving. Lastly, in this chapter, we walked you through how to set up a C# environment for our future ML projects. We built a simple logistic regression classifier using the Accord.NET framework and used the Deedle library to load and manipulate the data.

In the next chapter, we are going to dive straight into applying the fundamentals of ML, which we covered in this chapter, to build a ML model for spam email filtering. We will follow the steps for building ML models that we discussed in this chapter to transform raw email data into a structured dataset, analyze the email text data to draw some insights, and then finally build classification models that predict whether an email is a spam or not. We will also discuss some commonly used model evaluation metrics for classification models in the next chapter.

About the Author

  • Yoon Hyup Hwang

    Yoon Hyup Hwang is a seasoned data scientist in the marketing and financial sectors with expertise in predictive modeling, machine learning, statistical analysis, and data engineering. He has 8+ years' experience of building numerous machine learning models and data products using Python and R. He holds an MSE in computer and information technology from the University of Pennsylvania and a BA in economics from the University of Chicago. In his spare time, he enjoys practicing various martial arts, snowboarding, and roasting coffee. Born and raised in Busan, South Korea, he currently works in New York and lives in New Jersey with his artist wife, Sunyoung, and a playful dog, Dali (named after Salvador Dali).

    Browse publications by this author

Latest Reviews

(2 reviews total)
Nothing to comment here either
very generalist description. Without explanation what it is and how to use that staff for sure.

Recommended For You

Book Title
Unlock this book and the full library for FREE
Start free trial