Mastering Java Machine Learning

3.3 (3 reviews total)
By Dr. Uday Kamath , Krishna Choppella
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Machine Learning Review

About this book

Java is one of the main languages used by practicing data scientists; much of the Hadoop ecosystem is Java-based, and it is certainly the language that most production systems in Data Science are written in. If you know Java, Mastering Machine Learning with Java is your next step on the path to becoming an advanced practitioner in Data Science.

This book aims to introduce you to an array of advanced techniques in machine learning, including classification, clustering, anomaly detection, stream learning, active learning, semi-supervised learning, probabilistic graph modeling, text mining, deep learning, and big data batch and stream machine learning. Accompanying each chapter are illustrative examples and real-world case studies that show how to apply the newly learned techniques using sound methodologies and the best Java-based tools available today.

On completing this book, you will have an understanding of the tools and techniques for building powerful machine learning models to solve data science problems in just about any domain.

Publication date:
July 2017
Publisher
Packt
Pages
556
ISBN
9781785880513

 

Chapter 1. Machine Learning Review

Recent years have seen the revival of artificial intelligence (AI) and machine learning in particular, both in academic circles and the industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years.

What made these successes possible, in large part, was the impetus provided by the need to process the prodigious amounts of ever-growing data, key algorithmic advances by dogged researchers in deep learning, and the inexorable increase in raw computational power driven by Moore's Law. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments, and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business and, at the same time, its enormous success in improving the accuracy of what are now everyday applications, such as searches, speech recognition, and personal assistants on mobile phones, have made its effects commonplace in the family room and the board room alike. Articles breathlessly extolling the power of deep learning can be found today not only in the popular science and technology press but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time.

An ordinary user encounters machine learning in many ways in their day-to-day activities. Most e-mail providers, including Yahoo and Gmail, give the user automated sorting and categorization of e-mails into headings such as Spam, Junk, Promotions, and so on, which is made possible using text mining, a branch of machine learning. When shopping online for products on e-commerce websites, such as https://www.amazon.com/, or watching movies from content providers, such as Netflix, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning, as an effective way to retain customers.

Forecasting the weather, estimating real estate prices, predicting voter turnout, and even election results—all use some form of machine learning to see into the future, as it were.

The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on the skills of the limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, such as Java, Python, R, and increasingly, Scala. Fortunately, thanks to the thousands of contributors in the open source community, each of these languages has a rich and rapidly growing set of libraries, frameworks, and tutorials that make state-of-the-art techniques accessible to anyone with an internet connection and a computer, for the most part. Java is an important vehicle for this spread of tools and technology, especially in large-scale machine learning projects, owing to its maturity and stability in enterprise-level deployments and the portable JVM platform, not to mention the legions of professional programmers who have adopted it over the years. Consequently, mastery of the skills so lacking in the workforce today will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace.

Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist's interest in the subject. If you're reading this, it's likely you can already bend Java to your will, no problem, but now you feel you're ready to dig deeper and learn how to use the best of breed open source ML Java frameworks in your next data science project. If that is indeed you, how fortuitous is it that the chapters in this book are designed to do all that and more!

Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a book that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material. To warm up before we embark on sharpening our skills, we will devote this chapter to a quick review of what we already know. For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this book), here's our advice: make sure you do not skip the rest of this chapter; instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary. Wikipedia them. Then jump right back in.

For the rest of this chapter, we will review the following:

  • History and definitions

  • What is not machine learning?

  • Concepts and terminology

  • Important branches of machine learning

  • Different data types in machine learning

  • Applications of machine learning

  • Issues faced in machine learning

  • The meta-process used in most machine learning projects

  • Information on some well-known tools, APIs, and resources that we will employ in this book

 

Machine learning – history and definition


It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as the 1860s. In Rene Descartes' Discourse on the Method, he refers to Automata and says:

For we can easily understand a machine's being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on.

Alan Turing, in his famous publication Computing Machinery and Intelligence gives basic insights into the goals of machine learning by asking the question "Can machines think?".

Arthur Samuel in 1959 wrote, "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.".

Tom Mitchell in recent times gave a more exact definition of machine learning: "A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E."

Machine learning has a relationship with several areas:

  • Statistics: It uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical-based modeling, to name a few

  • Algorithms and computation: It uses the basic concepts of search, traversal, parallelization, distributed computing, and so on from basic computer science

  • Database and knowledge discovery: For its ability to store, retrieve, and access information in various formats

  • Pattern recognition: For its ability to find interesting patterns from the data to explore, visualize, and predict

  • Artificial intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on

 

What is not machine learning?


It is important to recognize areas that share a connection with machine learning but cannot themselves be considered part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct:

  • Business intelligence (BI) and reporting: Reporting key performance indicators (KPI's), querying OLAP for slicing, dicing, and drilling into the data, dashboards, and so on that form the central components of BI are not machine learning.

  • Storage and ETL: Data storage and ETL are key elements in any machine learning process, but, by themselves, they don't qualify as machine learning.

  • Information retrieval, search, and queries: The ability to retrieve data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on the searching of similar data for modeling, but that doesn't qualify searching as machine learning.

  • Knowledge representation and reasoning: Representing knowledge for performing complex tasks, such as ontology, expert systems, and semantic webs, does not qualify as machine learning.

 

Machine learning – concepts and terminology


In this section, we will describe the different concepts and terms normally used in machine learning:

  • Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for use in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free-flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can also be categorized by size: small or medium data have a few hundreds to thousands of instances, whereas big data refers to a large volume, mostly in millions or billions, that cannot be stored or accessed using common devices or fit in the memory of such devices.

  • Features, attributes, variables, or dimensions: In structured datasets, as mentioned before, there are predefined elements with their own semantics and data type, which are known variously as features, attributes, metrics, indicators, variables, or dimensions.

  • Data types: The features defined earlier need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows:

    • Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color—black, blue, brown, green, grey; document content type—text, image, video.

    • Continuous or numeric: This indicates a numeric nature of the data field. For example, a person's weight measured by a bathroom scale, the temperature reading from a sensor, or the monthly balance in dollars on a credit card account.

    • Ordinal: This denotes data that can be ordered in some way. For example, garment size—small, medium, large; boxing weight classes: heavyweight, light heavyweight, middleweight, lightweight, and bantamweight.

  • Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in an unseen dataset, is known as a target or a label. The term "ground truth" is also used in some domains. A label can have any form as specified before, that is, categorical, continuous, or ordinal.

  • Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model.

  • Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain, for example, all people eligible to vote in the general election, or all potential automobile owners in the next four years. Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purpose of analysis. A crucial consideration in the sampling process is that the sample is unbiased with respect to the population. The following are types of probability-based sampling:

    • Uniform random sampling: This refers to sampling that is done over a uniformly distributed population, that is, each object has an equal probability of being chosen.

    • Stratified random sampling: This refers to the sampling method used when the data can be categorized into multiple classes. In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power. Political polling often involves stratified sampling when it is known that different demographic groups vote in significantly different ways. Disproportional representation of each group in a random sample can lead to large errors in the outcomes of the polls. When we control for demographics, we can avoid oversampling the majority over the other groups.

    • Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population. An example is data that spans many geographical regions. In cluster sampling, you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample. This kind of sampling can reduce the cost of data collection without compromising the fidelity of distribution in the population.

    • Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles, arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of objects before selecting the next one, that is called systematic sampling. The value of k is calculated as the ratio of the population to the sample size.

  • Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio, and so on, to name a few (see Chapter 2, Practical Approach to Real-World Supervised Learning). In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics (see Chapter 3, Unsupervised Machine Learning Techniques). In stream-based learning, apart from the standard metrics mentioned earlier, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner (see Chapter 5, Real-Time Stream Machine Learning).

To illustrate these concepts, a concrete example in the form of a commonly used sample weather dataset is given. The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not:

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

The dataset is in the format of an ARFF (attribute-relation file format) file. It consists of a header giving the information about features or attributes with their data types and actual comma-separated data following the data tag. The dataset has five features, namely outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical.

 

Machine learning – types and subtypes


We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types:

  • Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled data. If the data type of the label is categorical, it becomes a classification problem, and if numeric, it is known as a regression problem. For example, if the goal of using of the dataset is the detection of fraud, which has categorical values of either true or false, we are dealing with a classification problem. If, on the other hand, the target is to predict the best price to list the sale of a home, which is a numeric dollar value, the problem is one of regression. The following figure illustrates labeled data that warrants the use of classification techniques, such as logistic regression that is suitable for linearly separable data, that is, when there exists a line that can cleanly separate the two classes. For higher dimensional data that may be linearly separable, one speaks of a separating hyperplane:

    Linearly separable data

    An example of a dataset that is not linearly separable.

    This type of problem calls for classification techniques, such as support vector machines.

  • Unsupervised learning: Understanding the data and exploring it for building machine learning models when the labels are not given is called unsupervised learning. Clustering, manifold learning, and outlier detection are techniques that are covered under this topic, which are dealt with in detail in Chapter 3, Unsupervised Machine Learning Techniques. Examples of problems that require unsupervised learning are many. Grouping customers according to their purchasing behavior is one example. In the case of biological data, tissues samples can be clustered based on similar gene expression values using unsupervised learning techniques.

    The following figure represents data with inherent structure that can be revealed as distinct clusters using an unsupervised learning technique, such as k-means:

    Clusters in data

    Different techniques are used to detect global outliers—examples that are anomalous with respect to the entire dataset, and local outliers—examples that are misfits in their neighborhood. In the following figure, the notion of local and global outliers is illustrated for a two-feature dataset:

    Local and global outliers

  • Semi-supervised learning: When the dataset has only some labeled data and a large amount of data that is not labeled, learning from such a dataset is called semi-supervised learning. When dealing with financial data with the goal of detecting fraud, for example, there may be a large amount of unlabeled data and only a small number of known fraud and non-fraud transactions. In such cases, semi-supervised learning may be applied.

  • Graph mining: Mining data represented as graph structures is known as graph mining. It is the basis of social network analysis and structure analysis in different bioinformatics, web mining, and community mining applications.

  • Probabilistic graph modeling and inferencing: Learning and exploiting conditional dependence structures present between features expressed as a graph-based model comes under the branch of probabilistic graph modeling. Bayesian networks and Markov random fields are two classes of such models.

  • Time-series forecasting: This refers to a form of learning where data has distinct temporal behavior and the relationship with time is modeled. A common example is in financial forecasting, where the performance of stocks in a certain sector may be the target of the predictive model.

  • Association analysis: This is a form of learning where data is in the form of an item set or market basket, and association rules are modeled to explore and predict the relationships between the items. A common example in association analysis is to learn the relationships between the most common items bought by customers when they visit the grocery store.

  • Reinforcement learning: This is a form of learning where machines learn to maximize performance based on feedback in the form of rewards or penalties received from the environment. A recent example that famously used reinforcement learning, among other techniques, was AlphaGo, the machine developed by Google that decisively beat the World Go Champion Lee Sedol in March 2016. Using a reward and penalty scheme, the model first trained on millions of board positions in the supervised learning stage, then played itself in the reinforcement learning stage to ultimately become good enough to triumph over the best human player.

  • Stream learning or incremental learning: Learning in a supervised, unsupervised, or semi-supervised manner from stream data in real time or pseudo-real time is called stream or incremental learning. Learning the behaviors of sensors from different types of industrial systems for categorizing into normal and abnormal is an application that needs real-time feeds and real-time detection.

 

Datasets used in machine learning


To learn from data, we must be able to understand and manage data in all forms. Data originates from many different sources, and consequently, datasets may differ widely in structure or have little or no structure at all. In this section, we present a high-level classification of datasets with commonly occurring examples.

Based on their structure, or the lack thereof, datasets may be classified as containing the following:

  • Structured data: Datasets with structured data are more amenable to being used as input to most machine learning algorithms. The data is in the form of records or rows following a well-known format with features that are either columns in a table or fields delimited by separators or tokens. There is no explicit relationship between the records or instances. The dataset is available chiefly in flat files or relational databases. The records of financial transactions at a bank shown in the following figure are an example of structured data:

    Financial card transactional data with labels of fraud

  • Transaction or market data: This is a special form of structured data where each entry corresponds to a collection of items. Examples of market datasets are the lists of grocery items purchased by different customers or movies viewed by customers, as shown in the following table:

    Market dataset for items bought from grocery store

  • Unstructured data: Unstructured data is normally not available in well-known formats, unlike structured data. Text data, image, and video data are different formats of unstructured data. Usually, a transformation of some form is needed to extract features from these forms of data into a structured dataset so that traditional machine learning algorithms can be applied.

    Sample text data, with no discernible structure, hence unstructured. Separating spam from normal messages (ham) is a binary classification problem. Here true positives (spam) and true negatives (ham) are distinguished by their labels, the second token in each instance of data. SMS Spam Collection Dataset (UCI Machine Learning Repository), source: Tiago A. Almeida from the Federal University of Sao Carlos.

  • Sequential data: Sequential data have an explicit notion of "order" to them. The order can be some relationship between features and a time variable in time series data, or it can be symbols repeating in some form in genomic datasets. Two examples of sequential data are weather data and genomic sequence data. The following figure shows the relationship between time and the sensor level for weather:

    Time series from sensor data

    Three genomic sequences are taken into consideration to show the repetition of the sequences CGGGT and TTGAAAGTGGTG in all three genomic sequences:

    Genomic sequences of DNA as a sequence of symbols.

  • Graph data: Graph data is characterized by the presence of relationships between entities in the data to form a graph structure. Graph datasets may be in a structured record format or an unstructured format. Typically, the graph relationship has to be mined from the dataset. Claims in the insurance domain can be considered structured records containing relevant claim details with claimants related through addresses, phone numbers, and so on. This can be viewed in a graph structure. Using the World Wide Web as an example, we have web pages available as unstructured data containing links, and graphs of relationships between web pages that can be built using web links, producing some of the most extensively mined graph datasets today:

    Insurance claim data, converted into a graph structure showing the relationship between vehicles, drivers, policies, and addresses

 

Machine learning applications


Given the rapidly growing use of machine learning in diverse areas of human endeavor, any attempt to list typical applications in the different industries where some form of machine learning is in use must necessarily be incomplete. Nevertheless, in this section, we list a broad set of machine learning applications by domain and the type of learning employed:

Domain/Industry

Applications

Machine Learning Type

Financial

Credit risk scoring, fraud detection, and anti-money laundering

Supervised, unsupervised, graph models, time series, and stream learning

Web

Online campaigns, health monitoring, and ad targeting

Supervised, unsupervised, semi-supervised

Healthcare

Evidence-based medicine, epidemiological surveillance, drug events prediction, and claim fraud detection

Supervised, unsupervised, graph models, time series, and stream learning

Internet of things (IoT)

Cyber security, smart roads, and sensor health monitoring

Supervised, unsupervised, semi-supervised, and stream learning

Environment

Weather forecasting, pollution modeling, and water quality measurement

Time series, supervised, unsupervised, semi-supervised, and stream learning

Retail

Inventory, customer management and recommendations, layout, and forecasting

Time series, supervised, unsupervised, semi-supervised, and stream learning

Applications of machine learning

 

Practical issues in machine learning


It is necessary to appreciate the nature of the constraints and potentially sub-optimal conditions one may face when dealing with problems requiring machine learning. An understanding of the nature of these issues, the impact of their presence, and the methods to deal with them will be addressed throughout the discussions in the coming chapters. Here, we present a brief introduction to the practical issues that confront us:

  • Data quality and noise: Missing values, duplicate values, incorrect values due to human or instrument recording error, and incorrect formatting are some of the important issues to be considered while building machine learning models. Not addressing data quality can result in incorrect or incomplete models. In the next chapter, we will highlight some of these issues and some strategies to overcome them through data cleansing.

  • Imbalanced datasets: In many real-world datasets, there is an imbalance among labels in the training data. This imbalance in a dataset affects the choice of learning, the process of selecting algorithms, model evaluation and verification. If the right techniques are not employed, the models can suffer large biases, and the learning is not effective. Detailed in the next few chapters are various techniques that use meta-learning processes, such as cost-sensitive learning, ensemble learning, outlier detection, and so on, which can be employed in these situations.

  • Data volume, velocity, and scalability: Often, a large volume of data exists in raw form or as real-time streaming data at high speed. Learning from the entire data becomes infeasible either due to constraints inherent to the algorithms or hardware limitations, or combinations thereof. In order to reduce the size of the dataset to fit the resources available, data sampling must be done. Sampling can be done in many ways, and each form of sampling introduces a bias. Validating the models against sample bias must be performed by employing various techniques, such as stratified sampling, varying sample sizes, and increasing the size of experiments on different sets. Using big data machine learning can also overcome the volume and sampling biases.

  • Overfitting: One of the core problems in predictive models is that the model is not generalized enough and is made to fit the given training data too well. This results in poor performance of the model when applied to unseen data. There are various techniques described in later chapters to overcome these issues.

  • Curse of dimensionality: When dealing with high-dimensional data, that is, datasets with a large number of features, scalability of machine learning algorithms becomes a serious concern. One of the issues with adding more features to the data is that it introduces sparsity, that is, there are now fewer data points on average per unit volume of feature space unless an increase in the number of features is accompanied by an exponential increase in the number of training examples. This can hamper performance in many methods, such as distance-based algorithms. Adding more features can also deteriorate the predictive power of learners, as illustrated in the following figure. In such cases, a more suitable algorithm is needed, or the dimensionality of the data must be reduced.

    Curse of dimensionality illustrated in classification learning, where adding more features deteriorates classifier performance.

 

Machine learning – roles and process


Any effort to apply machine learning to a large-sized problem requires the collaborative effort of a number of roles, each abiding by a set of systematic processes designed for rigor, efficiency, and robustness. The following roles and processes ensure that the goals of the endeavor are clearly defined at the outset and the correct methodologies are employed in data analysis, data sampling, model selection, deployment, and performance evaluation—all as part of a comprehensive framework for conducting analytics consistently and with repeatability.

Roles

Participants play specific parts in each step. These responsibilities are captured in the following four roles:

  • Business domain expert: A subject matter expert with knowledge of the problem domain

  • Data engineer: Involved in the collecting, transformation, and cleaning of the data

  • Project manager: Overseer of the smooth running of the process

  • Data scientist or machine learning expert: Responsible for applying descriptive or predictive analytic techniques

Process

CRISP (Cross Industry Standard Process) is a well-known high-level process model for data mining that defines the analytics process. In this section, we have added some of our own extensions to the CRISP process that make it more comprehensive and better suited for analytics using machine learning. The entire iterative process is demonstrated in the following schematic figure. We will discuss each step of the process in detail in this section.

  • Identifying the business problem: Understanding the objectives and the end goals of the project or process is the first step. This is normally carried out by a business domain expert in conjunction with the project manager and machine learning expert. What are the end goals in terms of data availability, formats, specification, collection, ROI, business value, deliverables? All these questions are discussed in this phase of the process. Identifying the goals clearly, and in quantifiable terms where possible, such as dollar amount saved, finding a pre-defined number of anomalies or clusters, or predicting no more than a certain number of false positives, and so on, is an important objective of this phase.

  • Machine learning mapping: The next step is mapping the business problem to one or more machine learning types discussed in the preceding section. This step is generally carried out by the machine learning expert. In it, we determine whether we should use just one form of learning (for example, supervised, unsupervised, semi-supervised) or if a hybrid of forms is more suitable for the project.

  • Data collection: Obtaining the raw data in the agreed format and specification for processing follows next. This step is normally carried out by data engineers and may require handling some basic ETL steps.

  • Data quality analysis: In this step, we perform analysis on the data for missing values, duplicates, and so on, conduct basic statistical analysis on the categorical and continuous types, and similar tasks to evaluate the quality of data. Data engineers and data scientists can perform the tasks together.

  • Data sampling and transformation: Determining whether data needs to be divided into samples and performing data sampling of various sizes for training, validation, or testing—these are the tasks performed in this step. It consists of employing different sampling techniques, such as oversampling and random sampling of the training datasets for effective learning by the algorithms, especially when the data is highly imbalanced in the labels. The data scientist is involved in this task.

  • Feature analysis and selection: This is an iterative process combined with modeling in many tasks to make sure the features are analyzed for either their discriminating values or their effectiveness. It can involve finding new features, transforming existing features, handling the data quality issues mentioned earlier, selecting a subset of features, and so on ahead of the modeling process. The data scientist is normally assigned this task.

  • Machine learning modeling: This is an iterative process working on different algorithms based on data characteristics and learning types. It involves different steps, such as generating hypotheses, selecting algorithms, tuning parameters, and getting results from evaluation to find models that meet the criteria. The data scientist carries out this task.

  • Model evaluation: While this step is related to all the preceding steps to some degree, it is more closely linked to the business understanding phase and machine learning mapping phase. The evaluation criteria must map in some way to the business problem or the goal. Each problem/project has its own goal, whether that be improving true positives, reducing false positives, finding anomalous clusters or behaviors, or analyzing data for different clusters. Different techniques that implicitly or explicitly measure these targets are used based on learning techniques. Data scientists and business domain experts normally take part in this step.

  • Model selection and deployment: Based on the evaluation criteria, one or more models—independent or as an ensemble—are selected. The deployment of models normally needs to address several issues: runtime scalability measures, execution specifications of the environment, and audit information, to name a few. Audit information that captures the key parameters based on learning is an essential part of the process. It ensures that model performance can be tracked and compared to check for the deterioration and aging of the models. Saving key information, such as training data volumes, dates, data quality analysis, and so on, is independent of learning types. Supervised learning might involve saving the confusion matrix, true positive ratios, false positive ratios, area under the ROC curve, precision, recall, error rates, and so on. Unsupervised learning might involve clustering or outlier evaluation results, cluster statistics, and so on. This is the domain of the data scientist, as well as the project manager.

  • Model performance monitoring: This task involves periodically tracking the model performance in terms of the criteria it was evaluated against, such as the true positive rate, false positive rate, performance speed, memory allocation, and so on. It is imperative to measure the deviations in these metrics with respect to the metrics between successive evaluations of the trained model's performance. The deviations and tolerance in the deviation will give insights into repeating the process or retuning the models as time progresses. The data scientist is responsible for this stage.

As may be observed from the preceding diagram, the entire process is an iterative one. After a model or set of models has been deployed, business and environmental factors may change in ways that affect the performance of the solution, requiring a re-evaluation of business goals and success criteria. This takes us back through the cycle again.

 

Machine learning – tools and datasets


A sure way to master the techniques necessary to successfully complete a project of any size or complexity in machine learning is to familiarize yourself with the available tools and frameworks by performing experiments with widely-used datasets, as demonstrated in the chapters to follow. A short survey of the most popular Java frameworks is presented in the following list. Later chapters will include experiments that you will do using the following tools:

  • RapidMiner: A leading analytics platform, RapidMiner has multiple offerings, including Studio, a visual design framework for processes, Server, a product to facilitate a collaborative environment by enabling sharing of data sources, processes, and practices, and Radoop, a system with translations to enable deployment and execution on the Hadoop ecosystem. RapidMiner Cloud provides a cloud-based repository and on-demand computing power.

  • Weka: This is a comprehensive open source Java toolset for data mining and building machine learning applications with its own collection of publicly available datasets.

  • Knime: KNIME (we are urged to pronounce it with a silent k, as "naime") Analytics Platform is written in Java and offers an integrated toolset, a rich set of algorithms, and a visual workflow to do analytics without the need for standard programming languages, such as Java, Python, and R. However, one can write scripts in Java and other languages to implement functionality not available natively in KNIME.

  • Mallet: This is a Java library for NLP. It offers document classification, sequence tagging, topic modeling, and other text-based applications of machine learning, as well as an API for task pipelines.

  • Elki: This is a research-oriented Java software primarily focused on data mining with unsupervised algorithms. It achieves high performance and scalability using data index structures that improve access performance of multi-dimensional data.

  • JCLAL: This is a Java Class Library for Active Learning, and is an open source framework for developing Active Learning methods, one of the areas that deal with learning predictive models from a mix of labeled and unlabeled data (semi-supervised learning is another).

  • KEEL: This is an open source software written in Java for designing experiments primarily suited to the implementation of evolutionary learning and soft computing based techniques for data mining problems.

  • DeepLearning4J: This is a distributed deep learning library for Java and Scala. DeepLearning4J is integrated with Spark and Hadoop. Anomaly detection and recommender systems are use cases that lend themselves well to the models generated via deep learning techniques.

  • Spark-MLlib: (Included in Apache Spark distribution) MLlib is the machine learning library included in Spark mainly written in Scala and Java. Since the introduction of Data Frames in Spark, the spark.ml package, which is written on top of Data Frames, is recommended over the original spark.mllib package. MLlib includes support for all stages of the analytics process, including statistical methods, classification and regression algorithms, clustering, dimensionality reduction, feature extraction, model evaluation, and PMML support, among others. Another aspect of MLlib is the support for the use of pipelines or workflows. MLlib is accessible from R, Scala, and Python, in addition to Java.

  • H2O: H2O is a Java-based library with API support in R and Python, in addition to Java. H2O can also run on Spark as its own application called Sparkling Water. H2O Flow is a web-based interactive environment with executable cells and rich media in a single notebook-like document.

  • MOA/SAMOA: Aimed at machine learning from data streams with a pluggable interface for stream processing platforms, SAMOA, at the time of writing, is an Apache Incubator project.

  • Neo4j: Neo4j is an open source NoSQL graphical database implemented in Java and Scala. As we will see in later chapters, graph analytics has a variety of use cases, including matchmaking, routing, social networks, network management, and so on. Neo4j supports fully ACID transactions.

  • GraphX: This is included in the Apache Spark distribution. GraphX is the graph library accompanying Spark. The API has extensive support for viewing and manipulating graph structures, as well as some graph algorithms, such as PageRank, Connected Components, and Triangle Counting.

  • OpenMarkov: OpenMarkov is a tool for editing and evaluating probabilistic graphical models (PGM). It includes a GUI for interactive learning.

  • Smile: Smile is a machine learning platform for the JVM with an extensive library of algorithms. Its capabilities include NLP, manifold learning, association rules, genetic algorithms, and a versatile set of tools for visualization.

Datasets

A number of publicly available datasets have aided research and learning in data science immensely. Several of those listed in the following section are well known and have been used by scores of researchers to benchmark their methods over the years. New datasets are constantly being made available to serve different communities of modelers and users. The majority are real-world datasets from different domains. The exercises in this volume will use several datasets from this list.

  • UC Irvine (UCI) database: Maintained by the Center for Machine Learning and Intelligent Systems at UC Irvine, the UCI database is a catalog of some 350 datasets of varying sizes, from a dozen to more than forty million records and up to three million attributes, with a mix of multivariate text, time-series, and other data types. (https://archive.ics.uci.edu/ml/index.html)

  • Tunedit: (http://tunedit.org/) This offers Tunedit Challenges and tools to conduct repeatable data mining experiments. It also offers a platform for hosting data competitions.

  • Mldata.org: (http://mldata.org/) Supported by the PASCAL 2 organization that brings together researchers and students across Europe and the world, mldata.org is primarily a repository of user-contributed datasets that encourages data and solution sharing amongst groups of researchers to help with the goal of creating reproducible solutions.

  • KDD Challenge Datasets: (http://www.kdnuggets.com/datasets/index.html) KDNuggets aggregates multiple dataset repositories across a wide variety of domains.

  • Kaggle: Billed as the Home of Data Science, Kaggle is a leading platform for data science competitions and also a repository of datasets from past competitions and user-submitted datasets.

 

Summary


Machine learning has already demonstrated impressive successes despite being a relatively young field. With the ubiquity of Java resources, Java's platform independence, and the selection of ML frameworks in Java, superior skill in machine learning using Java is a highly desirable asset in the market today.

Machine learning has been around in some form—if only in the imagination of thinkers, in the beginning—for a long time. More recent developments, however, have had a radical impact in many spheres of our everyday lives. Machine learning has much in common with statistics, artificial intelligence, and several other related areas. Whereas some data management, business intelligence, and knowledge representation systems may also be related in the central role of data in each of them, they are not commonly associated with principles of learning from data as embodied in the field of machine learning.

Any discourse on machine learning would assume an understanding of what data is and what data types we are concerned with. Are they categorical, continuous, or ordinal? What are the data features? What is the target, and which ones are predictors? What kinds of sampling methods can be used—uniform random, stratified random, cluster, or systematic sampling? What is the model? We saw an example dataset for weather data that included categorical and continuous features in the ARFF format.

The types of machine learning include supervised learning, the most common when labeled data is available, unsupervised when it's not, and semi-supervised when we have a mix of both. The chapters that follow will go into detail on these, as well as graph mining, probabilistic graph modeling, deep learning, stream learning, and learning with Big Data.

Data comes in many forms: structured, unstructured, transactional, sequential, and graphs. We will use data of different structures in the exercises to follow later in this book.

The list of domains and the different kinds of machine learning applications keeps growing. This review presents the most active areas and applications.

Understanding and dealing effectively with practical issues, such as noisy data, skewed datasets, overfitting, data volumes, and the curse of dimensionality, is the key to successful projects—it's what makes each project unique in its challenges.

Analytics with machine learning is a collaborative endeavor with multiple roles and well-defined processes. For consistent and reproducible results, adopting the enhanced CRISP methodology outlined here is critical—from understanding the business problem to data quality analysis, modeling and model evaluation, and finally to model performance monitoring.

Practitioners of data science are blessed with a rich and growing catalog of datasets available to the public and an increasing set of ML frameworks and tools in Java as well as other languages. In the following chapters, you will be introduced to several datasets, APIs, and frameworks, along with advanced concepts and techniques to equip you with all you will need to attain mastery in machine learning.

Ready? Onward then!

About the Authors

  • Dr. Uday Kamath

    Dr. Uday Kamath is the chief data scientist at BAE Systems Applied Intelligence. He specializes in scalable machine learning and has spent 20 years in the domain of AML, fraud detection in financial crime, cyber security, and bioinformatics, to name a few. Dr. Kamath is responsible for key products in areas focusing on the behavioral, social networking and big data machine learning aspects of analytics at BAE AI. He received his PhD at George Mason University, under the able guidance of Dr. Kenneth De Jong, where his dissertation research focused on machine learning for big data and automated sequence mining.

    Browse publications by this author
  • Krishna Choppella

    Krishna Choppella builds tools and client solutions in his role as a solutions architect for analytics at BAE Systems Applied Intelligence. He has been programming in Java for 20 years. His interests are data science, functional programming, and distributed computing.

    Browse publications by this author

Latest Reviews

(3 reviews total)
I don't any idea if this is a good book since I couldn't read it because the links to download the book didn't work and no one from PacktPub ever helped me.
Пока не читал, мнения не составил :-)
Great book - it's very thorough and covers a lot of the depth of practice of machine learning.

Recommended For You

Java Deep Learning Projects

Build and deploy powerful neural network models using the latest Java deep learning libraries

By Md. Rezaul Karim
Hands-On Machine Learning for Algorithmic Trading

Explore effective trading strategies in real-world markets using NumPy, spaCy, pandas, scikit-learn, and Keras

By Stefan Jansen
Machine Learning in Java - Second Edition

Leverage the power of Java and its associated machine learning libraries to build powerful predictive models

By AshishSingh Bhatia and 1 more
Natural Language Processing with Java Cookbook

A problem-solution guide to encounter various NLP tasks utilizing Java open source libraries and cloud-based solutions

By Richard M. Reese