Packt+ | Advance your knowledge in tech

You're reading from Machine Learning with R - Third Edition

Product typeBook

Published inApr 2019

Reading LevelIntermediate

PublisherPackt

ISBN-139781788295864

Edition3rd Edition

Languages

Tools

RStudio

Concepts

Machine Learning

Author (1)

Brett Lantz

Chapter 12. Specialized Machine Learning Topics

Congratulations on reaching this point in your machine learning journey! If you have not already started work on your own projects, you will do so soon. And in doing so, you may find that the task of turning data into action is more difficult than it first appeared.

As you gathered data, you might have realized that the information was trapped in a proprietary format or spread across pages on the web. Making matters worse, after spending hours reformatting the data, maybe your computer slowed to a crawl after it ran out of memory. Perhaps R even crashed or froze your machine. Hopefully you were undeterred, as these issues can be remedied with a bit more effort.

This chapter covers techniques that may not apply to every project, but will prove useful for working around such specialized issues. You might find the information particularly useful if you tend to work with data that is:

Stored in unstructured or proprietary formats such as web pages...

Managing and preparing real-world data

Unlike the examples in this book, real-world data is rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is known informally as data munging or data wrangling.

Data preparation has become even more important as the size of typical datasets has grown from megabytes to gigabytes and data is gathered from unrelated and messy sources, many of which are stored in massive databases. Several packages and resources for retrieving and working with proprietary data formats and databases are listed in the following sections.

Making data "tidy" with the tidyverse packages

A new approach has been rapidly taking shape as the dominant paradigm for working with data in R. Championed by Hadley Wickham, the mind behind many of the packages that drove much...

Working with online data and services

With growing amounts of data available from web-based sources, it is increasingly important for machine learning projects to be able to access and interact with online services. R is able to read data from online sources natively, with some caveats. First, by default, R cannot access secure websites (those using https:// rather than the http:// protocol). Secondly, it is important to note that most web pages do not provide data in a form that R can understand. The data will need to be parsed, or broken apart and rebuilt into a structured form before it can be useful. We'll discuss the workarounds shortly.

However, if neither of these caveats apply, that is, if the data are already online in a non-secure website and in a tabular form like CSV that R can understand natively, then R's read.csv() and read.table() functions can access it from the web just as if it were on your local machine. Simply supply the full Uniform Resource Locator (URL) for the dataset...

Working with domain-specific data

Machine learning has undoubtedly been applied to problems across every discipline. Although the basic techniques are similar across all domains, some are so specialized that communities have formed to develop solutions to the challenges unique to the field. This leads to the discovery of new techniques and new terminology that is relevant only to domain-specific problems.

This section covers a pair of domains that use machine learning techniques extensively, but require specialized knowledge to unlock their full potential. Since entire books have been written on these topics, this will serve as only the briefest of introductions. For more detail, seek out the help provided by the resources cited in each section.

Analyzing bioinformatics data

The field of bioinformatics is concerned with the application of computers and data analysis to the biological domain, particularly with regard to better understanding the genome. As genetic data is unique compared to many...

Improving the performance of R

Base R has a reputation for being slow and memory inefficient, a reputation that is at least somewhat earned. These faults are largely unnoticed on a modern PC for datasets of many thousands of records, but datasets with a million records or more can exceed the limits of what is currently possible with consumer-grade hardware. The problem is worsened if the dataset contains many features or if complex learning algorithms are being used.

Note

CRAN has a high-performance computing task view that lists packages pushing the boundaries on what is possible in R at http://cran.r-project.org/web/views/HighPerformanceComputing.html.

Packages that extend R past the capabilities of the base package are being developed rapidly. This work comes primarily on two fronts: some packages add the capability to manage extremely large datasets by making data operations faster or by allowing the size of data to exceed the amount of available system memory; others allow R to work faster...

Summary

It is certainly an exciting time to be studying machine learning. Ongoing work on the relatively uncharted frontiers of parallel and distributed computing offers great potential for tapping the knowledge found in the deluge of big data. And the burgeoning data science community is facilitated by the free and open-source R programming language, which provides a very low barrier for entry—you simply need to be willing to learn.

The topics you have learned, both in this chapter as well as previous chapters, provide the foundation for understanding more advanced machine learning methods. It is now your responsibility to keep learning and adding tools to your arsenal. Along the way, be sure to keep in mind the no free lunch theorem—no learning algorithm rules them all, and they all have varying strengths and weaknesses. For this reason, there will always be a human element to machine learning, adding subject-specific knowledge and the ability to match the appropriate algorithm to the task...

The rest of the chapter is locked

You have been reading a chapter from

Machine Learning with R - Third Edition

Published in: Apr 2019Publisher: PacktISBN-13: 9781788295864

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Brett Lantz

Brett Lantz (DataSpelunking) has spent more than 10 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning during research on a large database of teenagers' social network profiles. Brett is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. He is known to geek out about data science applications for sports, autonomous vehicles, foreign language learning, and fashion, among many other subjects, and hopes to one day blog about these subjects at Data Spelunking, a website dedicated to sharing knowledge about the search for insight in data.
Read more about Brett Lantz

Other recommended products

Related to this chapter

Machine Learning for Data Mining

Most data mining opportunities involve machine learning and often come with greater financial rewards. This book will help you bring the power of machine learning techniques into your data mining work. By the end of the book, you will be able to create accurate predictive models for data mining.

BookApr 2019252 pages

Machine Learning with R Cookbook

The R language is a powerful open source functional programming language. At its core, R is a statistical language that provides impressive tools to analyze data and create high-level graphics. This book covers the basics of R by setting up a user-friendly programming environment and programming ETL in R. Data exploration examples are provided that demonstrate how powerful data visualisation and machine learning is in discovering hidden relationships. You will also explore air quality data, steps to fix the missing values and visualising the same. You will then dive into important machine learning topics, including data classification, regression, survival analysis, time series analysis, clustering association rule mining, and dimension reduction.This book will include the latest code and examples based on R 3.3 and above—updated for better computation, accuracy, and speed with R.

BookOct 2017572 pages

Associations and Correlations

Through this book, you’ll learn why most statistical techniques give incorrect results and what you can do to avoid the most common pitfalls. You’ll learn how to make sure you get the correct results the first time, every time.

BookJun 2019134 pages

Regression Analysis with R

Regression analysis is a statistical process which enables prediction of relationships between variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

BookJan 2018422 pages

Practical Machine Learning Cookbook

Machine learning is the new BLACK GOLD. In this book, we explore topics such as classification, clustering, model selection and regularization, nonlinearity, supervised, unsupervised, and reinforcement learning, structured prediction, neural networks, deep learning, and case studies. The algorithms are developed using R.The book is for students and professionals in the field of statistics, data analytics, and computer science.

BookApr 2017570 pages

R Data Analysis Cookbook

Data analytics with R has emerged as a very important focus for organizations of all kinds. R enables even those with only an intuitive grasp of the underlying concepts, without a deep mathematical background, to unleash powerful and detailed examinations of their data. This book empowers you by showing you ways to use R to generate professional analysis reports. The book also teaches you to quickly adapt the example code for your own needs and save yourself the time needed to construct code from scratch.

BookSep 2017560 pages

MATLAB for Machine Learning

MATLAB is the language of choice for many researchers and mathematics experts for machine learning. This book will build a foundation for machine learning using MATLAB for beginners. It will also help you learn regression, clustering, classification, predictive analytics, artificial neural networks, and more with MATLAB.

BookAug 2017382 pages

Statistics for Machine Learning

This book will teach you all it takes to perform complex statistical computations required for Machine Learning. You will gain information on statistics behind supervised learning, unsupervised learning, reinforcement learning, and more. Understand the real-world examples that discuss the statistical side of Machine Learning and familiarize yourself with it.

BookJul 2017442 pages

Mastering Machine Learning with scikit-learn

This book examines machine learning models including k-nearest neighbors, logistic regression, naive Bayes, random forests, and support vector machines. You will work through document classification, image recognition, and other example problems.

BookJul 2017254 pages

Mastering Machine Learning with R

Machine learning is the field of Artificial Intelligence where we build systems that learn from data. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning to your data. This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2.

BookApr 2017420 pages

Mastering Predictive Analytics with R

R offers a free and open source environment that is perfect for both learning and deploying predictive modeling solutions in the real world. With its constantly growing community and plethora of packages, R offers the functionality to deal with a truly vast array of problems. Updated with revamped examples and to the latest version of R, this book is designed to be both a guide and a reference for moving beyond the basics of predictive modeling.

BookAug 2017448 pages

Mastering Machine Learning with R

Machine learning is a field of AI where we build systems that learn from data. This book explains complicated concepts with real-world applications. It demonstrates the power of R and machine learning extensively while highlighting the constraints. Finally, it will walk you through topics such as text analysis, time series, and deep learning.

BookJan 2019354 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages