Reader small image

You're reading from  Machine Learning with R - Third Edition

Product typeBook
Published inApr 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781788295864
Edition3rd Edition
Languages
Tools
Right arrow
Author (1)
Brett Lantz
Brett Lantz
author image
Brett Lantz

Brett Lantz (DataSpelunking) has spent more than 10 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning during research on a large database of teenagers' social network profiles. Brett is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. He is known to geek out about data science applications for sports, autonomous vehicles, foreign language learning, and fashion, among many other subjects, and hopes to one day blog about these subjects at Data Spelunking, a website dedicated to sharing knowledge about the search for insight in data.
Read more about Brett Lantz

Right arrow

Chapter 12. Specialized Machine Learning Topics

Congratulations on reaching this point in your machine learning journey! If you have not already started work on your own projects, you will do so soon. And in doing so, you may find that the task of turning data into action is more difficult than it first appeared.

As you gathered data, you might have realized that the information was trapped in a proprietary format or spread across pages on the web. Making matters worse, after spending hours reformatting the data, maybe your computer slowed to a crawl after it ran out of memory. Perhaps R even crashed or froze your machine. Hopefully you were undeterred, as these issues can be remedied with a bit more effort.

This chapter covers techniques that may not apply to every project, but will prove useful for working around such specialized issues. You might find the information particularly useful if you tend to work with data that is:

  • Stored in unstructured or proprietary formats such as web pages...

Managing and preparing real-world data


Unlike the examples in this book, real-world data is rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is known informally as data munging or data wrangling.

Data preparation has become even more important as the size of typical datasets has grown from megabytes to gigabytes and data is gathered from unrelated and messy sources, many of which are stored in massive databases. Several packages and resources for retrieving and working with proprietary data formats and databases are listed in the following sections.

Making data "tidy" with the tidyverse packages

A new approach has been rapidly taking shape as the dominant paradigm for working with data in R. Championed by Hadley Wickham, the mind behind many of the packages that drove much...

Working with online data and services


With growing amounts of data available from web-based sources, it is increasingly important for machine learning projects to be able to access and interact with online services. R is able to read data from online sources natively, with some caveats. First, by default, R cannot access secure websites (those using https:// rather than the http:// protocol). Secondly, it is important to note that most web pages do not provide data in a form that R can understand. The data will need to be parsed, or broken apart and rebuilt into a structured form before it can be useful. We'll discuss the workarounds shortly.

However, if neither of these caveats apply, that is, if the data are already online in a non-secure website and in a tabular form like CSV that R can understand natively, then R's read.csv() and read.table() functions can access it from the web just as if it were on your local machine. Simply supply the full Uniform Resource Locator (URL) for the dataset...

Working with domain-specific data


Machine learning has undoubtedly been applied to problems across every discipline. Although the basic techniques are similar across all domains, some are so specialized that communities have formed to develop solutions to the challenges unique to the field. This leads to the discovery of new techniques and new terminology that is relevant only to domain-specific problems.

This section covers a pair of domains that use machine learning techniques extensively, but require specialized knowledge to unlock their full potential. Since entire books have been written on these topics, this will serve as only the briefest of introductions. For more detail, seek out the help provided by the resources cited in each section.

Analyzing bioinformatics data

The field of bioinformatics is concerned with the application of computers and data analysis to the biological domain, particularly with regard to better understanding the genome. As genetic data is unique compared to many...

Improving the performance of R


Base R has a reputation for being slow and memory inefficient, a reputation that is at least somewhat earned. These faults are largely unnoticed on a modern PC for datasets of many thousands of records, but datasets with a million records or more can exceed the limits of what is currently possible with consumer-grade hardware. The problem is worsened if the dataset contains many features or if complex learning algorithms are being used.

Note

CRAN has a high-performance computing task view that lists packages pushing the boundaries on what is possible in R at http://cran.r-project.org/web/views/HighPerformanceComputing.html.

Packages that extend R past the capabilities of the base package are being developed rapidly. This work comes primarily on two fronts: some packages add the capability to manage extremely large datasets by making data operations faster or by allowing the size of data to exceed the amount of available system memory; others allow R to work faster...

Summary


It is certainly an exciting time to be studying machine learning. Ongoing work on the relatively uncharted frontiers of parallel and distributed computing offers great potential for tapping the knowledge found in the deluge of big data. And the burgeoning data science community is facilitated by the free and open-source R programming language, which provides a very low barrier for entry—you simply need to be willing to learn.

The topics you have learned, both in this chapter as well as previous chapters, provide the foundation for understanding more advanced machine learning methods. It is now your responsibility to keep learning and adding tools to your arsenal. Along the way, be sure to keep in mind the no free lunch theorem—no learning algorithm rules them all, and they all have varying strengths and weaknesses. For this reason, there will always be a human element to machine learning, adding subject-specific knowledge and the ability to match the appropriate algorithm to the task...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Machine Learning with R - Third Edition
Published in: Apr 2019Publisher: PacktISBN-13: 9781788295864
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Brett Lantz

Brett Lantz (DataSpelunking) has spent more than 10 years using innovative data methods to understand human behavior. A sociologist by training, Brett was first captivated by machine learning during research on a large database of teenagers' social network profiles. Brett is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. He is known to geek out about data science applications for sports, autonomous vehicles, foreign language learning, and fashion, among many other subjects, and hopes to one day blog about these subjects at Data Spelunking, a website dedicated to sharing knowledge about the search for insight in data.
Read more about Brett Lantz