Reader small image

You're reading from  R Web Scraping Quick Start Guide

Product typeBook
Published inOct 2018
Reading LevelBeginner
PublisherPackt
ISBN-139781789138733
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Olgun Aydin
Olgun Aydin
author image
Olgun Aydin

Olgun Aydin is a PhD candidate at the Department of Statistics at Mimar Sinan University, and is studying deep learning for his thesis. He also works as a data scientist. Olgun is familiar with big data technologies, such as Hadoop and Spark, and is a very big fan of R. He has already published academic papers about the application of statistics, machine learning, and deep learning. He loves statistics, and loves to investigate new methods and share his experience with other people.
Read more about Olgun Aydin

Right arrow

XML Path Language and Regular Expression Language

XPath primarily handles the nodes of XML 1.0 or XML 1.1 trees. It is used to represent the hierarchical structure of an XML document. XPath uses non-XML syntax and works on the logical structure of XML documents. This structure is also known as the data model. XPath is designed to be used embedded into a programming language. It has a natural subset that can be used for mapping.

A regular expression, regex or regexp, represents a search model used in computer science. Regex emerged as a result of the work of the American mathematician Stephen Cole Kleene in the 1950s. It is also being used with Unix text-processing programs. There are different forms of syntax for writing regular expressions: the POSIX standard and Perl syntax. A regular expression in search engines is used in search and replace operations and lexical analysis...

XML Path (XPath)

XPath is a syntax that provides functionality between XSL transformations and XPointer. It deals with parts of an XML document. It is used to manipulate strings, numbers, and Boolean expressions to handle the relevant parts of the XML document. XPath defines the path to a listener for each node type in an XML document. The primary syntactical structure in XPath is the expression. An expression is used to obtain an object that has one of the following four basic types:

  • Node-set
  • Boolean
  • Number
  • String

Key words in XPath are not written separately and they are written using lowercase characters. Each node in XML has a unique ID, a typed value, and a string value. Also, some nodes can even have a name. The value written to a node can be zero or atomic value strings. A sequence containing exactly one element is called a singleton. An item is identical to a singular...

Regular expression language (Regex)

Regular expressions are extremely useful when they are used to extract information from any text by searching for a specific sequence of ASCII or Unicode characters. The web is frequently used in search operations. One of the most impressive features of Regex is that it can be used in almost all programming languages. Basically, a regular expression is a formula that helps explain a text. RegEx's name comes from the mathematical theory underlying the idea. It is usually shortened to regex or regexp.

You can use regular expressions with a text bean that you can usually access using your programming language. If it is known how a Regex engine works, it will be easier to generate regexes. It will be easy to understand why you cannot do what you expect with a regular expression and write more complex expressions.

There are basically only two...

Exercises on RegEx and XPath

In this part, we are going to do some exercises that will be about writing XPaths and RegEx rules. For this purpose, we will use some open source tools.

One of the best open source platforms for practising writing RegEx rules is https://regex101.com/:

Regular expressions on regex101.com

As you see from the screenshot of regex101.com, the following is true:

  • You can add some text strings to the Test String part and you can add your regular expressions to the Regular Expression part.
  • On the Test String part, you will be able to see the matched strings.
  • For testing your XPath rules, you can use Google Chrome's developer tools.
  • You can reach this menu through the More Tools section; click on those three dots on your Chrome browser, as shown in the following screenshot:
Developer tools option on the web page
  • When you click Developer tools, you...

Summary

In this chapter, we have tried to focus on the idea behind XPath and RegEx; we became familiar with tools to write/test RegEx and XPath rules. In the exercise section, we focused on a real-life example and tried to write RegEx and XPath rules by using the tools that were mentioned.

In the next chapter, we will talk about the rvest library of R. We will investigate how it works and how to build scraping systems using the rvest library.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
R Web Scraping Quick Start Guide
Published in: Oct 2018Publisher: PacktISBN-13: 9781789138733
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Olgun Aydin

Olgun Aydin is a PhD candidate at the Department of Statistics at Mimar Sinan University, and is studying deep learning for his thesis. He also works as a data scientist. Olgun is familiar with big data technologies, such as Hadoop and Spark, and is a very big fan of R. He has already published academic papers about the application of statistics, machine learning, and deep learning. He loves statistics, and loves to investigate new methods and share his experience with other people.
Read more about Olgun Aydin