Reader small image

You're reading from  Mastering Text Mining with R

Product typeBook
Published inDec 2016
Reading LevelIntermediate
PublisherPackt
ISBN-139781783551811
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
KUMAR ASHISH
KUMAR ASHISH
author image
KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH

Right arrow

Processing text using regular expressions


The web consists predominantly of unstructured text. One of the main tasks in web scraping is to collect the relevant information from heaps of textual data. Within the unstructured text we are often interested in specific information, especially when we want to analyze the data using quantitative methods. Specific information can include numbers such as phone numbers, zip codes, latitude, longitude, or addresses.

First, we gather the unstructured text, next we determine the recurring patterns behind the information we are looking for, and then we apply these patterns to the unstructured text to extract the information. When we are web scraping, we have to identify and extract those parts of the document that contain the relevant information. Ideally, we can do so using xpath althrough, sometimes the crucial information is hidden within values. Sometimes relevant information might be scattered across an HTML document. We need to write regular expressions...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Mastering Text Mining with R
Published in: Dec 2016Publisher: PacktISBN-13: 9781783551811

Author (1)

author image
KUMAR ASHISH

Ashish Kumar is a seasoned data science professional, a publisher author and a thought leader in the field of data science and machine learning. An IIT Madras graduate and a Young India Fellow, he has around 7 years of experience in implementing and deploying data science and machine learning solutions for challenging industry problems in both hands-on and leadership roles. Natural Language Procession, IoT Analytics, R Shiny product development, Ensemble ML methods etc. are his core areas of expertise. He is fluent in Python and R and teaches a popular ML course at Simplilearn. When not crunching data, Ashish sneaks off to the next hip beach around and enjoys the company of his Kindle. He also trains and mentors data science aspirants and fledgling start-ups.
Read more about KUMAR ASHISH