You're reading from R Web Scraping Quick Start Guide

Product typeBook

Published inOct 2018

Reading LevelBeginner

PublisherPackt

ISBN-139781789138733

Edition1st Edition

Languages

Concepts

Data Mining

Author (1)

Olgun Aydin

Learning about data on the internet

Data is an essential part of any research, whether it be academic, marketing, or scientific . The World Wide Web (WWW) contains all kinds of information from different sources. Some of these are social, financial, security, and academic resources and are accessible via the internet.

People may want to collect and analyse data from multiple websites. These different websites that belong to specific categories display information in different formats. Even with a single website, you may not be able to see all the data at once. The data may be spanned across multiple pages under various sections.

Most websites do not allow you to save a copy of the data to your local storage. The only option is to manually copy and paste the data shown by the website to a local file in your computer. This is a very tedious process that can take lot of time.

Web scraping is a technique by which people can extract data from multiple websites to a single spreadsheet or database so that it becomes easier to analyse or even visualize the data. Web scraping is used to transform unstructured data from the network into a centralized local database.

Well-known companies, including Google, Amazon, Wikipedia, Facebook, and many more, provide APIs (Application Programming Interfaces) that contain object classes that facilitate interaction with variables, data structures, and other software components. In this way, data collection from those websites is fast and can be performed without any web scraping software.

One of the most used features when performing web scraping of the semi-structured of web pages are naturally rooted trees that are labeled. On this trees, the tags represent the appropriate labels for the HTML markup language syntax, and the tree hierarchy represents the different nesting levels of the elements that make up the web page. The display of a web page using an ordered rooted tree labeled with a label is referred to as the DOM (Document Object Model), which is largely edited by the WWW Consortium.

The general idea behind the DOM is to represent HTML web pages via plain text with HTML tags, with custom key words defined in the sign language. This can be interpreted by the browser to represent web-specific items. HTML tags can be placed in a hierarchical structure. In this hierarchy, nodes in the DOM are captured by the document tree that represents the HTML tags. We will take a look at DOM structures while we focus on XPath rules.

You have been reading a chapter from

R Web Scraping Quick Start Guide

Published in: Oct 2018Publisher: PacktISBN-13: 9781789138733

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Olgun Aydin

Olgun Aydin is a PhD candidate at the Department of Statistics at Mimar Sinan University, and is studying deep learning for his thesis. He also works as a data scientist. Olgun is familiar with big data technologies, such as Hadoop and Spark, and is a very big fan of R. He has already published academic papers about the application of statistics, machine learning, and deep learning. He loves statistics, and loves to investigate new methods and share his experience with other people.
Read more about Olgun Aydin

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages