Instant Web Scraping with Java [Instant]


This title is available as an eBook only
Instant Web Scraping with Java [Instant]
eBook: $14.99
Formats: PDF, PacktLib, ePub and Mobi formats
$12.74
save 15%!
Print & eBook also available on:
Learn in an Instant - Short, Fast, Focused
Overview
Table of Contents
Author
Support
Sample Chapters
  • Learn something new in an Instant! A short, fast, focused guide delivering immediate results
  • Get your Java environment set up and running
  • Gather clean, formatted web data into your own database
  • Learn how to work around crawler-resistant websites and legally subvert security measures
  • Use built-in Java features to perform parallel processing and distributed scraping
  • Build test cases for your own websites using JUnit

Book Details

Language : English
eBook : 72 pages
Release Date : August 2013
ISBN : 1849696888
ISBN 13 : 9781849696883
Author(s) : Ryan Mitchell
Topics and Technologies : All Books, Instant, Web Development

Table of Contents

Preface
Instant Web Scraping with Java
  • Instant Web Scraping with Java
    • Setting up your Java Environment (Simple)
    • Writing and executing HelloWorld.java (Simple)
    • Writing a simple scraper (Simple)
    • Writing more complicated scraper (Intermediate)
    • Handling errors (Simple)
    • Writing robust, scalable code (Advanced)
    • Persisting data (Advanced)
    • Writing tests (Intermediate)
    • Going undercover (Intermediate)
    • Submitting a basic form (Advanced)
    • Scraping Ajax Pages (Advanced)
    • Faster scraping through threading (Intermediate)
    • Faster scraping with RMI (Advanced)

Ryan Mitchell

Ryan Mitchell has ten years of programming experience, including Java, C, Perl, PHP, and Python. In addition to “traditional” programming, she specializes in web technologies, with three years of Drupal development experience, and is Sitecore developer certified. She graduated from Olin College of Engineering and is currently studying at the Harvard University Extension School for a Masters in Software Engineering. In addition to academic life, she currently works at Velir Studios as a Web Systems Analyst, and has also worked as a developer for Harvard University and Abine Inc.
Sorry, we don't have any reviews for this title yet.

Code Downloads

Download the code and support files for this book.


Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


Errata

- 1 submitted: last submission 27 Nov 2013

Errata type: code | Page number: 16

 

The following code snippet at the beginning of the page:

public class WikiScraper {
public static void main(String[] args) {
scrapeTopic("/wiki/Python");
}

 

Should be:

public class WikiScraper {
public static void main(String[] args) {
scrapeTopic("/wiki/Java");
}

Sorry, there are currently no downloads available for this title.

Frequently bought together

Instant Web Scraping with Java [Instant] +    Joomla! 1.5 Content Administration =
50% Off
the second eBook
Price for both: €23.15

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Set up your Java environment and work with the Eclipse IDE
  • Execute complicated web crawlers that run without intervention
  • Handle errors, documentation, and writing robust code
  • Log scraped data for later retrieval and analysis
  • Write code to test website content and functionality with the JUnit framework
  • Learn techniques for getting around website security, designed to prevent automated scraping
  • Fill and submit forms automatically
  • Use threading to run scrapers in parallel
  • Use Java’s Remote Machine Invocation to create multi-server distributed scrapers

In Detail

Java is often thought of as a stuffy enterprise language, while web scraping is the often-murky domain of scripting languages. By combining the robustness and extensibility of Java with the flexibility and power of web scraping, we can create immensely useful tools that can solve very difficult problems.

Instant Web Scraping with Java will guide you, step by step, through setting up your Java environment. You will also learn how to write simple web scrapers and distributed networks of crawlers. Throughout the book, we will provide useful tips, out-of-the-box working code, and additional resources to build expert knowledge.

Instant Web Scraping with Java will teach how to build your own web scrapers using real-world scraping examples that collect and store data from Wikipedia, public records data sites, IP address geolocation services, and more. You will learn how to run scrapers across multiple servers, run them in parallel, and subvert common methods of anti-scraper security used on modern websites. This book will also provide you with detailed step-by-step instructions, out-of-the-box working code, and expert pointers to further resources on key topics.

Instant Web Scraping with Java will show you how to view and collect any Internet data at the speed of your processor!

Approach

Filled with practical, step-by-step instructions and clear explanations for the most important and useful tasks. This book is full of short, concise recipes to learn a variety of useful web scraping techniques using Java. You will start with a simple basic recipe of setting up your Java environment and gradually learn some more advanced recipes such as using complex Scrapers.

Who this book is for

Instant Web Scraping with Java is aimed at developers who, while not necessarily familiar with Java, are at least ready to dive into the complexities of this language with simple, step-by-step instructions leading the way. It is assumed that you have at least an intermediate knowledge of HTML, some knowledge of MySQL, and access to an Internet-connected computer while doing most of the exercises (after all, scraping the Web is difficult if your code can’t get online!)

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software