Creating our first bot, WebBot

Exclusive offer: get 50% off this eBook here
Instant Simple Botting with PHP [Instant]

Instant Simple Botting with PHP [Instant] — Save 50%

Enhance your botting skills and create your own web bots with PHP with this book and ebook

$14.99    $7.50
by Shay Michael Anderson | October 2013 | Open Source

In this article by Shay Michael Anderson, the author of the book Instant Simple Botting with PHP, you will get started with building your own bot. You should be aware of and comfortable with HTTP requests and responses, how to develop HTTP packages, and why we use bootstrap files.

(For more resources related to this topic, see here.)

 

With the knowledge you have gained, we are now ready to develop our first bot, which will be a simple bot that gathers data (documents) based on a list of URLs and datasets (field and field values) that we will require.

First, let's start by creating our bot package directory. So, create a directory called WebBot so that the files in our project_directory/lib directory look like the following:

'-- project_directory|-- lib | |-- HTTP (our existing HTTP package) | | '-- (HTTP package files here) | '-- WebBot | |-- bootstrap.php| |-- Document.php | '-- WebBot.php |-- (our other files)'-- 03_webbot.php

As you can see, we have a very clean and simple directory and file structure that any programmer should be able to easily follow and understand.

The WebBot class

Next, open the file WebBot.php file and add the code from the project_directory/lib/WebBot/WebBot.php file:

In our WebBot class, we first use the __construct() method to pass the array of URLs (or documents) we want to fetch, and the array of document fields are used to define the datasets and regular expression patterns. Regular expression patterns are used to populate the dataset values (or document field values). If you are unfamiliar with regular expressions, now would be a good time to study them. Then, in the __construct() method, we verify whether there are URLs to fetch or not. If there , we set an error message stating this problem.

Next, we use the __formatUrl() method to properly format URLs we fetch data. This method will also set the correct protocol: either HTTP or HTTPS ( Hypertext Transfer Protocol Secure ). If the protocol is already set for the URL, for example http://www.[dom].com, we ignore setting the protocol. Also, if the class configuration setting conf_force_https is set to true, we force the HTTPS protocol again unless the protocol is already set for the URL.

We then use the execute() method to fetch data for each URL, set and add the Document objects to the array of documents, and track document statistics. This method also implements fetchdelay logic that will delay each fetch by x number of seconds if set in the class configuration settings conf_delay_between_fetches. We also include the logic that only allows distinct URL fetches, meaning that, if we have already fetched data for a URL we won't fetch it again; this eliminates duplicate URL data fetches. The Document object is used as a container for the URL data, and we can use the Document object to use the URL data, the data fields, and their corresponding data field values.

In the execute() method, you can see that we have performed a \HTTP\Request::get() request using the URL and our default timeout value—which is set with the class configuration settings conf_default_timeout. We then pass the \HTTP\Response object that is returned by the \HTTP\Request::get() method to the Document object. Then, the Document object uses the data from the \HTTP\Response object to build the document data.

Finally, we include the getDocuments() method, which simply returns all the Document objects in an array that we can use for our own purposes as we desire.

The WebBot Document class

Next, we need to create a class called Document that can be used to store document data and field names with their values. To do this we will carry out the following steps:

  1. We first pass the data retrieved by our WebBot class to the Document class.
  2. Then, we define our document's fields and values using regular expression patterns.
  3. Next, add the code from the project_directory/lib/WebBot/Document.php file.

    Our Document class accepts the \HTTP\Response object that is set in WebBot class's execute() method, and the document fields and document ID.

  4. In the Document __construct() method, we set our class properties: the HTTP Response object, the fields (and regular expression patterns), the document ID, and the URL that we use to fetch the HTTP response.
  5. We then check if the HTTP response successful (status code 200), and if it isn't, we set the error with the status code and message.
  6. Lastly, we call the __setFields() method.

The __setFields() method parses out and sets the field values from the HTTP response body. For example, if in our fields we have a title field defined as $fields = ['title' => '<title>(.*)<\/title>'];, the __setFields() method will add the title field and pull all values inside the <title>*</title> tags into the HTML response body. So, if there were two title tags in the URL data, the __setField() method would add the field and its values to the document as follows:

['title'] => [ 0 => 'title x', 1 => 'title y' ]

If we have the WebBot class configuration variable—conf_include_document_field_raw_values—set to true, the method will also add the raw values (it will include the tags or other strings as defined in the field's regular expression patterns) as a separate element, for example:

['title'] => [ 0 => 'title x', 1 => 'title y', 'raw' => [ 0 => '<title>title x</title>', 1 => '<title>title y</title>' ] ]

The preceding code is very useful when we want to extract specific data (or field values) from URL data.

To conclude the Document class, we have two more methods as follows:

  • getFields(): This method simply returns the fields and field values
  • getHttpResponse(): This method can be used to get the \HTTP\Response object that was originally set by the WebBot execute() method

This will allow us to perform logical requests to internal objects if we wish.

The WebBot bootstrap file

Now we will create a bootstrap.php file (at project_directory/lib/WebBot/) to load the HTTP package and our WebBot package classes, and set our WebBot class configuration settings:

<?php namespace WebBot; /** * Bootstrap file * * @package WebBot */ // load our HTTP package require_once './lib/HTTP/bootstrap.php'; // load our WebBot package classes require_once './lib/WebBot/Document.php'; require_once './lib/WebBot/WebBot.php'; // set unlimited execution time set_time_limit(0); // set default timeout to 30 seconds \WebBot\WebBot::$conf_default_timeout = 30; // set delay between fetches to 1 seconds \WebBot\WebBot::$conf_delay_between_fetches = 1; // do not use HTTPS protocol (we'll use HTTP protocol) \WebBot\WebBot::$conf_force_https = false; // do not include document field raw values \WebBot\WebBot::$conf_include_document_field_raw_values = false;

We use our HTTP package to handle HTTP requests and responses. You have seen in our WebBot class how we use HTTP requests to fetch the data, and then use the HTTP Response object to store the fetched data in the previous two sections. That is why we need to include the bootstrap file to load the HTTP package properly.

Then, we load our WebBot package files. Because our WebBot class uses the Document class, we load that class file first.

Next, we use the built-in PHP function set_time_limit() to tell the PHP interpreter that we want to allow unlimited execution time for our script. You don't necessarily have to use unlimited execute time. However, for testing reasons, we will use unlimited execution time for this example.

Finally, we set the WebBot class configuration settings. These settings are used by the WebBot object internally to make our bot work as we desire. We should always make the configuration settings as simple as possible to help other developers understand. This means we should also include detailed comments in our code to ensure easy usage of package configuration settings.

We have set up four configuration settings in our WebBot class. These are static and public variables, meaning that we can set them from anywhere after we have included the WebBot class, and once we set them they will remain the same for all WebBot objects unless we change the configuration variables. If you do not understand the PHP keyword static, now would be a good time to research this subject.

  • The first configuration variable is conf_default_timeout. This variable is used to globally set the default timeout (in seconds) for all WebBot objects we create. The timeout value tells the \HTTP\Request class how long it continue trying to send a request before stopping and deeming it as a bad request, or a timed-out request. By default, this configuration setting value is set to 30 (seconds).
  • The second configuration variable—conf_delay_between_fetches—is used to set a time delay (in seconds) between fetches (or HTTP requests). This can be very useful when gathering a lot of data from a website or web service. For example, say, you had to fetch one million documents from a website. You wouldn't want to unleash your bot with that type of mission without fetch delays because you could inevitably cause—to that website—problems due to massive requests. By default, this value is set to 0, or no delay.
  • The third WebBot class configuration variable—conf_force_https—when set to true, can be used to force the HTTPS protocol. As mentioned earlier, this will not override any protocol that is already set in the URL. If the conf_force_https variable is set to false, the HTTP protocol will be used. By default, this value is set to false.
  • The fourth and final configuration variable—conf_include_document_field_raw_values—when set to true, will force the Document object to include the raw values gathered from the ' regular expression patterns. We've discussed configuration settings in detail in the WebBot Document Class section earlier in this article. By default, this value is set to false.

Summary

In this article you have learned how to get started with building your first bot using HTTP requests and responses.

Resources for Article :


Further resources on this subject:


Instant Simple Botting with PHP [Instant] Enhance your botting skills and create your own web bots with PHP with this book and ebook
Published: September 2013
eBook Price: $14.99
See more
Select your format and quantity:

About the Author :


Shay Michael Anderson

Shay Michael Anderson has been programming and developing software since 1999. He quickly decided on software development as his career and enrolled in a college. He achieved his Bachelor of Science in Software Engineering degree through his studies at Oregon Tech and Colorado Tech. He then received a Master of Science in Software Systems Management from Colorado Tech. While earning his degrees in college, he achieved the undergraduate certificates for Software Engineering Application, Software Engineering Process, Object-Oriented Methods, and Unix Network Administration, and the graduate certificates for Systems Analysis and Integration, Network and Telecommunications, Data Management, and Project Management. Ever since he graduated from college he has been employed as a Web Application Developer, a Software Engineer, and a senior Software Engineer.

He is currently working as a senior Software Engineer for a large e-commerce and retail company. He develops and manages massive software systems, which are backed by a database cluster storing over a billion records. He also publishes open source software on his website, www.shayanderson.com.

Books From Packt


phpBB: A User Guide
phpBB: A User Guide

RESTful PHP Web Services
RESTful PHP Web Services

Instant PhpStorm Starter
Instant PhpStorm Starter

Building Websites with PHP-Nuke
Building Websites with PHP-Nuke

Object-Oriented Programming with PHP5
Object-Oriented Programming with PHP5

PHP Team Development
PHP Team Development

PHP jQuery Cookbook
PHP jQuery Cookbook

Building Online Communities with phpBB
Building Online Communities with phpBB


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
G
D
p
r
m
T
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software