Before you can build a web scraper, you must take a second and think about how the internet works. At its core, the internet is a network of computers connected together, discoverable through Domain Lookup System (DNS) servers. When you want to visit a website, your browser sends the website URL to a DNS server, the URL is translated into an IP address, and your browser then sends a request to the machine at that IP address. The machine, called a web server, receives and inspects the request, and makes a decision on what to send back to your browser. Your browser then parses the information sent by the server and displays content on your screen depending on the format of the data. The web server and browser are able to communicate because of the adherence to a global set of rules called the HTTP. In this chapter, you will learn some of the key points...
You're reading from Go Web Scraping Quick Start Guide
What do HTTP requests look like?
When a client (such as a browser) requests a web page from a server, it sends an HTTP request. The format for such a request defines an action, a resource, and the Version of the HTTP protocol. Some HTTP requests include extra information for the server to process, such as a query or specific metadata. Depending on the action, you also may be sending the server new information for the server to process.
HTTP request methods
There are nine current HTTP request methods, which define a general action desired by the client. Each method carries a particular connotation as to how the server should process the request. The nine request methods are as follows:
- GET
- POST
- PUT
- DELETE
- HEAD
- CONNECT
- TRACE...
What do HTTP responses look like?
When the server responds to your request, it will provide a status code, some response headers, and the content of the resource in most cases. Staying with our previous request for http://www.example.com/index.html, you will be able to see what a typical response looks like, section by section.
Status line
The first line of an HTTP response is called the status line and typically looks like this:
HTTP/1.1 200 OK
First, it tells you what Version of the HTTP protocol the server is using. This should always match the version sent by the client HTTP request. In this case, our server is using version 1.1. The next portion is the HTTP status code. This is code used to indicate the status of the...
What are HTTP status codes?
HTTP status codes are used to inform the HTTP client of the status of the HTTP request. In some cases, the HTTP server needs to inform the client that the request was not understood, or that extra actions need to be taken in order to get a full response. The HTTP status codes are divided into four separate ranges, each one covering a specific type of response.
100–199 range
These codes are used to provide information to the HTTP client on how to deliver a request. These codes are usually processed by the HTTP client itself and will be handled before your web scraper needs to worry about them.
For example, the client may prefer that requests be sent using the HTTP 2.0 protocol and request...
What do HTTP requests/responses look like in Go?
Now that you are familiar with the basics of HTTP requests and responses, it's time to see what this looks like in Go. The standard library in Go provides a package named net/http, which contains all of the tools you will need to build a client that is capable of requesting pages from web servers and processing the responses with very little effort.
Let's take a look at the example from the beginning of this chapter, where we were accessing the web page at http://www.example.com/index.html. The underlying HTTP request instructs the web server at example.com to GET the index.html resource:
GET /index.html HTTP/1.1
Host: example.com
Using the Go net/http package, you would use the following line of code:
r, err := http.Get("http://www.example.com/index.html")
Summary
In this chapter, we covered the basic formats of HTTP requests and responses. We also saw how HTTP requests are made in Go, as well as how the http.Response struct relates to real HTTP responses. Finally, we created a small program that sent an HTTP response to http://www.example.com/index.html and processed the HTTP response. For the full HTTP specification, I encourage you to visit https://www.w3.org/Protocols/.
In Chapter 3, Web Scraping Etiquette, we look at the best practices for being a good citizen of the web.