Reader small image

You're reading from  Go Web Scraping Quick Start Guide

Product typeBook
Published inJan 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789615708
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Vincent Smith
Vincent Smith
author image
Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith

Right arrow

The Request/Response Cycle

Before you can build a web scraper, you must take a second and think about how the internet works. At its core, the internet is a network of computers connected together, discoverable through Domain Lookup System (DNS) servers. When you want to visit a website, your browser sends the website URL to a DNS server, the URL is translated into an IP address, and your browser then sends a request to the machine at that IP address. The machine, called a web server, receives and inspects the request, and makes a decision on what to send back to your browser. Your browser then parses the information sent by the server and displays content on your screen depending on the format of the data. The web server and browser are able to communicate because of the adherence to a global set of rules called the HTTP. In this chapter, you will learn some of the key points...

What do HTTP requests look like?

When a client (such as a browser) requests a web page from a server, it sends an HTTP request. The format for such a request defines an action, a resource, and the Version of the HTTP protocol. Some HTTP requests include extra information for the server to process, such as a query or specific metadata. Depending on the action, you also may be sending the server new information for the server to process.

HTTP request methods

There are nine current HTTP request methods, which define a general action desired by the client. Each method carries a particular connotation as to how the server should process the request. The nine request methods are as follows:

  • GET
  • POST
  • PUT
  • DELETE
  • HEAD
  • CONNECT
  • TRACE...

What do HTTP responses look like?

When the server responds to your request, it will provide a status code, some response headers, and the content of the resource in most cases. Staying with our previous request for http://www.example.com/index.html, you will be able to see what a typical response looks like, section by section.

Status line

The first line of an HTTP response is called the status line and typically looks like this:

HTTP/1.1 200 OK

First, it tells you what Version of the HTTP protocol the server is using. This should always match the version sent by the client HTTP request. In this case, our server is using version 1.1. The next portion is the HTTP status code. This is code used to indicate the status of the...

What are HTTP status codes?

HTTP status codes are used to inform the HTTP client of the status of the HTTP request. In some cases, the HTTP server needs to inform the client that the request was not understood, or that extra actions need to be taken in order to get a full response. The HTTP status codes are divided into four separate ranges, each one covering a specific type of response.

100–199 range

These codes are used to provide information to the HTTP client on how to deliver a request. These codes are usually processed by the HTTP client itself and will be handled before your web scraper needs to worry about them.

For example, the client may prefer that requests be sent using the HTTP 2.0 protocol and request...

What do HTTP requests/responses look like in Go?

Now that you are familiar with the basics of HTTP requests and responses, it's time to see what this looks like in Go. The standard library in Go provides a package named net/http, which contains all of the tools you will need to build a client that is capable of requesting pages from web servers and processing the responses with very little effort.

Let's take a look at the example from the beginning of this chapter, where we were accessing the web page at http://www.example.com/index.html. The underlying HTTP request instructs the web server at example.com to GET the index.html resource:

GET /index.html HTTP/1.1
Host: example.com

Using the Go net/http package, you would use the following line of code:

r, err := http.Get("http://www.example.com/index.html")
The Go programming language allows for multiple variables...

Summary

In this chapter, we covered the basic formats of HTTP requests and responses. We also saw how HTTP requests are made in Go, as well as how the http.Response struct relates to real HTTP responses. Finally, we created a small program that sent an HTTP response to http://www.example.com/index.html and processed the HTTP response. For the full HTTP specification, I encourage you to visit https://www.w3.org/Protocols/.

In Chapter 3, Web Scraping Etiquette, we look at the best practices for being a good citizen of the web.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Go Web Scraping Quick Start Guide
Published in: Jan 2019Publisher: PacktISBN-13: 9781789615708
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Vincent Smith

Vincent Smith has been a software engineer for 10 years, having worked in various fields from health and IT to machine learning, and large-scale web scrapers. He has worked for both large-scale Fortune 500 companies and start-ups alike and has sharpened his skills from the best of both worlds. While obtaining a degree in electrical engineering, he learned the foundations of writing good code through his Java courses. These basics helped spur his career in software development early in his professional career in order to provide support for his team. He fell in love with the process of teaching computers how to behave and set him on the path he still walks today.
Read more about Vincent Smith