HTML documents are not always well formed. This might expose some vulnerabilities for hackers to attack, such as Cross-site scripting (XSS). Luckily, Jsoup has already provided some methods for cleaning these invalid HTML documents. Additionally, Jsoup is capable of parsing the incorrect HTML and transforming it into the correct one. Let's have a look at how we can make a well-formed HTML document.
Our task in this section is to clean the buggy, XSSed HTML:
If you open this file in the Chrome or Firefox browser, you will see the XSS. Just imagine that if users open this XSSed HTML and are redirected to a page that hackers have total control over, the hackers could, for example, steal users' cookies, which is very dangerous.
There are thousand ways for XSS attacks to occur, so you should avoid and clean it; it's time for Jsoup to do its job.
Load the Document
class structure.
Create a whitelist.
Add more allowed tags and attributes.
Create Cleaner
, which will do the cleaning task.
Clean the dirty HTML.
Print the new clean HTML.
This is the result of the cleaning:
Indeed, the resulting HTML is very clean and there is almost no script at all.
The complete example source code for this section is available at \source\Section05
.
The concept of cleaning dirty HTML in Jsoup is to identify the known safe tags and allow them in the result parse tree. These allowed tags are defined in Whitelist
.
Here we define Whitelist
, which is created through the relaxed()
method and contains the following tags:
a
, b
, blockquote
, br
, caption
, cite
, code
, col
, colgroup
, dd
, dl
, dt
, em
, h1
, h2
, h3
, h4
, h5
, h6
, i
, img
, li
, ol
, p
, pre
, q
, small
, strike
, strong
, sub
, sup
, table
, tbody
, td
, tfoot
, th
, thead
, tr
, u
, and ul
If you want to add more tags, use the method addTags
(String
… tags). As you can see, the list of tags created through relaxed()
doesn't have <meta>
, <title>
, <script>
, and <iframe>
, so I added them to the list manually by using addTags()
.
If the allowed tags have the attributes, you should add the list of allowed attributes to each tag.
One special attribute is src
, which contains a URL to a file, and it's always a good practice to give a protocol to prevent inline scripting XSS. Consider the previous bug HTML line:
The attribute "src"
is supposed to refer to a URL but it actually does not. The fix is to ensure the "src"
value is acquired through HTTP or HTTPS. That is what the following line means:
You can write in chain while adding tags or attributes.
While Whitelist
provides the safe tag list, Cleaner
, on the other hand, takes Whitelist
as input to clean the input HTML:
The new Document
class is created after cleaning is done.