Filtering user submitted HTML with HTMLPurifier

This article was originally published in the June 2012 issue of php|architect magazine.

One of the trickiest types of user input to filter is HTML.
Between WYSIWYG editors, Cross-Site Scripting (XSS) attacks and pasting from
Word it’s enough to make you pull your hair out. In this article I’ll cover
how to set up and use HTML Purifier which is a PHP library that makes filtering
and transforming HTML a breeze.

Solving the problem of HTML input

If you’ve been programming for any period of time I’m sure you’ve come across
the problem of accepting HTML. There are great solutions like TinyMCE or
CKEditor that do a great job of creating a WYSIWYG editor for your users. You
can limit what buttons are on the editor, but if they copy and paste from a word processor you end up with all sorts of weird markup in your submission. Plus you have to assume that all input coming from the user is malicious and you start trying to figure out how you’ll make sure the submission is safe.

One solution has been to abandon HTML entirely and just accept plain text. Run
the input through nl2br and hope for the best. Other approaches are to use
a tool like markdown or bbcode, but most users I’ve dealt with have difficulty
figuring out the syntax.

Another solution would be to try your best to clean up the output with a
collection of HTML Tidy, strip_tags and regular expressions. You end up
with a solution that you think works but then a few weeks later something sneaks
in and breaks everything. That or you have a solution that’s so complicated and
fragile that nobody but you can work with it.

Well in this article I’m going to show another solution, a PHP library called
HTML Purifier. HTML Purifier actually parses the HTML, has secure yet
permissive whitelist that filters out the nasty stuff that can be present in
user submitted HTML. It will also clean your HTML and make sure it’s standards
compliant.

Best of all it’s open source and written in PHP so if it does something that you
want to behave differently you are free to hack and extend it to your hearts
content.

Installing HTML Purifier

HTML Purifier is very simple to get set up and configured. It requires a
minimum of PHP 5.0.5 and no special extensions.

There are many ways to download and use HTML Purifier. They are all described
at http://htmlpurifier.org/ download but I will cover two common ways to
install here.

The first way is to just download it. The current version as of writing this
article is 4.4.0 which you can download from
http://htmlpurifier.org/download. Either the full or lite version will do,
the only difference is that the lite version does not include user
documentation, unit tests and such.

The other way of installing it is to use PEAR.

You’ll also need to make the folder HTMLPurifier/DefinitionCache/Serializer
writable by the web server. This is necessary whether you download the zip file
or install via PEAR.

You then just include the file HTMLPurifier.auto.php and you’re ready to
rock.

Getting Started

The minimum you need to do to get started using HTML Purifier is as follows:

The default configuration will clean up the markup so missing end tags will be
added, invalid nesting will be corrected. Unsafe tags and attributes like
<script> or onclick will be stripped out.

If you take a look at Listing 1 you’ll see an example of some
dirty HTML and in Listing 2 the clean HTML generated by the default
configuration. It does a good job and if you left it like this you would be
safe from Cross-Site Scripting attacks and the markup is cleaned up with
missing end tags and incorrectly nested tags fixed.

But there are probably some things that you would like to change.

Configuration

Next we’re going to start configuring HTML Purifier in order to customize it to
our particular needs. We want our output to be XHTML 1.0 Strict and we want to
whitelist special as the only class that is allowed.

The following is how we set up custom configuration:

We start out by creating a default config object with
HTMLPurifier_Config::createDefault() then we start to change the
configuration variables with $config->set(). There are a lot of
configuration options that you can read about at
http://htmlpurifier.org/live/configdoc/plain.html the two that we’re using
here are HTML.Doctype and Attr.AllowedClasses.

The first config option that we set is HTML.Doctype which can be set as
“HTML 4.01 Transitional”, “HTML 4.01 Strict”, “XHTML 1.0 Transitional”, “XHTML
1.0 Strict” or “XHTML 1.1”. Unfortunately at this time there is no HTML5
support. However since HTML Purifier is open source you are free to add that
functionality yourself and hopefully contribute it back.

This will change what tags are allowed and in some cases will transform your
tags to be standards compliant. In my example one of the changes it makes is
converting my <u> tag to <span style="text-decoration:underline;">.

The second option we set is Attr.AllowedClasses which takes a case-
insensitive comma separated list of all classes you want to allow in your clean
HTML. In my example we’re just allowing the class special.

Now take a look at Listing 3 to see how our newly configured clean HTML looks.

Tag Transformation and Adding Attributes

The output from our configuration is pretty good, however I have some more
modifications I’d like to make. The <u> tag was converted because it’s not
allowed in the XHTML 1.0 Strict standard, but I’d prefer if the <b> and
<i> tags were converted to <strong> and <em> respectively.

Due to the flexibility of HTML Purifier this is an easy task to complete,
however it requires that we modify the HTML definition. The HTML definition
according to the documentation is the "Definition of the purified HTML that
describes allowed children, attributes, and many other things."

In our case we’re going to modify the HTML definition, adding two tag
transformations in order to accomplish our conversions. The way to do this is
by inserting the following after the $config definition:

In the example you’ll see that we’re defining two new keys in the
$def->info_tag_transform array. One for <b> and one for <i>. We
assign to them an instance of HTMLPurifier_TagTransform_Simple that defines
what we’re transforming these tags to.

If you dig into the source code you’ll see that the library is actually using
this functionality extensively. For instance they use:

To convert the <u> tags in the definition for XHTML they use the first
parameter to transform the tag to a <span> and the second paramater to
define the style attribute.

Now we’re getting somewhere. However by default HTML Purifier is filtering out
the target attribute from our <a> tag. The way to fix this is to add an
attribute to the HTML definition. This is done by adding the following line to
our HTML definition modifications:

The addAttribute method adds a custom attribute to a pre-existing element.
The first param is which element we’re modifying, the second param is what
attribute we’re adding and the third param is the definition of what we’re
allowing through.

In the example we’re only allowing the specific values that we’ve defined. If
we didn’t want to be so strict we could have set our third param as “Text” and
that would have allowed any text to pass through.

I was not able to find great documentation about the available definitions so I
had to dive into the source code. If you’re interested in learning more check
out the class HTMLPurifier_AttrTypes.

Take a look at Listing 4 to see the final clean HTML after
our modifications of the HTML definition.

Conclusion

As you can see from my examples HTML Purifier is a work horse capable of a lot
of sophisticated HTML filtering and transforming. The documentation is great if
you’re just doing some standard filtering and tweaking, but if you need to get
into some heavy tweaking it will probably require some code diving.