I believe most developers already know the importance of converting / escaping special HTML characters when displaying their content on a Website to prevent XSS attacks.  In PHP, this can be simply done by calling htmlspecialchars([dangerous data]).  However I haven’t seen many articles talking about how to do this when you need to use a Rich Text Editor on your Website and needed the user to enter html (like a blog).

The easiest way to achieve this is by removing unwanted html tags such as <script>, <iframe> or <embed>. There is a really good PHP library called HTMLPurifier which is just designed to solve this problem.

To use it, you can either download the latest archive from HTML Purifier’s official website or if you are using a framework (such as ZF2 or Symphony 2), you can get it via Composer (https://packagist.org/packages/ezyang/htmlpurifier). Once it’s installed, you can use the following code to purify your html data.

$config = HTMLPurifier_Config::createDefault();

//the following line is optional, this is to turn off the caching. If you prefer better performance by leaving the caching on, the permission of the htmlpurifier/library/HTMLPurifier/DefinitionCache/Serializer folder will need to be made 777.
$config->set('Cache.DefinitionImpl', null);
$purifier = new HTMLPurifier($config);

//just replace $rawHTML with your Raw HTML data from the user input
$clean_html = $purifier->purify($rawHTML);

Html Purifier mainly does two things.
First, it tries to fix the HTML and guarantee the output is Standards Compliant, E.g.

<p>test

will be converted to

<p>test</p>

The next thing it does is to remove malicious html tags to prevent XSS attacks, E.g.

<script>alert(0);</script>
<p>line1</p>
<iframe src="somedangerousURL.com"></iframe>

will be converted to

<p>line1</p>

This is a great tool and it will make you site a lot more secure by keep it away from common XSS attacks. For more information and configuration options, check the official HTML Purifier Website

p.s.  Some may find the output contains weird characters such as “”, after the html gets purified. This is not a problem with HTML Purifier but is usually the problem with your encoding. There are usually a few things you need to check,

  • Encoding of PHP. E.g. 
    header('Content-Type:text/html; charset=UTF-8');
  • If you store the purified html into a database. Check encoding of your Database Access Layer (i.e. if you are using PDO directly) or your Database Abstraction Layer (i.e. Doctrine). Please note that the Database itself will also need to be set to UTF-8. I use Doctrine 2, and if you don’t know where to set the encoding, it’s next to all your database credentials.
		'user' => [username],
                'password' = [password],
                'dbname' => [dbName],
                'host' => [host],
                'charset' => 'utf8'
  • To ensure the website displays all none ASCII characters correctly, add the following meta tag
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />