This project is read-only.

Whitelists

Html Laundry has a strict wash machine. Won't let anything live it doesn't know. And it knows only the things defined in a whitelist.
You can create a very loose whitelist but you can not tell that any element is allowed and you can not tell that any attribute is allowed.
So in a short how safe the cleaned html will be is depends on the whitelist you use. But do not affraid about it it you can cut out dangerous tags very effectively.

A whitelist is an xml file. It's root element is tags. Tags' child elements are the allowed html elements. The next whitelist allows only the b and the img element.
<tags>
  <b></b>
  <img />
</tags>
Based on the whitelist above here are som examples
before clean after clean
Hello <b>world</b>! Hello <b>world</b>!
Hello <b>world ! Hello <b>world !</b>
Hello <span><b>world</b>!</span> Hello <b>world</b>!
Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" /> Hello <b>world</b>! <img />

As you can see in the last example the image element lost it's src attribute. It is because you have to define the allowed attributes too.

<tags>
  <b></b>
  <img src="" />
</tags>
Based on the whitelist above the examples are:
before clean after clean
Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" /> Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" />
Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" alt="logo" /> Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" />

As you can see in the last example the only attribute remained is the one defined in the whitelist.
But it can lead to a serious problem with XSS. You can use Regex to validate attributes value.

<tags>
  <b></b>
  <img src="(/|mailto\:|(news|(ht|f)tp(s?))\://){0,1}[@\w\.]+" />
</tags>

Based on the whitelist above the examples are:
before clean after clean
Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" /> Hello <b>world</b>! <img src="http://i1.codeplex.com/Images/v17601/logo-home.png" />
Hello <b>world</b>! <img src="javascript:alert('wow')" /> Hello <b>world</b>! <img />

As you can see in the last example the img element lost it's src attribute because the attribute's value did not fit to the given regex.

There is a special attribute, the style attribute. Html Laundry's whitelist treats it in a special way. In case of a style attribute there is more regex at once in the attributes value. The expressions are separated with semicolons. In the next whitelist the b element has a style attribute with two regex filters. This allows the style attribute with font-size and font-style values each using it's own exact format.

<tags>
  <b style="font-size:\d+((em)|(pt)|(px))?;font-style:(italic)|(normal)"></b>
  <img src="(/|mailto\:|(news|(ht|f)tp(s?))\://){0,1}[@\w\.]+" />
</tags>

There is a special element called attributes. Because there can be very complex regex flters in attribute values and those attributes can be used in many elements, there is a way to define default regex for attributes. This default regex will be used if and only if the attribute defined in the allowed element has tha attribute with the same name and an emty value. Based on this the next whitelist is the same as the previous.
<tags>
  <b style=""></b>
  <img  src=""/>
  <attributes 
     style="font-size:\d+((em)|(pt)|(px))?;font-style:(italic)|(normal)"
     src="(/|mailto\:|(news|(ht|f)tp(s?))\://){0,1}[@\w\.]+"
   />
</tags>

Last edited Mar 10, 2011 at 11:56 AM by Tocsi, version 5

Comments

No comments yet.