Package genshi :: Package filters :: Module html :: Class HTMLSanitizer

Class HTMLSanitizer



object --+
         |
        HTMLSanitizer

A filter that removes potentially dangerous HTML tags and attributes from the stream.

>>> from genshi import HTML
>>> html = HTML('<div><script>alert(document.cookie)</script></div>')
>>> print html | HTMLSanitizer()
<div/>

The default set of safe tags and attributes can be modified when the filter is instantiated. For example, to allow inline style attributes, the following instantation would work:

>>> html = HTML('<div style="background: #000"></div>')
>>> sanitizer = HTMLSanitizer(safe_attrs=HTMLSanitizer.SAFE_ATTRS | set(['style']))
>>> print html | sanitizer
<div style="background: #000"/>

Note that even in this case, the filter does attempt to remove dangerous constructs from style attributes:

>>> html = HTML('<div style="background: url(javascript:void); color: #000"></div>')
>>> print html | sanitizer
<div style="color: #000"/>

This handles HTML entities, unicode escapes in CSS and Javascript text, as well as a lot of other things. However, the style tag is still excluded by default because it is very hard for such sanitizing to be completely safe, especially considering how much error recovery current web browsers perform.



Instance Methods
 
__init__(self, safe_tags=frozenset(['a', 'abbr', 'acronym', 'address', 'area', 'b', 'bi..., safe_attrs=frozenset(['abbr', 'accept', 'accept-charset', 'accesskey', 'a..., safe_schemes=frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto']), uri_attrs=frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc',...)
Create the sanitizer.
 
__call__(self, stream)
Apply the filter to the given stream.

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Variables
  SAFE_TAGS = frozenset(['a', 'abbr', 'acronym', 'address', 'are...
  SAFE_ATTRS = frozenset(['abbr', 'accept', 'accept-charset', 'a...
  SAFE_SCHEMES = frozenset([None, 'file', 'ftp', 'http', 'https'...
  URI_ATTRS = frozenset(['action', 'background', 'dynsrc', 'href...
Properties

Inherited from object: __class__

Method Details

__init__(self, safe_tags=frozenset(['a', 'abbr', 'acronym', 'address', 'area', 'b', 'bi..., safe_attrs=frozenset(['abbr', 'accept', 'accept-charset', 'accesskey', 'a..., safe_schemes=frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto']), uri_attrs=frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc',...)
(Constructor)

 

Create the sanitizer.

The exact set of allowed elements and attributes can be configured.

Parameters:
  • safe_tags - a set of tag names that are considered safe
  • safe_attrs - a set of attribute names that are considered safe
  • safe_schemes - a set of URI schemes that are considered safe
  • uri_attrs - a set of names of attributes that contain URIs
Overrides: object.__init__

__call__(self, stream)
(Call operator)

 
Apply the filter to the given stream.
Parameters:
  • stream - the markup event stream to filter

Class Variable Details

SAFE_TAGS

Value:
frozenset(['a',
           'abbr',
           'acronym',
           'address',
           'area',
           'b',
           'big',
           'blockquote',
...

SAFE_ATTRS

Value:
frozenset(['abbr',
           'accept',
           'accept-charset',
           'accesskey',
           'action',
           'align',
           'alt',
           'axis',
...

SAFE_SCHEMES

Value:
frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto'])

URI_ATTRS

Value:
frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc', 'src'])