Class FeedNormalizer::HtmlCleaner
In: lib/html-cleaner.rb
Parent: Object

Methods

Constants

HTML_ELEMENTS = %w( a abbr acronym address area b bdo big blockquote br button caption center cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3 h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s samp small span strike strong sub sup table tbody td tfoot th thead tr tt u ul var )   allowed html elements.
HTML_ATTRS = %w( abbr accept accept-charset accesskey align alt axis border cellpadding cellspacing char charoff charset checked cite class clear cols colspan color compact coords datetime dir disabled for frame headers height href hreflang hspace id ismap label lang longdesc maxlength media method multiple name nohref noshade nowrap readonly rel rev rows rowspan rules scope selected shape size span src start summary tabindex target title type usemap valign value vspace width )   allowed attributes.
HTML_URI_ATTRS = %w( href src cite usemap longdesc )   allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.
DODGY_URI_SCHEMES = %w( javascript vbscript mocha livescript data )

Public Class methods

Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. &123; will NOT become &123;

This method could be improved by adding a whitelist of html entities.

Does this:

  • Unescape HTML
  • Parse HTML into tree
  • Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree
  • Each tag:
    • remove tag if not whitelisted
    • escape HTML tag contents
    • remove all attributes not on whitelist
    • extra-scrub URI attrs; see dodgy_uri?

Extra (i.e. unmatched) ending tags and comments are removed.

Returns true if the given string contains a suspicious URL, i.e. a javascript link.

This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.

For all other feed elements:

  • Unescape HTML.
  • Parse HTML into tree (taking ‘body’ as root, if present)
  • Takes text out of each tag, and escapes HTML.
  • Returns all text concatenated.

unescapes HTML. If xml is true, also converts XML-only named entities to HTML.

[Validate]