tagsoup-0.10.1: Parsing and extracting information from (possibly malformed) HTML/XML documentsSource codeContentsIndex
Text.HTML.TagSoup
Contents
Data structures and parsing
Tag identification
Extraction
Utility
Combinators
Description

This module is for working with HTML/XML. It deals with both well-formed XML and malformed HTML from the web. It features:

  • A lazy parser, based on the HTML 5 specification - see parseTags.
  • A renderer that can write out HTML/XML - see renderTags.
  • Utilities for extracting information from a document - see ~==, sections and partitions.

The standard practice is to parse a String to [Tag String] using parseTags, then operate upon it to extract the necessary information.

Synopsis
data Tag str
= TagOpen str [Attribute str]
| TagClose str
| TagText str
| TagComment str
| TagWarning str
| TagPosition !Row !Column
type Row = Int
type Column = Int
type Attribute str = (str, str)
parseTags :: StringLike str => str -> [Tag str]
parseTagsOptions :: StringLike str => ParseOptions str -> str -> [Tag str]
data ParseOptions str = ParseOptions {
optTagPosition :: Bool
optTagWarning :: Bool
optEntityData :: (str, Bool) -> [Tag str]
optEntityAttrib :: (str, Bool) -> (str, [Tag str])
optTagTextMerge :: Bool
}
parseOptions :: StringLike str => ParseOptions str
parseOptionsFast :: StringLike str => ParseOptions str
renderTags :: StringLike str => [Tag str] -> str
renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> str
escapeHTML :: StringLike str => str -> str
data RenderOptions str = RenderOptions {
optEscape :: str -> str
optMinimize :: str -> Bool
}
renderOptions :: StringLike str => RenderOptions str
canonicalizeTags :: StringLike str => [Tag str] -> [Tag str]
isTagOpen :: Tag str -> Bool
isTagClose :: Tag str -> Bool
isTagText :: Tag str -> Bool
isTagWarning :: Tag str -> Bool
isTagPosition :: Tag str -> Bool
isTagOpenName :: Eq str => str -> Tag str -> Bool
isTagCloseName :: Eq str => str -> Tag str -> Bool
fromTagText :: Show str => Tag str -> str
fromAttrib :: (Show str, Eq str, StringLike str) => str -> Tag str -> str
maybeTagText :: Tag str -> Maybe str
maybeTagWarning :: Tag str -> Maybe str
innerText :: StringLike str => [Tag str] -> str
sections :: (a -> Bool) -> [a] -> [[a]]
partitions :: (a -> Bool) -> [a] -> [[a]]
class TagRep a
(~==) :: (StringLike str, TagRep t) => Tag str -> t -> Bool
(~/=) :: (StringLike str, TagRep t) => Tag str -> t -> Bool
Data structures and parsing
data Tag str Source
A single HTML element. A whole document is represented by a list of Tag. There is no requirement for TagOpen and TagClose to match.
Constructors
TagOpen str [Attribute str]An open tag with Attributes in their original order
TagClose strA closing tag
TagText strA text node, guaranteed not to be the empty string
TagComment strA comment
TagWarning strMeta: A syntax error in the input file
TagPosition !Row !ColumnMeta: The position of a parsed element
show/hide Instances
Functor Tag
Typeable1 Tag
Eq str => Eq (Tag str)
Data str => Data (Tag str)
Ord str => Ord (Tag str)
Show str => Show (Tag str)
StringLike str => TagRep (Tag str)
type Row = IntSource
The row/line of a position, starting at 1
type Column = IntSource
The column of a position, starting at 1
type Attribute str = (str, str)Source
An HTML attribute id="name" generates ("id","name")
parseTags :: StringLike str => str -> [Tag str]Source

Parse a string to a list of tags, using an HTML 5 compliant parser.

 parseTags "<hello>my&amp;</world>" == [TagOpen "hello" [],TagText "my&",TagClose "world"]
parseTagsOptions :: StringLike str => ParseOptions str -> str -> [Tag str]Source

Parse a string to a list of tags, using settings supplied by the ParseOptions parameter, eg. to output position information:

 parseTagsOptions parseOptions{optTagPosition = True} "<hello>my&amp;</world>" ==
    [TagPosition 1 1,TagOpen "hello" [],TagPosition 1 8,TagText "my&",TagPosition 1 15,TagClose "world"]
data ParseOptions str Source
These options control how parseTags works.
Constructors
ParseOptions
optTagPosition :: BoolShould TagPosition values be given before some items (default=False,fast=False)
optTagWarning :: BoolShould TagWarning values be given (default=False,fast=False)
optEntityData :: (str, Bool) -> [Tag str]How to lookup an entity (Bool = has ending ';')
optEntityAttrib :: (str, Bool) -> (str, [Tag str])How to lookup an entity in an attribute (Bool = has ending ';'?)
optTagTextMerge :: BoolRequire no adjacent TagText values (default=True,fast=False)
show/hide Instances
parseOptions :: StringLike str => ParseOptions strSource
The default parse options value, described in ParseOptions.
parseOptionsFast :: StringLike str => ParseOptions strSource
A ParseOptions structure optimised for speed, following the fast options.
renderTags :: StringLike str => [Tag str] -> strSource

Show a list of tags, as they might have been parsed, using the default settings given in RenderOptions.

 renderTags [TagOpen "hello" [],TagText "my&",TagClose "world"] == "<hello>my&amp;</world>"
renderTagsOptions :: StringLike str => RenderOptions str -> [Tag str] -> strSource

Show a list of tags using settings supplied by the RenderOptions parameter, eg. to avoid escaping any characters one could do:

 renderTagsOptions renderOptions{optEscape = id} [TagText "my&"] == "my&"
escapeHTML :: StringLike str => str -> strSource
Replace the four characters &"<> with their HTML entities (the list from xmlEntities).
data RenderOptions str Source

These options control how renderTags works.

The strange quirk of only minimizing <br> tags is due to Internet Explorer treating <br></br> as <br><br>.

Constructors
RenderOptions
optEscape :: str -> strEscape a piece of text (default = escape the four characters &"<>)
optMinimize :: str -> BoolMinimise <b></b> -> <b/> (default = minimise only <br> tags)
renderOptions :: StringLike str => RenderOptions strSource
The default render options value, described in RenderOptions.
canonicalizeTags :: StringLike str => [Tag str] -> [Tag str]Source
Turns all tag names and attributes to lower case and converts DOCTYPE to upper case.
Tag identification
isTagOpen :: Tag str -> BoolSource
Test if a Tag is a TagOpen
isTagClose :: Tag str -> BoolSource
Test if a Tag is a TagClose
isTagText :: Tag str -> BoolSource
Test if a Tag is a TagText
isTagWarning :: Tag str -> BoolSource
Test if a Tag is a TagWarning
isTagPosition :: Tag str -> BoolSource
Test if a Tag is a TagPosition
isTagOpenName :: Eq str => str -> Tag str -> BoolSource
Returns True if the Tag is TagOpen and matches the given name
isTagCloseName :: Eq str => str -> Tag str -> BoolSource
Returns True if the Tag is TagClose and matches the given name
Extraction
fromTagText :: Show str => Tag str -> strSource
Extract the string from within TagText, crashes if not a TagText
fromAttrib :: (Show str, Eq str, StringLike str) => str -> Tag str -> strSource
Extract an attribute, crashes if not a TagOpen. Returns "" if no attribute present.
maybeTagText :: Tag str -> Maybe strSource
Extract the string from within TagText, otherwise Nothing
maybeTagWarning :: Tag str -> Maybe strSource
Extract the string from within TagWarning, otherwise Nothing
innerText :: StringLike str => [Tag str] -> strSource
Extract all text content from tags (similar to Verbatim found in HaXml)
Utility
sections :: (a -> Bool) -> [a] -> [[a]]Source
This function takes a list, and returns all suffixes whose first item matches the predicate.
partitions :: (a -> Bool) -> [a] -> [[a]]Source
This function is similar to sections, but splits the list so no element appears in any two partitions.
Combinators
class TagRep a Source
Define a class to allow String's or Tag str's to be used as matches
show/hide Instances
(~==) :: (StringLike str, TagRep t) => Tag str -> t -> BoolSource

Performs an inexact match, the first item should be the thing to match. If the second item is a blank string, that is considered to match anything. For example:

 (TagText "test" ~== TagText ""    ) == True
 (TagText "test" ~== TagText "test") == True
 (TagText "test" ~== TagText "soup") == False

For TagOpen missing attributes on the right are allowed.

(~/=) :: (StringLike str, TagRep t) => Tag str -> t -> BoolSource
Negation of ~==
Produced by Haddock version 2.4.2