stringsearch-0.3.0: Fast searching, splitting and replacing of ByteStringsSource codeContentsIndex
Data.ByteString.Lazy.Search.DFA
Portabilitynon-portable (BangPatterns)
StabilityProvisional
MaintainerDaniel Fischer <daniel.is.fischer@web.de>
Contents
Overview
Complexity and performance
Partial application
Finding substrings
Breaking on substrings
Replacing
Splitting
Description
Fast search of lazy ByteString values. Breaking, splitting and replacing using a deterministic finite automaton.
Synopsis
indices :: ByteString -> ByteString -> [Int64]
nonOverlappingIndices :: ByteString -> ByteString -> [Int64]
breakOn :: ByteString -> ByteString -> (ByteString, ByteString)
breakAfter :: ByteString -> ByteString -> (ByteString, ByteString)
breakFindAfter :: ByteString -> ByteString -> ((ByteString, ByteString), Bool)
replace :: Substitution rep => ByteString -> rep -> ByteString -> ByteString
split :: ByteString -> ByteString -> [ByteString]
splitKeepEnd :: ByteString -> ByteString -> [ByteString]
splitKeepFront :: ByteString -> ByteString -> [ByteString]
Overview

This module provides functions related to searching a substring within a string. The searching algorithm uses a deterministic finite automaton based on the Knuth-Morris-Pratt algorithm. The automaton is implemented as an array of (patternLength + 1) * σ state transitions, where σ is the alphabet size (256), so it is only suitable for short enough patterns, therefore the patterns in this module are required to be strict ByteStrings.

When searching a pattern in a UTF-8-encoded ByteString, be aware that these functions work on bytes, not characters, so the indices are byte-offsets, not character offsets.

Complexity and performance

The time and space complexity of the preprocessing phase is O(patternLength * σ). The searching phase is O(targetLength), each target character is inspected only once.

In general the functions in this module have about the same performance as the corresponding functions using the Knuth-Morris-Pratt algorithm but are considerably slower than the Boyer-Moore functions. For very short patterns or, in the case of indices, patterns with a short period which occur often, however, times are close to or even below the Boyer-Moore times.

Partial application
All functions can usefully be partially applied. Given only a pattern, the automaton is constructed only once, allowing efficient re-use.
Finding substrings
indicesSource
:: ByteStringStrict pattern to find
-> ByteStringLazy string to search
-> [Int64]Offsets of matches
indices finds the starting indices of all possibly overlapping occurrences of the pattern in the target string. If the pattern is empty, the result is [0 .. length target].
nonOverlappingIndicesSource
:: ByteStringStrict pattern to find
-> ByteStringLazy string to search
-> [Int64]Offsets of matches
nonOverlappingIndices finds the starting indices of all non-overlapping occurrences of the pattern in the target string. It is more efficient than removing indices from the list produced by indices.
Breaking on substrings
breakOnSource
:: ByteStringStrict pattern to search for
-> ByteStringLazy string to search in
-> (ByteString, ByteString)Head and tail of string broken at substring

breakOn pattern target splits target at the first occurrence of pattern. If the pattern does not occur in the target, the second component of the result is empty, otherwise it starts with pattern. If the pattern is empty, the first component is empty. For a non-empty pattern, the first component is generated lazily, thus the first parts of it can be available before the pattern has been found or determined to be absent.

   uncurry append . breakOn pattern = id
breakAfterSource
:: ByteStringStrict pattern to search for
-> ByteStringLazy string to search in
-> (ByteString, ByteString)Head and tail of string broken after substring
breakAfter pattern target splits target behind the first occurrence of pattern. An empty second component means that either the pattern does not occur in the target or the first occurrence of pattern is at the very end of target. If you need to discriminate between those cases, use breakFindAfter. If the pattern is empty, the first component is empty. For a non-empty pattern, the first component is generated lazily, thus the first parts of it can be available before the pattern has been found or determined to be absent. uncurry append . breakAfter pattern = id
breakFindAfterSource
:: ByteStringStrict pattern to search for
-> ByteStringLazy string to search in
-> ((ByteString, ByteString), Bool)Head and tail of string broken after substring and presence of pattern

breakFindAfter does the same as breakAfter but additionally indicates whether the pattern is present in the target.

   fst . breakFindAfter pat = breakAfter pat
Replacing
replaceSource
:: Substitution rep
=> ByteStringStrict pattern to replace
-> repReplacement string
-> ByteStringLazy string to modify
-> ByteStringLazy result

replace pat sub text replaces all (non-overlapping) occurrences of pat in text with sub. If occurrences of pat overlap, the first occurrence that does not overlap with a replaced previous occurrence is substituted. Occurrences of pat arising from a substitution will not be substituted. For example:

   replace "ana" "olog" "banana" = "bologna"
   replace "ana" "o" "bananana" = "bono"
   replace "aab" "abaa" "aaab" = "abaaab"

The result is a lazy ByteString, which is lazily produced, without copying. Equality of pattern and substitution is not checked, but

   replace pat pat text == text

holds (the internal structure is generally different). If the pattern is empty but not the substitution, the result is equivalent to (were they Strings) cycle sub.

For non-empty pat and sub a lazy ByteString,

   concat . Data.List.intersperse sub . split pat = replace pat sub

and analogous relations hold for other types of sub.

Splitting
splitSource
:: ByteStringStrict pattern to split on
-> ByteStringLazy string to split
-> [ByteString]Fragments of string

split pattern target splits target at each (non-overlapping) occurrence of pattern, removing pattern. If pattern is empty, the result is an infinite list of empty ByteStrings, if target is empty but not pattern, the result is an empty list, otherwise the following relations hold (where patL is the lazy ByteString corresponding to pat):

   concat . Data.List.intersperse patL . split pat = id,
   length (split pattern target) ==
               length (nonOverlappingIndices pattern target) + 1,

no fragment in the result contains an occurrence of pattern.

splitKeepEndSource
:: ByteStringStrict pattern to split on
-> ByteStringLazy string to split
-> [ByteString]Fragments of string

splitKeepEnd pattern target splits target after each (non-overlapping) occurrence of pattern. If pattern is empty, the result is an infinite list of empty ByteStrings, otherwise the following relations hold:

   concat . splitKeepEnd pattern = 'id,'

all fragments in the result except possibly the last end with pattern, no fragment contains more than one occurrence of pattern.

splitKeepFrontSource
:: ByteStringStrict pattern to split on
-> ByteStringLazy string to split
-> [ByteString]Fragments of string
splitKeepFront is like splitKeepEnd, except that target is split before each occurrence of pattern and hence all fragments with the possible exception of the first begin with pattern.
Produced by Haddock version 2.4.2