SimpleParse is a BSD-licensed Python package
providing a simple parser generator for use with the mxTextTools
text-tagging engine. SimpleParse allows you to generate tagging tables
for use with the text-tagging engine directly from your
EBNF grammar.
Unlike most parser generators, SimpleParse generates single-pass
parsers (there is no distinct tokenization stage), an
approach taken from the predecessor project (mcf.pars) which
attempted to create "autonomously parsing regex objects". The resulting
parsers are not as generalized as those created by,
for instance, the Earley algorithm, but they do tend to be useful for
the parsing of computer file formats and the like (as distinct from
natural language and similar "hard" parsing problems).
In addition to the parser generator, the SimpleParse
project
includes a sub-project to create a modified version of the mxTextTools
engine which reorganizes the code to allow for certain common EBNF
constructs.
For those interested in working on the project, I'm actively interested in welcoming and supporting both new developers and new users. Feel free to contact me.
You will need a copy of Python with distutils support (Python versions 2.0 and above include this). If you want to build the non-recursive TextTools engine, you'll also need a C compiler compatible with your Python build and understood by distutils.
To install the base SimpleParse engine, download the latest version in your preferred format. If you are using the Win32 installer, simply run the executable. If you are using one of the source distributions, unpack the distribution into a temporary directory (maintaining the directory structure) then run:
setup.py install
in the top directory created by the expansion process.
You will want the mxBase
2.1.0 distribution to run SimpleParse. This package should be
available in all the standard formats, follow the same instructions as
for the SimpleParse package to install. If you want to use
the non-recursive implementation, you will need to get the source
archive. It is possible to use mxBase 2.0.3 with SimpleParse,
but not to use it for building the non-recursive TextTools engine
(2.0.3 also lacks a lot of features and bug-fixes found in the 2.1.0
versions).
Note: without the non-recursive rewrite of 2.1.0, the test suite will not pass all tests. A number of tests (which are tested with a number of different versions of the simpleparse grammar) will fail with the recursive version of 2.1.0 as well. I'm not sure why they fail with the recursive version, but it does argue for using the non-recursive rewrite.
To build the non-recursive TextTools engine, you'll need to
get the source distribution for the non-recursive implementation from
theSimpleParse
file repository. This archive is intended to be expanded over the
mxBase source archive from the top-level directory (it was created with
the 2.1.0 beta5 distribution specifically, but
should work with any 2.1.0 beta distribution), replacing one file and
adding four others.
cd egenix-mx-base-2.1.0
gunzip non-recursive-1.0.0b1.tar.gz
tar -xvf non-recursive-1.0.0b1.tar
(Or use WinZip on Windows). When you have completed that, run:
setup.py build --force install
in the top directory of the eGenix-mx-base source tree.
New in 2.0.1:
diff -w -r1.4 error.py
32c32
< return '%s: %s'%( self.__class__.__name__, self.messageFormat(message) )
---
> return '%s: %s'%( self.__class__.__name__, self.messageFormat(self.message) )
New in 2.0:
General
Our (current) parsers are top-down, in that they work from the top
of the parsing graph (the root production). They are not, however,
tokenising parsers, so there is no appropriate LL(x) designation as far
as I can see, and there is an arbitrary lookahead mechanism that could
theoretically parse the entire rest of the file just to see if a
particular character matches). I would hazard a guess that they
are theoretically closest to a deterministic recursive-descent parser.
There are no backtracking facilities, so any ambiguity is handled by
choosing the first successful match of a grammar (not the longest, as
in most top-down parsers, mostly because without tokenisation, it would
be expensive to do checks for each possible match's length). As a
result of this, the parsers are entirely deterministic.
The time/memory characteristics are such that, in general, the time
to parse an input text varies with the amount of text to parse. There
are two major factors, the time to do the actual parsing (which, for
simple deterministic grammars should be close to linear with the length
of the text, though a pathalogical grammar might have radically
different operating characteristics) and the time to build the results
tree (which depends on the memory architecture of the machine, the
currently free memory, and the phase of the moon). As a rule,
SimpleParse parsers will be faster (for suitably limited grammars) than
anything you can code directly in Python. They will not generally
outperform grammar-specific parsers written in C.
mxTextTools Rewrite Enhancements
Alternate C Back-end?
© 1998-2003, Copyright by Mike C. Fletcher; All Rights Reserved.
mailto: mcfletch@users.sourceforge.net
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee or royalty is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice appear in supporting documentation or portions thereof, including modifications, that you make.
THE AUTHOR MIKE C. FLETCHER DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR
BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES
OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE!
A
Open Source project