This sections gives some explainations on how to write a language specification for Styx. Contrary to yacc, Styx is reflectively implemented, meaning it is written with it's own help. Thus, a proper Styx definition for the Styx language exists within the Styx' source distribution. For omitted details you might like to refer to this source (styx.sty), from which we cite often in this part of the document. This does not only provides a proper definition, but also gives a pletora of examples.
Refering back to the above walk-through, a specification of a language is written down within on file consisting of three section:
Language
section stating the name of the language.Regular Grammar
section defining the tokens.Context Free Grammar
section, which, tautologically,
is the section where the context free grammar is definied.start Source :root: "Language" Ide "Regular" "Grammar" QlxDfns "Context" "Free" "Grammar" DfnsAn extra twist is implemented within the Styx generators, requiering as a naming conventions that the Styx source files are named like the language and having the extention ".sty". Thus, if you specify a language named "calc", you have to name the language definition file "calc.sty".
The character set for the source is ASCII. For later reference, we distinguish between printable characters and control characters:
; Character Set let Byte = '\00' .. '\ff' ; all extended ascii let Control = '\00' .. '\1f' ; control | '\7f' ; DEL | '\ff' ; space-like extended ascii let Printable = Byte - Control
Space, Newline, Return and Formfeed are used to separate tokens and are otherwise completely ignored. The source itself is format-free. Note that also the page separator character may be used, we do never refer the source by pages. Additionally, no tabulator characters may be used with in source. We had so many problems with different programs having different ideas about how to expand them, that we droped them from this spefication.
ign Space = " " ; ASCII - Space ign Line = "\n" | "\r\n" ; UNIX / DOS ign Page = "\p" ; weak separation convention
Comments start with a semicolon and extend to the end of the line.
; Comments et al com Comment = ';' {Printable}
The regular tokens are Identifier (consisting of letters and digits, starting with a letter), three sorts of literals and a set operators.
; complex tokens tok Ide = Letter {Letter} {Digit} ; Identifier tok Nat = Digit+ ; Natural tok Set = '\'' {LitChar} '\'' ; CharacterSet tok Seq = '\"' {LitChar} '\"' ; CharacterSequence (String) tok Opr = (Special - ';')+ ; Operator
Beside the natural numbers, which are later used to denote characters by their ASCII code, two sorts of strings form the literals. One with single, and one with double quotes. While the first one is used to denote sets of characters, the second denotes sequences of characters - strings. When containing a single character, their meaning is of course identical.
Contrary to C syntax, both the single and the double quote has to be escaped when used inside these literals themselfes. Additionally, a hexadecimal notion for characters is also provided within the character literals. Some control characters (form feed, return, newline, tabulator) can also be denoted within the quotation by a single character after the the backslash.
For completeness, here the remaining definitions for the literals:
; Definitions let Letter = 'A'..'Z' | 'a'..'z' let HexDigit = '0'..'9' | 'a'..'f' let Digit = '0'..'9' let Normal = Letter | Digit | Space let Quote = '\'\"\`\\' tok Parenthesis = '()[]{}' ; one character tokens let Special = Printable - Normal - Parenthesis - Quote let LitChar = Printable - Quote | '\\' (Quote | 'prnt' | HexDigit HexDigit)
The remaining tokens are operators and parenthesis. Both token classes do not have a meaning for themselfes, but are used to form reserved words later in the regular grammar. Operators are made up from special characters.
The reserved words are "Language", "Regular", "Grammar", "Context", "Free", "let", "tok", "ign", "com", "ica", "start" and "err". Further, "cons", "nil" and words starting with "ign" have a special meaning when used as production names.
Following the introducing "Regular Grammar" keywords, the regular grammer is specified as a collection of equations. Following a leading keyword, that gives some hints how to cope with the equation, a name is assigned to a regular expression. Have a look at the preceeding definitions, to get an idea who this looks like.
; REG-Section let QlxDfns ; Qlx-Definitions :nil : :cons : QlxDfn QlxDfns let QlxDfn ; Qlx-Definition :defn : QlxCat QlxOpt Ide "=" Exp let QlxCat ; QlxCategory :letC : "let" :tokC : "tok" :ignC : "ign" :comC : "com"
The definitions can come in any order. This means, that an applied occurence does not need to follow it's definition textually. It is only requiered, that no recursion is used. So, you can order the definition due to other purposes. Note that contrary to the lex program, no implict semantics is placed on the order of the definitions, too.
The leading keyword introducting the equations (see QlxCat) serve two purposes. First, the equations introduced using "let" are auxiluary. They do not specify tokes but only regular sets evtl. used in them. See in the above section for typical applications of this feature.
The next keyword "tok" introduces regular tokes. The identfier followed by this keyword are the only ones that can later be used within the context free grammar. Also, when implicitly used there as keywords, only these regular sets will be considered.
So, the remaining keywords ("ign" and "com") introduce tokens that will be more or less ignored. "com" is for comments, and the semantic is, that they will be stored in the derivation tree (for evtl. source-source translation), but will not be accessable through the language specific interface. Also, both "com" and "ign" tokens can be inserted at any place within the language sources.
"ign" tokens are completely ignored and never even leave the scanner. Conceptually, they do their duty as formating character. Because the scanner knows about the newline character and provides line and column position with each token, these class of characters may (somehow indirect) be accessible in the source tree later. If no strange things are done with the control characters (i.e. only uses space and newline as formaters), on can fully reproduce the source from the derivation tree modulo trailing spaces and empty lines.
Collectivly, all definitions beside "let" ones are considered to form the tokens of the language. Styx' lexical analyser requieres for each of these token definitions, that they are disjunct from each other. So, no two of them may contain the same word. While the lex program resolves possible non-empty intersection by an implicit "priority", one has to make this explicit when using Styx. There are many way to do this. One possibility is to use the difference operator ("-") to clearify the situation. Styx will issue errors as soon as non-empty intersections are detected.
In the language interface, the tokens will be offered as symbols. Basically, these are unique strings, allowing them to be compared by the C identity predicte ("=="). String equal tokens will only stored once by this mean, too.
This can become a disadvantage when the tokens are annormal in within the language defined. Because we considered this a weak design anyway, few means are provided to introduce a normalizer for such tokens. Now because the typical case is case insensitive languages, a normalizer for these is build into Styx. One can specifiy this situation by a proper QlxOpt.
let QlxOpt ; QlxOption :non : :ignca: "[" "ica" "]"The "[ica]" contruction before the defined identifier indicates that the case has to be ignored. As a result, the corresponding tokens are normalized to all small letters. Note that using annormal tokens has many disadvantages. Especially, one looses eventual source information through this normalizations, since people who define such annormalities are typically unable to decide whether they really mean what they do. I've seen for example PASCAL implementations, which were case insensitive, but identifiers like "FileRead" being defined in them. This certainly means asking for troubles. We cannot help bad design, and strongly suggest not to use normalizers on tokens.
Here we finally come to the right hand side of the reguar equations.
let Exp4 ; Expression prio 4 :sequ : Seq :set : Set :ident: Ide
The meaning of the set and sequence literals has already been defined when these token were introduces. The identifier denotes the regular set corresponding to some other equation.
let Exp3 ; Expression prio 3 :ign1 : Exp4 :range: Exp4 ".." Exp4 :ign2 : "(" Exp ")"
Round parenthesis may be used to group expressions, the double dot operator ".." can be used to construct character ranges. It's both arguments have to be single characters.
let Exp2 ; Expression prio 2 :opt : "[" Exp "]" :star : "{" Exp "}" :plus : Exp3 "+" :ign1 : Exp3
Next in binding strength come the different sorts of monoids and options. The "+" suffix means one or more occurences, the curly brackets is for zero or more occurences and the square brackets means zero or one occurence.
let Exp1 ; Expression prio 1 :conc : Exp1 Exp2 :ign1 : Exp2
The concatenation is denoted by concatenating expressions. The corresponding operator is ommited.
let Exp ; Expression prio 0 :union: Exp "|" Exp1 :diff : Exp "-" Exp1 :ign1 : Exp1
Finally, and weakest in binding strength, we have the set union ("|") and difference ("-") operations.
Here we deal with the definition of the context free grammar section in the Styx sources. This is straight forward and basically a triple list.
On the top level, we have a list of definitions (Dfns) of non-terminal identifiers, which body consists of a list of productions (Prds) for these non-terminals, again each identified by a name. The body of the individual productions is formed by a list of members (Mbrs), which are either identifiers denoting terminals or non-terminals or strings denoting keywords.
The non-terminal names defined have to be unique within the scope of the source and disjunct from the names of the regular sets defined in the previous section. The production names have to be unique within each non-terminal definition.
The keywords (string members) have to belong to one of the defined regular sets of tokens.
; CFG-Section let Dfns ; Definitions :nil : :cons : Dfn Dfns let Dfn ; Definition :defn : Cat DfnOpt Ide Prds let Prds ; Productions :nil : :cons : Prd Prds let Prd ; Production :prod : Lay Ide ":" Mbrs let Mbrs ; Members :nil : :cons : Mbr Mbrs let Mbr ; Member :ntm : Ide :tkm : Seq
Some options apply this contruction, of which the most important is the distiction between start and inner productions. Start productions indicate those non-terminals which can later be parsed individually, while the inner productions can only be parsed as part of a start production. Refering back to the regular grammar specification, this distinction is much like the "let" and "tok" categories. We use a likely syntactic device for the indication, a leading keyword. The start productions are indicated by a leading "start" and the inner productions by a leading "let".
let Cat ; Category :letC : "let" :bgnC : "start"
The remaining options deal with error recovery and pretty printing. Use the error option to specifiy a nonterminal as resumption point within the implemented panic-mode error recovery which was tested but may nevertheless not work as expected. To force the default error handling where the parse process stops when a syntax error occurs you should omit the error option. The layout option is not documented yet. Choose the color (":") until we release a proper specification.
let DfnOpt ; DfnOption :non : :errnt: "[" "err" "]" let Lay :reg : ":" :line : "?" :nof : "!"
Some of the identifiers for the production names are reserved for normalization. These are "cons", "nil" and "ign(0-9)+". Beside other keywords used in the Styx grammar, you are otherwise free to chose these names. The mentioned identifiers serve it's duty as indications of how to make up the depth grammar. A separate section is devoted to this topic. See below.