Grammar Rules

The following discussion is rather abstract and will probably be quite tedious to most readers. If you find your eyes glazing over, you should skip down to the examples in the next section, because the concrete examples make this all a lot clearer than the theoretical discussion we're about to embark upon. Once you've figured out how the examples work, come back here for the details.

TADS 3 provides an intrinsic class called GrammarProd, for "grammar production," to make it easier to create a parser. You do not create GrammarProd objects directly; instead, the compiler creates these for you when you define "grammar rules."

The term "production" is used here in a specialized technical sense specific to computerized parsing. A production is an element of a grammar that is composed of smaller parts. For example, if we were creating a computerized grammar for English, we might include a "sentence" production that consists of a subject, a predicate, and one or more objects, which in turn would themselves be productions consisting of small word groups. Productions are so named because computer parsers are frequently designed to build grammatical structures by starting at individual words and working up through sequentially larger structures; each time a set of words is recognized as a functional group, the parser produces the larger structure from the constituent parts, thus the larger structure is called a production.

One of the most important things about productions is that a production almost always has more than one way of being built. This is really the entire point of defining productions, because it allows us to recognize the different forms a particular syntactic element can take. For example, in English, there are many different kinds of noun phrases: a simple noun, an adjective followed by a noun, a pronoun, a possessive pronoun followed by a noun, a possessive pronoun followed by an adjective followed by a noun, a possessive prounoun followed by an adjective followed by another adjective followed by a noun; we could go on all day. Despite all of these different syntactic forms a noun phrase can take, though, we can label any of these different forms as a noun phrase and then use it to build up larger sentence structures. This means that, when we define our larger structures, we don't have to worry about all the different ways a noun phrase can be formed—we simply know that we need a noun phrase, however formed. The "noun phrase" production thus allows us to hide all of this complexity behind a simple label.

A grammar rule is defined using the grammar keyword. A grammar statement has this form:

  grammar production_name optional_tag : alternative_list : class_list

    property_list

The optional_tag is a symbol or number token enclosed in parentheses. This is not required, but if it is present, it provides a way to distinguish the rule from other rules associated with the same production_name at run-time, and to refer to the rule in modify and replace statements. This tag is included in the string in the first element returned by the grammarInfo method. If the tag is present, the combination of the production name, tag, and parentheses forms the full name of the object, and so this combination must be globally unique, just as any other object name must be.

The class_list and property_list are defined in exactly the same way as for ordinary objects. The reason there's a class and property list is that the grammar statement defines an ordinary class, in addition to defining a GrammarProd object. The ordinary class that is defined has no name, but is otherwise like any other class.

The production_name specifies the name of the GrammarProd object. A given production name can occur in any number of grammar statements; a grammar statement does not uniquely define a production object, but simply adds one or more alternative syntax rules to the production.

The alternative_list is a set of one or more syntax rules to be associated with the production. Each alternative list looks like this:

  item_list | item_list ...

The vertical bar (the same symbol used for the bitwise-OR operator) separates multiple item lists, if more than one is specified (it's also perfectly fine to specify only one item list, in which case no bar is needed). Using the vertical bar to include multiple item lists is equivalent to using a separate grammar statement for each item list, so it is never necessary to use multiple lists; however, it is often convenient to specify a group of similar item lists together.

Each item list specifies a syntax rule for the production. An item list looks like this:

  optional_qualifiers  item1 item2 ... itemN  optional_star

Each item in the list can be one of the following:

A token type (an enum token value). A token type item tells the parser to match any input token of the given type.
A literal string, enclosed in single quotes. This matches the actual text of an input token. If a default dictionary is in effect at compile-time when the rule is defined (via a dictionary statement), the compiler automatically enters the text into the dictionary, associating it with the production object and the miscVocab property. Literals are matched using the same comparison rule as the Dictionary object used at the time of parsing; this means that any case folding, truncation, accent elisions, and other special matching rules that the Dictionary uses are applied in the same manner to grammar literals.
A dictionary property. This matches an input token that appears in the dictionary with the given property.
A list of dictionary properties enclosed in angle brackets (“< >”). This matches an input token that appears in the dictionary with any of the listed properties.
A production object (a production_name symbol from another grammar statement, or even the name of the current statement's production object). This matches if any of the alternatives defined for the sub-production match.
A group of alternative sub-lists enclosed in parentheses, with the alternative sub-lists separated by vertical bars.

Each item type can optionally be followed by an arrow symbol, "->" (a hyphen followed by a greater-than sign), then a property name. If this sequence is present, it indicates that, when the parser successfully matches the item, it will store the matching value in the given property of the object created to represent the production match. For a token type or dictionary property item, the value stored in the property is simply the token value of the input token that matches the item. For a sub-production item, the value stored in the property is the object created to represent the sub-production match.

The optional_star, if present, is simply an asterisk, "*". This symbol indicates that any input tokens that remain in the input after the tokens that match the syntax rule up to this point should simply be ignored. In a sense, the "*" is a "wildcard" symbol that matches everything remaining in the input token list; however, you shouldn't think of it this way, because that's not really how it works. The "*" doesn't actually match anything; instead, it simply indicates that any remaining tokens should be ignored. If an alternative of the root production does not end with a "*" symbol, either directly in the rule of the alternative in the root production or indirectly in the last subproduction, the parser will match the alternative only if the root alternative matches the entire input token list. If the root alternative does end (directly or indirectly) with the "*" symbol, however, the parser will match the alternative even if extra input tokens remain after matching the alternative's items. The "*" symbol, if present, must always be the last item in an alternative's list.

The optional_qualifiers, if present, specify additional information about the alternative. Only one qualifier is currently valid:

[badness number]

This qualifier assigns the alternative a "badness" rating, which can be used to create catch-all syntax patterns that you don't want to use except as a last resort; assigning a badness rating tells the parser that the alternative should be ignored until all other alternatives are exhausted. This is especially useful for handling syntax errors in the user input, because it allows you to create alternatives that match anything in particular parts of the input, which helps pinpoint where the problem is, which in turn lets you give the user better feedback about the problem.

Modifying and Replacing Grammar Rules

A grammar rule object can be replaced or modified by another grammar rule, just as a normal object can be replaced or modified, using the replace and modify keywords. You might want to modify or replace a grammar rule when you are using a library that defines a set of grammar rules, because this allows you to use the rules the library defines, but remove or change specific rules whose library definitions are not what you want.

To use modify with a grammar rule, use this syntax:

  modify grammar production_name tag : alternative_list :

    property_list

This is almost the same as the normal grammar syntax, but note that no class list follows the colon after the alternative list. No superclasses are specified with modify because the modified object has the same superclass or superclasses as the original object being modified. Note also that the tag (a token enclosed in parentheses) is required, because this provides the unique name for the match object that distinguishes it from other match objects defined for the same production name.

Note that the alternative_list is optional: if you leave it out (so you just put two colons in a row after the name), then the compiler retains the original alternative list of the rule being modified. This lets you override just the properties or methods of the grammar rule object, without changing any of the grammar it matches.

To use replace with a grammar rule, use this syntax:

replace grammar production_name tag : alternative_list : class_list

property_list

;

This is exactly the same as a normal grammar definition, except that the replace keyword precedes the definition.

If you use replace or modify with a grammar rule, the original grammar rule is completely replaced by the new grammar rule. In this respect, replace or modify behave exactly the same way. The differences between the two are their treatment of the original object's property list (in the case of replace, the original list is completely lost; in the case of modify, the original properties are inherited by the modified object) and of the original's superclass (replace specifies a brand new superclass, and modify uses the original object's superclass).

If you want to delete an existing grammar rule entirely, you can use the replace syntax, and specify an unmatchable alternative list. An alternative list is unmatchable if it contains a token string that the tokenizer will never produce; for example, in most cases, tokenizers do not return spaces as tokens, so you can use the string ' ' as an unmatchable alternative:

  replace grammar nounPhrase(1): ' ': object;

Calling the Parser

To use a grammar rule, simply call the parseTokens() method in a production object. (You must include the header file "gramprod.h" in your source file to use this method, because the method comes from the GrammarProd intrinsic class, which is defined in this header file.)

The parseTokens() method takes the following arguments:

A token list. Each entry in this list is a token specifier, which is a sublist consisting of at least two elements: the first element is the value of the token, and the second element is the type of the token. Each token's sublist can include more elements if desired; the parser will preserve the additional elements, but doesn't currently use any beyond the first two. This format is compatible with the Tokenizer class's format, so you can simply feed a token list produced by the Tokenizer class directly into the parseTokens() method.
A Dictionary object, or nil if no dictionary is to be used. If this argument is not nil, it must be an object of intrinsic class Dictionary. If any of the syntax rules that the parser encounters include dictionary property items, the parser will look up the corresponding token in the given dictionary to determine the properties under which the word is defined. A dictionary is not required if no dictionary properties are used in the syntax rules to be examined. The Dictionary also specifies how literals in grammar rules are matched to tokens: literals are matched using the Dictionary's "comparator" object, so that literals are matched using the same rules that the Dictionary uses to match tokens against dictionary words.

Call this method on the "root" production of the grammar you wish to parse. There's nothing special about a root production object—you can use any production here. For example, if you want to parse just a noun phrase, you can call parseTokens() on your noun phrase production object, even if the noun phrase production is used as a sub-production in other rules.

This method returns a list of matches. If the return list is empty, it indicates that there were no matches at all. Otherwise, each entry in the list is the top object of a "match tree."

A match tree is a tree of objects that the parser dynamically creates to represent the syntax structure of the match. Each object is an instance of one of the unnamed classes defined in a grammar statement. Each of these objects has the properties that appear after arrow symbols ("->") in the grammar item list set to the actual values from the input token list.

Finding the Original Tokens

As the parser builds the match tree, it sets properties of each match object to indicate the indices of the first and last tokens involved in the match; these bounds are inclusive. The properties the parser sets are called firstTokenIndex and lastTokenIndex.

In addition, the parser automatically sets the tokenList property of each match tree object to a reference to the original token list passed into parseTokens(). So, for a given match tree object match, the tokens matching the production can be obtained as follows:

toks = match.tokenList.sublist(

match.firstTokenIndex,

match.lastTokenIndex - match.firstTokenIndex + 1);

In addition to the token list, the parser stores a list of "match results" in each object in the match tree, in the property tokenMatchList. This list gives the result of the matchValues() method in the Dictionary's comparator object for each token that matched a literal in a grammar rule. Each element of this list gives the match result for the corresponding element of the token list, so tokenMatchList[3] gives the matchValues() result for the third token. If a token matches a dictionary property rather than a literal, its tokenMatchList entry will be nil, since matchValues() is not used to match such tokens.

The tokenMatchList information can be used to find out how well a particular token matched a grammar literal. For example, this can be used to determine if the token matched with truncation, or with accent substitution using an equivalence mapping (see the StringComparator class for more details on these types of matches).

(Note: to be precise, the parser uses the properties exported under the global names "GrammarProd.firstTokenIndex" and "GrammarProd.lastTokenIndex" for the indices, "GrammarProd.tokenList" for the token list, and "GrammarProd.tokenMatchList" for the token match result list. Since the GrammarProd header file, gramprod.h, defines these exports, most users can ignore this detail.)

grammarInfo

The compiler automatically generates a method called "grammarInfo" for each match object class. This method provides information that allows the program to traverse a match tree without knowing anything about the structure of the tree, which is useful for debugging as well as for searching a tree for particular types of matches.

The grammarInfo method returns a list. The first element in the list is the name of the match, which is simply the name of the production, plus the tag if one was specified. Each subsequent element is the value of one of the properties used after an arrow ("->") in the production's rule or rules; the properties appear in the list in the same order they are specified in the rule. If the rule contains multiple alternatives, each property appears only once in the list.

For example, suppose we define a rule like this:

grammar nounPhrase(1): adjective->adj_ noun->noun_ : object;

Now, suppose this rule matched the input "magic book." The grammarInfo method for the match object would look like this:

['nounPhrase(1)', 'magic', 'book']

The first element is the production name with the tag. The second element is the value of the adj_ property, which in this case is the literal token matched; likewise, the third element is the value of the noun_ property, which is another literal token.

The grammarInfo method can be used to write generic routines that traverse arbitrary match trees. For example, we could write a simple routine to display, for debugging purposes, the contents of a match tree:

showGrammar(match, indent)

  local info;

  /* indent by the desired amount (two spaces per level) */

  for (local i = 0 ; i < indent ; ++i)

    "\ \ ";

  /* if it's not a sub-production, treat it specially */

  if (match == nil)

    /* this tree element isn't used – skip it */

    return;

  else if (dataType(match) == TypeSString)

    /* it's a literal token match – show it */

    "'<<match>>'\n";

    return;

  /* get the grammar info for the object */

  info = match.grammarInfo();

  /* show the production rule name, and the original text it matched */

  "<<info[1]>> [<<showGrammarText(match)>>]\n";

  /* show each sub-match */

  for (local i = 2 ; i <= info.length ; ++i)

    showGrammar(info[i], indent + 1);

/* show the text of a match tree item */

showGrammarText(match)

  for (local i = match.firstTokenIndex ; i <= match.lastTokenIndex ; ++i)

    /* show a space before each entry except the first */

    if (i != match.firstTokenIndex)

      " ";

    /* show this token's text */

    "<<match.tokenList[i]>>";

Sample Grammar Rules

Let's define some grammar rules for a very simple noun phrase parser. For the purposes of this example, we'll keep things very simple: our noun phrases will consist of a noun, or an adjective and a noun, or an adjective and an adjective and a noun, or so on.

The simplest way we could define these rules is by enumerating all of the different phrasings we could use. So, we could write something like this:

grammar nounPhrase: noun->noun_: object;

grammar nounPhrase: adjective->adj_ noun->noun_: object;

grammar nounPhrase: adjective->adj1_ adjective->adj2_ noun->noun_:

   object;

grammar nounPhrase: adjective->adj1_ adjective->adj2_

   adjective->adj3_ noun->noun_: object;

And so on, up to some fixed limit to the number of adjectives we'll allow.

Now, this will work, but it's not very flexible. If we stop at six adjectives, users might complain that they can't use eight. We could add phrasings for seven and eight adjectives, but then users might complain about ten. Our work would never end.

Fortunately, there is a more general approach. The recursion-minded reader might have by now observed that we could in principle express a rule for an unlimited number of adjectives by saying that a noun phrase is a noun, or an adjective followed by a noun phrase. This neatly subsumes any number of adjectives: with one adjective, we have a noun phrase consisting of an adjective followed by a noun phrase consisting of a noun; with two adjectives, we have an adjective followed by a noun phrase which consists of an adjective followed by a noun phrase which consists of a noun; and so on.

Not only can we express our noun phrase syntax this way in principle, we can express it this way in fact. This is precisely the kind of thing at which our production scheme excels. Here's the simpler and much more flexible noun phrase grammar:

grammar nounPhrase: noun->noun_: object;

grammar nounPhrase: adjective->adj_ nounPhrase->np_: object;

That's it—this completely solves our problem, no matter how many adjectives our verbose user types. Suppose the user types "little red wagon": we first match the second alternative (with adjective "little"), leaving "red wagon" still to be parsed; we then match the second alternative again (with adjective "red"), leaving just "wagon" remaining; and finally we match the first alternative (with noun "wagon"). Here's how we actually call the parser to parse this:

  match = nounPhrase.parseTokens(tokList, gDict);

(Where do we get the tokList value to pass into the method? We can use the standard run-time library class Tokenizer, or a subclass, and call its tokenize() method to produce a token list from a simple string we want to parse.)

So, what good is all of this? Yes, we've built a set of rules that define a syntax structure, and we know how to call the parser to scan the rules. However, how do we actually use this for anything? What do we know, apart from whether or not an input token list matches the syntax? It turns out we know a great deal.

When the parser matches our syntax lists, it creates "match trees" that give us the details of exactly how the input tokens are structured, according to our grammar. The match tree consists of objects that the parser creates dynamically; these objects are instances of our grammar objects—not of the production objects, but of the unnamed "processor" objects that go with the grammar statement.

Let us, for the moment, assign some arbitrary labels to our processor objects. These aren't really class names – we're just using these to keep track of what's going on:

grammar nounPhrase: noun->noun_: object; // pseudo-class "A"

grammar nounPhrase: adjective->adj_ nounPhrase->np_:

    object; // pseudo-class "B"

When we call the parseTokens() method, the parser builds up match trees consisting of instances of "A" and "B". For our "little red wagon" example, here's what the match tree looks like, assigning arbitrary names to the objects (which wouldn't really happen, since they're dynamically created):

obj1: B  adj_ = 'little'  np_ = obj2 ;

obj2: B  adj_ = 'red'     np_ = obj3 ;

obj3: A  noun_ = 'wagon' ;

The "root" of the match tree is obj1. This example might make it more apparent why we call this a "tree": obj1 is the root, with an adjective on one branch and obj2 on the other; obj2 in turn branches to an adjective and obj3; and obj3 has a noun.

The next step in making these match trees useful is to give them some methods.

The fact that a nounPhrase can be used in syntax anywhere a noun phrase is required means that all of the different alternative forms of noun phrases should provide a common interface. Clearly, they don't right now: each alternative has its own set of properties. However, these properties are not meant as a public interface; they're intended only for the use of the specific alternative processor object. What we must now do is define a public interface, which provides a common set of functionality for every nounPhrase alternative, and then specify an alternative-specific implementation of the public interface; the implementation effectively provides the bridge from the alternative-specific properties to the common information that any noun phrase must provide.

What sorts of things should noun phrases do? In a real system, the primary thing we need from a noun phrase production is to resolve itself to a set of game objects.

For now, though, let's define a very simple public interface that simply provides a debugging display of the object. To do this, we'll define a method, debugPrint(), that any nounPhrase processor object must provide. Here's how we could implement this:

grammar nounPhrase: noun->noun_: object

  debugPrint() { "noun phrase: noun = <<noun_>>\n"; }

grammar nounPhrase: adjective->adj_ nounPhrase->np_: object

  debugPrint()

    "noun phrase: adjective = <<adj_>>, nounPhrase = {\n";

    np_.debugPrint();

    "}\n";

Note how the public interface method is implemented in every noun phrase processor object, but the actual implementation varies according to the underlying information in the processor object. We actually take advantage of this in the recursive call to np_.debugPrint() in the second alternative: we know for a fact that np_ refers to a noun phrase processor object, because our syntax says so, so we can call its debugPrint() method without having to know what kind of nounPhrase it is. This is important in our "little red wagon" example, because we'll actually have both kinds of nounPhrase alternatives present.

Here's some code that will parse a noun phrase and then show the debugging display for the result:

  local str;

  local toks;

  local match;

  /* ask for a string of input */

  ">";

  str = inputLine();

  /* tokenize the string with the standard tokenizer */

  toks = Tokenizer.tokenize(str);

  /* parse the tokens */

  match = nounPhrase.parseTokens(toks, gDict);

  /* display the match tree */

  for (local i = 1, local cnt = match.length() ; i <= cnt ; ++i)

    "Match #<<i>>:\n";

    match[i].debugPrint();

For our "little red wagon" example, we'll see something like this:

Match #1:

noun phrase: adjective = little, nounPhrase = {

noun phrase: adjective = red, nounPhrase = {

noun phrase: noun = wagon