(When editing this page, uncheck the ConvertSpacesToTabs? checkbox to avoid spoiling the layout of the examples.)
Context-free Syntax Definition
expression grammar
Disambiguating Expressions
Lexical Syntax Definition
idenitifiers, layout
Regular Expressions
Overloading Delimiters
BibTeX? is a language for the describtion of bibliographical information such as articles and books. For example, the following entry describes a PhD? thesis.
@PhdThesis{Vis97.thesis, author = {Visser, Eelco}, title = {Syntax Definition for Language Prototyping}, year = {1997}, month = {September}, school = {University of Amsterdam}, URL = { http://www.cs.uu.nl/~visser/thesis/ } }
In the syntax of BibTeX? entries the symbol { has two meanings: (1) indicating the start of the list of fields and (2) indicating the start of a field body. The second kind of use requires a lexical treatment since the body of a field consists of an arbitrary list of characters until the closing } is found. In an approach where scanner and parser are separated it is not possible for the scanner to know which kind of { is encountered. Furthermore, a field body can contain nested occurences of { and }, which should only occur in matching pairs. In
Finally, the treatment of whitespace is different between entries, between the fields of an entry and inside the body of a field.
context-free syntax C {Entry C}* C -> Entries "@" EName "{" Key "," {Field ","}* ","? "}" -> Entry Name "=" Value -> Field "{" ValWords "}" -> Value (ValWord | ("{" ValWords "}"))* -> ValWords lexical syntax ~[\{\}\ \t\n]+ -> ValWord lexical restrictions ValWord -/- ~[\{\}\ \t\n]
The complete syntax definition for BibTeX? (that also treats double quotes in field bodies correctly) can be found at
http://www.cwi.nl/~mdejonge/grammar-base/bibtex.0/index.html
Solving Lexical Ambiguities
Longest Match
follow restrictions
Reserved Words
in a normal scanner generator like LEX
when combining languages we want to have separate sets of reserved words;
a COBOL reserved word should not be used as a COBOL identifier, but might be quite usable as a SQL identifier
Ignoring Whitespace in Lexicals
In Fortran whitespace inside lexicals is not significant. This can be accomodated in Trash.SDFII by using context-free syntax to define lexicals.
Dividing a Syntax Definition into Modules
reuse of pieces of syntax
renaming
Combining Languages
COBOL is a language for manipulating business information represented by means of lists of records. COBOL programs are often mixed with fragments from other languages. For example, SQL queries can be embedded to access a database and CICS programs are used for process control. It is desirable to describe the syntax of each of the language separately and combine these descriptions as needed.
In a traditional syntax definition formalism this is not possible: (1) The grammars restrictions such LL or LALR on which the context-free syntax is based are not closed under composition. (2) the regular grammars on which the definition of the lexical syntax are based are not closed under composition either.
In practice, this translates to the following: A scanner does not consider the context in determining the sort of a token. Therefore, normal scanners cannot deal with
(LEX provides a workaround by means of modes.)
In Trash.SDFII the syntax of the composing languages can be described in separate modules and combined at will. For example, consider the following fragments from a syntax definition for COBOL. (Note that the actual combined syntax definition for COBOL, CICS and SQL combined consists of 1600 LOC divided into 38 modules.)
Module ID defines the syntax of identifiers. The module ID lexical syntax [0-9]* [A-Z] -> Lex-Id [0-9]* [A-Z] [A-Za-z0-9\-]* [A-Za-z0-9] -> Lex-Id [0-9]+ [\-] [0-9\-]* [A-Z] [A-Za-z0-9\-]* [A-Za-z0-9] -> Lex-Id context-free syntax Lex-Id -> Id lexical restrictions Lex-Id -/- [A-Za-z0-9\-]
Module COBOL defines the syntax of COBOL programs. The actual syntax definition for cobol consists of 36 modules. Here only the productions relevant for the example are shown. Note that the syntax of Picture overlaps with the syntax for Id. This overlap is disambiguated by context.
module COBOL imports ID %% ... lexical syntax [0-9XxAa\(\)pZzVvSszBCRD\/\,\$\+\-\*\:]+ -> Picture context-free syntax Ident-div Env-div Data-div Proc-div -> Program "DATA" "DIVISION" "." File-sec Ws-sec Link-sec -> Data-div "FILE" "SECTION" "." File-desc* -> File-sec "FD" Id Fd-item* "." Data-desc* -> File-desc Dd-header Dd-body* -> Data-desc
Module SQL defines the syntax for SQL queries. Queries are embedded into COBOL programs by means of the keywords EXEC SQL ... END-EXEC.
module SQL lexical syntax [A-Z0-9\-\_\.\:]+ -> Sql-id context-free syntax "SELECT" Distinct Select-list From-into Where Order-by -> Select Select -> Sql-item "EXEC" "SQL" Sql-item+ "END-EXEC" "." -> Data-desc "EXEC" "SQL" Sql-item+ "END-EXEC" -> StatModule CICS defines the syntax of CICS commands and their embedding in COBOL programs. Note that a command can have a reference to an A-exp, which is a COBOL expression.
module CICS imports PROGRAM lexical syntax [A-Z]+ -> Cics-kw context-free syntax Stat* "EXEC" "CICS" Cics-command Cics-opt* "." -> Sentence "EXEC" "CICS" Cics-command Cics-opt* "END-EXEC" -> Stat Cics-kw -> Cics-opt Cics-kw "(" Cics-arg ")" -> Cics-opt A-exp -> Cics-arg Str -> Cics-arg "ADDRESS" "OF" A-exp -> Cics-arg "LENGTH" "OF" A-exp -> Cics-arg "ABEND" -> Cics-command %% etc.