Reasoning About XMLSchema Languages Using Formal Language Theory

Program-Transformation.Org: The Program Transformation Wiki
Dongwon Lee, Murali Mani, Makoto Murata.

Technical Report, IBM Almaden Research Center, RJ# 10197, Log# 95071*, November 16, 2000


A mathematical framework using formal language theory to describe and compare XML schema languages is presented. Our framework uses the work in two related areas - regular tree languages [CDG+97] and ambiguity in regular expressions [BEGO71, BKW98]. Using these work as well as the content in two classical references [HU79, AU79], we present the following results: (1) a normal form representation for regular tree grammars, (2) a framework of marked regular expressions and model groups, and their ambiguities, (3) five subclasses of regular tree grammars and their corresponding languages to describe XML content models: regular tree languages, TD(1) (top-down input scan with 1-vertical lookahead), single-type constraint languages, TDLL(1) (top-down and left-right input scan with 1-vertical and 1-horizontal lookaheads), and local tree languages, (4) the closure properties of the five language classes under boolean set operations, (5) a classification and comparison of a few XML schema proposals and type systems: DTD, XML-Schema, DSD, XDuce, RELAX, and (6) properties of the grammar classes under two common operations: XML document validity checking and type resolution (i.e., XML document interpretation).


-- MartinBravenboer - 30 May 2002