Compiler Compiler

In Castle, you can define grammar(s) directly in your code. A Castle compiler will translate them into functions, using the build-in (PEG) compiler-compiler – at least that was it called back in the days of YACC.

How does one use that? And why should you?

Grammars, a short intro

A grammar is a collection of (parsing)-rules and some optional settings. Let us start with a simple example with some rules written in a mixture of EBNF and PEG meta-syntax.

castle_file  <- ( import_line | interface | implementation )* ;
import_line  <- IMPORT_stmt ( STRING_literal | qualID ) ';' ;
qualID       <-  '.'? nameID  ('.' nameID )*   ;
IMPORT_stmt  =   'import' ;
...

This basically defines that a castle_file is a sequence of import_line(s), interface(s), or implementation(s); which are all “non-terminal(s)” – see below. All of those ’non-terminals’ are defined by more rules. For example, an import_line starts with the IMPORT_stmt then comes either a STRING_literal or a qualID, and it ends with a semicolon (‘;’). Likewise, the IMPORT_stmt is set to ‘import’ (literally).

As we see, the grammar contains non-terminals and terminals. Non-terminals are abstract and defined by grammar rules, containing both (other) non-terminals and terminals. Terminals are concrete: they are the things (tokens) you type when programming. Some terminals are like constants like the semicolon at the end of import_line, therefore they are quoted in the grammar (Notice, the is also a non-quoted semicolon on each line, which is part of the syntax of grammar.)
Other terminals are more like variables, they have a value. The STRING_literal is a good example. Its value is the string itself. Similar for numbers and variable names.

In this (example) grammar, a qualID is a nameID (a name that is used as ID, like in any programming language), optionally followed by sub-names (again like most languages: a dotted name, specifying a field (in a field, in …). In Castle, that name may start with a dot –which is a shorthand notation for “in the current namespace”. You can ignore that for now.

A grammar defines how one (aka the compiler) should read the input –a text–, or more formally: how to parse it. The result of this parsing is twofold. It will check whether the input conforms to the grammar; resulting in a boolean, for the mathematics under us. And it will translate a sequential (flat) text into a tree structure; which is typically much more useful for a software engineer.

A well-known example is this HTML file. On disk, it’s nothing but text, which is easy to store and transfer. But when sent to your browser, it’s parsed to create the DOM; a tree of the document, with sections, paragraphs, hyperlinks, etc. By regarding it as a tree, it becomes easy to describe or selected parts (e.g. with CSS) how parts should be shown: all headers have a background, the first row in a table is highlighted, etc.

Parsing

Another well-known example is (the source of a) program. As code, it is just text. But the compiler will parse it into a parse tree and/or an “abstract syntax tree” (AST); which is built out of classes, methods, statements, etc.
But also your favorite IDE will parse it; to highlight the code, give tooltips, enable you to quickly navigate and refactor it, and all those convenient features that make it your favorite editor. And even you are probably parsing text as part of your daily job. When you un-serialise data, you are (often) parsing text; when you read the configuration, you are (or should be ) parsing that text. Even a simple input from the user might need a bit of parsing. The text “42” is not the number $42.0$ – you need to convert it; parse it.

There are many ways to parse. A full-fledged grammar to translate (the text) ‘42’ into the int “$42$” or the float “$42.0$” isn’t needed, a stdlib-function as “atoi()` or ``atof() will do. But how about handling complex numbers ($4+j2$) or fractions ($\frac{17}{42}$)?

Non-parsing

As writing a proper parser used to be (too) hard, other similar (but more simple) techniques are often used, like globing (``*.Castle`` on the bash-prompt will result in all Castle-files). It is elementary and will do in simple cases.
Others try to use regular expressions for parsing. RegExps are indeed more powerful than globing. They are often used to highlight code. A pattern as //.*$ can be used to highlight (single-line) comments. It often works, but this simple pattern might match a piece of text inside a multi-line-(doc)string – which is wrong.

Those tricks aren’t a sound solution to parse generic input/text; although I have seen cunning RegExps that almost (always) work. Regular expressions do have not the same power as grammars; that is already proven half a century ago and is not repeated here.

Grammars are more powerful

A grammar (even a simple one) is more powerful. You can define the (input) structure hierarchically. And specify the sub-structure of each lump. For example: when a multi-line-string has no sub-structure, the parser will never find a statement inside it; it simply is not hunting for it.

As most programming languages do not have built-in support for grammars, one has to resort to external tools. Like the famous YACC; developed in 197X. YACC will read a grammar-file and generates C-code that can be compiled and linked to your code.

Back then, writing compiler-compilers was a popular academic research exercise (YACC stands for: Yet Another Compiler Compiler). It was great for compiler designers, but clumsy to use for average developers: The syntax to write a grammar was hard to grasp, with many pitfalls, the interface between your code and the parser was awkward (you had to call yyparse(); needed some globals; OO wasn’t invented, no inheritance or data-hiding, which resulted in puzzling tricks to use multiple parsers, etc).
Aside from that, more and better parsing strategies are developed; that is handled in another blog.

Unleash that power!

With those better parsing algorithms, faster computers with a lot more memory, and other inventions, writing grammars has become more peaceful. Except that you still need an extra step, another syntax, as you still need to use an external tool. That sometimes isn’t maintained after a couple of years …
The effect is, most developers don’t use grammars; they write parser-like code manually, or they settle for less optimal results. Or are utterly not aware that grammar can provide another, better, easier solution.

With a few lines, you can define the structure of the input. Each rule is like a function: it has a name (the ‘left-hand side’ (LHS) of the rule, so the part before the arrow), and an implementation; the part after the arrow. That implementation “calls” other rules, like normal code.
When you call the “main rule function”, with the input stream as input, that file is parsed, and the complete input is ready to use; not more manual scanning and parsing. And when the file structure is slightly updated, you just add a few details to the grammar.

Castle has it built-in

Grammars makes reading text easy. Define the structure, call the “main rule” and use the values. Castle makes that simple!

UseCase: Castle has build-in grammar support U_Grammars

links incoming: R_GrammarCallables

In Castle one can define a grammar directly into the source code; as Grammar is code!

And, like many other details, the language is hiding the nasty details of parsing strategies. There is no need to generate, compile, and use that code, with external tools. All that clutter is gone.

Tip

The standard parsing algorithm is PEG, but that is not a requirement.

The syntax of grammars is quite generic, it’s the implementation of the Castle compiler that implements the parsing strategy; it should support PEG. But it is free to support others as well (with user-selectable compiler-plugins).
This is not unlike other compiler options.

To use the grammar, you simply call one of those rules as a function: pass the input (string) and it will return a (generic) tree structure. When you like to verify the syntax is correct: use the tree as a boolean: when it not-empty the input is valid.
Typically, however, you traverse that tree, like you do in many situations.

To read that early configuration: parse the file and walk over the tree. Or use it “ala a DOM” by using Castle’s Matching Statements (todo) to simply. Curious about how that works: continue reading in Grammar is code. Or skip to “Why there are No inline actions”.

All Sphinx Directives Grammar is code

Comments

comments powered by Disqus