Grammar is code

In Compiler Compiler we have seen that we can define a grammar within a Castle-program. And we have argued that each grammars-rule can be considered as a function.

In this post, we look into de details of how this works. And will confirm grammars is code …

Let’s start with an example. The grammar below is a simplified description of how a (PEG) parsing-rule looks like. The syntax in Castle is a bit more complicated, by having more details and options; but very simular. Most of the Caste-grammars can be parsed by the grammar below!

PEG_rule        <-  rule_name '<-' expression ';' ;
expression      <-  ordered_choice+ ;
ordered_choice  <-  sequence ( '|' sequence)* ;
sequence        <-  group | atom ;
group           <-  '(' expression ')' ;
atom            <-  rule_xref | str_lit | regexp ;
str_lit         <-  '"'  /[^"]*  '"' ;
regexp          <-  '/'  /[^/]*  '/' ;
rule_name       <-  ID ;
rule_xref       <-  ID ;

With this grammar we can read and check whether an a string is a valid rule by simply calling: \(ast:=PEG\_rule(single\_line)\). When not, ast is \(False\), else ast has 4 children (or fields). Where ast[0] represent the rule_name, ast[2] is an expression; which is a sequence of ordered_choice(s).
As the rule has two constants ast[1] and ast[3] will always match the strings ‘<-’ and ‘;’.

Grammar to code

Before we study how Castle expands a grammar to code, let examine how to do that by manually crafted code, or using a external compiler-compiler. In both cases, we will focus on the “validation” part only. And we skip a lot of details; assuming you know a bit of theory of (PEG) parser. When not, read Guido van Rossum’s excellent blog-series PEG Parsing Series Overview

Hand written code

Again, let’s suppost we like to verify a string contains a peg-rule. Then we need some functions [1], which signature are something like:

def PEG_rule(text: str) -> ast : ...
def expression(text: str) -> ast : ...
def ordered_choice(text: str) -> ast : ...
def sequence(text: str) -> ast : ...
...

Where we assume the type ast is defined in the chosen language and will be Null/None/Empty to signal \(False\): not valid.

The implementation of PEG_rule() checks thats the input-string starts with a rule_name, followed by the literal string “<-”, then has an expression and finally the literal string “;”. When one of those (four) steps fail, we return \(False\).
We follow this pattern for all rules.

This concept is easy to implement: each rule-function calls other rule-functions as defined by the grammar. When we need to check for a literal string we use expect(txt: str, literal: str) -> bool(). Also, we need a function to find an ID; again easy. Sometimes we need to implement a loop –to handle * and +. Or we need an \(or\), to implement alternatives (|). None of that is rocket science.

A real implementation is a bit harder, as we have to strip spaces (and comments), handle newlines, and need to keep track of where we are. Typically that is (nowadays) done by embedding those functions in a class; then the “input text” can be stored in the instance (instead of passing them constantly). That instance also has a ‘cursor’ to the current location.

More details

There are a lot of details that make writing a grammar complex. We mention a few, and what it effect is on the (manually written) code.

When using alternatives (the | operator in the grammar), a PEG-parser will always try the first alternative first, Only when that fails, it back-ups an try the next alternative. Sometimes means (almost) start again, and parse the same file almost completely again. Therefore the packrat algorithm is usually used; using memoization.
This is not hard: just add a few lines of boilerplate before and after each call. To store intermediate partial-ast(s) in a cache.

Sometimes, we like to use another parser-strategy, like LALR (used by Yacc), GLR (e.g Bison, the successor of Yacc) or LL(k) (introduced by ANTLR, which was popular for a while); each one has it pros and cons. Still, all (or almost) start with the same grammar (although smarter strategies may result is shorter, easier to maintain [2] grammars) [3].

For a long time PEG-parsers where not able to handle left recursive rules [4]. Until somebody discovered that is not correct. Grammars in Castle can be left recursive! Both direct and indirect recursion is allowed.

expr <- expr '-' term | term

Note

It is always possible to rewrite a lef-recursief grammar to one that isn’t. However, that make the grammar harder to read & maintain (for humans). It does also influence the outcome; the tree will differ.

By example, an simple calculation as \(7-5-3\) should result in \(((7-5)-3)\) but that needs left recursion. When rewriting it, you must be carefull not to get \((7-(5-3))\)!
This can be fixes, by adding an extra step. But it is better to use the update PEG-strategy: Just add more boilerplate code!

For that reason Castle will support recursion! You can write the grammar as you need, as we are generating that extra boilerplate anyhow.

Generating the code

You might recognise the pattern: To make the grammar more useful, the algorithms become more complex and adds more code. This “extra” code, however is not hard; you just need the same (or almost the same) lines at many places.
This begs for automation. And that is exactly what most compiler-compilers do.

A compiler-compilers read the grammar and generates the code. As shown above it will generate (C, C++, C#, Java, Python, or …) functions [5] that call each-other. It will also detect left-recursion, and might compensate for that. The result: more boilerplate-code; but as it is automatically generated this is easy.

Classic tools

There are many tools, that we can use for inspiration. A short overview, and how it influences Castle.

Possible the most famous compiler-compilers is Yacc. It was developed in 197X and generates C-code that can be compiled and linked to your code. To parse a string, you had to call yyparse()). It would however be relatively simple to generate functions with the name of each rule, using the same machinery. In that decade however, the goal was differently. Memory was limited, what we can also see in the used grammar: one had to craft it carefully as the was no back-tracking an only a single token look-ahead.

Bison is Gnu reimplementation of Yacc, but can use several parsing-algorithms.cLike Yacc, it used a separate Lexer: flex (whereas Yacc uses lex). A lexer splits the input-string into a stream of Tokens using another (simpler, but faster) algorithm. In that time that was relevant.
As a lexer can be implemented with a parsing-algorithm (but not the other-way around), and as the need for speed doesn’t demand a separate lexer anymore; modern parsings are often “scannerless”. This removes the need to use two meta-syntaxes (for the lexer/scanner and the parser) and so is simpler to use.
Also Castle use a scannerless approach.

Castle: rules are functions

Resolution: In Castle each rule act as a callable R_GrammarCallables ../../_images/arrow-right-circle.svg
links outgoing: U_Grammars

Each parse-rule that is defined in Castle is a kind of function; more accurately it’s a “callable”

Also in Castle you can use grammars; but now directly in your program, using the Castle-syntax. And Castle will generate “code” –Castle-functions that is. But now without an extra tool.
This “generate code” is not ‘code as text’. Why should we generate code, to read & parse it back and compile it directly? It easier to generate the AST, that would be the result of parsing the generated-code, directly.

But the effect is the same. You create a set of function with this generic “text to tree” signature, by writing some simle rule. Castle does the rest for you. Easy!

And in case you are wondering how to use actions is in e.g. YACC: You don’t (need them). See No inline actions for why there are outdated and both Matching Statements (todo), and Cyclic woods (todo)


Footnotes

Comments

comments powered by Disqus