Grammar is code¶
In Compiler Compiler we have seen that we can define a grammar within a Castle-program. And we have argued that each grammars-rule can be considered as a function.
In this post, we look into de details of how this works. And will confirm grammars is code …
Let’s start with an example. The grammer below is a simplified description of how a (PEG) parsing-rule looks like. The syntax in Castle is a bit more complicated, by having more details and options; but very simular. Most of the Caste-grammers can be parsed by the grammer below!
PEG_rule <- rule_name '<-' expression ';' ; expression <- ordered_choice+ ; ordered_choice <- sequence ( '|' sequence)* ; sequence <- group | atom ; group <- '(' expression ')' ; atom <- rule_xref | str_lit | regexp ; str_lit <- '"' /[^"]* '"' ; regexp <- '/' /[^/]* '/' ; rule_name <- ID ; rule_xref <- ID ;
With this grammer we can read and check whether an a string is a valid rule by simply calling:
\(ast:=PEG\_rule(single\_line)\). When not,
ast is \(False\), else
ast has 4 children (or fields).
ast represent the
ast is an
expression; which is a sequence of
As the rule has two constants
ast will always match the strings ‘<-’ and ‘;’.
Grammar to code¶
Before we study how Castle expands a grammar to code, let examine how to do that by manually crafted code, or using a external compiler-compiler. In both cases, we will focus on the “validation” part only. And we skip a lot of details; assuming you know a bit of theory of (PEG) parser. When not, read Guido van Rossum’s excellent blog-series PEG Parsing Series Overview
Hand written code¶
Again, let’s suppost we like to verify a string contains a peg-rule. Then we need some functions 1, which signature are something like:
def PEG_rule(text: str) -> ast : ... def expression(text: str) -> ast : ... def ordered_choice(text: str) -> ast : ... def sequence(text: str) -> ast : ... ...
ast PEG_rule(char* txt); ast expression(char* txt); ast ordered_choice(char* txt); ast sequence(char* txt); ...
Where we assume the type
ast is defined in the chosen language and will be Null/None/Empty to signal \(False\):
The implementation of
PEG_rule() checks thats the input-string starts with a
rule_name, followed by the literal
string “<-”, then has an
expression and finally the literal string “;”. When one of those (four) steps fail,
we return \(False\).
We follow this pattern for all rules.
This concept is easy to implement: each rule-function calls other rule-functions as defined by the grammar. When we
need to check for a literal string we use
expect(txt: str, literal: str) -> bool(). Also, we need a function to find
an ID; again easy. Sometimes we need to implement a loop –to handle
+. Or we need an \(or\), to
implement alternatives (
|). None of that is rocket science.
A real implementation is a bit harder, as we have to strip spaces (and comments), handle newlines, and need to keep track of where we are. Typically that is (nowadays) done by embedding those functions in a class; then the “input text” can be stored in the instance (instead of passing them constantly). That instance also has a ‘cursor’ to the current location.
There are a lot of details that make writing a grammer complex. We mention a few, and what it effect is on the (manually written) code.
When using alternatives (the
| operator in the grammar), a PEG-parser will always try the first alternative first,
Only when that fails, it back-ups an try the next alternative. Sometimes means (almost) start again, and parse the same file almost
completely again. Therefore the packrat algorithm is usually used; using memoization.
This is not hard: just add a few lines of boilerplate before and after each call. To store intermediate partial-ast(s) in a cache.
Sometimes, we like to use another parser-strategy, like LALR (used by Yacc), GLR (e.g Bison, the successor of Yacc) or LL(k) (introduced by ANTLR, which was popular for a while); each one has it pros and cons. Still, all (or almost) start with the same grammar (although smarter strategies may result is shorter, easier to maintain 2 grammars) 3.
For a long time PEG-parsers where not able to handle left recursive rules 4. Until somebody discovered that is not correct. Grammars in Castle can be left recursive! Both direct and indirect recursion is allowed.
expr <- expr '-' term | term
A <- B "a" | "a" B <- A "b" | "b"
expr <- term ( '-' term )*
It is always possible to rewrite a lef-recursief grammar to one that isn’t. However, that make the grammar harder to read & maintain (for humans). It does also influence the outcome; the tree will differ.
By example, an simple calculation as \(7-5-3\) should result in \(((7-5)-3)\) but that needs left
recursion. When rewriting it, you must be carefull not to get \((7-(5-3))\)!
This can be fixes, by adding an extra step. But it is better to use the update PEG-strategy: Just add more boilerplate code!
For that reason Castle will support recursion! You can write the grammar as you need, as we are generating that extra boilerplate anyhow.
Generating the code¶
You might recognise the pattern: To make the grammer more useful, the algorithms become more complex and adds more
code. This “extra” code, however is not hard; you just need the same (or almost the same) lines at many places.
This begs for automation. And that is exactly what most compiler-compilers do.
A compiler-compilers read the grammar and generates the code. As shown above it will generate (C, C++, C#, Java, Python, or …) functions 5 that call each-other. It will also detect left-recursion, and might compensate for that. The result: more boilerplate-code; but as it is automatically generated this is easy.
There are many tools, that we can use for inspiration. A short overview, and how it influences Castle.
Possible the most famous compiler-compilers is Yacc. It was developed in 197X and generates C-code that can be compiled
and linked to your code. To parse a string, you had to call
yyparse()). It would however be relatively simple to
generate functions with the name of each rule, using the same machinery. In that decade however, the goal was
differently. Memory was limited, what we can also see in the used grammar: one had to craft it carefully as the was no
back-tracking an only a single token look-ahead.
Bison is Gnu reimplementation of Yacc, but can use several parsing-algorithms.cLike Yacc, it used a separate Lexer:
flex (whereas Yacc uses lex). A lexer splits the input-string into a stream of Tokens using another (simpler,
but faster) algorithm. In that time that was relevant.
As a lexer can be implemented with a parsing-algorithm (but not the other-way around), and as the need for speed doesn’t demand a separate lexer anymore; modern parsings are often “scannerless”. This removes the need to use two meta-syntaxes (for the lexer/scanner and the parser) and so is simpler to use.
Also Castle use a scannerless approach.
Castle: rules are functions¶
Resolution: In Castle each rule act as a *callable* R_GrammerCallables
Each parse-rule that is defined in Castle is a kind of function; more accurately it’s a “callable”
Also in Castle you can use grammars; but now directly in your program, using the Castle-syntax. And Castle will generate
“code” –Castle-functions that is. But now without an extra tool.
This “generate code” is not ‘code as text’. Why should we generate code, to read & parse it back and compile it directly? It easier to generate the AST, that would be the result of parsing the generated-code, directly.
But the effect is the same. You create a set of function with this generic “text to tree” signature, by writing some simle rule. Castle does the rest for you. Easy!
Instead of a function, it can also be a method, or any *callable. We use ‘function’ a generic term, in the mathematical meaning: some input (parameters) and an output (return value).
This is not specially for grammers; all it valid for all programming-languages. New languages may introduce new concepts (like –once– OO). When the compiler becomes smarter, the programmer can focus in the important bits!
Aside of multiple parser-algorithms, there are also several notation to write the grammar itself; like EBNF ABNF, and YACC Most implementations of a given algorithm, use a dialect of a standard one, to enable No inline actions, or ..
Also Caste does this: We use the Caste-grammer, which is based on both EBNF and PEG; but using the classic ‘|’ instead of the ‘’ for ordered-choice.
Without going into details left-recursion is hard for many parsing-algorithms. In the shown approach, a rule-function (for a rule that is direct left-recurse) will call itself as first step. In this way no progress is made, and the stack will quickly overrun.
Some tools, like Yacc by example, use another approach. Instead of many functions it has a generic (run-time) library that used code-tables; which are generated by the tool. Still, that is just a implementation detail.