QuickNote: PEGEN

To implement CCastle we need a parser, as part of the compiler. Eventually, that parser will be written in Castle. For now, we kickstart it in python; which has several packages that can assist us. As we like to use a PEG one, there are a few options. Arpeggio is well known, and has some nice options – but can’t handle left recursion – like most PEG-parsers.

Recently python itself uses a PEG parser, that supports left recursion (which is a recent development). That parser is also available as a package: pegen; but hardly documented.

This blog is written to remember some lessons learned when playing with it. And as kind of informal docs.

See also

QuickNote: Arpeggio is another candidate package for the PEG parser in the initial Workshop Tools

Build-In Lexer

Pegen is specially written for Python and uses a specialised lexer; unlike most PEG-parser that uses PEG for lexing too. Pegen uses the tokenizer that is part of Python. This comes with some restrictions.

This lexer -or tokenize(r) as python calls it– is used both to read the grammar (the PEG file) and to read the source-files that are parsed by the generated parser.

Hint

These restrictions apply when we use pegen as module: pyton -m pegen ...; that calls simple_parser_main().
But also when we use the parser-class in own code –so, when importing pegen from pegen.parser Parser ...– it is restricted. Then is a bit more possible, as we can configure another (self made) lexer. The interface is quite narrow to python however.

Tokens

The lexer will recognise some tokes that are special for python, like INDENT & DEDENT. Also some generic tokens like NAME (which is an ID) and NUMBER are know, and can be used to define the language.

Unfortunately, it will also find some tokens –typical operators– that hardcoded for python. Even when we like to use them differently; possible combined with other characters. Then, those will not be found; not the literal-strings as set in the grammar.

Note

Pegen speaks about (soft) keywords for all kind of literal terminals; even when they are more like operators than words.

Warning

When the grammar defines (literal) terminals (or keywords) –especially for operators– make sure the lexer will not break them into predefined tokens!
This will not give an error, but it does not work!

Left_arrow_BAD: '<-'      ## This is WRONG, as ``<`` is seen as a token. And so,  `<-` is never found
Left_arrow_OKE: '<' '-'   ## This is acceptable

This splitting results however in 2 entries in the resulting tree –unless one uses grammar actions to create one new “token”.

See also

See https://docs.python.org/3/library/token.html, for an overview of the predefined tokens

Tip

A quick trick to see how a file is split into tokens, use python -m tokenize [-e] filename.peg.
Make sure you do not use string-literals that (eg) are composed of two tokens. Like the above mentioned <--

Rule names

The GeneratedParser inherits and calls the base pegen.parser.Parser class and has methods for all rule-names. This implies some names should not be used as rule-names (in all cases) – see the sidebar.

Meta Syntax (issues)

No: regexps

PEGEN has no support for regular expressions probably as it uses a custom lexer.

Unordered Group starts a comment

PEGEN (or it lexer) used the # to start a comment. This implies an Unordered group ( sequence )# –as in Arpeggio– are not recognized

A workaround is to use another character like @ instead of the hash (#).

Result/Output

cmd-tool

The command-line tool pyton -m pegen ... only prints the parsed tree: a list (shown as []) with sub-list and/or TokenInfo named-tuples. Each TokenInfo has 5 elements: a token type (an int and its enum-name), the token-string (that was was parsed), the begin & end location (line- & column-number), and the full line that is being parsed.

No info about the matched gramer-rule (e.g. the rule-name) is shown. Actually that info is not part of the parsed-tree.

See also

This structure is described in the tokenize module; without specifying its name: TokenInfo.

The parser

The GeneratedParser (and/or it’s baseclass: pegen.parser.Parser) returns only (list of) tokens from the tokenizer (a OO wrapper around tokenize). And so, the same TokenInfo objects as described above.

Stability

The current pegen package op pypi is V0.1.0 – which already shows it not mature. That version github is dated September 2021 (with 36 commits). The current version (Nov 22) has 20 commits more (56).
And can be installed with pip install git+https://github.com/we-like-parsers/pegen

It os however, not fully compatible. By example pegen/parser.y::simple_parser_main() now expect an ATS object (to print), not a list of TokenInfo.

Tip

The pegen package is NOT used inside the (C)Python tool; the CPython version is heavily related to other details of CPython; it can also generate C-code. The pegen-package is based on it, and more-or-less in sync, can generate Python-code only, but is not depending on the compiler-implementation details.

Buggy current version

The git version contains (at least) one bug. The function parser::simple_parser_main(), that is called when using the generated file, uses the AST module to print (show) the result – which simple does not work.
Probably, that* default main* isn’t used a lot (Also, I prever to use – have use– a own main). Still it shows it immaturity.

Comments

comments powered by Disqus