.. include:: /std/localtoc.irst
.. _QN_PEGEN:
================
QuickNote: PEGEN
================
.. post:: 2022/11/3
:category: CastleBlogs, rough
:tags: Grammar, PEG
To implement CCastle we need a parser, as part of the compiler. Eventually, that parser will be written in Castle. For
now, we kickstart it in python; which has several packages that can assist us. As we like to use a PEG one, there
are a few options. `Arpeggio `__ is well known, and has some nice options --
but can’t handle `left recursion `__ -- like most PEG-parsers.
Recently python itself uses a PEG parser, that supports `left recursion
`__ (which is a recent development). That parser is also available as a
package: `pegen `__; but hardly documented.
This blog is written to remember some lessons learned when playing with it. And as kind of informal docs.
.. seealso:: :ref:`QN_Arpeggio` is another candidate package for the PEG parser in the initial :ref:`Castle-WorkshopTools`
Build-In Lexer
==============
Pegen is specially written for Python and uses a specialised lexer; unlike most PEG-parser that uses PEG for lexing too. Pegen
uses the `tokenizer `__ that is part of Python. This comes with some
restrictions.
This lexer -or tokenize(r) as python calls it-- is used **both** to read the grammar (the PEG file) *and* to read the
source-files that are parsed by the generated parser.
.. hint::
These restrictions apply when we use pegen as module: ``pyton -m pegen ...``; that calls `simple_parser_main()`.
|BR|
But also when we use the parser-class in own code --so, when importing pegen ``from pegen.parser Parser ...``-- it is
restricted. Then is a bit more possible, as we can configure another (self made) lexer. The interface is quite narrow
to python however.
Tokens
------
The lexer will recognise some tokes that are special for python, like `INDENT` & `DEDENT`. Also some generic tokens
like NAME (which is an ID) and `NUMBER` are know, and can be used to define the language.
Unfortunately, it will also find some tokens --typical operators-- that *hardcoded* for python. Even when we like to use
them differently; possible combined with other characters. Then, those will not be found; not the literal-strings as set
in the grammar.
.. note::
Pegen speaks about *(soft)* **keywords** for all kind of literal terminals; even when they are more like operators
than *words*.
.. warning::
When the grammar defines (literal) terminals (or keywords) --especially for operators-- make sure the lexer will not
break them into predefined tokens!
|BR|
This will not give an error, but it does not work!
.. code-block:: PEG
Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is seen as a token. And so, `<-` is never found
Left_arrow_OKE: '<' '-' ## This is acceptable
This *splitting* results however in 2 entries in the resulting tree --unless one uses `grammar actions
`__ to create one new “token”.
.. seealso:: See https://docs.python.org/3/library/token.html, for an overview of the predefined tokens
.. tip::
A quick trick to see how a file is split into tokens, use ``python -m tokenize [-e] filename.peg``.
|BR|
Make sure you do not use string-literals that (eg) are composed of two tokens. Like the above mentioned ``<--``
.. sidebar:: Reserved
:class: localtoc
- showpeek
- name
- number
- string
- op
- type_comment
- soft_keyword
- expect
- expect_forced
- positive_lookahead
- negative_lookahead
- make_syntax_error
Rule names
----------
The *GeneratedParser* inherits and calls the base ``pegen.parser.Parser`` class and has methods for all
rule-names. This implies some names should not be used as rule-names (in all cases) -- see the sidebar.
Meta Syntax (issues)
====================
No: regexps
-----------
PEGEN has **no** support for regular expressions probably as it uses a custom lexer.
Unordered Group starts a comment
--------------------------------
PEGEN (or it lexer) used the ``#`` to start a comment. This implies an **Unordered group** ``( sequence )#`` --as in
`Arpeggio `__-- are not recognized
A workaround is to use another character like ``@`` instead of the hash (``#``).
Result/Output
=============
cmd-tool
--------
The command-line tool ``pyton -m pegen ...`` only prints the parsed tree: a list (shown as ``[`` ... ``]``) with
sub-list and/or `TokenInfo` named-tuples. Each `TokenInfo` has 5 elements: a token type (an int and its enum-name), the
token-string (that was was parsed), the begin & end location (line- & column-number), and the full line that is being
parsed.
No info about the matched gramer-rule (e.g. the rule-name) is shown. Actually that info is not part of the parsed-tree.
.. seealso:: This `structure is described `__ in
the tokenize module; without specifying its name: TokenInfo.
The parser
----------
The GeneratedParser (and/or it’s baseclass: ``pegen.parser.Parser``) returns only (list of) tokens from the tokenizer (a
OO wrapper around tokenize). And so, the same TokenInfo objects as described above.
Stability
=========
The current pegen package op `pypi `__ is V0.1.0 -- which already shows it not
mature. `That version github `__ is dated September 2021 (with 36
commits). The `current `__
version (Nov 22) has 20 commits more (56).
|BR|
And can be installed with ``pip install git+https://github.com/we-like-parsers/pegen``
It os however, not fully compatible. By example ``pegen/parser.y::simple_parser_main()`` now expect an ATS object (to
print), not a list of TokenInfo.
.. tip::
The pegen package is **NOT** used inside the `(C)Python tool
`__; the CPython version is heavily related to other
details of CPython; it can also generate C-code. The pegen-package is based on it, and more-or-less in sync, can
generate Python-code only, but is not depending on the compiler-implementation details.
.. seealso:: https://we-like-parsers.github.io/pegen/#differences-with-cpythons-pegen
Buggy current version
---------------------
The git version contains (at least) one bug. The function ``parser::simple_parser_main()``, that is called when using the
generated file, uses the AST module to print (show) the result -- which simple does not work.
|BR|
Probably, that* default main* isn’t used a lot (Also, I prever to use -- have use-- a own main). Still it shows it
immaturity.
.. LocalWords: lexer tokenize cpython regexps tokenizer