greta/examples.md at main · verse-lab/greta

Examples

In this document, we explain how TAs are used to remove precedence ambiguities from CFG of a small toy language (only containing + and * operations).

Step 1. CFG $G$ for a toy language

Given that a language has ambiguities in the form of shift/reduce conflicts, the first step involves constructing a CFG $G$ for the language. $G$ is defined as a tuple $(V, \Sigma, S, P)$ where

$V$ is a set of non-terminals (aka variables),
$\Sigma$ is a set of terminals,
$S$ is a set of start symbols, and
$P$ is a set of productions.

For our small toy language, initial CFG $G$ is defined with the following tuple:

G  := (V, Σ, S, P)

V   = { E }
Σ   = { N, +, *, (, ) }
S   = { E }
P   = { E -> N      ;
        E -> E + E  ;
        E -> E * E  ;
        E -> ( E )  }

$V$ contains non-terminals generated per expr. $\Sigma$ is defined based on %token

The grammar $G$ is specified in the parser.mly, where the following lines refer to the productions $P$:

(* parser.mly *)
program : expr EOF { $1 };
expr: 
  | INT  { Ast.Int $1 }
  | expr PLUS expr { Ast.Plus ($1, $3) }
  | expr MUL expr { Ast.Mul ($1, $3) }
  | LPAREN expr RPAREN { Ast.Paren $2 };

In our implementation, Converter.mly_to_cfg extracts relevant lines -- such as the lines above -- from the parser.mly and coverts to its corresponding CFG. Running the project with the above grammar results in the shift/reduce conflicts:

$ dune exec ./main.exe
Warning: 2 states have shift/reduce conflicts.
Warning: 4 shift/reduce conflicts were arbitrarily resolved.

Step 2. Mapping CFG $G$ to TA $A$

Before we deal with the conflicts in Step 3, we first convert the CFG $G$ to its corresponding TA $A$ by labeling productions. $A$ is defined as a tuple $(Q, F, S, \Delta)$ where

$Q$ is a set of states,
$F$ is a set of (ranked) constructor labels (aka alphabet),
$S$ is a set of root states, and
$\Delta$ is a set of productions.

Note that this is a top-down automata where the $A$ accepts a tree $t$ if it starts from a root state in $S$ and expanding downward can construct the tree $t$.

$Q$ corresponds to $V$ of $G$, added with ϵ. $F$ is defined based on the $\Sigma$ of $G$, where each of its arity is computed based on the number of exprs in respective production rule. $S$ refers to the first expr that appears in the productions. For our toy language, initial TA $A$ is defined with the following tuple:

A  := (Q, F, S, Δ)

Q   = { E, ϵ }
F   = { <+, 2>, <*, 2>, <(), 1>, <N, 1>, <ϵ, 1> }
S   = { E }
Δ   = { E ->_N ϵ    ;
        E ->_+ E E  ;
        E ->_* E E  ;
        E ->_() E   }

As noted in [1], TRE that corresponds to the above TA is written as follows:

TRE : 
    ( +([],[]) | *([],[]) | ()([]) )* . N()

Step 3. Generate examples based on conflict(s) in the language

We make the Menhir parser generator provide explanations on conflicts via -- inspection --dump --explain flags. This generates parser.conflicts file in the _build/default/ directory. If there are no conflicts in the grammar, no parser.conflicts file will be generated upon compilation.

Upon close inspection of parser.conflicts generated by Menhir, the 4 shift/reduce conflicts refer to conflicts from the following expressions:

expr MUL expr PLUS expr
expr PLUS expr MUL expr
expr PLUS expr PLUS expr
expr MUL expr MUL expr

We assume that an ambiguity reported by Menhir is represented with the most concise expression that covers other bigger, complicated expressions containing the same type of ambiguity.

For the grammar $G$, following line in parser.conflicts indicate that there are two types of examples that cause shift-reduce conflicts:

...
expr MUL expr
         expr . PLUS expr
...
expr PLUS expr
          expr . PLUS expr
...

Based on the first example, we can create following options for the user:

     (Option 1)         |      (Option 2)       
                        |                       
        MUL             |         PLUS          
       /   \            |         /  \          
     expr PLUS          |       MUL  expr       
          /  \          |      /  \             
       expr  expr       |    expr expr

That is, the $G$ contains ambiguities caused by precedence involving + and *. When there is $n_1 * n_2 + n_3$, it is not sure whether to do perform $n_1 * n_2$ first or $n_2 + n_3$ first. Without specifying precedence, Menhir arbitrarily resolves this conflict, presumably resulting in an incorrect computation of $n_1 * (n_2 + n_3)$. For example, 2*3+4 is evaluated as Mul(2,Plus(3, 4)).

Similarly, the following options are generated for the second example:

     (Option 1)         |      (Option 2)       
                        |                       
        PLUS            |         PLUS          
        /  \            |         /  \          
     expr PLUS          |       PLUS expr       
          /  \          |       /  \             
       expr  expr       |     expr expr

This refers to the associativity of + operator. When there is $n_1 + n_2 + n_3$, it is not sure whether to do perform $n_1 + n_2$ first or $n_2 + n_3$ first. As in the earlier example, Menhir arbitrarily resolves this conflict, with the right associativity of $n_1 + (n_2 + n_3)$.

Step 4. The user selects an example

In this step, we let the user --- aka the language designer --- select one example based on their preference. Note that there are two different formats to present examples to the user: trees and code snippets. The format will be finalized based on the user study in the future. In this document, we're using the tree format as shown in the Step 3 above.

For simplicity, suppose that we are only dealing with the first example and the Option 2 was selected by the user to eliminate the conflicts.

Step 5. TA $A'$ encoding restrictions

This example (Option 2) is subsequently fed to the following algorithm to automatically generate a TA $A'$ encodinig restrictions.

$$ a + b $$

(TODO: Here, we need to further clarify an algorithm involved in generating $A'$ based on the example selected by the user.)

(TODO: Take note of Angluin's Algorithm and apply it in Tree languages context. Explain how the algorithm I came up with is principled.)

Now we specify a TA $A'$ encoding restrictions as follows:

A' := (Q', F, S', Δ')

Q'  = { X, Y, ϵ }
F   = { <+, 2>, <*, 2>, <(), 1>, <N, 1>, <ϵ, 1> }
S'  = { X }
Δ'  = { X ->_+ X X  ;
        X ->_ϵ Y    ;

        Y ->_N ϵ    ;
        Y ->_* Y Y  ;
        Y ->_() X   }

TRE that corresponds to the above TA is written as follows:

TRE : 
    ( ( +([],[]) )* . ( *([],[]) )* . ( N() | ()([]) ) )*

Step 6. TA $A''$ resulted from intersection of $A$ and $A'$

TA $A''$ resulted from taking intersection of $A$ and $A'$ (i.e., $A \cap A'$ where $A = (Q, F, S, \Delta)$ and $A' = (Q', F, S', \Delta')$) is defined as a tuple $(Q'', F, S'', \Delta'')$ where:

$Q''$ is a cross product of $Q$ and $Q'$, i.e., $Q \times Q'$,
$S''$ is a cross product of $S$ and $S'$, i.e., $S \times S'$, and
$\Delta''$ is a cross product of $\Delta$ and $\Delta'$, i.e., $\Delta \times \Delta'$.

Based on the above definition of intersection, we can define $A''$ as follows:

A'':= (Q'', F, S'', Δ'')

Q'' = { EX, EY, ϵ }
F   = { <+, 2>, <*, 2>, <(), 1>, <N, 1>, <ϵ, 1> }
S'' = { EX }
Δ'' = { EX ->_N ϵ       ;
        EX ->_+ EX EX   ;
        EX ->_* EY EY   ;
        EX ->_() EX     ;
        EY ->_N ϵ       ;
        EY ->_* EY EY   ;
        EY ->_() EX     }

By introducing epsilon introductions, the productions $\Delta''$ can be simplified further, as shown below:

Δ'' = { EX ->_+ EX EX   ;
        EX ->_ϵ EY      ;
        EY ->_* EY EY   ;
        EY ->_N ϵ       ;
        EY ->_() EX     }

Step 7. Convert TA $A''$ to CFG $G'$

The resulted $A''$ is converted back to CFG $G'$ by unlabeling the productions:

G' := (V', Σ', S', P')

V'  = { X, Y }
Σ'  = { N, +, *, (, ) }
S'  = { X }
P'  = { X -> X + X  ;
        X -> Y      ;
        Y -> Y * Y  ;
        Y -> N
        Y -> ( X )  }

(TODO: mention how this CFG gets converted to a new .mly file)

(TODO: include further example to illustrate a situation where there are multiple conflicts - e.g., dangling else, etc.)

References

[1]: Adams, M. D., & Might, M. (2017). Restricting grammars with tree automata. Proceedings of the ACM on Programming Languages, 1(OOPSLA), 1-25. https://doi.org/10.1145/3133906.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples

Step 1. CFG $G$ for a toy language

Step 2. Mapping CFG $G$ to TA $A$

Step 3. Generate examples based on conflict(s) in the language

Step 4. The user selects an example

Step 5. TA $A'$ encoding restrictions

Step 6. TA $A''$ resulted from intersection of $A$ and $A'$

Step 7. Convert TA $A''$ to CFG $G'$

References

FilesExpand file tree

examples.md

Latest commit

History

examples.md

File metadata and controls

Examples

Step 1. CFG $G$ for a toy language

Step 2. Mapping CFG $G$ to TA $A$

Step 3. Generate examples based on conflict(s) in the language

Step 4. The user selects an example

Step 5. TA $A'$ encoding restrictions

Step 6. TA $A''$ resulted from intersection of $A$ and $A'$

Step 7. Convert TA $A''$ to CFG $G'$

References