In this document, we explain how TAs are used to remove precedence ambiguities from CFG of a small toy language (only containing + and * operations).
Given that a language has ambiguities in the form of shift/reduce conflicts, the first step involves constructing a CFG
-
$V$ is a set of non-terminals (aka variables), -
$\Sigma$ is a set of terminals, -
$S$ is a set of start symbols, and -
$P$ is a set of productions.
For our small toy language, initial CFG
G := (V, Σ, S, P)
V = { E }
Σ = { N, +, *, (, ) }
S = { E }
P = { E -> N ;
E -> E + E ;
E -> E * E ;
E -> ( E ) }
expr. %token
The grammar parser.mly, where the following lines refer to the productions
(* parser.mly *)
program : expr EOF { $1 };
expr:
| INT { Ast.Int $1 }
| expr PLUS expr { Ast.Plus ($1, $3) }
| expr MUL expr { Ast.Mul ($1, $3) }
| LPAREN expr RPAREN { Ast.Paren $2 };In our implementation, Converter.mly_to_cfg extracts relevant lines -- such as the lines above -- from the parser.mly and coverts to its corresponding CFG. Running the project with the above grammar results in the shift/reduce conflicts:
$ dune exec ./main.exe
Warning: 2 states have shift/reduce conflicts.
Warning: 4 shift/reduce conflicts were arbitrarily resolved.Before we deal with the conflicts in Step 3, we first convert the CFG
-
$Q$ is a set of states, -
$F$ is a set of (ranked) constructor labels (aka alphabet), -
$S$ is a set of root states, and -
$\Delta$ is a set of productions.
Note that this is a top-down automata where the
ϵ. exprs in respective production rule. expr that appears in the productions. For our toy language, initial TA
A := (Q, F, S, Δ)
Q = { E, ϵ }
F = { <+, 2>, <*, 2>, <(), 1>, <N, 1>, <ϵ, 1> }
S = { E }
Δ = { E ->_N ϵ ;
E ->_+ E E ;
E ->_* E E ;
E ->_() E }
As noted in [1], TRE that corresponds to the above TA is written as follows:
TRE :
( +([],[]) | *([],[]) | ()([]) )* . N()
We make the Menhir parser generator provide explanations on conflicts via -- inspection --dump --explain flags. This generates parser.conflicts file in the _build/default/ directory. If there are no conflicts in the grammar, no parser.conflicts file will be generated upon compilation.
Upon close inspection of parser.conflicts generated by Menhir, the 4 shift/reduce conflicts refer to conflicts from the following expressions:
expr MUL expr PLUS exprexpr PLUS expr MUL exprexpr PLUS expr PLUS exprexpr MUL expr MUL expr
We assume that an ambiguity reported by Menhir is represented with the most concise expression that covers other bigger, complicated expressions containing the same type of ambiguity.
For the grammar parser.conflicts indicate that there are two types of examples that cause shift-reduce conflicts:
...
expr MUL expr
expr . PLUS expr
...
expr PLUS expr
expr . PLUS expr
...
Based on the first example, we can create following options for the user:
(Option 1) | (Option 2)
|
MUL | PLUS
/ \ | / \
expr PLUS | MUL expr
/ \ | / \
expr expr | expr expr
That is, the + and *. When there is 2*3+4 is evaluated as Mul(2,Plus(3, 4)).
Similarly, the following options are generated for the second example:
(Option 1) | (Option 2)
|
PLUS | PLUS
/ \ | / \
expr PLUS | PLUS expr
/ \ | / \
expr expr | expr expr
This refers to the associativity of + operator. When there is
In this step, we let the user --- aka the language designer --- select one example based on their preference. Note that there are two different formats to present examples to the user: trees and code snippets. The format will be finalized based on the user study in the future. In this document, we're using the tree format as shown in the Step 3 above.
For simplicity, suppose that we are only dealing with the first example and the Option 2 was selected by the user to eliminate the conflicts.
This example (Option 2) is subsequently fed to the following algorithm to automatically generate a TA
(TODO: Here, we need to further clarify an algorithm involved in generating $A'$ based on the example selected by the user.)
(TODO: Take note of Angluin's Algorithm and apply it in Tree languages context. Explain how the algorithm I came up with is principled.)
Now we specify a TA
A' := (Q', F, S', Δ')
Q' = { X, Y, ϵ }
F = { <+, 2>, <*, 2>, <(), 1>, <N, 1>, <ϵ, 1> }
S' = { X }
Δ' = { X ->_+ X X ;
X ->_ϵ Y ;
Y ->_N ϵ ;
Y ->_* Y Y ;
Y ->_() X }
TRE that corresponds to the above TA is written as follows:
TRE :
( ( +([],[]) )* . ( *([],[]) )* . ( N() | ()([]) ) )*
TA
-
$Q''$ is a cross product of$Q$ and$Q'$ , i.e.,$Q \times Q'$ , -
$S''$ is a cross product of$S$ and$S'$ , i.e.,$S \times S'$ , and -
$\Delta''$ is a cross product of$\Delta$ and$\Delta'$ , i.e.,$\Delta \times \Delta'$ .
Based on the above definition of intersection, we can define
A'':= (Q'', F, S'', Δ'')
Q'' = { EX, EY, ϵ }
F = { <+, 2>, <*, 2>, <(), 1>, <N, 1>, <ϵ, 1> }
S'' = { EX }
Δ'' = { EX ->_N ϵ ;
EX ->_+ EX EX ;
EX ->_* EY EY ;
EX ->_() EX ;
EY ->_N ϵ ;
EY ->_* EY EY ;
EY ->_() EX }
By introducing epsilon introductions, the productions
Δ'' = { EX ->_+ EX EX ;
EX ->_ϵ EY ;
EY ->_* EY EY ;
EY ->_N ϵ ;
EY ->_() EX }
The resulted
G' := (V', Σ', S', P')
V' = { X, Y }
Σ' = { N, +, *, (, ) }
S' = { X }
P' = { X -> X + X ;
X -> Y ;
Y -> Y * Y ;
Y -> N
Y -> ( X ) }
(TODO: mention how this CFG gets converted to a new .mly file)
(TODO: include further example to illustrate a situation where there are multiple conflicts - e.g., dangling else, etc.)
[1]: Adams, M. D., & Might, M. (2017). Restricting grammars with tree automata. Proceedings of the ACM on Programming Languages, 1(OOPSLA), 1-25. https://doi.org/10.1145/3133906.