JavaScript has one of the most challenging grammar to parse, this tutorial details all the sweat and tears I had while learning it.
LL(1) Grammar
According to Wikipedia,
an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right
The first L means the scanning the source from Left to right, and the second L means the construction of a Leftmost derivation tree.
Context-free and the (1) in LL(1) means a tree can be constructed by just peeking at the next token and nothing else.
LL Grammars are of particular interest in academia because we are lazy human beings and we want to write programs that generate parsers automatically so we don't need to write parsers by hand.
Unfortunately, most industrial programming languages do not have a nice LL(1) grammar, and this applies to JavaScript too.
Mozilla started the jsparagus project a few years ago and wrote a LALR parser generator in Python. They haven't updated it much in the past two years and they sent a strong message at the end of js-quirks.md
What have we learned today?
- Do not write a JS parser.
- JavaScript has some syntactic horrors in it. But hey, you don't make the world's most widely used programming language by avoiding all mistakes. You do it by shipping a serviceable tool, in the right circumstances, for the right users.
The only practical way to parse JavaScript is to write a recursive descent parser by hand because of the nature of its grammar, so let's learn all the quirks in the grammar before we shoot ourselves in the foot.
The list below starts simple and will become difficult to grasp, so please grab a coffee and take your time.
Identifiers
There are three types of identifiers defined in #sec-identifiers
,
IdentifierReference[Yield, Await] :
BindingIdentifier[Yield, Await] :
LabelIdentifier[Yield, Await] :
estree
and some ASTs do not distinguish the above identifiers,
and the specification does not explain them in plain text.
BindingIdentifier
s are declarations and IdentifierReference
s are references to binding identifiers.
For example in var foo = bar
, foo
is a BindingIdentifier
and bar
is a IdentifierReference
in the grammar:
VariableDeclaration[In, Yield, Await] :
BindingIdentifier[?Yield, ?Await] Initializer[?In, ?Yield, ?Await] opt
Initializer[In, Yield, Await] :
= AssignmentExpression[?In, ?Yield, ?Await]
follow AssignmentExpression
into PrimaryExpression
we get
PrimaryExpression[Yield, Await] :
IdentifierReference[?Yield, ?Await]
Declaring these identifiers differently in the AST will greatly simply downstream tools, especially for semantic analysis.
pub struct BindingIdentifier {
pub node: Node,
pub name: Atom,
}
pub struct IdentifierReference {
pub node: Node,
pub name: Atom,
}
Class and Strict Mode
ECMAScript Class is born after strict mode, so they decided that everything inside a class must be strict mode for simplicity.
It is stated as such in #sec-class-definitions
with just a Node: A class definition is always strict mode code.
It is easy to declare strict mode by associating it with function scopes, but a class
declaration does not have a scope,
we need to keep an extra state just for parsing classes.
https://github.com/swc-project/swc/blob/f9c4eff94a133fa497778328fa0734aa22d5697c/crates/swc_ecma_parser/src/parser/class_and_fn.rs#L85
Legacy Octal and Use Strict
#sec-string-literals-early-errors
disallows escaped legacy octal inside strings "\01"
:
EscapeSequence ::
LegacyOctalEscapeSequence
NonOctalDecimalEscapeSequence
It is a Syntax Error if the source text matched by this production is strict mode code.
The best place to detect this is inside the lexer, it can ask the parser for strict mode state and throw errors accordingly.
But, this becomes impossible when mixed with directives:
https://github.com/tc39/test262/blob/747bed2e8aaafe8fdf2c65e8a10dd7ae64f66c47/test/language/literals/string/legacy-octal-escape-sequence-prologue-strict.js#L16-L19
use strict
is declared after the escaped legacy octal, yet the syntax error needs to be thrown.
Fortunately, no real code uses directives with legacy octals ... unless you want to pass the test262 case from above.
Non-simple Parameter and Strict Mode
Identical function parameters is allowed in non-strict mode function foo(a, a) { }
,
and we can forbid this by adding use strict
: function foo(a, a) { "use strict" }
.
Later on in es6, other grammars were added to function parameters, for example function foo({ a }, b = c) {}
.
Now, what happens if we write the following where "01" is a strict mode error?
function foo(value=(function() { return "\01" }())) {
"use strict";
return value;
}
More specifically, what should we do if there is a strict mode syntax error inside the parameters thinking from the parser perspective?
So in #sec-function-definitions-static-semantics-early-errors
, it just bans this by stating
FunctionDeclaration :
FunctionExpression :
It is a Syntax Error if FunctionBodyContainsUseStrict of FunctionBody is true and IsSimpleParameterList of FormalParameters is false.
Chrome throws this error with a mysterious message "Uncaught SyntaxError: Illegal 'use strict' directive in function with non-simple parameter list".
A more in-depth explanation is described in this blog post by the author of ESLint.
Fun fact, the above rule does not apply if we are targeting es5
in TypeScript, it transpiles to
function foo(a, b) {
"use strict";
if (b === void 0) { b = "\01"; }
}
Parenthesized Expression
Parenthesized expressions are supposed to not have any semantic meanings?
For instance the AST for ((x))
can just be a single IdentifierReference
, not ParenthesizedExpression
-> ParenthesizedExpression
-> IdentifierReference
.
And this is the case for JavaScript grammar.
But ... who would have thought it can have run-time meanings. Found in this estree issue, it shows that
> fn = function () {};
> fn.name
< "fn"
> (fn) = function () {};
> fn.name
< ''
So eventually acorn and babel added the preserveParens
option for compatibility.
Function Declaration in If Statement
If we follow the grammar precisely in #sec-ecmascript-language-statements-and-declarations
:
Statement[Yield, Await, Return] :
... lots of statements
Declaration[Yield, Await] :
... declarations
The Statement
node we define for our AST would obviously not contain Declaration
,
but in Annex B #sec-functiondeclarations-in-ifstatement-statement-clauses
,
it allows declaration inside the statement position of if
statements in non-strict mode:
if (x) function foo() {}
else function bar() {}
Label statement is legit
We probably have never written a single line of labelled statement, but it is legit in modern JavaScript and not banned by strict mode.
The following syntax is correct, it returns a labelled statement (not object literal).
<Foo
bar={() => {
baz: "quaz";
}}
/>
// ^^^^^^^^^^^ `LabelledStatement`
let
is not a keyword
let
is not a keyword so it is allowed to appear anywhere unless the grammar explicitly states let
is not allowed in such positions.
Parsers need to peek at the token after the let
token and decide what it needs to be parsed into, e.g.:
let a;
let = foo;
let instanceof x;
let + 1;
while (true) let;
a = let[0];
For-in / For-of and the [In] context
If we look at the grammar for for-in
and for-of
in #prod-ForInOfStatement
,
it is immediately confusing to understand how to parse these.
There are two major obstacles for us to understand: the [lookahead ≠ let]
part and the [+In]
part.
If we have parsed to for (let
, we need to check the peeking token is:
- not
in
to disallowfor (let in)
- is
{
,[
or an identifier to allowfor (let {} = foo)
,for (let [] = foo)
andfor (let bar = foo)
Once reached the of
or in
keyword, the right-hand side expression needs to be passed with the correct [+In] context to disallow
the two in
expressions in #prod-RelationalExpression
:
RelationalExpression[In, Yield, Await] :
[+In] RelationalExpression[+In, ?Yield, ?Await] in ShiftExpression[?Yield, ?Await]
[+In] PrivateIdentifier in ShiftExpression[?Yield, ?Await]
Note 2: The [In] grammar parameter is needed to avoid confusing the in operator in a relational expression with the in operator in a for statement.
And this is the only application for the [In]
context in the entire specification.
Also to note, the grammar [lookahead ∉ { let, async of }]
forbids for (async of ...)
,
and it needs to be explicitly guarded against.
Block-Level Function Declarations
In Annex B.3.2 #sec-block-level-function-declarations-web-legacy-compatibility-semantics
,
an entire page is dedicated to explain how FunctionDeclaration
is supposed to behave in Block
statements.
It boils down to
https://github.com/acornjs/acorn/blob/11735729c4ebe590e406f952059813f250a4cbd1/acorn/src/scope.js#L30-L35
The name of a FunctionDeclaration
needs to be treated the same as a var
declaration if its inside a function declaration.
This code snippet errors with a re-declaration error since bar
is inside a block scope:
function foo() {
if (true) {
var bar;
function bar() {} // redeclaration error
}
}
meanwhile, the following does not error because it is inside a function scope, function bar
is treated as a var declaration:
function foo() {
var bar;
function bar() {}
}
Grammar Context
The syntactic grammar has 5 context parameters for allowing and disallowing certain constructs,
namely [In]
, [Return]
, [Yield]
, [Await]
and [Default]
.
It is best to keep a context during parsing, for example in Rome:
https://github.com/rome/tools/blob/5a059c0413baf1d54436ac0c149a829f0dfd1f4d/crates/rome_js_parser/src/state.rs#L404-L425
And toggle and check these flags accordingly by following the grammar.
AssignmentPattern vs BindingPattern
In estree
, the left-hand side of an AssignmentExpression
is a Pattern
:
extend interface AssignmentExpression {
left: Pattern;
}
and the left-hand side of a VariableDeclarator
is a Pattern
:
interface VariableDeclarator <: Node {
type: "VariableDeclarator";
id: Pattern;
init: Expression | null;
}
A Pattern
can be a Identifier
, ObjectPattern
, ArrayPattern
:
interface Identifier <: Expression, Pattern {
type: "Identifier";
name: string;
}
interface ObjectPattern <: Pattern {
type: "ObjectPattern";
properties: [ AssignmentProperty ];
}
interface ArrayPattern <: Pattern {
type: "ArrayPattern";
elements: [ Pattern | null ];
}
But from the specification perspective, we have the following JavaScript:
// AssignmentExpression:
{ foo } = bar;
^^^ IdentifierReference
[ foo ] = bar;
^^^ IdentifierReference
// VariableDeclarator
var { foo } = bar;
^^^ BindingIdentifier
var [ foo ] = bar;
^^^ BindingIdentifier
This starts to become confusing because we now have a situation where we cannot directly distinguish whether the Identifier
is a BindingIdentifier
or a IdentifierReference
inside a Pattern
:
enum Pattern {
Identifier, // Is this a `BindingIdentifier` or a `IdentifierReference`?
ArrayPattern,
ObjectPattern,
}
This will lead to all sorts of unnecessary code further down the parser pipeline.
For example, when setting up the scope for semantic analysis, we need to inspect the parents of this Identifier
to determine whether we should bind it to the scope or not.
A better solution is to fully understand the specification and decide what to do.
The grammar for AssignmentExpression
and VariableDeclaration
are defined as:
13.15 Assignment Operators
AssignmentExpression[In, Yield, Await] :
LeftHandSideExpression[?Yield, ?Await] = AssignmentExpression[?In, ?Yield, ?Await]
13.15.5 Destructuring Assignment
In certain circumstances when processing an instance of the production
AssignmentExpression : LeftHandSideExpression = AssignmentExpression
the interpretation of LeftHandSideExpression is refined using the following grammar:
AssignmentPattern[Yield, Await] :
ObjectAssignmentPattern[?Yield, ?Await]
ArrayAssignmentPattern[?Yield, ?Await]
14.3.2 Variable Statement
VariableDeclaration[In, Yield, Await] :
BindingIdentifier[?Yield, ?Await] Initializer[?In, ?Yield, ?Await]opt
BindingPattern[?Yield, ?Await] Initializer[?In, ?Yield, ?Await]
The specification distinguishes this two grammar by defining them separately with an AssignmentPattern
and a BindingPattern
.
So in situations like this, do not be afraid to deviate from estree
and define extra AST nodes for our parser:
enum BindingPattern {
BindingIdentifier,
ObjectBindingPattern,
ArrayBindingPattern,
}
enum AssignmentPattern {
IdentifierReference,
ObjectAssignmentPattern,
ArrayAssignmentPattern,
}
I was in a super confusing state for a whole week until I finally reached enlightenment:
we need to define an AssignmentPattern
node and a BindingPattern
node instead of a single Pattern
node.
estree
must be correct because people have been using it for years so it cannot be wrong?- how are we going to cleanly distinguish the
Identifier
s inside the patterns without defining two separate nodes? I just cannot find where the grammar is? - After a whole day of navigating the specification ...
the grammar for
AssignmentPattern
is in the 5th subsection of the main section "13.15 Assignment Operators" with the subtitle "Supplemental Syntax" 🤯 - this is really out of place because all grammar is defined in the main section, not like this one defined after the "Runtime Semantics" section
The following cases are really difficult to grasp. Here be dragons.
Ambiguous Grammar
Let's first think like a parser and solve the problem - given the /
token, is it a division operator or the start of a regex expression?
a / b;
a / / regex /;
a /= / regex /;
/ regex / / b;
/=/ / /=/;
It is almost impossible, isn't it? Let's break these down and follow the grammar.
The first thing we need to understand is that the syntactic grammar drives the lexical grammar as stated in #sec-ecmascript-language-lexical-grammar
There are several situations where the identification of lexical input elements is sensitive to the syntactic grammar context that is consuming the input elements.
This means that the parser is responsible for telling the lexer which token to return next.
The above example indicates that the lexer needs to return either a /
token or a RegExp
token.
For getting the correct /
or RegExp
token, the specification says:
The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted ... In all other contexts, InputElementDiv is used as the lexical goal symbol.
And the syntax for InputElementDiv
and InputElementRegExp
are
InputElementDiv ::
WhiteSpace
LineTerminator
Comment
CommonToken
DivPunctuator <---------- the `/` and `/=` token
RightBracePunctuator
InputElementRegExp ::
WhiteSpace
LineTerminator
Comment
CommonToken
RightBracePunctuator
RegularExpressionLiteral <-------- the `RegExp` token
This means whenever the grammar reaches RegularExpressionLiteral
, /
need to be tokenized as a RegExp
token (and throw an error if it does not have a matching /
).
All other cases we'll tokenize /
as a slash token.
Let's walk through an example:
a / / regex /
^ ------------ PrimaryExpression:: IdentifierReference
^ ---------- MultiplicativeExpression: MultiplicativeExpression MultiplicativeOperator ExponentiationExpression
^^^^^^^^ - PrimaryExpression: RegularExpressionLiteral
This statement does not match any other start of Statement
,
so it'll go down the ExpressionStatement
route:
ExpressionStatement
--> Expression
--> AssignmentExpression
--> ... -->
MultiplicativeExpression
--> ... -->
MemberExpression
--> PrimaryExpression
--> IdentifierReference
.
We stopped at IdentifierReference
and not RegularExpressionLiteral
,
the statement "In all other contexts, InputElementDiv is used as the lexical goal symbol" applies.
The first slash is a DivPunctuator
token.
Since this is a DivPunctuator
token,
the grammar MultiplicativeExpression: MultiplicativeExpression MultiplicativeOperator ExponentiationExpression
is matched,
the right-hand side is expected to be an ExponentiationExpression
.
Now we are at the second slash in a / /
.
By following ExponentiationExpression
,
we reach PrimaryExpression: RegularExpressionLiteral
because RegularExpressionLiteral
is the only matching grammar with a /
:
RegularExpressionLiteral ::
/ RegularExpressionBody / RegularExpressionFlags
This second /
will be tokenized as RegExp
because
the specification states "The InputElementRegExp goal symbol is used in all syntactic grammar contexts where a RegularExpressionLiteral is permitted".
As an exercise, try and follow the grammar for /=/ / /=/
.