Tuesday, January 6, 2015
A Technical Comparison of the Shift and SpiderMonkey AST Formats
Since publishing our announcement of the Shift AST specification, many developers have asked for more details about how the Shift AST format compares to the SpiderMonkey AST format. We should first enumerate what we consider to be the properties of a good AST format.
A good AST format…
- minimizes the number of inhabitants that do not represent a program.
- is at least partially homogenous to allow for a simple and efficient visitor.
- does not impede moving, copying, or replacing subtrees.
- discourages duplication in code that operates on it.
The following is a list of differences that we consider improvements over the SpiderMonkey AST format.
- The top-level node returned from any successful parse is named Script, not Program, to ease upgrade to ECMAScript 6. ECMAScript 6 parsers need two modes: one mode that produces a Script and one mode that produces a Module. Modules allow import/export declarations at the top level and are always in strict mode.
- Functions (including getters/setters) represent their body using a FunctionBody, not a BlockStatement, to support directives and because a function’s body is neither a generic statement position nor a block.
- Script contains a FunctionBody, not a [Statement], to support top-level directives and for uniform handling of the shared FunctionBody structure.
- The concepts of BlockStatement and Block have been separated. A BlockStatement contains a Block, not a [Statement]. A Block contains a [Statement]. This is better for transformation: a BlockStatement may be replaced by any other Statement, but a Block must be replaced only by another Block. Block is also used to represent the body and finalizer of a TryFinallyStatement and body of a CatchClause (all of which cannot be arbitrary statements).
- Similarly, the concepts of VariableDeclarationStatement and VariableDeclaration have been separated. A VariableDeclaration is used within for and for-in statements (both of which cannot contain arbitrary statements in that position).
- The VariableDeclaration declarators list is required to be non-empty.
- The concepts of IdentifierExpression and Identifier have been separated. An IdentifierExpression contains an Identifier in expression position. Identifiers are also used for function names, break labels, and static member access property names.
- MemberExpression has been separated into StaticMemberExpression and ComputedMemberExpression so that the computed flag and the type of property cannot conflict.
- SwitchStatementWithDefault has been separated out of SwitchStatement to guarantee that a SwitchStatement has no more than one default clause.
- TryStatement has been separated into TryFinallyStatement (try/catch/finally and try/finally) and TryCatchStatement (try/catch) to disallow a TryStatement with no handler and no finalizer.
- SequenceExpression and LogicalExpression are just BinaryExpressions. AssignmentExpression remains separate in preparation for ECMAScript 6, where its left operand will need to be a Binding.
- Separated Literal into LiteralBooleanExpression, LiteralNullExpression, LiteralNumericExpression, LiteralRegExpExpression, and LiteralStringExpression. The SpiderMonkey Literal node is overloaded to the point that it is not used anywhere without qualifying that only a subset of its values may be used.
- LiteralRegExpExpression is represented by a string, not a RegExp. This allows for JSON serialization and a simpler equivalence definition.
- Property has been separated into Getter, Setter, and DataProperty, each of which have a PropertyName. PropertyName has a kind (“identifier”, “string”, or “number”) and string value. FunctionExpressions are much too permissive to represent getters/setters, and the Property kind could conflict with the value.
- Added UseStrictDirective and UnknownDirective nodes to represent directives. These nodes will be replaced with a single Directive node in the future.
- Removed support for SpiderMonkey-specific language extensions (expression closures, multiple catch clauses, for-each-in, etc.) other than block-scoped declarations.
The following is a list of differences that we believe are insignificant improvements over the SpiderMonkey AST format.
- SourceLocation format. Source position information was not originally part of the Shift specification because it was not important for any of Shape Security’s usages. Support for source position tracking was only recently added with this experimental interface. If a use case for tracking end position without source content is identified, that information may be added to SourceLocation.
- Names. We tried to be internally consistent with names like binding, value, and body. We made no effort to carry over SpiderMonkey naming conventions.
- UpdateExpression and UnaryExpression are replaced by PrefixExpression and PostfixExpression. Ignoring the fact that the prefix flag on SpiderMonkey’s UnaryExpression is unnecessary, there are pros and cons to each way this set of operations is grouped. For example, during scope analysis, it is easier to group the increment/decrement operators together to generate write references, but during code generation it is easier to group the prefix/postfix operators separately.
Deviations From ECMAScript 5
The following is a list of intentional supported extensions to ECMAScript 5.
- VariableDeclarationKind contains let and const, which should only be allowed in ECMAScript 6, but popular implementations had widespread support for these declaration kinds long before they had support for any other ECMAScript 6 feature. Because of this, many people consider them to be an unofficial extension to ECMAScript 5.
- Similarly, FunctionDeclarations in arbitrary Statement position were allowed by many ECMAScript 5 implementations (with varying semantics), so the Declaration interface was removed, and FunctionDeclaration was moved to Statement.
The following is a list of restrictions that must be checked in addition to the structural correctness to ensure that a Shift AST represents an ECMAScript program. Ideally, this list would be as small as possible, but because of the context sensitivity inherent in the design of the ECMAScript language, these additional restrictions are either impossible or infeasible to enforce in the AST structure. The reason we want this list to be small is because each program that operates on a Shift format AST needs to either be aware of all of these restrictions and handle them gracefully or require consumers to guarantee that the input AST is valid.
Luckily for developers, at the time of the initial Shift specification announcement we made available the Shift validator, which both validates and enumerates validation errors for a given Shift format AST. This makes it very easy to ensure that a Shift format AST does not include any of the below listed problems, as well as debugging problems when they are detected.
- BreakStatement without a label must be nested within an IterationStatement, SwitchCase, or SwitchDefault. BreakStatement with a label must be nested within a LabeledStatement with an equivalent label (without crossing a function boundary).
- ContinueStatement without a label must be nested within an IterationStatement. ContinueStatement with a label must be nested within an IterationStatement that is labeled with an equivalent label (without crossing a function boundary).
- LiteralRegExpExpression value must represent a valid RegExp.
- Identifier name must always be an IdentifierName, and must not be a reserved word in any position other than a StaticMemberExpression property.
- An IfStatement with alternate must not have another IfStatement without alternate nested within its consequent in a way that does not represent a valid program. See isProblematicIfStatement in estools/esutils for more details.
- LabeledStatement must not be nested within a LabeledStatement with the same label name.
- LiteralNumericExpression value must be non-negative, finite, and non-NaN.
- ObjectExpression cannot contain data/getter or data/setter properties with the same name.
- ObjectExpression cannot contain more than one data property with the name
- A PropertyName with kind of “number” can have a non-numeric value, and a PropertyName with a kind of “identifier” can have a non-IdentifierName value. It is possible this may one day be fixed.
- ReturnStatement must be nested within a FunctionExpression, FunctionDeclaration, Getter, or Setter.
- VariableDeclaration as ForInStatement left must have exactly one declarator. This can (and likely will) be fixed.
- In strict mode (in other words, nested within a FunctionBody that has a UseStrictDirective), function names, function parameters, catch bindings, setter parameters, prefix/postfix increment/decrement operands, assignment bindings, and variable declaration bindings must not be restricted words (arguments or eval).
- In strict mode, function parameters must be unique within their containing parameter list.
- In strict mode, Identifier name must not be a future reserved word in expression position.
- In strict mode, a PrefixExpression must not have both a delete operator and an Identifier operand.
- In strict mode, WithStatement is disallowed.
Hopefully this has cleared up the questions you had about the Shift AST. If you think of anything we haven’t, or if you have additional questions, leave a comment below so that others may benefit from the discussion.
Edit: A previous version of this post did not distinguish BreakStatement/ContinueStatement nodes with a label from those without.