Abstract Syntax Tree

AST design evolves together with the whole compiler

  • AST = Abstract Syntax Tree
  • Serves as main data structure for all post-parsing phases of the compiler
  • AST should be concise while also flexible enough for all post-parsing phases
  • Not uncommon to revisit and change the initial AST design

Concrete vs Abstract

  • Concrete syntax trees show every detail of the parsing process
    • Show the structure and organization of the grammar
    • Also called parse tree
  • Abstract syntax trees instead only preserve the important structural information of the input program

AST Design and Construction

  • There are different things to consider when designing an AST:
  1. It should be possible to unparse the AST
    • Execution of AST shall reflect the execution of the original program
    • AST nodes need to contain sufficient information to recall the essential elements they represent
  2. Implementation of AST should be decoupled from information represented within the AST
    • Using accessors
  3. Different phases of the compiler view elements of the AST fundamentally different
    • There is no single class hierarchy that can describe AST nodes for all purposes
    • Usage of the AST by the different phases is facilitated by various phase-specific interfaces implemented by AST nodes
  • Going from source language \(L\) to a grammar and AST design:
  1. Create unambiguous grammar for \(L\)
    • There can be certain production rules only for the purpose of removing ambiguity from the grammar
  2. Create AST from the grammar
    • Grammar details for disambiguation are not part of the AST design
    • Semantically useless symbols and punctuation such as , or ; are also discarded
  3. AST is constructed by adding semantic actions into the grammar
  4. Design different phases of the compiler
    • Each phase might add new requirements to the AST design
    • Grammar/AST design might be revisited

Left and right values

  • When an identifier is used, it can mean either
    1. The value associated with that name
    2. The location (address) where that value is stored
  • The meaning depends on the context where the identifier is used

x = y

  • The identifier y refers to the value of y

    • This is also called right value (R-value) because it is to the right =
    • An object’s self reference (this) is typically only available in right-value form
  • The identifier x refers to the location of x, not the its value

    • Also called left value (L-value)
    • Some languages allow for using R-values in place of L-values (for example with dereference operator)
    • In C: *e means the R-value of e is used as an L-value
    • Other languages, like Java, limit L-values to reduce the ability to change storage unintentionally

Design Pattern for ASTs

  • Class hierarchy for the AST is kept relatively flat
  • Node management is placed into a common superclass AbstractNode
    • Each type of node (assignment, if, while) is then a simple extension of AbstractNode

Visitor Pattern

  • ASTs for languages like Java have \(\approx50\) different node types
  • Compilers like GCC have \(\approx200\) different passes
  • To handle this we can make use of the visitor pattern
    • Each pass can be in its own visitor class

TODO: Example