Intermediate Representation

Why do we care?

  • Translating the AST directly into the target language is fine for simple languages
    • Input syntax directly determines the output
  • This results in bad output
    • No optimizations
    • Richer source language features are hard to encode
    • It is hard to optimize the resulting assembly directly
      • The representation is too concrete
      • Only a fixed number of registers
  • Retargeting the compiler to a new architecture is hard
    • Target assembly code is hard-wired into the translation
  • Instead, introduce an abstract machine language
    • Hides details of the target architecture
    • This allows machine independent code generation and optimization
    • Goal: get program closer to machine code without losing the information needed for analysis and optimizations
    • This is what we call an intermediate representation (IR)
    • In practice, multiple IRs might be used (each might be for different purposes)

What makes a good IR?

  • Easy to translate into (from the level above)
  • Easy to translate from (to the level below)
  • Narrow interface
    • Fewer constructs means simpler phases and optimizations
    • Source language might have constructs like while, for, foreach
      • In IR we only have while and translate for and foreach into while

Different Abstraction Levels

High-level IRs

  • Abstract syntax and new node types not generated by the parser
    • E.g., Have type checking information
    • Preserve high-level language constructs
      • Structured control-flow, variable names, methods, functions
      • May do some simplifications like for to while
    • Allow high-level optimizations based on program structure
      • Inline small functions, reuse constants
    • Useful for semantic analysis like type-checking

Mid-level IRs

  • Intermediate between AST and assembly
  • May have unstructured jumps, abstract registers or memory locations
  • Convenient for translation to high-quality machine code
  • All intermediate values might be named to facilitate optimizations that attempt to minimize stack/register usage
(x1 + (x2 - x3)) * x4) --> t1 = x2 - x3
                           t2 = x1 + t1
                           t3 = t2 * x4
  • Examples:
    1. Triples: OP a b
      • Useful for instruction selection
    2. Quadruples: a = b OP c three-address-form
    3. SSA (static-single-assignment-form)
      • Variant of quadruples where each variable is assigned exactly once
      • Easy dataflow analysis for optimizations
      • e.g. LLVM is based on SSA
    4. Stack-based
      • Easy to generate
      • e.g. Java Bytecode

High-level IRs

  • Machine-dependent assembly code and extra “pseudo-instructions”
    • “Pseudo instruction” to interface with garbage collector or memory allocator (parts of the language runtime system)
    • Source structure of the program is lost
    • Translation to actual assembly is straightforward
    • Allows low-level optimizations based on target architecture
      • Register allocation, instruction selection, memory layout