A Chord Project: Tasks, Targets, and DependenciesGuide for DevelopersArchitecture of ChordJava Program Representation

Java Program Representation

Chord uses Joeq to translate Java bytecode, one class file at a time, into a three-address-like intermediate representation of the input Java program called quadcode. This chapter describes all aspects of quadcode and how it relates to bytecode. It first explains how to pretty-print bytecode and quadcode (Section *) which is useful for debugging analyses and deciphering their output. The remaining sections describe the quadcode representation along with the API of Joeq and Chord for navigating it. Briefly, the representation consists of a set of classes that may be loaded (Section *). The representation of each class consists of a set of members (Section *) which are the fields and methods of the class. The representation of a concrete method (Section *) consists of a control-flow graph (CFG). The representation of a CFG (Section *) consists of a set of registers and a set of basic blocks linked by directed edges denoting flow of control between basic blocks. Each basic block contains zero or more primitive statements called quads (Section *). Finally, the most common way to traverse all quads is discussed (Section *).

Pretty-Printing

Consider the following Java program contained in file examples/hello_world/src/test/HelloWorld.java in Chord's main directory:

package test;

public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

First compile this program by running command ant in directory examples/hello_world/.

To pretty-print the bytecode representation of a class, run the following command:

javap -classpath <CLASS_PATH> -bootclasspath <BOOT_CLASS_PATH> -private -verbose <CLASS_NAME>

where:

Program javap comes along with the JVM. The output of the above command for our example is as follows:

Compiled from "HelloWorld.java"
public class test.HelloWorld extends java.lang.Object
  SourceFile: "HelloWorld.java"
  minor version: 0
  major version: 49
  Constant pool:
const #1 = Method   #6.#20; //  java/lang/Object."<init>":()V
const #2 = Field    #21.#22;    //  java/lang/System.out:Ljava/io/PrintStream;
const #3 = String   #23;    //  Hello World!
const #4 = Method   #24.#25;    //  java/io/PrintStream.println:(Ljava/lang/String;)V
const #5 = class    #26;    //  test/HelloWorld
const #6 = class    #27;    //  java/lang/Object
const #7 = Asciz    <init>;
const #8 = Asciz    ()V;
const #9 = Asciz    Code;
const #10 = Asciz   LineNumberTable;
const #11 = Asciz   LocalVariableTable;
const #12 = Asciz   this;
const #13 = Asciz   Ltest/HelloWorld;;
const #14 = Asciz   main;
const #15 = Asciz   ([Ljava/lang/String;)V;
const #16 = Asciz   args;
const #17 = Asciz   [Ljava/lang/String;;
const #18 = Asciz   SourceFile;
const #19 = Asciz   HelloWorld.java;
const #20 = NameAndType #7:#8;//  "<init>":()V
const #21 = class   #28;    //  java/lang/System
const #22 = NameAndType #29:#30;//  out:Ljava/io/PrintStream;
const #23 = Asciz   Hello World!;
const #24 = class   #31;    //  java/io/PrintStream
const #25 = NameAndType #32:#33;//  println:(Ljava/lang/String;)V
const #26 = Asciz   test/HelloWorld;
const #27 = Asciz   java/lang/Object;
const #28 = Asciz   java/lang/System;
const #29 = Asciz   out;
const #30 = Asciz   Ljava/io/PrintStream;;
const #31 = Asciz   java/io/PrintStream;
const #32 = Asciz   println;
const #33 = Asciz   (Ljava/lang/String;)V;

{
public test.HelloWorld();
  Code:
   Stack=1, Locals=1, Args_size=1
   0:   aload_0
   1:   invokespecial   #1; //Method java/lang/Object."<init>":()V
   4:   return
  LineNumberTable:
   line 3: 0
  LocalVariableTable:
   Start  Length  Slot  Name   Signature
   0      5      0    this       Ltest/HelloWorld;

public static void main(java.lang.String[]);
  Code:
   Stack=2, Locals=1, Args_size=1
   0:   getstatic   #2; //Field java/lang/System.out:Ljava/io/PrintStream;
   3:   ldc #3; //String Hello World!
   5:   invokevirtual   #4; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
   8:   return
  LineNumberTable:
   line 5: 0
   line 6: 8
  LocalVariableTable:
   Start  Length  Slot  Name   Signature
   0      9      0    args       [Ljava/lang/String;
}

To pretty-print the quadcode representation of a class, run the following command:

ant -Dchord.work.dir=<WORK_DIR> -Dchord.print.classes=<CLASS_NAME> \
    -Dchord.verbose=0 -Dchord.out.file=<OUT_FILE> run

where:

The output of the above command for our example is as follows:

*** Class: test.HelloWorld
Method: main:([Ljava/lang/String;)V@test.HelloWorld
    0#1
    5#3
    5#2
    8#4
Control flow graph for main:([Ljava/lang/String;)V@test.HelloWorld:
BB0 (ENTRY) (in: <none>, out: BB2)

BB2 (in: BB0 (ENTRY), out: BB1 (EXIT))
1: GETSTATIC_A T1, .out
3: MOVE_A T2, AConst: "Hello World!"
2: INVOKEVIRTUAL_V println:(Ljava/lang/String;)V@java.io.PrintStream, (T1, T2)
4: RETURN_V

BB1 (EXIT)  (in: BB2, out: <none>)

Exception handlers: []
Register factory: Registers: 3

Whole Program

This and the following sections describe the quadcode representation along with the API of Joeq and Chord for navigating it. This API is contained in packages chord.program, joeq.Class, and joeq.Compiler.Quad.

The quadcode representation of the whole program is a unique global object of class chord.program.Program which can be obtained by calling static method chord.program.Program.g(). This class provides a rich API (in the form of public instance methods) to access various parts of the representation, most notably:

IndexSet<jq_Type> getTypes() All types referenced in classes that may be loaded.
IndexSet<jq_Reference> getClasses() All classes that may be loaded.*
IndexSet<jq_Method> getMethods() All methods that may be called.

* Includes both classes/interfaces and array types, represented as objects of jq_Class and jq_Array, respectively; both these are subclasses of jq_Reference.

See Chapter * for how Chord determines which classes may be loaded and which methods may be called.

The quadcode representation of each type is a unique object of the appropriate subclass of joeq.Class.jq_Type in the following hierarchy:

                    jq_Type
                       |
      ------------------------------------
      |                                  |
jq_Primitive                       jq_Reference
                                         |
                          -------------------------------
                          |                             |
                      jq_Class                       jq_Array

Class Members

Each primitive type (e.g., boolean, int, etc.) is represented by a unique jq_Primtive object. Each class and each interface type is represented by a unique jq_Class object. Each array type is represented by a unique jq_Array object.

Members (i.e., fields and methods) of the class/interface represented by an object of class joeq.Class.jq_Class can be accessed using the following API provided by that class.

String getName() Fully-qualified name of the class, e.g., "java.lang.String[]".
jq_InstanceField[] getDeclaredInstanceFields() All instance fields declared in the class.
jq_StaticField[] getDeclaredStaticFields() All static fields declared in the class.
jq_InstanceMethod[] getDeclaredInstanceMethods() All instance methods declared in the class.
jq_StaticMethod[] getDeclaredStaticMethods() All static methods declared in the class.

Chord uses format mName:mDesc@cName, described in class chord.program.MethodSign, to uniquely identify each field and each method in the input Java program, where mName denotes the name of the field/method, mDesc denotes the descriptor of the field/method (see below), and cName denotes the fully-qualified name of the class declaring the field/method. For instance, "main:[Ljava/lang/String;@test.HelloWorld" uniquely identifies the main method in the example above. We next review field descriptors and method descriptors from the Java bytecode specification.

A field descriptor represents the type of a local variable or a (static or instance) field. It is a series of characters generated by the grammar:

FieldDescriptor : FieldType
      FieldType : BaseType | ObjectType | ArrayType
       BaseType : B | C | D | F | I | J | S | Z
     ObjectType : L <classname> ;
      ArrayType : [ ComponentType
  ComponentType : FieldType

The characters of BaseType, the `L' and `;' of ObjectType, and the `[' of ArrayType are all ASCII characters. The <classname> represents a fully qualified class or interface name. The interpretation of the field types is as shown in the below table:

BaseType Character Type Interpretation
B byte signed byte
C char Unicode character
D double double-precision floating-point value
F float single-precision floating-point value
I int integer
J long long integer
L<classname>; reference an instance of class <classname>
S short signed short
Z boolean true or false
[ reference one array dimension

For example, the descriptor of type int is simply I. The descriptor of an instance variable of type Object is "Ljava/lang/Object;". Note that the internal form of the fully qualified name for class Object is used. The descriptor of a multidimensional double array of type "double[][][]" is "[[[D".

A method descriptor represents the types of the arguments and return result of a method:

   MethodDescriptor : ( ParameterDescriptor* ) ReturnDescriptor
ParameterDescriptor : FieldType
   ReturnDescriptor : FieldType | V

A parameter descriptor represents the type of an argument of a method. A return descriptor represents the type of the return result of a method. The character `V' indicates that the method returns no value (its return type is void).

The method descriptor is the same whether it is a static or an instance method. Although an instance method is passed this, a reference to the current class instance, in addition to its intended arguments, that fact is not reflected in the method descriptor.

For example, the method descriptor for the method "Object foo(int i, double d, Thread t)" is "(IDLjava/lang/Thread;)Ljava/lang/Object;". Note that internal forms of the fully qualified names of Thread and Object are used in the method descriptor.

Methods

The quadcode representation of each method is a unique object of class joeq.Class.jq_Method. Components of the method, most notably its control-flow graph, can be accessed using the following API provided by that class.

String getName() Name of the method.
String getDesc().toString() Descriptor of the method, e.g., "(Ljava/lang/String;)V".
jq_Class getDeclaringClass() Declaring class of the method.
ControlFlowGraph getCFG() Control-flow graph of the method.*
int getLineNumber(int bci) Line number of the given bytecode offset (-1 if not found).
Quad getQuad(int bci) First quad at the given bytecode offset (null if not found).
Quad getQuad(int bci, Class kind) First quad of the given kind at the given bytecode offset (null if not found).
Quad getQuad(int bci, Class[] kind) First quad of any given kind at the given bytecode offset (null if not found).
String toString() Unique identifier of the method in format mName:mDesc@cName.

* The control-flow graph must not be asked if the method is abstract (which can be determined by calling instance method isAbstract() of jq_Method).

Control-Flow Graphs

The control-flow graph (CFG) of each method consists of a set of registers, called the register factory, and a directed graph whose nodes are basic blocks and whose edges denote flow of control between basic blocks.

The CFG of each method is a unique object of class joeq.Compiler.Quad.ControlFlowGraph. Components of the CFG can be accessed using the following API provided by that class.

getRegisterFactory() Set of all local variables.
EntryOrExitBasicBlock entry() Unique entry basic block.
EntryOrExitBasicBlock exit() Unique exit basic block.
ListIterator.BasicBlock reversePostOrderIterator() Iterator over all basic blocks in reverse post-order.
jq_Method getMethod() Containing method of the CFG.

The register factory contains one register per argument of the method (called local variables) and one register per temporary in the method body (called stack variables). Temporaries include those declared by programmers as well as those generated by Joeq. The reason Joeq can generate temporaries is that the quadcode representation, which is register-based, is constructed from Java bytecode, which is stack-based; moreover, Joeq does the Static Single Assignment (SSA) transformation by default, which introduces temporaries to ensure that there is at most one static assignment to any variable. Registers corresponding to local variables are named R0, R1, ..., Rn, while those corresponding to stack variables are named Tn+1, Tn+2, ..., Tm.

For instance, the register factory of the main method in the example above has 3 registers: R0 denoting the args argument of the method and T1 and T2 denoting temporaries generated by Joeq.

Each register factory is a unique object of class joeq.Compiler.Quad.RegisterFactory.

Besides the register factory, a CFG has a directed graph whose nodes are basic blocks and whose edges denote flow of control between basic blocks. Each basic block contains a straight-line sequence of zero or more primitive statements called quads (Section *). Each CFG is guaranteed to contain at least two basic blocks: a unique entry basic block with no incoming edges and a unique exit block with no outgoing edges. The entry and exit basic blocks do not contain any quads.

Each basic block is a unique object of class joeq.Compiler.Quad.BasicBlock (the entry and exit basic blocks are instances of a subclass joeq.Compiler.Quad.EntryOrExitBasicBlock). Components of the basic block can be accessed using the following API provided by that class.

int size() Number of quads contained in the basic block.
Quad getQuad(int index) Quad at the given 0-based index.
List.BasicBlock getPredecessors() List of immediate predecessor basic blocks.
List.BasicBlock getSuccessors() List of immediate successor basic blocks.
jq_Method getMethod() Containing method of the basic block.

Quads

Chord uses format offset!mName:mDesc@cName, described in class chord.program.MethodElem, to uniquely identify each bytecode instruction in the input Java program, where offset is the (0-based) bytecode offset of the instruction in its containing method, mName is the name of the method, mDesc is the descriptor of the method, and cName is the fully-qualified name of the class declaring the method. For instance, "8!main:[Ljava/lang/String;@test.HelloWorld" uniquely identifies the return instruction in the main method in the example above.

The quadcode representation is register-based, as opposed to Java bytecode that is used to construct it, which is stack-based. As a result, it uses quads to represent bytecode instructions. A quad is a primitive statement that consists of an operator and upto four operands. There is no one-to-one correspondence between bytecode instructions and quads: certain bytecode instructions generate a sequence of more than one quads while others do not generate any quad. The API of class jq_Method provides various getQuad(...) methods to access the quad(s) corresponding to a bytecode instruction (see Section *).

Each quad is a unique object of class joeq.Compiler.Quad.Quad. Components of the quad can be accessed using the following API provided by that class.

Operator getOperator() Kind of the quad.
int getBCI() Bytecode offset of the quad in its containing method.
String toByteLocStr() Unique identifier of the quad in format offset!mName:mDesc@cName.
String toJavaLocStr() Location of the quad in format fileName:lineNum in Java source code.
String toLocStr() Location of the quad in both Java bytecode and source code.
String toVerboseStr() Verbose description of the quad (its location plus contents).
jq_Method getMethod() Containing method of the quad.

The kind of each quad is determined by its operator which is a unique object of the appropriate subclass of joeq.Compiler.Quad.Operator in the following hierarchy:

Operator
    |
    |--- Move
    |--- Phi
    |--- Unary
    |--- Binary
    |--- New
    |--- NewArray
    |--- MultiNewArray
    |--- Getstatic
    |--- Putstatic
    |--- ALoad
    |--- AStore
    |--- Getfield
    |--- Putfield
    |--- CheckCast
    |--- InstanceOf
    |--- ALength
    |--- Return
    |
    |--- Branch
    |       |
    |       |--- IntIfCmp
    |       |--- Goto
    |       |--- Jsr
    |       |--- Ret
    |       |--- LookupSwitch
    |       |--- TableSwitch
    |
    |--- Invoke
    |       |
    |       |--- InvokeVirtual
    |       |--- InvokeStatic
    |       |--- InvokeInterface
    |
    |--- Monitor
            |
            |--- MONITORENTER
            |--- MONITOREXIT

The number and kinds of operands of each quad depends upon the kind of the operator. Each of the above subclasses of Operator provides an API to access the operands of the quad. For instance, the components of a Getfield quad q of the form "l = b.f" can be accessed as follows:

Operand lo = Getfield.getSrc(q);
Operand bo = Getfield.getBase(q);
if (lo instanceof RegisterOperand && bo instanceof RegisterOperand) {
    Register l = ((RegisterOperand) lo).getRegister();
    Register b = ((RegisterOperand) bo).getRegister();
    jq_Field f = Getfield.getField(q).getField();
    ...
}

Traversing Quadcode

A common way to traverse all quads in the input Java program is as follows:


import chord.program.Program;
import joeq.Compiler.Quad.QuadVisitor;
import joeq.Class.jq_Method;
import joeq.Compiler.Quad.ControlFlowGraph;
import joeq.Util.Templates.ListIterator;
import joeq.Compiler.Quad.BasicBlock;
import joeq.Compiler.Quad.Quad;

QuadVisitor qv = new QuadVisitor.EmptyVisitor() {
    public void visitMove(Quad q) { ... }
    public void visitPhi(Quad q) { ... }
    public void visitUnary(Quad q) { ... }
    ...
};
Program program = Program.g();
for (jq_Method m : program.getMethods()) {
    if (!m.isAbstract()) {
        ControlFlowGraph cfg = m.getCFG();
        ListIterator.BasicBlock it = cfg.reversePostOrderIterator();
        while (it.hasNext()) {
            BasicBlock b = it.nextBasicBlock();
            for (int i = 0; i < b.size(); i++) {
                Quad q = b.getQuad(i);
                q.accept(qv);
            }
        }
    }
}

mhn@cs.stanford.edu

A Chord Project: Tasks, Targets, and DependenciesGuide for DevelopersArchitecture of ChordJava Program Representation