Guide for DevelopersDynamic AnalysisDatalog Analysis

Datalog Analysis

A common way to rapidly prototype an analysis in Chord is using a declarative logic-programming language called Datalog. This chapter describes all aspects of Datalog analyses in Chord. Section * explains how to write and run a Datalog analysis. Section * explains how to tune its performance. Finally, Section * explains the representation of BDDs (Binary Decision Diagrams) which are a data structure central to executing Datalog analyses.

Writing a Datalog Analysis

A Datalog analysis declares a bunch of input/output program relations, each over one or more program domains, and provides a bunch of rules (constraints) specifying how to compute the output relations from the input relations. It can be defined in any file with suffix .dlog in any directory in the path specified by property chord.dlog.analysis.path. An example Datalog analysis is shown below:

    # name=datarace-dlog

    # Program domains
    .include "E.dom"
    .include "F.dom"
    .include "T.dom"

    # BDD variable order
    .bddvarorder E0xE1_T0_T1_F0

    # Input/intermediate/output program relations
    field(e:E0,f:F0) input
    write(e:E0) input
    reach(t:T0,e:E0) input
    alias(e1:E0,e2:E1) input
    escape(e:E0) input
    unguarded(t1:T0,e1:E0,t2:T1,e2:E1) input
    datarace(t1:T0,e1:E0, t2:T1,e2:E1) output

    # Analysis constraints
    hasWrite(e1,e2) :- write(e1).
    hasWrite(e1,e2) :- write(e2).
    candidate(e1,e2) :- field(e1,f), field(e2,f), hasWrite(e1,e2), e1 <= e2.
    datarace(t1,e1,t2,e2) :- candidate(e1,e2), reach(t1,e1), reach(t2,e2), \
        alias(e1,e2), escape(e1), escape(e2), unguarded(t1,e1,t2,e2).

Any line that begins with a "#" is regarded a comment, except a line of the form "# name=<ANALYSIS_NAME>", which specifies the name <ANALYSIS_NAME> of the Datalog analysis. Each Datalog analysis is expected to have exactly one such line. The above Datalog analysis is named datarace-dlog.

The ".include "<DOM_NAME>.dom"" lines specify each domain named <DOM_NAME> that is needed by the Datalog analysis, i.e., each domain over which any relation that is input/output by the Datalog analysis is defined. The declaration of each such relation specifies the relation's schema: all the domains over which the relation is defined. If the same domain appears multiple times in a relation's schema then contiguous integers starting from 0 must be used to distinguish them; for instance, in the above example, candidate is a binary relation, both of whose attributes have domain E, and they are distinguished as E0 and E1.

Each relation is represented symbolically (as opposed to explicitly) using a graph-based data structure called a Binary Decision Diagram (BDD for short). Each domain containing N elements is assigned log2(N) BDD variables. The size of a BDD and the efficiency of operations on it depends heavily on the order of these BDD variables. The ".bddvarorder <BDD_VAR_ORDER>" line in the Datalog analysis enables the Datalog analysis writer to specify this order. It must list all domains along with their numerical suffixes, separated by `_' or `x'. Using a `_' between two domains, such as T0_T1, means that the BDD variables assigned to domain T0 precede those assigned to domain T1 in the BDD variable order for this Datalog analysis. Using a `x' between two domains, such as E0xE1, means that the BDD variables assigned to domains E0 and E1 will be interleaved in the BDD variable order for this Datalog analysis. See Section * for more details on BDD representations.

Each Datalog analysis rule is a Horn clause of the form "R(t) :- R1(t1), ..., Rn(tn)" meaning that if relations R1, ..., Rn contain tuples t1, ..., tn respectively, then relation R contains tuple t. A backslash may be used at the end of a line to break long rules for readability. The Datalog analysis solver bddbddb used in Chord does not apply any sophisticated optimizations to simplify the rules; besides the BDD variable order, the manner in which these rules are expressed heavily affects the performance of the solver. For instance, an important manual optimization involves breaking down long rules into multiple shorter rules communicating via intermediate relations. See Section * for hints on tuning the performance of Datalog analyses.

Running a Datalog analysis is no different from running any other kind of analysis in Chord. See Chapter * for how to run an analysis.

Tuning Performance

There are several tricks analysis writers can try to improve the performance of bddbddb, the Datalog solver used by Chord, often by several orders of magnitude. Try these tricks by running the following command:

ant -Ddlog.file=<FILE> -Dwork.dir=<DIR> solve

where <FILE> is the file defining the Datalog analysis to be tuned, and <DIR> is the directory containing the program domains (*.dom files) and program relations (*.bdd files) consumed by the analysis (this is by default the chord_output/bddbddb/ directory generated by a previous run of Chord.

  1. Set properties noisy=yes, tracesolve=yes, and fulltracesolve=yes on the above command line and observe which rule gets "stuck" (i.e., takes several seconds to solve). fulltracesolve is seldom useful, but noisy and tracesolve are often very useful. Once you identify the rule that is getting stuck, it will also tell you which relations and which domains used in that rule, and which operation on them, is taking a long time to solve. Then try to fix the problem with that rule by doing either or both of the following:

  2. Once you have ensured that none of the rules is getting "stuck", you will notice that some rules are applied too many times, and so although each application of the rule itself isn't taking too much time, the cumulative time for the rule is too much. After finishing solving a Datalog analysis, bddbddb prints how long each rule took to solve (both in terms of the number of times it was applied and the cumulative time it took). It sorts the rules in the order of the cumulative time. You need to focus on the rules that took the most time to solve (they will be at the bottom of the list). Assuming you removed the problem of rules getting "stuck", the rules will roughly be in the order of the number of times they were applied. Here is an example:

    OUT> Rule VH(u:V0,h:H0) :- VV(u:V0,v:V1), VH(v:V1,h:H0), VHfilter
    OUT>    Updates: 2871
    OUT>    Time: 6798 ms
    OUT>    Longest Iteration: 0 (0 ms)
    OUT> Rule IM(i:I0,m:M0) :- reachableI(i:I0), specIMV(i:I0,m:M0,v:V0), VH(v:V0,_:H0).
    OUT>    Updates: 5031
    OUT>    Time: 6972 ms
    OUT>    Longest Iteration: 0 (0 ms)

    Notice that the second rule was applied 5031 times whereas the first was applied 2871 times. More importantly, the second rule took 6972 milliseconds in all, compared to 6798 for the first rule. Hence, you should focus on the second rule first, and try to speed it up. This means that you should focus only on relations IM, reachableI, specIMV, and VH, and the domains I0, M0, V0, and H0. Any changes you make that do not affect these relations and domains are unlikely to make your solving faster. In general, look at the last few rules, not just the last one, and try to identify the "sub-analysis" of the Datalog analysis that seems problematic, and then focus on speeding up just that sub- analysis.

  3. You can add the .split keyword at the end of certain rules as a hint to bddbddb to decompose those rules into simpler ones that can be solved faster. You can also set property split_all_rules=yes as shorthand for splitting all rules without adding the .split keyword to any of them, though I seldom find splitting all rules helpful.

  4. You can try to decompose a single Datalog analysis file into two separate Datalog analysis files. Of course, you cannot separate mutually-recursive rules into two different analyses, but if you unnecessarily club together rules that could have gone into different analyses, then they can put conflicting demands on bddbddb (e.g., on the BDD variable order). So if rule 2 uses the result of rule 1 and rule 1 does not use the result of rule 2, then put rule 1 and rule 2 in separate Datalog analyses.

  5. Observe the sizes of the BDDs representing the relations that are input and output. bddbddb prints both the number of tuples in each relation and the number of nodes in the BDD. Try changing the BDD variable order for the domains of the relation, and observe how the number of nodes in the BDD for that relation change. You will notice that some orders perform remarkably better than others. Then note down these orders as invariants that you will not violate as you tweak other things.

  6. The relative order of values *within* domains (e.g., in domains named M, H, C, etc. in Chord) affects the performance of bddbddb, but I've never tried changing this and studying its effect. It might be worth trying. For instance, John Whaley's PLDI'04 paper describes a specific way in which he numbers contexts (in domain C) and that it was fundamental to the speedup of his "infinity"-CFA points-to analysis.

  7. Finally, it is worth emphasizing that BDDs are not magic. If your algorithm itself is fundamentally hard to scale, then BDDs are unlikely to help you a whole lot. Secondly, many things are awkward to encode as integers (e.g., the abstract contexts in domain C in Chord) or as Datalog rules. For instance, I've noticed that summary-based context-sensitive program analyses are hard to express in Datalog. The may-happen-in-parallel analysis provided in Chord shows a relatively simple kind of summary-based analysis that uses the Reps-Horwitz-Sagiv tabulation algorithm. But this is as far as I could get--more complicated summary-based algorithms are best written in Java itself instead of Datalog.

BDD Representation

Each domain containing N elements is assigned log2(N) BDD variables in the underlying BDD factory with contiguous IDs. For instance, domain F0 containing [128..256) elements will be assigned 8 variables with IDs (say) 63,64,65,66,67,68,69,70 and domain Z0 containing [8..16) elements will be assigned 4 variables with IDs (say) 105,106,107,108.

If two domains are uninterleaved in the declared domain order in a Datalog program (i.e., `_' is used instead of `x' between them), then the BDD variables assigned to those domains are ordered in reverse order in the underlying BDD factory. For instance, the BDD variable order corresponding to the declared domain order F0_Z0 is (in level2var format) "70,69,68,67,66,65,64,63,108,107,106,105".

If two domains are interleaved in the declared domain order in a Datalog program (i.e., `x' is used instead of `_' between them), then the BDD variables assigned to those domains are still ordered in reverse order of IDs in the underlying BDD factory, but they are also interleaved. For instance, the BDD variable order corresponding to the declared domain order F0xZ0 is (in level2var format) "70,108,69,107,68,106,67,105,66,65,64,63".

Each BDD variable is at a unique "level" which is its 0-based position in the BDD variable order in the underlying BDD factory.

We will next illustrate the format of a BDD stored on disk (in a .bdd file) using the following example:

# V0:16 H0:12
# 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
# 82 83 84 85 86 87 88 89 90 91 92 93
28489 134
39 36 33 30 27 24 21 18 15 12 9 6 3 0 81 79 77 75 73 71 69 67 65 63 61 59 57 55  \
   53 51 82 80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 37 34 31 28 25 22 19 16 \
   13 10 7 4 1 117 116 115 114 113 112 111 110 109 108 107 106 50 49 48 47 46 45 \
   44 43 42 41 40 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 \
   86 85 84 83 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118   \
   38 35 32 29 26 23 20 17 14 11 8 5 2
287 83 0 1
349123 84 287 0
349138 85 0 349123

The first comment line indicates the domains in the relation (in the above case, V0 and H0, represented using 16 and 12 unique BDD variables, respectively).

If there are N domains, there are N following comment lines, each specifying the BDD variables assigned to the corresponding domain.

The following line has two numbers: the number of nodes in the represented BDD (28489 in this case) and the number of variables in the BDD factory from which the BDD was dumped to disk. Note that the number of variables (134 in this case) is not necessarily the number of variables in the represented BDD (16+12=28 in this case) though it is guaranteed to be greater than or equal to it.

The following line specifies the BDD variable order in var2level format. In this case, the specified order subsumes V0_H0 (notice that levels "81 79 77 75 73 71 69 67 65 63 61 59 57 55 53 51", which are at positions "14 15 ... 28 29" in the specified order are lower than levels "105 104 103 102 101 100 99 98 97 96 95 94" which are at positions "82 83 .. 92 93").

Each of the following lines specifies a unique node in the represented BDD; it has format "X V L H" where:

The order of these lines specifying BDD nodes is such that the lines specifying the BDD nodes in the L and H entries of each BDD node are guaranteed to occur before the line specifying that BDD node (for instance, the L entry 287 on the second line and the R entry 349123 on the third line are IDs of BDD nodes specified on the first and second lines, respectively).

Note on Terminology: The support of a BDD b is another BDD r = that represents all the variables used in b. The support BDD r is a linear tree each of whose nodes contains a separate variable, the low branch is 0, and high branch is the node representing the next variable. To list all the variables used in a BDD b use r.scanSet(). Needless to say, scanSet() simply walks along the high branches starting at the root of BDD r.

Guide for DevelopersDynamic AnalysisDatalog Analysis