Datalog Analysis |

A common way to rapidly prototype an analysis in Chord is using a declarative logic-programming language called Datalog. This chapter describes all aspects of Datalog analyses in Chord. Section * explains how to write and run a Datalog analysis. Section * explains how to tune its performance. Finally, Section * explains the representation of BDDs (Binary Decision Diagrams) which are a data structure central to executing Datalog analyses.

A Datalog analysis declares a bunch of input/output program relations, each over
one or more program domains, and provides a bunch of rules (constraints)
specifying how to compute the output relations from the input relations. It can
be defined in any file with suffix `.dlog` in any directory in the path
specified by property `chord.dlog.analysis.path`. An example Datalog
analysis is shown below:

# name=datarace-dlog # Program domains .include "E.dom" .include "F.dom" .include "T.dom" # BDD variable order .bddvarorder E0xE1_T0_T1_F0 # Input/intermediate/output program relations field(e:E0,f:F0) input write(e:E0) input reach(t:T0,e:E0) input alias(e1:E0,e2:E1) input escape(e:E0) input unguarded(t1:T0,e1:E0,t2:T1,e2:E1) input hasWrite(e1:E0,e2:E1) candidate(e1:E0,e2:E1) datarace(t1:T0,e1:E0, t2:T1,e2:E1) output # Analysis constraints hasWrite(e1,e2) :- write(e1). hasWrite(e1,e2) :- write(e2). candidate(e1,e2) :- field(e1,f), field(e2,f), hasWrite(e1,e2), e1 <= e2. datarace(t1,e1,t2,e2) :- candidate(e1,e2), reach(t1,e1), reach(t2,e2), \ alias(e1,e2), escape(e1), escape(e2), unguarded(t1,e1,t2,e2). |

Any line that begins with a "`#`" is regarded a comment, except a
line of the form "`# name=<ANALYSIS_NAME>`", which specifies the name
`<ANALYSIS_NAME>` of the Datalog analysis.
Each Datalog analysis is expected to have exactly one such line.
The above Datalog analysis is named `datarace-dlog`.

The "`.include "<DOM_NAME>.dom"`" lines specify each domain named
`<DOM_NAME>` that is needed by the Datalog analysis, i.e., each domain over which
any relation that is input/output by the Datalog analysis is defined.
The declaration of each such relation specifies the relation's schema:
all the domains over which the relation is defined.
If the same domain appears multiple times in a relation's schema then
contiguous integers starting from 0 must be used to distinguish them; for instance,
in the above example, `candidate` is a binary relation, both of whose
attributes have domain `E`, and they are distinguished as `E0` and `E1`.

Each relation is represented symbolically (as opposed to explicitly)
using a graph-based data structure called a Binary Decision Diagram (BDD for short).
Each domain containing N elements is assigned log2(N) BDD variables.
The size of a BDD and the efficiency of operations on it depends heavily
on the order of these BDD variables.
The "`.bddvarorder <BDD_VAR_ORDER>`" line in the Datalog analysis enables the Datalog
analysis writer to specify this order.
It must list all domains along with their numerical suffixes, separated
by ``_`' or ``x`'.
Using a ``_`' between two domains, such as `T0_T1`, means that the BDD variables assigned
to domain `T0` precede those assigned to domain `T1` in the BDD variable
order for this Datalog analysis.
Using a ``x`' between two domains, such as `E0xE1`, means that the
BDD variables assigned to domains `E0` and `E1`
will be interleaved in the BDD variable order for this Datalog analysis.
See Section * for more details on BDD representations.

Each Datalog analysis rule is a Horn clause of the form
"`R(t) :- R1(t1), ..., Rn(tn)`"
meaning that if relations `R1`, ..., `Rn` contain tuples `t1`, ..., `tn`
respectively, then relation `R` contains tuple `t`.
A backslash may be used at the end of a line to break long rules for readability.
The Datalog analysis solver bddbddb used in Chord does not apply any
sophisticated optimizations to simplify the rules; besides the BDD variable order,
the manner in which these rules are expressed heavily affects the performance of
the solver. For instance, an important manual optimization involves breaking down
long rules into multiple shorter rules communicating via intermediate relations.
See Section * for hints on tuning the performance
of Datalog analyses.

Running a Datalog analysis is no different from running any other kind of analysis in Chord. See Chapter * for how to run an analysis.

There are several tricks analysis writers can try to improve the performance of bddbddb, the Datalog solver used by Chord, often by several orders of magnitude. Try these tricks by running the following command:

ant -Ddlog.file=<FILE> -Dwork.dir=<DIR> solve |

where `<FILE>` is the file defining the Datalog analysis
to be tuned, and `<DIR>` is the directory containing the
program domains (`*.dom` files) and program relations (`*.bdd` files)
consumed by the analysis (this is by default the
`chord_output/bddbddb/` directory generated
by a previous run of Chord.

Set properties

`noisy=yes`

,`tracesolve=yes`

, and`fulltracesolve=yes`

on the above command line and observe which rule gets "stuck" (i.e., takes several seconds to solve).`fulltracesolve`

is seldom useful, but`noisy`

and`tracesolve`

are often very useful. Once you identify the rule that is getting stuck, it will also tell you which relations and which domains used in that rule, and which operation on them, is taking a long time to solve. Then try to fix the problem with that rule by doing either or both of the following:- Break down the rule into multiple rules by creating intermediate relations (the more relations you have on the RHS of a rule the slower it generally takes to solve that rule).
- Change the relative order of the domains of those relations in the BDD variable order (note that you can use either `_' or `x' between a pair of domains).

Once you have ensured that none of the rules is getting "stuck", you will notice that some rules are applied too many times, and so although each application of the rule itself isn't taking too much time, the cumulative time for the rule is too much. After finishing solving a Datalog analysis, bddbddb prints how long each rule took to solve (both in terms of the number of times it was applied and the cumulative time it took). It sorts the rules in the order of the cumulative time. You need to focus on the rules that took the most time to solve (they will be at the bottom of the list). Assuming you removed the problem of rules getting "stuck", the rules will roughly be in the order of the number of times they were applied. Here is an example:

OUT> Rule VH(u:V0,h:H0) :- VV(u:V0,v:V1), VH(v:V1,h:H0), VHfilter (u:V0,h:H0). OUT> Updates: 2871 OUT> Time: 6798 ms OUT> Longest Iteration: 0 (0 ms) OUT> Rule IM(i:I0,m:M0) :- reachableI(i:I0), specIMV(i:I0,m:M0,v:V0), VH(v:V0,_:H0). OUT> Updates: 5031 OUT> Time: 6972 ms OUT> Longest Iteration: 0 (0 ms)

Notice that the second rule was applied 5031 times whereas the first was applied 2871 times. More importantly, the second rule took 6972 milliseconds in all, compared to 6798 for the first rule. Hence, you should focus on the second rule first, and try to speed it up. This means that you should focus only on relations IM, reachableI, specIMV, and VH, and the domains I0, M0, V0, and H0. Any changes you make that do not affect these relations and domains are unlikely to make your solving faster. In general, look at the last few rules, not just the last one, and try to identify the "sub-analysis" of the Datalog analysis that seems problematic, and then focus on speeding up just that sub- analysis.

You can add the

`.split`

keyword at the end of certain rules as a hint to bddbddb to decompose those rules into simpler ones that can be solved faster. You can also set property`split_all_rules=yes`

as shorthand for splitting all rules without adding the`.split`

keyword to any of them, though I seldom find splitting all rules helpful.You can try to decompose a single Datalog analysis file into two separate Datalog analysis files. Of course, you cannot separate mutually-recursive rules into two different analyses, but if you unnecessarily club together rules that could have gone into different analyses, then they can put conflicting demands on bddbddb (e.g., on the BDD variable order). So if rule 2 uses the result of rule 1 and rule 1 does not use the result of rule 2, then put rule 1 and rule 2 in separate Datalog analyses.

Observe the sizes of the BDDs representing the relations that are input and output. bddbddb prints both the number of tuples in each relation and the number of nodes in the BDD. Try changing the BDD variable order for the domains of the relation, and observe how the number of nodes in the BDD for that relation change. You will notice that some orders perform remarkably better than others. Then note down these orders as invariants that you will not violate as you tweak other things.

The relative order of values *within* domains (e.g., in domains named

`M`

,`H`

,`C`

, etc. in Chord) affects the performance of bddbddb, but I've never tried changing this and studying its effect. It might be worth trying. For instance, John Whaley's PLDI'04 paper describes a specific way in which he numbers contexts (in domain`C`

) and that it was fundamental to the speedup of his "infinity"-CFA points-to analysis.Finally, it is worth emphasizing that BDDs are not magic. If your algorithm itself is fundamentally hard to scale, then BDDs are unlikely to help you a whole lot. Secondly, many things are awkward to encode as integers (e.g., the abstract contexts in domain

`C`

in Chord) or as Datalog rules. For instance, I've noticed that summary-based context-sensitive program analyses are hard to express in Datalog. The may-happen-in-parallel analysis provided in Chord shows a relatively simple kind of summary-based analysis that uses the Reps-Horwitz-Sagiv tabulation algorithm. But this is as far as I could get--more complicated summary-based algorithms are best written in Java itself instead of Datalog.

Each domain containing N elements is assigned log2(N) BDD variables in the underlying BDD factory with contiguous IDs.
For instance,
domain `F0` containing [128..256) elements will be assigned 8 variables with IDs (say) 63,64,65,66,67,68,69,70 and
domain `Z0` containing [8..16) elements will be assigned 4 variables with IDs (say) 105,106,107,108.

If two domains are uninterleaved in the declared domain order in a Datalog program (i.e., ``_`' is used instead of ``x`' between them),
then the BDD variables assigned to those domains are ordered in reverse order in the underlying BDD factory.
For instance, the BDD variable order corresponding to the declared domain order `F0_Z0` is (in level2var format)
"70,69,68,67,66,65,64,63,108,107,106,105".

If two domains are interleaved in the declared domain order in a Datalog program (i.e., ``x`' is used instead of ``_`' between them),
then the BDD variables assigned to those domains are still ordered in reverse order of IDs in the underlying BDD factory,
but they are also interleaved.
For instance, the BDD variable order corresponding to the declared domain order `F0xZ0` is (in level2var format)
"70,108,69,107,68,106,67,105,66,65,64,63".

Each BDD variable is at a unique "level" which is its 0-based position in the BDD variable order in the underlying BDD factory.

We will next illustrate the format of a BDD stored on disk (in a .bdd file) using the following example:

# V0:16 H0:12 # 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # 82 83 84 85 86 87 88 89 90 91 92 93 28489 134 39 36 33 30 27 24 21 18 15 12 9 6 3 0 81 79 77 75 73 71 69 67 65 63 61 59 57 55 \ 53 51 82 80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 37 34 31 28 25 22 19 16 \ 13 10 7 4 1 117 116 115 114 113 112 111 110 109 108 107 106 50 49 48 47 46 45 \ 44 43 42 41 40 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 \ 86 85 84 83 133 132 131 130 129 128 127 126 125 124 123 122 121 120 119 118 \ 38 35 32 29 26 23 20 17 14 11 8 5 2 287 83 0 1 349123 84 287 0 349138 85 0 349123 ... |

The first comment line indicates the domains in the relation (in the above case, `V0` and `H0`,
represented using 16 and 12 unique BDD variables, respectively).

If there are *N* domains, there are *N* following comment lines, each specifying the
BDD variables assigned to the corresponding domain.

The following line has two numbers: the number of nodes in the represented BDD (28489 in this case) and the number of variables in the BDD factory from which the BDD was dumped to disk. Note that the number of variables (134 in this case) is not necessarily the number of variables in the represented BDD (16+12=28 in this case) though it is guaranteed to be greater than or equal to it.

The following line specifies the BDD variable order in var2level format. In this case, the specified order subsumes
`V0_H0` (notice that levels "81 79 77 75 73 71 69 67 65 63 61 59 57 55 53 51", which are at positions "14 15 ... 28 29"
in the specified order are lower than levels "105 104 103 102 101 100 99 98 97 96 95 94" which are at positions
"82 83 .. 92 93").

Each of the following lines specifies a unique node in the represented BDD; it has format "X V L H" where:

- X is the ID of the BDD node
- V is the ID of the bdd variable labeling that node (unless it is 0 or 1, in which case it represents a leaf node)
- L is the ID of the BDD node's low child
- H is the ID of the BDD node's high child

The order of these lines specifying BDD nodes is such that the lines specifying the BDD nodes in the L and H entries of each BDD node are guaranteed to occur before the line specifying that BDD node (for instance, the L entry 287 on the second line and the R entry 349123 on the third line are IDs of BDD nodes specified on the first and second lines, respectively).

Note on Terminology:
The *support* of a BDD `b` is another BDD `r` = `b.support()`
that represents all the variables used in `b`.
The support BDD `r` is a linear tree each of whose nodes contains a separate variable,
the low branch is 0, and high branch is the node representing the next variable.
To list all the variables used in a BDD `b` use `r.scanSet()`.
Needless to say, `scanSet()` simply walks along the high branches
starting at the root of BDD `r`.

mhn@cs.stanford.edu

Datalog Analysis |