Relational Abstractions Based on Labeled Union-Find


Context
In program analysis, various abstractions are used to store known facts about the program being studied. Two popular choices are non-relational abstractions, which store numeric information on variables (like intervals ), or relational abstraction, which store relations between variables (like ). The former is fast (complexity in where is the number of variables) but imprecise, whereas the latter is very precise but cost-prohibitive (polyhedra has complexity).
In the middle of this spectrum lie weakly-relational abstractions. They only store relations between pairs of variables. For example, octagons store relations for some constant . These abstractions are faster than polyhedra, but still expensive, in large part due to having to compute a transitive closure to find all known relations, which costs .
Our goal is to find a new family of relational abstract domains that are cheaper than the weakly-relational domains. For this, a central question is can we compute the expensive transitive closure much more cheaply? The answer is yes, if we assume that the relation obtained on each path between two variables is always the same. This allows eliminating the vast majority of relations, we only need to store a spanning tree and can still recover any arbitrary relation in amortized almost-constant time, using a variation of the efficient union-find data structure.
Labeled union-find
We can use an extension of the classical union-find data structure to represent this spanning tree. The extension proceeds by adding labels (representing relations) to the parent edges, and is thus called labeled union-find. In order to properly adapt the union-find algorithms, these labels need an associative composition operation , an inverse and a neutral element. That is to say, they must have a group structure. This requirement also derives fairly naturally from our previous assumption (same relation on each path).
The labeled union-find relational abstraction
When using labeled union-find to represent abstract relations between variables, the soundness of operations places strong requirements on the relations that can be used. We show that these relations must correspond to injective functions between equivalence classes. The following examples relations are suitable:
- Constant offset: relations of the form for some constant
- Two Value per Equality (TVPE): , with , or
- Modular TVPE: between 64-bit vectors with odd.
- Xor and rotation: between fixed-size bitvector, which can encode many shifts
However, we cannot use relations like bounded difference , as they are not injective. Doing so inevitably leads to precision loss.
Combining with other abstractions
Labeled union-find groups variables into related class, which each point to the same representative. This can often be used to simplify other abstractions, especially whenever information on any element can be deduced from information on the representative.
For example, when combining the constant offset labeled union-find with the interval abstraction, we only need to store intervals on representative elements, not on all variables, since these can be recovered. If we know that and then we do not need to store . This reduces storage cost and propagation time, since all related variables are updated at once any time new information is learned.
Labeled union-find can also help relational abstraction similarly, shrinking their size and thus their computation cost. Furthermore, it can be modified to detect any entailed equalities and notify other abstractions of these facts.
Examples
We have implemented labeled union-find domains both in Codex, a sound static analyzer based on abstract interpretation, and in Colibri2, a constraint solver.
Codex already performs constraint propagation using relations between the values computed by the program. However, the new domain can find new relations; in particular, an important source of improvement comes from relating simultaneously incremented loop counters. For instance, consider the following C snippet:
int i = 0, j = 4;
while(i < 10) { i += 1; j += 3; }
Without labeled union-find, Codex learns that at the end of the loop, , and . However, with labeled union find and the TVPE relation, is inferred. Thus, at the end of the loop, Codex knows .
In Colibri2, we were able to increase propagations when using constant difference alongside the interval domain. For example, instance, if for some , and if we know that , then we can learn that is unsatifsiable. This sort of reasoning seems easy, but in practice, a decision procedure for non linear-arithmetic is difficult to implement and costly. Using labeled union-find (to relate and ) enables solving some easy cases such as these.
Going further
- Read the paper.
- You can also read the WIP workshop paper. It is only 4 pages long and less technical.
- To be presented at the Programming Language Design and Implementation (PLDI) 2025 conference.
- Download the software artifact from Zenodo to explore the code and see the performance results.