Foundations of Dataflow Analysis
22 Sep 2018This aims to be Part-2 in a multi-part series on Program Analysis. \(\newcommand{\bigsqcap}{\mathop ⨅}\)
Developing intuitions
Continuing our odd/even analysis from the previous post, we are interested in knowing whether the value of a particular variable at a program point is odd or even. The control-flow graph for the program looks like the following:
At the Program Start
point, we have no information about the initial values of variables p
and q
. When the execution of the program flows from one node (or basic block in a general sense) to another, there is a transfer of control and data - in terms of information about the environment, variables, etc. Each node can be thought of as collecting information from its predecessors, operating upon them and then passing the information ahead to its successors.
Dataflow analysis can be thought of as a process of collecting information about the way program objects are defined and used in a program.1 To develop our odd/even analysis, we will try and model it as a dataflow analysis problem (since we can see conceptual overlap in the ideas).
Dataflow analysis can be characterized as a forward and backward analysis routine. Depending on which program point has initial information, we can define the analysis to follow the control flow or choose to flow the data in a reverse order to draw certain other conclusions. For a basic block b, we can formally define the following terms:
- \(in_b\) : Represents the knowledge at the entrance of block \(b\). Knowledge here implies any information which the analysis might be interested in.
- \(out_b\) : Represents the knowledge at the exit of block \(b\).
- \(join_b\) : Represents the result of a specified merge strategy applied on basic blocks \(b_i\) which flow into b. Merge strategies may vary depending on the kind of analysis and the domain.
- \(transfer_b\) : Can be thought of as a function applied on \(in_b\) to generate \(out_b\). This is termed as the transfer function. Often times, the transfer function models the evaluation strategy of the program in the abstract analysis domain used.
Engineering new dataflow analyses
Intuitively, a generic dataflow analysis algorithm can be thought to work as:
- Define initial program state information
- for \(i \in 1, 2, 3..n\) (where \(n\) is the number of program points)
- Define \(in_b\) for block \(b_i\) by applying a merge strategy on \(join_b\).
- Apply \(trans_b\) and generate \(out_b\).
- Flow \(out_b\) into all successors of block \(b_i\).
- Check if we have reached a maxima in terms of collected information. If yes goto 3, else push all successors of \(b_i\) into a queue and repeat 2.
- Terminate
Let us discuss a few properties of the simplified algorithm above:
-
Correctness: When analysing dataflow algorithms, how do we verify their correctness? It a challenging problem to make sure that the information being gathered does indeed capture the exact semantics of the program. We will look at correctness in a later post.
-
Termination:
"Check if we reached a maxima in terms of collected information."
. To be honest, that is a pretty vague statement I made. The termination of the algorithm depends on this condition. We need to define certain properties for our \(join\) and \(trans\) functions so as to guarantee eventual termination.- Continuous increase in knowledge: Intuitively it is obvious that as the algorithm proceeds, we continue to collect more and more information about various objects across program points. We therefore define both of the above functions such that they never throw away any information collected during the course of the algorithm. This ensures that they constantly increase (or remain the same).
- Upper bound on information: We also apply an upper bound on the collected information.
The application of these two conditions results in two possible outcomes:
- Data flowing into blocks might cease to change. This implies that during the iterative procedure of the analysis algorithm, there is no change in \(in_b\) for any block \(b\) at which point we can halt (since the \(transfer_b\) function depends only on the input, corresponding \(out_b\) generated will be equivalent across iterations).
- In the case where there is continuous increase in knowledge, the upper bound on collected information guarantees us that eventually hitting the upper bound would drive us towards outcome 1 and the algorithm will halt.
Let us try and develop some mathematics and then come back prove claims we have made above. We will now cover Lattices, a mathematical structure which will help us better represent static analyses and work with them. Elements of a lattice are representative of abstract properties of variables, expressions and other programming constructs for all possible executions of a procedure.
Partial Order and bounds
Let’s first start with the concept of a partial order. A partial order is a set S equipped with a binary relation \(\sqsubseteq\) where the following conditions are satisfied:
- Reflexivity: \(\forall x \in S : x \sqsubseteq x\)
- Transitivity : \(\forall x, y, z \in S : x \sqsubseteq y \wedge y \sqsubseteq z \Rightarrow x \sqsubseteq z\)
- Anti-symmetry : \(\forall x, y \in S : x \sqsubseteq y \wedge y \sqsubseteq x \Rightarrow x = y\)
Let’s define the concept of a lower and an upper bound for such as set S.
Let \(X \subseteq Y\). We say that \(y \in S\) is an upper bound for X (denoted as \(X \sqsubseteq y\), if we have : \(\forall x \in X : x \sqsubseteq y\) . A least upper bound, written as \(\bigsqcup X\) is defined as:
\[X \sqsubseteq \bigsqcup X \wedge \forall y \in S : X \sqsubseteq y \Rightarrow \bigsqcup X \sqsubseteq y\]In simple terms: Consider a subset X of S. Now, consider the set U of all upper bounds of X. y is said to be the least upper bound of X if it is “lesser” than all other elements of U.
\(\bigsqcup\) is also called the join operator.
For two subsets \(l_1, l_2\), we can write \(l_1 \sqcup l_2\) for denoting \(\bigsqcup \{ l_1, l_2 \}\).
Similarly, we will now define a lower bound for X. We wil say that \(y \in S\) is a lower bound for X, written \(y \sqsubseteq X\) if \(\forall x \in X : y \sqsubseteq x\) . A greatest lower bound, written as \(\bigsqcap X\) is defined as:
\[\bigsqcap X \sqsubseteq X \wedge \forall y \in S : y \sqsubseteq X \Rightarrow y \sqsubseteq \bigsqcap X\]In simple terms: Consider a subset X of S. Now, consider the set L of all lower bounds of X. y is said to be the greatest lower bound of X if it is “greater” than all other elements of L.
\(\bigsqcap\) is also called the meet operator.
For two subsets \(l_1, l_2\), we can write \(l_1 \bigsqcap l_2\) for denoting \(\bigsqcup \{ l_1, l_2 \}\).2
Lattices
We are interested in complete lattices and hence will be defining those:
Complete lattice : A complete lattice \(L = (L, \sqsubseteq) = (L, \sqsubseteq, \bigsqcup, \bigsqcap, \top, \bot)\) is a partially ordered set for which \(\bigsqcup X\) and \(\bigsqcap X\) exist for all X where \(X \subseteq S\).
(a) Partial Order and Lattice | (b) Partial Order but not a lattice |
The diagrams seen above are called Hasse Diagrams and are commonly used to describe lattices. An upward curve/line from a node y to node x of this diagram denotes that x covers y or rather for a set with a partial ordering, it denotes \(y \sqsubseteq x\). Assume that the relation between the nodes of figures (a) and (b) is some partial ordering. As can be seen, figure (b) has a couple of “culprit” nodes (d and e) which do not satisfy the condition that any two subsets of the entire set S should have a (unique) least upper bound. On the other hand, figure (a) is a complete lattice.
Furthermore, we will be defining two crucial and unique elements of this lattice structure.
- \(\bot = \bigsqcup \phi = \bigsqcap L\), where \(\bot\) is called the bottom element of the lattice.
- \(\top = \bigsqcap \phi = \bigsqcup L\), where \(\top\) is called the top element of the lattice.
And finally, the height of the lattice is the length of the shortest path from \(\bot\) to \(\top\).
Let us now look at different types of lattices, namely Powerset, Product and Map lattice structures.
-
Powerset lattice: Every finite set \(X\) defines a lattice \((2^{A}, \subseteq)\), where \(\bot = \phi, \top = A, x \sqcup y = x \cup y\) and \(x \sqcap y = x \cap y\). We call this the powerset lattice for A. For a set of 3 elements, the Hasse diagram of its powerset lattice looks like the following:
As can be guessed, the height of a powerset lattice for a set A is \(\vert A \vert\).
-
Product lattice: If \(L_1, L_2, L_3...\) are lattices, then so is the general product.
\[L_1 \times L_2 \times L_3 ... \times L_n = \{(x_1, x_2, .... x_n \vert x_i \in L_i\}\]where the lattice order is considered point-wise across the various lattices. Interestingly the height of a product lattice is the sum of the individual lattices.
-
Map lattice: If \(A\) is a set and \(L\) is a lattice, then we obtain a map lattice consisting of a set of functions ordered pointwise.
\[A \rightarrow L = \{[a_1 \mapsto x_1, a_2 \mapsto x_2 ..] \vert A = \{a_1, a_2, . . \} \wedge x_1, x_2, . . \in L \}\]where the \(\mapsto\) notation implies that a function \(f\) maps the LHS to the corresponding RHS.
The map lattice gives us a set of functions from \(A \rightarrow L\). For such functions \(f, g \in A \rightarrow L\),
\[f \sqsubseteq g \Leftrightarrow \forall a_i \in A : f(a_i) \sqsubseteq g(a_i)\]
Other types of lattices include flat lattice, lift lattice and sum lattice.
Motivating the need for monotonicity
In our particular example we are interested in whether a variable is odd/even at a particular program point. We can abstract program values into elements of the set -
\(Vals = \{ \bot, o, e, \top \}\).
The lattice is pretty simple to visualize:
Let us draw some analogies with our discussion on dataflow analysis earlier:
- Let the set of variables of the program be \(Vars = \{p, q\}\). Consider the lattice \(L = Vals\) to be the analysis domain of this particular problem.
-
Functions \(f_i\) where \(i \in 1, 2, 3..n\) can be thought of as transfer functions across a particular program point such that, \(f_i: L \rightarrow L\). \(f_i\) maps one instance of the analysis domain to another. For our example, we can think of this idea in the following context. Consider program point 2 of our odd/even example:
2 - q = evenNumberGenerator ()
Across this point, let the transfer function be \(f_2\). Let the input (\(in_2\)) for this point be some abstract value \(a_2\). We can define \(f_2\) as:
\[f_2 = a_2 [ q \mapsto e ]\]The abstract value for the variable
q
is updated in the environment and a new instance of the analysis domain is produced as \(out_2\). -
Similarly consider the \(join\) function defined earlier. For a particular program point \(i\) with multiple predecessor points (\(i_1, i_2, .. i_n\)), the \(join\) function can be defined s:
\[join_i: L_1 \times L_2 \times ... L_n \rightarrow L\] -
Let us discuss analysis termination again. When the analysis is represented as a lattice structure, we can see that the information bound we were talking about is actually the top (\(\top\)) element. According to definition of a complete lattice we are guaranteed to have a top element and hence in the case of dataflow analysis, an information upper-bound.
-
Now that we have mathematically defined the \(trans\) and \(join\) functions, we need to impose some constraints on them to ensure the
continuous increase of information
property of dataflow analysis. Naively, consider an example analysis domain as follows:Two things to note here. The y-axis describes an increase in information. Consider \(c\) which is formed by the \(join\) of \(a\) and \(b\). According to our definition, the \(join\) function must not lose any knowledge. Hence,
\[a \sqsubseteq join (a, b)\] \[b \sqsubseteq join (a, b)\]This is analogous to our representation of information in terms of the lattice structure. As we have seen in our Hasse diagram, an upwards edge denotes a partial ordering between the end nodes.
Secondly, we cannot allow the transfer function to reduce known information. Since \(a, b, d, e\) represent the same amount of information, the \(trans(c)\) function violates our condition of continuous increase. “More precise input will not result in a less precise output”. Mathematically, this can be defined as monotonicity.
We will talk in detail about monotonic functions and general methods of solving dataflow analysis problems in the next post!
Footnotes:
- We will take a look at Dataflow analysis in more detail in the next post but it necessary to develop some intuitions and mathematical tooling to better understand it.
-
It is important to see here that the existence of a G.L.B and a L.U.B is not necessary for a subset X, but if they exist they have to be unique (this property arises due to the anti-symmetric nature of the ordering relation).
-
For generating Hasse diagram SVGs I am using the tikz library and https://tex.s2cms.ru.