Foundations of Dataflow Analysis

This aims to be Part-2 in a multi-part series on Program Analysis. \(\newcommand{\bigsqcap}{\mathop ⨅}\)

Developing intuitions

Continuing our odd/even analysis from the previous post, we are interested in knowing whether the value of a particular variable at a program point is odd or even. The control-flow graph for the program looks like the following:

At the Program Start point, we have no information about the initial values of variables p and q. When the execution of the program flows from one node (or basic block in a general sense) to another, there is a transfer of control and data - in terms of information about the environment, variables, etc. Each node can be thought of as collecting information from its predecessors, operating upon them and then passing the information ahead to its successors.

Dataflow analysis can be thought of as a process of collecting information about the way program objects are defined and used in a program.1 To develop our odd/even analysis, we will try and model it as a dataflow analysis problem (since we can see conceptual overlap in the ideas).

Dataflow analysis can be characterized as a forward and backward analysis routine. Depending on which program point has initial information, we can define the analysis to follow the control flow or choose to flow the data in a reverse order to draw certain other conclusions. For a basic block b, we can formally define the following terms:

Engineering new dataflow analyses

Intuitively, a generic dataflow analysis algorithm can be thought to work as:

  1. Define initial program state information
  2. for \(i \in 1, 2, 3..n\) (where \(n\) is the number of program points)
    • Define \(in_b\) for block \(b_i\) by applying a merge strategy on \(join_b\).
    • Apply \(trans_b\) and generate \(out_b\).
    • Flow \(out_b\) into all successors of block \(b_i\).
    • Check if we have reached a maxima in terms of collected information. If yes goto 3, else push all successors of \(b_i\) into a queue and repeat 2.
  3. Terminate

Let us discuss a few properties of the simplified algorithm above:

Let us try and develop some mathematics and then come back prove claims we have made above. We will now cover Lattices, a mathematical structure which will help us better represent static analyses and work with them. Elements of a lattice are representative of abstract properties of variables, expressions and other programming constructs for all possible executions of a procedure.

Partial Order and bounds

Let’s first start with the concept of a partial order. A partial order is a set S equipped with a binary relation \(\sqsubseteq\) where the following conditions are satisfied:

Let’s define the concept of a lower and an upper bound for such as set S.

Let \(X \subseteq Y\). We say that \(y \in S\) is an upper bound for X (denoted as \(X \sqsubseteq y\), if we have : \(\forall x \in X : x \sqsubseteq y\) . A least upper bound, written as \(\bigsqcup X\) is defined as:

\[X \sqsubseteq \bigsqcup X \wedge \forall y \in S : X \sqsubseteq y \Rightarrow \bigsqcup X \sqsubseteq y\]

In simple terms: Consider a subset X of S. Now, consider the set U of all upper bounds of X. y is said to be the least upper bound of X if it is “lesser” than all other elements of U.

\(\bigsqcup\) is also called the join operator.

For two subsets \(l_1, l_2\), we can write \(l_1 \sqcup l_2\) for denoting \(\bigsqcup \{ l_1, l_2 \}\).

Similarly, we will now define a lower bound for X. We wil say that \(y \in S\) is a lower bound for X, written \(y \sqsubseteq X\) if \(\forall x \in X : y \sqsubseteq x\) . A greatest lower bound, written as \(\bigsqcap X\) is defined as:

\[\bigsqcap X \sqsubseteq X \wedge \forall y \in S : y \sqsubseteq X \Rightarrow y \sqsubseteq \bigsqcap X\]

In simple terms: Consider a subset X of S. Now, consider the set L of all lower bounds of X. y is said to be the greatest lower bound of X if it is “greater” than all other elements of L.

\(\bigsqcap\) is also called the meet operator.

For two subsets \(l_1, l_2\), we can write \(l_1 \bigsqcap l_2\) for denoting \(\bigsqcup \{ l_1, l_2 \}\).2

Lattices

We are interested in complete lattices and hence will be defining those:

Complete lattice : A complete lattice \(L = (L, \sqsubseteq) = (L, \sqsubseteq, \bigsqcup, \bigsqcap, \top, \bot)\) is a partially ordered set for which \(\bigsqcup X\) and \(\bigsqcap X\) exist for all X where \(X \subseteq S\).

(a) Partial Order and Lattice (b) Partial Order but not a lattice

The diagrams seen above are called Hasse Diagrams and are commonly used to describe lattices. An upward curve/line from a node y to node x of this diagram denotes that x covers y or rather for a set with a partial ordering, it denotes \(y \sqsubseteq x\). Assume that the relation between the nodes of figures (a) and (b) is some partial ordering. As can be seen, figure (b) has a couple of “culprit” nodes (d and e) which do not satisfy the condition that any two subsets of the entire set S should have a (unique) least upper bound. On the other hand, figure (a) is a complete lattice.

Furthermore, we will be defining two crucial and unique elements of this lattice structure.

And finally, the height of the lattice is the length of the shortest path from \(\bot\) to \(\top\).

Let us now look at different types of lattices, namely Powerset, Product and Map lattice structures.

Other types of lattices include flat lattice, lift lattice and sum lattice.

Motivating the need for monotonicity

In our particular example we are interested in whether a variable is odd/even at a particular program point. We can abstract program values into elements of the set -

\(Vals = \{ \bot, o, e, \top \}\).

The lattice is pretty simple to visualize:

Let us draw some analogies with our discussion on dataflow analysis earlier:

We will talk in detail about monotonic functions and general methods of solving dataflow analysis problems in the next post!

Footnotes:

  1. We will take a look at Dataflow analysis in more detail in the next post but it necessary to develop some intuitions and mathematical tooling to better understand it.

  1. It is important to see here that the existence of a G.L.B and a L.U.B is not necessary for a subset X, but if they exist they have to be unique (this property arises due to the anti-symmetric nature of the ordering relation).

  2. For generating Hasse diagram SVGs I am using the tikz library and https://tex.s2cms.ru.