Of Program Analysis and more

Introduction

Program Analysis is the branch of computer science which deals with the study of reasoning about the behavior of computer programs. We usually care about specific properties of a program such as safety, control flow and side-effects when performing such analyses. Intuitions and results developed through these techniques are critical for verifying runtime behaviors of programs.

Program Analysis can be broadly categorized into static and dynamic program analysis.

Dynamic Program Analysis uses information which is available at runtime of a program. This implies executing the program and analyzing its memory utilization, resource usage, network activity and other such interesting parameters. Depending on the property being looked into, analyses can also be more specific. Some examples are:

Although Dynamic Analyses provide concrete reports, they are usually quite intense to incorporate in the development workflow.

For such analyses to be deemed complete, we would need to look at all possible execution runs of the program. Mathemtically, an object which represents information derived from “all possible executions” (concrete semantics) would be infinite and and hence non-computable. Hence, non-trivial questions about such concrete semantics of a program are undecidable. It is then imperative that we guide and optimize such analyses through intelligence gathered from the knowledge of other properties of the program. It is to derive this intelligence that we complement the above with static analysis which is the focus of this blog series.

Static Program Analysis allows us to reason about computer programs without actually running them. This is a broad area which helps in the study of compiler optimizations, error-detection, memory-safety, correctness, etc. Let’s dive in!

Please note that some of the explainations are hand-wavy and are only intended to help develop basic intuitions about the subject.

Need for automated program guarantees

We can think about a lot of interesting questions regarding the behavior of computer programs. For example:

… (and so on)

Static Program Analysis allows for developing an algorithmic framework which sets up study for answering questions similar to the ones stated above. Through these analyses we want to develop program guarantees automatically and be more confident about critical software systems. This allows for development of highly performant, optimized and safe programs.

Maybe I have painted an overly optimistic view of static analysis. It’s not magic, and certainly not easy. Although we can get rid of a lot of persistent issues through manual source code parsing and careful code reviews, the important phrase to look out for here is "develop program guarantees automatically". We would like to develop programs which analyze programs and decide whether they have a given property or not, preferably without any user intervention. This is where the challenge creeps in. An interesting “elusive triangle” diagram can be used to better represent this idea:

Of the three vertices of the triangle, we can realistically develop systems which take care of only two of them. Detailing the combinations:

Rice’s Theorem and undecidability

Before we move on to Rice’s theorem, we should probably gain a basic understanding of Turing’s Halting Problem. From Wikipedia,

In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running (i.e., halt) or continue to run forever.

Consider that we do come up with a program g (a Turing Machine) that determines if programs halt. So,

      (p,x)        p halts on x
Input ------> g -----------------------> True
               \   p doesn't halt on x 
                -----------------------> False

We will now contstruct a new setup from g,

      (p,x)        p halts on x
Input ------> g -----------------------> Loop forever
               \   p doesn't halt on x 
                -----------------------> Halt

Let us call this entire setup as g*. This program g* is then given as input to g with p == g*, x == g*.

Assume that g* does halt. In that case, it will take the true branch and hence loop forever. In the other case, if we assume that g* does not halt, it will immediately halt.

This creates a contradiction. This is the heart of Turing’s Halting Problem. We cannot come up with programs which determine if other programs can halt.

Now, Rice’s Theorem informally states that interesting questions about semantic properties of a program are undecidable. There is an elegant proof for the same which uses Turing’s Halting Problem. Assume that we build an analyzer that decides if a variable in a program has a constant value in any execution.

      (p,x)        x is constant
Input ------> A -----------------------> True
               \   x is not constant
                -----------------------> False

We could then create a program which exploits this analyzer to decide the halting problem on an empty input. For example, consider a Turing Machine M:

x = const_1;
if M:
    x = const_2;

Now, x can only be constant if and only if the Turing Machine M does not halt on empty input. If we feed this program to analyzer A we would be basically generating a solution to Turing’s Halting Problem which is again a contradiction.

Approximations

Let’s look at what we mean by “Conservative Algorithms”. Consider an analysis P and a property Q.

int i = 2;
if (...) { i = 3; }

An unsound analysis will claim that “i is always 3”, which is not necessarily the case if the conditional is false.

int i = 2;
if (...) { i = 3; }

An imprecise analysis will claim that “i is between 0 and 1000”. Although this argument is sound, it is not precise and we can do much better.

When designing analysis techniques, we will consider trade-offs between the following notions:

Consider one of the motivating questions we had put forth when discussing static program analysis:

“What is the value of the variable x at program point P?”

Rather than knowing the exact value of the variable x (which is indeed a very hard problem to solve), we are more interested in knowing all of its possible values, even if it introduces a certain amount of imprecision. Do note that this ensures that the analyzer is sound.

An algorithm which is sound, but imprecise is said to be “Conservative”. Such analyses also ensure that all errors lie to the same side. That is why, they are also called “Safe” analyses. They are allowed to generate spurious results (false positives) but by design we ensure that they would be a possible superset of all the correct ones.

If suppose the possible values of variable x at program point P are \(\{1, 2, 3, 4, 5\}\). A conservative analysis might generate the result \(\{0, 1, 2, 3, 4, 5, 19\}\). This highlights imprecision (\(\{0, 19\} \subseteq R\)) but is sound \(\{1, 2, 3, 4, 5\} \subseteq R\). When our goal is to circumvent Rice’s theorem, we are aiming for approximations which can drive possible solutions forward. Static analyses are hence mostly conservative in nature.1

Defining a static analysis

Let us now design a static analysis that can find if the intergral values of variables in a program are odd or even at a program point. Consider the following program,

1. p = oddNumberGenerator ()    | (p, o)
2. q = evenNumberGenerator ()   | (p, o)  (q, e)
                                | (p, o)  (q, e)
3. if (p > q)                   | (p, o)  (q, e)
4.     p = p*2 + q              | (p, e)  (q, e)
                                | (p, oe) (q, e)
5. print (p)                    | (p, oe) (q, e)
                                | (p, oe) (q, e)
6. if (p <= q)                  | (p, oe) (q, e)
7.     p = p + 1                | (p, oe) (q, e)
                                | (p, e)  (q, e)
8. print (q)                    | (p, e)  (q, e)
                                | (p, e)  (q, e)
9. q = q + p                    | (p, e)  (q, e)

At each point, we have the pair (var, val) where var is p or q and val is the odd/even value of the variable. The analysis detailed here is precise since it is performed manually (especially after point 7, where we can see that the variable p has to take an even value irrespective of its relation with q).

Now, to formally analyze the above program, we would need some mathematical tooling to help us about. Specifically, structures which will help us capture information about “what we know about a variable” in a program. I will be detailing lattices in the next blog post!

This aims to be the first in a multi-part series about program analysis. I wish to go about developing the subject as follows (list is subject to timely edit):

I am also learning the above topics as I go along writing these posts, so please forgive me if there any mistakes. Please send all love, hate and suggestions via email or @ me on Twitter :) Until next time!

References:

Footnotes:

  1. There are multiple exceptions to this rule though. Static program checkers for example, generate program points to highlight existence of bugs. There is a very high chance that it generates multiple false-positives but precision is preferred over soundness in this case.