Of Program Analysis and more

Introduction

Program Analysis is the branch of computer science which deals with the study of reasoning about the behavior of computer programs. We usually care about specific properties of a program such as safety, control flow and side-effects when performing such analyses. Intuitions and results developed through these techniques are critical for verifying runtime behaviors of programs.

Program Analysis can be broadly categorized into static and dynamic program analysis.

Dynamic Program Analysis uses information which is available at runtime of a program. This implies executing the program and analyzing its memory utilization, resource usage, network activity and other such interesting parameters. Depending on the property being looked into, analyses can also be more specific. Some examples are:

  • Memory Sanitization: Tools such as Valgrind and DMalloc monitor memory allocation for a particular program tracking potential leaks, invalid reads/writes and overflows.

  • Binary instrumentation: Usually consists of a Controller module which manages tracing of the runtime of a program. Such a controller enables in-memory patching, taint-tracking and instruction counting among other things to analyze specific execution runs of a program.

  • Fuzzers: Fuzzing involves (smartly) randomizing inputs to a program to either improve coverage analysis or enable bug-hunting. This is an exciting area of research for black-box and white-box testing of binaries.

Although Dynamic Analyses provide concrete reports, they are usually quite intense to incorporate in the development workflow.

For such analyses to be deemed complete, we would need to look at all possible execution runs of the program. Mathemtically, an object which represents information derived from “all possible executions” (concrete semantics) would be infinite and and hence non-computable. Hence, non-trivial questions about such concrete semantics of a program are undecidable. It is then imperative that we guide and optimize such analyses through intelligence gathered from the knowledge of other properties of the program. It is to derive this intelligence that we complement the above with static analysis which is the focus of this blog series.

Static Program Analysis allows us to reason about computer programs without actually running them. This is a broad area which helps in the study of compiler optimizations, error-detection, memory-safety, correctness, etc. Let’s dive in!

Please note that some of the explainations are hand-wavy and are only intended to help develop basic intuitions about the subject.

Need for automated program guarantees

We can think about a lot of interesting questions regarding the behavior of computer programs. For example:

  • Is a variable always constant for all possible execution runs of a program ?
  • Does the program contain dead code which is never called ?
  • Is there a possibility of infinite loops ?
  • Are there instances of memory vulnerabilities such as double-free, dangling pointers and use-after-free ?
  • What is the values of the variable x at program point P ?

… (and so on)

Static Program Analysis allows for developing an algorithmic framework which sets up study for answering questions similar to the ones stated above. Through these analyses we want to develop program guarantees automatically and be more confident about critical software systems. This allows for development of highly performant, optimized and safe programs.

Maybe I have painted an overly optimistic view of static analysis. It’s not magic, and certainly not easy. Although we can get rid of a lot of persistent issues through manual source code parsing and careful code reviews, the important phrase to look out for here is "develop program guarantees automatically". We would like to develop programs which analyze programs and decide whether they have a given property or not, preferably without any user intervention. This is where the challenge creeps in. An interesting “elusive triangle” diagram can be used to better represent this idea:

Of the three vertices of the triangle, we can realistically develop systems which take care of only two of them. Detailing the combinations:

  • Large Programs, Deep Properties: Analysing large systems and proving deep properties would involve close scrutiny of the source code and intensive manual analysis. Automated static analyses would yield highly approximate results which might not be very useful in this scenario.

  • Deep Properties, Automation: Here, we are typically looking at smaller, much simpler (highly decoupled) programs. But since the complexity is an order of magnitude lesser, proving such deep properties is not as impactful. For example, we could prove that the canonical bubble sort program always sorts the array in O(n^2), but what does that achieve?

  • Large Programs, Automation: This area is the sweet spot for traditional analysis algorithms under the umbrella of static analysis. Even the simplest of passes such as constant propagation and dead-code elimination can prove to be surprisingly useful when operating on large-scale systems.

Rice’s Theorem and undecidability

Before we move on to Rice’s theorem, we should probably gain a basic understanding of Turing’s Halting Problem. From Wikipedia,

In computability theory, the halting problem is the problem of determining, from a description of an arbitrary computer program and an input, whether the program will finish running (i.e., halt) or continue to run forever.

Consider that we do come up with a program g (a Turing Machine) that determines if programs halt. So,

      (p,x)        p halts on x
Input ------> g -----------------------> True
               \   p doesn't halt on x 
                -----------------------> False

We will now contstruct a new setup from g,

      (p,x)        p halts on x
Input ------> g -----------------------> Loop forever
               \   p doesn't halt on x 
                -----------------------> Halt

Let us call this entire setup as g*. This program g* is then given as input to g with p == g*, x == g*.

Assume that g* does halt. In that case, it will take the true branch and hence loop forever. In the other case, if we assume that g* does not halt, it will immediately halt.

This creates a contradiction. This is the heart of Turing’s Halting Problem. We cannot come up with programs which determine if other programs can halt.

Now, Rice’s Theorem informally states that interesting questions about semantic properties of a program are undecidable. There is an elegant proof for the same which uses Turing’s Halting Problem. Assume that we build an analyzer that decides if a variable in a program has a constant value in any execution.

      (p,x)        x is constant
Input ------> A -----------------------> True
               \   x is not constant
                -----------------------> False

We could then create a program which exploits this analyzer to decide the halting problem on an empty input. For example, consider a Turing Machine M:

x = const_1;
if M:
    x = const_2;

Now, x can only be constant if and only if the Turing Machine M does not halt on empty input. If we feed this program to analyzer A we would be basically generating a solution to Turing’s Halting Problem which is again a contradiction.

Approximations

Let’s look at what we mean by “Conservative Algorithms”. Consider an analysis P and a property Q.

  • Soundness : We say that a program analysis P is sound if the property Q that it approximates always triggers P. An unsound analysis will generate false negatives.
int i = 2;
if (...) { i = 3; }

An unsound analysis will claim that “i is always 3”, which is not necessarily the case if the conditional is false.

  • Precision : We say that a program analysis P is precise for a claimed property Q if it triggers only when Q is true.
int i = 2;
if (...) { i = 3; }

An imprecise analysis will claim that “i is between 0 and 1000”. Although this argument is sound, it is not precise and we can do much better.

When designing analysis techniques, we will consider trade-offs between the following notions:

  • Soundness : Does the analysis catch all instances of the claimed property Q ?
  • Precision : If the analysis claims property Q, is it satisfied ?
  • Execution Time of the analysis under consideration

Consider one of the motivating questions we had put forth when discussing static program analysis:

“What is the value of the variable x at program point P?”

Rather than knowing the exact value of the variable x (which is indeed a very hard problem to solve), we are more interested in knowing all of its possible values, even if it introduces a certain amount of imprecision. Do note that this ensures that the analyzer is sound.

An algorithm which is sound, but imprecise is said to be “Conservative”. Such analyses also ensure that all errors lie to the same side. That is why, they are also called “Safe” analyses. They are allowed to generate spurious results (false positives) but by design we ensure that they would be a possible superset of all the correct ones.

If suppose the possible values of variable x at program point P are . A conservative analysis might generate the result . This highlights imprecision () but is sound . When our goal is to circumvent Rice’s theorem, we are aiming for approximations which can drive possible solutions forward. Static analyses are hence mostly conservative in nature.1

Defining a static analysis

Let us now design a static analysis that can find if the intergral values of variables in a program are odd or even at a program point. Consider the following program,

1. p = oddNumberGenerator ()    | (p, o)
2. q = evenNumberGenerator ()   | (p, o)  (q, e)
                                | (p, o)  (q, e)
3. if (p > q)                   | (p, o)  (q, e)
4.     p = p*2 + q              | (p, e)  (q, e)
                                | (p, oe) (q, e)
5. print (p)                    | (p, oe) (q, e)
                                | (p, oe) (q, e)
6. if (p <= q)                  | (p, oe) (q, e)
7.     p = p + 1                | (p, oe) (q, e)
                                | (p, e)  (q, e)
8. print (q)                    | (p, e)  (q, e)
                                | (p, e)  (q, e)
9. q = q + p                    | (p, e)  (q, e)

At each point, we have the pair (var, val) where var is p or q and val is the odd/even value of the variable. The analysis detailed here is precise since it is performed manually (especially after point 7, where we can see that the variable p has to take an even value irrespective of its relation with q).

Now, to formally analyze the above program, we would need some mathematical tooling to help us about. Specifically, structures which will help us capture information about “what we know about a variable” in a program. I will be detailing lattices in the next blog post!

This aims to be the first in a multi-part series about program analysis. I wish to go about developing the subject as follows (list is subject to timely edit):

I am also learning the above topics as I go along writing these posts, so please forgive me if there any mistakes. Please send all love, hate and suggestions via email or @ me on Twitter :) Until next time!

References:

Footnotes:

  1. There are multiple exceptions to this rule though. Static program checkers for example, generate program points to highlight existence of bugs. There is a very high chance that it generates multiple false-positives but precision is preferred over soundness in this case.