Solving a Project Euler problem using SymEx and Clojure

This is a repost from the LeadLab blog. They work on applying technology to healthcare. Do check out their cool research!

Hello! We are going to try and solve a Project Euler problem. The problem statement is as follows:

A Pythagorean triplet is a set of three natural numbers, a < b < c, for which,
a^2 + b^2 = c^2
For example, 32 + 42 = 9 + 16 = 25 = 52.
There exists exactly one Pythagorean triplet for which a + b + c = 1000. Find the product abc.

Pretty easy to just iterate over all the possible values over 1000. But we will try and do something new! \o/

Okay, now let’s say that a, b and c are symbolic variables. We will use the clojure.core.logic library for this.

(ns clj.p9
  ;; conflicting operator
  (:refer-clojure :exclude [==])
  (:require [clojure.core.logic :refer :all]
            [clojure.core.fd    :as fd]))

(defn solution
  ;; Let us say we take as input the sum of the three pythagorean variables and input the result.
  [sum]
  ;; We will have the result in a vec as [a b c].
  ;; Since we want the product a*b*c, let's just multiply them together.
  ;; We use the function first since the two results provided are:
  ;; [a b c] and [b a c] with c being the hypotenuse. Hence we choose one of them.
  (reduce * (first
  ;; Here we go! `run*` implies that we want the computer to compute
  ;; all the possible values for the expression `q`.
  (run* [q]
  ;; Declare 3 fresh symbolic variables for a, b and c.
  (fresh [a b c]
  ;; The solution should be a vec of values of the variables a, b and c.
  (== q [a b c])
  ;; Now we know that all that all the number have to be distinct.
  ;; We hence use the fd/distinct to add that constraint.
  (fd/distinct [a b c])
  ;; Another contraint we can derive is that since the sum of two sides is always &gt; than the third,
  ;; each side has to be lesser than half of the `sum` value.
  ;; This can reduce our search space by quite a lot. fd/in provides us an option to specify the range.
  ;; (fd/interval [start] [end]) is the actual numeric range.
  (fd/in a b c (fd/interval 1 (/ sum 2)))
  ;; We now add the given constraints on the variables.
  (fd/eq
    ;; sum = a + b + c
    (= sum (+ a b c))
    ;; c^2 = a^2 + b^2 (since they are pythagorean triplets)
    (= (* c c) (+ (* a a) (* b b)))))))))

EZ PZ. Let’s run this!

(time (soln 1000))
"Elapsed time: 2345.989466 msecs"
31875000

There we have it! Solved in 2.3 seconds. Although this was fairly simple, but we learnt a lot on the way. core.logic is not as highly optimized as z3 but is fairly fast. It is made for a relational algebra which is a superset of symbolic analysis. z3 has a nice Python API to interact with it!

Issues involved

Current research trends include the inclusion of analysis of source and raw binaries as well. Program testing can be approached in an entirely new way if the inputs to methods(maybe?) are considered to be symbolic and we then fuzz test the over the entire search space. It is although very important to realize that there are a lot of problems which can come up during practical implementations:

  1. Path explosion: If there are a lot of branching statements in the program, there will be generation of a new path expression at each branch hence creating a new instance of the symbolic runtime increasing the number of paths. This is a very serious problem and requires smart inputs from the user to prune certain branches off.

  2. Environment interactions and system calls: A symbolic engine will find it very tough to interact with the underlying OS or the runtime environment. This can hinder a lot of core systems testing. The program may take decision based on the return value of a particular read statement or write to a particular file and then taking decisions. There is still ongoing research to find implement a virtual model of the underlying stack for the engine to interact with.

A thought experiment: Symbolic Execution in Healthcare?

Medical systems need to completely fool proof and verified before they are sent to production. Symbolic execution and analysis can be absolutely critical in such an analysis. Like normal programs, even though certain inputs are completely pointless(for example providing a negative number to a square root calculator) it is important that we implement correct error handling.

Although the above is practical, it would be more interesting to apply this logic to the human body directly which is… let’s just say it’s a though experiment. Some of my thoughts related to the ideas below stem from the discussion here. Below are some assumptions I make:

  1. Since the human body is made up of a finite number of atoms, at any point in time there is a set of values which describe the entire body state(Assuming we manage to deconstruct the brain).

  2. Although huge, we have a model of the human body which can be provided to a computer to generate an approximate system which will emulate the body internals and reactions to stimuli. Although far-fetched, we are slowly inching toward such a goal.

  3. The computer we imagine is fast enough to perform calculations on such a scale.

Imagine medicine in such a scenario. Assume a medicinal drug to be a symbolic variable and the program to be the human body itself. The medicine can be thought to be an input to the body and hence the system should be able to track the breakdown of such a medicine in the body. The IF conditions in this case would represent exsiting patient conditions and his body state which would lead to different reactions in the body. Once we have all the leaf-nodes of a graph of such a system, the resulting information can be highly infomative in providing crucial healthcare and desigining of drugs, since we will have data regarding each breakdown of the chemicals. Drug design can be highly benefit from such a scenario!

The possibilities are endless. For now, let me get back to ValidIS.

References!

[1] Symbolic Execution and Program Testing, JC King.
[2] Lecture notes on Symbolic Execution, University of Maryland.