Technology and Art
In this post, I’ll talk about how I wrote a small Virtual Machine in Prolog which can both interpret concrete assembly language-like programs, and run basic symbolic executions, which are useful in data flow analyses of programs. The full code is available in this repository.
This post has not been written or edited by AI.
One of my favourite exercises to do when learning a new language is to build something which exercises non-trivial capabilities of the language like pattern matching, flexibility of data structures, or exposes the brevity of expressing ideas in the langauge. Building a simple virtual machine forces one to reckon with ideas like expression trees, recursive traversals, term rewriting, etc.
I’ve also been reading about symbolic execution as a way to perform dataflow analysis recently as well. As an added challenge, I chose to enhance the concrete interpretation of a program with symbolic execution capabilities.
We need a dictionary implementation. SWI-Prolog has the dictionaries, but since we are building everything from scratch, we will write a very naive implementation using only lists. Granted, there are some semantics of a dictionary that can be violated for now - for example, you can start off with duplicate keys, but let’s assume the happy path.
get2(_,[],empty).
get2(K, [(-(K,VX))|_],VX) :- !.
get2(K, [_|T],R) :- get2(K,T,R).
put2_(-(K,V),[],Replaced,R) :- Replaced->R=[];R=[-(K,V)].
put2_(-(K,V),[-(K,_)|T],_,[-(K,V)|RX]) :- put2_(-(K,V),T,true,RX).
put2_(-(K,V),[H|T],Replaced,[H|RX]) :- put2_(-(K,V),T,Replaced,RX).
put2(-(K,V),Map,R) :- put2_(-(K,V),Map,false,R).
To represent entries in a dictionary, we use the K-V
compound term, which is basically syntactic sugar for -(K,V)
. These entries live inside a list. Both get2
and put2
behave in predictable ways, except when the dictionary has duplicate keys. In that case:
get2(K,V)
returns the value of the first matching key.put2(-(K,V),InputMap,OutputMap)
modifies all matching keys with the value V
.In our current implementation, we will not worry about duplicate entries yet.
We will also need push/pop operations on stacks. This is very simple. Note that the top of the stack is always the leftmost element.
push_(V,Stack,UpdatedStack) :- UpdatedStack=[V|Stack].
pop_([],empty,[]).
pop_([H|Rest],H,Rest).
We will be logging quite a bit inside the rules. Thus it is important to have a structured way of logging different levels, like DEBUG
, INFO
, WARNING
, etc. This is what a basic logging setup looks like:
log_with_level(LogLevel,FormatString,Args) :- format(string(Message),FormatString,Args),format('[~w]: ~w~n',[LogLevel,Message]).
debug(Message) :- log_with_level('DEBUG',Message,[]).
debug(FormatString,Args) :- log_with_level('DEBUG',FormatString,Args).
info(Message) :- log_with_level('INFO',Message,[]).
info(FormatString,Args) :- log_with_level('INFO',FormatString,Args).
warning(Message) :- log_with_level('WARN',Message,[]).
warning(FormatString,Args) :- log_with_level('WARN',FormatString,Args).
error(Message) :- log_with_level('ERROR',Message,[]).
error(FormatString,Args) :- log_with_level('ERROR',FormatString,Args).
dont_log(_).
dont_log(_,_).
The minimal instruction is comprised of the following:
mov(reg,reg|constant)
cmp(reg,reg|constant)
label(name)
j(label)
jz(label|address)
jnz(label|address)
push(reg|constant)
pop(reg)
call(label)
ret
hlt
term(string)
nop
inc(reg)
dec(reg)
mul(reg,reg|constant)
term(string)
For convenience, I chose to not have a fixed number of registers for convenience; thus, you can use any symbol as a register. In this respect, we will be treating registers more akin to conventional variables.
There will be one special register called the Instruction Pointer (IP). This will point to the next instruction to be executed. Jump instructions like j
, jnz
, and jz
can can modify the IP to change the flow of the program.
The other useful data structure will be the stack, which is operated by push
, pop
, call
, and ret
(the last two use it to keep track of the stack when entering and leaving procedures).
There will be one flag called the Zero Flag. This should probably be better named to Equals Flag, because it is set to zero if the two sides of a cmp
are equal, otherwise -1/+1 depending upon their relative ordering.
Data can be of two types:
const(5)
sym(x)
, sym(abcd)
, etc.For the purposes of this simple VM, I chose not to have any memory. I may add it later, and then I will update this post accordingly.
The execution model is simple and similar to what we’d expect a very simple single-threaded VM to behave. Every instruction is sequentially mapped to a specific memory address (simple incrementing integers for our purposes). The Instruction Pointer starts at 0. At every instruction, something can happen. Actions include:
Jumps work by modifying the value of the Instruction Pointer to the destination address. Procedure calls work similarly, but with an added side effect: the address of the instruction after the call
is pushed onto the stack: when a ret
is encountered, the topmost value is popped off the stack and is assigned back to the Instruction Pointer. This simulates the return from the procedure.
There are a couple of mappings we need to build to be able to jump to arbitrary locations because of changes to the IP.
These mappings are done in the following code:
instruction_pointer_map([],IPMap,_,IPMap).
instruction_pointer_map([Instr|T],IPMap,IPCounter,FinalIPMap) :- put2(-(IPCounter,Instr),IPMap,UpdatedIPMap),
plusOne(IPCounter,UpdatedIPCounter),
instruction_pointer_map(T,UpdatedIPMap,UpdatedIPCounter,FinalIPMap).
label_map([],LabelMap,_,LabelMap).
label_map([label(Label)|T],LabelMap,IPCounter,FinalLabelMap) :- put2(-(label(Label),IPCounter),LabelMap,UpdatedLabelMap),
plusOne(IPCounter,UpdatedIPCounter),
label_map(T,UpdatedLabelMap,UpdatedIPCounter,FinalLabelMap),
!.
label_map([_|T],LabelMap,IPCounter,FinalLabelMap) :- plusOne(IPCounter,UpdatedIPCounter),
label_map(T,LabelMap,UpdatedIPCounter,FinalLabelMap).
The label_map
predicate simply step through the full list of instructions, adding a label mapping to the current address (incremented each time through plusOne
) when it encounters a label
fact.
The instruction_map
predicate simply assigns every instruction that it finds to an incrementing counter.
Symbolic execution is a technique used to determine the provenance of data in a piece of code. Assume, you have a very simplified code segment, like so:
Code=[
mvc(reg(hl),const(10)), % load 10 into register HL
mvc(reg(bc),const(5)), % load 5 into register BC
inc(reg(hl)), % increment HL by one
mul(reg(hl),reg(bc)) % multiply contents of HL with that of BC, store the result in HL
].
Now, suppose you wished to determine what sort of data transformations were taking place in this code. You might want to do this for different reasons:
Now, you can run this piece of code on concrete numbers and generate lots of test cases for different values in hl
, bc
, etc. However, as a human, you may be able to induce a generic rule which explains the behaviour of this piece of code, which is:
hl=(hl+1)*bc
To deduce this rule, note that you didn’t use a concrete number, you used symbols like hl
and bc
to represent the values that these registers could store. Symbolic execution does exactly this: instead of storing concrete numbers in registers, we store symbols. When we “execute” the program, all operations which modify these symbols essentially store the log of operations on these symbols. So for example:
Code=[
mvc(reg(hl),sym(a)), % load symbol a into register HL
mvc(reg(bc),sym(b)), % load symbol b into register BC
inc(reg(hl)), % increment HL by one, HL now holds inc(sym(a))
mul(reg(hl),reg(bc)) % HL now holds mul(inc(sym(a)),sym(b))
].
Thus at the end hl
’s contents are mul(inc(sym(a)),sym(b))
, which is interpreted as hl=(hl+1)*bc
.
Symbolic execution is a powerful technique for program analysis. There is however one wrinkle we need to take care of when building a symbolic interpreter: branching.
Consider the instruction jz(label(some_label))
. During concrete execution, we can look at the value of the Zero Flag, and then determine whether we want to jump to some_label
or continue with the normal execution flow. However, the Zero Flag is set based on comparison between two concrete values: what if those values are symbols? You cannot meaningfully compare sym(a)
and sym(b)
numerically: they represent a range of values.
So, then the question becomes: which path do we take?
The image below is an example of the sort of situation you might encounter at a branch point.
The answer is that we take both paths. Effectively, we split our execution world into two branches: one which makes the jump, and the other one which contains normal execution. These two branches then continue on as individual threads to completion. Of course, if these branches encounter more conditional jump instructions, more sub-worlds split out of these as well, and so on.
Symbolic execution thus explores all possibilities of a program.
One issue is that this can easily result in a combinatorial explosion of paths, as each binary branch doubles the number of effective worlds. Symbolic execution engines tackle this in various ways. However, for our simple VM, we will simply keep splitting our world into new branches whenever we encounter conditional jumps.
plusOne(sym(X),sym(inc(sym(X)))).
plusOne(const(X),const(PlusOne)) :- PlusOne is X+1.
minusOne(sym(X),sym(dec(sym(X)))).
minusOne(const(X),const(MinusOne)) :- MinusOne is X-1.
product(const(LHS),const(RHS),const(Product)) :- Product is LHS*RHS.
product(sym(LHS),sym(RHS),sym(product(sym(LHS),sym(RHS)))).
product(sym(LHS),const(RHS),sym(product(sym(LHS),const(RHS)))).
product(const(LHS),sym(RHS),sym(product(const(LHS),sym(RHS)))).
The predicate definitions take into account meaningful combinations of constants and symbols. As a rule of thumb, any expression containing even one symbol cannot be simplified (save degenerate cases like adding / subtracting 0, multiplying by 1 or 0, subtracting an expression from itself, etc., but we leave those aside for now for simplicity). Thus the only arithmetic simplification we do is when everything in an expression is a constant. For all the others, it creates a new compound term which reflects the operation being performed, which will ultimately be inspected at the end of a run as part of the value of a register.
Comparison also works similar to arithmetic operations, in that actual comparison is usually only performed when both sides are constants. The special case is that if both sides are symbolic expressions with the same structure, that also counts for equality. Otherwise a new compound term logging the comparison operator (wrapped in a symbol itself) is returned.
equate(LHS,LHS,const(0)).
equate(sym(LHS),sym(RHS),sym(cmp(sym(LHS)),sym(RHS))).
equate(sym(LHS),const(RHS),sym(cmp(sym(LHS)),const(RHS))).
equate(const(LHS),sym(RHS),sym(cmp(const(LHS)),sym(RHS))).
equate(const(LHS),const(RHS),const(1)) :- LHS < RHS.
equate(const(LHS),const(RHS),const(-1)) :- LHS > RHS.
The virtual machine state must reflect the exact configuration of registers, flags, stack, etc. These are represented as:
vmState(IP,Stack,CallStack,Registers,flags(zero(v1),hlt(v2),branch(v3))
const()
branch()
compound term is useful to keep track of when to split worlds during symbolic execution. The hlt()
compound term is to keep track of a halt condition (which can be an explicit HLT
instruction, or the flow falling off the end of a program without a HLT
instruction).This vmState
compound term is passed around everywhere and essentially is equivalent to a struct
in C (SWI-Prolog has dedicated facilities for representing structures, but I’ve deliberately kept the code as implementation-neutral as possible).
Accessing data inside this structure is very easy, thanks to unification and pattern matching; for example, if we wished to only extract the value of the Zero Flag, and not care about the other values, we can simply write:
vmState(_,_,_,_,flags(zero(ZeroFlagValue),_,_)
For reference, this is the high level predicate flow that we will be referencing in the explanation of the predicates.
Before looking at how branches in symbolic execution are handled, it is instructive to understand how a single world, concrete execution flow works. The inner loop which evaluates a world, starts with:
ExecutionMode
is either symbolic
or concrete
. For the purposes of this explanation, let us assume that it is concrete.StateIn
is simply the initial VM state which is described in Virtual Machine statevmMaps
is a tuple of the IP Map and the label map, as described in Building the navigation mapsTraceOut
contains the program trace for this execution. Since we are only doing a concrete flow, there will be no ChildWorlds
. The code below represents the part of the vm
predicate which is pertinent to this discussion.vm(Program,ExecutionMode,StateIn,vmMaps(IPMap,LabelMap),world(StateIn,TraceOut,ChildWorlds)) :-
exec_(vmMaps(IPMap,LabelMap),
StateIn,[],
traceOut(FinalTrace,VmStateOut),
env(log(debug,info,warning,error),ExecutionMode)),
VmStateOut=vmState(FinalIP,FinalStack,FinalCallStack,FinalRegisters,FinalVmFlags),
minusOne(FinalIP,LastInstrIP),
TraceOut=traceOut(FinalTrace,vmState(LastInstrIP,FinalStack,FinalCallStack,FinalRegisters,FinalVmFlags)),
...
.
The core interpretation is triggered by the exec_
predicate. The remaining lines involves taking the output VM state, adjusting its final IP value (which is one address beyond the last executed instruction) to point to the last executed instruction, and repackaging it to bind it to the output variable TraceOut
.
Let’s look at the internal exec_
loop.
The exec_
has two cases:
hlt
flag is set to true (hlt(true)
). This indicates that a HLT
instruction has been encountered in the execution of the previous instruction. At this point, the TraceOut
variable is bound to the accumulated program trace and the current VM trace, and returned back to the vm_
predicate.exec_(_,vmState(IP,Stack,CallStack,Registers,flags(ZeroFlag,hlt(true),BranchFlag)),
TraceAcc,
traceOut(TraceAcc,vmState(IP,Stack,CallStack,Registers,flags(ZeroFlag,hlt(true),BranchFlag))),
env(log(_,Info,_,_),_)) :- call(Info,'EXITING PROGRAM LOOP!!!').
HLT
), we first get the instruction corresponding to the current IP value from the IPMap
structure. This is then passed to exec_helper
with all the associated context.exec_(vmMaps(IPMap,LabelMap),vmState(IP,Stack,CallStack,Registers,VmFlags),TraceAcc,StateOut,Env) :-
get2(IP,IPMap,Instr),
exec_helper(Instr,vmMaps(IPMap,LabelMap),
vmState(IP,Stack,CallStack,Registers,VmFlags),TraceAcc,StateOut,Env).
The exec_helper
predicate is where the interpretation actually happens. This also has two cases.
HLT
instruction, and execution falls off the end of the program. This is treated equivalent to an implicit HLT
instruction. Thus, the TraceOut
variable is bound to whatever context is already present. The only explicit modification is the hlt
flag which is explicitly set to true.exec_helper(empty,VmMaps,vmState(IP,Stack,CallStack,Registers,flags(ZeroFlag,_,BranchFlag)),
TraceAcc,
traceOut(TraceAcc,ExitState),
env(log(Debug,Info,Warn,Error),ExecutionMode)) :-
ExitState=vmState(IP,Stack,CallStack,Registers,flags(ZeroFlag,hlt(true),BranchFlag)),
call(Warn,'No other instruction found, but no HLT is present. Halting program.'),
exec_(VmMaps,ExitState,TraceAcc,traceOut(TraceAcc,ExitState),env(log(Debug,Info,Warn,Error),ExecutionMode)).
interpret
predicate is called, a NextIP
variable is initialised to the current IP incremented by one, to indicate the next instruction that will be executed, assuming there are no jumps. This is so that a conditional jump can either return the same NextIP
(i.e., no jump), or can return a different IP (indicating a jump).The interpret
predicate is then called, which has different cases, depending upon the instruction encountered. The various cases are described in Instruction Interpretation.
The shouldBranch()
ternary operator’s true condition is only triggered during symbolic execution, so we’ll not worry about that for the moment. The negative condition is the concrete execution flow. This part is straightforward. It simply calls the exec_
predicate recursively with the updated IP.
This mutual recursive call between exec_
and exec_helper
continues until a halt condition is reached.
exec_helper(...) :-
...,
plusOne(IP,NextIP),
interpret(Instr,VmMaps,vmState(NextIP,Stack,CallStack,Registers,VmFlags),vmState(UpdatedIP,UpdatedStack,UpdatedCallStack,UpdatedRegisters,UpdatedVmFlags),env(log(Debug,Info,Warning,Error),ExecutionMode)),
(shouldBranch(UpdatedVmFlags)->
(
...
);
(
call(Debug,'Next IP is ~w',[UpdatedIP]),
exec_(VmMaps,vmState(UpdatedIP,UpdatedStack,UpdatedCallStack,UpdatedRegisters,UpdatedVmFlags),TraceAcc,traceOut(RemainingTrace,vmState(FinalIP,FinalStack,FinalCallStack,FinalRegisters,FinalVmFlags)),env(log(Debug,Info,Warning,Error),ExecutionMode)),
FinalTrace=[traceEntry(Instr,vmState(UpdatedIP,UpdatedStack,UpdatedCallStack,UpdatedRegisters,UpdatedVmFlags))|RemainingTrace]
)
),
!.
Many of the cases for the interpret
predicate are about retrieving the values of (one or two) registers, performing some manipulation on them (symbolic or arithmetic) and storing the result in the register, as in this example for the MUL
instruction.
interpret(mul(reg(LHSRegister),reg(RHSRegister)),_,vmState(NextIP,Stack,CallStack,Registers,VmFlags),vmState(NextIP,Stack,CallStack,UpdatedRegisters,VmFlags),env(log(Debug,_,_,_),_)) :-
get2(LHSRegister,Registers,LHSValue),
get2(RHSRegister,Registers,RHSValue),
call(Debug,'LHS is ~w,~w',[LHSRegister,LHSValue]),
call(Debug,'RHS is ~w,~w',[RHSRegister,RHSValue]),
product(LHSValue,RHSValue,Product),
call(Debug,'And result is ~w',[Product]),
update_reg(-(reg(LHSRegister),Product),Registers,UpdatedRegisters).
The two get2
calls retrieve the values of LHSRegister
and RHSRegister
. product
calculates their (symbolic or arithmetic) product. update_reg
updates the LHSRegister
with the product.
The two instructions that modify the VM state differently are the JZ
and JNZ
instructions, which we have discussed depend upon whether the ExecutionMode
is symbolic
or concrete
.
Let’s talk about how symbolic execution works. The symbolic execution mode is controlled by two variables:
ExecutionMode
variable which can either be symbolic
or concrete
.branch
flag which is explicitly set to true by conditional jump instructions, only when the ExecutionMode
is symbolic
. To see this link, look at the two cases for the JZ
(Jump if Zero) instruction.interpret(jz(JumpIP),_,vmState(OldNextIP,Stack,CallStack,Registers,VmFlags),vmState(UpdatedIP,Stack,CallStack,Registers,UpdatedVmFlags),env(_,mode(symbolic))) :- interpret_symbolic_condition(OldNextIP,JumpIP,VmFlags,isZero,UpdatedVmFlags,UpdatedIP).
interpret(jz(JumpIP),_,vmState(OldNextIP,Stack,CallStack,Registers,VmFlags),vmState(UpdatedIP,Stack,CallStack,Registers,UpdatedVmFlags),env(_,mode(concrete))) :- interpret_condition(OldNextIP,JumpIP,VmFlags,isZero,UpdatedVmFlags,UpdatedIP).
The first case triggers when the mode is symbolic
, and calls the interpret_symbolic_condition
predicate. This predicate is a single predicate which directly sets branch(true)
:
interpret_symbolic_condition(OldNextIP,_,flags(ZeroFlag,HltFlag,_),_,flags(ZeroFlag,HltFlag,branch(true)),OldNextIP).
The second case triggers when the mode is concrete
and calls the interpret_condition
predicate which is described in Instruction Interpretation.
Where is branch(true)
actually used? This is in the exec_helper
predicate, reproduced here with the pertinent code:
exec_helper(...) :-
...,
(shouldBranch(UpdatedVmFlags)->
(
terminateForBranch(vmState(UpdatedIP,UpdatedStack,UpdatedCallStack,UpdatedRegisters,UpdatedVmFlags),vmState(FinalIP,FinalStack,FinalCallStack,FinalRegisters,FinalVmFlags)),
FinalTrace=TraceAcc
);
(
...
)
),
!.
The shouldBranch()
term is only true when branch(true)
is true. At this point, it simply returns the entire trace and the VM state as-is, effectively ending the execution of this thread. This is because, beyond this, two new world threads need to be created and run interpreted as their own worlds with identical starting points.
Where does this world splitting take place?
This happens in the vm()
predicate, reproduced here with the relevant code
vm(...) :-
...,
(shouldTerminateWorld(FinalVmFlags)->(...);
(
NewStartIP_One=FinalIP,
branchDestination(LastInstrIP,LabelMap,IPMap,NewStartIP_Two),
Branches=[NewStartIP_One,NewStartIP_Two],
info("Branches are: ~w",[Branches]),
explore(Program,ExecutionMode,VmStateOut,vmMaps(IPMap,LabelMap),Branches,[],ChildWorlds)
)
).
At this point, we are back at the top, and we need to decide whether to halt the execution entirely, or whether to branch out into different worlds. How do we determine this?
Answer: We know we are back at the top, but we also know that we aren’t in a HALT condition (an explicit HALT
instruction or execution flow falling off the end), therefore we must be at a branch point. Therefore, we extract two IP values, NewStartIP_One
(the default execution flow IP value) and NewStartIP_Two
(the jump IP value). Now, we recursively call the explore
predicate, which is the top-level entry predicate for our VM.
Let’s look at the explore
predicate.
The important variable in this predicate is the list of Instruction Pointers. If there is only one IP value (which is what happens when this predicate is first invoked), that means there is only one branch / world. If there are two (or more, though that is not a situation that happens with this VM), it means that there are multiple branches which need to be symbolically interpreted independently, starting from the same set of initial conditions (the VM state, etc.). explore
is thus called recursively for every conditional branch, until there are none left, thus signalling the end of the symbolic execution tree.
explore(_,_,_,_,[],WorldAcc,WorldAcc).
explore(Program,ExecutionMode,VmState,VmMaps,[IP|OtherIPs],WorldAcc,[WorldOut|OtherWorldOuts]) :-
VmState=vmState(_,Stack,CallStack,Registers,flags(ZeroFlag,_,_)),
FreshState=vmState(IP,Stack,CallStack,Registers,flags(ZeroFlag,hlt(false),branch(false))),
vm(Program,ExecutionMode,FreshState,VmMaps,WorldOut),
explore(Program,ExecutionMode,VmState,VmMaps,OtherIPs,WorldAcc,OtherWorldOuts),
!.
If you wish to trace the flow yourself and are using SWI-Prolog, I’d highly recommend using the graphical debugger. It has full support for breakpoints, step over/step into, forcing predicates to fail, on-the-fly edit-and-recompile, and more.
These are a couple of example programs you can run with this VM.
execute_symbolic([
mov(reg(r0),const(5)),
mov(reg(r1),const(1)),
call(label(factorial)),
push(const(30)),
cmp(reg(r1),const(121)),
hlt,
label(factorial),
cmp(reg(r0),const(0)),
jz(label(base)),
push(reg(r0)),
dec(reg(r0)),
call(label(factorial)),
pop(reg(r0)),
mul(reg(r1),reg(r0)),
ret,
label(base),
mov(reg(r1),const(1)),
ret
],
mode(concrete),AllWorlds),print_worlds(AllWorlds,0).
execute_symbolic([
mov(reg(r1),const(10)),inc(reg(r1)),
mov(reg(r2),const(20)),
cmp(reg(r1), const(0)),
jnz(const(7)),
cmp(reg(r2), const(1)),
mov(reg(r1),const(30)),
jz(const(10)),
mov(reg(r3), const(40)),
hlt,
mov(reg(r3), const(50)),
hlt
],
mode(symbolic),AllWorlds),print_worlds(AllWorlds,0).