2011年7月28日星期四

Lazy CEK Machine

In the CEK Machine, though the substitution for an identifier is delayed, it doesn't mean it is a lazy machine. As described in Eager and Lazy Evaluation, a lazy evaluation strategy doesn't evaluate an argument before applying it to a function. However, in the CEK machine, it always evaluates the argument to a value, so it is still a call-by-value machine. If you are careful enough, you may already notice whenever an environment is created, the expression in its closure is always a value. The real power of environment is working with lazy evaluation. So in this post, I am going to implement the call-by-name and call-by-need strategies for the CEK machine.

Call by Name
It is quite easy to implement the call-by-name strategy based on the current call-by-value one. Since the machine no longer evaluates the argument before applying it to a function, the reduction rules related to applications should be modified(For a primitive operator, the operands are needed at the moment it is evaluated, so it stays unchanged.). In the call-by-value machine, when the control string is an application, first an ArgKK continuation is handled, then a FunKK continuation is handled. Now, the two steps, 2.1 and 2.2, should be combined into one, and the FunKK continuation type is removed.

2) If the control string is a value(abstraction or constant), check the type of continuation:
    2.1) If it is ArgKK, pop up the continuation, create a new environment that maps the function parameter to current closure, and set the function body and the new environment as the current expression and environment respectively;

However, the machine may need to evaluate the same argument many times now. Take expression (lambda x  + x x) ((lambda y y) 1) as an example, x is mapped to ((lambda y y) 1) after reducing the outer application. Then, when evaluating + x x, ((lambda y y) 1) is evaluated twice. The reason is that a parameter may be used several times in a function body, and whenever it is needed, the original argument it maps to should be evaluated. What if remember the result of the argument for the first time it is evaluated, so it doesn't bother to re-compute it later? I will talk this in next section.

Call by Need
The idea is quite straightforward. Remember the value for a closure, so next time the closure is needed, the value is retrieved directly. Considering the same example (lambda x + x x) ((lambda y y) 1), when the first x in + x x is needed, ((lambda y y) 1) is evaluated and the result is 1. After this, the environment which maps x to ((lambda y y) 1) is updated so x is mapped to 1. When the second x is needed, 1 is retrieved from the environment and no extra evaluation is performed.

However, the implementation is not that simple. When evaluating ((lambda y y) 1), I can't simply put it as the control string of current machine, because this may destroy the machine state and the machine can't go back to previous state after it is evaluated. Instead, a separate evaluation process should be performed on it. This is kind of a sub-machine, whose evaluation logic is the same, but it relies on some states from the parent machine, such as environments because the expression itself may contain free variables. In current evaluate() function, it initializes the machine state at the beginning and clean it up at the end. It was designed like this because no sub-machine was ever needed before. I am going to refactor it so it is fed with a machine state and focuses on the reduction logic, instead of managing the lifecycle of the machine state. One advantage is an expression can be put into a particular context for evaluation. It is more flexible. The interface is changed to:

// file: eval.c
static int _evaluate(State *state);

TreeNode * evaluate(TreeNode *expr) {
    Environment *globalEnv = buildGlobalEnvironment();
    State *state = cek_newState();
    state->closure = cek_newClosure(expr,globalEnv);

    TreeNode* result = NULL;
    if(_evaluate(state)) {
        // The result control string may contain free variables, need to 
        // substitute them using the environment for it.
        TreeNode *tmp = resolveFreeVariables(state->closure->expr,state->closure->env);
        if(tmp!=NULL) {
            // actually, tmp and state->closure->expr are the same expression
            result = tmp;
            state->closure->expr = NULL;
        }
    }
    cek_cleanup(state);
    return result;
}

And the code for looking up an identifier is:

// file: eval.c
static Closure* lookupVariable(const char *name, Environment *env) {
    if(env==NULL) return NULL;
    if(strcmp(name,env->name)==0) {
        Closure *closure = env->closure;
        if(closure->env!=NULL) {
            // evaluates the expression in its environment
            State *state = cek_newState();
            state->closure = cek_newClosure(duplicateTree(closure->expr),cek_newEnvironment("",cek_newClosure(NULL,NULL),closure->env));

            if(_evaluate(state)) {
                deleteTree(closure->expr);
                cek_deleteClosure(closure);
                // save the evaluated result
                env->closure = cek_newClosure(state->closure->expr,state->closure->env);
                state->closure->expr = NULL;
            }
            cek_cleanup(state);
        }
        return env->closure;
    }
    return lookupVariable(name,env->parent);
}

More Words about Closure and Environment
Another design improvement I want to mention is the global environment. Currently, if the control string is an identifier, the machine first tries to look it up in the environments and if not found it tries to resolve it from the builtin and standard libraries. Since a library maps identifiers to expressions, it is just like an environment. So the identifier lookup can be treated universally by introducing a global environment, which contains all the identifiers and functions provided by the evaluator. It is the root environment for all environments that are created during the evaluation and it is initialized when the machine is started. Thus, to handle a control string which is an identifier, the machine only needs to look it up in the environments. If it is a builtin or standard function, the machine can finally find it in the global environment.

A closure is an expression associated with an environment. The environment is where the expression is defined rather than it is evaluated. Let's consider the reduction steps of expression (lambda x  (lambda y (lambda x + x y) 2) x) 1.

(lambda x (lambda y (lambda x + x y) 2) x) 1
  => (lambda y (lambda x + x y) 2) x)                    A = {x=>(1,G), G}
  => (lambda x + x y) 2                                          B = {y=>(x,A), A}
  => + x y                                                               C = {x=>(2,B), B}
  => + 2 y                                                               C = {x=>(2,B), B}
  => + 2 1                                                               C = {x=>(2,B), B}  *
  => 3

The result is 3 rather than 4, because at step (*), y is mapped to x, and x is looked up in environment A, where it is defined, but not the latest environment C where is evaluated.

You can get the source code in branches call-by-name and call-by-need via github.

2011年7月23日星期六

CEK Machine

In the textual machines I mentioned in previous posts, when reducing an application, without exception, everyone requires the substitution of an argument in place of all the occurences of an identifier in the abstraction body. The machine traverses the expression tree during substitution, and if the result is not a value, it needs to re-traverse it again to evaluate it. Take (lambda x + x x) 1 as an example, + x x is traversed to replace x with 1. After that, we get result + 1 1. To evaluate it, the expression + 1 1 is traversed again. It is inefficient to traverse an expression twice. We cannot get rid of this if we stick on the substitution when reducing an application. We can, however, remember the identifier and the argument at the moment and delay the substitution until it is actually needed.

To represent this delay substitution, a component that maps an identifier to an expression is needed. It is called an environment. Every expression in the machine is replaced by a pair, whose first part is an expression probably with free variables in it and the other part is an environment. It is called a closure. Of course, the environment could not map identifiers to expressions, because the machine never works on a single expression. It should map identifiers to closures. And the continuation also contains a closure. Now, the machine state changes to a control string, an environment and a continuation, so it is called a CEK machine. Here is the data structures introduced for environment and closure:

// file: cek_machine.h
struct envStruct {
    char *name;
    struct closureStruct *closure;
    struct envStruct *parent;
};

struct closureStruct {
    TreeNode *expr;
    struct envStruct *env;
};

typedef struct continuationStruct {
    ContinuationKind tag;
    Closure * closure;
    struct continuationStruct * next;
} Continuation;

Conceptually, an environment can contain several identifier-closure mappings. However, in the definition above an environment only has one such mapping. Since mappings are created when reducing applications and an abstraction only has one parameter, I keep one mapping in a standalone environment and arrange them hierarchically. When looking up for the mapped closure of an identifier, first try it in its associated environment, and if failed, try it in the parent one.

CEK Machine
There is no significant change to continuation actions. Some new actions, like creating a closure, deleting a closure, creating an environment, looking up in an environment, are very common in the reduction steps. By the way, an identifier is no long a value because it may be mapped to an expression, which can be further reduced. Here are the reduction rules:
1) If the control string is an identifier, look up the mapped closure in the environment, and set the expression and environment of the closure as current expression and environment respectively;
2) If the control string is a value(abstraction or constant), check the type of continuation:
    2.1) If it is FunKK, pop up the continuation, create a new environment that maps the function parameter to current closure, and set the function body and the new environment as the current expression and environment respectively;
    2.2) If it is ArgKK, set the continuation as FunKK, and switch the current expression and environment with these in the closure of continuation respectively;
    2.3) If it is OprKK, pop up the continuation, evaluate the primitive expression, and set the result as the current expression and set the current environment to null;
    2.4) If it is OpdKK, set next argument expression and its environment as current expression and environment respectively;
3) If the control string is an application, create and push a continuation of type ArgKK, whose closure consists of the argument expression and current environment, and set the abstractioin expression as current expression, with current environment unchanged;
4) If the control string is a primitive, create and push a continuation of type OpdKK, whose closure consists of the primitive expression and current environment, and set the first argument expression as current expression, with current environment unchanged.

Though implementation is straightforward according to the reduction rules, the code is getting complicated, so I am not duplicate it here. You can still access it on github.

A drawback of the CEK machine is no intermediate expression can be obtained after each step. What a pity! The machine always delays the substitution and expression is not actually reduced after each step. It does less work at the expense of losing immediate effects.