Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closures, lambda lifting, name resolution, etc. #21

Open
bartvm opened this issue Jan 23, 2018 · 7 comments
Open

Closures, lambda lifting, name resolution, etc. #21

bartvm opened this issue Jan 23, 2018 · 7 comments

Comments

@bartvm
Copy link
Member

bartvm commented Jan 23, 2018

I did a bit of research on environments, closures, lambda lifting, etc. etc. for Myia and thought I would write a quick summary. I'll go over some basic definitions to ground the discussion, even though people might be familiar with them already:

Static vs runtime name resolution

Python's name resolution is done at runtime

def f():
    return x
locals()['x'] = 1
f()  # 1

But not for the local scope

locals()['x'] = 1
def f():
    locals()['x'] = 2
    return x  # At parse-time this is determined to be a non-local variable
f()  # 1

It's worth noting that the Python documentation says:

On the other hand, the actual search for names is done dynamically, at run time — however, the language definition is evolving towards static name resolution, at “compile” time, so don’t rely on dynamic name resolution! (In fact, local variables are already determined statically.)

Note that closures are always runtime data structures. The runtime aspect of dynamic name resolution refers specifically to the fact that the namespace can be manipulated at runtime, and that no compile-time errors will be thrown for undefined variables. The actual name resolution rules can be the same for dynamic and static resolution.

Forward references

Python's support for forward references comes from the fact that it does runtime name resolution, but the two are separate i.e. forward references can be supported in static name resolution (although it generally requires backtracking/multiple passes).

def f():
    return x  # x is unresolved at this point
x = 2  # x gets bound when defined later on

Referencing names or values

Forward references makes some situations ambiguous, because we need to decide whether to capture values at the point of definition or at the call site.

def f():
    x = 1
    def g():
        return x
    y = g()  # 1
    x = 2
    z = g()  # should z be 1 or 2? Python says 2
    return g

If your intermediate representation references variables by name, it is possible to deal with this ambiguity i.e. one can say "return the variable named x", and the variable x can change between calls to g.

However, Myia references variables by value. That is, return x is parsed to a graph with a direct edge to the variable x in its enclosing scope; we have to draw an edge to either x = 1 or x = 2.

The advantage of having these direct edges to the outer graph is that it simplifies optimization, because the outer and inner scope can be optimized jointly (which is particularly important in the case of backpropagators). The downside is that we cannot apply the same name resolution rules as Python. To avoid discrepancies between the two approaches, the following rule can be enforced:

If a free variable is undefined, it must be defined exactly once after its capture. If a variable is defined, it cannot be redefined after its capture.

Lambda lifting vs. closure conversion

Our intermediate representation can probably be executed directly using something like a spaghetti stack (as the current VM does). However, for code generation we will probably want to do lambda lifting and/or closure conversion at one point during the lowering process.

Lambda lifting involves the removal of free variables by turning them into parameters. This requires changing the arguments passed at the call site. The problem with lambda lifting is that it cannot generally be done when you have higher-order functions.

def f(g):
    return g()  # g1(x) or g2(x, y)

def h(c, x, y):
    def g1():
        return x
    def g2():
        return x + y
    return f(g1) + f(g2)

Closure conversion removes free variables by packing them into environments that the function is called with. These are runtime data structures.

def f(env, g):
    return g(env)

def g1(env):
    return env.x

def g2(env):
    return env.x + env.y

def h(c, x, y):
        env = makeenv(x=x, y=y)
	return f(env, g1) + f(env, g2)

Implementing closures

If the nested code is executed directly, closures only need to be created when a functions is returned. However, if the goal is to make every function top-level, all nested functions must have their free variables removed and turned into environments.

There are roughly three methods to creating environments for closures. With each of these approaches you can make the environments mutable or immutable. A mutable environment means that a closure will be called with the environment at the time of its call (like Python), whereas an immutable environment means that a closure will be called with the environment at the time of its definition.

1. Copy

Each time a closure is created we create a new environment (heap allocated) using the current locals (stack allocated).

The downside of this approach is that local variables need to be copied into the environment. The upside is that if the makeenv call is not reached, no environment will be constructed.

def f():
    x = 1
    def g():
        return x
    y = g()
    def h():
        return g() + y
    return h
def g(env):
    return env.x

def h(env):
    return g(env) + env.y

def f(env):
    x = 1
    # The environment is created for g
    env = makeenv(x=x)  # x is copied
    y = g(env)
    # The environment is updated for h
    env.y = y  # y is copied
    return make_closure(h, env)

2.a. Rewrite with shared environments

The rewriting approach rewrites functions that might create closures to store and load local variables in an environment which can then be used directly to create a closure.

The shared environment approach means that each environment only contains the variables created in that function's scope, but also contains a pointer to its parent environment. This saves memory, but might require a bit of pointer hoping to resolve environment variables from outer scopes at runtime.

def f(env):
    env = makeenv(parent=env)  # Closures contain a link to their parent environment
    env.x = 1
    env.y = g(env)
    return make_closure(h, env)

2.b. Rewrite with flat environments

Flat environments optimize lookup performance by making sure that all free variables are stored in the environment. The downside is that it costs more memory because for each nested function the environment needs to be copied again.

def f(env):
    env = env.copy()
    env.x = 1
    env.y = g(env)
    return make_closure(h, env)

Corner cases

A few noteworthy corner cases are recursion and mutual recursion. If we use flat environments, we have to make sure that we don't create new ones each time, but instead re-use the original ones the function was called with. For mutual recursion this gets trickier: All the functions involved in the mutual recursion must share a single flat environment.

When using shared environments, the closure environments simply become cyclical for any form of recursion (i.e. f inside of f is a closure with the environment that f was first called with).

Random thoughts

  • It might be a good thing to do closure conversion before doing automatic differentiation: Each function's environment is now made explicit, and y = f(env, x) will have a derivative denv, dx = df(dy), so there's a nice symmetry to it. However, I think it's worthwhile to optimize f and df jointly before closure converting the generated df.
  • Type inference can still be done after closure conversion (environments can have accompanying type maps).
  • For the VM, I think support for closures will be good to add eventually, but I think we should keep the current spaghetti stack approach so that we can debug these graphs directly.
  • In Parser #4 I proposed supporting Python globals by having an environment object that functions are closed over. It's worth noting that this would lead to a special case after closure conversion, where global_env = env.globals; sum = global_env.sum i.e. one variable in the environment will be another environment containing the globals (so accessing Python globals will require two lookups).
@breuleux
Copy link
Member

If a free variable is undefined, it must be defined exactly once after its capture. If a variable is defined, it cannot be redefined after its capture.

I would also support the simpler "a variable which is captured can only be defined once."

Regarding closure conversion:

def f(env, g):
    return g(env)

def g1(env):
    return env.x

def g2(env):
    return env.x + env.y

def h(c, x, y):
    env = makeenv(x=x, y=y)
    return f(env, g1) + f(env, g2)

That's not typically how closure conversion works. You are supposed to pack the environment and the function together:

def f(g):
    return g()  # runtime interprets this as get_function(g)(get_env(g))

def g1(env):
    return env.x

def g2(env):
    return env.x + env.y

def h(c, x, y):
    env = makeenv(x=x, y=y)
    g1c = makeclosure(g1, env)
    g2c = makeclosure(g2, env)
    return f(g1c) + f(g2c)

It might be a good thing to do closure conversion before doing automatic differentiation: Each function's environment is now made explicit, and y = f(env, x) will have a derivative denv, dx = df(dy), so there's a nice symmetry to it.

Well, in general you're not going to call y = f(env, x), you're going to call y = get_closure(f)(get_env(f), x), so you need to derive through get_closure and get_env (and make_closure etc.) I don't think it's necessarily a nicer symmetry than gf, gx = df(gy) either.

However, I think it's worthwhile to optimize f and df jointly before closure converting the generated df.

If there are substantial benefits to having direct pointers to free variables as in Thorin (and I've expressed mild skepticism about that), you probably don't want to optimize some half-closure-converted representation.

@bartvm
Copy link
Member Author

bartvm commented Jan 23, 2018

If there are substantial benefits to having direct pointers to free variables as in Thorin (and I've expressed mild skepticism about that), you probably don't want to optimize some half-closure-converted representation.

My suggestion would be to optimize, then closure convert, then AD, then optimize again, then closure convert again. Presumably the first optimization pass would have exploited any optimization opportunities in closures in the primals.

I do think it might be important for backpropagators to have direct pointers to the intermediate variables in the primals. It allows you to optimize the forward and backward pass together as a single graph.

That's not typically how closure conversion works. You are supposed to pack the environment and the function together.

Sure, that's the definition of a closure. My notation wasn't very clear here--I was trying to make it explicit at which points the environment gets constructed, mutated, and passed around without muddling it with packing and unpacking notation--but I don't believe it changes anything about the implementation details I summarized.

(As a sidenote, you don't need to do the packing and unpacking if you are calling a function in the scope it was defined in.)

I would also support the simpler "a variable which is captured can only be defined once."

I think that would actually complicate the implementation rather than simplify it, because doing name resolution by accessing the latest definition in scope is the default. Backtracking to see whether there was more than one definition at the point a variable gets captured is just extra work.

Well, in general you're not going to call y = f(env, x), you're going to call y = get_closure(f)(get_env(f), x), so you need to derive through get_closure and get_env (and make_closure etc.) I don't think it's necessarily a nicer symmetry than gf, gx = df(gy) either.

The get_closure and get_env adjoints seem trivial; it's just packing and unpacking stuff.

Intuitively I like y = f(env, x)/denv, dx = df(dy) because there is less magic going on: denv is simply the derivative of an argument like any other. When you use y = f(x) and denv, dx = df(dy), there is a certain kind of asymmetry that bugs me. Although f(x) can read from its environment, df(dy) cannot write to it. So whereas f(x) accesses things implicitly, we have to make df(dy)'s modification on the environment explicit by constructing denv. From there on we make use of the fact that a function, as a non-numerical object, doesn't have a derivative, so we can use their partials as a way to smuggle environment sensitivities back to the adjoint of the closure definition.

I don't know how much this matters in practice, possibly little, but I do wonder if AD on closure converted code isn't more natural because there are more invariants i.e. type(env) == type(denv) whereas type(denv) != type(f).

Here's a worked out example.

def f(g):
    return g()

def h(x):
    def g():
        return x
    y = f(g)
    return y

This will be closure converted to something like this.

def g(env):
    y = env.x
    return y

def h(x):
    env = make_env()
    env.x = x
    g_ = (g, env) # a closure
    y = f(g_)
    return y

def f(g):
    y = g[0](g[1])  # calling a closure
    return y

Now the backpropagator approach seems natural to me, because the environment sensitivities have just become the gradients of the environments, which were passed as explicit arguments. This means you don't really need to treat them in any special way.

def g(env):
    y = env.x
    def dg(dy):
        denv = make_env()
        denv.x = dy  # adjoint of y = env.x
        return denv
    return y, dg

def h(x):
    env = make_env()
    env.x = x
    g_ = (g, env)
    y, df = f(g_)
    def dh(dy):
        dg_ = df(dy)  # normal adjoint of function call
        dg, denv = dg_  # adjoint of make_closure
        dx = denv.x  # writing to environment has its adjoint
        return dx
    return y

def f(g):
    y, dg = g[0](g[1])
    def df(dy):
        dg1 = dg(dy)  # normal adjoint of environment argument
        dg = (0, dg1)  # non-numeric objects always have gradient 0
        return dg
    return y

@breuleux
Copy link
Member

My suggestion would be to optimize, then closure convert, then AD, then optimize again, then closure convert again. Presumably the first optimization pass would have exploited any optimization opportunities in closures in the primals.

But there may be optimization opportunities between the AD-generated code and free variables, e.g.

def f(x):
    a = x * x
    def g():
        return a * log(x)
    return g()

dg/dx is a/x. With a direct pointer to a, we can easily see that this is (x*x)/x = x, but after closure conversion of g, a is opaque -- we don't see what it is. So we would miss that optimization.

I think that would actually complicate the implementation rather than simplify it, because doing name resolution by accessing the latest definition in scope is the default. Backtracking to see whether there was more than one definition at the point a variable gets captured is just extra work.

Ah, okay, that makes sense. If that's simpler, then let's do that.

I don't know how much this matters in practice, possibly little, but I do wonder if AD on closure converted code isn't more natural because there are more invariants i.e. type(env) == type(denv) whereas type(denv) != type(f).

I think closure conversion is actually mildly problematic from a type system point of view, because the environment's type isn't supposed to be part of the closure's type: it's not supposed to leak. If you have a HOF F(f), all that matters are the types of f's inputs and outputs. The environment of f could be literally anything. So if you do closure conversion, generally speaking, the type of closure_env(f) is going to be top/any. closure_env is not typable in the general case and this could be a problem (it could limit expressiveness).

That's why when you closure-convert in a typed language you'd lift functions and insert make_closure calls, but you wouldn't change call sites from g(x) to g[0](g[1], x), because the type system will choke on this -- you either leave them as they are (the runtime can deal with closures on its own) or have something like run_closure(g, x).

This might be a reason why the grad transform is difficult to integrate in a type system, since it is potentially equivalent to having a closure_env primitive. So we may very well have that problem anyway.

(Also, nothing actually stops type(denv) == type(f) from being true. At some point in master I did exactly that. The semantics are a bit dubious, so I wouldn't necessarily recommend it, but it works just fine.)

@breuleux
Copy link
Member

Actually, I suppose an explicit closure could be typed as (e, e -> a -> b), but it does require e to be a type variable in several situations.

@bartvm
Copy link
Member Author

bartvm commented Jan 24, 2018

But there may be optimization opportunities between the AD-generated code and free variables [...]

That's a good example.

Actually, I suppose an explicit closure could be typed as (e, e -> a -> b), but it does require e to be a type variable in several situations.

The environment sensitivities returned by a backpropagator must have a type, no? The symmetry I am thinking about is that the type of f would be env -> a -> b which makes the backpropagator have type b -> (a, env). If you make the closure transparent then you have type a -> b for the primal and b -> (a, ?) for the backpropagator, so we need to have an explicit type for environments anyway.

(Also, nothing actually stops type(denv) == type(f) from being true. At some point in master I did exactly that. The semantics are a bit dubious, so I wouldn't necessarily recommend it, but it works just fine.)

Letting type(denv) == type(f) seems dubious indeed, because now you have a -> b for the primal, and b -> (a, (a -> b)) for the backpropagator...

@bartvm
Copy link
Member Author

bartvm commented Jan 25, 2018

So here's another way of thinking about it, which maybe clarifies my thinking a bit: If primals read from their environment, adjoints should really be writing to an adjoint environment. For example, take this function

def f():
    return x ** 2  # x is a free variable read from the environment

then its transformation looks like

def f():
    def df(dy):
        dx = x * 2 * dy  # dx is an environment sensitivity
    return x ** 2, df

and if you write this out more explicitly, it looks something like this, with line by line adjoints whose types work out

env: Env = make_env()
env.x: float = x
f_: Callable[[], float] = make_closure(f, env)
y, df_: Tuple[float, Callable[[float], None]] = f_()
...
df_(dy)  # No return value
df, denv: Tuple[Callable[[float], None], Env] = unmake_closure(df_)
dx: float = denv.dx
denv = make_env()  # type-specific zero gradient

Although this is nicely symmetric, it shows why this transformation is awkward: The adjoint of reading from an environment is writing to one, but that requires an impure function.

The backpropagator approach in the paper sidesteps this problem by making the environment sensitivities explicit return values. However, this means exposing them to the type system and higher-order gradients, which prompted my question: Why not just make environments of the primals explicit as well i.e. do closure conversion? But I agree that this is not ideal e.g. for optimization.

A third option would be a representation in which the backpropgator can explicitly write to the scope of other functions, like in Python with the nonlocal keyword. I don't think this is really possible though, since you would have to write into the scope of the backpropagator of the function that your primal's free variables were bound in, which can only be determined at runtime.

def g(x):
    def f():
        def df(dy):
            nonlocal dx  # should write into scope of dg somehow
            dx += x * 2 * dy
        return x ** 2, df
    y, df = f()
    def dg(dy):
        dx = 0
        df(dy)
        return dx
    return y, dg

So I guess that if we don't want to do closure conversion before AD, the only option is to accept the slightly awkward asymmetry and just return environments explicitly (and just explicitly type them as an environment with accompanying type map that we can use for type inference).

@breuleux
Copy link
Member

Yeah, I see what you mean, and the gradient transform's certainly going to be simpler if it doesn't need all the special machinery to deal with free variables. This being said, I think I may already have a working implementation of Grad in #22 that uses our representation directly, so unless we determine the advantages of pointing to free variables directly are not worth it, I say we just proceed as planned, but keep this option in mind if we need to reevaluate later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants