Why are λ-calculus optimal evaluators able to compute big modular exponentiations without formulas?

Church numbers are an encoding of natural numbers as functions.

(\ f x → (f x))             -- church number 1
(\ f x → (f (f (f x))))     -- church number 3
(\ f x → (f (f (f (f x))))) -- church number 4

Neatly, you can exponentiate 2 church numbers by just applying them. That is, if you apply 4 to 2, you get the church number 16, or 2^4. Obviously, that is utterly unpractical. Church numbers need a linear amount of memory and are really, really slow. Computing something like 10^10 - which GHCI quickly answers correctly - would take ages and couldn't fit the memory on your computer anyway.

I've been experimenting with optimal λ evaluators lately. On my tests, I accidentally typed the following on my optimal λ-calculator:

10 ^ 10 % 13

It was supposed to be multiplication, not exponentiation. Before I could move my fingers to abort the forever-running program in despair, it answered my request:

3
{ iterations: 11523, applications: 5748, used_memory: 27729 }

real    0m0.104s
user    0m0.086s
sys     0m0.019s

With my "bug alert" flashing, I went to Google and verified, 10^10%13 == 3 indeed. But the λ-calculator wasn't supposed to find that result, it can barely store 10^10. I started stressing it, for science. It instantly answered me 20^20%13 == 3, 50^50%13 == 4, 60^60%3 == 0. I had to use external tools to verify those results, since Haskell itself wasn't able to compute it (due to integer overflow) (it is if you use Integers not Ints, of course!). Pushing it to its limits, this was the answer to 200^200%31:

5
{ iterations: 10351327, applications: 5175644, used_memory: 23754870 }

real    0m4.025s
user    0m3.686s
sys 0m0.341s

If we had one copy of the universe for each atom on the universe, and we had a computer for each atom we had in total, we couldn't store the church number 200^200. This prompted me to question if my mac was really that powerful. Maybe the optimal evaluator was able to skip the unnecessary branches and arrive right at the answer in the same fashion Haskell does with lazy evaluation. To test this, I compiled the λ program to Haskell:

data Term = F !(Term -> Term) | N !Double
instance Show Term where {
    show (N x) = "(N "++(if fromIntegral (floor x) == x then show (floor x) else show x)++")";
    show (F _) = "(λ...)"}
infixl 0 #
(F f) # x = f x
churchNum = F(\(N n)->F(\f->F(\x->if n<=0 then x else (f#(churchNum#(N(n-1))#f#x)))))
expMod    = (F(\v0->(F(\v1->(F(\v2->((((((churchNum # v2) # (F(\v3->(F(\v4->(v3 # (F(\v5->((v4 # (F(\v6->(F(\v7->(v6 # ((v5 # v6) # v7))))))) # v5))))))))) # (F(\v3->(v3 # (F(\v4->(F(\v5->v5)))))))) # (F(\v3->((((churchNum # v1) # (churchNum # v0)) # ((((churchNum # v2) # (F(\v4->(F(\v5->(F(\v6->(v4 # (F(\v7->((v5 # v7) # v6))))))))))) # (F(\v4->v4))) # (F(\v4->(F(\v5->(v5 # v4))))))) # ((((churchNum # v2) # (F(\v4->(F(\v5->v4))))) # (F(\v4->v4))) # (F(\v4->v4))))))) # (F(\v3->(((F(\(N x)->F(\(N y)->N(x+y)))) # v3) # (N 1))))) # (N 0))))))))
main = print $ (expMod # N 5 # N 5 # N 4)

This correctly outputs 1 (5 ^ 5 % 4) - but throw anything above 10^10 and it will be stuck, eliminating the hypothesis.

The optimal evaluator I used is a 160-lines long, unoptimized JavaScript program that didn't include any sort of exponential modulus math - and the lambda-calculus modulus function I used was equally simple:

(λab.(b(λcd.(c(λe.(d(λfg.(f(efg)))e))))(λc.(c(λde.e)))(λc.(a(b(λdef.(d(λg.(egf))))(λd.d)(λde.(ed)))(b(λde.d)(λd.d)(λd.d))))))

I used no specific modular arithmetic algorithm or formula. So, how is the optimal evaluator able to arrive at the right answers?


Solution 1:

The phenomenon comes from the amount of shared beta-reduction steps, which can be dramatically different in Haskell-style lazy evaluation (or usual call-by-value, which is not that far in this respect) and in Vuillemin-Lévy-Lamping-Kathail-Asperti-Guerrini-(et al…) "optimal" evaluation. This is a general feature, that is completely independent from the arithmetic formulas you could use in this particular example.

Sharing means having a representation of your lambda-term in which one "node" can describe several similar parts of the actual lambda-term you represent. For instance, you can represent the term

\x. x ((\y.y)a) ((\y.y)a)

using a (directed acyclic) graph in which there is only one occurrence of the subgraph representing (\y.y)a, and two edges targeting that subgraph. In Haskell terms, you have one thunk, that you evaluate only once, and two pointers to this thunk.

Haskell-style memoization implements sharing of complete subterms. This level of sharing can be represented by directed acyclic graphs. Optimal sharing does not have this restriction: it can also share "partial" subterms, which may imply cycles in the graph representation.

To see the difference between these two levels of sharing, consider the term

\x. (\z.z) ((\z.z) x)

If your sharing is restricted to complete subterms as it is the case in Haskell, you may have only one occurrence of \z.z, but the two beta-redexes here will be distinct: one is (\z.z) x and the other one is (\z.z) ((\z.z) x), and since they are not equal terms they cannot be shared. If the sharing of partial subterms is allowed, then it becomes possible to share the partial term (\z.z) [] (that is not just the function \z.z, but "the function \z.z applied to something), which evaluates in one step to just something, whatever this argument is. Hence you can have a graph in which only one node represents the two applications of \z.z to two distinct arguments, and in which these two applications can be reduced in just one step. Remark that there is a cycle on this node, since the argument of the "first occurrence" is precisely the "second occurrence". Finally, with optimal sharing you can go from (a graph representing) \x. (\z.z) ((\z.z) x)) to (a graph representing) the result \x.x in just one step of beta-reduction (plus some bookkeeping). This is basically what happens in your optimal evaluator (and the graph representation is also what prevents space explosion).

For slightly extended explanations, you can look at the paper Weak Optimality, and the Meaning of Sharing (what you are interested in is the introduction and the section 4.1, and maybe some of the bibliographic pointers at the end).

Coming back at your example, the coding of arithmetic functions working on Church integers is one of the "well-known" mines of examples where optimal evaluators can perform better than mainstream languages (in this sentence, well-known actually means that a handful of specialists are aware of these examples). For more such examples, take a look at the paper Safe Operators: Brackets Closed Forever by Asperti and Chroboczek (and by the way, you will find here interesting lambda-terms that are not EAL-typeable; so I’m encouraging you to take a look at oracles, starting with this Asperti/Chroboczek paper).

As you said yourself, this kind of encoding is utterly unpractical, but they still represent a nice way of understanding what is going on. And let me conclude with a challenge for further investigation: will you be able to find an example on which optimal evaluation on these supposedly bad encodings is actually on par with traditional evaluation on a reasonable data representation? (as far as I know this is a real open question).

Solution 2:

This isn't an anwser but it's a suggestion of where you might start looking.

There's a trivial way to calculate modular exponentiations in little space, specifically by rewriting

(a * x ^ y) % z

as

(((a * x) % z) * x ^ (y - 1)) % z

If an evaluator evaluates like this and keeps the accumulating parameter a in normal form then you will avoid using too much space. If indeed your evaluator is optimal then presumably it must not do any more work than this one, so in particular can't use more space than the time this one takes to evaluate.

I'm not really sure what an optimal evaluator really is so I'm afraid I can't make this more rigorous.