Does PyPy translate itself?

Am I getting this straight? Does the PyPy interpreter actually interpret itself and then translate itself?

So here's my current understanding:

  • RPython's toolchain involves partially executing the program to be translated to get a sort of preprocessed version to annotate and translate.
  • The PyPy interpreter, running on top of CPython, executes to partially interpret itself, at which point it hands control off to its RPython half, which performs the translation?

If this is true, then this is one of the most mind-bending things I have ever seen.


Solution 1:

PyPy's translation process is actually much less conceptually recursive than it sounds.

Really all it is is a Python program that processes Python function/class/other objects (not Python source code) and outputs C code. But of course it doesn't process just any Python objects; it can only handle particular forms, which are what you get if you write your to-be-translated code in RPython.

Since the translation toolchain is a Python program, you can run it on top of any Python interpreter, which obviously includes PyPy's python interpreter. So that's nothing special.

Since it translates RPython objects, you can use it to translate PyPy's python interpreter, which is written in RPython.

But you can't run it on the translation framework itself, which is not RPython. Only PyPy's python interpreter itself is RPython.

Things only get interesting because RPython code is also Python code (but not the reverse), and because RPython doesn't ever "really exist" in source files, but only in memory inside a working Python process that necessarily includes other non-RPython code (there are no "pure-RPython" imports or function definitions, for example, because the translator operates on functions that have already been defined and imported).

Remember that the translation toolchain operates on in-memory Python code objects. Python's execution model means that these don't exist before some Python code has been running. You can imagine that starting the translation process looks a bit like this, if you highly simplify it:

from my_interpreter import main
from pypy import translate

translate(main)

As we all know, just importing main is going to run lots of Python code, including all the other modules my_interpreter imports. But the translation process starts analysing the function object main; it never sees, and doesn't care about, whatever code was executed to come up with main.

One way to think of this is that "programming in RPython" means "writing a Python program which generates an RPython program and then feeds it to the translation process". That's relatively easy to understand and is kind of similar to how many other compilers work (e.g. one way to think of programming in C is that you are essentially writing a C pre-processor program that generates a C program, which is then fed to the C compiler).

Things only get confusing in the PyPy case because all 3 components (the Python program which generates the RPython program, the RPython program, and the translation process) are loaded into the same Python interpreter. This means it's quite possible to have functions that are RPython when called with some arguments and not when called with other arguments, to call helper functions from the translation framework as part of generating your RPython program, and lots of other weird things. So the situation gets rather blurry around the edges, and you can't necessarily divide your source lines cleanly into "RPython to be translated", "Python generating my RPython program" and "handing the RPython program over to the translation framework".


The PyPy interpreter, running on top of CPython, executes to partially interpret itself

What I think you're alluding to here is PyPy's use of the the flow object space during translation, to do abstract interpretation. Even this isn't as crazy and mind-bending as it seems at first. I'm much less informed about this part of PyPy, but as I understand it:

PyPy implements all of the operations of a Python interpreter by delegating them to an "object space", which contains an implementation of all the basic built in operations. But you can plug in different object spaces to get different effects, and so long as they implement the same "object space" interface the interpreter will still be able to "execute" Python code.

The RPython code objects that the PyPy translation toolchain processes is Python code that could be executed by an interpreter. So PyPy re-uses part of their Python interpreter as part of the translation tool-chain, by plugging in the flow object space. When "executing" code with this object space, the interpreter doesn't actually carry out the operations of the code, it instead produces flow graphs, which are analogous to the sorts of intermediate representation used by many other compilers; it's just a simple machine-manipulable representation of the code, to be further processed. This is how regular (R)Python code objects get turned into the input for the rest of the translation process.

Since the usual thing that is translated with the translation process is PyPy's Python interpreter, it indeed "interprets itself" with the flow object space. But all that really means is that you have a Python program that is processing Python functions, including the ones doing the processing. In itself it isn't any more mind-bending than applying a decorator to itself, or having a wrapper-class wrap an instance of itself (or wrap the class itself).


Um, that got a bit rambly. I hope it helps, anyway, and I hope I haven't said anything inaccurate; please correct me if I have.

Solution 2:

Disclaimer: I'm not an expert on PyPy - in particular, I don't understand the details of the RPython translation, I'm only citing stuff that I've read before. For a more specific post on how RPython translation may work, check out this answer.

The answer is, yes, it can (but only after it was first compiled using CPython).

Longer description:

At first it seems highly mind bending and paradoxical, but once you understand it, it's easy. Checkout the answer on Wikipedia.

Bootstrapping in program development began during the 1950s when each program was constructed on paper in decimal code or in binary code, bit by bit (1s and 0s), because there was no high-level computer language, no compiler, no assembler, and no linker. A tiny assembler program was hand-coded for a new computer (for example the IBM 650) which converted a few instructions into binary or decimal code: A1. This simple assembler program was then rewritten in its just-defined assembly language but with extensions that would enable the use of some additional mnemonics for more complex operation codes.

The process is called software bootstrapping. Basically, you build one tool, say a C++ compiler, in a lower language which has already been made (everything at one point had to be coded from binary), say ASM. Now that you have C++ in existence, you can now code a C++ compiler in C++, then use the ASM C++ compiler to compile your new one. After you once have your new compiler compiled, you can now use it to compile itself.

So basically, make the first computer tool ever by hand coding it, use that interpreter to make another slightly better one, and use that one to make a better one, ... And eventually you get all the complex software today! :)

Another interesting case, is the CoffeeScript language, which is written in... CoffeeScript. (Although this use case still requires the use of an external interpreter, namely Node.js)

The PyPy interpreter, running on top of CPython, executes to partially interpret itself, at which point it hands control off to its RPython half, which performs the translation?

You can compile PyPy using an already compiled PyPy interpreter, or you can use CPython to compile it instead. However, since PyPy has a JIT now, it'll be faster to compile PyPy using itself, rather than CPython. (PyPy is now faster than CPython in most cases)