How exactly is Python Bytecode Run in CPython?
Solution 1:
Yes, your understanding is correct. There is basically (very basically) a giant switch statement inside the CPython interpreter that says "if the current opcode is so and so, do this and that".
http://hg.python.org/cpython/file/3.3/Python/ceval.c#l790
Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
Solution 2:
If you want to see the bytecode of some code (whether source code, a live function object or code object, etc.), the dis
module will tell you exactly what you need. For example:
>>> dis.dis('i/3')
1 0 LOAD_NAME 0 (i)
3 LOAD_CONST 0 (3)
6 BINARY_TRUE_DIVIDE
7 RETURN_VALUE
The dis
docs explain what each bytecode means. For example, LOAD_NAME
:
Pushes the value associated with
co_names[namei]
onto the stack.
To understand this, you have to know that the bytecode interpreter is a virtual stack machine, and what co_names
is. The inspect
module docs have a nice table showing the most important attributes of the most important internal objects, so you can see that co_names
is an attribute of code
objects which holds a tuple of names of local variables. In other words, LOAD_NAME 0
pushes the value associated with the 0th local variable (and dis
helpfully looks this up and sees that the 0th local variable is named 'i'
).
And that's enough to see that a string of bytecodes isn't enough; the interpreter also needs the other attributes of the code object, and in some cases attributes of the function object (which is also where the locals and globals environments come from).
The inspect
module also has some tools that can help you further in investigating live code.
This is enough to figure out a lot of interesting stuff. For example, you probably know that Python figures out at compile time whether a variable in a function is local, closure, or global, based on whether you assign to it anywhere in the function body (and on any nonlocal
or global
statements); if you write three different functions and compare their disassembly (and the relevant other attributes) you can pretty easily figure out exactly what it must be doing.
(The one bit that's tricky here is understanding closure cells. To really get this, you will need to have 3 levels of functions, to see how the one in the middle forwards things along for the innermost one.)
To understand how the bytecode is interpreted and how the stack machine works (in CPython), you need to look at the ceval.c
source code. The answers by thy435 and eyquem already cover this.
Understanding how pyc
files are read only takes a bit more information. Ned Batchelder has a great (if slightly out-of-date) blog post called The structure of .pyc files, that covers all of the tricky and not-well-documented parts. (Note that in 3.3, some of the gory code related to importing has been moved from C to Python, which makes it much easier to follow.) But basically, it's just some header info and the module's code
object, serialized by marshal
.
To understand how source gets compiled to bytecode, that's the fun part.
Design of CPython's Compiler explains how everything works. (Some of the other sections of the Python Developer's Guide are also useful.)
For the early stuff—tokenizing and parsing—you can just use the ast
module to jump right to the point where it's time to do the actual compiling. Then see compile.c
for how that AST gets turned into bytecode.
The macros can be a bit tough to work through, but once you grasp the idea of how the compiler uses a stack to descend into blocks, and how it uses those compiler_addop
and friends to emit bytecodes at the current level, it all makes sense.
One thing that surprises most people at first is the way functions work. The function definition's body is compiled into a code object. Then the function definition itself is compiled into code (inside the enclosing function body, module, etc.) that, when executed, builds a function object from that code object. (Once you think about how closures must work, it's obvious why it works that way. Each instance of the closure is a separate function object with the same code object.)
And now you're ready to start patching CPython to add your own statements, right? Well, as Changing CPython's Grammar shows, there's a lot of stuff to get right (and there's even more if you need to create new opcodes). You might find it easier to learn PyPy as well as CPython, and start hacking on PyPy first, and only come back to CPython once you know that what you're doing is sensible and doable.