How are coroutines implemented in JVM langs without JVM support?
This question came up after reading the Loom proposal, which describes an approach of implementing coroutines in the Java programming language.
Particularly this proposal says that to implement this feature in the language, additional JVM support will be required.
As I understand it there are already several languages on the JVM that have coroutines as part of their feature set such as Kotlin and Scala.
So how is this feature implemented without additional support and can it be implemented efficiently without it?
Solution 1:
tl;dr Summary:
Particularly this proposal says that to implement this feature in the language the additional JVM support will be required.
When they say "required", they mean "required in order to be implemented in such a way that it is both performant and interoperable between languages".
So how this feature is implemented without additional support
There are many ways, the most easy to understand how it can possibly work (but not necessarily easiest to implement) is to implement your own VM with your own semantics on top of the JVM. (Note that is not how it is actually done, this is only an intuition as to why it can be done.)
and can it be implemented efficiently without it ?
Not really.
Slightly longer explanation:
Note that one goal of Project Loom is to introduce this abstraction purely as a library. This has three advantages:
- It is much easier to introduce a new library than it is to change the Java programming language.
- Libraries can immediately be used by programs written in every single language on the JVM, whereas a Java language feature can only be used by Java programs.
- A library with the same API that does not use the new JVM features can be implemented, which will allow you to write code that runs on older JVMs with a simple re-compile (albeit with less performance).
However, implementing it as a library precludes clever compiler tricks turning co-routines into something else, because there is no compiler involved. Without clever compiler tricks, getting good performance is much harder, ergo, the "requirement" for JVM support.
Longer explanation:
In general, all of the usual "powerful" control structures are equivalent in a computational sense and can be implemented using each other.
The most well-known of those "powerful" universal control-flow structures is the venerable GOTO
, another one are Continuations. Then, there are Threads and Coroutines, and one that people don't often think about, but that is also equivalent to GOTO
: Exceptions.
A different possibility is a re-ified call stack, so that the call-stack is accessible as an object to the programmer and can be modified and re-written. (Many Smalltalk dialects do this, for example, and it is also kind-of like how this is done in C and assembly.)
As long as you have one of those, you can have all of those, by just implementing one on top of the other.
The JVM has two of those: Exceptions and GOTO
, but the GOTO
in the JVM is not universal, it is extremely limited: it only works inside a single method. (It is essentially intended only for loops.) So, that leaves us with Exceptions.
So, that is one possible answer to your question: you can implement co-routines on top of Exceptions.
Another possibility is to not use the JVM's control-flow at all and implement your own stack.
However, that is typically not the path that is actually taken when implementing co-routines on the JVM. Most likely, someone who implements co-routines would choose to use Trampolines and partially re-ify the execution context as an object. That is, for example, how Generators are implemented in C♯ on the CLI (not the JVM, but the challenges are similar). Generators (which are basically restricted semi-co-routines) in C♯ are implemented by lifting the local variables of the method into fields of a context object and splitting the method into multiple methods on that object at each yield
statement, converting them into a state machine, and carefully threading all state changes through the fields on the context object. And before async
/await
came along as a language feature, a clever programmer implemented asynchronous programming using the same machinery as well.
HOWEVER, and that is what the article you pointed to most likely referred to: all that machinery is costly. If you implement your own stack or lift the execution context into a separate object, or compile all your methods into one giant method and use GOTO
everywhere (which isn't even possible because of the size limit on methods), or use Exceptions as control-flow, at least one of these two things will be true:
- Your calling conventions become incompatible with the JVM stack layout that other languages expect, i.e. you lose interoperability.
- The JIT compiler has no idea what the hell your code is doing, and is presented with byte code patterns, execution flow patterns, and usage patterns (e.g. throwing and catching ginormous amounts of exceptions) it doesn't expect and doesn't know how to optimize, i.e. you lose performance.
Rich Hickey (the designer of Clojure) once said in a talk: "Tail Calls, Performance, Interop. Pick Two." I generalized this to what I call Hickey's Maxim: "Advanced Control-Flow, Performance, Interop. Pick Two."
In fact, it is generally hard to achieve even one of interop or performance.
Also, your compiler will become more complex.
All of this goes away, when the construct is available natively in the JVM. Imagine, for example, if the JVM didn't have Threads. Then, every language implementation would create its own Threading library, which is hard, complex, slow, and doesn't interoperate with any other language implementation's Threading library.
A recent, and real-world, example are lambdas: many language implementations on the JVM had lambdas, e.g. Scala. Then Java added lambdas as well, but because the JVM doesn't support lambdas, they must be encoded somehow, and the encoding that Oracle chose was different from the one Scala had chosen before, which meant that you couldn't pass a Java lambda to a Scala method expecting a Scala Function
. The solution in this case was that the Scala developers completely re-wrote their encoding of lambdas to be compatible with the encoding Oracle had chosen. This actually broke backwards-compatibility in some places.
Solution 2:
From the Kotlin Documentation on Coroutines (emphasis mine):
Coroutines simplify asynchronous programming by putting the complications into libraries. The logic of the program can be expressed sequentially in a coroutine, and the underlying library will figure out the asynchrony for us. The library can wrap relevant parts of the user code into callbacks, subscribe to relevant events, schedule execution on different threads (or even different machines!), and the code remains as simple as if it was sequentially executed.
Long story short, they are compiled down to code that uses callbacks and a state machine to handle suspending and resuming.
Roman Elizarov, the project lead, gave two fantastic talks at KotlinConf 2017 on this subject. One is an Introduction to Coroutines, the second is a Deep Dive on Coroutines.
Solution 3:
Coroutines do not rely on features of the operating system or the JVM. Instead, coroutines and suspend
functions are transformed by the compiler producing a state machine capable of handling suspensions in general and passing around suspending coroutines keeping their state. This is enabled by Continuations, which are added as a parameter to each and every suspending function by the compiler; this technique is called “Continuation-passing style”(CPS).
One example can be observed in the transformation of suspend
functions:
suspend fun <T> CompletableFuture<T>.await(): T
The following shows its signature after CPS transformation:
fun <T> CompletableFuture<T>.await(continuation: Continuation<T>): Any?
If you want to know the hard details, you need to read this explanation.
Solution 4:
The Project Loom was preceded by the Quasar library by the same author.
Here is a quote from it's docs:
Internally, a fiber is a continuation which is then scheduled in a scheduler. A continuation captures the instantaneous state of a computation, and allows it to be suspended and then resumed at a later time from the point where it was suspended. Quasar creates continuations by instrumenting (at the bytecode level) suspendable methods. For scheduling, Quasar uses ForkJoinPool, which is a very efficient, work-stealing, multi-threaded scheduler.
Whenever a class is loaded, Quasar’s instrumentation module (usually run as a Java agent) scans it for suspendable methods. Every suspendable method f is then instrumented in the following way: It is scanned for calls to other suspendable methods. For every call to a suspendable method g, some code is inserted before (and after) the call to g that saves (and restores) the state of a local variables to the fiber’s stack (a fiber manages its own stack), and records the fact that this (i.e. the call to g) is a possible suspension point. At the end of this “suspendable function chain”, we’ll find a call to Fiber.park. park suspends the fiber by throwing a SuspendExecution exception (which the instrumentation prevents you from catching, even if your method contains a catch(Throwable t) block).
If g indeed blocks, the SuspendExecution exception will be caught by the Fiber class. When the fiber is awakened (with unpark), method f will be called, and then the execution record will show that we’re blocked at the call to g, so we’ll immediately jump to the line in f where g is called, and call it. Finally, we’ll reach the actual suspension point (the call to park), where we’ll resume execution immediately following the call. When g returns, the code inserted in f will restore f’s local variables from the fiber stack.
This process sounds complicated, but its incurs a performance overhead of no more than 3%-5%.
It seems that almost all pure java continuation libraries used similar bytecode instrumentation approach to capture and restore local variables on the stack frames.
Only Kotlin and Scala compilers were brave enough to implement more detached and potentially more performant approach with CPS transformations to state machines mentioned in some other answers here.