When can Hotspot allocate objects on the stack? [duplicate]

I have done some experimentation in order to see when Hotspot is able to stack allocate. It turns out that its stack allocation is quite a bit more limited than what you might expect based on the available documentation. The referenced paper by Choi "Escape Analysis for Java" suggests that an object that is only ever assigned to local variables can always be stack allocated. But that is not true.

All of this are implementation details of the current Hotspot implementation, so they could change in future versions. This refers to my OpenJDK install which is version 1.8.0_121 for X86-64.

The short summary, based on quite a bit of experimentation, seems to be:

Hotspot can stack-allocate an object instance if

  • all its uses are inlined
  • it is never assigned to any static or object fields, only to local variables
  • at each point in the program, which local variables contain references to the object must be JIT-time determinable, and not depend on any unpredictable conditional control flow.
  • If the object is an array, its size must be known at JIT time and indexing into it must use JIT-time constants.

To know when these conditions hold you need to know quite a bit about how Hotspot works. Relying on Hotspot to definately do stack allocation in a certain situation can be risky, as a lot of non-local factors are involved. Especially knowing if everything is inlined can be difficult to predict.

Practically speaking, simple iterators will usually be stack allocatable if you just use them to iterate. For composite objects only the outer object can ever be stack allocated, so lists and other collections always cause heap allocation.

If you have a HashMap<Integer,Something> and you use it in myHashMap.get(42), the 42 may stack allocate in a test program, but it will not in a full application because you can be sure that there will be more than two types of key objects in HashMaps in the entire program, and therefore the hashCode and equals methods on the key won't inline.

Beyond that I don't see any generally applicable rules, and it will depend on the specifics of the code.

Hotspot internals

The first important thing to know is that escape analysis is performed after inlining. This means that Hotspot's escape analysis is in this respect more powerful than the description in the Choi paper, since an object returned from a method but local to the caller method can still be stack allocated. Because of this iterators can nearly always be stack allocated if you do e.g. for(Foo item : myList) {...} (and the implementation of myList.iterator() is simple enough, which they usually are.)

Hotspot only compiles optimized versions of methods once it determines the method is 'hot', so code that is not run a lot of times does not get optimized at all, in which case there is no stack allocation or inlining whatsoever. But for those methods you usually don't care.

Inlining

Inlining decisions are based on profiling data that Hotspot collects first. The declared types do not matter so much, even if a method is virtual Hotspot can inline it based on the types of the objects it sees during profiling. Something similar holds for branches (i.e. if-statements and other control flow constructs): If during profiling Hotspot never sees a certain branch being taken, it will compile and optimize the code based on the assumption that the branch is never taken. In both cases, if Hotspot cannot prove that its assumptions will always be true, it will insert checks in the compiled code known as 'uncommon traps', and if such a trap is hit Hotspot will de-optimize and possibly re-optimize taking the new information into account.

Hotspot will profile which object types occur as receivers at which call sites. If Hotspot only sees a single type or only two distinct types occuring at a call site, it is able to inline the called method. If there are only one or two very common types and other types occur much less often Hotspot should also still be able to inline the methods of the common types, including a check for which code it needs to take. (I'm not entirely sure about this last case with one or two common types and more uncommon types though). If there are more than two common types, Hotspot will not inline the call at all but instead generate machine code for an indirect call.

'Type' here refers to the exact type of an object. Implemented interfaces or shared superclasses are not taken into account. Even if different receiver types occur at a call site but they all inherit the same implementation of a method (e.g. multiple classes that all inherit hashCode from Object), Hotspot will still generate an indirect call and not inline. (So i.m.o. hotspot is quite stupid in such cases. I hope future versions improve this.)

Hotspot will also only inline methods that are not too big. 'Not too big' is determined by the -XX:MaxInlineSize=n and -XX:FreqInlineSize=n options. Inlinable methods with a JVM bytecode size below MaxInlineSize are always inlined, methods with a JVM bytecode size below FreqInlineSize are inlined if the call is 'hot'. Larger methods are never inlined. By default MaxInlineSize is 35 and FreqInlineSize is platform dependent but for me it is 325. So make sure your methods are not too big if you want them inlined. It can sometimes help to split out the common path from a large method, so that it can be inlined into its callers.

Profiling

One important thing to know about profiling is that profiling sites are based on the JVM bytecode, which itself is not inlined in any way. So if you have e.g. a static method

static <T,U> List<U> map(List<T> list, Function<T,U> func) {
    List<U> result = new ArrayList();
    for(T item : list) { result.add(func.call(item)); }
    return result; 
}

that maps a SAM Function callable over a list and returns the transformed list, Hotspot will treat the call to func.call as a single program-wide call site. You might call this map function at several spots in your program, passing a different func in at each call site (but the same one for one call site). In that case you might expect that Hotspot is able to inline map, and then also the call to func.call since at every use of map there is only a single func type. If this were so, Hotspot would be able to optimize the loop down very tightly. Unfortunately Hotspot is not smart enough for that. It only keeps a single profile for the func.call call site, lumping all the func types that you pass to map together. You will probably use more than two different implementations of func, so Hotspot will not be able to inline the call to func.call. Link for more details, and archived link as the original appears to be gone.

(As an aside, in Kotlin the equivalent loop can be fully inlined as the Kotlin compiler can do inlining of calls at the bytecode level. So for some uses it could be significantly faster than Java.)

Scalar Replacement

Another important thing to know is that Hotspot does not actually implement stack allocation of objects. Instead it implements scalar replacement, which means that an object is deconstructed into its constituent fields and those fields are stack allocated like normal local variables. This means that there is no object left at all. Scalar replacement only works if there is never a need to create a pointer to the stack-allocated object. Some forms of stack allocation in e.g. C++ or Go would be able to allocate full objects on the stack and then pass references or pointers to them to called functions, but in Hotspot this does not work. Therefore if there is ever a need to pass an object reference to a non-inlined method, even if the reference would not escape the called method, Hotspot will always heap-allocate such an object.

In principle Hotspot could be smarter about this, but right now it is not.

Test program

I used the following program and variations to see when Hotspot will do scalar replacement.

// Minimal example for which the JVM does not scalarize the allocation. If field is final, or the second allocation is unconditional, it will.

class Scalarization {

        int field = 0xbd;
        long foo(long i) { return i * field; }


        public static void main(String[] args) {
                long result = 0;
                for(long i=0; i<100; i++) {
                        result += test();
                }
                System.out.println("Result: "+result);
        }


        static long test() {
                long ctr = 0x5;
                for(long i=0; i<0x10000; i++) {

                Scalarization s = new Scalarization();
                ctr = s.foo(ctr);
                if(i == 0) s = new Scalarization();
                ctr = s.foo(ctr);
                }
                return ctr;
        }
}

If you compile and run this program with javac Scalarization.java; java -verbose:gc Scalarization you can see if scalar replacement worked by the number of garbage collections. If scalar replacement works, no garbage collection happened on my system, if scalar replacement did not work I see a few garbage collections.

Variants that Hotspot is able to scalarize run significantly faster than versions where it does not. I verified the generated machine code (instructions) to make sure Hotspot was not doing any unexpected optimizations. If hotspot is able to scalar replace the allocations, it can then also do some additional optimizations on the loop, unrolling it a few iterations and then combining those iterations together. So in the scalarized versions the effective loop count is lower with each iteraton doing the work of multiple source code level iterations. So the speed difference is not only due to allocation and garbage collection overhead.

Observations

I tried a number of variations on the above program. One condition for scalar replacement is that the object must never be assigned to an object (or static) field, and presumably also not into an array. So in code like

Foo f = new Foo();
bar.field = f;

the Foo object cannot be scalar replaced. This holds even if bar itself is scalar replaced, and also if you never again use bar.field. So an object can only ever be assigned to local variables.

That alone is not enough, Hotspot must also be able to determine statically at JIT-time which object instance will be the target of a call. For example, using the following implementations of foo and test and removing field causes heap allocation:

long foo(long i) { return i * 0xbb; }

static long test() {
    long ctr = 0x5;
    for(long i=0; i<0x10000; i++) {
        Scalarization s = new Scalarization();
        ctr = s.foo(ctr);
        if(i == 50) s = new Scalarization();
        ctr = s.foo(ctr);
    }
    return ctr;
}

While if you then remove the conditional for the second assignment no more heap allocation occurs:

static long test() {
    long ctr = 0x5;
    for(long i=0; i<0x10000; i++) {
        Scalarization s = new Scalarization();
        ctr = s.foo(ctr);
        s = new Scalarization();
        ctr = s.foo(ctr);
    }
    return ctr;
}

In this case Hotspot can determine statically which instance is the target for each call to s.foo.

On the other hand, even if the second assignment to s is a subclass of Scalarization with a completely different implementation, as long as the assignment is unconditional Hotspot will still scalarize the allocations.

Hotspot does not appear to be able to move an object to the heap that was previously scalar replaced (at least not without deoptimizing). Scalar replacement is an all-or-nothing affair. So in the original test method both allocations of Scalarization always happen on the heap.

Conditionals

One important detail is that Hotspot will predict conditionals based on its profiling data. If a conditional assignment is never executed, Hotspot will compile code under that assumption, and then might be able to do scalar replacement. If at a later point in time the condtion does get taken, Hotspot will need to recompile the code with this new assumption. The new code will not do scalar replacement since Hotspot can no longer determine the receiver instance of following calls statically.

For instance in this variant of test:

static long limit = 0;

static long test() {
    long ctr = 0x5;
    long i = limit;
    limit += 0x10000;
    for(; i<limit; i++) { // In this form if scalarization happens is nondeterministic: if the condition is hit before profiling starts scalarization happens, else not.

        Scalarization s = new Scalarization();
        ctr = s.foo(ctr);
        if(i == 0xf9a0) s = new Scalarization();
        ctr = s.foo(ctr);
    }
    return ctr;
}

the conditional assignemnt is only executed once during the lifetime of the program. If this assignment occurs early enough, before Hotspot starts full profiling of the test method, Hotspot never notices the conditional being taken and compiles code that does scalar replacement. If profiling has already started when the conditional is taken, Hotspot will not do scalar replacement. With the test value of 0xf9a0, whether scalar replacement happens is nondeterministic on my computer, since exactly when profiling starts can vary (e.g. because profiling and optimized code is compiled on background threads). So if I run the above variant it sometimes does a few garbage collections, and sometimes does not.

Hotspot's static code analysis is much more limited than what C/C++ and other static compilers can do, so Hotspot is not as smart in following the control flow in a method through several conditionals and other control structures to determine the instance that a variable refers to, even if it would be statically determinable for the programmer or a smarter compiler. In many cases the profiling information will make up for that, but it is something to be aware of.

Arrays

Arrays can be stack allocated if their size is known at JIT time. However indexing into an array is not supported unless Hotspot can also statically determine the index value at JIT-time. So stack allocated arrays are pretty useless. Since most programs don't use arrays directly but use the standard collections this is not very relevant, as embedded objects such as the array containing the data within an ArrayList already need to be heap-allocated due to their embedded-ness. I suppose the reasoning for this restriction is that there exists no indexing operation on local variables so this would require additional code generation functionality for a pretty rare use case.