Why is StringBuilder#append(int) faster in Java 7 than in Java 8?

TL;DR: Side effects in append apparently break StringConcat optimizations.

Very good analysis in the original question and updates!

For completeness, below are a few missing steps:

  • See through the -XX:+PrintInlining for both 7u55 and 8u5. In 7u55, you will see something like this:

     @ 16   org.sample.IntStr::inlineSideEffect (25 bytes)   force inline by CompilerOracle
       @ 4   java.lang.StringBuilder::<init> (7 bytes)   inline (hot)
       @ 18   java.lang.StringBuilder::append (8 bytes)   already compiled into a big method
       @ 21   java.lang.StringBuilder::toString (17 bytes)   inline (hot)
    

    ...and in 8u5:

     @ 16   org.sample.IntStr::inlineSideEffect (25 bytes)   force inline by CompilerOracle
       @ 4   java.lang.StringBuilder::<init> (7 bytes)   inline (hot)
         @ 3   java.lang.AbstractStringBuilder::<init> (12 bytes)   inline (hot)
           @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
       @ 18   java.lang.StringBuilder::append (8 bytes)   inline (hot)
         @ 2   java.lang.AbstractStringBuilder::append (62 bytes)   already compiled into a big method
       @ 21   java.lang.StringBuilder::toString (17 bytes)   inline (hot)
         @ 13   java.lang.String::<init> (62 bytes)   inline (hot)
           @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
           @ 55   java.util.Arrays::copyOfRange (63 bytes)   inline (hot)
             @ 54   java.lang.Math::min (11 bytes)   (intrinsic)
             @ 57   java.lang.System::arraycopy (0 bytes)   (intrinsic)
    

    You might notice that 7u55 version is shallower, and it looks like nothing is called after StringBuilder methods -- this is a good indication the string optimizations are in effect. Indeed, if you run 7u55 with -XX:-OptimizeStringConcat, the subcalls will reappear, and performance drops to 8u5 levels.

  • OK, so we need to figure out why 8u5 does not do the same optimization. Grep http://hg.openjdk.java.net/jdk9/jdk9/hotspot for "StringBuilder" to figure out where VM handles the StringConcat optimization; this will get you into src/share/vm/opto/stringopts.cpp

  • hg log src/share/vm/opto/stringopts.cpp to figure out the latest changes there. One of the candidates would be:

    changeset:   5493:90abdd727e64
    user:        iveresov
    date:        Wed Oct 16 11:13:15 2013 -0700
    summary:     8009303: Tiered: incorrect results in VM tests stringconcat...
    
  • Look for the review threads on OpenJDK mailing lists (easy enough to google for changeset summary): http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2013-October/012084.html

  • Spot "String concat optimization optimization collapses the pattern [...] into a single allocation of a string and forming the result directly. All possible deopts that may happen in the optimized code restart this pattern from the beginning (starting from the StringBuffer allocation). That means that the whole pattern must me side-effect free." Eureka?

  • Write out the contrasting benchmark:

    @Fork(5)
    @Warmup(iterations = 5)
    @Measurement(iterations = 5)
    @BenchmarkMode(Mode.AverageTime)
    @OutputTimeUnit(TimeUnit.NANOSECONDS)
    @State(Scope.Benchmark)
    public class IntStr {
        private int counter;
    
        @GenerateMicroBenchmark
        public String inlineSideEffect() {
            return new StringBuilder().append(counter++).toString();
        }
    
        @GenerateMicroBenchmark
        public String spliceSideEffect() {
            int cnt = counter++;
            return new StringBuilder().append(cnt).toString();
        }
    }
    
  • Measure it on JDK 7u55, seeing the same performance for inlined/spliced side effects:

    Benchmark                       Mode   Samples         Mean   Mean error    Units
    o.s.IntStr.inlineSideEffect     avgt        25       65.460        1.747    ns/op
    o.s.IntStr.spliceSideEffect     avgt        25       64.414        1.323    ns/op
    
  • Measure it on JDK 8u5, seeing the performance degradation with the inlined effect:

    Benchmark                       Mode   Samples         Mean   Mean error    Units
    o.s.IntStr.inlineSideEffect     avgt        25       84.953        2.274    ns/op
    o.s.IntStr.spliceSideEffect     avgt        25       65.386        1.194    ns/op
    
  • Submit the bug report (https://bugs.openjdk.java.net/browse/JDK-8043677) to discuss this behavior with VM guys. The rationale for original fix is rock solid, it is interesting however if we can/should get back this optimization in some trivial cases like these.

  • ???

  • PROFIT.

And yeah, I should post the results for the benchmark which moves the increment from the StringBuilder chain, doing it before the entire chain. Also, switched to average time, and ns/op. This is JDK 7u55:

Benchmark                      Mode   Samples         Mean   Mean error    Units
o.s.IntStr.integerToString     avgt        25      153.805        1.093    ns/op
o.s.IntStr.stringBuilder0      avgt        25      128.284        6.797    ns/op
o.s.IntStr.stringBuilder1      avgt        25      131.524        3.116    ns/op
o.s.IntStr.stringBuilder2      avgt        25      254.384        9.204    ns/op
o.s.IntStr.stringFormat        avgt        25     2302.501      103.032    ns/op

And this is 8u5:

Benchmark                      Mode   Samples         Mean   Mean error    Units
o.s.IntStr.integerToString     avgt        25      153.032        3.295    ns/op
o.s.IntStr.stringBuilder0      avgt        25      127.796        1.158    ns/op
o.s.IntStr.stringBuilder1      avgt        25      131.585        1.137    ns/op
o.s.IntStr.stringBuilder2      avgt        25      250.980        2.773    ns/op
o.s.IntStr.stringFormat        avgt        25     2123.706       25.105    ns/op

stringFormat is actually a bit faster in 8u5, and all other tests are the same. This solidifies the hypothesis the side-effect breakage in SB chains in the major culprit in the original question.


I think this has to do with the CompileThreshold flag which controls when the byte code is compiled into machine code by JIT.

The Oracle JDK has a default count of 10,000 as document at http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html.

Where OpenJDK I couldn't find a latest document on this flag; but some mail threads suggest a much lower threshold: http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-November/004239.html

Also, try turn on / off the Oracle JDK flags like -XX:+UseCompressedStrings and -XX:+OptimizeStringConcat. I am not sure if those flags are turned on by default on OpenJDK though. Could someone please suggest.

One experiement you can do, is to firstly run the program by a lot of times, say, 30,000 loops, do a System.gc() and then try to look at the performance. I believe they would yield the same.

And I assume your GC setting is the same too. Otherwise you are allocating a lot of objects and the GC might well be the major part of your run time.