Difference between compact strings and compressed strings in Java 9

What are the advantages of compact strings over compressed strings in JDK9?


Solution 1:

Compressed strings (Java 6) and compact strings (Java 9) both have the same motivation (strings are often effectively Latin-1, so half the space is wasted) and goal (make those strings small) but the implementations differ a lot.

Compressed Strings

In an interview Aleksey Shipilëv (who was in charge of implementing the Java 9 feature) had this to say about compressed strings:

UseCompressedStrings feature was rather conservative: while distinguishing between char[] and byte[] case, and trying to compress the char[] into byte[] on String construction, it done most String operations on char[], which required to unpack the String. Therefore, it benefited only a special type of workloads, where most strings are compressible (so compression does not go to waste), and only a limited amount of known String operations are performed on them (so no unpacking is needed). In great many workloads, enabling -XX:+UseCompressedStrings was a pessimization.

[...] UseCompressedStrings implementation was basically an optional feature that maintained a completely distinct String implementation in alt-rt.jar, which was loaded once the VM option is supplied. Optional features are harder to test, since they double the number of option combinations to try.

Compact Strings

In Java 9 on the other hand, compact strings are fully integrated into the JDK source. String is always backed by byte[], where characters use one byte if they are Latin-1 and otherwise two. Most operations do a check to see which is the case, e.g. charAt:

public char charAt(int index) {
    if (isLatin1()) {
        return StringLatin1.charAt(value, index);
    } else {
        return StringUTF16.charAt(value, index);
    }
}

Compact strings are enabled by default and can be partially disabled - "partially" because they are still backed by a byte[] and operations returning chars must still put them together from two separate bytes (due to intrinsics it is hard to say whether this has a performance impact).

More

If you're interested in more background on compact strings I recommend to read the interview I linked to above and/or watch this great talk by the same Aleksey Shipilëv (which also explains the new string concatenation).

Solution 2:

XX:+UseCompressedStrings and Compact Strings are different things.

UseCompressedStrings meant that Strings that are ASCII only could be converted to byte[], but this was off by-default. In jdk-9 this optimization is always on, but not via the flag itself, but build-in.

Until java-9 Strings are stored internally as a char[] in UTF-16 encoding. From java-9 and up they will be store as byte[]. Why?

Because in ISO_LATIN_1 each character can be encoded in a single byte (8 bits) vs what it is used to be until now (16 bits, 8 of each where never used). This works only for ISO_LATIN_1, but that is the majority of Strings used anyway.

So that is done for space usage.

Here is a small example that should make things more clear:

class StringCharVsByte {
    public static void main(String[] args) {
        String first = "first";
        String russianFirst = "первыи";

        char[] c1 = first.toCharArray();
        char[] c2 = russianFirst.toCharArray();

        for (char c : c1) {
            System.out.println(c >>> 8);
        }

        for (char c : c2) {
            System.out.println(c >>> 8);
        }
    }
}

In the first case we are going to get zeroes only, meaning that the most significant 8 bits are zeroes; in the second case there is going to be a non-zero value, meaning that at least one bit from the most significant 8, is present.

That means that if internally we store Strings as an array of chars, there are string literals that actually waste half of each char. It turns out there are multiple applications that actually waste a lot of space because of this.

You have a String made from 10 Latin1 characters? You just lost 80 bits, or 10 bytes. To mitigate this String compression was made. And now, there will be no space loss for these Strings.

Internally this also means some very nice things. To distinguish between String that are LATIN1 and UTF-16 there's a field coder:

/**
 * The identifier of the encoding used to encode the bytes in
 * {@code value}. The supported values in this implementation are
 *
 * LATIN1
 * UTF16
 *
 * @implNote This field is trusted by the VM, and is a subject to
 * constant folding if String instance is constant. Overwriting this
 * field after construction will cause problems.
 */
private final byte coder;

Now based on this length is computed differently:

public int length() {
    return value.length >> coder();
}

If our String is Latin1 only, coder is going to be zero, so length of value (the byte array) is the size of chars. For non-Latin1 divide by two.

Solution 3:

Compact Strings will have best of both worlds.

As can be seen in the definition provided in OpenJDK documentation:

The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

As mentioned by @Eugene, most of the strings are encoded in Latin-1 format and require one byte per character and hence do not require the whole 2-byte space provide in current String class implementation.

The new String class implementation will shift from UTF-16 char array to a byte array plus an encoding-flag field. The additional encoding field will show whether the characters are stored using UTF-16 or Latin-1 format.

This also concludes that we will also be able to store strings in UTF-16 format if required. And this also becomes the main point of difference between the Compressed String of Java 6 and Compact String of Java 9 as in Compressed String only byte[] array was used for storage which was then representated as pure ASCII.