How can I Zip and Unzip a string using GZIPOutputStream that is compatible with .Net?

Solution 1:

The GZIP methods:

public static byte[] compress(String string) throws IOException {
    ByteArrayOutputStream os = new ByteArrayOutputStream(string.length());
    GZIPOutputStream gos = new GZIPOutputStream(os);
    gos.write(string.getBytes());
    gos.close();
    byte[] compressed = os.toByteArray();
    os.close();
    return compressed;
}

public static String decompress(byte[] compressed) throws IOException {
    final int BUFFER_SIZE = 32;
    ByteArrayInputStream is = new ByteArrayInputStream(compressed);
    GZIPInputStream gis = new GZIPInputStream(is, BUFFER_SIZE);
    StringBuilder string = new StringBuilder();
    byte[] data = new byte[BUFFER_SIZE];
    int bytesRead;
    while ((bytesRead = gis.read(data)) != -1) {
        string.append(new String(data, 0, bytesRead));
    }
    gis.close();
    is.close();
    return string.toString();
}

And a test:

final String text = "hello";
try {
    byte[] compressed = compress(text);
    for (byte character : compressed) {
        Log.d("test", String.valueOf(character));
    }
    String decompressed = decompress(compressed);
    Log.d("test", decompressed);
} catch (IOException e) {
    e.printStackTrace();
}

=== Update ===

If you need .Net compability my code has to be changed a little:

public static byte[] compress(String string) throws IOException {
    byte[] blockcopy = ByteBuffer
        .allocate(4)
        .order(java.nio.ByteOrder.LITTLE_ENDIAN)
        .putInt(string.length())
        .array();
    ByteArrayOutputStream os = new ByteArrayOutputStream(string.length());
    GZIPOutputStream gos = new GZIPOutputStream(os);
    gos.write(string.getBytes());
    gos.close();
    os.close();
    byte[] compressed = new byte[4 + os.toByteArray().length];
    System.arraycopy(blockcopy, 0, compressed, 0, 4);
    System.arraycopy(os.toByteArray(), 0, compressed, 4, os.toByteArray().length);
    return compressed;
}

public static String decompress(byte[] compressed) throws IOException {
    final int BUFFER_SIZE = 32;
    ByteArrayInputStream is = new ByteArrayInputStream(compressed, 4, compressed.length - 4);
    GZIPInputStream gis = new GZIPInputStream(is, BUFFER_SIZE);
    StringBuilder string = new StringBuilder();
    byte[] data = new byte[BUFFER_SIZE];
    int bytesRead;
    while ((bytesRead = gis.read(data)) != -1) {
        string.append(new String(data, 0, bytesRead));
    }
    gis.close();
    is.close();
    return string.toString();
}

You can use the same test script.

Solution 2:

Whatever it was that compressed "Hello" to BQAAAB+LC... is a particularly poor implementation of a gzipper. It expanded "Hello" far, far more than necessary, using a dynamic block instead of a static block in the deflate format. After removing the four-byte prefix to the gzip stream (which always starts with hex 1f 8b), "Hello" was expanded to 123 bytes. In the world of compression, that is considered a crime.

The Compress method that you are complaining about is working correctly and properly. It is generating a static block and a total output of 25 bytes. The gzip format has a ten-byte header and eight-byte trailer overhead, leaving the five-byte input having been coded in seven bytes. That's more like it.

Streams that are not compressible will be expanded, but it shouldn't be by much. The deflate format used by gzip will add five bytes to every 16K to 64K for incompressible data.

To get actual compression, in general you need to give the compressor much more to work with that five bytes, so that it can find repeated strings and biased statistics in compressible data. I understand that you were just doing tests with a short string. But in an actual application, you would never use a general-purpose compressor with such short strings, since it would always be better to just send the string.

Solution 3:

In your Decompress() method, the first 4 bytes of the Base64 decoded input are skipped before passing to GZipInputStream. These bytes are found to be 05 00 00 00 in this particular case. So in the Compress() method, these bytes have to be put back in just before the Base64 encode.

If I do this, Compress() returns the following:

BQAAAB+LCAAAAAAAAADLSM3JyQcAhqYQNgUAAAA=

I know that this is not exactly the same as your expectation, which is:

BQAAAB+LCAAAAAAABADtvQdgHEmWJSYvbcp7f0r1StfgdKEIgGATJNiQQBDswYjN5pLsHWlHIymrKoHKZVZlXWYWQMztnbz33nvvvffee++997o7nU4n99//P1xmZAFs9s5K2smeIYCqyB8/fnwfPyLmeVlW/w+GphA2BQAAAA==

But, if my result is plugged back into Decompress(), I think you'll still get "Hello". Try it. The difference may be due to the different compression level with which you got the original string.

So what are the mysterious prefixed bytes 05 00 00 00? According to this answer it may be the length of the compressed string so that the program knows how long the decompressed byte buffer should be. Still that does not tally in this case.

This is the modified code for compress():

public static String Compress(String text) throws IOException {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();

    // TODO: Should be computed instead of being hard-coded
    baos.write(new byte[]{0x05, 0, 0, 0}, 0, 4);

    GZIPOutputStream gzos = new GZIPOutputStream(baos);
    gzos.write(text.getBytes());
    gzos.close();

    return Base64.encode(baos.toByteArray());
}

Update:

The reason why the output strings in Android and your .NET code don't match is that the .NET GZip implementation does a faster compression (and thus larger output). This can be verified for sure by looking at the raw Base64 decoded byte values:

.NET:

1F8B 0800 0000 0000 0400 EDBD 0760 1C49
9625 262F 6DCA 7B7F 4AF5 4AD7 E074 A108
8060 1324 D890 4010 ECC1 88CD E692 EC1D
6947 2329 AB2A 81CA 6556 655D 6616 40CC
ED9D BCF7 DE7B EFBD F7DE 7BEF BDF7 BA3B
9D4E 27F7 DFFF 3F5C 6664 016C F6CE 4ADA
C99E 2180 AAC8 1F3F 7E7C 1F3F 22E6 7959
56FF 0F86 A610 3605 0000 00

My Android version:

1F8B 0800 0000 0000 0000 CB48 CDC9 C907
0086 A610 3605 0000 00

Now if we check the GZip File Format, we see that both the .NET and Android versions are mostly identical in the initial header and trailing CRC32 & Size fields. The only differences are in the below fields:

XFL = 04 (compressor used fastest algorithm) in the case of .NET, whereas it's 00 in Android
The actual compressed blocks

So it's clear from the XFL field that the .NET compression algorithm produces longer output.

Infact, when I creates a binary file with these raw data values and then uncompressed them using gunzip, both the .NET and Android versions gave exactly the same output as "hello".

So you don't have to bother about the differing results.

Solution 4:

I tried your code in my project, and found a encoding bug in compress method on Android:

byte[] blockcopy = ByteBuffer
        .allocate(4)
        .order(java.nio.ByteOrder.LITTLE_ENDIAN)
        .putInt(str.length())
        .array();
ByteArrayOutputStream os = new ByteArrayOutputStream(str.length());
GZIPOutputStream gos = new GZIPOutputStream(os);
gos.write(str.getBytes());

on above code, u should use the corrected encoding, and fill the bytes length, not the string length:

byte[] data = str.getBytes("UTF-8");

byte[] blockcopy = ByteBuffer
        .allocate(4)
        .order(java.nio.ByteOrder.LITTLE_ENDIAN)
        .putInt(data.length)
            .array();

ByteArrayOutputStream os = new ByteArrayOutputStream( data.length );    
GZIPOutputStream gos = new GZIPOutputStream(os);
gos.write( data );