Hadoop: compress file in HDFS?

Solution 1:

For me, it's lower overhead to write a Hadoop Streaming job to compress files.

This is the command I run:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input <input-path> \
  -output $OUTPUT \
  -mapper "cut -f 2"

I'll also typically stash the output in a temp folder in case something goes wrong:

OUTPUT=/tmp/hdfs-gzip-`basename $1`-$RANDOM

One additional note, I do not specify a reducer in the streaming job, but you certainly can. It will force all the lines to be sorted which can take a long time with a large file. There might be a way to get around this by overriding the partitioner but I didn't bother figuring that out. The unfortunate part of this is that you potentially end up with many small files that do not utilize HDFS blocks efficiently. That's one reason to look into Hadoop Archives

Solution 2:

I suggest you write a MapReduce job that, as you say, just uses the Identity mapper. While you are at it, you should consider writing the data out to sequence files to improve performance loading. You can also store sequence files in block-level and record-level compression. Yo should see what works best for you, as both are optimized for different types of records.

Solution 3:

The streaming command from Jeff Wu along with a concatenation of the compressed files will give a single compressed file. When a non java mapper is passed to the streaming job and the input format is text streaming outputs just the value and not the key.

hadoop jar contrib/streaming/hadoop-streaming-1.0.3.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input filename \
            -output /filename \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
hadoop fs -cat /path/part* | hadoop fs -put - /path/compressed.gz

Solution 4:

This is what I've used:

/*
 * Pig script to compress a directory
 * input:   hdfs input directory to compress
 *          hdfs output directory
 * 
 * 
 */

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

--comma seperated list of hdfs directories to compress
input0 = LOAD '$IN_DIR' USING PigStorage();

--single output directory
STORE input0 INTO '$OUT_DIR' USING PigStorage(); 

Though it's not LZO so it may be a bit slower.