Is there a way to grep gzipped content in hdfs without extracting it?

I'm looking for a way to zgrep hdfs files

something like:

hadoop fs -zcat hdfs://myfile.gz | grep "hi"

or

hadoop fs -cat hdfs://myfile.gz | zgrep "hi"

it does not really work for me is there anyway to achieve that with command line?


Solution 1:

zless/zcat/zgrep are just shell wrappers that make gzip output the decompressed data to stdout. To do what you want, you'll just have to write a wrapper around the hadoop fs commands.

Aside: The reason this probably didn't work for you is that you're missing an additional slash in your hdfs URI.

You wrote:

hadoop fs -cat hdfs://myfile.gz | zgrep "hi"

This attempts to contact the host or cluster called myfile.gz. What you really want is either hdfs:///myfile.gz or (assuming your config files are set up correctly), just myfile.gz, which the hadoop command should prepend with the correct cluster/namenode path defined by fs.defaultFS.

The following works for me.

$ hadoop fs -ls hdfs:///user/hcoyote/foo.gz
Found 1 items
-rw-r--r--   3 hcoyote users    5184637 2015-02-20 12:17 hdfs:///user/hcoyote/foo.gz

$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | gzip -c -d | grep -c Authorization
425893

$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | zgrep -c Authorization
425893

Solution 2:

This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:

hadoop fs -text hdfs:///path/to/file [hdfs:///path/to/another/file]

I have used this for .snappy & .gz files. It probably works for .lzo and .bz2 files.

This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file. There is no command-line 'unsnappy' command like there is for the other compressors. I also don't know of any direct command that creates one. I've only created them as Hive table data.

Note: hadoop fs -text is single-threaded and runs the decompression on the machine where you run the command.