Is there a way to grep gzipped content in hdfs without extracting it?
I'm looking for a way to zgrep
hdfs
files
something like:
hadoop fs -zcat hdfs://myfile.gz | grep "hi"
or
hadoop fs -cat hdfs://myfile.gz | zgrep "hi"
it does not really work for me is there anyway to achieve that with command line?
Solution 1:
zless/zcat/zgrep are just shell wrappers that make gzip output the decompressed data to stdout. To do what you want, you'll just have to write a wrapper around the hadoop fs commands.
Aside: The reason this probably didn't work for you is that you're missing an additional slash in your hdfs URI.
You wrote:
hadoop fs -cat hdfs://myfile.gz | zgrep "hi"
This attempts to contact the host or cluster called myfile.gz. What you really want is either hdfs:///myfile.gz or (assuming your config files are set up correctly), just myfile.gz, which the hadoop command should prepend with the correct cluster/namenode path defined by fs.defaultFS.
The following works for me.
$ hadoop fs -ls hdfs:///user/hcoyote/foo.gz
Found 1 items
-rw-r--r-- 3 hcoyote users 5184637 2015-02-20 12:17 hdfs:///user/hcoyote/foo.gz
$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | gzip -c -d | grep -c Authorization
425893
$ hadoop fs -cat hdfs:///user/hcoyote/foo.gz | zgrep -c Authorization
425893
Solution 2:
This command-line will automatically find the right decompressor for any simple text file and print the uncompressed data to standard output:
hadoop fs -text hdfs:///path/to/file [hdfs:///path/to/another/file]
I have used this for .snappy & .gz files. It probably works for .lzo and .bz2 files.
This is an important feature because Hadoop uses a custom file format for Snappy files. This is the only direct way to uncompress a Hadoop-created Snappy file. There is no command-line 'unsnappy' command like there is for the other compressors. I also don't know of any direct command that creates one. I've only created them as Hive table data.
Note: hadoop fs -text
is single-threaded and runs the decompression on the machine where you run the command.