decompress ZIP with given encoding
I got ZIP file(s), which contains files, which filenames are in some encoding. Let's say I know encoding of those filenames, but I still dont know how to properly decompress them.
Here is example file, it contains one file "【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass"
I know used encoding is GB18030 (Chinese)
Question is - how to unpack that file in FreeBSD using unzip or other CLI utility to get proper encoded filename? I tried everything what I could, but result was never good. Please help.
I tried on OSX:
MBP1:test 2ge$ bsdtar xf gb18030.zip
MBP1:test 2ge$ ls
%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12/ gb18030.zip
MBP1:test 2ge$ cd %A1%BESSK%D7%D6Ļ%D7顿The\ Vampire\ Diaries\ %CE%FCѪ%B9%ED%C8ռ%C7S06E12/
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ ls
%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12.ass*
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ find . | iconv -f gb18030 -t utf-8
.
./%A1%BESSK%D7%D6L抬%D7椤縏he Vampire Diaries %CE%FC血%B9%ED%C8占%C7S06E12.ass
MBP1:%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12 2ge$ convmv -r -f gb18030 -t utf-8 --notest .
Skipping, already UTF-8: ./%A1%BESSK%D7%D6Ļ%D7顿The Vampire Diaries %CE%FCѪ%B9%ED%C8ռ%C7S06E12.ass
Ready!
I tried similar with unzip, but I get similar problem.
Thanks, now trying on FREE BSD, where I am connecting using SSH from OSX (Terminal):
# locale
LANG=
LC_CTYPE="C"
LC_COLLATE="C"
LC_TIME="C"
LC_NUMERIC="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=C
The first thing, I would like to is to proper show Chinese names. I changed
setenv LC_ALL zh_CN.GB18030
setenv LANG zh_CN.GB18030
Then I downloaded file and try to "ls" to see proper characters, but not luck. So I think I have to solve first Chinese locale to verify when I get proper result, actually I can compare it. Can you also help me please with this?
Here's what I do on Ubuntu 16.04 to unzip a zip in any encoding, as long as I know what that encoding is. The same method should work on FreeBSD because it only relies on widely available unzip
tool.
I double-check the exact name of the encoding, as to not misspell it: https://www.iana.org/assignments/character-sets/character-sets.xhtml
-
I simply run
$ unzip -O <encoding> <filename> -d <target_dir>
or
$ unzip -I <encoding> <filename> -d <target_dir>
choosing between
-O
or-I
according to instructions here:$ unzip -h UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP. ... -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives -I CHARSET specify a character encoding for UNIX and other archives ...
which means that I simply try
-O
and it should work, because not a lot of people would create a.zip
file in Unix...
So, for your specific example:
The exact encoding name is
GB18030
.-
I use the
-O
flag and:$ unzip -O GB18030 gb18030.zip -d target_dir Archive: gb18030.zip creating: target_dir/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/ inflating: target_dir/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass
... it works.
Method 1 : use unar utility
sudo apt-get install unar
unar -e gb18030 gb18030.zip
Method 2 : Use a python script to unzip the file (reference https://gist.github.com/usunyu/dfc6e56af6e6caab8018bef4c3f3d452#file-gbk-unzip-py )
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# unzip-gbk.py
import os
import sys
import zipfile
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--encoding", help="encoding for filename, default gbk")
parser.add_argument("-l", help="list filenames in zipfile, do not unzip", action="store_true")
parser.add_argument("file", help="process file.zip")
args = parser.parse_args()
print "Processing File " + args.file
file=zipfile.ZipFile(args.file,"r");
if args.encoding:
print "Encoding " + args.encoding
for name in file.namelist():
if args.encoding:
utf8name=name.decode(args.encoding)
else:
utf8name=name.decode('gbk')
pathname = os.path.dirname(utf8name)
if args.l:
print "Filename " + utf8name
else:
print "Extracting " + utf8name
if not os.path.exists(pathname) and pathname!= "":
os.makedirs(pathname)
data = file.read(name)
if not os.path.exists(utf8name):
fo = open(utf8name, "w")
fo.write(data)
fo.close
file.close()
The example gb18030.zip will extract the following file
【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12
【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12/【SSK字幕组】The Vampire Diaries 吸血鬼日记S06E12.ass