Read Unicode characters from command-line arguments in Python 2.x on Windows
Solution 1:
Here is a solution that is just what I'm looking for, making a call to the Windows GetCommandLineArgvW
function:
Get sys.argv with Unicode characters under Windows (from ActiveState)
But I've made several changes, to simplify its usage and better handle certain uses. Here is what I use:
win32_unicode_argv.py
"""
win32_unicode_argv.py
Importing this will replace sys.argv with a full Unicode form.
Windows only.
From this site, with adaptations:
http://code.activestate.com/recipes/572200/
Usage: simply import this module into a script. sys.argv is changed to
be a list of Unicode strings.
"""
import sys
def win32_unicode_argv():
"""Uses shell32.GetCommandLineArgvW to get sys.argv as a list of Unicode
strings.
Versions 2.x of Python don't support Unicode in sys.argv on
Windows, with the underlying Windows API instead replacing multi-byte
characters with '?'.
"""
from ctypes import POINTER, byref, cdll, c_int, windll
from ctypes.wintypes import LPCWSTR, LPWSTR
GetCommandLineW = cdll.kernel32.GetCommandLineW
GetCommandLineW.argtypes = []
GetCommandLineW.restype = LPCWSTR
CommandLineToArgvW = windll.shell32.CommandLineToArgvW
CommandLineToArgvW.argtypes = [LPCWSTR, POINTER(c_int)]
CommandLineToArgvW.restype = POINTER(LPWSTR)
cmd = GetCommandLineW()
argc = c_int(0)
argv = CommandLineToArgvW(cmd, byref(argc))
if argc.value > 0:
# Remove Python executable and commands if present
start = argc.value - len(sys.argv)
return [argv[i] for i in
xrange(start, argc.value)]
sys.argv = win32_unicode_argv()
Now, the way I use it is simply to do:
import sys
import win32_unicode_argv
and from then on, sys.argv
is a list of Unicode strings. The Python optparse
module seems happy to parse it, which is great.
Solution 2:
Dealing with encodings is very confusing.
I believe if your inputing data via the commandline it will encode the data as whatever your system encoding is and is not unicode. (Even copy/paste should do this)
So it should be correct to decode into unicode using the system encoding:
import sys
first_arg = sys.argv[1]
print first_arg
print type(first_arg)
first_arg_unicode = first_arg.decode(sys.getfilesystemencoding())
print first_arg_unicode
print type(first_arg_unicode)
f = codecs.open(first_arg_unicode, 'r', 'utf-8')
unicode_text = f.read()
print type(unicode_text)
print unicode_text.encode(sys.getfilesystemencoding())
running the following Will output: Prompt> python myargv.py "PC・ソフト申請書08.09.24.txt"
PC・ソフト申請書08.09.24.txt
<type 'str'>
<type 'unicode'>
PC・ソフト申請書08.09.24.txt
<type 'unicode'>
?日本語
Where the "PC・ソフト申請書08.09.24.txt" contained the text, "日本語". (I encoded the file as utf8 using windows notepad, I'm a little stumped as to why there's a '?' in the begining when printing. Something to do with how notepad saves utf8?)
The strings 'decode' method or the unicode() builtin can be used to convert an encoding into unicode.
unicode_str = utf8_str.decode('utf8')
unicode_str = unicode(utf8_str, 'utf8')
Also, if your dealing with encoded files you may want to use the codecs.open() function in place of the built-in open(). It allows you to define the encoding of the file, and will then use the given encoding to transparently decode the content to unicode.
So when you call content = codecs.open("myfile.txt", "r", "utf8").read()
content
will be in unicode.
codecs.open: http://docs.python.org/library/codecs.html?#codecs.open
If I'm miss-understanding something please let me know.
If you haven't already I recommend reading Joel's article on unicode and encoding: http://www.joelonsoftware.com/articles/Unicode.html
Solution 3:
Try this:
import sys
print repr(sys.argv[1].decode('UTF-8'))
Maybe you have to substitute CP437
or CP1252
for UTF-8
. You should be able to infer the proper encoding name from the registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\OEMCP