Sorting list of string with specific locale in python
Solution 1:
You could use a PyICU's collator to avoid changing global settings:
import icu # PyICU
def sorted_strings(strings, locale=None):
if locale is None:
return sorted(strings)
collator = icu.Collator.createInstance(icu.Locale(locale))
return sorted(strings, key=collator.getSortKey)
Example:
>>> L = [u'sandwiches', u'angel delight', u'custard', u'éclairs', u'glühwein']
>>> sorted_strings(L)
['angel delight', 'custard', 'glühwein', 'sandwiches', 'éclairs']
>>> sorted_strings(L, 'en_US')
['angel delight', 'custard', 'éclairs', 'glühwein', 'sandwiches']
Disadvantage: dependency on PyICU library; the behavior is slightly different from locale.strcoll
.
I don't know how to get locale.strxfrm
function given a locale name without changing it globally. As a hack you could run your function in a different child process:
pool = multiprocessing.Pool()
# ...
pool.apply(locale_aware_sort, [strings, loc])
Disadvantage: might be slow, resource hungry
Using ordinary threading.Lock
won't work unless you can control every place where locale aware functions (they are not limited to locale
module e.g., re
) could be called from multiple threads.
You could compile your function using Cython to synchronize access using GIL. GIL will make sure that no other Python code can be executed while your function is running.
Disadvantage: not pure Python
Solution 2:
The ctypes
solution is fine, but if anyone in the future would like just to modify your original solution, here is a way how to do so:
Temporary changes of global settings can safely be accomplished with a context manager.
from contextlib import contextmanager
import locale
@contextmanager
def changedlocale(newone):
old_locale = locale.getlocale(locale.LC_COLLATE)
try:
locale.setlocale(locale.LC_COLLATE, newone)
yield locale.strcoll
finally:
locale.setlocale(locale.LC_COLLATE, old_locale)
def sort_strings(strings, locale_=None):
if locale_ is None:
return sorted(strings)
with changedlocale(locale_) as strcoll:
return sorted(strings, cmp=strcoll)
return sorted_strings
This ensures a clean restoration of the original locale - as long as you don't use threading.
Solution 3:
Glibc does support a locale API with an explicit state. Here's a quick wrapper for that API made with ctypes.
# -*- coding: utf-8
import ctypes
class Locale(object):
def __init__(self, locale):
LC_ALL_MASK = 8127
# LC_COLLATE_MASK = 8
self.libc = ctypes.CDLL("libc.so.6")
self.ctx = self.libc.newlocale(LC_ALL_MASK, locale, 0)
def strxfrm(self, src, iteration=1):
size = 3 * iteration * len(src)
dest = ctypes.create_string_buffer('\000' * size)
n = self.libc.strxfrm_l(dest, src, size, self.ctx)
if n < size:
return dest.value
elif iteration<=4:
return self.strxfrm(src, iteration+1)
else:
raise Exception('max number of iterations trying to increase dest reached')
def __del__(self):
self.libc.freelocale(self.ctx)
and a short test
locale1 = Locale('C')
locale2 = Locale('mk_MK.UTF-8')
a_list = ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']
import random
random.shuffle(a_list)
assert sorted(a_list, key=locale1.strxfrm) == ['а', 'б', 'в', 'ш', 'ј', 'ќ', 'џ']
assert sorted(a_list, key=locale2.strxfrm) == ['а', 'б', 'в', 'ј', 'ќ', 'џ', 'ш']
what's left to do is implement all the locale functions, support for python unicode strings (with wchar* functions I guess), and automatically import the include file definitions or something